Dummy Coding: The how and why


Posted May 31, 2017

Nominal variables, or variables that describe a characteristic using two or more categories, are commonplace in quantitative research, but are not always useable in their categorical form. A common workaround for using these variables in a regression analysis is dummy coding, but there is often a lot of confusion (sometimes even among dissertation committees!) about what dummy variables are, how they work, and why we use them. With this in mind, it is important that the researcher knows how and why to use dummy coding so they can defend their correct (and in many cases, necessary) use.

Dummy coding is a way of incorporating nominal variables into regression analysis, and the reason why is pretty intuitive once you understand the regression model. Regressions are most commonly known for their use in using continuous variables (for instance, hours spent studying) to predict an outcome value (such as grade point average, or GPA). In this example, we might find that increased study time corresponds with increased GPAs.

Now, what if we wanted to also know if favorite class (e.g., science, math, and language) corresponded with an increased GPA. Let’s say we coded this so that science = 1, math = 2, and language = 3. Looking at the nominal favorite class variable, we can see that there is no such thing as an increase in favorite class – math is not higher than science, and is not lower than language either. This is sometimes referred to as directionality, and knowing that a high versus low score means something is an integral part of regression analysis. Luckily, there is a way around this! Enter: dummy coding.

Dummy coding allows us to turn categories into something a regression can treat as having a high (1) and low (0) score. Any binary variable can be thought of as having directionality, because if it is higher, it is category 1, but if it is lower, it is category 0. This allows the regression look at directionality by comparing two sides, rather than expecting each unit to correspond with some kind of increase. Let’s go back to the favorite class variable. Remember, we originally coded this as science = 1, math = 2, and language = 3. To give the regression something to work with, we can make a separate column, or variable, for each category. These columns will each show whether each category was a student’s favorite; if a student has a (1), the high (or yes) score, in the science column, science is their favorite, but if they have a (0), the low (or no) score, science did not make the cut. The same goes for each of the dummy variables, as they are called. Below is an example of how this ends up working out:
Dummy variables
Student Favorite class Science Math Language

 

Dummy variables

Student

Favorite class

Science

Math

Language

1

Science

1

0

0

2

Science

1

0

0

3

Language

0

0

1

4

Math

0

1

0

5

Language

0

0

1

6

Math

0

1

0

Now, looking at this you can see that knowing the values for two of the variables tell us what value the final variable has to be. Let’s look at student 1; we know they can only have one favorite class. If we know science = 1 and math = 0, we know that language has to be 0 as well. The same goes for student 5; we know that science is not their favorite, nor is math, so language has to have a yes (or 1).

For this reason, we do not use all three categories in a regression. Doing so would give the regression redundant information, result in multicollinearity, and break the model. This means we have to leave one category out, and we call this missing category the reference category. Using the reference category makes all interpretation in reference to that category. For example, if you included the dummy variable of science and used language as the reference, results for that variable tell you those students’ results in comparison to students with language as their favorite class. The author edreference category is usually chosen based on how you want to interpret the results, so if you would rather talk about students in comparison to those with math as their favorite class, simply include the other two instead.

Now that we have covered the basics of one of the most common data transformations done for regression, next time we will cover a little more of a general interpretation of the linear regression. You can also learn more about interpreting binary logistic regression here!


Pin It on Pinterest

Shares
Share This