Previously, you learned about single dichotomous predictors, where you have a binary variable with one reference group and a slope that represents the difference between the two groups. It is possible to expand this idea into multicategorical variables that have 3+ groups. This can be done through various coding techniques that we will go over below. These can be used for nominal or ordinal variables, though some of the methods are more preferable for certain variable types over others.
A through line through each method is that we will need to have multiple predictors and multiple slopes in the model, approximately one for each group, except that, like a dichotomous predictor, there is a group that is left out of the model (the details of which are method dependent).
Another through line is that each method can be represented by a coding scheme, a “contrast” matrix. If you have taken the ANOVA course before regression, this concept will hopefully make sense.
At the end of the day, all of these methods are comparing group means of the outcome. What differs is what combination of group means are being compared (e.g., mean 1 vs mean 2, means 1 & 2 vs means 3, etc.).
The Dataset
For this lesson we will use a COVID stress dataset. The outcome is COVID_Concerns - how concerned an individual is about COVID-19. The predictor of interest is Educ_Mom - the mother’s education level, which has seven categories:
“None” (no college)
“Up to 6 years of school”
“Up to 9 years of school”
“Up to 12 years of school” (high school degree)
“Some College or equivalent”
College degree
PhD/Doctorate
Note, the dataset that was given to develop the lab lesson proved to not really have any group differences (spoiler), so this isn’t the best example to teach with. If I ever get a hold on better data and can be bothered to come back and change some of these, I might in the future.
Group Descriptives
Since all of these methods will be comparing group means, it is very helpful to check the group descriptive statistics beforehand. We can use the favstats() function from the mosaic package.
mosaic::favstats(COVID_Concerns ~ Educ_Mom, data = covid)
Educ_Mom min Q1 median Q3 max mean sd n
1 College degree 3 19.00 24.0 27 30 22.26774 5.981843 310
2 None 4 18.25 24.0 26 30 22.08065 6.036225 62
3 PhD/Doctorate 4 22.50 26.0 28 30 23.90323 6.425755 31
4 Some College or equivalent 3 19.00 23.0 26 30 21.85068 6.239199 221
5 Up to 12 years of school 3 18.00 23.0 26 30 21.59048 6.392215 315
6 Up to 6 years of school 3 19.75 24.0 26 30 22.47436 6.112334 156
7 Up to 9 years of school 3 19.00 22.5 26 30 21.86765 5.863647 136
missing
1 0
2 0
3 0
4 0
5 0
6 0
7 0
In terms of average COVID-19 concern levels, individuals whose mom has a PhD/Doctorate have the highest average level (23.9), whereas individuals whose mom has only a high school degree (“Up to 12 years of school”) have the lowest average COVID-19 concern (21.59). This gives us a glimpse of what relationships we might see.
A lot of the explanation will be done in the Dummy/Indicator Coding section below, as dummy coding is by far the most popular method and its rules and options apply to the other coding schemes.
1. Dummy/Indicator Coding
What Is Dummy Coding?
The most popular method for dealing with multicategorical predictors is to use dummy (or indicator) coding. This builds directly on what you have already learned about dichotomous (0/1) predictors. Previously, a dichotomous variable used 0 and 1 to represent two groups, where the coefficient told us the difference between those groups (with 0 as the reference group). Dummy coding extends this idea to variables with more than two categories by creating multiple 0/1 variables, each comparing one category to a chosen reference group. For example, if a variable has three categories, we would create two dummy variables, each coded 0 or 1, and each coefficient would represent the difference between that category and the reference category.
For the sake of example, we will make the “None” mother’s education group the baseline. Therefore, each dummy coded variable will represent the difference between group means for that respective group and the “None” group.
Dummy Coding with factor()
There are many ways to dummy code in R, but the easiest is to convert your variable to a factor class object. R will then automatically apply dummy coding when you include the variable in lm(). First, check the class of your variable.
class(covid$Educ_Mom)
[1] "character"
str(covid$Educ_Mom)
chr [1:1263] "None" "College degree" "Up to 12 years of school" ...
Educ_Mom is a character class variable. If you are ever curious about the structure of a variable or dataset, str() is a great function for that.
Now convert it to a factor. We input the variable, define the levels (with names), and specify ordered = FALSE. The first level in the list will be the baseline reference group.
covid$Educ_Mom_Factor <-factor( covid$Educ_Mom,levels =c("None", "Up to 6 years of school", "Up to 9 years of school","Up to 12 years of school","Some College or equivalent", "College degree","PhD/Doctorate"), # make sure everything is spelled correctlyordered =FALSE)
The last argument ordered = FALSE dictates whether or not the inputted variable has to be in order. If you want the ability to recode your reference variable, this needs to be set to FALSE. Even though our variable is technically ordinal, we want this argument to be FALSE so we can change the baseline and use the variable in other coding schemes as well.
If your variable is numeric (no character strings, just numbers), converting it to a factor will be necessary, or else lm() will think your variable is continuous. In that case, just list the values in the levels = line (e.g. levels = c(1,2,3,4)).
str(covid$Educ_Mom_Factor)
Factor w/ 7 levels "None","Up to 6 years of school",..: 1 6 4 6 5 4 6 4 6 NA ...
levels(covid$Educ_Mom_Factor)
[1] "None" "Up to 6 years of school"
[3] "Up to 9 years of school" "Up to 12 years of school"
[5] "Some College or equivalent" "College degree"
[7] "PhD/Doctorate"
Now it is classified as a “Factor” object with 7 levels. The first group in levels() is always the reference baseline group.
You can also check the contrast matrix that R will use for dummy coding:
contrasts(covid$Educ_Mom_Factor)
Up to 6 years of school Up to 9 years of school
None 0 0
Up to 6 years of school 1 0
Up to 9 years of school 0 1
Up to 12 years of school 0 0
Some College or equivalent 0 0
College degree 0 0
PhD/Doctorate 0 0
Up to 12 years of school Some College or equivalent
None 0 0
Up to 6 years of school 0 0
Up to 9 years of school 0 0
Up to 12 years of school 1 0
Some College or equivalent 0 1
College degree 0 0
PhD/Doctorate 0 0
College degree PhD/Doctorate
None 0 0
Up to 6 years of school 0 0
Up to 9 years of school 0 0
Up to 12 years of school 0 0
Some College or equivalent 0 0
College degree 1 0
PhD/Doctorate 0 1
The row with all 0’s is the baseline group. Each column represents one dummy variable, and the 1 in each row tells you which dummy code that group gets.
Fitting the Model
Model_Factor <-lm(COVID_Concerns ~ Educ_Mom_Factor, data = covid)summary(Model_Factor)
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.0806 0.7815 28.255 <2e-16
Educ_Mom_FactorUp to 6 years of school 0.3937 0.9238 0.426 0.670
Educ_Mom_FactorUp to 9 years of school -0.2130 0.9429 -0.226 0.821
Educ_Mom_FactorUp to 12 years of school -0.4902 0.8549 -0.573 0.567
Educ_Mom_FactorSome College or equivalent -0.2300 0.8843 -0.260 0.795
Educ_Mom_FactorCollege degree 0.1871 0.8561 0.219 0.827
Educ_Mom_FactorPhD/Doctorate 1.8226 1.3536 1.347 0.178
(Intercept) ***
Educ_Mom_FactorUp to 6 years of school
Educ_Mom_FactorUp to 9 years of school
Educ_Mom_FactorUp to 12 years of school
Educ_Mom_FactorSome College or equivalent
Educ_Mom_FactorCollege degree
Educ_Mom_FactorPhD/Doctorate
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
You can see the output is given per dummy coded variable, described as the variable name + the group name. Since “None” is the baseline reference, it is not included as a predictor, and its group mean is the value of the intercept. Each coefficient is the difference between the baseline and the predictor group following the form: Group - Baseline.
Interpreting Dummy Codes
Intercept: The mean COVID-19 concern for individuals whose mothers have no education is about 22.08 points.
B1: Individuals whose mothers have up to 6 years of schooling have about .39 more average COVID-19 concern when compared to individuals whose mothers have no education.
B2: Individuals whose mothers have up to 9 years of schooling have about .21 less average COVID-19 concern when compared to individuals whose mothers have no education.
B3: Individuals whose mothers have up to 12 years of schooling have about .49 less average COVID-19 concern when compared to individuals whose mothers have no education.
B4: Individuals whose mothers have some college or equivalent education have about .23 less average COVID-19 concern when compared to individuals whose mothers have no education.
B5: Individuals whose mothers have a college degree have about .19 more average COVID-19 concern when compared to individuals whose mothers have no education.
B6: Individuals whose mothers have a doctorate degree have about 1.82 more average COVID-19 concern when compared to individuals whose mothers have no education.
However, none of the slopes are significantly different than zero. Therefore, we would fail to reject the null that any of these groups have mean COVID-19 concern differences in the population.
Coding Unique Variables Manually
Another method is to manually create new dummy coded variables for each group. This involves making a new column variable that has 1’s for rows that are part of that group and 0’s on everything else. This is not necessary for dummy coded variables, but it can be helpful for other coding schemes covered later. It is also possible some datasets you will work with already have dummy coded variables.
covid <-transform(covid,d1 = (Educ_Mom =="None"),d2 = (Educ_Mom =="Up to 6 years of school"),d3 = (Educ_Mom =="Up to 9 years of school"),d4 = (Educ_Mom =="Up to 12 years of school"),d5 = (Educ_Mom =="Some College or equivalent"),d6 = (Educ_Mom =="College degree"),d7 = (Educ_Mom =="PhD/Doctorate"))# Convert TRUE/FALSE to 1/0 for cleaner outputcols <-c("d1","d2","d3","d4","d5","d6","d7")covid[cols] <-lapply(covid[cols], as.integer)
Note that if your categorical variable is numerically coded (e.g., “None” = 1, “Up to 6 years of school” = 2, etc.), you should just do Educ_Mom == 1, Educ_Mom == 2, etc. in the transform code above.
PLEASE NOTE that lm() will automatically use the LAST variable as the baseline, so you need to include d1 very last since we are using “None” education as the baseline.
Call:
lm(formula = COVID_Concerns ~ d2 + d3 + d4 + d5 + d6 + d7 + d1,
data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.0806 0.7815 28.255 <2e-16 ***
d2 0.3937 0.9238 0.426 0.670
d3 -0.2130 0.9429 -0.226 0.821
d4 -0.4902 0.8549 -0.573 0.567
d5 -0.2300 0.8843 -0.260 0.795
d6 0.1871 0.8561 0.219 0.827
d7 1.8226 1.3536 1.347 0.178
d1 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
As you can see, the results for this method are identical to the factor() method above.
Recoding the Baseline Group
Since dummy variable slopes only give the difference between the baseline and the dummy group, it does not inherently extend flexibility to any group comparison. For example, if you wanted to compare “Some College or equivalent” with “PhD/Doctorate”, you would need to change the baseline group to either one of those. There are a few ways to do this.
[1] "College degree" "None"
[3] "Up to 6 years of school" "Up to 9 years of school"
[5] "Up to 12 years of school" "Some College or equivalent"
[7] "PhD/Doctorate"
“College degree” is now first in the list, confirming it is the reference group.
Method 2: Respecifying the factor() order
covid$Educ_Mom_Factor2 <-factor( covid$Educ_Mom,levels =c("College degree", "None", "Up to 6 years of school", "Up to 9 years of school", "Up to 12 years of school","Some College or equivalent", "PhD/Doctorate"), ordered =FALSE)Model_Factor2 <-lm(COVID_Concerns ~ Educ_Mom_Factor2, data = covid)summary(Model_Factor2)
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor2, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.2677 0.3495 63.716 <2e-16
Educ_Mom_Factor2None -0.1871 0.8561 -0.219 0.827
Educ_Mom_Factor2Up to 6 years of school 0.2066 0.6040 0.342 0.732
Educ_Mom_Factor2Up to 9 years of school -0.4001 0.6329 -0.632 0.527
Educ_Mom_Factor2Up to 12 years of school -0.6773 0.4923 -1.376 0.169
Educ_Mom_Factor2Some College or equivalent -0.4171 0.5417 -0.770 0.442
Educ_Mom_Factor2PhD/Doctorate 1.6355 1.1591 1.411 0.159
(Intercept) ***
Educ_Mom_Factor2None
Educ_Mom_Factor2Up to 6 years of school
Educ_Mom_Factor2Up to 9 years of school
Educ_Mom_Factor2Up to 12 years of school
Educ_Mom_Factor2Some College or equivalent
Educ_Mom_Factor2PhD/Doctorate
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
Now the intercept equals the mean COVID-19 concern for individuals whose mothers have a college degree. The slope for the PhD dummy slope is about 1.64, meaning that the mean COVID-19 concern for individuals whose mothers have a doctorate is 1.64 higher than those whose mothers have a college degree. This is not significant.
You can check the contrast matrix for your new factor variable:
contrasts(covid$Educ_Mom_Factor2)
None Up to 6 years of school Up to 9 years of school
College degree 0 0 0
None 1 0 0
Up to 6 years of school 0 1 0
Up to 9 years of school 0 0 1
Up to 12 years of school 0 0 0
Some College or equivalent 0 0 0
PhD/Doctorate 0 0 0
Up to 12 years of school Some College or equivalent
College degree 0 0
None 0 0
Up to 6 years of school 0 0
Up to 9 years of school 0 0
Up to 12 years of school 1 0
Some College or equivalent 0 1
PhD/Doctorate 0 0
PhD/Doctorate
College degree 0
None 0
Up to 6 years of school 0
Up to 9 years of school 0
Up to 12 years of school 0
Some College or equivalent 0
PhD/Doctorate 1
Method 3: Reordering unique variables
If you did individual variables for each dummy code, you just need to place the one you want as the baseline last.
This section covers how to manually create a contrast matrix. This will become important when learning about other coding schemes, as many compare multiple group averages against others.
To understand how this works, look at the contrast matrix. Whatever group has all 0’s across each column is the baseline. For each group row, whatever column has a “1” represents what that dummy code will be.
To change the baseline manually, find the group row you want as the new baseline, change all of its values to 0, then change the original baseline row to be included in a variable.
[,1] [,2] [,3] [,4] [,5] [,6]
None 0 0 0 0 1 0
Up to 6 years of school 1 0 0 0 0 0
Up to 9 years of school 0 1 0 0 0 0
Up to 12 years of school 0 0 1 0 0 0
Some College or equivalent 0 0 0 1 0 0
College degree 0 0 0 0 0 0
PhD/Doctorate 0 0 0 0 0 1
Model_Dummy3 <-lm(COVID_Concerns ~ Educ_Mom_Factor3, data = covid) summary(Model_Dummy3)
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor3, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.2677 0.3495 63.716 <2e-16 ***
Educ_Mom_Factor31 0.2066 0.6040 0.342 0.732
Educ_Mom_Factor32 -0.4001 0.6329 -0.632 0.527
Educ_Mom_Factor33 -0.6773 0.4923 -1.376 0.169
Educ_Mom_Factor34 -0.4171 0.5417 -0.770 0.442
Educ_Mom_Factor35 -0.1871 0.8561 -0.219 0.827
Educ_Mom_Factor36 1.6355 1.1591 1.411 0.159
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
NOTE: If you use this method, you need to make sure the contrast is correct. If you do not, the results you get will be wrong. Double check the matrix.
Adding Covariates
Adding continuous covariates is straightforward. The same rules for the baseline group follow as outlined above. Simply add the variable into lm().
# Adding COVID_Compliance as a continuous covariateModel_Factor3 <-lm(COVID_Concerns ~ Educ_Mom_Factor + COVID_Compliance,data = covid)summary(Model_Factor3)
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor + COVID_Compliance,
data = covid)
Residuals:
Min 1Q Median 3Q Max
-20.2975 -2.8673 0.9447 4.0765 13.9001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.94592 1.28744 8.502 <2e-16
Educ_Mom_FactorUp to 6 years of school 0.35460 0.88427 0.401 0.6885
Educ_Mom_FactorUp to 9 years of school -0.02346 0.90274 -0.026 0.9793
Educ_Mom_FactorUp to 12 years of school -0.26572 0.81861 -0.325 0.7455
Educ_Mom_FactorSome College or equivalent 0.14755 0.84722 0.174 0.8618
Educ_Mom_FactorCollege degree 0.53354 0.82007 0.651 0.5154
Educ_Mom_FactorPhD/Doctorate 2.22178 1.29616 1.714 0.0868
COVID_Compliance 0.44197 0.04159 10.626 <2e-16
(Intercept) ***
Educ_Mom_FactorUp to 6 years of school
Educ_Mom_FactorUp to 9 years of school
Educ_Mom_FactorUp to 12 years of school
Educ_Mom_FactorSome College or equivalent
Educ_Mom_FactorCollege degree
Educ_Mom_FactorPhD/Doctorate .
COVID_Compliance ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.89 on 1223 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.08902, Adjusted R-squared: 0.08381
F-statistic: 17.07 on 7 and 1223 DF, p-value: < 2.2e-16
For the unique dummy variable method, just add the covariate in. R can still distinguish which variables are the dummy variables, and will use the last one as the reference code.
Call:
lm(formula = COVID_Concerns ~ d2 + d3 + d4 + d5 + d6 + d7 + d1 +
COVID_Compliance, data = covid)
Residuals:
Min 1Q Median 3Q Max
-20.2975 -2.8673 0.9447 4.0765 13.9001
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.94592 1.28744 8.502 <2e-16 ***
d2 0.35460 0.88427 0.401 0.6885
d3 -0.02346 0.90274 -0.026 0.9793
d4 -0.26572 0.81861 -0.325 0.7455
d5 0.14755 0.84722 0.174 0.8618
d6 0.53354 0.82007 0.651 0.5154
d7 2.22178 1.29616 1.714 0.0868 .
d1 NA NA NA NA
COVID_Compliance 0.44197 0.04159 10.626 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.89 on 1223 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.08902, Adjusted R-squared: 0.08381
F-statistic: 17.07 on 7 and 1223 DF, p-value: < 2.2e-16
You can also add multiple multicategorical variables by using the factor method.
covid$Educ_Factor <-factor( covid$Educ,levels =c("None", "Up to 6 years of school", "Up to 9 years of school","Up to 12 years of school", "Some College or equivalent", "College degree", "PhD/Doctorate"),ordered =FALSE) Model_Dummy5 <-lm(COVID_Concerns ~ Educ_Mom_Factor + Educ_Factor, data = covid)summary(Model_Dummy5)
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor + Educ_Factor,
data = covid)
Residuals:
Min 1Q Median 3Q Max
-20.186 -2.827 1.291 4.173 11.385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.2681 1.7010 12.504 <2e-16
Educ_Mom_FactorUp to 6 years of school -0.1182 1.4260 -0.083 0.934
Educ_Mom_FactorUp to 9 years of school -0.2495 1.4777 -0.169 0.866
Educ_Mom_FactorUp to 12 years of school -1.0276 1.3403 -0.767 0.444
Educ_Mom_FactorSome College or equivalent 0.2968 1.5438 0.192 0.848
Educ_Mom_FactorCollege degree 0.6202 1.4052 0.441 0.659
Educ_Mom_FactorPhD/Doctorate 1.3082 2.2212 0.589 0.556
Educ_FactorUp to 6 years of school 5.1834 3.8631 1.342 0.181
Educ_FactorUp to 9 years of school -2.5352 2.0816 -1.218 0.224
Educ_FactorUp to 12 years of school 0.5592 1.5082 0.371 0.711
Educ_FactorPhD/Doctorate 1.6101 1.5114 1.065 0.288
(Intercept) ***
Educ_Mom_FactorUp to 6 years of school
Educ_Mom_FactorUp to 9 years of school
Educ_Mom_FactorUp to 12 years of school
Educ_Mom_FactorSome College or equivalent
Educ_Mom_FactorCollege degree
Educ_Mom_FactorPhD/Doctorate
Educ_FactorUp to 6 years of school
Educ_FactorUp to 9 years of school
Educ_FactorUp to 12 years of school
Educ_FactorPhD/Doctorate
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.078 on 322 degrees of freedom
(930 observations deleted due to missingness)
Multiple R-squared: 0.04442, Adjusted R-squared: 0.01474
F-statistic: 1.497 on 10 and 322 DF, p-value: 0.1392
Interpreting Covariates
If you want to interpret a dummy coded variable with a covariate, you just need to control the covariate by holding the continuous one constant or keeping group membership the same for the categorical variable.
Example with continuous covariate:
Holding COVID-19 compliance constant between individuals, the mean COVID-19 concern score for individuals whose mother has 6 years of school is expected to be .35 higher than the average COVID-19 concern for individuals whose mothers have no education.
Example with categorical covariate:
For individuals who have the same education level, the mean COVID-19 concern score for individuals whose mother has 6 years of school is expected to be .35 higher than the average COVID-19 concern for individuals whose mothers have no education.
Note that with multiple categorical variables, the intercept is the mean outcome at both baselines. In general, interpretations for models with multiple categorical variables get messy, and it is usually better to either run interactions or use an ANOVA/ANCOVA for simpler results to interpret.
For inference, it is often of interest to treat all your dummy codes as a group and do a delta-R^2 test on them to test how much contribution the multicategorical variable added after controlling for the covariate(s). See the Inference lesson for that method.
2. Sequential Coding
What Is Sequential Coding?
Moving on from dummy codes, sequential coding is an alternative way to structure your categorical predictors. Instead of comparing every group to a single baseline group (like dummy coding does), sequential coding compares groups in a step-by-step fashion. Each category is compared to the one that comes before it in some meaningful order (e.g., education level, time, dosage).
In this context the interpretation of coefficients changes: rather than asking “How is this group different from the baseline?”, we ask “How is this group different from the previous group?” This can be especially useful when your categories have a natural ordering (i.e., ordinal variables) and you are interested in incremental changes across levels rather than comparisons to a single reference group. Though, you can still use sequential coding with non-ordinal variables.
In short, dummy coding compares everything to one group, while sequential coding compares each group to the next in line.
As a side note, if you ever want to do piecewise regression with multiple joints (see the future Nonlinear Regression lab), you should use sequential coding for the joints.
Sequential Coding in Base R
Unfortunately, lm() defaults to dummy coded variables for categorical variables, so you must code these manually. This method is much easier to do with a numeric coded variable. The coding scheme follows sequentially (e.g., S1 is everything above group 1, S2 is everything above group 2, etc.).
Note that, unlike dummy codes, there is no “reference” group with sequential variables, so we include all variables into the model and the order does not matter.
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor_Seq, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.0806 0.7815 28.255 <2e-16 ***
Educ_Mom_Factor_Seq1 0.3937 0.9238 0.426 0.670
Educ_Mom_Factor_Seq2 -0.6067 0.7219 -0.840 0.401
Educ_Mom_Factor_Seq3 -0.2772 0.6314 -0.439 0.661
Educ_Mom_Factor_Seq4 0.2602 0.5399 0.482 0.630
Educ_Mom_Factor_Seq5 0.4171 0.5417 0.770 0.442
Educ_Mom_Factor_Seq6 1.6355 1.1591 1.411 0.159
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
This method requires you to specify the matrix correctly. Double check the matrix is correct before interpreting results.
Interpreting Sequential Codes
The intercept of the sequentially coded groups is the group mean of the “first” group in sequential order, which is the “None” mother’s education group. Each coefficient then compares the subsequent group mean to the previous group mean.
Example b1:
Individuals whose mothers have up to 6 years of schooling have about .39 more average COVID-19 concern when compared to individuals whose mothers have no education.
Example b2:
Individuals whose mothers have up to 9 years of schooling have about .61 less average COVID-19 concern when compared to individuals whose mothers have 6 years of education.
Example b6:
Individuals whose mothers have a PhD or doctorate have about 1.64 more average COVID-19 concern when compared to individuals whose mothers have a college degree.
3. Helmert Coding
What Is Helmert Coding?
Helmert coding is a method to code multicategorical variables where each group is compared to the average of the groups that come after it. Unlike dummy coding, which compares each group to a single reference group, Helmert coding makes a series of sequential comparisons across levels of the variable. For example, the first coefficient compares the first group to the average of all remaining group means, the second coefficient compares the second group average to the average of the group means that follow, and so on.
Helmert coding is useful when there is a meaningful ordering to the categories and you are interested in how earlier groups differ from later ones overall. Note that even though it is helpful for ordinal variables, Helmert coding’s order of coefficients is arbitrary and any comparison can be made.
For our example, we want to do the following comparisons:
No education vs the rest
PhD vs everything except No Education
6 years vs 9 years, 12 years, Some College, and College
9 years vs 12 years, Some College, and College
12 years vs Some College and College
Some College vs College
Important note on “mean of the group means”: Helmert coding compares each group to the average of the group means of subsequent groups, not the grand mean of all individuals in those groups. These are only equal when group sizes are equal, which they usually are not. Be precise about this in your interpretations.
Helmert Coding in Base R
There is no easy built-in way to do Helmert coding in R, so we will manually code unique variables. The coding scheme follows fractions based on a specific pattern. The | operator represents “OR”.
Call:
lm(formula = COVID_Concerns ~ h1 + h2 + h3 + h4 + h5 + h6, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.29068 0.23763 93.805 <2e-16 ***
h1 0.24504 0.81890 0.299 0.7648
h2 -1.89305 1.12196 -1.687 0.0918 .
h3 -0.58022 0.53476 -1.085 0.2781
h4 0.03532 0.56954 0.062 0.9506
h5 0.46873 0.43996 1.065 0.2869
h6 0.41706 0.54173 0.770 0.4415
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
Custom Contrast Method
Helmert_Matrix <-c(-6/7, 0, 0, 0, 0, 0, # row 11/7, 1/6, -4/5, 0, 0, 0, # row 21/7, 1/6, 1/5, -3/4, 0, 0, # row 31/7, 1/6, 1/5, 1/4, -2/3, 0, # row 41/7, 1/6, 1/5, 1/4, 1/3, -1/2, # row 51/7, 1/6, 1/5, 1/4, 1/3, 1/2, # row 61/7, -5/6, 0, 0, 0, 0# row 7)Helmert_Contrast <-matrix(Helmert_Matrix, nrow =7, byrow =TRUE)# Note: you can use whole numbers, but you will need to rescale coefficients# accordingly. I recommend sticking to fractions.covid$Educ_Mom_Factor_Helmert2 <- covid$Educ_Mom_Factorcontrasts(covid$Educ_Mom_Factor_Helmert2) <- Helmert_Contrastmodel_Helmert2 <-lm(COVID_Concerns ~ Educ_Mom_Factor_Helmert2, data = covid)summary(model_Helmert2)
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor_Helmert2, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.29068 0.23763 93.805 <2e-16 ***
Educ_Mom_Factor_Helmert21 0.24504 0.81890 0.299 0.7648
Educ_Mom_Factor_Helmert22 -1.89305 1.12196 -1.687 0.0918 .
Educ_Mom_Factor_Helmert23 -0.58022 0.53476 -1.085 0.2781
Educ_Mom_Factor_Helmert24 0.03532 0.56954 0.062 0.9506
Educ_Mom_Factor_Helmert25 0.46873 0.43996 1.065 0.2869
Educ_Mom_Factor_Helmert26 0.41706 0.54173 0.770 0.4415
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
Interpreting Helmert Codes
Interpreting Helmert coded groups is comparing one group mean to the average of the other group means (e.g., Mean 1 vs (Mean 2 + Mean 3 + Mean 4) / 3). Note that its often not represented this way due to shorthand, but when you are comparing “groups”, you are comparing “group averages.”
This is important because it is a very common mistake to say you are comparing the “grand mean of group 2, 3, and 4”, but this is incorrect. The grand mean is NOT the mean of the group means. They will only be equal when group sample sizes are equal. When group sample sizes are not equal (as they often are not), then the mean of the group means will be different than the grand mean.
The intercept of the Helmert coded groups is the average of the group means. Keep in mind that Helmert coding arithmetic is: later groups - current group. So a negative sign means the current group mean is larger than the average of the other group means.
Example b1:
Individuals whose mothers have no education have an average COVID-19 concern that is about .25 lower than the average of the other education level averages.
Example b2:
Individuals whose mothers have a PhD or doctorate have an average COVID-19 concern that is 1.89 greater than the average of the rest of the education group means.
Example b6:
Individuals whose mothers have some college or equivalent have an average COVID-19 concern that is .41 lower than the average for individuals whose mother has a college degree.
4. Effects Coding
What Is Effects Coding?
Effects coding is a method for coding multicategorical variables where each category is compared to the mean of the means rather than to a single reference group. Like dummy coding, it uses a series of 0/1-type variables, but with one key difference: the reference category is coded as -1 instead of 0. This ensures that the coefficients represent how much each group differs from the average of all groups, rather than from a specific baseline group.
As a result, the intercept in an effects-coded model represents the mean of the group means, and each coefficient tells us how far a given category is above or below that overall average. This can be especially useful when no single group serves as a natural reference, or when you are interested in understanding how each group compares to the overall pattern rather than to one specific category.
The downside is that you have to leave out one group for the sake of estimating the model. The choice of group you leave out has no practical consequence. It just means that if you want the effects coded coefficient for that group, you have to change the leave-out group and rerun the model.
As a side note, effects coding is helpful for LASSO, Ridge, or Elastic Net regression because it treats all categories more evenly when coefficients are being shrunk. Unlike dummy coding, which depends on a chosen reference group, effects coding reduces the influence of that arbitrary choice and can lead to more stable results when selecting variables.
Effects Coding in Base R
lm() defaults to dummy coded variables, so you must code these manually. All you need to do is select the group you are leaving out and subtract it from each other group.
Call:
lm(formula = COVID_Concerns ~ e1 + e2 + e3 + e4 + e5 + e6, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.29068 0.23763 93.805 <2e-16 ***
e1 0.18368 0.47941 0.383 0.7017
e2 -0.42303 0.50530 -0.837 0.4026
e3 -0.70021 0.37726 -1.856 0.0637 .
e4 -0.44000 0.42290 -1.040 0.2983
e5 -0.02294 0.37909 -0.061 0.9518
e6 1.61254 0.96379 1.673 0.0946 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
As you can see, every coefficient that is not associated with the added PhD group does not change. However, to get the coefficient for the PhD group, we had to recode and rerun everything. This is annoying but not that hard to do.
Call:
lm(formula = COVID_Concerns ~ Educ_Mom_Factor_Effects2, data = covid)
Residuals:
Min 1Q Median 3Q Max
-19.903 -2.868 1.409 4.410 8.409
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.29068 0.23763 93.805 <2e-16 ***
Educ_Mom_Factor_Effects21 0.18368 0.47941 0.383 0.7017
Educ_Mom_Factor_Effects22 -0.42303 0.50530 -0.837 0.4026
Educ_Mom_Factor_Effects23 -0.70021 0.37726 -1.856 0.0637 .
Educ_Mom_Factor_Effects24 -0.44000 0.42290 -1.040 0.2983
Educ_Mom_Factor_Effects25 -0.02294 0.37909 -0.061 0.9518
Educ_Mom_Factor_Effects26 1.61254 0.96379 1.673 0.0946 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.153 on 1224 degrees of freedom
(32 observations deleted due to missingness)
Multiple R-squared: 0.004912, Adjusted R-squared: 3.459e-05
F-statistic: 1.007 on 6 and 1224 DF, p-value: 0.419
If you want to switch the leave-out group, you need to rearrange the row with -1’s and rerun the model.
Interpreting Effects Codes
The intercept of the effects coded groups is the mean of the group means. Each coefficient is the difference between that group’s mean and the mean of the group means (Group - Intercept). So a negative sign means the group mean is below the mean of the group means.
This is a very common mistake: people often say you are comparing to the “grand mean” or “overall average”, but this is incorrect. The grand mean is NOT the mean of the group means. They only equal each other when group sample sizes are equal.
Example b1 (with PhD as the leave-out group):
Individuals whose mothers have no education are expected to have an average COVID-19 concern that is .18 larger than the average COVID-19 concern of the different mother education level means.
Example b3:
Individuals whose mothers have at least 12 years of education are expected to have an average COVID-19 concern that is .70 lower than the average COVID-19 concern of the different mother education level means.
Well Done!
You have completed the Multicategorical Predictors tutorial. Here is a summary of what was covered:
Dummy coding with factor() and relevel(), manual dummy variables, and custom contrast matrices
Sequential coding for step-by-step comparisons between adjacent groups
Helmert coding for comparing each group to the average of subsequent groups
Effects coding for comparing each group to the mean of all group means
How to interpret the intercept and slopes under each coding scheme
How to add covariates to multicategorical models
The next lesson covers prediction and cross-validation.