Prediction

Author

Michael Woller

Introduction

Most of this course has focused on inference, which involves using a model to make conclusions about population-level relationships. But there is another goal in regression: prediction. For predictive models, we are often less concerned with how well the model fits the data we already have, and more concerned with how well it will perform on new, unseen data. A model can fit the current dataset very well simply by capturing noise (overfitting), but this does not guarantee it will generalize to future datasets.

In this lesson we will cover cross-validation methods for evaluating predictive accuracy, variable selection techniques for building optimal models, prediction intervals, and LASSO regression.

The Dataset

For this lesson we continue with the Big 5 personality dataset, predicting ExistentialWellBeing from the five personality traits:

  • OpenMindedness
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • NegativeEmotionality

1. Cross-Validation and Shrunken R

What Is Cross-Validation?

Cross-validation is a method for evaluating how well a model is likely to perform in predicting new data. It involves splitting the data into two parts:

  1. A training set, used to fit the model
  2. A testing (hold-out) set, used to evaluate the model’s predictive accuracy

The model is fit on the training data, then used to predict outcomes in the testing data. The relationship between the predicted and actual values in the testing set provides a more realistic estimate of model performance than evaluating on the same data the model was trained on.

This typically results in a smaller \(R^2\) than the original model, because the model is no longer being evaluated on the same data it was trained on. This reduced value is often referred to as the shrunken R or \(R^2\). It gives the relationship between the observed and predicted data.

Splitting Training/Testing Data

The first thing you need to do is set your seed. This is a very important step. You need to be able to replicate with the exact same data splits. Your choice of seed number does not matter, but you must use one.

set.seed(8675309)

n           <- nrow(big5)
# Create a random index of which rows will be in the training data
train_index <- sample(1:n, size = floor(0.5 * n))

train_data <- big5[train_index, ]  # subsets big5 training data
test_data  <- big5[-train_index, ] # subsets big5 testing data
# Note: '-train_index' means all rows that DON'T include these values

Here we are doing a 50/50 split for demonstration purposes. In practice it is more common to use 80/20 or 90/10. Note that the results of your shrunken R will differ for simple, double, and k-fold cross-validation depending on your seed. Leave-one-out cross-validation is insensitive to the seed, as you will see below.


Simple Cross-Validation

This method simply trains a model with the training data, then tries to predict the testing data by funneling the testing data into the trained model. The closer the model’s predictions are to the true testing data, the better the model is. We will use shrunken R to determine this.

# Fit the model on the training data
training_fit <- lm(ExistentialWellBeing ~ OpenMindedness + Conscientiousness 
                   + Extraversion + Agreeableness + NegativeEmotionality,
                   data = train_data)

# Predict on the testing data using "newdata ="
tested <- predict(training_fit, newdata = test_data)

# Shrunken R: correlate the testing data outcomes with the model's predictions
shrunkenR <- cor(test_data$ExistentialWellBeing, tested, use = "complete.obs")
shrunkenR
[1] 0.599186

Our model’s predictions have a moderate positive correlation (~0.60) with actual existential well-being in new data.

# Shrunken R^2
shrunkenR2 <- shrunkenR^2
shrunkenR2
[1] 0.3590239

About 36% of the variance in existential well-being is associated with the model’s predictions in new data.

We can also compute the out-of-sample \(R^2\), which is calculated from sums of squares rather than correlations. It asks “how much variance in \(Y\) would this model explain if taken to a new sample from the same population?”

SSE    <- sum((test_data$ExistentialWellBeing - tested)^2)
SST    <- sum((test_data$ExistentialWellBeing - mean(test_data$ExistentialWellBeing))^2)
R2_oos <- 1 - SSE / SST
R2_oos
[1] 0.3576984

The model explains about 36% of the variance in existential well-being in the test dataset.

# RMSE: average size of prediction error (common in predictive statistics)
rmse <- sqrt(mean((test_data$ExistentialWellBeing - tested)^2))
rmse
[1] 0.5902557

On average, the model’s predictions are off by about 0.59 units of existential well-being.


Double Cross-Validation

Double cross-validation is the same as simple cross-validation, but you train two models and test each one on the data used to train the other. This requires a 50/50 split, which is why we split the data in halves above.

# Model 1: trained on train_data, tested on test_data (same as above)
training_fit <- lm(ExistentialWellBeing ~ OpenMindedness + Conscientiousness 
                   + Extraversion + Agreeableness + NegativeEmotionality,
                   data = train_data)

# Model 2: trained on test_data, tested on train_data (swap the halves)
second_fit <- lm(ExistentialWellBeing ~ OpenMindedness + Conscientiousness 
                 + Extraversion + Agreeableness + NegativeEmotionality,
                 data = test_data)

# Predictions
tested  <- predict(training_fit, newdata = test_data)
tested2 <- predict(second_fit,   newdata = train_data)

# Shrunken R for each model
shrunkenR  <- cor(test_data$ExistentialWellBeing,  tested,  use = "complete.obs")
shrunkenR2 <- cor(train_data$ExistentialWellBeing, tested2, use = "complete.obs")

# Final shrunken R is the average of the two
shrunkenR_dcv <- mean(c(shrunkenR, shrunkenR2))
shrunkenR_dcv
[1] 0.6102768
# Shrunken R^2
shrunkenR_dcv^2
[1] 0.3724378

Interpretations follow the same logic as simple cross-validation.


K-Fold Cross-Validation

K-fold cross-validation does the same thing as double cross-validation, but instead of doing it twice, you do it \(k\) number of times. Typically \(k = 10\), as there are diminishing returns when it gets much higher. This is probably the most common form of validation for regression.

k  <- 5           # number of folds
N  <- nrow(big5)  # sample size
nk <- N / k       # participants per fold

# Insert fold indices into data
big5$fold <- rep(1:k, times = nk + 1)[1:N]

# Empty vector to store shrunken R per fold
shrunkenRCV_kfold <- vector(mode = "numeric", length = k)

set.seed(8675309)

# The loop does the cross-validation k times automatically
for (i in 1:k) {
  train     <- big5[big5$fold != i, ]               # training data (all folds except i)
  test      <- big5[big5$fold == i, ]               # testing data (fold i)
  modeltest <- lm(ExistentialWellBeing ~ OpenMindedness 
                  + Conscientiousness + Extraversion 
                  + Agreeableness + NegativeEmotionality,
                  data = train)
  Ypredcv   <- predict(modeltest, newdata = test)
  shrunkenRCV_kfold[i] <- cor(Ypredcv, test$ExistentialWellBeing)
}

# Average shrunken R across all k folds
shrunkenR_kfold <- mean(shrunkenRCV_kfold)
shrunkenR_kfold
[1] 0.6126861
# Shrunken R^2
shrunkenR_kfold^2
[1] 0.3753843

Don’t worry too much about the for loop syntax. If you aren’t familiar with creating loops in R this will be a bit too complicated for now. The key idea is that it just automates the cross-validation process \(k\) times and averages the results.


Leave-One-Out Validation

Leave-one-out (LOO) validation evaluates a model by using all but one observation to fit the model, then testing it on the single left-out observation, repeating this for every data point in the dataset. In simple terms: “Train on almost all the data, test on one point, and repeat for every point.”

Since we are testing on every single point, we don’t need a seed. We start by estimating the model on all data.

full_model <- lm(ExistentialWellBeing ~ OpenMindedness 
                 + Conscientiousness + Extraversion  
                 + Agreeableness + NegativeEmotionality,
                 data = big5)

# Save residuals and hat values
e <- resid(full_model)
h <- hatvalues(full_model) # a measure of leverage

# Get LOO predicted value for each case
yhat_loo <- big5$ExistentialWellBeing - e / (1 - h)

# LOO Shrunken R
shrunkenR_LOO <- cor(big5$ExistentialWellBeing, yhat_loo, use = "complete.obs")
shrunkenR_LOO
[1] 0.6120392
# LOO Shrunken R^2
shrunkenR_LOO^2
[1] 0.374592

2. Variable Selection

Since the purpose of prediction is to have the most accurate predictions, models need to focus on minimizing error variance rather than getting unbiased estimates.

  • Unbiased estimates -> Valid inference of the population
  • Minimized error variance -> More accurate and consistent predictions of samples

Unfortunately, when doing statistics, you generally have to choose one or the other. This is the infamous bias-variance trade-off.

The process of minimizing error variance requires techniques called variable selection, which involves step-by-step algorithmic techniques to select which variables are best suited for minimizing error variance.

All an “algorithm” is is just a step-by-step guide to get a specific outcome. Machine learning methods are just automated algorithms to minimize error variance. Some are more “black box” than others, meaning the intermediate steps are unknown to us. Ridge and LASSO regression (discussed later) are examples of non-black-box automated algorithms.

The following methods are very out of date and should not really be used in practice. However, you may still encounter them in published work, so it is worth knowing how they work. In psychology, you could most likely get away with using Ridge or LASSO regression in most cases.


Forward Selection

Forward selection starts with no predictors and adds them one at a time. At each step, the predictor that provides the most significant improvement (\(\Delta R^2\) F test) is added. The process stops when no remaining variable significantly improves the model. We use add1() with test = "F".

Note that ~1 in lm() means a model with only the intercept and no predictors.

modelnone <- lm(ExistentialWellBeing ~ 1, data = big5)

add1(modelnone, scope = ~ OpenMindedness + Conscientiousness + Extraversion + 
       Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ 1
                     Df Sum of Sq    RSS      AIC F value    Pr(>F)    
<none>                            850.09  -929.04                      
OpenMindedness        1     9.639 840.45  -944.71  17.754 2.659e-05 ***
Conscientiousness     1   120.445 729.64 -1163.86 255.535 < 2.2e-16 ***
Extraversion          1   106.661 743.43 -1134.85 222.094 < 2.2e-16 ***
Agreeableness         1    76.546 773.54 -1073.30 153.184 < 2.2e-16 ***
NegativeEmotionality  1   303.423 546.66 -1611.37 859.208 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NegativeEmotionality has the highest F/lowest p, so it is added first. Use update() with a + in front of the variable name to add it. Make sure every variable is in scope =.

add1(update(modelnone, ~. + NegativeEmotionality), 
     scope = ~ OpenMindedness + Conscientiousness + Extraversion
             + Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ NegativeEmotionality
                  Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                         546.66 -1611.4                      
OpenMindedness     1    2.9432 543.72 -1617.7  8.3742 0.0038593 ** 
Conscientiousness  1   15.2054 531.46 -1653.1 44.2608 3.977e-11 ***
Extraversion       1    3.9161 542.75 -1620.5 11.1621 0.0008548 ***
Agreeableness      1    9.9049 536.76 -1637.7 28.5469 1.051e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conscientiousness is next most significant, so it is added.

add1(update(modelnone, ~. + NegativeEmotionality + Conscientiousness), 
     scope = ~ OpenMindedness + Conscientiousness + Extraversion 
             + Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness
               Df Sum of Sq    RSS     AIC F value   Pr(>F)   
<none>                      531.46 -1653.1                    
OpenMindedness  1    0.4690 530.99 -1652.5  1.3655 0.242771   
Extraversion    1    1.0283 530.43 -1654.1  2.9972 0.083610 . 
Agreeableness   1    3.2693 528.19 -1660.7  9.5692 0.002014 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Agreeableness is significant, so it is added.

add1(update(modelnone, ~. + NegativeEmotionality + Conscientiousness + Agreeableness), 
     scope = ~ OpenMindedness + Conscientiousness + Extraversion 
             + Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + 
    Agreeableness
               Df Sum of Sq    RSS     AIC F value Pr(>F)
<none>                      528.19 -1660.7               
OpenMindedness  1   0.02255 528.17 -1658.7   0.066 0.7973
Extraversion    1   0.86899 527.32 -1661.2   2.546 0.1108

No significant variables are left to add, so forward selection stops. The final model is:

ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + Agreeableness


Backward Selection

Backward selection starts with the full model and removes predictors one at a time. At each step the least useful variable (highest p-value) is dropped. The process stops when all remaining predictors are significant. We use drop1() with test = "F". Use update() with a - in front of the variable name to remove it.

This is probably the “best” of the simple variable selection methods, though it is still not recommended to use these days.

full_model <- lm(ExistentialWellBeing ~ OpenMindedness + Conscientiousness 
                 + Extraversion + Agreeableness + NegativeEmotionality,
                 data = big5)

drop1(full_model, test = "F")
Single term deletions

Model:
ExistentialWellBeing ~ OpenMindedness + Conscientiousness + Extraversion + 
    Agreeableness + NegativeEmotionality
                     Df Sum of Sq    RSS     AIC  F value    Pr(>F)    
<none>                            527.30 -1659.3                       
OpenMindedness        1     0.022 527.32 -1661.2   0.0633  0.801403    
Conscientiousness     1     7.020 534.32 -1640.8  20.5549 6.242e-06 ***
Extraversion          1     0.868 528.17 -1658.7   2.5417  0.111077    
Agreeableness         1     2.980 530.28 -1652.5   8.7254  0.003186 ** 
NegativeEmotionality  1   142.115 669.41 -1291.4 416.1307 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OpenMindedness has the highest p-value so it is dropped first.

drop1(update(full_model, ~ . - OpenMindedness), test = "F")
Single term deletions

Model:
ExistentialWellBeing ~ Conscientiousness + Extraversion + Agreeableness + 
    NegativeEmotionality
                     Df Sum of Sq    RSS     AIC  F value    Pr(>F)    
<none>                            527.32 -1661.2                       
Conscientiousness     1     7.005 534.33 -1642.8  20.5225 6.346e-06 ***
Extraversion          1     0.869 528.19 -1660.7   2.5460  0.110775    
Agreeableness         1     3.110 530.43 -1654.1   9.1119  0.002581 ** 
NegativeEmotionality  1   149.727 677.05 -1275.8 438.6869 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extraversion should be dropped next.

drop1(update(full_model, ~ . - OpenMindedness - Extraversion), test = "F")
Single term deletions

Model:
ExistentialWellBeing ~ Conscientiousness + Agreeableness + NegativeEmotionality
                     Df Sum of Sq    RSS     AIC  F value    Pr(>F)    
<none>                            528.19 -1660.7                       
Conscientiousness     1     8.570 536.76 -1637.7  25.0838 6.119e-07 ***
Agreeableness         1     3.269 531.46 -1653.1   9.5692  0.002014 ** 
NegativeEmotionality  1   184.415 712.61 -1198.5 539.7794 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All remaining variables are significant so backward selection stops. The final model matches forward selection:

ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + Agreeableness


Stepwise Selection

Stepwise (bidirectional) selection combines forward and backward selection. Predictors are added one at a time, but after each addition the procedure also checks whether any predictors already in the model should be removed. Variables can both enter and leave the model at different steps.

modelnone <- lm(ExistentialWellBeing ~ 1, data = big5)

add1(modelnone, scope = ~ OpenMindedness + Conscientiousness + Extraversion + 
       Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ 1
                     Df Sum of Sq    RSS      AIC F value    Pr(>F)    
<none>                            850.09  -929.04                      
OpenMindedness        1     9.639 840.45  -944.71  17.754 2.659e-05 ***
Conscientiousness     1   120.445 729.64 -1163.86 255.535 < 2.2e-16 ***
Extraversion          1   106.661 743.43 -1134.85 222.094 < 2.2e-16 ***
Agreeableness         1    76.546 773.54 -1073.30 153.184 < 2.2e-16 ***
NegativeEmotionality  1   303.423 546.66 -1611.37 859.208 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

NegativeEmotionality is added first (same as forward selection).

add1(update(modelnone, ~. + NegativeEmotionality), 
     scope = ~ OpenMindedness + Conscientiousness + Extraversion 
             + Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ NegativeEmotionality
                  Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                         546.66 -1611.4                      
OpenMindedness     1    2.9432 543.72 -1617.7  8.3742 0.0038593 ** 
Conscientiousness  1   15.2054 531.46 -1653.1 44.2608 3.977e-11 ***
Extraversion       1    3.9161 542.75 -1620.5 11.1621 0.0008548 ***
Agreeableness      1    9.9049 536.76 -1637.7 28.5469 1.051e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conscientiousness is added. Now check whether NegativeEmotionality should be dropped.

drop1(update(modelnone, ~. + NegativeEmotionality + Conscientiousness), test = "F")
Single term deletions

Model:
ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness
                     Df Sum of Sq    RSS     AIC F value    Pr(>F)    
<none>                            531.46 -1653.1                      
NegativeEmotionality  1   198.184 729.64 -1163.9 576.883 < 2.2e-16 ***
Conscientiousness     1    15.205 546.66 -1611.4  44.261 3.977e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both are significant, so don’t drop either. Continue adding.

add1(update(modelnone, ~. + NegativeEmotionality + Conscientiousness), 
     scope = ~ OpenMindedness + Conscientiousness + Extraversion 
             + Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness
               Df Sum of Sq    RSS     AIC F value   Pr(>F)   
<none>                      531.46 -1653.1                    
OpenMindedness  1    0.4690 530.99 -1652.5  1.3655 0.242771   
Extraversion    1    1.0283 530.43 -1654.1  2.9972 0.083610 . 
Agreeableness   1    3.2693 528.19 -1660.7  9.5692 0.002014 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Agreeableness is added. Check again whether any predictor should be dropped.

drop1(update(modelnone, ~. + NegativeEmotionality + Conscientiousness 
             + Agreeableness), test = "F")
Single term deletions

Model:
ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + 
    Agreeableness
                     Df Sum of Sq    RSS     AIC  F value    Pr(>F)    
<none>                            528.19 -1660.7                       
NegativeEmotionality  1   184.415 712.61 -1198.5 539.7794 < 2.2e-16 ***
Conscientiousness     1     8.570 536.76 -1637.7  25.0838 6.119e-07 ***
Agreeableness         1     3.269 531.46 -1653.1   9.5692  0.002014 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All still significant. Check if any remaining variables can be added.

add1(update(modelnone, ~. + NegativeEmotionality + Conscientiousness + Agreeableness), 
     scope = ~ OpenMindedness + Conscientiousness 
             + Extraversion + Agreeableness + NegativeEmotionality, test = "F")
Single term additions

Model:
ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + 
    Agreeableness
               Df Sum of Sq    RSS     AIC F value Pr(>F)
<none>                      528.19 -1660.7               
OpenMindedness  1   0.02255 528.17 -1658.7   0.066 0.7973
Extraversion    1   0.86899 527.32 -1661.2   2.546 0.1108

None are significant, so stepwise selection is done. The final model is:

ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + Agreeableness


All Subsets Regression

All subsets regression evaluates every possible combination of predictors and identifies the best model for each possible number of predictors, using adjusted \(R^2\) as the criterion. Unlike the step-by-step methods above, this finds the globally optimal model at each size.

We use regsubsets() from the leaps package. First subset your dataset to just the variables of interest.

big5small <- big5 %>% select(ExistentialWellBeing, OpenMindedness, 
                              Conscientiousness, Extraversion,
                              Agreeableness, NegativeEmotionality)

subsetmodels <- regsubsets(ExistentialWellBeing ~ ., data = big5small)
summary(subsetmodels)
Subset selection object
Call: regsubsets.formula(ExistentialWellBeing ~ ., data = big5small)
5 Variables  (and intercept)
                     Forced in Forced out
OpenMindedness           FALSE      FALSE
Conscientiousness        FALSE      FALSE
Extraversion             FALSE      FALSE
Agreeableness            FALSE      FALSE
NegativeEmotionality     FALSE      FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
         OpenMindedness Conscientiousness Extraversion Agreeableness
1  ( 1 ) " "            " "               " "          " "          
2  ( 1 ) " "            "*"               " "          " "          
3  ( 1 ) " "            "*"               " "          "*"          
4  ( 1 ) " "            "*"               "*"          "*"          
5  ( 1 ) "*"            "*"               "*"          "*"          
         NegativeEmotionality
1  ( 1 ) "*"                 
2  ( 1 ) "*"                 
3  ( 1 ) "*"                 
4  ( 1 ) "*"                 
5  ( 1 ) "*"                 

The bottom table shows the best model for each number of predictors. The number on the left is how many predictors are in the model, and * means that variable is included. For example, the best one-predictor model includes only NegativeEmotionality, and the best two-predictor model includes NegativeEmotionality and Conscientiousness.

To find the overall best model, extract the adjusted \(R^2\) for each:

summary(subsetmodels)$adjr2
[1] 0.3565161 0.3740101 0.3774585 0.3780805 0.3777032

The model with the highest adjusted \(R^2\) wins. In this dataset the four-predictor model has the highest adjusted \(R^2\), giving a final model of:

ExistentialWellBeing ~ NegativeEmotionality + Conscientiousness + Agreeableness + Extraversion

Note this differs slightly from forward/backward/stepwise selection, which is expected since all subsets finds the globally optimal model rather than following one sequential path.


3. Prediction Intervals

Prediction intervals are used when we want to estimate not just the average outcome, but the value of a new individual observation given a set of predictor values. There is always uncertainty around any prediction. A prediction interval provides a range of values within which we expect a single future observation to fall, with a certain level of confidence (typically 95%).

It is important to distinguish prediction intervals from confidence intervals. A confidence interval gives a range for the mean outcome at a given set of predictors, whereas a prediction interval gives a range for an individual outcome. Because individual observations vary more than averages, prediction intervals are always wider than confidence intervals.

Conceptually, a prediction interval accounts for two sources of uncertainty: (1) uncertainty in estimating the regression line itself, and (2) the random error variance around that line.

If most of what we covered in this lesson is out of date, prediction intervals are the main thing you should take away. These are used everywhere and you will always need to generate some kind of interval estimate when doing anything that involves real-world prediction.

Getting Prediction Intervals

Use predict() with interval = "prediction". The fit column is the predicted value, lwr is the lower bound, and upr is the upper bound of the 95% prediction interval.

full_model <- lm(ExistentialWellBeing ~ OpenMindedness + Conscientiousness 
                 + Extraversion + Agreeableness + NegativeEmotionality,
                 data = big5)

head(predict(full_model, big5, interval = "prediction")) # using head() to prevent printing hundreds of lines
       fit      lwr      upr
1 3.153296 2.005024 4.301568
2 3.162984 2.009782 4.316185
3 4.039420 2.889556 5.189283
4 2.877630 1.730560 4.024699
5 3.042007 1.894629 4.189385
6 3.176778 2.022871 4.330685

To change the confidence level, use level = (default is .95):

# 90% prediction interval
head(predict(full_model, big5, interval = "prediction", level = .9))
       fit      lwr      upr
1 3.153296 2.189813 4.116779
2 3.162984 2.195365 4.130603
3 4.039420 3.074602 5.004238
4 2.877630 1.915156 3.840104
5 3.042007 2.079274 4.004740
6 3.176778 2.208567 4.144989

For comparison, here is the 95% confidence interval per predicted score:

head(predict(full_model, big5, interval = "confidence"))
       fit      lwr      upr
1 3.153296 3.085819 3.220773
2 3.162984 3.036898 3.289070
3 4.039420 3.948813 4.130027
4 2.877630 2.835295 2.919965
5 3.042007 2.991999 3.092016
6 3.176778 3.044393 3.309162

Specific Predictions

Prediction intervals also let you estimate the uncertainty for a specific hypothetical individual. Save their predictor values in a new data frame and plug it into predict().

newdata <- data.frame(OpenMindedness = 1, Conscientiousness = 2,
                      Extraversion = 3, Agreeableness = 0, 
                      NegativeEmotionality = 2)

predict(full_model, newdata, interval = "prediction")
       fit      lwr      upr
1 3.029406 1.863558 4.195255

We predict this individual has an existential well-being score of about 3, with a 95% prediction interval of [1.86, 4.19].

For multiple hypothetical individuals, just add more rows to the data frame:

newdata <- data.frame(OpenMindedness      = c(1, 0), 
                      Conscientiousness   = c(2, 1), 
                      Extraversion        = c(3, 2), 
                      Agreeableness       = c(0, 1), 
                      NegativeEmotionality = c(3, 2))

predict(full_model, newdata, interval = "prediction")
       fit      lwr      upr
1 2.588098 1.424191 3.752005
2 2.964597 1.799168 4.130025

4. LASSO Regression

Ridge vs LASSO vs Elastic Net

Ridge and LASSO regression are more modern automated selection methods for predictive models. Instead of removing variables one at a time like forward/backward selection, these approaches keep all predictors in the model but reduce their influence based on how useful they are for prediction. This process of intentionally penalizing large coefficients is called regularization.

Ridge regression shrinks all coefficients toward zero but never fully removes any predictor from the model. It keeps everything but gives more weight to more important variables.

LASSO regression also shrinks coefficients, but unlike Ridge it can shrink some all the way to exactly zero, effectively removing those predictors from the model. This makes LASSO especially useful when you have many predictors or when some are not very important, because it creates a simpler, more interpretable model while reducing overfitting.

A helpful way to think about the difference: backward selection makes “yes or no” decisions about each variable, whereas Ridge and LASSO adjust how much each variable matters. LASSO can still make “drop” decisions by zeroing out coefficients, while Ridge only reduces their importance.

Elastic Net is a mix of LASSO and Ridge that can both shrink and remove variables. It is especially useful when predictors are highly correlated, but it is more complicated to use since you have to make additional choices about how to balance shrinkage versus removal.

In this lesson we will focus on LASSO. Note that the coefficients from LASSO should not be interpreted in the same way as standard regression coefficients. LASSO seeks to minimize prediction error, not produce unbiased estimates, so the coefficients may differ substantially from what you would get with lm().

LASSO with cv.glmnet()

To do LASSO in R, use cv.glmnet() from the glmnet package. The cv stands for cross-validation, which is the process for choosing the optimal \(\lambda\) before LASSO is run. The predictors must be in matrix format.

predictors <- as.matrix(big5small %>% select(-ExistentialWellBeing))

lasso_model <- cv.glmnet(x      = predictors,
                         y      = big5$ExistentialWellBeing, 
                         family = "gaussian",  # ordinary regression
                         alpha  = 1)           # alpha = 1 specifies LASSO
# NOTE: alpha = 0 fits Ridge, alpha between 0 and 1 fits Elastic Net

lasso_model$lambda.min
[1] 0.004635115

The optimal \(\lambda\) was about .02. This value does not have a direct interpretation on its own.

To visualize how prediction error changes across \(\lambda\) values:

plot(lasso_model)

The Y-axis is mean-squared prediction error (lower is better). The X-axis shows \(-\log(\lambda)\), so moving right corresponds to smaller \(\lambda\) (less shrinking, more complex). The numbers at the top indicate how many predictors are included at each point. The vertical dotted line marks the optimal \(\lambda\).

To extract the model coefficients:

# Best predictive performance
coef(lasso_model, s = "lambda.min")
6 x 1 sparse Matrix of class "dgCMatrix"
                      lambda.min
(Intercept)           3.58072170
OpenMindedness        .         
Conscientiousness     0.11561031
Extraversion          0.03598063
Agreeableness         0.08164591
NegativeEmotionality -0.44021022
# Simpler model that performs nearly as well
coef(lasso_model, s = "lambda.1se")
6 x 1 sparse Matrix of class "dgCMatrix"
                      lambda.1se
(Intercept)           3.93463768
OpenMindedness        .         
Conscientiousness     0.08077601
Extraversion          .         
Agreeableness         0.02121283
NegativeEmotionality -0.40607926

In this case they are the same, meaning allowing more complexity does not meaningfully improve prediction. LASSO kept only NegativeEmotionality and zeroed out everything else, which aligns with the Variable Importance lesson where NegativeEmotionality had by far the largest effect.

Some other useful things you can extract:

lasso_model$lambda  # all lambda values tested
 [1] 0.442444220 0.403138695 0.367324964 0.334692827 0.304959639 0.277867865
 [7] 0.253182850 0.230690783 0.210196850 0.191523541 0.174509117 0.159006208
[13] 0.144880535 0.132009748 0.120282365 0.109596812 0.099860533 0.090989198
[19] 0.082905969 0.075540831 0.068829992 0.062715325 0.057143868 0.052067365
[25] 0.047441843 0.043227241 0.039387052 0.035888015 0.032699822 0.029794860
[31] 0.027147967 0.024736217 0.022538719 0.020536442 0.018712041 0.017049715
[37] 0.015535065 0.014154973 0.012897484 0.011751707 0.010707718 0.009756474
[43] 0.008889735 0.008099996 0.007380414 0.006724759 0.006127350 0.005583013
[49] 0.005087033 0.004635115 0.004223344 0.003848154 0.003506295 0.003194805
[55] 0.002910988 0.002652383 0.002416753 0.002202055
lasso_model$cvm     # mean cross-validation error per lambda tested
 [1] 0.5474312 0.5168876 0.4892192 0.4662463 0.4471717 0.4313338 0.4181833
 [8] 0.4072640 0.3981972 0.3906685 0.3844126 0.3790556 0.3739481 0.3690966
[15] 0.3649387 0.3614773 0.3586015 0.3562074 0.3541722 0.3523402 0.3507956
[22] 0.3495270 0.3484797 0.3476107 0.3468644 0.3462141 0.3456656 0.3452136
[29] 0.3448383 0.3445046 0.3442250 0.3439929 0.3438000 0.3436397 0.3435068
[36] 0.3433957 0.3433035 0.3432268 0.3431629 0.3431098 0.3430656 0.3430288
[43] 0.3429982 0.3429727 0.3429514 0.3429337 0.3429190 0.3429067 0.3428968
[50] 0.3428886 0.3428910 0.3429005 0.3429127 0.3429252 0.3429375 0.3429501
[57] 0.3429620 0.3429771
lasso_model$nzero   # number of non-zero coefficients per lambda tested
 s0  s1  s2  s3  s4  s5  s6  s7  s8  s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 
  0   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   3   3 
s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39 
  3   3   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 
  4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 

To get predicted scores:

head(predict(lasso_model, newx = predictors, s = "lambda.min"))
     lambda.min
[1,]   3.153834
[2,]   3.171883
[3,]   4.034786
[4,]   2.882289
[5,]   3.043611
[6,]   3.186076

Well Done!

You have completed the Prediction tutorial. Here is a summary of what was covered:

  • Cross-validation methods: simple, double, k-fold, and leave-one-out, and how to compute shrunken R and shrunken \(R^2\)
  • Variable selection: forward selection with add1(), backward selection with drop1(), stepwise selection, and all subsets regression with regsubsets()
  • Prediction intervals with predict() and how they differ from confidence intervals
  • LASSO regression with cv.glmnet() for simultaneous variable selection and coefficient shrinkage

The next lesson covers variable importance.