Here, we discuss stepwise regression in R, including, forward, backward, and bi-directional (or forward-backward) steps.

Stepwise regression in R can be performed with the step() and lm() functions from the "stats" package in the base version of R.

For multiple regression modeling, stepwise regression can be used to perform variable selection among a set of variables by adding or dropping one variable at a time.

Sample Steps to Run a Stepwise Regression:

# Specify data
x1 = c(6.8, 4.7, 3.3, 4.3, 4.6, 7.4, 2.8)
x2 = c(7.2, 3.9, 6.7, 7.0, 5.4, 5.8, 6.6)
x3 = c(2.8, 1.8, 6.2, 3.6, 0.4, 4.5, 3.0)
x4 = c(6.4, 7.7, 7.9, 7.0, 7.4, 7.1, 6.4)
e = c(-0.01, -0.05, 0.04, -0.07, 0.05, 0.03, -0.02)
y = 3 + 4*x1 + 3*x3 + e
df_data = data.frame(y, x1, x2, x3, x4)
df_data
      y  x1  x2  x3  x4
1 38.59 6.8 7.2 2.8 6.4
2 27.15 4.7 3.9 1.8 7.7
3 34.84 3.3 6.7 6.2 7.9
4 30.93 4.3 7.0 3.6 7.0
5 22.65 4.6 5.4 0.4 7.4
6 46.13 7.4 5.8 4.5 7.1
7 23.18 2.8 6.6 3.0 6.4
# Specify the intercept model
intercept_model = lm(y ~ 1, data = df_data)

# Specify the guess model
guess_model = lm(y ~ x1 + x2,
                 data = df_data)

# Specify the full model
full_model = lm(y ~ ., data = df_data)
#or
full_model = lm(y ~ x1 + x2 +x3 + x4, 
                data = df_data)

Perform forward step-wise regression:

step(intercept_model, direction = 'forward', 
     scope = formula(full_model), trace = 0)

Call:
lm(formula = y ~ x1 + x3, data = df_data)

Coefficients:
(Intercept)           x1           x3  
      2.964        4.004        3.003  

Perform backward step-wise regression:

step(full_model, direction = 'backward', trace = 0)

Call:
lm(formula = y ~ x1 + x3, data = df_data)

Coefficients:
(Intercept)           x1           x3  
      2.964        4.004        3.003  

Perform forward-backward step-wise regression:

step(guess_model, direction = 'both',
     scope = list(lower = formula(intercept_model), upper = formula(full_model)),
     trace = 0)

Call:
lm(formula = y ~ x1 + x3, data = df_data)

Coefficients:
(Intercept)           x1           x3  
      2.964        4.004        3.003  

Showing Model Iterations and Using BIC:

To show model iterations, set trace = 1. For BIC, set k = log(number of rows).

The results shows that the algorithm only adds variables that reduce the criterion (BIC in this case, although labelled AIC). If there are multiple variables that reduce the AIC, it picks the one that reduces it the most.

When no other variable can reduce the AIC, the algorithm stops. This way, the algorithm is highly likely to have the final model as the combination of variables that have the lowest value of the criterion (AIC or BIC).

# Perform forward step-wise regression with step shown
# BIC can be used when we set k = log(nrow(df_data)
step(intercept_model, direction = 'forward', 
     scope = formula(full_model), trace = 1,
     k = log(nrow(df_data)))
Start:  AIC=30.95
y ~ 1

       Df Sum of Sq    RSS    AIC
+ x1    1   253.444 187.55 26.909
+ x3    1   164.437 276.56 29.627
<none>              440.99 30.948
+ x2    1    34.025 406.97 32.332
+ x4    1     1.493 439.50 32.870

Step:  AIC=26.91
y ~ x1

       Df Sum of Sq     RSS     AIC
+ x3    1   187.536   0.012 -38.605
<none>              187.549  26.909
+ x2    1    43.612 143.936  27.002
+ x4    1     4.962 182.587  28.667

Step:  AIC=-38.6
y ~ x1 + x3

       Df  Sum of Sq      RSS     AIC
<none>               0.012240 -38.605
+ x4    1 0.00116413 0.011076 -37.359
+ x2    1 0.00003159 0.012209 -36.677

Call:
lm(formula = y ~ x1 + x3, data = df_data)

Coefficients:
(Intercept)           x1           x3  
      2.964        4.004        3.003  
Table of Some Stepwise Regression Arguments in R
Argument Usage
object The lm model object to start from
scope Set the range of variables applicable. This can be one model or a list of the upper and lower models
direction Set to “both” for forward-backward steps, “backward”, or “forward”, with a default of “both”. If scope is not set, the default is “backward”.
trace Set to 1 to print the selection steps, set to 0 otherwise
k Multiple of the number of penalty degrees of freedom. For AIC (default), k = 2, For BIC, k = log(number of rows).

1 Definitions

Consider the set of predictors \(x_1, x_2, \cdots, x_p\), to be used to predict \(y\). We can define the following models below.

Full Model

The model with all the available variables:

\[\widehat y = \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_2 x_2 + \cdots + \widehat \beta_p x_p.\]

full_model = lm(y ~ ., data = dataset)
#or
full_model = lm(y ~ x1 + x2 + ... + xp, data = dataset)

Intercept Model (or Null Model)

The model with no variable, hence, the predictor is the sample mean, \(\bar y\):

\[\widehat y = \bar y.\]

intercept_model = lm(y ~ 1, data = dataset)

Guess Model

A sub model of the full model, containing a subset of the full set of variables, which may be the guess before variable selection:

\[\widehat y = \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_2 x_2.\]

guess_model = lm(y ~ x1 + x2, data = dataset)

Dataset

Using the "USJudgeRatings" data from the "datasets" package with 10 sample rows from 43 rows below:

USJudgeRatings
                CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H.    5.7  7.9  7.7  7.3  7.1  7.4  7.1  7.1  7.1  7.0  8.3  7.8
ARMENTANO,A.J.   7.2  8.1  7.8  7.8  7.5  7.6  7.5  7.5  7.3  7.4  7.9  7.8
BURNS,E.B.       6.2  8.8  8.7  8.5  7.9  8.0  8.1  8.0  8.0  8.0  8.6  8.6
HEALEY.A.H.      8.0  7.6  6.6  7.2  6.5  6.5  6.8  6.7  6.4  6.5  6.9  6.7
O'BRIEN,F.J.     7.1  8.5  8.3  8.0  7.9  7.9  7.8  7.8  7.8  7.7  8.3  8.2
O'SULLIVAN,T.J.  7.5  9.0  8.9  8.7  8.4  8.5  8.4  8.3  8.3  8.3  8.8  8.7
PASKEY,L.        7.5  8.1  7.7  8.2  8.0  8.1  8.2  8.4  8.0  8.1  8.4  8.1
SIDOR,W.J.       7.7  6.2  5.1  5.6  5.6  5.9  5.6  5.6  5.3  5.5  6.3  5.3
STAPLETON,J.F.   6.5  8.2  7.7  7.8  7.6  7.7  7.7  7.7  7.5  7.6  8.5  7.7
ZARRILLI,K.J.    8.6  7.4  7.0  7.5  7.5  7.7  7.4  7.2  6.9  7.0  7.8  7.1

NOTE: The dependent variable is "RTEN".

The data on US judges’ ratings by lawyers with descriptions below.

Table of Variable Descriptions
Sign Code Variable Description
\(y\) RTEN Worthy of retention
\(x_1\) CONT Number of contacts of lawyer with judge
\(x_2\) INTG Judicial integrity
\(x_3\) DMNR Demeanor
\(x_4\) DILG Diligence
\(x_5\) CFMG Case flow managing
\(x_6\) DECI Prompt decisions
\(x_7\) PREP Preparation for trial
\(x_8\) FAMI Familiarity with law
\(x_9\) ORAL Sound oral rulings
\(x_{10}\) WRIT Sound written rulings
\(x_{11}\) PHYS Physical ability

Set the Defined Reference Models

# Specify the intercept model
intercept_model = lm(RTEN ~ 1, data = USJudgeRatings)

# Specify the guess model
guess_model = lm(RTEN ~ DMNR + DILG + CFMG,
                 data = USJudgeRatings)

# Specify the full model
full_model = lm(RTEN ~ ., data = USJudgeRatings)
#or
full_model = lm(RTEN ~ CONT + INTG + DMNR +
                  DILG + CFMG + DECI + PREP +
                  FAMI + ORAL + WRIT + PHYS, 
                data = USJudgeRatings)

2 Forward Stepwise Regression in R

In the forward selection, the following happens:

  • We start with the intercept model or the guess model (the guess model must be a subset of the full set of variables in the upper scope).

  • Then from the intercept model or the guess model, based on AIC or BIC criterion, we add one variable repeatedly, and potentially pick variables up to the full set of variables available (or upper scope) if they all improve model quality.

Perform forward step-wise regression from intercept:

step(intercept_model, direction = 'forward', 
     scope = formula(full_model), trace = 0)

Call:
lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)

Coefficients:
(Intercept)         ORAL         DMNR         PHYS         INTG         DECI  
    -2.2043       0.2917       0.1520       0.2829       0.3779       0.1667  

The final selection includes only "ORAL", "DMNR", "PHYS", "INTG", and "DECI".

summary(lm(formula = RTEN ~ ORAL + DMNR +
             PHYS + INTG + DECI,
           data = USJudgeRatings))

Call:
lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.240656 -0.069026 -0.009474  0.068961  0.246402 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.20433    0.43611  -5.055 1.19e-05 ***
ORAL         0.29169    0.10191   2.862 0.006887 ** 
DMNR         0.15199    0.06354   2.392 0.021957 *  
PHYS         0.28292    0.04678   6.048 5.40e-07 ***
INTG         0.37785    0.10559   3.579 0.000986 ***
DECI         0.16672    0.07702   2.165 0.036928 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1119 on 37 degrees of freedom
Multiple R-squared:  0.9909,    Adjusted R-squared:  0.9897 
F-statistic: 806.1 on 5 and 37 DF,  p-value: < 2.2e-16

If we want our model to include a specific set of variables, we could start with those variables in a guess model. For example, for "DMNR", "DILG", and "CFMG", we could use the guess model above.

Perform forward step-wise regression from guess:

step(guess_model, direction = 'forward', 
     scope = formula(full_model), trace = 0)

Call:
lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG + 
    DECI, data = USJudgeRatings)

Coefficients:
(Intercept)         DMNR         DILG         CFMG         PHYS         ORAL  
   -2.22619      0.15132      0.01817     -0.10734      0.29224      0.30146  
       INTG         DECI  
    0.37467      0.24209  

The final selection includes the guess variables, "DMNR", "DILG", and "CFMG", including additional variables "PHYS", "ORAL", "INTG" and "DECI".

Notice that not all the variables in the table below are significant, compared to the model selection that started with the intercept model. "DILG" and "CFMG" are not significant, which is a disadvantage of not starting from the intercept model.

summary(lm(formula = RTEN ~ DMNR + DILG + CFMG +
             PHYS + ORAL + INTG + DECI,
           data = USJudgeRatings))

Call:
lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG + 
    DECI, data = USJudgeRatings)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.24638 -0.06583 -0.01054  0.06288  0.25587 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.22619    0.45932  -4.847 2.55e-05 ***
DMNR         0.15132    0.06908   2.190  0.03524 *  
DILG         0.01817    0.10274   0.177  0.86062    
CFMG        -0.10734    0.12109  -0.886  0.38142    
PHYS         0.29224    0.05101   5.730 1.75e-06 ***
ORAL         0.30146    0.10845   2.780  0.00869 ** 
INTG         0.37467    0.11885   3.152  0.00331 ** 
DECI         0.24209    0.12758   1.898  0.06603 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1137 on 35 degrees of freedom
Multiple R-squared:  0.9911,    Adjusted R-squared:  0.9893 
F-statistic: 557.3 on 7 and 35 DF,  p-value: < 2.2e-16

3 Backward Stepwise Regression in R

In the backward selection, the following happens:

  • We start with the full model or set of models to be possibly included in the final selection.

  • Then from the full model, based on AIC or BIC criterion, we drop one variable repeatedly, and potentially drop every variable to get to the intercept model if no variable improves model quality, or get to a final selection with specific variables if those variables are set in a lower scope model formula. The lower scope must be a subset of the full set of variables in the full model.

Perform backward step-wise regression from full model:

step(full_model, direction = 'backward', trace = 0)

Call:
lm(formula = RTEN ~ INTG + DMNR + DECI + ORAL + PHYS, data = USJudgeRatings)

Coefficients:
(Intercept)         INTG         DMNR         DECI         ORAL         PHYS  
    -2.2043       0.3779       0.1520       0.1667       0.2917       0.2829  

The final selection includes only "INTG", "DMNR", "DECI", "ORAL", and "PHYS". This is the same set as the selection above for the forward stepwise selection but in different order.

If we want our model to include a specific set of variables, we could end with those variables included in a set lower scope model. For example, for "DMNR", "DILG", and "CFMG", we could use the guess model above.

Perform backward step-wise regression with lower model as guess model:

step(full_model, direction = 'backward', 
     scope = list(lower = formula(guess_model)),
     trace = 0)

Call:
lm(formula = RTEN ~ INTG + DMNR + DILG + CFMG + DECI + ORAL + 
    PHYS, data = USJudgeRatings)

Coefficients:
(Intercept)         INTG         DMNR         DILG         CFMG         DECI  
   -2.22619      0.37467      0.15132      0.01817     -0.10734      0.24209  
       ORAL         PHYS  
    0.30146      0.29224  

The final selection includes the guess variables, "DMNR", "DILG", and "CFMG", including additional variables "INTG", "DECI", "ORAL" and "PHYS". This is the same set as the selection above for the forward stepwise selection that started with the guess model but in different order.

Again, not all the variables in the selection are significant, compared to the model selection that did not have a lower scope model set. "DILG" and "CFMG" are not significant, which is a disadvantage of arbitrarily setting a lower scope.

4 Forward-Backward (or Both) Stepwise Regression in R

In the forward-backward (or both) selection, the following happens:

  • We start with a guess model model, this could be the intercept model, the full model, or a subset of the full model.

  • Then from the guess model, based on AIC or BIC criterion, we add one or drop one variable repeatedly, and potentially add more variables to the guess model if they are significant or drop variables from the guess model if they all improve model quality. The lower scope must be as subset of the guess model which in turn must be a subset of the full model.

Perform forward-backward step-wise regression starting from intercept model:

step(intercept_model, direction = 'both',
     scope = list(lower = formula(intercept_model), upper = formula(full_model)),
     trace = 0)

Call:
lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)

Coefficients:
(Intercept)         ORAL         DMNR         PHYS         INTG         DECI  
    -2.2043       0.2917       0.1520       0.2829       0.3779       0.1667  

Perform forward-backward step-wise regression starting from guess model:

step(guess_model, direction = 'both',
     scope = list(lower = formula(intercept_model), upper = formula(full_model)),
     trace = 0)

Call:
lm(formula = RTEN ~ DMNR + PHYS + ORAL + INTG + DECI, data = USJudgeRatings)

Coefficients:
(Intercept)         DMNR         PHYS         ORAL         INTG         DECI  
    -2.2043       0.1520       0.2829       0.2917       0.3779       0.1667  

Perform forward-backward step-wise regression starting from full model:

step(full_model, direction = 'both',
     scope = list(lower = formula(intercept_model), upper = formula(full_model)),
     trace = 0)

Call:
lm(formula = RTEN ~ INTG + DMNR + DECI + ORAL + PHYS, data = USJudgeRatings)

Coefficients:
(Intercept)         INTG         DMNR         DECI         ORAL         PHYS  
    -2.2043       0.3779       0.1520       0.1667       0.2917       0.2829  

The final selections are all the same and it includes only "DMNR", "PHYS", "ORAL", "INTG", and "DECI". This is the same set as the selection above for the forward stepwise selection.

In the forward-backward stepwise selection, unlike the forward and backward stepwise selections, you cannot pre-specify variables that must be included in the final selection, you can only set the scope of the variables to be included.

Copyright © 2020 - 2024. All Rights Reserved by Stats Codes