Here, we discuss stepwise regression in R, including, forward, backward, and bi-directional (or forward-backward) steps.
Stepwise regression in R can be performed with the
step()
and lm()
functions from the "stats" package in the base
version of R.
For multiple regression modeling, stepwise regression can be used to perform variable selection among a set of variables by adding or dropping one variable at a time.
# Specify data
x1 = c(6.8, 4.7, 3.3, 4.3, 4.6, 7.4, 2.8)
x2 = c(7.2, 3.9, 6.7, 7.0, 5.4, 5.8, 6.6)
x3 = c(2.8, 1.8, 6.2, 3.6, 0.4, 4.5, 3.0)
x4 = c(6.4, 7.7, 7.9, 7.0, 7.4, 7.1, 6.4)
e = c(-0.01, -0.05, 0.04, -0.07, 0.05, 0.03, -0.02)
y = 3 + 4*x1 + 3*x3 + e
df_data = data.frame(y, x1, x2, x3, x4)
df_data
y x1 x2 x3 x4
1 38.59 6.8 7.2 2.8 6.4
2 27.15 4.7 3.9 1.8 7.7
3 34.84 3.3 6.7 6.2 7.9
4 30.93 4.3 7.0 3.6 7.0
5 22.65 4.6 5.4 0.4 7.4
6 46.13 7.4 5.8 4.5 7.1
7 23.18 2.8 6.6 3.0 6.4
# Specify the intercept model
intercept_model = lm(y ~ 1, data = df_data)
# Specify the guess model
guess_model = lm(y ~ x1 + x2,
data = df_data)
# Specify the full model
full_model = lm(y ~ ., data = df_data)
#or
full_model = lm(y ~ x1 + x2 +x3 + x4,
data = df_data)
Perform forward step-wise regression:
Call:
lm(formula = y ~ x1 + x3, data = df_data)
Coefficients:
(Intercept) x1 x3
2.964 4.004 3.003
Perform backward step-wise regression:
Call:
lm(formula = y ~ x1 + x3, data = df_data)
Coefficients:
(Intercept) x1 x3
2.964 4.004 3.003
Perform forward-backward step-wise regression:
step(guess_model, direction = 'both',
scope = list(lower = formula(intercept_model), upper = formula(full_model)),
trace = 0)
Call:
lm(formula = y ~ x1 + x3, data = df_data)
Coefficients:
(Intercept) x1 x3
2.964 4.004 3.003
To show model iterations, set trace = 1
. For BIC, set
k = log(number of rows)
.
The results shows that the algorithm only adds variables that reduce the criterion (BIC in this case, although labelled AIC). If there are multiple variables that reduce the AIC, it picks the one that reduces it the most.
When no other variable can reduce the AIC, the algorithm stops. This way, the algorithm is highly likely to have the final model as the combination of variables that have the lowest value of the criterion (AIC or BIC).
# Perform forward step-wise regression with step shown
# BIC can be used when we set k = log(nrow(df_data)
step(intercept_model, direction = 'forward',
scope = formula(full_model), trace = 1,
k = log(nrow(df_data)))
Start: AIC=30.95
y ~ 1
Df Sum of Sq RSS AIC
+ x1 1 253.444 187.55 26.909
+ x3 1 164.437 276.56 29.627
<none> 440.99 30.948
+ x2 1 34.025 406.97 32.332
+ x4 1 1.493 439.50 32.870
Step: AIC=26.91
y ~ x1
Df Sum of Sq RSS AIC
+ x3 1 187.536 0.012 -38.605
<none> 187.549 26.909
+ x2 1 43.612 143.936 27.002
+ x4 1 4.962 182.587 28.667
Step: AIC=-38.6
y ~ x1 + x3
Df Sum of Sq RSS AIC
<none> 0.012240 -38.605
+ x4 1 0.00116413 0.011076 -37.359
+ x2 1 0.00003159 0.012209 -36.677
Call:
lm(formula = y ~ x1 + x3, data = df_data)
Coefficients:
(Intercept) x1 x3
2.964 4.004 3.003
Argument | Usage |
object | The lm model object to start from |
scope | Set the range of variables applicable. This can be one model or a list of the upper and lower models |
direction | Set to “both” for forward-backward steps, “backward”, or “forward”, with a default of “both”. If scope is not set, the default is “backward”. |
trace | Set to 1 to print the selection steps, set to
0 otherwise |
k | Multiple of the number of penalty degrees of freedom. For AIC (default), k = 2, For BIC, k = log(number of rows). |
Consider the set of predictors \(x_1, x_2, \cdots, x_p\), to be used to predict \(y\). We can define the following models below.
The model with all the available variables:
\[\widehat y = \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_2 x_2 + \cdots + \widehat \beta_p x_p.\]
The model with no variable, hence, the predictor is the sample mean, \(\bar y\):
\[\widehat y = \bar y.\]
A sub model of the full model, containing a subset of the full set of variables, which may be the guess before variable selection:
\[\widehat y = \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_2 x_2.\]
Using the "USJudgeRatings" data from the "datasets" package with 10 sample rows from 43 rows below:
CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8
ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8
BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6
HEALEY.A.H. 8.0 7.6 6.6 7.2 6.5 6.5 6.8 6.7 6.4 6.5 6.9 6.7
O'BRIEN,F.J. 7.1 8.5 8.3 8.0 7.9 7.9 7.8 7.8 7.8 7.7 8.3 8.2
O'SULLIVAN,T.J. 7.5 9.0 8.9 8.7 8.4 8.5 8.4 8.3 8.3 8.3 8.8 8.7
PASKEY,L. 7.5 8.1 7.7 8.2 8.0 8.1 8.2 8.4 8.0 8.1 8.4 8.1
SIDOR,W.J. 7.7 6.2 5.1 5.6 5.6 5.9 5.6 5.6 5.3 5.5 6.3 5.3
STAPLETON,J.F. 6.5 8.2 7.7 7.8 7.6 7.7 7.7 7.7 7.5 7.6 8.5 7.7
ZARRILLI,K.J. 8.6 7.4 7.0 7.5 7.5 7.7 7.4 7.2 6.9 7.0 7.8 7.1
NOTE: The dependent variable is "RTEN".
The data on US judges’ ratings by lawyers with descriptions below.
Sign | Code | Variable Description |
\(y\) | RTEN | Worthy of retention |
\(x_1\) | CONT | Number of contacts of lawyer with judge |
\(x_2\) | INTG | Judicial integrity |
\(x_3\) | DMNR | Demeanor |
\(x_4\) | DILG | Diligence |
\(x_5\) | CFMG | Case flow managing |
\(x_6\) | DECI | Prompt decisions |
\(x_7\) | PREP | Preparation for trial |
\(x_8\) | FAMI | Familiarity with law |
\(x_9\) | ORAL | Sound oral rulings |
\(x_{10}\) | WRIT | Sound written rulings |
\(x_{11}\) | PHYS | Physical ability |
# Specify the intercept model
intercept_model = lm(RTEN ~ 1, data = USJudgeRatings)
# Specify the guess model
guess_model = lm(RTEN ~ DMNR + DILG + CFMG,
data = USJudgeRatings)
# Specify the full model
full_model = lm(RTEN ~ ., data = USJudgeRatings)
#or
full_model = lm(RTEN ~ CONT + INTG + DMNR +
DILG + CFMG + DECI + PREP +
FAMI + ORAL + WRIT + PHYS,
data = USJudgeRatings)
In the forward selection, the following happens:
We start with the intercept model or the guess model (the guess model must be a subset of the full set of variables in the upper scope).
Then from the intercept model or the guess model, based on AIC or BIC criterion, we add one variable repeatedly, and potentially pick variables up to the full set of variables available (or upper scope) if they all improve model quality.
Perform forward step-wise regression from intercept:
Call:
lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)
Coefficients:
(Intercept) ORAL DMNR PHYS INTG DECI
-2.2043 0.2917 0.1520 0.2829 0.3779 0.1667
The final selection includes only "ORAL", "DMNR", "PHYS", "INTG", and "DECI".
Call:
lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)
Residuals:
Min 1Q Median 3Q Max
-0.240656 -0.069026 -0.009474 0.068961 0.246402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.20433 0.43611 -5.055 1.19e-05 ***
ORAL 0.29169 0.10191 2.862 0.006887 **
DMNR 0.15199 0.06354 2.392 0.021957 *
PHYS 0.28292 0.04678 6.048 5.40e-07 ***
INTG 0.37785 0.10559 3.579 0.000986 ***
DECI 0.16672 0.07702 2.165 0.036928 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1119 on 37 degrees of freedom
Multiple R-squared: 0.9909, Adjusted R-squared: 0.9897
F-statistic: 806.1 on 5 and 37 DF, p-value: < 2.2e-16
If we want our model to include a specific set of variables, we could start with those variables in a guess model. For example, for "DMNR", "DILG", and "CFMG", we could use the guess model above.
Perform forward step-wise regression from guess:
Call:
lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG +
DECI, data = USJudgeRatings)
Coefficients:
(Intercept) DMNR DILG CFMG PHYS ORAL
-2.22619 0.15132 0.01817 -0.10734 0.29224 0.30146
INTG DECI
0.37467 0.24209
The final selection includes the guess variables, "DMNR", "DILG", and "CFMG", including additional variables "PHYS", "ORAL", "INTG" and "DECI".
Notice that not all the variables in the table below are significant, compared to the model selection that started with the intercept model. "DILG" and "CFMG" are not significant, which is a disadvantage of not starting from the intercept model.
Call:
lm(formula = RTEN ~ DMNR + DILG + CFMG + PHYS + ORAL + INTG +
DECI, data = USJudgeRatings)
Residuals:
Min 1Q Median 3Q Max
-0.24638 -0.06583 -0.01054 0.06288 0.25587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.22619 0.45932 -4.847 2.55e-05 ***
DMNR 0.15132 0.06908 2.190 0.03524 *
DILG 0.01817 0.10274 0.177 0.86062
CFMG -0.10734 0.12109 -0.886 0.38142
PHYS 0.29224 0.05101 5.730 1.75e-06 ***
ORAL 0.30146 0.10845 2.780 0.00869 **
INTG 0.37467 0.11885 3.152 0.00331 **
DECI 0.24209 0.12758 1.898 0.06603 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1137 on 35 degrees of freedom
Multiple R-squared: 0.9911, Adjusted R-squared: 0.9893
F-statistic: 557.3 on 7 and 35 DF, p-value: < 2.2e-16
In the backward selection, the following happens:
We start with the full model or set of models to be possibly included in the final selection.
Then from the full model, based on AIC or BIC criterion, we drop one variable repeatedly, and potentially drop every variable to get to the intercept model if no variable improves model quality, or get to a final selection with specific variables if those variables are set in a lower scope model formula. The lower scope must be a subset of the full set of variables in the full model.
Perform backward step-wise regression from full model:
Call:
lm(formula = RTEN ~ INTG + DMNR + DECI + ORAL + PHYS, data = USJudgeRatings)
Coefficients:
(Intercept) INTG DMNR DECI ORAL PHYS
-2.2043 0.3779 0.1520 0.1667 0.2917 0.2829
The final selection includes only "INTG", "DMNR", "DECI", "ORAL", and "PHYS". This is the same set as the selection above for the forward stepwise selection but in different order.
If we want our model to include a specific set of variables, we could end with those variables included in a set lower scope model. For example, for "DMNR", "DILG", and "CFMG", we could use the guess model above.
Perform backward step-wise regression with lower model as guess model:
Call:
lm(formula = RTEN ~ INTG + DMNR + DILG + CFMG + DECI + ORAL +
PHYS, data = USJudgeRatings)
Coefficients:
(Intercept) INTG DMNR DILG CFMG DECI
-2.22619 0.37467 0.15132 0.01817 -0.10734 0.24209
ORAL PHYS
0.30146 0.29224
The final selection includes the guess variables, "DMNR", "DILG", and "CFMG", including additional variables "INTG", "DECI", "ORAL" and "PHYS". This is the same set as the selection above for the forward stepwise selection that started with the guess model but in different order.
Again, not all the variables in the selection are significant, compared to the model selection that did not have a lower scope model set. "DILG" and "CFMG" are not significant, which is a disadvantage of arbitrarily setting a lower scope.
In the forward-backward (or both) selection, the following happens:
We start with a guess model model, this could be the intercept model, the full model, or a subset of the full model.
Then from the guess model, based on AIC or BIC criterion, we add one or drop one variable repeatedly, and potentially add more variables to the guess model if they are significant or drop variables from the guess model if they all improve model quality. The lower scope must be as subset of the guess model which in turn must be a subset of the full model.
Perform forward-backward step-wise regression starting from intercept model:
step(intercept_model, direction = 'both',
scope = list(lower = formula(intercept_model), upper = formula(full_model)),
trace = 0)
Call:
lm(formula = RTEN ~ ORAL + DMNR + PHYS + INTG + DECI, data = USJudgeRatings)
Coefficients:
(Intercept) ORAL DMNR PHYS INTG DECI
-2.2043 0.2917 0.1520 0.2829 0.3779 0.1667
Perform forward-backward step-wise regression starting from guess model:
step(guess_model, direction = 'both',
scope = list(lower = formula(intercept_model), upper = formula(full_model)),
trace = 0)
Call:
lm(formula = RTEN ~ DMNR + PHYS + ORAL + INTG + DECI, data = USJudgeRatings)
Coefficients:
(Intercept) DMNR PHYS ORAL INTG DECI
-2.2043 0.1520 0.2829 0.2917 0.3779 0.1667
Perform forward-backward step-wise regression starting from full model:
step(full_model, direction = 'both',
scope = list(lower = formula(intercept_model), upper = formula(full_model)),
trace = 0)
Call:
lm(formula = RTEN ~ INTG + DMNR + DECI + ORAL + PHYS, data = USJudgeRatings)
Coefficients:
(Intercept) INTG DMNR DECI ORAL PHYS
-2.2043 0.3779 0.1520 0.1667 0.2917 0.2829
The final selections are all the same and it includes only "DMNR", "PHYS", "ORAL", "INTG", and "DECI". This is the same set as the selection above for the forward stepwise selection.
In the forward-backward stepwise selection, unlike the forward and backward stepwise selections, you cannot pre-specify variables that must be included in the final selection, you can only set the scope of the variables to be included.
The feedback form is a Google form but it does not collect any personal information.
Please click on the link below to go to the Google form.
Thank You!
Go to Feedback Form
Copyright © 2020 - 2024. All Rights Reserved by Stats Codes