Here, we discuss linear regression in R with interpretations, including coefficients, r-squared, and p-values.

Linear regression (or ordinary least squares) in R can be performed with the lm() function from the "stats" package in the base version of R.

Linear regression can be used to study the linear relationship, if one exists, between a dependent variable \((y)\) and an independent variable \((x)\).

For multiple independent variables, see multiple regression.

The linear regression framework is based on the theoretical assumption that: \[y = \alpha + \beta x + \varepsilon,\]

where \(\varepsilon\) represents the error terms that are 1) independent, 2) normal distributed, 3) have constant variance, and 4) have mean zero.

The linear regression model estimates the true coefficient, \(\beta\), as \(\widehat \beta\), and the true intercept, \(\alpha\), as \(\widehat \alpha\).
Then for any \(x\) value, these two are used to predict or estimate the true \(y\), as \(\widehat y\), with the equation below:

\[\widehat y = \widehat \alpha + \widehat \beta x ,\] where for \(n\) sample data pairs \(\{(x_i, y_i), i = 1, ..., n\}\), \(\bar{x}\) and \(\bar{y}\) as sample means,

\[\begin{align} \widehat\beta &= \frac{ \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) }{ \sum_{i=1}^n (x_i - \bar{x})^2 }, \\ \widehat\alpha & = \bar{y} - \widehat\beta\,\bar{x}. \end{align}\]

Sample Steps to Run a Regression Model:

# Create the data samples for the regression model
# Values are paired based on matching position in each sample

y = c(6.9, 5.7, 7.9, 9.6, 5.1, 8.2, 8.6, 9.4)
x = c(2.7, 2.2, 3.6, 4.3, 2.6, 3.7, 3.8, 4.0)
df_data = data.frame(y, x)

# Run the regression model

model = lm(y ~ x)


model = lm(y ~ x, data = df_data)
    y   x
1 6.9 2.7
2 5.7 2.2
3 7.9 3.6
4 9.6 4.3
5 5.1 2.6
6 8.2 3.7
7 8.6 3.8
8 9.4 4.0

lm(formula = y ~ x)

     Min       1Q   Median       3Q      Max 
-0.99782 -0.19638  0.00295  0.41217  0.59533 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.7199     0.9392   0.767 0.472445    
x             2.0684     0.2733   7.568 0.000276 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5479 on 6 degrees of freedom
Multiple R-squared:  0.9052,    Adjusted R-squared:  0.8894 
F-statistic: 57.28 on 1 and 6 DF,  p-value: 0.0002765
Table of Some Regression Model Arguments in R
Argument Usage
y ~ x y is the dependent sample, and x is the independent sample
data The dataframe object that contains the dependent and independent variables

Creating Regression Summary Object and Model Object:

# Create data
x = rnorm(100, 20, 2)
y = 2 + 5*x + rnorm(100, 0, 1) 

# Create objects
reg_summary = summary(lm(y ~ x))
reg_model = lm(y ~ x)
# Extract a component from summary object
reg_summary$coefficients; reg_summary$coefficients[, 1]
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 2.298004 1.02674230  2.23815 2.747466e-02
x           4.986958 0.05188795 96.11012 8.741521e-99
(Intercept)           x 
   2.298004    4.986958 
# Extract a component from model object
(Intercept)           x 
   2.298004    4.986958 

There are more examples in the table below.

Table of Some Regression Summary and Model Object Outputs in R
Regression Component Usage
reg_summary$coefficients The estimated intercept and beta values:
their standard error, t-value and p-value
reg_summary$residuals The regression model residuals
reg_summary$r.squared The model r-squared value
reg_summary$adj.r.squared The model adjusted r-squared value
reg_summary$fstatistic The f-statistic and the degrees of freedom
reg_summary$sigma The model residuals standard error
reg_model$coefficients The estimated intercept and beta values
reg_model$residuals The regression model residuals
reg_model$fitted.values The predicted y values
reg_model$df.residual The degrees of freedom of the residuals
reg_model$model The model dataframe

1 Steps to Running a Regression in R

Using the first 10 rows of the "faithful" data from the "datasets" package below:

data_f = faithful[1:10,]
   eruptions waiting
1      3.600      79
2      1.800      54
3      3.333      74
4      2.283      62
5      4.533      85
6      2.883      55
7      4.700      88
8      3.600      85
9      1.950      51
10     4.350      85

NOTE: The dependent variable is "waiting", and the independent variable is "eruptions".

1.1 Check for Linear Relationship

Use a scatter plot to visually check for a linear relationship.

data_f = faithful[1:10,]
x = data_f$eruptions
y = data_f$waiting
plot(x, y,
     main = "Scatter Plot with Regression and Lowess Lines in R")
abline(lm(y ~ x), col = "blue")
lines(lowess(x, y), col = "red", lty = "dashed")
Scatter Plot with Regression and Lowess Lines in R

The appearance of the scatter plot suggests a strong linear relationship.

1.2 Run the Regression

Run the linear regression model using the lm() function, and print the results using the summary() function.

data_f = faithful[1:10,]
eruptions = data_f$eruptions
waiting = data_f$waiting
model = lm(waiting ~ eruptions)


data_f = faithful[1:10,]
model = lm(waiting ~ eruptions, data = data_f)

lm(formula = waiting ~ eruptions, data = data_f)

     Min       1Q   Median       3Q      Max 
-11.3298  -2.6033   0.6707   2.9552   9.3362 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   28.798      6.296   4.574  0.00182 ** 
eruptions     13.018      1.824   7.137 9.83e-05 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.781 on 8 degrees of freedom
Multiple R-squared:  0.8643,    Adjusted R-squared:  0.8473 
F-statistic: 50.94 on 1 and 8 DF,  p-value: 9.832e-05

1.3 Interpretation of the Results

  • Coefficients:
    • The estimated intercept (\(\widehat \alpha\)) is \(\text{summary(model)\$coefficients[1, 1]}\) \(= 28.798\).
    • The estimated coefficient for \(x\) (\(\widehat \beta\)) is \(\text{summary(model)\$coefficients[2, 1]}\) \(= 13.018\).
  • P-values:
    For level of significance, \(\alpha = 0.05\).
    • The p-value for the intercept is \(\text{summary(model)\$coefficients[1, 4]}\) \(= 0.0018165\). Since the p-value is less than \(0.05\), we say that the intercept is statistically significantly different from zero.
    • The p-value for the independent variable, \(x\) (eruptions), is \(\text{summary(model)\$coefficients[2, 4]}\) \(= 9.83\times 10^{-5}\). Since the p-value is less than \(0.05\), we say that the independent variable is a statistically significant predictor of the dependent variable \(y\) (waitings).
      If the p-value were higher than the chosen level of significance, we would have concluded that the independent variable is NOT a statistically significant predictor of the dependent variable.
  • R-squared:
    The r-squared value is \(\text{summary(model)\$r.squared}\) \(= 0.864\).
    This means that the model, using the independent variable, \(x\) (eruptions), explains \(86.4\%\) of the variations in the dependent variable \(y\) (waitings) from the sample mean of \(y\), \(\bar y\).

1.4 Prediction and Estimation

To predict or estimate \(y\) for any \(x\), we can use \(y = 28.798 + 13.018 x\).

For example, for \(x = 5\), \(y = 28.798+ 13.018 \times 5\) \(= 93.888\).

