Here, we discuss the generalized linear model (GLM) in R with interpretations, including, binomial, Gaussian, Poisson, and gamma families.
The generalized linear model in R can be performed with the
glm()
function from the base "stats" package.
The generalized linear model can be used to study the non-linear relationships, if they exist, between a dependent variable \((y)\), with a specified distribution, and a set of independent variables \((X = x_1, x_2, \ldots, x_p)\).
The generalized linear model framework is based on the theoretical assumption that:
\[\begin{align} g\left(E(y|X)\right) & = g(\mu) \\ & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p. \end{align}\]
The generalized linear model estimates the true coefficients, \(\beta_1, \beta_2, \ldots, \beta_p\), as
\(\widehat \beta_1, \widehat \beta_2, \ldots,
\widehat \beta_p\), and the true intercept, \(\beta_0\), as \(\widehat \beta_0\).
Then for any \(x_1, x_2, \ldots, x_p\) values, these are
used to predict or estimate the true \(\mu\), as \(\widehat \mu\), with the equation
below:
\[\widehat \mu = g^{-1}\left( \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_2 x_2 + \cdots + \widehat \beta_p x_p \right).\]
See also logistic regression and binomial family GLM.
# Create the data samples for the GLM
# Values are matched based on matching position in each sample
x1 = c(4.35, 6.18, 5.13, 5.80, 4.00, 4.15,
5.09, 6.27, 7.15, 3.14, 4.75, 6.01)
x2 = c(3.76, 2.86, 3.85, 2.42, 1.84, 1.55,
3.25, 3.87, 3.64, 2.16, 3.38, 4.03)
y = c(0, 1, 0 ,1, 1, 1, 0, 0, 1, 0, 1, 0)
glm_data = data.frame(y, x1, x2)
glm_data
y x1 x2
1 0 4.35 3.76
2 1 6.18 2.86
3 0 5.13 3.85
4 1 5.80 2.42
5 1 4.00 1.84
6 1 4.15 1.55
7 0 5.09 3.25
8 0 6.27 3.87
9 1 7.15 3.64
10 0 3.14 2.16
11 1 4.75 3.38
12 0 6.01 4.03
# Run the generalized linear model with family, link and variables specified
model = glm(y ~ x1 + x2, family = binomial(link = "logit"),
data = glm_data)
summary(model)
Call:
glm(formula = y ~ x1 + x2, family = binomial(link = "logit"),
data = glm_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.037 4.146 0.250 0.8026
x1 2.191 1.306 1.678 0.0934 .
x2 -3.999 2.169 -1.844 0.0652 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16.6355 on 11 degrees of freedom
Residual deviance: 7.6478 on 9 degrees of freedom
AIC: 13.648
Number of Fisher Scoring iterations: 6
The example above is of the binomial family, hence the parameter of interest is \(p\) (see below). It also uses the "logit" link, hence the parameter function is \(p = \frac{\exp(X\beta)}{1 + \exp(X\beta)}\).
These imply for any \(x_1\) and \(x_2\), the estimated \(p\) or probability of success is: \[\widehat p = \frac{\exp(1.037 + 2.191x_1 - 3.999x_2)}{1 + \exp(1.037 + 2.191x_1 - 3.999x_2)}\]
Also, the lower the AIC, the better the model fit.
Family | Links | Distribution | Dependent Variable |
---|---|---|---|
Binomialbinomial Parameter: \(p\) |
logit (Default),probit ,
cauchit ,log , cloglog
|
Bernoulli \((p)\): \(\{0, 1\}\) |
1’s & 0’s Or a factor with two levels: ‘Yes’ & ‘No’ |
logit (Default),probit ,
cauchit ,log , cloglog
|
Binomial
\((n, p)\): \(\{0, 1, \ldots, n\}\) |
Matrix of counts of ‘success’ and counts of ‘failure’ from cases of ‘success’ or ‘failure’ |
|
Gaussiangaussian Parameter: \(\mu\) |
identity (Default),log ,
inverse
|
Normal \((\mu, \sigma)\): \((-\infty, \infty)\) |
Continuous variable with normal/Gaussian distribution |
Poissonpoisson Parameter: \(\lambda\) |
log (Default),identity , sqrt
|
Poisson
\((\lambda)\): \(\{0, 1, \ldots, n\}\) |
Counts of events in a fixed space or time period |
GammaGamma Parameter: \(\alpha/ \beta\) |
inverse (Default),identity ,
log
|
Gamma \((\alpha, \beta)\): \((0, \infty)\) |
Continuous variable with gamma distribution |
inverse (Default),identity ,
log
|
Exponential
\((\lambda)\): \((0, \infty)\) Gamma \((1, \beta)\) \(\equiv\) Exponential \((\beta)\) |
Continuous variable with exponential distribution |
|
Inverse Gaussianinverse.gaussian Parameter: \(\mu\) |
1/mu^2 (Default),inverse ,
identity ,log
|
Inverse Gaussian \((\mu,
\lambda)\): \((0, \infty)\) |
Continuous variable with inverse Gaussian distribution |
Link | Link Function | Parameter Function |
---|---|---|
\(X\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2
+ \cdots + \beta_p x_p\) The parameter \(\mu\) or \(p\) is the expected value of \(y\) \((E(y|X))\) |
||
Logitlogit
|
\(\log\left(\frac{p}{1-p}\right) = X\beta\) | \(p = \frac{\exp(X\beta)}{1 + \exp(X\beta)}\) |
Probitprobit
|
\(\Phi^{-1}(p) = X\beta\) | \(p = \Phi(X\beta)\) |
\(\Phi\) is the cumulative distribution
function of the standard normal distribution. |
||
Cauchitcauchit
|
\(\tan(\pi(p-\frac{1}{2})) = X\beta\) | \(p = \frac{1}{\pi}\arctan\left(X\beta\right) + \frac{1}{2}\) |
Similar to the probit , based on the cumulative
distributionfunction of the standard Cauchy distribution. |
||
Complementary Log-Log cloglog
|
\(\log(-\log(1-p)) = X\beta\) | \(p = 1-\exp(-\exp(X\beta))\) |
Identityidentity
|
\(\mu = X\beta\) | \(\mu = X\beta\) |
Loglog
|
\(\log(\mu) = X\beta\) | \(\mu = \exp(X\beta)\) |
Inverseinverse
|
\(\frac{1}{\mu} = X\beta\) | \(\mu = \frac{1}{X\beta}\) |
Square Rootsqrt
|
\(\sqrt \mu = X\beta\) | \(\mu = (X\beta)^2\) |
Inverse Square1/mu^2
|
\(\frac{1}{\mu^2} = X\beta\) | \(\mu = \frac{1}{\sqrt {X\beta}}\) |
Argument | Usage |
y ~ x1 + x2 +…+ xp | y is the dependent sample, and x1, x2, …, xp are the independent samples |
y ~ ., data | y is the dependent sample, and "." means all other variables are in the model |
family | The GLM family based on the distribution of the dependent variable |
link | The link function between the dependent variable and the independent variables |
data | The dataframe object that contains the dependent and independent variables |
start | The guess start values for the coefficients to aid convergence |
set.seed(123)
n = 1000
# Simulate linear combination (mean 0)
x1 = rnorm(n, mean = 4, sd = 0.5)
x2 = rnorm(n, mean = 3, sd = 0.25)
xb = 3 + 3*x1 - 5*x2
# Simulate logit Bernoulli 1's and 0's
p = exp(xb)/(1+exp(xb))
y_bernoulli = rbinom(n, size = 1, prob = p)
# Run GLM for binomial family with logit
model = glm(y_bernoulli ~ x1 + x2,
family = binomial(link = "logit"))
summary(model)
Call:
glm(formula = y_bernoulli ~ x1 + x2, family = binomial(link = "logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7353 1.1645 2.349 0.0188 *
x1 3.3301 0.2336 14.253 <2e-16 ***
x2 -5.3653 0.4286 -12.519 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1385.51 on 999 degrees of freedom
Residual deviance: 911.91 on 997 degrees of freedom
AIC: 917.91
Number of Fisher Scoring iterations: 5
Model to estimate \(\widehat p\): \[\widehat p = \frac{\exp(2.7353 + 3.3301 x_1 - 5.3653 x_2)}{1 + \exp(2.7353 + 3.3301 x_1 - 5.3653 x_2)}.\]
set.seed(100)
n = 1000
# Simulate linear combination (mean 0)
x1 = rnorm(n, mean = 4, sd = 0.025)
x2 = rnorm(n, mean = 3, sd = 0.035)
xb = 3 + 3*x1 - 5*x2
# Simulate probit binomial variables
p = pnorm(xb)
success = rbinom(n, size = 20, prob = p)
y_binomial = cbind(success, 20 - success)
# Run GLM for binomial family with probit
model = glm(y_binomial ~ x1 + x2,
family = binomial(link = "probit"))
summary(model)
Call:
glm(formula = y_binomial ~ x1 + x2, family = binomial(link = "probit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7426 1.5814 1.734 0.0829 .
x1 3.3704 0.3479 9.689 <2e-16 ***
x2 -5.4050 0.2635 -20.512 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1543.0 on 999 degrees of freedom
Residual deviance: 1030.3 on 997 degrees of freedom
AIC: 4427.6
Number of Fisher Scoring iterations: 3
Model to estimate \(\widehat p\): \[\widehat p = \Phi(2.7426 + 3.3704 x_1 - 5.4050 x_2).\]
set.seed(123)
n = 1000
# Simulate linear combination (must be positive)
x1 = rnorm(n, mean = 0.5, sd = 0.05)
x2 = rnorm(n, mean = 0.3, sd = 0.04)
xb = 2 + 3*x1 - 4*x2
# Simulate Poisson variables with exponent xb
lmb = exp(xb)
y_poisson = rpois(n, lambda = lmb)
# Run GLM for Poisson family with log link
model = glm(y_poisson ~ x1 + x2,
family = poisson(link = "log"))
summary(model)
Call:
glm(formula = y_poisson ~ x1 + x2, family = poisson(link = "log"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8699 0.1220 15.33 <2e-16 ***
x1 3.1603 0.2026 15.60 <2e-16 ***
x2 -3.9220 0.2495 -15.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1427.1 on 999 degrees of freedom
Residual deviance: 975.9 on 997 degrees of freedom
AIC: 5051.8
Number of Fisher Scoring iterations: 4
Model to estimate \(\widehat \lambda\): \[\widehat \lambda = \exp(1.8699 + 3.1603 x_1 - 3.9220 x_2).\]
set.seed(123)
n = 1000
# Simulate linear combination (must be positive)
x1 = rnorm(n, mean = 2, sd = 0.1)
x2 = rnorm(n, mean = 3, sd = 0.1)
xb = 3 + 4*x1 - 3*x2
# Simulate Poisson variables with square xb
lmb = xb^2
y_poisson = rpois(n, lambda = lmb)
# Run GLM for Poisson family with square root link
model = glm(y_poisson ~ x1 + x2,
family = poisson(link = "sqrt"))
summary(model)
Call:
glm(formula = y_poisson ~ x1 + x2, family = poisson(link = "sqrt"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2188 0.5477 5.877 4.18e-09 ***
x1 3.8461 0.1601 24.021 < 2e-16 ***
x2 -2.9787 0.1573 -18.940 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1874.2 on 999 degrees of freedom
Residual deviance: 1033.4 on 997 degrees of freedom
AIC: 4036.7
Number of Fisher Scoring iterations: 4
Model to estimate \(\widehat \lambda\): \[\widehat \lambda = (3.2188 + 3.8461 x_1 - 2.9787 x_2)^2.\]
set.seed(123)
n = 1000
# Simulate linear combination
# For inverse link, with max or min close to 0
x1 = rnorm(n, mean = 5, sd = 0.25)
x2 = rnorm(n, mean = 4, sd = 0.25)
xb = 3 + 2*x1 - 4*x2
# Simulate normal variables with inverse xb
mu = 1/xb
y_gaussian = rnorm(n, mu, 0.5)
# Run GLM for Gaussian family with inverse
model = glm(y_gaussian ~ x1 + x2,
family = gaussian(link = "inverse"))
summary(model)
Call:
glm(formula = y_gaussian ~ x1 + x2, family = gaussian(link = "inverse"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.91729 0.14210 20.53 <2e-16 ***
x1 1.89521 0.06733 28.15 <2e-16 ***
x2 -3.81398 0.12456 -30.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.2393183)
Null deviance: 593.81 on 999 degrees of freedom
Residual deviance: 238.61 on 997 degrees of freedom
AIC: 1413
Number of Fisher Scoring iterations: 7
Model to estimate \(\widehat \mu\): \[\widehat \mu = \frac{1}{2.91729 + 1.89521 x_1 - 3.81398 x_2}.\]
set.seed(123)
n = 1000
# Simulate linear combination (positive values)
x1 = rnorm(n, mean = 5, sd = 0.25)
x2 = rnorm(n, mean = 4, sd = 0.25)
xb = 9 + 2*x1 - 4*x2
# Simulate normal variables with exponent xb
mu = exp(xb)
y = rnorm(n, mu, 0.5)
# Run GLM for Gaussian family with log link
model = glm(y ~ x1 + x2,
family = gaussian(link = "log"))
summary(model)
Call:
glm(formula = y ~ x1 + x2, family = gaussian(link = "log"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.003332 0.006822 1320 <2e-16 ***
x1 1.999409 0.001186 1686 <2e-16 ***
x2 -4.000175 0.001396 -2865 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.2393446)
Null deviance: 2275692.30 on 999 degrees of freedom
Residual deviance: 238.63 on 997 degrees of freedom
AIC: 1413
Number of Fisher Scoring iterations: 3
Model to estimate \(\widehat \mu\): \[\widehat \mu = \exp(9.003332 + 1.999409 x_1 - 4.000175 x_2).\]
Call:
lm(formula = log(y) ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-0.86680 -0.01391 -0.00124 0.01711 0.73651
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.959441 0.061884 144.8 <2e-16 ***
x1 2.017755 0.010145 198.9 <2e-16 ***
x2 -4.012752 0.009964 -402.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0792 on 997 degrees of freedom
Multiple R-squared: 0.9948, Adjusted R-squared: 0.9948
F-statistic: 9.465e+04 on 2 and 997 DF, p-value: < 2.2e-16
NOTE: As expected, the results are similar. However, for the same data, on the log(y) scale (OLS model dependent variable), the OLS should have better fit, but on the level scale (GLM model dependent variable), the GLM should have better fit.
set.seed(123)
n = 1000
# Simulate linear combination
x1 = rnorm(n, mean = 6, sd = 2)
x2 = rnorm(n, mean = 5, sd = 2)
xb = 5 + 4*x1 - 6*x2
# Simulate normal variable with identity
mu = xb
y = rnorm(n, mu, 0.5)
# Run GLM for Gaussian family with identity link
model = glm(y ~ x1 + x2,
family = gaussian(link = "identity"))
summary(model)
Call:
glm(formula = y ~ x1 + x2, family = gaussian(link = "identity"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.987387 0.060707 82.16 <2e-16 ***
x1 3.994627 0.007836 509.79 <2e-16 ***
x2 -5.993123 0.007696 -778.70 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.2394945)
Null deviance: 192697.76 on 999 degrees of freedom
Residual deviance: 238.78 on 997 degrees of freedom
AIC: 1413.6
Number of Fisher Scoring iterations: 2
Model to estimate \(\widehat \mu\): \[\widehat \mu = 4.987387 + 3.994627 x_1 - 5.993123 x_2.\]
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-1.41801 -0.31386 -0.01849 0.32692 1.68933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.987387 0.060707 82.16 <2e-16 ***
x1 3.994627 0.007836 509.79 <2e-16 ***
x2 -5.993123 0.007696 -778.70 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4894 on 997 degrees of freedom
Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
F-statistic: 4.018e+05 on 2 and 997 DF, p-value: < 2.2e-16
NOTE: As expected, the OLS, and the GLM with the "identity" link produced the same result.
set.seed(123)
n = 1000
# Simulate linear combination
x1 = rnorm(n, mean = 10, sd = 1)
x2 = rnorm(n, mean = 8, sd = 1)
xb = 3 + 4*x1 - 5*x2
# Simulate gamma variables
average = exp(xb)
# average = shape/rate or rate = shape/average
y_gamma = rgamma(n, shape = 1, rate = 1/average)
# Run GLM for gamma family with log link
model = glm(y_gamma ~ x1 + x2,
family = Gamma(link = "log"))
summary(model)
Call:
glm(formula = y_gamma ~ x1 + x2, family = Gamma(link = "log"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.91850 0.40803 9.603 <2e-16 ***
x1 3.94799 0.03331 118.507 <2e-16 ***
x2 -5.04674 0.03272 -154.235 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 1.082247)
Null deviance: 24128.1 on 999 degrees of freedom
Residual deviance: 1112.9 on 997 degrees of freedom
AIC: 7767
Number of Fisher Scoring iterations: 6
Model to estimate \(\text{mean}\): \[\text{mean} = \frac{\widehat \alpha}{\widehat \beta} = \exp \left( 3.91850 + 3.94799 x_1 - 5.04674 x_2 \right).\]
set.seed(123)
n = 1000
# Simulate linear combination
x1 = rnorm(n, mean = 6, sd = 0.5)
x2 = rnorm(n, mean = 4, sd = 0.5)
xb = 4 + 4*x1 - 5*x2
# Simulate exponential variables
average = exp(xb)
# rgamma(n, shape = 1, rate = 1/average) is equivalent to rexp(n, rate = 1/average)
# average = 1/rate or rate = 1/average
y_exp = rexp(n, rate = 1/average)
# Run GLM for gamma family (exponential variable) with log link
model = glm(y_exp ~ x1 + x2,
family = Gamma(link = "log"))
summary(model)
Call:
glm(formula = y_exp ~ x1 + x2, family = Gamma(link = "log"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.47013 0.44869 9.963 <2e-16 ***
x1 3.98958 0.06487 61.497 <2e-16 ***
x2 -5.09976 0.06372 -80.035 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 1.026001)
Null deviance: 9488.1 on 999 degrees of freedom
Residual deviance: 1152.6 on 997 degrees of freedom
AIC: 17886
Number of Fisher Scoring iterations: 6
Model to estimate \(\text{mean}\): \[\text{mean} = \frac{1}{\widehat \lambda} = \exp \left(4.47013 + 3.98958 x_1 - 5.09976 x_2 \right).\]
# Create data
x1 = rnorm(1000, 4, 2)
x2 = rnorm(1000, 3, 1)
xb = 1 + 2*x1 - 3*x2
p = exp(xb)/(1+exp(xb))
y = rbinom(1000, 1, p)
# Create generalized linear model summary and model objects
reg_summary = summary(glm(y ~ x1 + x2, family = binomial()))
reg_model = glm(y ~ x1 + x2, family = binomial())
# Extract a component from GLM summary object
reg_summary$coefficients; reg_summary$coefficients[, 1]
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.143466 0.4118878 2.776158 5.500548e-03
x1 1.819446 0.1275497 14.264599 3.636756e-46
x2 -2.771271 0.2084446 -13.295001 2.474716e-40
(Intercept) x1 x2
1.143466 1.819446 -2.771271
(Intercept) x1 x2
1.143466 1.819446 -2.771271
GLM Component | Usage |
reg_summary$coefficients | The estimated intercept and beta values: their standard error, z-value and p-value |
reg_summary$aic | The Akaike’s An Information Criterion value |
reg_summary$null.deviance | The null model’s deviance |
reg_summary$deviance | The model’s deviance |
reg_summary$df.residual | The model’s residual degrees of freedom |
reg_summary$df.null | The null model’s residual degrees of freedom |
reg_model$coefficients | The estimated intercept and beta values |
reg_model$fitted.values | The predicted y values |
reg_model$linear.predictors | The predicted linear model values |
reg_model$residuals | The linear model residuals |
reg_model$aic | The Akaike’s An Information Criterion value |
reg_model$null.deviance | The null model’s deviance |
reg_model$deviance | The model’s deviance |
reg_model$df.residual | The model’s residual degrees of freedom |
reg_model$df.null | The null model’s residual degrees of freedom |
reg_model$model | The model dataframe |
The feedback form is a Google form but it does not collect any personal information.
Please click on the link below to go to the Google form.
Thank You!
Go to Feedback Form
Copyright © 2020 - 2024. All Rights Reserved by Stats Codes