The goal is to propose and estimate a model for explaining a continuous response variable \(Y_i\) from a number of fixed predictors \(x_{i1}, \dots, x_{ik}\).
Mathematically, the model reads: \[ Y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_k x_{ik} + \varepsilon_i, \quad \mbox{with} \quad \mathbb{E}[\varepsilon_i] = 0 \quad \mbox{and} \quad \mathbb{V}\mbox{ar}[\varepsilon_i] = \sigma^2. \]
We can summarize the assumptions on which relies the linear regression model as follows:
\[ \mbox{SSD}(\boldsymbol{\beta}, \sigma^2; \mathbf{y}, \mathbb{X}) := (\mathbf{y} - \mathbb{X} \boldsymbol{\beta})^\top (\mathbf{y} - \mathbb{X} \boldsymbol{\beta}) = \sum_{i = 1}^n (y_i - (\beta_0 + \beta_1 x_{i1} + \dots + \beta_k x_{ik}))^2 .\]
A categorical variable with two categories (sex
for instance) can be transformed in a single numeric variable sex == "male"
, which evaluates to \(1\) if sex
is "male"
and to \(0\) if sex
is "female"
(remember logicals are numbers in R
).
A categorical variable with three categories (origin
which stores the acronym of the New York airport for departure in the flights
data set) can be converted into two numeric variables as shown in the table below:
origin |
originJFK |
originLGA |
---|---|---|
EWR | 0 | 0 |
JFK | 1 | 0 |
LGA | 0 | 1 |
\[ \widehat{\boldsymbol{\beta}} := (\mathbb{X}^T \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y}, \]
\[ \widehat{\mathbf{Y}} = \mathbb{X} \widehat{\boldsymbol{\beta}} = \mathbb{X} (\mathbb{X}^T \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} = \mathbb{H} \mathbf{Y}. \]
\[ \widehat{\sigma^2} := \frac{(\mathbf{Y} - \widehat{\mathbf{Y}})^\top (\mathbf{Y} - \widehat{\mathbf{Y}})}{n - k - 1}. \]
\[ \mathbb{H} = \mathbb{X} (\mathbb{X}^T \mathbb{X})^{-1} \mathbb{X}^\top \]
Compute a confidence interval for the mean response \(\mathbf{x}_i^\top \boldsymbol{\beta}\) at an already observed point \(\mathbf{x}_i\), \(i=1,\dots,n\).
\[ \mathbf{x}_i^\top \widehat{\boldsymbol{\beta}} \sim \mathcal{N} \left( \mathbf{x}_i^\top \boldsymbol{\beta}, \sigma^2 \mathbf{x}_i^\top (\mathbb{X}^\top \mathbb{X})^{-1} \mathbf{x}_i \right), \]
We do not know \(\sigma^2\) but we have that: \[ \frac{\widehat{\sigma^2}}{\sigma^2} \sim \chi^2(n-k-1), \] and \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\sigma^2}\) are statistically independent.
A pivotal statistic for a parameter \(\theta\) is a random variable \(T(\theta)\) such that
When the parameter is the mean response \(\mathbf{x}_i^\top \boldsymbol{\beta}\) at \(\mathbf{x}_i\), we can use the following pivotal statistic:
\[ \frac{\mathbf{x}_i^\top \boldsymbol{\beta} - \mathbf{x}_i^\top \widehat{\boldsymbol{\beta}}}{\sqrt{\widehat{\sigma^2} \mathbf{x}_i^\top (\mathbb{X}^\top \mathbb{X})^{-1} \mathbf{x}_i}} \sim \mathcal{S}tudent(n-k-1). \]
We can express a \((1-\alpha)\%\) confidence interval for the mean response \(\mathbf{x}_i^\top \boldsymbol{\beta}\) at \(\mathbf{x}_i\) as: \[ \mathbb{P} \left( \mathbf{x}_i^\top \boldsymbol{\beta} \in \left[ \mathbf{x}_i^\top \widehat{\boldsymbol{\beta}} \pm t_{1-\frac{\alpha}{2}}(n-k-1) \sqrt{\widehat{\sigma^2} \mathbf{x}_i^\top (\mathbb{X}^\top \mathbb{X})^{-1} \mathbf{x}_i} \right] \right) = 1 - \alpha. \]
Compute a prediction interval for a new not yet observed response \(Y_{n+1}\) at a new observed point \(\mathbf{x}_{n+1}\): \[ Y_{n+1} = \mathbf{x}_{n+1}^\top \boldsymbol{\beta} + \varepsilon_{n+1} \sim \mathcal{N} \left( \mathbf{x}_{n+1}^\top \boldsymbol{\beta}, \sigma^2 \right), \]
\[ \widehat{Y_{n+1}} = \mathbf{x}_{n+1}^\top \widehat{\boldsymbol{\beta}} \sim \mathcal{N} \left( \mathbf{x}_{n+1}^\top \boldsymbol{\beta}, \sigma^2 \mathbf{x}_{n+1}^\top (\mathbb{X}^\top \mathbb{X})^{-1} \mathbf{x}_{n+1} \right), \]
We do not know \(\sigma^2\) but we have that: \[ \frac{\widehat{\sigma^2}}{\sigma^2} \sim \chi^2(n-k-1), \] and \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\sigma^2}\) are statistically independent.
We can now define a random variable of known distribution whose only unknown (once the \(n\)-sample is observed) is \(Y_{n+1}\): \[ \frac{Y_{n+1} - \widehat{Y_{n+1}}}{\sqrt{\widehat{\sigma^2} \left( 1 + \mathbf{x}_{n+1}^\top (\mathbb{X}^\top \mathbb{X})^{-1} \mathbf{x}_{n+1} \right)}} \sim \mathcal{S}tudent(n-k-1). \]
We can express a \((1-\alpha)\%\) prediction interval for the new not yet observed response \(Y_{n+1}\) at \(\mathbf{x}_{n+1}\) as: \[ \mathbb{P} \left( Y_{n+1} \in \left[ \widehat{Y_{n+1}} \pm t_{1-\frac{\alpha}{2}}(n-k-1) \sqrt{\widehat{\sigma^2} \left( 1 + \mathbf{x}_{n+1}^\top (\mathbb{X}^\top \mathbb{X})^{-1} \mathbf{x}_{n+1} \right)} \right] \right) = 1 - \alpha. \]
The most important first thing you should do before any modeling attempt is to look at your data.
As you can see, both data sets seem to generate the same linear regression predictions, although we already clearly understand which one will go sideways…
lm()
.lm()
.broom::glance()
help page will provide you with a short summary of the information retrieved by glance()
.broom::tidy()
help page will provide you with a short summary of the information retrieved by tidy()
.broom::augment()
help page will provide you with a short summary of the information retrieved by augment()
.You must always remember that, even though R facilitates computations, it does not verify for you that the assumptions required by the model are met by your data!
This is where model diagnosis comes into play. You can almost entirely diagnose your model graphically. Using the grammar of graphics in ggplot2
, this can be achieved using the ggfortify
package.
Linear regression makes several assumptions about the data, such as :
Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
Normality of residuals. The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity)
Independence of residuals error terms.
You should check whether or not these assumptions hold true. Potential problems include:
Non-linearity of the outcome - predictor relationships
Heteroscedasticity: Non-constant variance of error terms.
Presence of influential values in the data that can be:
Outliers: extreme values in the outcome (y) variable
High-leverage points: extreme values in the predictors (x) variable
All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.
Usage
To check for non-linearity not accounted for; a horizontal line, without distinct patterns is an indication for a linear relationship, which is good.
Usage
To check the normality of residuals; good if residuals follow the straight dashed line.
Usage
To check the homoscedasticity assumption (constant variance); horizontal line with equally spread points is a good indication of homoscedasticity.
Usage
To check the zero-mean assumption on the residuals and to spot potential outliers in the top and bottom right corners (residuals vs leverages plot)
To spot potential outliers as observations with the largest Cook’s distance.
The left-hand side of ~
must contain the response variable as named in the atomic vector storing its values or in the data set containing it.
The right-hand side specifies the predictors and will therefore be a combination of potentially transformed variables from your data set (or from atomic vectors defined in your R environment). To get the proper syntax for the rhs of the formula, you should be know the set of allowed operators that have a specific interpretation by R when used within a formula:
+
for adding terms.-
for removing terms.:
for adding interaction terms only.*
for crossing variables, i.e. adding variables and their interactions.%in%
for nesting variables, i.e. adding one variable and its interaction with another.^
for limiting variable crossing to a specified order.
for adding all other variables in the matrix that have not yet been included in the model.Linear regression assumes that the response variable (y
) and at least one predictor (x
) are continuous variables. The following table summarizes the effect of operators when adding a variable z
to the predictors. When z
is categorical, we assume that it can take \(h\) unique possible values \(z_0, z_1, \dots, z_{h-1}\).
Type of z |
Formula | Model |
---|---|---|
Continuous | y ~ x + z |
\(Y = \beta_0 + \beta_1 x + \beta_2 z + \varepsilon\) |
Continuous | y ~ x : z |
\(Y = \beta_0 + \beta_1 x z + \varepsilon\) |
Continuous | y ~ x * z |
\(Y = \beta_0 + \beta_1 x + \beta_2 z + \beta_3 x z + \varepsilon\) |
Categorical | y ~ x + z |
\(Y = \beta_0^{(0)} + \sum_{\ell = 1}^{h-1} \beta_\ell^{(0)} \delta_{\{z = z_\ell\}} + \beta_1^{(1)} x + \varepsilon\) |
Categorical | y ~ x : z |
\(Y = \beta_0^{(0)} + \beta_1^{(1)} x + \sum_{\ell = 1}^{h-1} \beta_\ell^{(1)} x \delta_{\{z = z_\ell\}} + \varepsilon\) |
Categorical | y ~ x * z |
\(Y = \beta_0^{(0)} + \sum_{\ell = 1}^{h-1} \beta_\ell^{(0)} \delta_{\{z = z_\ell\}} + \beta_0^{(1)} x + \sum_{\ell = 1}^{h-1} \beta_\ell^{(1)} x \delta_{\{z = z_\ell\}} + \varepsilon\) |
I()
What strikes from the previous table is that the natural multiplication of variables within a formula does not simply adds the product of the predictors in the model but also the predictors themselves.
What if you want to actually perform an arithmetic operation on your predictors to include a transformed predictor in your model? For example, you might want to include both \(x\) and \(x^2\) in your model. You have two options:
- You compute all of the (possibly transformed) predictors you want to include in your model beforehand and store them in the data set (remember
dplyr::mutate()
which will help you achieve this goal easily); or,- You use the as-is operator
I()
. In this case, you instructlm()
that the operations declared withinI()
must be performed outside from the formula environment and before generating the design matrix for the model.
There are two possible software suites:
jtools
and interactions
packages:ggeffects
and sjPlot
packages.They both provide tools for summarizing and visualising models, marginal effects, interactions and model predictions.
Observations | 831 (10 missing obs. deleted) |
Dependent variable | metascore |
Type | OLS linear regression |
F(6,824) | 169.37 |
R² | 0.55 |
Adj. R² | 0.55 |
Est. | S.E. | t val. | p | |
---|---|---|---|---|
(Intercept) | -39.96 | 5.92 | -6.75 | 0.00 |
imdb_rating | 12.80 | 0.49 | 25.89 | 0.00 |
log(us_gross) | 0.47 | 0.31 | 1.52 | 0.13 |
genre5Comedy | 6.32 | 1.06 | 5.95 | 0.00 |
genre5Drama | 7.66 | 1.08 | 7.12 | 0.00 |
genre5Horror/Thriller | -0.73 | 1.51 | -0.48 | 0.63 |
genre5Other | 5.86 | 3.25 | 1.80 | 0.07 |
Standard errors: OLS |
This is handled by the interaction
package.
Johnson-Neyman plot: sim_slopes()
Two-way interaction with at least one continuous predictor: interact_plot()
Two-way interaction between categorical predictors: cat_plot()
(Multi)-collinearity refers to high correlation in two or more independent variables in the regression model.
Multicollinearity can arise from poorly designed experiments (Data-based multicollinearity) or from creating new independent variables related to the existing ones (structural multicollinearity).
Correlation plots are a great tool to explore colinearity between two variables. They can be seamlessly computed and visualized using packages such as corrr
or ggcorrplot
.
The Variance inflation factor (VIF) measures the degree of multicollinearity or collinearity in the regression model.
\[ \mathrm{VIF}_i = \frac{1}{1 - R_i^2}, \]
where \(R_i^2\) is the multiple correlation coefficient associated to the regression of \(X_i\) against all remaining independent variables.
You can add VIFs to the summary of a model as follows:
Observations | 234 |
Dependent variable | cty |
Type | OLS linear regression |
F(13,220) | 100.37 |
R² | 0.86 |
Adj. R² | 0.85 |
Est. | S.E. | t val. | p | VIF | |
---|---|---|---|---|---|
(Intercept) | -199.79 | 51.14 | -3.91 | 0.00 | NA |
displ | -1.02 | 0.29 | -3.53 | 0.00 | 11.76 |
year | 0.12 | 0.03 | 4.52 | 0.00 | 1.11 |
cyl | -0.85 | 0.20 | -4.26 | 0.00 | 8.69 |
classcompact | -2.81 | 1.00 | -2.83 | 0.01 | 3.69 |
classmidsize | -2.95 | 0.97 | -3.05 | 0.00 | 3.69 |
classminivan | -5.08 | 1.05 | -4.82 | 0.00 | 3.69 |
classpickup | -5.89 | 0.92 | -6.37 | 0.00 | 3.69 |
classsubcompact | -2.63 | 1.01 | -2.61 | 0.01 | 3.69 |
classsuv | -5.59 | 0.88 | -6.33 | 0.00 | 3.69 |
fld | 5.93 | 1.85 | 3.21 | 0.00 | 1.60 |
fle | -5.13 | 1.81 | -2.84 | 0.00 | 1.60 |
flp | -2.97 | 1.72 | -1.73 | 0.08 | 1.60 |
flr | -1.69 | 1.70 | -0.99 | 0.32 | 1.60 |
Standard errors: OLS |
VIFs are always at least equal to 1.
In some domains, VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn’t pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs, but this may just be a natural, unavoidable fact with your model (e.g., there is no problem with high VIFs when you have an interaction effect).
Should you care?
If you are interested in causal inference, yes;
otherwise, no.
stats::step()
function uses the Akaike’s information criterion (AIC) and allows to perform forward selection (direction = "forward"
) or backward elimination (direction = "backward"
).Data Science with R - aymeric.stamm@cnrs.fr - https://astamm.github.io/data-science-with-r/