Linear Regression with R

Expectations

In the following exercises, you will be asked to study the relationship of a continuous response variable and one or more predictors. In doing so, remember to:

perform model diagnosis
including visualization tools
including multicollinearity assessment
perform informed model selection
comment each result of an analysis you run with R

Exercise 1

Analysis of the production data set which is composed of the following variables:

Variable name	Description
x	Number of produced pieces
y	Production cost

Study the relationship between x and y.

Solution

The first thing to do is to inspect a very brief summary of the data mainly to get an idea of

the sample size;
the presence of missing values;
the number of categorical and continuous variables.

The skimr package proposes the skim() function that can fit this purpose:

skimr::skim(production)

Data summary
Name	production
Number of rows	10
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
x	0	1	1702.8	671.30	852	1213.50	1534.0	2161.25	3050	▇▃▂▃▂
y	0	1	18806.3	5693.01	11314	14839.75	16860.5	23158.75	29349	▇▇▁▇▂

We do not have missing values but only 10 observations.

Next, we can visualize the variation of y in terms of x to get an idea about the kind of relationship that we should model, should there be any:

production |> 
  ggplot(aes(x, y)) + 
  geom_point() + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

The graph suggests that a linear relationship seems to hold.

Now, we can get to the modeling part:

mod <- lm(y ~ x, data = production)
dt_glance(mod)

The model exhibits an adjusted \(R^2\) of 0.9657089 which is very high, meaning that most of the variability in y is explained by the modeled relationship with x. Moreover, the p-value of the test of significance of the regression is 2.39^{-7} which is very low, suggesting that the model is statistically significant.

dt_tidy(mod)

Looking at the table of estimated coefficients, we read a coefficient of 8.3503139 for the predictor x which means that a positive correlation exists between x and y: an increase of x implies an increase of y. Moreover, the p-value of the significance of that coefficient is 2.39^{-7} which is very low, meaning that the coefficient is statistically different from zero, proving that x does have a significant impact on y.

All these conclusions will be only valid if the estimated model satisfies the assumptions behind the simple linear regression. We can plot a number of diagnosis plot to help us investigate this:

library(ggfortify)
ggplot2::autoplot(mod)

The Residuals vs Fitted plot does show some structure while we ideally want to see no pattern. This is to be mitigated with the low sample size and the presence of one or two visible outliers (obs. 9 and 10) which seem to drive the visible pattern. These two observations have high residuals and great departure from normality which means that they tend to be outliers. However, the Residuals vs Leverage plot reveals that they have little influence on the estimated coefficients. Hence, overall, we cannot say that the estimated model violates critical assumptions and the previous conclusions drawn from it are valid.

At this point, we can generated a publication-ready summary table of the model:

jtools::summ(mod)

Observations	10
Dependent variable	y
Type	OLS linear regression

F(1,8)	254.46
R²	0.97
Adj. R²	0.97

	Est.	S.E.	t val.	p
(Intercept)	4587.39	951.67	4.82	0.00
x	8.35	0.52	15.95	0.00
Standard errors: OLS

We can also provide a nice visualization of the estimated coefficients along with the incertainty about their estimation in the form of confidence intervals:

jtools::plot_summs(mod, inner_ci_level = 0.90)

Exercise 2

Analysis of the brain data set which is composed of the following variables:

Variable name	Description
body_weight	Body weight in kg
brain_weight	Brain weight in kg

Study the relationship between body and brain weights, to establish how the variable brain_weight changes with the variable body_weight.

Solution

The first thing to do is to inspect a very brief summary of the data mainly to get an idea of

the sample size;
the presence of missing values;
the number of categorical and continuous variables.

The skimr package proposes the skim() function that can fit this purpose:

skimr::skim(brain)

Data summary
Name	brain
Number of rows	62
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
body_weight	0	1	198.79	899.18	0.00	0.60	3.34	48.2	6654.18	▇▁▁▁▁
brain_weight	0	1	283.14	930.28	0.14	4.25	17.25	166.0	5711.86	▇▁▁▁▁

We do not have missing values and the sample size is \(62\), which is reasonable should we resort to asymptotic results such as the CLT.

Next, we can visualize the variation of brain_weight in terms of body_weight to get an idea about the kind of relationship that we should model, should there be any:

brain |> 
  ggplot(aes(x = body_weight, y = brain_weight)) + 
  geom_point() + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

The relationship between brain_weight and body_weight does not seem to be linear. This however does not mean that we cannot use a linear model. Remember that the linear model is called this way because it is linear in the coefficients, not in the predictors!

Here, we observe that

both variables are positive;
most of the mass is concentrated in the bottom-left part of the plot;
the rest of the points seem to follow a logarithm trend.

This suggests that a log-log transformation might help. This can be achieved in two ways. One way is actually compute the logarithm of the variables via log() and visualize the newly created variables:

brain |> 
  mutate(
    body_weight_log = log(body_weight),
    brain_weight_log = log(brain_weight)
  ) |> 
  ggplot(aes(x = body_weight_log, y = brain_weight_log)) + 
  geom_point() + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

This shows that the relationship seems linear between the logarithm of both variables.

We can also keep asking ggplot2 to graph brain_weight in terms of body_weight but to use logarithmic scales on the axes of the plot:

brain |> 
  ggplot(aes(x = body_weight, y = brain_weight)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_x_log10() + 
  scale_y_log10()

`geom_smooth()` using formula = 'y ~ x'

This essentially plots the same thing.

Now we can estimate the model:

mod <- lm(log(brain_weight) ~ log(body_weight), data = brain)
dt_glance(mod)

We can now inspect the estimated coefficients and comment the table, carry on model diagnosis and report summary table and CI for estimated coefficients as done in the previous exercise.

Exercise 3

Analysis of the anscombe data set which is composed of the following variables:

Variable name	Description
x1	Predictor to be used for explaining `y1`
x2	Predictor to be used for explaining `y2`
x3	Predictor to be used for explaining `y3`
x4	Predictor to be used for explaining `y4`
y1	Response to be explained by `x1`
y2	Response to be explained by `x2`
y3	Response to be explained by `x3`
y4	Response to be explained by `x4`

Study the relationship between each \(y_i\) and the corresponding \(x_i\).

Exercise 4

Analysis of the cement data set, which contains the following variables:

Variable name	Description
aluminium	Percentage of \(\mathrm{Ca}_3 \mathrm{Al}_2 \mathrm{O}_6\)
silicate	Percentage of \(\mathrm{C}_2 \mathrm{S}\)
aluminium_ferrite	Percentage of \(4 \mathrm{CaO} \mathrm{Al}_2 \mathrm{O}_3 \mathrm{Fe}_2 \mathrm{O}_3\)
silicate_bic	Percentage of \(\mathrm{C}_3 \mathrm{S}\)
hardness	Hardness of the cement obtained by mixing the above four components

Study, using a multiple linear regression model, how the variable hardness depends on the four predictors.

Solution

We can take a look at the data summary:

skimr::skim(cement)

Data summary
Name	cement
Number of rows	13
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
aluminium	1	7.46	5.88	1.0	2.0	7.0	11.0	21.0	▇▃▇▁▂
silicate	1	46.62	18.03	16.0	31.0	52.0	56.0	71.0	▃▃▃▇▆
aluminium_ferrite	1	11.77	6.41	4.0	8.0	9.0	17.0	23.0	▅▇▂▃▃
silicate_bic	1	30.00	16.74	6.0	20.0	26.0	44.0	60.0	▆▇▃▃▃
hardness	1	95.42	15.04	72.5	83.8	95.9	109.2	115.9	▆▃▃▃▇

This shows that we have a rather low sample size (13) but no missing values.

The novelty in this exercise is that we are going to move to multivariate linear regression in the sense that we will have more than one continuous predictor. Here the skim() function reveals that we can use 4 continuous variables to build up predictors. The next thing to do is then to visualize how the variable that we want to predict (hardness) varies with each one of the 4 potential predictors:

cement |> 
  pivot_longer(-hardness) |> 
  ggplot(aes(x = value, y = hardness)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  facet_wrap(vars(name), nrow = 2, scales = "free")

`geom_smooth()` using formula = 'y ~ x'

For all potential predictors, a linear relationhip with the outcome hardness seems appropriate.

Now, before moving to the modeling part, when it comes to integrating several predictors in a model, it is important to address the problem of collinearity and multicollinearity.

We can look at potential collinearities by visualizing the correlation matrix between variables:

library(ggcorrplot)
corr <- round(cor(cement), digits = 2)
ggcorrplot(corr, lab = TRUE)

In this matrix, values of correlation very close to 1 are the ones that might create problems by making \(\mathbb{X}^\top \mathbb{X}\) ill-conditioned. Here we notice such a high correlation (-0.98) between silicate and silicate_bic. It might therefore not be a good idea to put them both as predictors in a model.

We can test this assertion by estimating the model with all 4 predictors first and look at the estimated coefficient and corresponding standard error and then do the same thing with a model without, say, silicate_bic.

mod1 <- lm(hardness ~ ., data = cement)
dt_tidy(mod1)

mod2 <- lm(hardness ~ . - silicate_bic, data = cement)
dt_tidy(mod2)

We observe that:

the coefficient for silicate in model 1 is negative will previous plots suggests that the covariation between hardness and silicate should be positive;
the standard error of this same coefficient is huge in comparison to its value in model 2.

These two observations confirms that we should not include both silicate and silicate_bic in the same model as predictors.

Now we can focus on model 2 and assess whether we now have multicollinearity issues. This can be inspected by computed the VIFs of each predictor:

jtools::summ(mod2, vifs = TRUE)

Observations	13
Dependent variable	hardness
Type	OLS linear regression

F(3,9)	90.25
R²	0.97
Adj. R²	0.96

	Est.	S.E.	t val.	p	VIF
(Intercept)	58.83	4.96	11.87	0.00	NA
aluminium	1.40	0.28	4.95	0.00	3.42
silicate	0.57	0.05	10.84	0.00	1.11
aluminium_ferrite	-0.03	0.25	-0.13	0.90	3.24
Standard errors: OLS

Here, all 3 VIFs are relatively small (lower than 5) which suggests that multicollinearity should not be an issue.

Lastly, we can perform a step of model selection by stepwise backward elimination starting with the model with all 3 predictors:

mod3 <- step(mod2, direction = "backward")

Start:  AIC=32.77
hardness ~ (aluminium + silicate + aluminium_ferrite + silicate_bic) - 
    silicate_bic

                    Df Sum of Sq     RSS    AIC
- aluminium_ferrite  1      0.16   87.53 30.792
<none>                             87.37 32.768
- aluminium          1    238.09  325.46 47.864
- silicate           1   1139.70 1227.07 65.117

Step:  AIC=30.79
hardness ~ aluminium + silicate

            Df Sum of Sq     RSS    AIC
<none>                     87.53 30.792
- aluminium  1    796.13  883.67 58.849
- silicate   1   1178.15 1265.69 63.519

The model selection suggests that, on the basis of the AIC, we can get rid of the aluminium_ferrite predictor as well which seems to have little non-significant influence on hardness.

We have now a good candidate model that we need to inspect, diagnose and report about as done in Exercise 1.

dt_glance(mod3)

# diagnostic

jtools::summ(mod3, vifs = TRUE)

Observations	13
Dependent variable	hardness
Type	OLS linear regression

F(2,10)	150.13
R²	0.97
Adj. R²	0.96

	Est.	S.E.	t val.	p	VIF
(Intercept)	58.28	2.42	24.10	0.00	NA
aluminium	1.43	0.15	9.54	0.00	1.07
silicate	0.57	0.05	11.60	0.00	1.07
Standard errors: OLS

jtools::plot_summs(mod3, inner_ci_level = 0.90)

In particular, when reporting about the model, we can add effect plots to focus on the effect of each individual predictors while correcting the observed data points to account for the impact of the other predictors:

jtools::effect_plot(mod3, aluminium, partial.residuals = TRUE)

jtools::effect_plot(mod3, silicate, partial.residuals = TRUE)

Exercise 5

Analysis of the job data set, which contains the following variables:

Variable name	Description
average_score	Average score obtained by the employee in the test
years_service	Number of years of service
sex	Male or female

We want to see if it is possible to use the sex of the person in addition to the years of service to predict, with a linear model, the average score obtained in the test. Estimate a linear regression of average_score vs. years_service, considering the categorical variable sex.

skimr::skim(job)

Data summary
Name	job
Number of rows	20
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
sex	0	1	4	6	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
average_score	0	1	6.69	1.86	3.4	5.23	6.95	8.03	9.8	▆▃▇▇▆
years_service	0	1	9.80	8.30	1.0	4.00	7.00	12.00	30.0	▇▇▁▁▂

job |> 
  ggplot(aes(x = years_service, y = average_score, color = sex)) + 
  geom_point() + 
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

mod <- lm(average_score ~ years_service * sex, data = job)
dt_glance(mod)

dt_tidy(mod)

Model selection by stepwise backward elimination minimizing AIC:

mod2 <- step(mod, direction = "backward")

Start:  AIC=-7.41
average_score ~ years_service * sex

                    Df Sum of Sq     RSS     AIC
<none>                            9.2538 -7.4140
- years_service:sex  1    1.2506 10.5044 -6.8788

It does not suggest to remove the varying slopes. But this is mainly due to the small sample size because AIC is known to be biased in these cases. One could use the corrected AIC:

\[ \mathrm{AICc} = \mathrm{AIC} + \frac{2k(k+1)}{n-k-1}, \]

where \(k\) is the number of estimated parameters in the model. It leads to considering a common slope.

Model selection by stepwise backward elimination minimizing BIC:

n <- nrow(job)
mod3 <- step(mod, direction = "backward", k = log(n))

Start:  AIC=-3.43
average_score ~ years_service * sex

                    Df Sum of Sq     RSS     AIC
- years_service:sex  1    1.2506 10.5044 -3.8916
<none>                            9.2538 -3.4311

Step:  AIC=-3.89
average_score ~ years_service + sex

                Df Sum of Sq    RSS     AIC
<none>                       10.504 -3.8916
- years_service  1    11.696 22.200  8.0787
- sex            1    31.905 42.409 21.0241

Looking at the BIC also suggests considering a common slope.

Exercise 6

Analysis of the cars data set, which contains the following variables:

Variable name	Description
speed	Speed of the car before starting braking
dist	Distance travelled by the car during the braking period until it completely stops

Verify if the distance travelled during the braking depends on the starting velocity of the car:

Choose the best model to explain the distance as function of the speed,
Predict the braking distance for a starting velocity of 25 km/h, using a point estimate and a prediction interval.

Exercise 7

Analysis of the mussels data set, which contains the following variables:

Variable name	Description
length	Length of a mussel (mm)
width	Width of a mussel (mm)
height	Height of a mussel (mm)
size	Mass of a mussel (g)
weight	Weight of eatable part of a mussel (g)

We want to study how the eatable part of a mussel varies as a function of the other four variables using a multiple linear regression.