Linear Regression with R
Expectations
In the following exercises, you will be asked to study the relationship of a continuous response variable and one or more predictors. In doing so, remember to:
- perform model diagnosis
- including visualization tools
- including multicollinearity assessment
- perform informed model selection
- comment each result of an analysis you run with R
Exercise 1
Analysis of the production
data set which is composed of the following variables:
Variable name | Description |
---|---|
x | Number of produced pieces |
y | Production cost |
Study the relationship between x
and y
.
Exercise 2
Analysis of the brain
data set which is composed of the following variables:
Variable name | Description |
---|---|
body_weight | Body weight in kg |
brain_weight | Brain weight in kg |
Study the relationship between body and brain weights, to establish how the variable brain_weight
changes with the variable body_weight
.
Exercise 3
Analysis of the anscombe
data set which is composed of the following variables:
Variable name | Description |
---|---|
x1 | Predictor to be used for explaining y1 |
x2 | Predictor to be used for explaining y2 |
x3 | Predictor to be used for explaining y3 |
x4 | Predictor to be used for explaining y4 |
y1 | Response to be explained by x1 |
y2 | Response to be explained by x2 |
y3 | Response to be explained by x3 |
y4 | Response to be explained by x4 |
Study the relationship between each \(y_i\) and the corresponding \(x_i\).
Exercise 4
Analysis of the cement
data set, which contains the following variables:
Variable name | Description |
---|---|
aluminium | Percentage of \(\mathrm{Ca}_3 \mathrm{Al}_2 \mathrm{O}_6\) |
silicate | Percentage of \(\mathrm{C}_2 \mathrm{S}\) |
aluminium_ferrite | Percentage of \(4 \mathrm{CaO} \mathrm{Al}_2 \mathrm{O}_3 \mathrm{Fe}_2 \mathrm{O}_3\) |
silicate_bic | Percentage of \(\mathrm{C}_3 \mathrm{S}\) |
hardness | Hardness of the cement obtained by mixing the above four components |
Study, using a multiple linear regression model, how the variable hardness
depends on the four predictors.
Exercise 5
Analysis of the job
data set, which contains the following variables:
Variable name | Description |
---|---|
average_score | Average score obtained by the employee in the test |
years_service | Number of years of service |
sex | Male or female |
We want to see if it is possible to use the sex of the person in addition to the years of service to predict, with a linear model, the average score obtained in the test. Estimate a linear regression of average_score
vs. years_service
, considering the categorical variable sex
.
Exercise 6
Analysis of the cars
data set, which contains the following variables:
Variable name | Description |
---|---|
speed | Speed of the car before starting braking |
dist | Distance travelled by the car during the braking period until it completely stops |
Verify if the distance travelled during the braking depends on the starting velocity of the car:
- Choose the best model to explain the distance as function of the speed,
- Predict the braking distance for a starting velocity of 25 km/h, using a point estimate and a prediction interval.
Exercise 7
Analysis of the mussels
data set, which contains the following variables:
Variable name | Description |
---|---|
length | Length of a mussel (mm) |
width | Width of a mussel (mm) |
height | Height of a mussel (mm) |
size | Mass of a mussel (g) |
weight | Weight of eatable part of a mussel (g) |
We want to study how the eatable part of a mussel varies as a function of the other four variables using a multiple linear regression.