PCA - Solutions

The temperature.csv data set

This data set contains:

  • monthly recordings of the temperatures of European capitals on a specific year;
  • GPS coordinates of each city;
  • thermal amplitude: difference between maximal and minimal temperature;
  • annual average temperature;
  • European area: South, North, West or East.

Perform a PCA to unravel patterns of temperature and which cities follow them.

Data import

Let us begin by importing the data. After a quick inspection of the raw data file in a text editor, we find that:

  • the data is separated by semicolons;
  • the first column contains the city names and have trailing whitespaces.

We can use the readr::read_delim() function to import the data as follows:

temperature <- readr::read_delim(
  "12-PCA-Data/temperature.csv", 
  delim = ";", 
  trim_ws = TRUE,
  show_col_types = FALSE
)
New names:
• `` -> `...1`
temperature
# A tibble: 35 × 18
   ...1  January February March April   May  June  July August September October
   <chr>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>
 1 Amst…     2.9      2.5   5.7   8.2  12.5  14.8  17.1   17.1      14.5    11.4
 2 Athe…     9.1      9.7  11.7  15.4  20.1  24.5  27.4   27.2      23.8    19.2
 3 Berl…    -0.2      0.1   4.4   8.2  13.8  16    18.3   18        14.4    10  
 4 Brus…     3.3      3.3   6.7   8.9  12.8  15.6  17.8   17.8      15      11.1
 5 Buda…    -1.1      0.8   5.5  11.6  17    20.2  22     21.3      16.9    11.3
 6 Cope…    -0.4     -0.4   1.3   5.8  11.1  15.4  17.1   16.6      13.3     8.8
 7 Dubl…     4.8      5     5.9   7.8  10.4  13.3  15     14.6      12.7     9.7
 8 Elsi…    -5.8     -6.2  -2.7   3.1  10.2  14    17.2   14.9       9.7     5.2
 9 Kiev     -5.9     -5    -0.3   7.4  14.3  17.8  19.4   18.5      13.7     7.5
10 Krak…    -3.7     -2     1.9   7.9  13.2  16.9  18.4   17.6      13.7     8.6
# ℹ 25 more rows
# ℹ 7 more variables: November <dbl>, December <dbl>, Annual <dbl>,
#   Amplitude <dbl>, Latitude <dbl>, Longitude <dbl>, Area <chr>

We notice that the first variable has no name. We can rename it using the dplyr::rename() function as follows:

temperature <- dplyr::rename(temperature, city = ...1)
temperature
# A tibble: 35 × 18
   city  January February March April   May  June  July August September October
   <chr>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>
 1 Amst…     2.9      2.5   5.7   8.2  12.5  14.8  17.1   17.1      14.5    11.4
 2 Athe…     9.1      9.7  11.7  15.4  20.1  24.5  27.4   27.2      23.8    19.2
 3 Berl…    -0.2      0.1   4.4   8.2  13.8  16    18.3   18        14.4    10  
 4 Brus…     3.3      3.3   6.7   8.9  12.8  15.6  17.8   17.8      15      11.1
 5 Buda…    -1.1      0.8   5.5  11.6  17    20.2  22     21.3      16.9    11.3
 6 Cope…    -0.4     -0.4   1.3   5.8  11.1  15.4  17.1   16.6      13.3     8.8
 7 Dubl…     4.8      5     5.9   7.8  10.4  13.3  15     14.6      12.7     9.7
 8 Elsi…    -5.8     -6.2  -2.7   3.1  10.2  14    17.2   14.9       9.7     5.2
 9 Kiev     -5.9     -5    -0.3   7.4  14.3  17.8  19.4   18.5      13.7     7.5
10 Krak…    -3.7     -2     1.9   7.9  13.2  16.9  18.4   17.6      13.7     8.6
# ℹ 25 more rows
# ℹ 7 more variables: November <dbl>, December <dbl>, Annual <dbl>,
#   Amplitude <dbl>, Latitude <dbl>, Longitude <dbl>, Area <chr>

Data summary

The skimr::skim() function provides a nice summary of the data:

skimr::skim(temperature)
Data summary
Name temperature
Number of rows 35
Number of columns 18
_______________________
Column type frequency:
character 2
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 14 0 35 0
Area 0 1 4 5 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
January 0 1 1.35 5.50 -9.3 -1.55 0.2 4.90 10.7 ▅▅▇▇▆
February 0 1 2.22 5.50 -7.9 -0.15 1.9 5.80 11.8 ▂▂▇▂▃
March 0 1 5.23 4.86 -3.7 1.60 5.4 8.50 14.1 ▃▂▇▃▃
April 0 1 9.28 3.81 2.9 7.25 8.9 12.05 16.9 ▅▇▆▆▃
May 0 1 13.91 3.27 6.5 12.15 13.8 16.35 20.9 ▁▃▇▃▂
June 0 1 17.41 3.32 9.3 15.40 16.9 19.80 24.5 ▁▃▇▃▂
July 0 1 19.62 3.57 11.1 17.30 18.9 21.75 27.4 ▁▅▇▂▃
August 0 1 18.98 3.73 10.6 16.65 18.3 21.60 27.2 ▁▇▇▂▃
September 0 1 15.63 4.11 7.9 13.00 14.8 18.25 24.3 ▂▇▇▂▃
October 0 1 11.00 4.32 4.5 8.65 10.2 13.30 19.4 ▅▇▆▂▅
November 0 1 6.07 4.57 -1.1 3.20 5.1 7.90 14.9 ▆▇▆▂▅
December 0 1 2.88 4.97 -6.0 0.25 1.7 5.40 12.0 ▃▇▇▃▅
Annual 0 1 10.27 3.96 4.5 7.75 9.7 12.65 18.2 ▅▇▃▁▃
Amplitude 0 1 18.32 4.51 10.2 14.90 18.5 21.45 27.6 ▃▇▇▆▃
Latitude 0 1 49.04 7.10 37.2 43.90 50.0 53.35 64.1 ▆▅▇▃▃
Longitude 0 1 13.01 9.70 0.0 5.05 10.5 19.30 37.6 ▇▇▃▂▂

We learn from it that the data contains 18 variables and 35 observations. The skimr::skim() function also provides a nice summary of the data types and the number of missing values. We can see that the city and Area variables are character vectors and that there are no missing values.

Data visualization

The first 13 columns are the temperature measurements for each month. We can use the dplyr::select() function to select these columns and the city column as identifier:

temperature_sub <- dplyr::select(temperature, 1:13)
temperature_sub
# A tibble: 35 × 13
   city  January February March April   May  June  July August September October
   <chr>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>
 1 Amst…     2.9      2.5   5.7   8.2  12.5  14.8  17.1   17.1      14.5    11.4
 2 Athe…     9.1      9.7  11.7  15.4  20.1  24.5  27.4   27.2      23.8    19.2
 3 Berl…    -0.2      0.1   4.4   8.2  13.8  16    18.3   18        14.4    10  
 4 Brus…     3.3      3.3   6.7   8.9  12.8  15.6  17.8   17.8      15      11.1
 5 Buda…    -1.1      0.8   5.5  11.6  17    20.2  22     21.3      16.9    11.3
 6 Cope…    -0.4     -0.4   1.3   5.8  11.1  15.4  17.1   16.6      13.3     8.8
 7 Dubl…     4.8      5     5.9   7.8  10.4  13.3  15     14.6      12.7     9.7
 8 Elsi…    -5.8     -6.2  -2.7   3.1  10.2  14    17.2   14.9       9.7     5.2
 9 Kiev     -5.9     -5    -0.3   7.4  14.3  17.8  19.4   18.5      13.7     7.5
10 Krak…    -3.7     -2     1.9   7.9  13.2  16.9  18.4   17.6      13.7     8.6
# ℹ 25 more rows
# ℹ 2 more variables: November <dbl>, December <dbl>

Next, we use the tidyr::pivot_longer() function to reshape the data into a long format which is required for visualization with ggplot2.

temperature_sub <- tidyr::pivot_longer(
  temperature_sub, 
  cols = -city, 
  names_to = "month", 
  values_to = "temperature"
)
temperature_sub
# A tibble: 420 × 3
   city      month     temperature
   <chr>     <chr>           <dbl>
 1 Amsterdam January           2.9
 2 Amsterdam February          2.5
 3 Amsterdam March             5.7
 4 Amsterdam April             8.2
 5 Amsterdam May              12.5
 6 Amsterdam June             14.8
 7 Amsterdam July             17.1
 8 Amsterdam August           17.1
 9 Amsterdam September        14.5
10 Amsterdam October          11.4
# ℹ 410 more rows

Next, we convert the month column to a proper date-time format using the lubridate::parse_date_time() function:

temperature_sub <- dplyr::mutate(
  temperature_sub, 
  month = lubridate::parse_date_time(month, orders = "m")
)
temperature_sub
# A tibble: 420 × 3
   city      month               temperature
   <chr>     <dttm>                    <dbl>
 1 Amsterdam 0000-01-01 00:00:00         2.9
 2 Amsterdam 0000-02-01 00:00:00         2.5
 3 Amsterdam 0000-03-01 00:00:00         5.7
 4 Amsterdam 0000-04-01 00:00:00         8.2
 5 Amsterdam 0000-05-01 00:00:00        12.5
 6 Amsterdam 0000-06-01 00:00:00        14.8
 7 Amsterdam 0000-07-01 00:00:00        17.1
 8 Amsterdam 0000-08-01 00:00:00        17.1
 9 Amsterdam 0000-09-01 00:00:00        14.5
10 Amsterdam 0000-10-01 00:00:00        11.4
# ℹ 410 more rows

Now we can plot the temperature measurements for each city using the ggplot2::ggplot() function. We also use the scales::scale_x_datetime() function to format the x-axis labels to only show the month names:

temperature_sub |>
  ggplot(aes(x = month, y = temperature, group = city)) +
  geom_line() +
  scale_x_datetime(labels = scales::date_format("%b")) +
  theme_bw() + 
  labs(
    x = "Month", 
    y = "Temperature (°C)", 
    color = "City"
  ) + 
  theme(legend.position = "none")

It is hard to see from this plot which cities have similar temperature profiles. A principal component analysis (PCA) might help us to find out.

Principal component analysis

It is important that PCA be performed on exclusively numeric variables. Hence, we first need to use dplyr::select() to retain only the columns corresponding to the monthly temperatures:

X <- dplyr::select(temperature, 2:13)

Before performing PCA, it is important to standardize the data. For this purpose, we can use the base::scale() function:

X <- scale(X)

Now, we are going to use FactoMineR::PCA() to perform the PCA and the factoextra package for visualization of the results. When visualizing observations, the package uses the rownames of the input data frame to label the observations. Hence, we anticipate here and affect the city names as rownames of the data set:

rownames(X) <- temperature$city

Now we can perform the PCA using the FactorMineR::PCA() function:

pca_results <- FactoMineR::PCA(X, graph = FALSE)

Choice of the reduced dimension

We can also use the factoextra::fviz_screeplot() function to plot the eigenvalues:

factoextra::fviz_screeplot(pca_results, addlabels = TRUE)

And the eigenvalues can be extracted with:

factoextra::get_eigenvalue(pca_results)
         eigenvalue variance.percent cumulative.variance.percent
Dim.1  1.042445e+01     86.870441346                    86.87044
Dim.2  1.370499e+00     11.420823117                    98.29126
Dim.3  1.205076e-01      1.004230241                    99.29549
Dim.4  4.233298e-02      0.352774838                    99.64827
Dim.5  2.292280e-02      0.191023370                    99.83929
Dim.6  8.684234e-03      0.072368614                    99.91166
Dim.7  4.178064e-03      0.034817200                    99.94648
Dim.8  2.930325e-03      0.024419371                    99.97090
Dim.9  1.475750e-03      0.012297915                    99.98320
Dim.10 8.529732e-04      0.007108110                    99.99030
Dim.11 7.862929e-04      0.006552441                    99.99686
Dim.12 3.772122e-04      0.003143435                   100.00000

This suggests that the first two principal components explain almost all the variance in the data.

Analysis of the variables

We can use the factoextra::fviz_pca_var() function to visualize the coordinates of the variables in the proposed reduced space consisting of the first two principal directions:

factoextra::fviz_pca_var(pca_results)

We see from the above correlation circle that:

  • the first principal component seems to produce a rather uniformly weighted average temperature over the year; a high score on this component means a high average temperature over the year.
  • the second principal component instead contrasts temperatures between the warmest months (summer months) and the coldest months (winter months) hence somehow providing an estimate of the range between maximal and mininal temperature; a high score on this component means an elevated difference between maximal and minimal temperature over the year.

We can assess whether each original variable is well represented by the first two principal components by looking at the cos2 values:

cos2 <- factoextra::get_pca_var(pca_results)$cos2[, 1:2]
corrplot::corrplot(cos2, is.corr = FALSE)

The original variables are indeed well represented. They even would all be well represented by the first component only.

We can confirm the interpretation of the first two principal components that we were able to give from the correlation circle by looking at the contribution of each original variable to the principal components:

contrib <- factoextra::get_pca_var(pca_results)$contrib[, 1:2]
corrplot::corrplot(contrib, is.corr = FALSE)

Analysis of the city temperature profiles

We can use the factoextra::fviz_pca_ind() function to visualize the cities in the plane defined by the first two principal components:

factoextra::fviz_pca_ind(pca_results)

The plane is naturally divided into 4 panels:

  • The top left panel groups cities with negative scores on the first component but positive scores on the second component reflecting rather low average temperature but relatively high thermal amplitude.
  • The top right panel groups cities with positive scores on the first component and positive scores on the second component reflecting rather high average temperature and relatively high thermal amplitude.
  • The bottom left panel groups cities with negative scores on the first component and negative scores on the second component reflecting rather low average temperature and relatively low thermal amplitude.
  • The bottom right panel groups cities with positive scores on the first component but negative scores on the second component reflecting rather high average temperature but relatively low thermal amplitude.

We can also add colors to the plot to see if the area of the cities is correlated with the temperature profiles:

factoextra::fviz_pca_ind(
  pca_results,
  col.ind = temperature$Area, # color by groups
  palette = viridis::viridis(4),
  legend.title = "By Area"
)

This is interesting as it reflects that:

  • western capitals have a medium average temperature and medium thermal amplitude.
  • nothern capitals are grouped in the bottom left panel.
  • eastern capitals are grouped in the top left panel.
  • southern capitals are grouped on the right side with cities that are therefore globally warm but with some of them featuring a high thermal amplitude like Milan while others featuring a more stable temperature over the year like Lisbon.
Interpretation of units

The fact that some cities have a negative score on the second principal component does not mean that winter achieves higher temperature w.r.t. summer.

Recall that temperatures have been standardized across cities. Hence, if you want to interpret the scores in terms of temperatures, you should do it by hand with something like:

orig_temp <- as.matrix(temperature[, 2:13])
loadings <- factoextra::get_pca_var(pca_results)$coord[, 1:2]
scores <- orig_temp %*% loadings
scores_tbl <- tibble::tibble(
  City = temperature$city, 
  PC1 = scores[, 1], 
  PC2 = scores[, 2]
)
gt::gt(scores_tbl)
City PC1 PC2
Amsterdam 109.75002 20.07679
Athens 198.59930 25.31486
Berlin 100.57744 26.73047
Brussels 114.61657 20.60132
Budapest 121.84567 33.06963
Copenhagen 87.01508 25.23984
Dublin 103.31766 14.35165
Elsinki 52.66654 31.43762
Kiev 78.60049 36.91400
Krakow 86.57091 31.42295
Lisbon 177.96301 15.27917
London 108.28743 19.03678
Madrid 154.79387 26.79321
Minsk 60.59012 35.15569
Moscow 56.10806 38.98164
Oslo 62.27592 30.17793
Paris 123.96285 21.37080
Prague 101.58740 29.64837
Reykjavik 51.03193 16.05866
Rome 171.82598 23.94100
Sarajevo 105.54795 27.96790
Sofia 107.82834 29.99347
Stockholm 64.97559 28.84649
Antwerp 113.97189 20.71773
Barcelona 180.85690 20.44746
Bordeaux 141.80864 20.95166
Edinburgh 92.78014 16.47195
Frankfurt 108.90323 26.73563
Geneva 108.15154 26.93270
Genoa 178.52320 21.81177
Milan 141.14609 31.35509
Palermo 196.51068 19.84736
Seville 203.30276 22.04190
St. Petersburg 50.62120 36.91273
Zurich 96.74811 26.57796

The chicken.csv data set

It regroups 43 chickens who went through six different diets:

  • normal diet (N),
  • fast for 16h (F16),
  • fast for 16h and back to normal diet for 5h (F16R5),
  • fast for 16h and back to normal diet for 16h (F16R16),
  • fast for 48h (F48),
  • fast for 48h and back to normal diet for 24h (F48R24).

After each process, chicken underwent gene expression sequencing and 7406 gene expressions were collected.

  • Can we conclude that genes express differently according to the stress level?
  • How much time is needed for chicken to go back to normality in terms of gene expression?

Data import & Visualization

A quick glance at the content of the CSV file shows that the delimiter is the semi-colon. We can therefore resort to readr::read_delim() to import the data:

df <- readr::read_delim(
  file = "12-PCA-Data/chicken.csv", 
  delim = ";", 
  show_col_types = FALSE
)
df
# A tibble: 7,406 × 44
   gene         N_1     N_2     N_3     N_4      N_6     N_7     j16_3   j16_4
   <chr>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>     <dbl>   <dbl>
 1 A4GALT  -0.153   -0.0126 -0.140   0.208  -0.461   -0.217  -0.130    -0.295 
 2 A4GNT    0.00873 -0.0130  0.0126  0.0713  0.515    0.158   0.600     0.171 
 3 AACS     0.0426   0.173  -0.0219 -0.0788 -0.00269 -0.0894 -0.0548    0.0145
 4 AADACL1  0.161    0.205   0.185   0.0576  0.307    0.309   0.000618 -0.0414
 5 AADACL2 -0.374    0.0320 -0.411   0.0361 -0.490   -0.145  -0.112    -0.151 
 6 AADACL3 -0.445    0.0638  0.444   0.106  -0.455   -0.0628 -0.0906    0.104 
 7 AAK1    -0.216   -0.157  -0.306  -0.238  -0.218   -0.114  -0.0109   -0.226 
 8 AAMP     1.23     0.650   0.350   1.25    0.138    0.690  -0.903    -0.594 
 9 AARS     0.425    0.0293  0.195  -0.0334 -0.189   -0.297   0.0937    0.0522
10 AARS2   -0.118   -0.189  -0.486   0.414  -0.404   -0.182  -0.598    -0.508 
# ℹ 7,396 more rows
# ℹ 35 more variables: j16_5 <dbl>, j16_6 <dbl>, j16_7 <dbl>, j16r5_1 <dbl>,
#   j16r5_2 <dbl>, j16r5_3 <dbl>, j16r5_4 <dbl>, j16r5_5 <dbl>, j16r5_6 <dbl>,
#   j16r5_7 <dbl>, j16r5_8 <dbl>, j16r16_1 <dbl>, j16r16_2 <dbl>,
#   j16r16_3 <dbl>, j16r16_4 <dbl>, j16r16_5 <dbl>, j16r16_6 <dbl>,
#   j16r16_7 <dbl>, j16r16_8 <dbl>, j16r16_9 <dbl>, j48_1 <dbl>, j48_2 <dbl>,
#   j48_3 <dbl>, j48_4 <dbl>, j48_6 <dbl>, j48_7 <dbl>, j48r24_1 <dbl>, …

The analysis that we want to do puts the focus on the chicken rather that on genes. However, the raw data records genes in rows and chicken in columns grouped by diet. We therefore proceed with some data manipulation to get the data set with 43 rows corresponding to chickens. First, we use tidyr::pivot_longer() to have all variables expect gene become the values of a unique column coined name and the corresponding values in the table for these variables in a unique column coined value:

df <- tidyr::pivot_longer(df, cols = -gene)
df
# A tibble: 318,458 × 3
   gene   name    value
   <chr>  <chr>   <dbl>
 1 A4GALT N_1   -0.153 
 2 A4GALT N_2   -0.0126
 3 A4GALT N_3   -0.140 
 4 A4GALT N_4    0.208 
 5 A4GALT N_6   -0.461 
 6 A4GALT N_7   -0.217 
 7 A4GALT j16_3 -0.130 
 8 A4GALT j16_4 -0.295 
 9 A4GALT j16_5 -0.130 
10 A4GALT j16_6 -0.220 
# ℹ 318,448 more rows

Now, we can see that the name column encodes two different informations:

  • the type of diet given to the chicken; and,
  • the chicken ID in his group.

We therefore use tidyr::separate() to get each piece of information into its own column:

df <- tidyr::separate(df, col = name, into = c("diet", "id"), sep = "_")
df
# A tibble: 318,458 × 4
   gene   diet  id      value
   <chr>  <chr> <chr>   <dbl>
 1 A4GALT N     1     -0.153 
 2 A4GALT N     2     -0.0126
 3 A4GALT N     3     -0.140 
 4 A4GALT N     4      0.208 
 5 A4GALT N     6     -0.461 
 6 A4GALT N     7     -0.217 
 7 A4GALT j16   3     -0.130 
 8 A4GALT j16   4     -0.295 
 9 A4GALT j16   5     -0.130 
10 A4GALT j16   6     -0.220 
# ℹ 318,448 more rows

Finally, we perform a last reshaping of the data set to get each gene in a dedicated column via tidyr::pivot_wider():

chk <- tidyr::pivot_wider(df, names_from = gene, values_from = value)
chk
# A tibble: 43 × 7,408
   diet  id     A4GALT    A4GNT     AACS  AADACL1 AADACL2 AADACL3    AAK1   AAMP
   <chr> <chr>   <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 N     1     -0.153   0.00873  0.0426   1.61e-1 -0.374  -0.445  -0.216   1.23 
 2 N     2     -0.0126 -0.0130   0.173    2.05e-1  0.0320  0.0638 -0.157   0.650
 3 N     3     -0.140   0.0126  -0.0219   1.85e-1 -0.411   0.444  -0.306   0.350
 4 N     4      0.208   0.0713  -0.0788   5.76e-2  0.0361  0.106  -0.238   1.25 
 5 N     6     -0.461   0.515   -0.00269  3.07e-1 -0.490  -0.455  -0.218   0.138
 6 N     7     -0.217   0.158   -0.0894   3.09e-1 -0.145  -0.0628 -0.114   0.690
 7 j16   3     -0.130   0.600   -0.0548   6.18e-4 -0.112  -0.0906 -0.0109 -0.903
 8 j16   4     -0.295   0.171    0.0145  -4.14e-2 -0.151   0.104  -0.226  -0.594
 9 j16   5     -0.130   0.723   -0.120    1.27e-1 -0.187  -1.23   -0.151  -0.733
10 j16   6     -0.220   0.743    0.200    6.10e-2 -0.265  -0.157  -0.141  -0.683
# ℹ 33 more rows
# ℹ 7,398 more variables: AARS <dbl>, AARS2 <dbl>, AARSD1 <dbl>, AASS <dbl>,
#   AATF <dbl>, ABAT <dbl>, ABAT1 <dbl>, ABCA1 <dbl>, ABCA3 <dbl>,
#   ABCB10 <dbl>, ABCB7 <dbl>, ABCC4 <dbl>, ABCC5 <dbl>, ABCD3 <dbl>,
#   ABCD4 <dbl>, ABCE1 <dbl>, ABCF2 <dbl>, ABCG1 <dbl>, ABCG2 <dbl>,
#   ABCG4 <dbl>, ABHD11 <dbl>, ABHD12 <dbl>, ABHD2 <dbl>, ABHD3 <dbl>,
#   ABHD5 <dbl>, ABI1 <dbl>, ABI2 <dbl>, ABL1 <dbl>, ABL11 <dbl>, ABL2 <dbl>, …

It is difficult to provide insightful visualization in the original space of variables which is of dimension 7406. We therefore begin with performing PCA and then we will provide hopefully insightful visualizations in reduced space.

PCA

First, we select the numerical variables on which we want to perform the PCA and we standardize them:

X <- chk |> 
  dplyr::select(-diet, -id) |> 
  scale()

Now, we can perform the PCA:

pca_res <- FactoMineR::PCA(X, graph = FALSE)

Choice of the reduced dimension

factoextra::fviz_screeplot(
  pca_res, 
  addlabels = TRUE, 
  ncp = 42L
)

The elbow method applied to the screeplot suggests to keep either 1, 6, 8 or 16 components. We can look at the cumulative percentage of variance explained:

pca_res |> 
  factoextra::get_eigenvalue() |> 
  janitor::clean_names() |> 
  tibble::as_tibble() |> 
  gt::gt() |> 
  gt::cols_label(
    eigenvalue = "Eigenvalue",
    variance_percent = "Percentage of variance explained",
    cumulative_variance_percent = "Cumulative percentage of variance explained"
  )
Eigenvalue Percentage of variance explained Cumulative percentage of variance explained
1453.57240 19.6269565 19.62696
692.78785 9.3544133 28.98137
536.20796 7.2401831 36.22155
434.45335 5.8662348 42.08779
374.62157 5.0583523 47.14614
324.08250 4.3759451 51.52209
273.79824 3.6969787 55.21906
263.23096 3.5542933 58.77336
238.82728 3.2247809 61.99814
212.22641 2.8656010 64.86374
205.86862 2.7797545 67.64349
186.68809 2.5207682 70.16426
161.71254 2.1835342 72.34780
146.67462 1.9804837 74.32828
134.41024 1.8148831 76.14316
128.77791 1.7388321 77.88199
109.49173 1.4784192 79.36041
104.11595 1.4058324 80.76625
98.79454 1.3339798 82.10023
91.34045 1.2333304 83.33356
83.86158 1.1323465 84.46590
80.15492 1.0822971 85.54820
79.43852 1.0726238 86.62082
74.46548 1.0054751 87.62630
70.99072 0.9585568 88.58486
66.21149 0.8940250 89.47888
62.68820 0.8464515 90.32533
60.07860 0.8112153 91.13655
58.69024 0.7924688 91.92902
56.15042 0.7581747 92.68719
54.70237 0.7386224 93.42581
52.81262 0.7131059 94.13892
50.97364 0.6882749 94.82719
48.64270 0.6568012 95.48400
48.15634 0.6502342 96.13423
46.40970 0.6266500 96.76088
44.81239 0.6050822 97.36596
42.90186 0.5792851 97.94525
40.79995 0.5509040 98.49615
39.33001 0.5310560 99.02721
38.25896 0.5165942 99.54380
33.78608 0.4561987 100.00000

In this case, we see that keeping 6 components retains only 50% of the inertia, keeping 8 components makes it up almost to 60% and going for 16 components makes us above the 75% of inertia.

If we use the rule that pertains to keeping all components with an explained variance above 1, then we would keep 42 components.

Finally, we can get an estimate of the optimal number of components to keep by minimizing the generalized cross-validation score optained as the mean reconstruction error of each chicken gene expression from the reconstructed gene expression using outputs of a PCA computed on all chickens expect that one. The function FactoMineR::estim_ncp() does exactly that for us:

out <- FactoMineR::estim_ncp(X, ncp.min = 1, ncp.max = 42)
tibble::tibble(gcv = out$criterion, ncp = seq_along(gcv)) |> 
  ggplot(aes(x = ncp, y = gcv)) + 
  geom_point() +
  geom_line()

out$ncp
[1] 15

Analysis of the variables

The analysis of variables by means of visualizations or tables becomes tricky as the number of variables grows. Typically here, drawing the correlation circle with 7406 arrows would not be of any help. That’s why the factoextra::fviz_pca_var() comes with an optional argument select.var which instruct the function to visualize only the variables that satisfy some constraints. For example, in the code below, we ask it to retain only variables with a cumulated cos2 value in the first principal plane above 0.8. This allows to visualize only those variables which are well represented in that plane:

factoextra::fviz_pca_var(pca_res, select.var = list(cos2 = 0.8))

Analysis of the individuals

We can project chickens onto the first principal plane and color points by diet:

factoextra::fviz_pca_ind(pca_res, col.ind = chk$diet)

This is interesting as it shows that, in this plane, only chicken who underwent fast for 48h did not go back to normal gene expression, including those who had been back to normal diet for 24h after the fast.

The limit of linear projections

One has to be careful in the interpretation in such cases. Indeed, we could be tempted to conclude that chicken in all the other types of diet went back to normal gene expression. But we are in fact only looking at the chicken in a reduced space that only accounts for less than 30% of the total inertia in the original data! This is what urged researchers typically in the field of genetics to resort to non-linear methods for dimensional reduction such as tSNE or UMAP, which effectively achieve very small reduced spaces in terms of dimension while retaining most of the original inertia. For instance, the code below illustrate the projection of the data in a reduced space of dimension 2 via UMAP:

umap_res <- chk |> 
  dplyr::select(-diet, -id) |> 
  scale() |> 
  umap::umap()
tibble::tibble(
  X = umap_res$layout[, 1], 
  Y = umap_res$layout[, 2], 
  Diet = chk$diet
) |> 
  ggplot(aes(x = X, y = Y, color = Diet)) + 
  geom_point() + 
  theme_bw()

This clearly illustrates that claiming that chicken who underwent fast for less than 48h go back to normal gene expression on the basis of the results from the PCA would have been completely misleading.

Le jeu de données orange.csv

Six jus d’orange de fabriquants différents ont évalués. Toutes les variables sont-elles indisensables ? Y a-t-il des jus qui se dégagent comme particulièrement bons? mauvais ?

orange <- readr::read_delim("12-PCA-Data/orange.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)
Rows: 6 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (3): Product, Way of preserving, Origin
dbl (14): Odour intensity, Odour typicality, Pulpiness, Intensity of taste, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
orange
# A tibble: 6 × 17
  Product    `Odour intensity` `Odour typicality` Pulpiness `Intensity of taste`
  <chr>                  <dbl>              <dbl>     <dbl>                <dbl>
1 Pampryl a…              2.82               2.53      1.66                 3.46
2 Tropicana…              2.76               2.82      1.91                 3.23
3 Fruvita f…              2.83               2.88      4                    3.45
4 Joker amb.              2.76               2.59      1.66                 3.37
5 Tropicana…              3.2                3.02      3.69                 3.12
6 Pampryl f…              3.07               2.73      3.34                 3.54
# ℹ 12 more variables: Acidity <dbl>, Bitterness <dbl>, Sweetness <dbl>,
#   Glucose <dbl>, Fructose <dbl>, Saccharose <dbl>, `Sweetening power` <dbl>,
#   pH <dbl>, `Citric acid` <dbl>, `Vitamin C` <dbl>,
#   `Way of preserving` <chr>, Origin <chr>