PCA - Solutions

The `temperature.csv` data set

This data set contains:

monthly recordings of the temperatures of European capitals on a specific year;
GPS coordinates of each city;
thermal amplitude: difference between maximal and minimal temperature;
annual average temperature;
European area: South, North, West or East.

Perform a PCA to unravel patterns of temperature and which cities follow them.

Data import

Let us begin by importing the data. After a quick inspection of the raw data file in a text editor, we find that:

the data is separated by semicolons;
the first column contains the city names and have trailing whitespaces.

We can use the readr::read_delim() function to import the data as follows:

temperature <- readr::read_delim(
  "12-PCA-Data/temperature.csv", 
  delim = ";", 
  trim_ws = TRUE,
  show_col_types = FALSE
)

New names:
• `` -> `...1`

temperature

# A tibble: 35 × 18
   ...1  January February March April   May  June  July August September October
   <chr>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>
 1 Amst…     2.9      2.5   5.7   8.2  12.5  14.8  17.1   17.1      14.5    11.4
 2 Athe…     9.1      9.7  11.7  15.4  20.1  24.5  27.4   27.2      23.8    19.2
 3 Berl…    -0.2      0.1   4.4   8.2  13.8  16    18.3   18        14.4    10  
 4 Brus…     3.3      3.3   6.7   8.9  12.8  15.6  17.8   17.8      15      11.1
 5 Buda…    -1.1      0.8   5.5  11.6  17    20.2  22     21.3      16.9    11.3
 6 Cope…    -0.4     -0.4   1.3   5.8  11.1  15.4  17.1   16.6      13.3     8.8
 7 Dubl…     4.8      5     5.9   7.8  10.4  13.3  15     14.6      12.7     9.7
 8 Elsi…    -5.8     -6.2  -2.7   3.1  10.2  14    17.2   14.9       9.7     5.2
 9 Kiev     -5.9     -5    -0.3   7.4  14.3  17.8  19.4   18.5      13.7     7.5
10 Krak…    -3.7     -2     1.9   7.9  13.2  16.9  18.4   17.6      13.7     8.6
# ℹ 25 more rows
# ℹ 7 more variables: November <dbl>, December <dbl>, Annual <dbl>,
#   Amplitude <dbl>, Latitude <dbl>, Longitude <dbl>, Area <chr>

We notice that the first variable has no name. We can rename it using the dplyr::rename() function as follows:

temperature <- dplyr::rename(temperature, city = ...1)
temperature

# A tibble: 35 × 18
   city  January February March April   May  June  July August September October
   <chr>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>
 1 Amst…     2.9      2.5   5.7   8.2  12.5  14.8  17.1   17.1      14.5    11.4
 2 Athe…     9.1      9.7  11.7  15.4  20.1  24.5  27.4   27.2      23.8    19.2
 3 Berl…    -0.2      0.1   4.4   8.2  13.8  16    18.3   18        14.4    10  
 4 Brus…     3.3      3.3   6.7   8.9  12.8  15.6  17.8   17.8      15      11.1
 5 Buda…    -1.1      0.8   5.5  11.6  17    20.2  22     21.3      16.9    11.3
 6 Cope…    -0.4     -0.4   1.3   5.8  11.1  15.4  17.1   16.6      13.3     8.8
 7 Dubl…     4.8      5     5.9   7.8  10.4  13.3  15     14.6      12.7     9.7
 8 Elsi…    -5.8     -6.2  -2.7   3.1  10.2  14    17.2   14.9       9.7     5.2
 9 Kiev     -5.9     -5    -0.3   7.4  14.3  17.8  19.4   18.5      13.7     7.5
10 Krak…    -3.7     -2     1.9   7.9  13.2  16.9  18.4   17.6      13.7     8.6
# ℹ 25 more rows
# ℹ 7 more variables: November <dbl>, December <dbl>, Annual <dbl>,
#   Amplitude <dbl>, Latitude <dbl>, Longitude <dbl>, Area <chr>

Data summary

The skimr::skim() function provides a nice summary of the data:

skimr::skim(temperature)

Data summary
Name	temperature
Number of rows	35
Number of columns	18
_______________________
Column type frequency:
character	2
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
city	0	1	4	14	0	35	0
Area	0	1	4	5	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
January	1	1.35	5.50	-9.3	-1.55	0.2	4.90	10.7	▅▅▇▇▆
February	1	2.22	5.50	-7.9	-0.15	1.9	5.80	11.8	▂▂▇▂▃
March	1	5.23	4.86	-3.7	1.60	5.4	8.50	14.1	▃▂▇▃▃
April	1	9.28	3.81	2.9	7.25	8.9	12.05	16.9	▅▇▆▆▃
May	1	13.91	3.27	6.5	12.15	13.8	16.35	20.9	▁▃▇▃▂
June	1	17.41	3.32	9.3	15.40	16.9	19.80	24.5	▁▃▇▃▂
July	1	19.62	3.57	11.1	17.30	18.9	21.75	27.4	▁▅▇▂▃
August	1	18.98	3.73	10.6	16.65	18.3	21.60	27.2	▁▇▇▂▃
September	1	15.63	4.11	7.9	13.00	14.8	18.25	24.3	▂▇▇▂▃
October	1	11.00	4.32	4.5	8.65	10.2	13.30	19.4	▅▇▆▂▅
November	1	6.07	4.57	-1.1	3.20	5.1	7.90	14.9	▆▇▆▂▅
December	1	2.88	4.97	-6.0	0.25	1.7	5.40	12.0	▃▇▇▃▅
Annual	1	10.27	3.96	4.5	7.75	9.7	12.65	18.2	▅▇▃▁▃
Amplitude	1	18.32	4.51	10.2	14.90	18.5	21.45	27.6	▃▇▇▆▃
Latitude	1	49.04	7.10	37.2	43.90	50.0	53.35	64.1	▆▅▇▃▃
Longitude	1	13.01	9.70	0.0	5.05	10.5	19.30	37.6	▇▇▃▂▂

We learn from it that the data contains 18 variables and 35 observations. The skimr::skim() function also provides a nice summary of the data types and the number of missing values. We can see that the city and Area variables are character vectors and that there are no missing values.

Data visualization

The first 13 columns are the temperature measurements for each month. We can use the dplyr::select() function to select these columns and the city column as identifier:

temperature_sub <- dplyr::select(temperature, 1:13)
temperature_sub

# A tibble: 35 × 13
   city  January February March April   May  June  July August September October
   <chr>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>
 1 Amst…     2.9      2.5   5.7   8.2  12.5  14.8  17.1   17.1      14.5    11.4
 2 Athe…     9.1      9.7  11.7  15.4  20.1  24.5  27.4   27.2      23.8    19.2
 3 Berl…    -0.2      0.1   4.4   8.2  13.8  16    18.3   18        14.4    10  
 4 Brus…     3.3      3.3   6.7   8.9  12.8  15.6  17.8   17.8      15      11.1
 5 Buda…    -1.1      0.8   5.5  11.6  17    20.2  22     21.3      16.9    11.3
 6 Cope…    -0.4     -0.4   1.3   5.8  11.1  15.4  17.1   16.6      13.3     8.8
 7 Dubl…     4.8      5     5.9   7.8  10.4  13.3  15     14.6      12.7     9.7
 8 Elsi…    -5.8     -6.2  -2.7   3.1  10.2  14    17.2   14.9       9.7     5.2
 9 Kiev     -5.9     -5    -0.3   7.4  14.3  17.8  19.4   18.5      13.7     7.5
10 Krak…    -3.7     -2     1.9   7.9  13.2  16.9  18.4   17.6      13.7     8.6
# ℹ 25 more rows
# ℹ 2 more variables: November <dbl>, December <dbl>

Next, we use the tidyr::pivot_longer() function to reshape the data into a long format which is required for visualization with ggplot2.

temperature_sub <- tidyr::pivot_longer(
  temperature_sub, 
  cols = -city, 
  names_to = "month", 
  values_to = "temperature"
)
temperature_sub

# A tibble: 420 × 3
   city      month     temperature
   <chr>     <chr>           <dbl>
 1 Amsterdam January           2.9
 2 Amsterdam February          2.5
 3 Amsterdam March             5.7
 4 Amsterdam April             8.2
 5 Amsterdam May              12.5
 6 Amsterdam June             14.8
 7 Amsterdam July             17.1
 8 Amsterdam August           17.1
 9 Amsterdam September        14.5
10 Amsterdam October          11.4
# ℹ 410 more rows

Next, we convert the month column to a proper date-time format using the lubridate::parse_date_time() function:

temperature_sub <- dplyr::mutate(
  temperature_sub, 
  month = lubridate::parse_date_time(month, orders = "m")
)
temperature_sub

# A tibble: 420 × 3
   city      month               temperature
   <chr>     <dttm>                    <dbl>
 1 Amsterdam 0000-01-01 00:00:00         2.9
 2 Amsterdam 0000-02-01 00:00:00         2.5
 3 Amsterdam 0000-03-01 00:00:00         5.7
 4 Amsterdam 0000-04-01 00:00:00         8.2
 5 Amsterdam 0000-05-01 00:00:00        12.5
 6 Amsterdam 0000-06-01 00:00:00        14.8
 7 Amsterdam 0000-07-01 00:00:00        17.1
 8 Amsterdam 0000-08-01 00:00:00        17.1
 9 Amsterdam 0000-09-01 00:00:00        14.5
10 Amsterdam 0000-10-01 00:00:00        11.4
# ℹ 410 more rows

Now we can plot the temperature measurements for each city using the ggplot2::ggplot() function. We also use the scales::scale_x_datetime() function to format the x-axis labels to only show the month names:

temperature_sub |>
  ggplot(aes(x = month, y = temperature, group = city)) +
  geom_line() +
  scale_x_datetime(labels = scales::date_format("%b")) +
  theme_bw() + 
  labs(
    x = "Month", 
    y = "Temperature (°C)", 
    color = "City"
  ) + 
  theme(legend.position = "none")

It is hard to see from this plot which cities have similar temperature profiles. A principal component analysis (PCA) might help us to find out.

Principal component analysis

It is important that PCA be performed on exclusively numeric variables. Hence, we first need to use dplyr::select() to retain only the columns corresponding to the monthly temperatures:

X <- dplyr::select(temperature, 2:13)

Before performing PCA, it is important to standardize the data. For this purpose, we can use the base::scale() function:

X <- scale(X)

Now, we are going to use FactoMineR::PCA() to perform the PCA and the factoextra package for visualization of the results. When visualizing observations, the package uses the rownames of the input data frame to label the observations. Hence, we anticipate here and affect the city names as rownames of the data set:

rownames(X) <- temperature$city

Now we can perform the PCA using the FactorMineR::PCA() function:

pca_results <- FactoMineR::PCA(X, graph = FALSE)

Choice of the reduced dimension

We can also use the factoextra::fviz_screeplot() function to plot the eigenvalues:

factoextra::fviz_screeplot(pca_results, addlabels = TRUE)

And the eigenvalues can be extracted with:

factoextra::get_eigenvalue(pca_results)

         eigenvalue variance.percent cumulative.variance.percent
Dim.1  1.042445e+01     86.870441346                    86.87044
Dim.2  1.370499e+00     11.420823117                    98.29126
Dim.3  1.205076e-01      1.004230241                    99.29549
Dim.4  4.233298e-02      0.352774838                    99.64827
Dim.5  2.292280e-02      0.191023370                    99.83929
Dim.6  8.684234e-03      0.072368614                    99.91166
Dim.7  4.178064e-03      0.034817200                    99.94648
Dim.8  2.930325e-03      0.024419371                    99.97090
Dim.9  1.475750e-03      0.012297915                    99.98320
Dim.10 8.529732e-04      0.007108110                    99.99030
Dim.11 7.862929e-04      0.006552441                    99.99686
Dim.12 3.772122e-04      0.003143435                   100.00000

This suggests that the first two principal components explain almost all the variance in the data.

Analysis of the variables

We can use the factoextra::fviz_pca_var() function to visualize the coordinates of the variables in the proposed reduced space consisting of the first two principal directions:

factoextra::fviz_pca_var(pca_results)

We see from the above correlation circle that:

the first principal component seems to produce a rather uniformly weighted average temperature over the year; a high score on this component means a high average temperature over the year.
the second principal component instead contrasts temperatures between the warmest months (summer months) and the coldest months (winter months) hence somehow providing an estimate of the range between maximal and mininal temperature; a high score on this component means an elevated difference between maximal and minimal temperature over the year.

We can assess whether each original variable is well represented by the first two principal components by looking at the cos2 values:

cos2 <- factoextra::get_pca_var(pca_results)$cos2[, 1:2]
corrplot::corrplot(cos2, is.corr = FALSE)

The original variables are indeed well represented. They even would all be well represented by the first component only.

We can confirm the interpretation of the first two principal components that we were able to give from the correlation circle by looking at the contribution of each original variable to the principal components:

contrib <- factoextra::get_pca_var(pca_results)$contrib[, 1:2]
corrplot::corrplot(contrib, is.corr = FALSE)

Analysis of the city temperature profiles

We can use the factoextra::fviz_pca_ind() function to visualize the cities in the plane defined by the first two principal components:

factoextra::fviz_pca_ind(pca_results)

The plane is naturally divided into 4 panels:

The top left panel groups cities with negative scores on the first component but positive scores on the second component reflecting rather low average temperature but relatively high thermal amplitude.
The top right panel groups cities with positive scores on the first component and positive scores on the second component reflecting rather high average temperature and relatively high thermal amplitude.
The bottom left panel groups cities with negative scores on the first component and negative scores on the second component reflecting rather low average temperature and relatively low thermal amplitude.
The bottom right panel groups cities with positive scores on the first component but negative scores on the second component reflecting rather high average temperature but relatively low thermal amplitude.

We can also add colors to the plot to see if the area of the cities is correlated with the temperature profiles:

factoextra::fviz_pca_ind(
  pca_results,
  col.ind = temperature$Area, # color by groups
  palette = viridis::viridis(4),
  legend.title = "By Area"
)

This is interesting as it reflects that:

western capitals have a medium average temperature and medium thermal amplitude.
nothern capitals are grouped in the bottom left panel.
eastern capitals are grouped in the top left panel.
southern capitals are grouped on the right side with cities that are therefore globally warm but with some of them featuring a high thermal amplitude like Milan while others featuring a more stable temperature over the year like Lisbon.

Interpretation of units

The fact that some cities have a negative score on the second principal component does not mean that winter achieves higher temperature w.r.t. summer.

Recall that temperatures have been standardized across cities. Hence, if you want to interpret the scores in terms of temperatures, you should do it by hand with something like:

orig_temp <- as.matrix(temperature[, 2:13])
loadings <- factoextra::get_pca_var(pca_results)$coord[, 1:2]
scores <- orig_temp %*% loadings
scores_tbl <- tibble::tibble(
  City = temperature$city, 
  PC1 = scores[, 1], 
  PC2 = scores[, 2]
)
gt::gt(scores_tbl)

City	PC1	PC2
Amsterdam	109.75002	20.07679
Athens	198.59930	25.31486
Berlin	100.57744	26.73047
Brussels	114.61657	20.60132
Budapest	121.84567	33.06963
Copenhagen	87.01508	25.23984
Dublin	103.31766	14.35165
Elsinki	52.66654	31.43762
Kiev	78.60049	36.91400
Krakow	86.57091	31.42295
Lisbon	177.96301	15.27917
London	108.28743	19.03678
Madrid	154.79387	26.79321
Minsk	60.59012	35.15569
Moscow	56.10806	38.98164
Oslo	62.27592	30.17793
Paris	123.96285	21.37080
Prague	101.58740	29.64837
Reykjavik	51.03193	16.05866
Rome	171.82598	23.94100
Sarajevo	105.54795	27.96790
Sofia	107.82834	29.99347
Stockholm	64.97559	28.84649
Antwerp	113.97189	20.71773
Barcelona	180.85690	20.44746
Bordeaux	141.80864	20.95166
Edinburgh	92.78014	16.47195
Frankfurt	108.90323	26.73563
Geneva	108.15154	26.93270
Genoa	178.52320	21.81177
Milan	141.14609	31.35509
Palermo	196.51068	19.84736
Seville	203.30276	22.04190
St. Petersburg	50.62120	36.91273
Zurich	96.74811	26.57796

The `chicken.csv` data set

It regroups 43 chickens who went through six different diets:

normal diet (N),
fast for 16h (F16),
fast for 16h and back to normal diet for 5h (F16R5),
fast for 16h and back to normal diet for 16h (F16R16),
fast for 48h (F48),
fast for 48h and back to normal diet for 24h (F48R24).

After each process, chicken underwent gene expression sequencing and 7406 gene expressions were collected.

Can we conclude that genes express differently according to the stress level?
How much time is needed for chicken to go back to normality in terms of gene expression?

Data import & Visualization

A quick glance at the content of the CSV file shows that the delimiter is the semi-colon. We can therefore resort to readr::read_delim() to import the data:

df <- readr::read_delim(
  file = "12-PCA-Data/chicken.csv", 
  delim = ";", 
  show_col_types = FALSE
)
df

# A tibble: 7,406 × 44
   gene         N_1     N_2     N_3     N_4      N_6     N_7     j16_3   j16_4
   <chr>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>     <dbl>   <dbl>
 1 A4GALT  -0.153   -0.0126 -0.140   0.208  -0.461   -0.217  -0.130    -0.295 
 2 A4GNT    0.00873 -0.0130  0.0126  0.0713  0.515    0.158   0.600     0.171 
 3 AACS     0.0426   0.173  -0.0219 -0.0788 -0.00269 -0.0894 -0.0548    0.0145
 4 AADACL1  0.161    0.205   0.185   0.0576  0.307    0.309   0.000618 -0.0414
 5 AADACL2 -0.374    0.0320 -0.411   0.0361 -0.490   -0.145  -0.112    -0.151 
 6 AADACL3 -0.445    0.0638  0.444   0.106  -0.455   -0.0628 -0.0906    0.104 
 7 AAK1    -0.216   -0.157  -0.306  -0.238  -0.218   -0.114  -0.0109   -0.226 
 8 AAMP     1.23     0.650   0.350   1.25    0.138    0.690  -0.903    -0.594 
 9 AARS     0.425    0.0293  0.195  -0.0334 -0.189   -0.297   0.0937    0.0522
10 AARS2   -0.118   -0.189  -0.486   0.414  -0.404   -0.182  -0.598    -0.508 
# ℹ 7,396 more rows
# ℹ 35 more variables: j16_5 <dbl>, j16_6 <dbl>, j16_7 <dbl>, j16r5_1 <dbl>,
#   j16r5_2 <dbl>, j16r5_3 <dbl>, j16r5_4 <dbl>, j16r5_5 <dbl>, j16r5_6 <dbl>,
#   j16r5_7 <dbl>, j16r5_8 <dbl>, j16r16_1 <dbl>, j16r16_2 <dbl>,
#   j16r16_3 <dbl>, j16r16_4 <dbl>, j16r16_5 <dbl>, j16r16_6 <dbl>,
#   j16r16_7 <dbl>, j16r16_8 <dbl>, j16r16_9 <dbl>, j48_1 <dbl>, j48_2 <dbl>,
#   j48_3 <dbl>, j48_4 <dbl>, j48_6 <dbl>, j48_7 <dbl>, j48r24_1 <dbl>, …

The analysis that we want to do puts the focus on the chicken rather that on genes. However, the raw data records genes in rows and chicken in columns grouped by diet. We therefore proceed with some data manipulation to get the data set with 43 rows corresponding to chickens. First, we use tidyr::pivot_longer() to have all variables expect gene become the values of a unique column coined name and the corresponding values in the table for these variables in a unique column coined value:

df <- tidyr::pivot_longer(df, cols = -gene)
df

# A tibble: 318,458 × 3
   gene   name    value
   <chr>  <chr>   <dbl>
 1 A4GALT N_1   -0.153 
 2 A4GALT N_2   -0.0126
 3 A4GALT N_3   -0.140 
 4 A4GALT N_4    0.208 
 5 A4GALT N_6   -0.461 
 6 A4GALT N_7   -0.217 
 7 A4GALT j16_3 -0.130 
 8 A4GALT j16_4 -0.295 
 9 A4GALT j16_5 -0.130 
10 A4GALT j16_6 -0.220 
# ℹ 318,448 more rows

Now, we can see that the name column encodes two different informations:

the type of diet given to the chicken; and,
the chicken ID in his group.

We therefore use tidyr::separate() to get each piece of information into its own column:

df <- tidyr::separate(df, col = name, into = c("diet", "id"), sep = "_")
df

# A tibble: 318,458 × 4
   gene   diet  id      value
   <chr>  <chr> <chr>   <dbl>
 1 A4GALT N     1     -0.153 
 2 A4GALT N     2     -0.0126
 3 A4GALT N     3     -0.140 
 4 A4GALT N     4      0.208 
 5 A4GALT N     6     -0.461 
 6 A4GALT N     7     -0.217 
 7 A4GALT j16   3     -0.130 
 8 A4GALT j16   4     -0.295 
 9 A4GALT j16   5     -0.130 
10 A4GALT j16   6     -0.220 
# ℹ 318,448 more rows

Finally, we perform a last reshaping of the data set to get each gene in a dedicated column via tidyr::pivot_wider():

chk <- tidyr::pivot_wider(df, names_from = gene, values_from = value)
chk

# A tibble: 43 × 7,408
   diet  id     A4GALT    A4GNT     AACS  AADACL1 AADACL2 AADACL3    AAK1   AAMP
   <chr> <chr>   <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 N     1     -0.153   0.00873  0.0426   1.61e-1 -0.374  -0.445  -0.216   1.23 
 2 N     2     -0.0126 -0.0130   0.173    2.05e-1  0.0320  0.0638 -0.157   0.650
 3 N     3     -0.140   0.0126  -0.0219   1.85e-1 -0.411   0.444  -0.306   0.350
 4 N     4      0.208   0.0713  -0.0788   5.76e-2  0.0361  0.106  -0.238   1.25 
 5 N     6     -0.461   0.515   -0.00269  3.07e-1 -0.490  -0.455  -0.218   0.138
 6 N     7     -0.217   0.158   -0.0894   3.09e-1 -0.145  -0.0628 -0.114   0.690
 7 j16   3     -0.130   0.600   -0.0548   6.18e-4 -0.112  -0.0906 -0.0109 -0.903
 8 j16   4     -0.295   0.171    0.0145  -4.14e-2 -0.151   0.104  -0.226  -0.594
 9 j16   5     -0.130   0.723   -0.120    1.27e-1 -0.187  -1.23   -0.151  -0.733
10 j16   6     -0.220   0.743    0.200    6.10e-2 -0.265  -0.157  -0.141  -0.683
# ℹ 33 more rows
# ℹ 7,398 more variables: AARS <dbl>, AARS2 <dbl>, AARSD1 <dbl>, AASS <dbl>,
#   AATF <dbl>, ABAT <dbl>, ABAT1 <dbl>, ABCA1 <dbl>, ABCA3 <dbl>,
#   ABCB10 <dbl>, ABCB7 <dbl>, ABCC4 <dbl>, ABCC5 <dbl>, ABCD3 <dbl>,
#   ABCD4 <dbl>, ABCE1 <dbl>, ABCF2 <dbl>, ABCG1 <dbl>, ABCG2 <dbl>,
#   ABCG4 <dbl>, ABHD11 <dbl>, ABHD12 <dbl>, ABHD2 <dbl>, ABHD3 <dbl>,
#   ABHD5 <dbl>, ABI1 <dbl>, ABI2 <dbl>, ABL1 <dbl>, ABL11 <dbl>, ABL2 <dbl>, …

It is difficult to provide insightful visualization in the original space of variables which is of dimension 7406. We therefore begin with performing PCA and then we will provide hopefully insightful visualizations in reduced space.

PCA

First, we select the numerical variables on which we want to perform the PCA and we standardize them:

X <- chk |> 
  dplyr::select(-diet, -id) |> 
  scale()

Now, we can perform the PCA:

pca_res <- FactoMineR::PCA(X, graph = FALSE)

Choice of the reduced dimension

factoextra::fviz_screeplot(
  pca_res, 
  addlabels = TRUE, 
  ncp = 42L
)

The elbow method applied to the screeplot suggests to keep either 1, 6, 8 or 16 components. We can look at the cumulative percentage of variance explained:

pca_res |> 
  factoextra::get_eigenvalue() |> 
  janitor::clean_names() |> 
  tibble::as_tibble() |> 
  gt::gt() |> 
  gt::cols_label(
    eigenvalue = "Eigenvalue",
    variance_percent = "Percentage of variance explained",
    cumulative_variance_percent = "Cumulative percentage of variance explained"
  )

Eigenvalue	Percentage of variance explained	Cumulative percentage of variance explained
1453.57240	19.6269565	19.62696
692.78785	9.3544133	28.98137
536.20796	7.2401831	36.22155
434.45335	5.8662348	42.08779
374.62157	5.0583523	47.14614
324.08250	4.3759451	51.52209
273.79824	3.6969787	55.21906
263.23096	3.5542933	58.77336
238.82728	3.2247809	61.99814
212.22641	2.8656010	64.86374
205.86862	2.7797545	67.64349
186.68809	2.5207682	70.16426
161.71254	2.1835342	72.34780
146.67462	1.9804837	74.32828
134.41024	1.8148831	76.14316
128.77791	1.7388321	77.88199
109.49173	1.4784192	79.36041
104.11595	1.4058324	80.76625
98.79454	1.3339798	82.10023
91.34045	1.2333304	83.33356
83.86158	1.1323465	84.46590
80.15492	1.0822971	85.54820
79.43852	1.0726238	86.62082
74.46548	1.0054751	87.62630
70.99072	0.9585568	88.58486
66.21149	0.8940250	89.47888
62.68820	0.8464515	90.32533
60.07860	0.8112153	91.13655
58.69024	0.7924688	91.92902
56.15042	0.7581747	92.68719
54.70237	0.7386224	93.42581
52.81262	0.7131059	94.13892
50.97364	0.6882749	94.82719
48.64270	0.6568012	95.48400
48.15634	0.6502342	96.13423
46.40970	0.6266500	96.76088
44.81239	0.6050822	97.36596
42.90186	0.5792851	97.94525
40.79995	0.5509040	98.49615
39.33001	0.5310560	99.02721
38.25896	0.5165942	99.54380
33.78608	0.4561987	100.00000

In this case, we see that keeping 6 components retains only 50% of the inertia, keeping 8 components makes it up almost to 60% and going for 16 components makes us above the 75% of inertia.

If we use the rule that pertains to keeping all components with an explained variance above 1, then we would keep 42 components.

Finally, we can get an estimate of the optimal number of components to keep by minimizing the generalized cross-validation score optained as the mean reconstruction error of each chicken gene expression from the reconstructed gene expression using outputs of a PCA computed on all chickens expect that one. The function FactoMineR::estim_ncp() does exactly that for us:

out <- FactoMineR::estim_ncp(X, ncp.min = 1, ncp.max = 42)
tibble::tibble(gcv = out$criterion, ncp = seq_along(gcv)) |> 
  ggplot(aes(x = ncp, y = gcv)) + 
  geom_point() +
  geom_line()

out$ncp

[1] 15

Analysis of the variables

The analysis of variables by means of visualizations or tables becomes tricky as the number of variables grows. Typically here, drawing the correlation circle with 7406 arrows would not be of any help. That’s why the factoextra::fviz_pca_var() comes with an optional argument select.var which instruct the function to visualize only the variables that satisfy some constraints. For example, in the code below, we ask it to retain only variables with a cumulated cos2 value in the first principal plane above 0.8. This allows to visualize only those variables which are well represented in that plane:

factoextra::fviz_pca_var(pca_res, select.var = list(cos2 = 0.8))

Analysis of the individuals

We can project chickens onto the first principal plane and color points by diet:

factoextra::fviz_pca_ind(pca_res, col.ind = chk$diet)

This is interesting as it shows that, in this plane, only chicken who underwent fast for 48h did not go back to normal gene expression, including those who had been back to normal diet for 24h after the fast.

The limit of linear projections

One has to be careful in the interpretation in such cases. Indeed, we could be tempted to conclude that chicken in all the other types of diet went back to normal gene expression. But we are in fact only looking at the chicken in a reduced space that only accounts for less than 30% of the total inertia in the original data! This is what urged researchers typically in the field of genetics to resort to non-linear methods for dimensional reduction such as tSNE or UMAP, which effectively achieve very small reduced spaces in terms of dimension while retaining most of the original inertia. For instance, the code below illustrate the projection of the data in a reduced space of dimension 2 via UMAP:

umap_res <- chk |> 
  dplyr::select(-diet, -id) |> 
  scale() |> 
  umap::umap()
tibble::tibble(
  X = umap_res$layout[, 1], 
  Y = umap_res$layout[, 2], 
  Diet = chk$diet
) |> 
  ggplot(aes(x = X, y = Y, color = Diet)) + 
  geom_point() + 
  theme_bw()

This clearly illustrates that claiming that chicken who underwent fast for less than 48h go back to normal gene expression on the basis of the results from the PCA would have been completely misleading.

Le jeu de données `orange.csv`

Six jus d’orange de fabriquants différents ont évalués. Toutes les variables sont-elles indisensables ? Y a-t-il des jus qui se dégagent comme particulièrement bons? mauvais ?

orange <- readr::read_delim("12-PCA-Data/orange.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

Rows: 6 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (3): Product, Way of preserving, Origin
dbl (14): Odour intensity, Odour typicality, Pulpiness, Intensity of taste, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

orange

# A tibble: 6 × 17
  Product    `Odour intensity` `Odour typicality` Pulpiness `Intensity of taste`
  <chr>                  <dbl>              <dbl>     <dbl>                <dbl>
1 Pampryl a…              2.82               2.53      1.66                 3.46
2 Tropicana…              2.76               2.82      1.91                 3.23
3 Fruvita f…              2.83               2.88      4                    3.45
4 Joker amb.              2.76               2.59      1.66                 3.37
5 Tropicana…              3.2                3.02      3.69                 3.12
6 Pampryl f…              3.07               2.73      3.34                 3.54
# ℹ 12 more variables: Acidity <dbl>, Bitterness <dbl>, Sweetness <dbl>,
#   Glucose <dbl>, Fructose <dbl>, Saccharose <dbl>, `Sweetening power` <dbl>,
#   pH <dbl>, `Citric acid` <dbl>, `Vitamin C` <dbl>,
#   `Way of preserving` <chr>, Origin <chr>

The temperature.csv data set

Data import

Data summary

Data visualization

Principal component analysis

Choice of the reduced dimension

Analysis of the variables

Analysis of the city temperature profiles

The chicken.csv data set

Data import & Visualization

PCA

Choice of the reduced dimension

Analysis of the variables

Analysis of the individuals

Le jeu de données orange.csv

The `temperature.csv` data set

The `chicken.csv` data set

Le jeu de données `orange.csv`