Principal Component Analysis

Aymeric Stamm

2025-01-16

Overview

Dimensionality reduction

If you’ve worked with a lot of variables before, you know this can present problems. Do you understand the relationships between each variable? Do you have so many variables that you are in danger of overfitting your model to your data or that you might be violating assumptions of whichever modeling tactic you’re using?

You might ask the question, “How do I take all of the variables I’ve collected and focus on only a few of them?” In technical terms, you want to “reduce the dimension of your feature space.” By reducing the dimension of your feature space, you have fewer relationships between variables to consider and you are less likely to overfit your model.

Think about linear regression:

PCA

What ?

Principal component analysis is a technique for feature extraction — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables! As an added benefit, each of the “new” variables after PCA are all independent of one another.

How ?

The dimension reduction is achieved by identifying the principal directions, called principal components, in which the data varies.

PCA assumes that the directions with the largest variances are the most “important” (i.e, the most principal).

When to use?

  1. Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?
  2. Do you want to ensure your variables are independent of one another?
  3. Are you comfortable making your independent variables less interpretable?

Principle

Mathematical formalism

Context.

X1, …, Xn a sample of n i.i.d. random vectors in p. Let 𝕏 = (X1, …, Xn) be the n × p data matrix, whose j-th column is the vector xj containing the n observations on the j-th variable. Let Σ be the covariance structure of the Xi’s.

Determination of the 1st PC

The 1st principal component (PC) accounts for the largest possible variance in the data set.

R Packages

Principal Component Computation

Recommendation

The FactoMineR::PCA() function.

Information Extraction and Visualization

The factoextra package.

Case Study

Objectives

PCA allows to describe a data set, to summarize a data set, to reduce the dimensionality. We want to perform a PCA on all the individuals of the data set to answer several questions:

Individuals’ study (athletes’ study)

Two athletes will be close to each other if their results to the events are close. We want to see the variability between the individuals.

Variables’ study (performances’ study)

We want to see if there are linear relationships between variables. The two objectives are to summarize the correlation matrix and to look for synthetic variables: can we resume the performance of an athlete by a small number of variables?

Can we characterize groups of individuals by variables?

PCA Terminology

Active individuals: Individuals used for PCA.

Supplementary individuals: Coordinates of these individuals will be predicted using the PCA information and parameters obtained using active individuals.

Active variables: Variables used for PCA.

Supplementary variables: Coordinates of these variables will be predicted also. These can be:

Data standardization

Description of the problem

We need to scale the variables prior to performing PCA!

A common solution: Standardization

$$ x_{ij} \leftarrow \frac{x_{ij} - \overline{x}_j}{\sqrt{\frac{1}{n-1} \sum_{\ell=1}^n (x_{\ell j} - \overline{x}_j)^2}} $$

The function PCA() in FactoMineR, standardizes the data automatically.

PCA() Syntax

PCA(
  # a data frame with n rows (individuals) and p columns (numeric variables)
  X = , 
  
  # number of dimensions kept in the results (by default 5)
  ncp = , 
  
  # a vector indicating the indexes of the supplementary individuals
  ind.sup = , 
  
  # a vector indicating the indexes of the quantitative supplementary variables
  quanti.sup = ,
  
  # a vector indicating the indexes of the categorical supplementary variables
  quali.sup = , 
  
  # boolean, if TRUE (default) a graph is displayed
  graph = 
)

Running PCA()

PCA(X = decathlon[, 1:10], graph = FALSE)
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 41 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"          

Visualization and Interpretation

The factoextra package

Useful extraction and visualization functions

Choice of the reduced dimension

Eigenvalues / Variances

Eigenvalues measure the amount of variation retained by each principal component.

eig_val <- get_eigenvalue(res_pca)
html_table(eig_val)

Choosing the appropriate number of PCs

Eigenvalues can be used to determine the number of principal components to retain after PCA.

Guidelines.

Graphical tool.

fviz_screeplot(res_pca, addlabels = TRUE)

Analysis of the variables

Variable Information in PCA

Variable information extraction

var_info <- get_pca_var(res_pca)

Components

Usage

$coord: Correlation variables & components

Property: norms of column vectors is 1 (eigenvectors).
Consequence: each coordinate is in [-1,1].

Contribution of PCs to variables

fviz_pca_var(): Correlation circle

fviz_pca_var(res_pca, col.var = "black")

$cos2: Quality of representation

Positive and negative strong correlations of PCs to a given variable imply that the PCs represent it well.
Squared correlations thus measure quality of representation.

$cos2: Graphical tool 1

corrplot::corrplot(var_info$cos2, is.corr = FALSE)

$cos2: Graphical tool 2

# Total cos2 of variables on Dim.1 and Dim.2
factoextra::fviz_cos2(res_pca, choice = "var", axes = 1:2)

$cos2: Summary

factoextra::fviz_pca_var(X = res_pca, 
                         col.var = "cos2",
                         gradient.cols = viridis::viridis(3),
                         repel = TRUE) # Avoid text overlapping

Contribution of variables to PCs

$contrib: Important variables

$contrib: Graphical tool 1

corrplot::corrplot(var_info$contrib, is.corr = FALSE)

$contrib: Graphical tool 2

# Contributions of variables to PC1
factoextra::fviz_contrib(
  X = res_pca, 
  choice = "var", 
  axes = 1, 
  top = 10
)

# Contributions of variables to PC2
factoextra::fviz_contrib(
  X = res_pca, 
  choice = "var", 
  axes = 2, 
  top = 10
)

$contrib: Summary

factoextra::fviz_pca_var(
  X = res_pca, 
  col.var = "contrib",
  gradient.cols = viridis::viridis(3),
  repel = TRUE
)

Component description

res_pca_var_desc <- FactoMineR::dimdesc(res = res_pca, axes = c(1, 2), proba = 0.05)

Description of PC1

Description of PC2

Analysis of the individuals

The reduced space

Individual information extraction

ind_info <- get_pca_ind(res_pca)

Components

Usage

First factorial plane & quality

factoextra::fviz_pca_ind(
  X = res_pca, 
  col.ind = "cos2", 
  gradient.cols = viridis::viridis(3),
  repel = TRUE # Avoid text overlapping (slow if many points)
)

First factorial plane & contribution

# Total contribution on PC1 and PC2
factoextra::fviz_contrib(res_pca, choice = "ind", axes = 1:2)

First factorial plane colored by groups

factoextra::fviz_pca_ind(
  X = res_pca,
  col.ind = decathlon$competition[1:23], # color by groups
  palette = viridis::viridis(3),
  addEllipses = TRUE, # Concentration ellipses
  legend.title = "Competition"
)

Combined variable and individual analysis

Biplot

factoextra::fviz_pca_biplot(
  X = res.pca, 
  repel = TRUE,
  col.var = "#2E9FDF", # Variables color
  col.ind = "#696969"  # Individuals color
)

Interpretation

At this point, we can divide the first factorial plane into four parts: fast and strong athletes (like Sebrle), slow athletes (like Casarsa), fast but weak athletes (like Warners) and slow and weak athletes (like Lorenzo).

Your turn

References