Aymeric Stamm
2025-01-16
If you’ve worked with a lot of variables before, you know this can present problems. Do you understand the relationships between each variable? Do you have so many variables that you are in danger of overfitting your model to your data or that you might be violating assumptions of whichever modeling tactic you’re using?
You might ask the question, “How do I take all of the variables I’ve collected and focus on only a few of them?” In technical terms, you want to “reduce the dimension of your feature space.” By reducing the dimension of your feature space, you have fewer relationships between variables to consider and you are less likely to overfit your model.
Think about linear regression:
What ?
Principal component analysis is a technique for feature extraction — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables! As an added benefit, each of the “new” variables after PCA are all independent of one another.
How ?
The dimension reduction is achieved by identifying the principal directions, called principal components, in which the data varies.
PCA assumes that the directions with the largest variances are the most “important” (i.e, the most principal).
When to use?
Context.
X1, …, Xn a sample of n i.i.d. random vectors in ℝp. Let 𝕏 = (X1, …, Xn)⊤ be the n × p data matrix, whose j-th column is the vector xj containing the n observations on the j-th variable. Let Σ be the covariance structure of the Xi’s.
Determination of the 1st PC
The 1st principal component (PC) accounts for the largest possible variance in the data set.
prcomp()
and princomp()
from built-in R stats package,PCA()
from FactoMineR package,dudi.pca()
from ade4 package.The FactoMineR::PCA()
function.
The factoextra package.
PCA allows to describe a data set, to summarize a data set, to reduce the dimensionality. We want to perform a PCA on all the individuals of the data set to answer several questions:
Two athletes will be close to each other if their results to the events are close. We want to see the variability between the individuals.
We want to see if there are linear relationships between variables. The two objectives are to summarize the correlation matrix and to look for synthetic variables: can we resume the performance of an athlete by a small number of variables?
Can we characterize groups of individuals by variables?
Active individuals: Individuals used for PCA.
Supplementary individuals: Coordinates of these individuals will be predicted using the PCA information and parameters obtained using active individuals.
Active variables: Variables used for PCA.
Supplementary variables: Coordinates of these variables will be predicted also. These can be:
PCA linearly combines original variables to maximize variance
If one variable is measured in meter and another in centimeter, the first one will contribute more to the variance than the second one, even if the intrinsic variability of each variable is the same.
⇒ We need to scale the variables prior to performing PCA!
$$ x_{ij} \leftarrow \frac{x_{ij} - \overline{x}_j}{\sqrt{\frac{1}{n-1} \sum_{\ell=1}^n (x_{\ell j} - \overline{x}_j)^2}} $$
⇒ The function PCA()
in FactoMineR, standardizes the data automatically.
PCA()
SyntaxPCA(
# a data frame with n rows (individuals) and p columns (numeric variables)
X = ,
# number of dimensions kept in the results (by default 5)
ncp = ,
# a vector indicating the indexes of the supplementary individuals
ind.sup = ,
# a vector indicating the indexes of the quantitative supplementary variables
quanti.sup = ,
# a vector indicating the indexes of the categorical supplementary variables
quali.sup = ,
# boolean, if TRUE (default) a graph is displayed
graph =
)
PCA()
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 41 individuals, described by 10 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
stats::prcomp()
, FactoMiner::PCA()
, ade4::dudi.pca()
, ExPosition::epPCA()
), you can easily extract and visualize the results of PCA using R functions provided in the factoextra R package.get_eigenvalue()
: Extract the eigenvalues/variances of principal components.fviz_screeplot()
: Visualize the eigenvalues / proportion of explained variance.get_pca_ind()
, get_pca_var()
: Extract the results for individuals and variables, respectively.fviz_pca_ind()
, fviz_pca_var()
: Visualize the results individuals and variables, respectively.fviz_pca_biplot()
: Make a biplot of individuals and variables.Eigenvalues measure the amount of variation retained by each principal component.
Eigenvalues can be used to determine the number of principal components to retain after PCA.
Guidelines.
var_info$coord
: Coordinates for the variables.var_info$cos2
: Quality of representation of variables by PC: var.cos2 = var.coord * var.coord
.var_info$contrib
: Contributions of the variables to the principal components: (var.cos2 * 100) / (sum cos2 of the component)
.$coord
: Correlation variables & componentsProperty: norms of column vectors is 1 (eigenvectors).
Consequence: each coordinate is in [-1,1].
$cos2
: Quality of representationPositive and negative strong correlations of PCs to a given variable imply that the PCs represent it well.
Squared correlations thus measure quality of representation.
$cos2
: Graphical tool 2$cos2
: Summarycos2
values are used to estimate the quality of the representation.cos2
indicates a good representation of the variable by the corresponding PC.cos2
indicates that the variable is poorly represented by the corresponding PC.$contrib
: Important variablesDim.1
(PC1) and Dim.2
(PC2) are the most important to explain the variability in the data set.$contrib
: Graphical tool 2$contrib
: SummaryDescription of PC1
Description of PC2
ind_info$coord
: Coordinates of the individuals.ind_info$cos2
: Quality of representation of individuals by PC: ind.cos2 = ind.coord * ind.coord
.ind_info$contrib
: Contributions of the individuals to the principal components: (ind.cos2 * 100) / (sum cos2 of the component)
.100m
is correlated negatively to the variable long.jump
. When an athlete performs a short time when running 100m, he can jump a big distance. Here one has to be careful because a low value for the variables 100m
, 400m
, 110m.hurdle
and 1500m
means a high score: the shorter an athlete runs, the more points he scores.Discus
and Shot.put
) to those who are weak.Discus
, Shot.put
and High.jump
are not much correlated to the variables 100m
, 400m
, 110m.hurdle
and Long.jump
. This means that strength is not much correlated to speed.At this point, we can divide the first factorial plane into four parts: fast and strong athletes (like Sebrle), slow athletes (like Casarsa), fast but weak athletes (like Warners) and slow and weak athletes (like Lorenzo).
12-PCA-Exercises.qmd
and perform the PCA analyses for the proposed data sets.