This Month
Published: 29 June 2017

Points of Significance

Principal component analysis

Jake Lever¹,
Martin Krzywinski² &
Naomi Altman³

Nature Methods volume 14, pages 641–642 (2017)Cite this article

192k Accesses
757 Citations
80 Altmetric
Metrics details

Subjects

PCA helps you interpret your data, but it will not always find the important patterns.

You have full access to this article via your institution.

Download PDF

Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features. High-dimensional data are very common in biology and arise when multiple features, such as expression of many genes, are measured for each sample. This type of data presents several challenges that PCA mitigates: computational expense and an increased error rate due to multiple test correction when testing each feature for association with an outcome. PCA is an unsupervised learning method and is similar to clustering¹—it finds patterns without reference to prior knowledge about whether the samples come from different treatment groups or have phenotypic differences.

PCA reduces data by geometrically projecting them onto lower dimensions called principal components (PCs), with the goal of finding the best summary of the data using a limited number of PCs. The first PC is chosen to minimize the total distance between the data and their projection onto the PC (Fig. 1a). By minimizing this distance, we also maximize the variance of the projected points, σ² (Fig. 1b). The second (and subsequent) PCs are selected similarly, with the additional requirement that they be uncorrelated with all previous PCs. For example, projection onto PC1 is uncorrelated with projection onto PC2, and we can think of the PCs as geometrically orthogonal. This requirement of no correlation means that the maximum number of PCs possible is either the number of samples or the number of features, whichever is smaller. The PC selection process has the effect of maximizing the correlation (r²) (ref. 2) between data and their projection and is equivalent to carrying out multiple linear regression^3,4 on the projected data against each variable of the original data. For example, the projection onto PC2 has maximum r² when used in multiple regression with PC1.

**Figure 1: PCA geometrically projects data onto a lower-dimensional space.**

The PCs are defined as a linear combination of the data's original variables, and in our two-dimensional (2D) example, PC1 = x/√2 + y/√2 (Fig. 1c). These coefficients are stored in a 'PCA loading matrix', which can be interpreted as a rotation matrix that rotates data such that the projection with greatest variance goes along the first axis. At first glance, PC1 closely resembles the linear regression line³ of y versus x or x versus y (Fig. 1c). However, PCA differs from linear regression in that PCA minimizes the perpendicular distance between a data point and the principal component, whereas linear regression minimizes the distance between the response variable and its predicted value.

To illustrate PCA on biological data, we simulated expression profiles for nine genes that fall into one of three patterns across six samples (Fig. 2a). We find that the variance is fairly similar across samples (Fig. 2a), which tells us that no single sample captures the patterns in the data appreciably more than another. In other words, we need all six sample dimensions to express the data fully.

**Figure 2: PCA reduction of nine expression profiles from six to two dimensions.**

Let's now use PCA to see whether a smaller number of combinations of samples can capture the patterns. We start by finding the six PCs (PC1–PC6), which become our new axes (Fig. 2b). We next transform the profiles so that they are expressed as linear combinations of PCs—each profile is now a set of coordinates on the PC axes—and calculate the variance (Fig. 2c). As expected, PC1 has the largest variance, with 52.6% captured by PC1 and 47.0% captured by PC2. A useful interpretation of PCA is that r² of the regression is the percent variance (of all the data) explained by the PCs. As additional PCs are added to the prediction, the difference in r² corresponds to the variance explained by that PC. However, all the PCs are not typically used because the majority of variance, and hence patterns in the data, will be limited to the first few PCs. In our example, we can ignore PC3−PC6, which contribute little (0.4%) to explaining the variance, and express the data in two dimensions instead of six.

Figure 2d verifies visually that we can faithfully reproduce the profiles using only PC1 and PC2. For example, the root mean square (r.m.s.) distances of the original profile A from its 1D, 2D and 3D reconstructions are 0.29, 0.03 and 0.01, respectively.Approximations using two or three PCs are useful, because we can summarize the data as a scatter plot. In our case, this plot easily identifies that the profiles fall into three patterns (Fig. 2e). Moreover, the projected data in such plots often appear less noisy, which enhances pattern recognition and data summary.

Such PCA plots are often used to find potential clusters. To relate PCA to clustering, we return to the 26 expression profiles across 15 subjects from a previous column¹, which we grouped using hierarchical clustering (Fig. 3a). It turns out that we can recover these clusters using only two PCs (Fig. 3b), reducing the dimensionality from 15 (the number of subjects) to 2.

**Figure 3: PCA can help identify clusters in the data.**

Scale matters with PCA. We illustrate this by showing PC1 and PC2 coefficients of each profile after artificially scaling up the expression in the first two subjects in every profile by factors of 300 and 200 so that they are dominant (Fig. 3c). This scenario might arise if expression in the first two subjects was measured using a different technique, resulting in dramatically different variance. In fact, when a small set of variables has a much larger magnitude than others, the components in the PCA analysis are heavily weighted along those variables, while other variables are ignored. As a consequence, the PCA simply recovers the values of these high-magnitude variables (Fig. 3d).

If the variance is dramatically different across variables (e.g., expression across patients in the scaled data in Fig. 3c), or if the variables use different scales, such as expression and phenotype data, it may be appropriate to standardize them such that each variable has unit variance. However, if the variables are already on the same scale, standardization is not normally appropriate, as it may actually distort the data. For instance, after standardization, gene expression that varies dramatically owing to biological function may look similar to gene expression that varies only owing to noise.

PCA is a good data summary when the interesting patterns increase the variance of projections onto orthogonal components. But PCA also has limitations that must be considered when interpreting the output: the underlying structure of the data must be linear (Fig. 4a), patterns that are highly correlated may be unresolved because all PCs are uncorrelated (Fig. 4b), and the goal is to maximize variance and not necessarily to find clusters (Fig. 4c).

**Figure 4: The assumptions of PCA place limitations on its use.**

Conclusions made with PCA must take these limitations into account. As with all statistical methods, PCA can be misused. The scaling of variables can cause different PCA results, and it is very important that the scaling is not adjusted to match prior knowledge of the data. If different scalings are tried, they should be described. PCA is a tool for identifying the main axes of variance within a data set and allows for easy data exploration to understand the key variables in the data and spot outliers. Properly applied, it is one of the most powerful tools in the data analysis tool kit.

References

Altman, N. & Krzywinski, M. Nat. Methods 14, 545–546 (2017).
Article CAS Google Scholar
Altman, N. & Krzywinski, M. Nat. Methods 12, 899–900 (2015).
Article CAS Google Scholar
Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).
Article CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 12, 1103–1104 (2015).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

PhD candidate at Canada's Michael Smith Genome Sciences Centre,
Jake Lever
staff scientist at Canada's Michael Smith Genome Sciences Centre,
Martin Krzywinski
Professor of Statistics at The Pennsylvania State University,
Naomi Altman

Authors

Jake Lever
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lever, J., Krzywinski, M. & Altman, N. Principal component analysis. Nat Methods 14, 641–642 (2017). https://doi.org/10.1038/nmeth.4346

Download citation

Published: 29 June 2017
Issue Date: 01 July 2017
DOI: https://doi.org/10.1038/nmeth.4346

This article is cited by

Determinants of digital technology development in sub-Saharan African countries: evidence from panel data analysis
- Elfenesh Beyene
- Amsalu Bedemo
- Atnafu Gebremeskel
Energy Informatics (2024)
Fiber optic computing using distributed feedback
- Brandon Redding
- Joseph B. Murray
- Raktim Sarma
Communications Physics (2024)
A CIC-related-epigenetic factors-based model associated with prediction, the tumor microenvironment and drug sensitivity in osteosarcoma
- Bin Yu
- Chengkui Geng
- Zhuoyuan Chen
Scientific Reports (2024)
Genetic effects on variability in visual aesthetic evaluations are partially shared across visual domains
- Giacomo Bignardi
- Dirk J. A. Smit
- Tinca J. C. Polderman
Communications Biology (2024)
Anesthetic drug discovery with computer-aided drug design and machine learning
- Xianggen Liu
- Zhe Xue
- Jiancheng Lv
Anesthesiology and Perioperative Science (2024)

Principal component analysis

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Determinants of digital technology development in sub-Saharan African countries: evidence from panel data analysis

Fiber optic computing using distributed feedback

A CIC-related-epigenetic factors-based model associated with prediction, the tumor microenvironment and drug sensitivity in osteosarcoma

Genetic effects on variability in visual aesthetic evaluations are partially shared across visual domains

Anesthetic drug discovery with computer-aided drug design and machine learning

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Determinants of digital technology development in sub-Saharan African countries: evidence from panel data analysis

Fiber optic computing using distributed feedback

A CIC-related-epigenetic factors-based model associated with prediction, the tumor microenvironment and drug sensitivity in osteosarcoma

Genetic effects on variability in visual aesthetic evaluations are partially shared across visual domains

Anesthetic drug discovery with computer-aided drug design and machine learning

Search

Quick links