Points of Significance: Principal component analysis

Journal name:
Nature Methods
Volume:
14,
Pages:
641–642
Year published:
DOI:
doi:10.1038/nmeth.4346
Published online

PCA helps you interpret your data, but it will not always find the important patterns.

At a glance

Figures

  1. PCA geometrically projects data onto a lower-dimensional space.
    Figure 1: PCA geometrically projects data onto a lower-dimensional space.

    (a) Projection is illustrated with 2D points projected onto 1D lines along a path perpendicular to the line (illustrated for the solid circle). (b) The projections of points in a onto each line. σ2 for projected points can vary (e.g., high for u and low for v). (c) PC1 maximizes the σ2 of the projection and is the line u from a. The second (v, PC2) is perpendicular to PC1. Note that PC1 is not the same as linear regression of y vs. x (y~x, dark brown) or x vs. y (x~y, light brown). Dashed lines indicates distances being minimized.

  2. PCA reduction of nine expression profiles from six to two dimensions.
    Figure 2: PCA reduction of nine expression profiles from six to two dimensions.

    (a) Expression profiles for nine genes (A–I) across six samples (a–f), coded by color on the basis of shape similarity, and the expression variance of each sample. (b) PC1–PC6 of the profiles in a. PC1 and PC2 reflect clearly visible trends, and the remaining capture only small fluctuations. (c) Transformed profiles, expressed as PC scores and σ2 of each component score. (d) The profiles reconstructed using PC1–PC3. (e) The 2D coordinates of each profile based on the scores of the first two PCs.

  3. PCA can help identify clusters in the data.
    Figure 3: PCA can help identify clusters in the data.

    (a) Complete linkage hierarchical clustering of previously described expression profiles1 with the expression of all 26 genes (listed vertically) represented with lines across 15 samples (horizontally). (b) When shown as coefficients of the first two PCs, profiles group in a similar manner to the hierarchical clustering—groups D and E are still difficult to separate. (c) PCA is not scale invariant. Shown are the first two PC components of profiles whose first and second variable (subject) were scaled by 300 and 200, respectively. A grouping very different from that in b is obtained. (d) The plot of the two scaled variables in each profile, ignoring the remaining 13 variables. The grouping of points is very similar to that in c, because PCA puts more weight on variables with larger absolute magnitude.

  4. The assumptions of PCA place limitations on its use.
    Figure 4: The assumptions of PCA place limitations on its use.

    (a–c) Limitations of PCA are that it may miss nonlinear data patterns (a); structure that is not orthogonal to previous PCs may not be well characterized (b); and PC1 (blue) may not split two obvious clusters (c). PC2 is shown in orange.

References

  1. Altman, N. & Krzywinski, M. Nat. Methods 14, 545546 (2017).
  2. Altman, N. & Krzywinski, M. Nat. Methods 12, 899900 (2015).
  3. Altman, N. & Krzywinski, M. Nat. Methods 12, 9991000 (2015).
  4. Krzywinski, M. & Altman, N. Nat. Methods 12, 11031104 (2015).

Download references

Author information

Affiliations

  1. Jake Lever is a PhD candidate at Canada's Michael Smith Genome Sciences Centre.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

  3. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data