Interpreting principal component analyses of spatial population genetic variation

Abstract

Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions1. They interpreted gradient and wave patterns in these maps as signatures of specific migration events1,2,3. These interpretations have been controversial4,5,6,7, but influential8, and the use of PCA has become widespread in analysis of population genetics data9,10,11,12,13. However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.'s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events. Our findings aid interpretation of PCA results and suggest how PCA can help correct for continuous population structure in association studies.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Comparison of PC maps of ref. 3 with theoretical and empirical predictions.
Figure 2: Results of PCA applied to data from a one-dimensional habitat.

References

  1. 1

    Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).

    CAS  Article  Google Scholar 

  2. 2

    Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. Demic expansions and human evolution. Science 259, 639–646 (1993).

    CAS  Article  Google Scholar 

  3. 3

    Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton University Press, Princeton, New Jersey, USA, 1994).

    Google Scholar 

  4. 4

    Rendine, S., Piazza, A. & Cavalli-Sforza, L.L. Simulation and separation by principal components of multiple demic expansions in Europe. Am. Nat. 128, 681–706 (1986).

    Article  Google Scholar 

  5. 5

    Sokal, R.R., Oden, N.L. & Thomson, B.A. A problem with synthetic maps. Hum. Biol. 71, 1–13 (1999).

    CAS  PubMed  Google Scholar 

  6. 6

    Rendine, S., Piazza, A. & Cavalli-Sforza, L.L. A problem with synthetic maps: Reply to Sokal et al. Hum. Biol. 71, 15–25 (1999).

    Google Scholar 

  7. 7

    Currat, M. & Excoffier, L. The effect of the Neolithic expansion on European molecular diversity. Proc. Biol. Sci. 272, 679–688 (2005).

    Article  Google Scholar 

  8. 8

    Jobling, M., Hurles, M. & Tyler-Smith, C. Human Evolutionary Genetics (Garland Science, New York, 2004).

    Google Scholar 

  9. 9

    Hanotte, O. et al. African pastoralism: genetic imprints of origins and migrations. Science 296, 336–339 (2002).

    CAS  Article  Google Scholar 

  10. 10

    Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    CAS  Article  Google Scholar 

  11. 11

    Patterson, N., Price, A. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  Google Scholar 

  12. 12

    Bauchet, M. et al. Measuring European population stratification with microarray genotype data. Am. J. Hum. Genet. 80, 948–956 (2007).

    CAS  Article  Google Scholar 

  13. 13

    Linz, B. et al. An African origin for the intimate association between humans and Helicobacter pylori. Nature 445, 915–918 (2007).

    Article  Google Scholar 

  14. 14

    Ahmed, N., Natarajan, T. & Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. C-23, 90–93 (1974).

    Article  Google Scholar 

  15. 15

    Brillinger, D.R. Time Series: Data Analysis and Theory (Holt, Rinehart, and Winston, New York, 1975).

    Google Scholar 

  16. 16

    Podani, J. & Miklos, I. Resemblance coefficients and the horseshoe effect in principal coordinates analysis. Ecology 83, 3331–3343 (2002).

    Article  Google Scholar 

  17. 17

    Richman, M.B. Rotation of principal components. J. Climatol. 6, 293–335 (1986).

    Article  Google Scholar 

  18. 18

    Heidemann, G. The principal components of natural images revisited. IEEE Trans. Pattern Anal. Mach. Intell. 28, 822–826 (2006).

    Article  Google Scholar 

  19. 19

    Freiberger, W., ed. The International Dictionary of Applied Mathematics (D. Van Nostrand Co., Princeton, New Jersey, USA, 1960).

    Google Scholar 

  20. 20

    Diaconis, P., Goel, S. & Holmes, S. Horseshoes in multidimensional scaling and kernel methods. Ann. Appl. Stat. (in the press).

  21. 21

    Zhao, K. et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3, e4 (2007).

    Article  Google Scholar 

  22. 22

    Irwin, D.E., Bensch, S., Irwin, J.H. & Price, T.D. Speciation by distance in a ring species. Science 307, 414–416 (2005).

    CAS  Article  Google Scholar 

  23. 23

    Handley, L.J.L., Manica, A., Goudet, J. & Balloux, F. Going the distance: human population genetics in a clinal world. Trends Genet. 23, 432–439 (2007).

    CAS  Article  Google Scholar 

  24. 24

    Semino, O. et al. Origin, diffusion, and differentiation of Y-chromosome haplogroups E and J: inferences on the Neolithization of Europe and later migratory events in the Mediterranean area. Am. J. Hum. Genet. 74, 1023–1034 (2004).

    CAS  Article  Google Scholar 

  25. 25

    Haak, W. et al. Ancient DNA from the first European farmers in 7500-year-old Neolithic sites. Science 310, 1016–1018 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26

    Pinhasi, R., Fort, J. & Ammerman, A.J. Tracing the origin and spread of agriculture in Europe. PLoS Biol. 3, e410 (2005).

    Article  Google Scholar 

  27. 27

    Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

  28. 28

    Zhu, X., Zhang, S., Zhao, H. & Cooper, R.S. Association mapping, using a mixture model for complex traits. Genet. Epidemiol. 23, 181–196 (2002).

    Article  Google Scholar 

  29. 29

    Pritchard, J.K. & Rosenberg, N.A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).

    CAS  Article  Google Scholar 

  30. 30

    Wakefield, J. Disease mapping and spatial regression with count data. Biostatistics 8, 158–183 (2007).

    Article  Google Scholar 

Download references

Acknowledgements

We thank G. Coop, M. Przeworski, D. Reich, N. Patterson, Y. Guan, M. Barber and C. Becquet for helpful comments and D. Witonsky for pointing out the connection to Lissajous figures. Funding for this research was provided by a US National Science Foundation postdoctoral research fellowship in bioinformatics (J.N.) and US National Institutes of Health grant RO1 HG02585-01 (M.S.).

Author information

Affiliations

Authors

Contributions

J.N. and M.S. jointly designed the analyses and interpreted results. J.N. performed the analyses. J.N. and M.S. wrote the paper.

Corresponding author

Correspondence to Matthew Stephens.

Supplementary information

Supplementary Text and Figures

Supplementary Methods, Supplementary Note and Supplementary Figures 1–9 (PDF 2227 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Novembre, J., Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet 40, 646–649 (2008). https://doi.org/10.1038/ng.139

Download citation

Further reading