Article | Published:

Principal components analysis corrects for stratification in genome-wide association studies


Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1

    Lander, E.S. & Schork, N.J. Genetic dissection of complex traits. Science 265, 2037–2048 (1994).

  2. 2

    Lohmueller, K. et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182 (2003).

  3. 3

    Freedman, M. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).

  4. 4

    Marchini, J. et al. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).

  5. 5

    Helgason, A. et al. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).

  6. 6

    Campbell, C.D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).

  7. 7

    Hirschhorn, J.N. & Daly, M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).

  8. 8

    Thomas, D.C. et al. Recent developments in genomewide association scans: a workshop summary and review. Am. J. Hum. Genet. 77, 337–345 (2005).

  9. 9

    Reich, D. & Goldstein, D. Detecting association in a case-control study while allowing for population stratification. Genet. Epidemiol. 20, 4–16 (2001).

  10. 10

    Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

  11. 11

    Devlin, B. et al. Genomic control to the extreme. Nat. Genet. 36, 1129–1130 (2004).

  12. 12

    Pritchard, J.K. et al. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

  13. 13

    Satten, G. et al. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68, 466–477 (2001).

  14. 14

    Setakis, E., Stirnadel, H. & Balding, D.J. Logistic regression protects against population structure in genetic association studies. Genome Res. 16, 290–296 (2006).

  15. 15

    Pritchard, J.K. et al. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).

  16. 16

    Serre, D. & Paabo, S. Evidence for gradients of human genetic diversity within and among continents. Genome Res. 14, 1679–1685 (2004).

  17. 17

    Jackson, J.E. A User's Guide to Principal Components (John Wiley & Sons, New York, 2003).

  18. 18

    Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).

  19. 19

    Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. Demic expansions and human evolution. Science 259, 639–646 (1993).

  20. 20

    Johnstone, I. On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 29, 295–327 (2001).

  21. 21

    Soshnikov, A. A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices. J. Stat. Phys. 108, 1033–1056 (2002).

  22. 22

    Baik, J., Ben Arous, G. & Peche, S. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices. Ann. Probab. 33, 1643–1697 (2005).

  23. 23

    Rosenberg, N.A. et al. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genetics 1, 660–671 (2005).

  24. 24

    Pritchard, J.K. & Donnelly, P. Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 60, 227–237 (2001).

  25. 25

    Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identify and paternity. Genetica 96, 3–12 (1995).

  26. 26

    Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, Princeton, New Jersey, 1994).

  27. 27

    Nicholson, G. et al. Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Statist. Soc. (B) 64, 695–715 (2002).

  28. 28

    Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74, 1111–1120 (2004).

  29. 29

    Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

  30. 30

    Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).

  31. 31

    The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  32. 32

    Cimmino, M.A. et al. Prevalence of rheumatoid arthritis in Italy: the Chiavari study. Ann. Rheum. Dis. 57, 315–318 (1998).

  33. 33

    Rosati, G. The prevalence of multiple sclerosis in the world: an update. Neurol. Sci. 22, 117–139 (2001).

  34. 34

    Panza, F. et al. Shifts in angiotensin I converting enzyme insertion allele frequency across Europe: implications for Alzheimer's disease risk. J. Neurol. Neurosurg. Psychiatry 74, 1159–1161 (2003).

  35. 35

    Bernardi, F. et al. Contribution of factor VII genotype to activated FVII levels. Differences in genotype frequencies between northern and southern European populations. Arterioscler. Thromb. Vasc. Biol. 17, 2548–2553 (1997).

  36. 36

    Angastiniotis, M. & Modell, B. Global epidemiology of hemoglobin disorders. Ann. NY Acad. Sci. 850, 251–269 (1998).

  37. 37

    Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246 (2005).

  38. 38

    Wright, S. The genetical structure of populations. Ann. Eugen. 15, 323–354 (1951).

  39. 39

    Benito-Garcia, E. et al. Dietary caffeine does not affect methotrexate efficacy in rheumatoid arthritis patients. J. Rheumatol. (in the press).

Download references


The authors are grateful to B. Blumenstiel, M. DeFelice, M. Parkin, R. Barry, W. Winslow, C. Healy and S. Gabriel for generation of the Affymetrix genotype data. We are grateful to the BRASS study participants, the BRASS study team, and our rheumatology colleagues at the Brigham and Women's Hospital Arthritis Center. We thank C. Campbell and J. Hirschhorn for helpful comments and sharing data from their paper6. The BRASS study was supported by a grant from Millennium Pharmaceuticals. D.R. is supported in part by a Burroughs Wellcome Career Development Award in the Biomedical Sciences.

Author information

Competing interests

M.E.W. serves as a consultant to Millennium Pharmaceuticals; the BRASS study, which produced a data set described in the paper, was supported by a grant from Millenium Pharmaceuticals.

Correspondence to Alkes L Price.

Supplementary information

  1. Supplementary Fig. 1

    P-P plot of EIGENSTRAT test statistics. (PDF 429 kb)

  2. Supplementary Table 1

    Simulations using K axes of variation. (PDF 58 kb)

  3. Supplementary Table 2

    Simulations using M SNPs. (PDF 66 kb)

  4. Supplementary Table 3

    Simulations of Pritchard and Donnelly. (PDF 68 kb)

  5. Supplementary Table 4

    Simulations with no stratification and n subpopulations. (PDF 73 kb)

  6. Supplementary Table 5

    Stratification correction at rs10511418 using M SNPs. (PDF 73 kb)

  7. Supplementary Note (PDF 207 kb)

Rights and permissions

Reprints and Permissions

About this article

Further reading

Figure 1: The EIGENSTRAT algorithm, illustrated on simulated data.
Figure 2: The top two axes of variation of European American samples.