Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $18.75 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Lander, E.S. & Schork, N.J. Genetic dissection of complex traits. Science 265, 2037–2048 (1994).
Lohmueller, K. et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182 (2003).
Freedman, M. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).
Marchini, J. et al. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).
Helgason, A. et al. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).
Campbell, C.D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).
Hirschhorn, J.N. & Daly, M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).
Thomas, D.C. et al. Recent developments in genomewide association scans: a workshop summary and review. Am. J. Hum. Genet. 77, 337–345 (2005).
Reich, D. & Goldstein, D. Detecting association in a case-control study while allowing for population stratification. Genet. Epidemiol. 20, 4–16 (2001).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Devlin, B. et al. Genomic control to the extreme. Nat. Genet. 36, 1129–1130 (2004).
Pritchard, J.K. et al. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).
Satten, G. et al. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68, 466–477 (2001).
Setakis, E., Stirnadel, H. & Balding, D.J. Logistic regression protects against population structure in genetic association studies. Genome Res. 16, 290–296 (2006).
Pritchard, J.K. et al. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Serre, D. & Paabo, S. Evidence for gradients of human genetic diversity within and among continents. Genome Res. 14, 1679–1685 (2004).
Jackson, J.E. A User's Guide to Principal Components (John Wiley & Sons, New York, 2003).
Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).
Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. Demic expansions and human evolution. Science 259, 639–646 (1993).
Johnstone, I. On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 29, 295–327 (2001).
Soshnikov, A. A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices. J. Stat. Phys. 108, 1033–1056 (2002).
Baik, J., Ben Arous, G. & Peche, S. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices. Ann. Probab. 33, 1643–1697 (2005).
Rosenberg, N.A. et al. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genetics 1, 660–671 (2005).
Pritchard, J.K. & Donnelly, P. Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 60, 227–237 (2001).
Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identify and paternity. Genetica 96, 3–12 (1995).
Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, Princeton, New Jersey, 1994).
Nicholson, G. et al. Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Statist. Soc. (B) 64, 695–715 (2002).
Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74, 1111–1120 (2004).
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Cimmino, M.A. et al. Prevalence of rheumatoid arthritis in Italy: the Chiavari study. Ann. Rheum. Dis. 57, 315–318 (1998).
Rosati, G. The prevalence of multiple sclerosis in the world: an update. Neurol. Sci. 22, 117–139 (2001).
Panza, F. et al. Shifts in angiotensin I converting enzyme insertion allele frequency across Europe: implications for Alzheimer's disease risk. J. Neurol. Neurosurg. Psychiatry 74, 1159–1161 (2003).
Bernardi, F. et al. Contribution of factor VII genotype to activated FVII levels. Differences in genotype frequencies between northern and southern European populations. Arterioscler. Thromb. Vasc. Biol. 17, 2548–2553 (1997).
Angastiniotis, M. & Modell, B. Global epidemiology of hemoglobin disorders. Ann. NY Acad. Sci. 850, 251–269 (1998).
Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246 (2005).
Wright, S. The genetical structure of populations. Ann. Eugen. 15, 323–354 (1951).
Benito-Garcia, E. et al. Dietary caffeine does not affect methotrexate efficacy in rheumatoid arthritis patients. J. Rheumatol. (in the press).
The authors are grateful to B. Blumenstiel, M. DeFelice, M. Parkin, R. Barry, W. Winslow, C. Healy and S. Gabriel for generation of the Affymetrix genotype data. We are grateful to the BRASS study participants, the BRASS study team, and our rheumatology colleagues at the Brigham and Women's Hospital Arthritis Center. We thank C. Campbell and J. Hirschhorn for helpful comments and sharing data from their paper6. The BRASS study was supported by a grant from Millennium Pharmaceuticals. D.R. is supported in part by a Burroughs Wellcome Career Development Award in the Biomedical Sciences.
P-P plot of EIGENSTRAT test statistics. (PDF 429 kb)
Simulations using K axes of variation. (PDF 58 kb)
Simulations using M SNPs. (PDF 66 kb)
Simulations of Pritchard and Donnelly. (PDF 68 kb)
Simulations with no stratification and n subpopulations. (PDF 73 kb)
Stratification correction at rs10511418 using M SNPs. (PDF 73 kb)
About this article
sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs
BMC Bioinformatics (2019)
BMC Bioinformatics (2019)
African evolutionary history inferred from whole genome sequence data of 44 indigenous African populations
Genome Biology (2019)
A PheWAS study of a large observational epidemiological cohort of African Americans from the REGARDS study
BMC Medical Genomics (2019)
BMC Bioinformatics (2019)