New approaches to population stratification in genome-wide association studies

Price, Alkes L.; Zaitlen, Noah A.; Reich, David; Patterson, Nick

doi:10.1038/nrg2813

Progress
Published: 15 June 2010

New approaches to population stratification in genome-wide association studies

Alkes L. Price^1,2,
Noah A. Zaitlen^1,2,
David Reich³ &
…
Nick Patterson¹

Nature Reviews Genetics volume 11, pages 459–463 (2010)Cite this article

18k Accesses
738 Citations
22 Altmetric
Metrics details

Subjects

Abstract

Genome-wide association (GWA) studies are an effective approach for identifying genetic variants associated with disease risk. GWA studies can be confounded by population stratification — systematic ancestry differences between cases and controls — which has previously been addressed by methods that infer genetic ancestry. Those methods perform well in data sets in which population structure is the only kind of structure present but are inadequate in data sets that also contain family structure or cryptic relatedness. Here, we review recent progress on methods that correct for stratification while accounting for these additional complexities.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: P–P plots for the visualization of stratification or other confounders.**

References

McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet. 9, 356–369 (2008).
Article CAS PubMed Google Scholar
Campbell, C. D. et al. Demonstrating stratification in a European American population. Nature Genet. 37, 868–872 (2005).
Article CAS PubMed Google Scholar
Tian, C., Gregersen, P. K. & Seldin, M. F. Accounting for ancestry: population substructure and genome-wide association studies. Hum. Mol. Genet. 17, R143–R150 (2008).
Article CAS PubMed PubMed Central Google Scholar
Tian, C. et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 4, e4 (2008).
Article PubMed PubMed Central Google Scholar
Voight, B. F. & Pritchard, J. K. Confounding from cryptic relatedness in case–control association studies. PLoS Genet. 1, e32 (2005).
Article PubMed PubMed Central Google Scholar
Weir, B. S., Anderson, A. D. & Hepler, A. B. Genetic relatedness analysis: modern data and new challenges. Nature Rev. Genet. 7, 771–780 (2006).
Article CAS PubMed Google Scholar
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Article CAS PubMed Google Scholar
Pritchard, J. K. & Rosenberg, N. A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).
Article CAS PubMed PubMed Central Google Scholar
Reich, D. E. & Goldstein, D. B. Detecting association in a case–control study while correcting for population stratification. Genet. Epidemiol. 20, 4–16 (2001).
Article CAS PubMed Google Scholar
Clayton, D. G. et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nature Genet. 37, 1243–1246 (2005).
Article CAS PubMed Google Scholar
Price, A. L. et al. The impact of divergence time on the nature of population structure: an example from Iceland. PLoS Genet. 5, e1000505 (2009).
Article PubMed PubMed Central Google Scholar
Devlin, B., Bacanu, S. A. & Roeder, K. Genomic control to the extreme. Nature Genet. 36, 1129–1130 (2004); author reply in 36, 1131 (2004).
Article CAS PubMed Google Scholar
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
CAS PubMed PubMed Central Google Scholar
Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Article CAS PubMed Google Scholar
Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).
Article CAS PubMed PubMed Central Google Scholar
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Article CAS PubMed PubMed Central Google Scholar
Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).
Article CAS PubMed Google Scholar
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, 1994).
Google Scholar
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Article PubMed PubMed Central Google Scholar
Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nature Genet. 40, 646–649 (2008).
Article CAS PubMed Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
Zhu, X., Zhang, S., Zhao, H. & Cooper, R. S. Association mapping, using a mixture model for complex traits. Genet. Epidemiol. 23, 181–196 (2002).
Article PubMed Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Luca, D. et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am. J. Hum. Genet. 82, 453–463 (2008).
Article CAS PubMed PubMed Central Google Scholar
Lee, A. B., Luca, D., Klei, L., Devlin, B. & Roeder. K. Discovering genetic ancestry using spectral graph theory. Genet. Epidemiol. 34, 51–59 (2010).
Article CAS PubMed PubMed Central Google Scholar
Seldin, M. F. & Price, A. L. Application of ancestry informative markers to association studies in European Americans. PLoS Genet. 4, e5 (2008).
Article PubMed PubMed Central Google Scholar
Spielman, R. S., McGinnis, R. E. & Ewens, W. J. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, 506–516 (1993).
CAS PubMed PubMed Central Google Scholar
Laird, N. M. & Lange, C. Family-based designs in the age of large-scale gene-association studies. Nature Rev. Genet. 7, 385–394 (2006).
Article CAS PubMed Google Scholar
Abecasis, G. R., Cardon, L. R. & Cookson, W. O. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000).
Article CAS PubMed Google Scholar
Lange, C., DeMeo, D. L. & Laird, N. M. Power and design considerations for a general class of family-based association tests: quantitative traits. Am. J. Hum. Genet. 71, 1330–1341 (2002).
Article CAS PubMed PubMed Central Google Scholar
Won, S. et al. On the analysis of genome-wide association studies in family-based designs: a universal, robust analysis approach and an application to four genome-wide association studies. PLoS Genet. 5, e1000741 (2009).
Article PubMed PubMed Central Google Scholar
Lasky-Su, J. et al. On genome-wide association studies for family-based designs: an integrative analysis approach combining ascertained family samples with unselected controls. Am. J. Hum. Genet. 86, 573–580 (2010).
Article CAS PubMed PubMed Central Google Scholar
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38, 203–208 (2006).
Article CAS PubMed Google Scholar
Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics era — concepts and misconceptions. Nature Rev. Genet. 9, 255–266 (2008).
Article CAS PubMed Google Scholar
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).
Article CAS PubMed Google Scholar
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nature Genet. 42, 355–360 (2010).
Article CAS PubMed Google Scholar
Zhu, X., Li, S., Cooper, R. S. & Elston, R. C. A unified association analysis approach for family and unrelated samples correcting for stratification. Am. J. Hum. Genet. 82, 352–365 (2008).
Article CAS PubMed PubMed Central Google Scholar
Lee S., Zou F. & Wright F. A. Convergence and prediction of principal component scores in high-dimensional settings. Ann. Stat. (in the press).
Thornton, T. & McPeek, M. S. ROADTRIPS: case–control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 86, 172–184 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rakovski, C. S. & Stram, D. O. A kinship-based modification of the Armitage trend test to address hidden population structure and small differential genotyping errors. PLoS ONE 4, e5825 (2009).
Article PubMed PubMed Central Google Scholar
Holsinger, K. E. & Weir, B. S. Genetics in geographically structured populations: defining, estimating and interpreting FST . Nature Rev. Genet. 10, 639–650 (2009).
Article CAS PubMed Google Scholar
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Article CAS PubMed PubMed Central Google Scholar
Abney, M. & McPeek, M. S. Association testing with principal-components-based correction for population stratification [abstract number 58]. Proc. of the 58th Annual Meeting of The American Soc. of Human Genetics [online], (2008).
Google Scholar
Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 2869–2872 (2004).
Google Scholar

Download references

Author information

Authors and Affiliations

Alkes L. Price, Noah A. Zaitlen, David Reich and Nick Patterson are at the Program in Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA.,
Alkes L. Price, Noah A. Zaitlen & Nick Patterson
Alkes L. Price and Noah A. Zaitlen are also at the Department of Epidemiology and Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA.,
Alkes L. Price & Noah A. Zaitlen
David Reich is also at the Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA.,
David Reich

Authors

Alkes L. Price
View author publications
You can also search for this author in PubMed Google Scholar
Noah A. Zaitlen
View author publications
You can also search for this author in PubMed Google Scholar
David Reich
View author publications
You can also search for this author in PubMed Google Scholar
Nick Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alkes L. Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Ancestry-informative markers: Genetic markers ascertained for large differences in allele frequency between subpopulations that are genotyped to infer genetic ancestry in new samples.
Armitage trend test: A standard χ²(1 degree of freedom) association test computed as the number of samples times the squared correlation between genotype and phenotype.
Cryptic relatedness: Sample structure due to distant relatedness among samples with no known family relationships.
Differential bias: Spurious differences in allele frequencies between cases and controls due to differences in sample collection, sample preparation and/or genotyping assay procedures.
Exome resequencing: A study design in which exon capture technologies are used to obtain resequencing data covering all exonic regions for each individual in the study.
Family-based association tests: A class of association tests that uses families with one or more affected children as the subjects rather than unrelated cases or controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as a 'case' and the untransmitted alleles as 'controls' to avoid the effects of population structure.
Family structure: Sample structure due to familial relatedness among samples.
F _ST: A measure of the genetic distance between two populations that describes the proportion of overall genetic variation that is due to differences between populations.
Genetic drift: Random fluctuations in allele frequencies over time due to sampling effects, particularly in small populations.
Genetic heritability: The proportion of the total phenotypic variation in a given characteristic that can be attributed to additive genetic effects. In the broad sense, heritability involves all additive and non-additive genetic variance, whereas in the narrow sense, it involves only additive genetic variance.
Genetic matching: A method of association testing in which cases and controls are matched for genetic ancestry, as inferred by principal components analysis or other methods.
Genomic control: A method for detecting (or detecting and correcting for) stratification based on the genome-wide inflation of association statistics.
Mixed models: A class of models in which phenotypes are modelled using both fixed effects (candidate SNPs and fixed covariates) and random effects (the phenotypic covariance matrix).
Multidimensional scaling: A dimensionality reduction technique, similar to principal components analysis, in which points in a high-dimensional space are projected into a lower-dimensional space while approximately preserving the distance between points.
Population structure: Sample structure due to differences in genetic ancestry among samples.
Principal components analysis: A dimensionality reduction technique used to infer continuous axes of variation in genetic data, often representing genetic ancestry.
Rank statistic: A statistic describing the rank, across markers, of association of each marker. Rank statistics can be transformed into quantiles of a standard normal distribution that can be combined with other statistics.
SNP loadings: The correlations of each SNP to a given principal component in principal components analysis. The principal component coordinates of each sample are proportional to the sum of normalized genotypes weighted by SNP loadings.
Structured association: A method for correcting for stratification in which samples are assigned to subpopulation clusters and evidence of association is stratified by cluster.
Transmission disequilibrium test: A family-based association test involving case–parent trios in which alleles transmitted from parents to children are compared with untransmitted alleles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Price, A., Zaitlen, N., Reich, D. et al. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11, 459–463 (2010). https://doi.org/10.1038/nrg2813

Download citation

Published: 15 June 2010
Issue Date: July 2010
DOI: https://doi.org/10.1038/nrg2813

This article is cited by

Unveiling the predominance of Saccharum spontaneum alleles for resistance to orange rust in sugarcane using genome-wide association
- Jordan Dijoux
- Simon Rio
- Jean-Yves Hoarau
Theoretical and Applied Genetics (2024)
Identification of the domestication gene GmCYP82C4 underlying the major quantitative trait locus for the seed weight in soybean
- Yang Li
- Wenqian Zhao
- Zhen-Yu Wang
Theoretical and Applied Genetics (2024)
ADHD genetic burden associates with older epigenetic age: mediating roles of education, behavioral and sociodemographic factors among older adults
- Thalida E. Arpawong
- Eric T. Klopack
- Eileen M. Crimmins
Clinical Epigenetics (2023)
PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data
- Elena Solovieva
- Hiroaki Sakai
BMC Bioinformatics (2023)
A fast non-parametric test of association for multiple traits
- Diego Garrido-Martín
- Miquel Calvo
- Roderic Guigó
Genome Biology (2023)