Abstract
Genome-wide association (GWA) studies are an effective approach for identifying genetic variants associated with disease risk. GWA studies can be confounded by population stratification — systematic ancestry differences between cases and controls — which has previously been addressed by methods that infer genetic ancestry. Those methods perform well in data sets in which population structure is the only kind of structure present but are inadequate in data sets that also contain family structure or cryptic relatedness. Here, we review recent progress on methods that correct for stratification while accounting for these additional complexities.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet. 9, 356–369 (2008).
Campbell, C. D. et al. Demonstrating stratification in a European American population. Nature Genet. 37, 868–872 (2005).
Tian, C., Gregersen, P. K. & Seldin, M. F. Accounting for ancestry: population substructure and genome-wide association studies. Hum. Mol. Genet. 17, R143–R150 (2008).
Tian, C. et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 4, e4 (2008).
Voight, B. F. & Pritchard, J. K. Confounding from cryptic relatedness in case–control association studies. PLoS Genet. 1, e32 (2005).
Weir, B. S., Anderson, A. D. & Hepler, A. B. Genetic relatedness analysis: modern data and new challenges. Nature Rev. Genet. 7, 771–780 (2006).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Pritchard, J. K. & Rosenberg, N. A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).
Reich, D. E. & Goldstein, D. B. Detecting association in a case–control study while correcting for population stratification. Genet. Epidemiol. 20, 4–16 (2001).
Clayton, D. G. et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nature Genet. 37, 1243–1246 (2005).
Price, A. L. et al. The impact of divergence time on the nature of population structure: an example from Iceland. PLoS Genet. 5, e1000505 (2009).
Devlin, B., Bacanu, S. A. & Roeder, K. Genomic control to the extreme. Nature Genet. 36, 1129–1130 (2004); author reply in 36, 1131 (2004).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, 1994).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nature Genet. 40, 646–649 (2008).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).
Zhu, X., Zhang, S., Zhao, H. & Cooper, R. S. Association mapping, using a mixture model for complex traits. Genet. Epidemiol. 23, 181–196 (2002).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Luca, D. et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am. J. Hum. Genet. 82, 453–463 (2008).
Lee, A. B., Luca, D., Klei, L., Devlin, B. & Roeder. K. Discovering genetic ancestry using spectral graph theory. Genet. Epidemiol. 34, 51–59 (2010).
Seldin, M. F. & Price, A. L. Application of ancestry informative markers to association studies in European Americans. PLoS Genet. 4, e5 (2008).
Spielman, R. S., McGinnis, R. E. & Ewens, W. J. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, 506–516 (1993).
Laird, N. M. & Lange, C. Family-based designs in the age of large-scale gene-association studies. Nature Rev. Genet. 7, 385–394 (2006).
Abecasis, G. R., Cardon, L. R. & Cookson, W. O. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000).
Lange, C., DeMeo, D. L. & Laird, N. M. Power and design considerations for a general class of family-based association tests: quantitative traits. Am. J. Hum. Genet. 71, 1330–1341 (2002).
Won, S. et al. On the analysis of genome-wide association studies in family-based designs: a universal, robust analysis approach and an application to four genome-wide association studies. PLoS Genet. 5, e1000741 (2009).
Lasky-Su, J. et al. On genome-wide association studies for family-based designs: an integrative analysis approach combining ascertained family samples with unselected controls. Am. J. Hum. Genet. 86, 573–580 (2010).
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38, 203–208 (2006).
Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics era — concepts and misconceptions. Nature Rev. Genet. 9, 255–266 (2008).
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nature Genet. 42, 355–360 (2010).
Zhu, X., Li, S., Cooper, R. S. & Elston, R. C. A unified association analysis approach for family and unrelated samples correcting for stratification. Am. J. Hum. Genet. 82, 352–365 (2008).
Lee S., Zou F. & Wright F. A. Convergence and prediction of principal component scores in high-dimensional settings. Ann. Stat. (in the press).
Thornton, T. & McPeek, M. S. ROADTRIPS: case–control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 86, 172–184 (2010).
Rakovski, C. S. & Stram, D. O. A kinship-based modification of the Armitage trend test to address hidden population structure and small differential genotyping errors. PLoS ONE 4, e5825 (2009).
Holsinger, K. E. & Weir, B. S. Genetics in geographically structured populations: defining, estimating and interpreting FST . Nature Rev. Genet. 10, 639–650 (2009).
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Abney, M. & McPeek, M. S. Association testing with principal-components-based correction for population stratification [abstract number 58]. Proc. of the 58th Annual Meeting of The American Soc. of Human Genetics [online], (2008).
Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 2869–2872 (2004).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Glossary
- Ancestry-informative markers
-
Genetic markers ascertained for large differences in allele frequency between subpopulations that are genotyped to infer genetic ancestry in new samples.
- Armitage trend test
-
A standard χ2(1 degree of freedom) association test computed as the number of samples times the squared correlation between genotype and phenotype.
- Cryptic relatedness
-
Sample structure due to distant relatedness among samples with no known family relationships.
- Differential bias
-
Spurious differences in allele frequencies between cases and controls due to differences in sample collection, sample preparation and/or genotyping assay procedures.
- Exome resequencing
-
A study design in which exon capture technologies are used to obtain resequencing data covering all exonic regions for each individual in the study.
- Family-based association tests
-
A class of association tests that uses families with one or more affected children as the subjects rather than unrelated cases or controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as a 'case' and the untransmitted alleles as 'controls' to avoid the effects of population structure.
- Family structure
-
Sample structure due to familial relatedness among samples.
- F ST
-
A measure of the genetic distance between two populations that describes the proportion of overall genetic variation that is due to differences between populations.
- Genetic drift
-
Random fluctuations in allele frequencies over time due to sampling effects, particularly in small populations.
- Genetic heritability
-
The proportion of the total phenotypic variation in a given characteristic that can be attributed to additive genetic effects. In the broad sense, heritability involves all additive and non-additive genetic variance, whereas in the narrow sense, it involves only additive genetic variance.
- Genetic matching
-
A method of association testing in which cases and controls are matched for genetic ancestry, as inferred by principal components analysis or other methods.
- Genomic control
-
A method for detecting (or detecting and correcting for) stratification based on the genome-wide inflation of association statistics.
- Mixed models
-
A class of models in which phenotypes are modelled using both fixed effects (candidate SNPs and fixed covariates) and random effects (the phenotypic covariance matrix).
- Multidimensional scaling
-
A dimensionality reduction technique, similar to principal components analysis, in which points in a high-dimensional space are projected into a lower-dimensional space while approximately preserving the distance between points.
- Population structure
-
Sample structure due to differences in genetic ancestry among samples.
- Principal components analysis
-
A dimensionality reduction technique used to infer continuous axes of variation in genetic data, often representing genetic ancestry.
- Rank statistic
-
A statistic describing the rank, across markers, of association of each marker. Rank statistics can be transformed into quantiles of a standard normal distribution that can be combined with other statistics.
- SNP loadings
-
The correlations of each SNP to a given principal component in principal components analysis. The principal component coordinates of each sample are proportional to the sum of normalized genotypes weighted by SNP loadings.
- Structured association
-
A method for correcting for stratification in which samples are assigned to subpopulation clusters and evidence of association is stratified by cluster.
- Transmission disequilibrium test
-
A family-based association test involving case–parent trios in which alleles transmitted from parents to children are compared with untransmitted alleles.
Rights and permissions
About this article
Cite this article
Price, A., Zaitlen, N., Reich, D. et al. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11, 459–463 (2010). https://doi.org/10.1038/nrg2813
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg2813
This article is cited by
-
Unveiling the predominance of Saccharum spontaneum alleles for resistance to orange rust in sugarcane using genome-wide association
Theoretical and Applied Genetics (2024)
-
Identification of the domestication gene GmCYP82C4 underlying the major quantitative trait locus for the seed weight in soybean
Theoretical and Applied Genetics (2024)
-
ADHD genetic burden associates with older epigenetic age: mediating roles of education, behavioral and sociodemographic factors among older adults
Clinical Epigenetics (2023)
-
PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data
BMC Bioinformatics (2023)
-
A fast non-parametric test of association for multiple traits
Genome Biology (2023)