Abstract
Estimating individual ancestry is important in genetic association studies where population structure leads to false positive signals, although assigning ancestry remains challenging with targeted sequence data. We propose a new method for the accurate estimation of individual genetic ancestry, based on direct analysis of off-target sequence reads, and implement our method in the publicly available LASER software. We validate the method using simulated and empirical data and show that the method can accurately infer worldwide continental ancestry when used with sequencing data sets with whole-genome shotgun coverage as low as 0.001×. For estimates of fine-scale ancestry within Europe, the method performs well with coverage of 0.1×. On an even finer scale, the method improves discrimination between exome-sequenced study participants originating from different provinces within Finland. Finally, we show that our method can be used to improve case-control matching in genetic association studies and to reduce the risk of spurious findings due to population structure.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data
BMC Bioinformatics Open Access 02 May 2023
-
Hybrid autoencoder with orthogonal latent space for robust population structure inference
Scientific Reports Open Access 14 February 2023
-
Maternal vitamin D during pregnancy and offspring autism and autism-associated traits: a prospective cohort study
Molecular Autism Open Access 12 November 2022
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout




References
Altshuler, D., Daly, M.J. & Lander, E.S. Genetic mapping in human disease. Science 322, 881–888 (2008).
McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356–369 (2008).
Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).
Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).
Coventry, A. et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat. Commun. 1, 131 (2010).
Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing. Nat. Methods 7, 111–118 (2010).
Bamshad, M.J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12, 745–755 (2011).
Shen, P. et al. High-quality DNA sequence capture of 524 disease candidate genes. Proc. Natl. Acad. Sci. USA 108, 6549–6554 (2011).
Nelson, M.R. et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104 (2012).
Nejentsev, S., Walker, N., Riches, D., Egholm, M. & Todd, J.A. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324, 387–389 (2009).
Rivas, M.A. et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 43, 1066–1073 (2011).
Raychaudhuri, S. et al. A rare penetrant mutation in CFH confers high risk of age-related macular degeneration. Nat. Genet. 43, 1232–1236 (2011).
Cardon, L.R. & Palmer, L.J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).
Marchini, J., Cardon, L.R., Phillips, M.S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).
Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246 (2005).
Mathieson, I. & McVean, G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 44, 243–246 (2012).
Clark, M.J. et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29, 908–914 (2011).
Li, Y., Sidore, C., Kang, H.M., Boehnke, M. & Abecasis, G.R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
Le, S.Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).
Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631–635 (2012).
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Price, A.L., Zaitlen, N.A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
Li, J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104 (2008).
Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).
Schönemann, P.H. & Carroll, R.M. Fitting one matrix to another under choice of a central dilation and a rigid motion. Psychometrika 35, 245–255 (1970).
Wang, C. et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9, 13 (2010).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Zhan, X. et al. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nat. Genet. 45, 1375–1379 (2013).
International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Chen, W. et al. Genetic variants near TIMP3 and high-density lipoprotein–associated loci influence susceptibility to age-related macular degeneration. Proc. Natl. Acad. Sci. USA 107, 7401–7406 (2010).
Valle, T. et al. Mapping genes for NIDDM. Design of the Finland–United States Investigation of NIDDM Genetics (FUSION) Study. Diabetes Care 21, 949–958 (1998).
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
Guan, W., Liang, L., Boehnke, M. & Abecasis, G.R. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet. Epidemiol. 33, 508–517 (2009).
Wang, C., Zöllner, S. & Rosenberg, N.A. A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet. 8, e1002886 (2012).
Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Miclaus, K., Wolfinger, R. & Czika, W. SNP selection and multidimensional scaling to quantify population structure. Genet. Epidemiol. 33, 488–496 (2009).
Zhu, C. & Yu, J. Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics 182, 875–888 (2009).
Yang, W.Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nat. Genet. 44, 725–731 (2012).
Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am. J. Hum. Genet. 91, 597–607 (2012).
Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22, 1525–1532 (2012).
Kang, H.M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nat. Methods 8, 833–835 (2011).
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Zhang, S., Zhu, X. & Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet. Epidemiol. 24, 44–56 (2003).
Nelson, M.R. et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83, 347–358 (2008).
Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 91, 839–848 (2012).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Skoglund, P. et al. Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science 336, 466–469 (2012).
Holsinger, K.E. & Weir, B.S. Genetics in geographically structured populations: defining, estimating and interpreting FST . Nat. Rev. Genet. 10, 639–650 (2009).
Hudson, R.R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 337–338 (2002).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Acknowledgements
We thank investigators from the FUSION study and the GoT2D Sequencing Project for generously sharing whole-genome and deep exome sequence data for 941 individuals before publication and the D2D, Finrisk 2002, Health 2000, Action LADA and Saviatipale studies for providing some of the FUSION-sequenced DNA. We thank J.Z. Li for his assistance with the HGDP data set, H. Stringham and A. Locke for assistance with the FUSION data set and M. Brooks for organizing the macular degeneration samples. C.W. acknowledges funding support from a Howard Hughes Medical Institute International Student Research Fellowship. This study is supported by the US National Institutes of Health (DK062370, HG000376, HG005552, HG006513, EY022005, HG007022, HG005855, HG003079, CA076404 and CA134294) and by the National Eye Institute Intramural Research Program.
Author information
Authors and Affiliations
Consortia
Contributions
C.W., X.Z., S.Z. and G.R.A. conceived and implemented the approach. X.L. provided critical feedback on methodology and simulations. J.B.-G., D.S., E.Y.C., K.E.B., J.H., R.F., R.K.W., E.R.M. and A.S. contributed the macular degeneration targeted sequencing data. H.M.K. and FUSION collaborators contributed the Finnish exome sequence data. C.W. and G.R.A. wrote the first draft of the manuscript. All authors reviewed, revised and contributed critical feedback to the manuscript and presentation.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Additional information
Full lists of members and affiliations appear in the Supplementary Note.
Integrated supplementary information
Supplementary Figure 1 Off-target coverage for 410 samples from the 1000 Genomes exon project.
The off-target coverage for each sample is calculated by averaging across 632,958 loci in HGDP. For 270 loci that appear in the targeted regions, we set the coverage at these loci to 0 for all samples. Mean off-target coverage is 0.096× across the HGDP loci.
Supplementary Figure 2 Estimation of worldwide ancestry for 410 samples in the 1000 Genomes exon project.
The SNP genotypes of these samples are from the HapMap Project. We used all HGDP individuals as the reference panel, as labeled by colored points. (A,B) Results based on SNPs that were genotyped in both HapMap 3 and HGDP. (C,D) Results based on off-target sequence data. Procrustes similarity to the SNP-based coordinates is t0 = 0.9955, r2 = 0.9950, 0.9871, 0.9439 and 0.7747 for PC1, PC2, PC3 and PC4, respectively.
Supplementary Figure 3 Off-target coverage for 3,159 samples from the AMD study.
The red line indicates off-target coverage averaged across the 632,958 loci included in HGDP. The blue line indicates off-target coverage averaged across the 318,682 loci that are included in POPRES. For loci that appear in the targeted regions, we set the coverage at these loci to 0 for all samples, including 215 loci in HGDP and 113 loci in POPRES. Mean off-target coverage is 0.224× across the HGDP loci and 0.241× across the POPRES loci.
Supplementary Figure 4 Estimation of ancestry for 3,159 samples in the AMD targeted sequencing data set.
(A,B) Results based on the HGDP reference panel, whose colors and symbols follow Supplementary Figure 2. AMD samples are displayed in black, with different symbols representing possible ancestries based on their estimated PC coordinates. Two HapMap trios are labeled in gray. (C,D) Results based on the POPRES reference panel. Panel C displays PC1 and PC2 of POPRES; panel D displays 3,072 AMD samples on top of the POPRES samples. These samples are possibly Europeans or Middle Eastern ancestry, as indicated in panels A and B. Population labels for the POPRES samples are as follows: AL, Albania; AT, Austria; BA, Bosnia and Herzegovina; BE, Belgium; BG, Bulgaria; CH-F, Swiss French; CH-G, Swiss German; CH-I, Swiss Italian; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NL, Netherlands; NO, Norway; PL, Poland; PT, Portugal; RO, Romania; RU, Russia; Sct, Scotland; SE, Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG, Serbia and Montenegro.
Supplementary Figure 5 Sequence-based coordinates and SNP-based coordinates for 931 AMD samples when using the HGDP reference panel.
Colors and symbols for HGDP and AMD samples follow Supplementary Figure 2. (A,B) Results based on 45,700 SNPs that are shared by the HGDP, POPRES and AMD SNP datasets. (C,D) Results based on off-target sequence data. The Procrustes similarity between results in panels A and B and in panels C and D is t0 = 0.9068. r2 = 0.9104, 0.8881, 0.6031 and 0.1828 for PC1, PC2, PC3 and PC4, respectively.
Supplementary Figure 6 Sequence-based coordinates and SNP-based coordinates for AMD samples when using the POPRES reference panel.
We only included 928 AMD samples whose genotype data are available and who might be European or Middle Eastern according to results in Supplementary Figure 5. (A) Results based on 45,700 SNPs that are shared by the HGDP, POPRES and AMD SNP data sets. (B) Results based on off-target sequence data. The Procrustes similarity between results in panels A and B is t0 = 0.9209. r2 = 0.9557 and 0.6389 for PC1 and PC2, respectively.
Supplementary Figure 7 Results for simulated exome sequencing data for 385 POPRES samples.
(A) Coordinates estimated from SNP genotypes at 2,547 on-target loci. Procrustes similarity to the SNP-based coordinates in Figure 3A is t0 = 0.5031. (B) Coordinates estimated based on off-target sequence reads (t0 = 0.9467). (C) Coordinates estimated based on sequence reads from both off-target and on-target regions (t0 = 0.9669). Mean coverage is ~88.9× and ~1.0× for on-target and off-target regions.
Supplementary Figure 8 Different strategies for sampling 1,280 cases.
(A) Sampling from two 8 × 8 grids along one side, with ten cases from each grid point. (B) Sampling from two 8 × 8 grids along the diagonal, with ten cases from each grid point. (C) Sampling from one 8 × 8 grid at the corner, with 20 cases from each grid point. (D) Sampling from one 8 × 8 grid at the center, with 20 cases from each grid point.
Supplementary Figure 9 Improvement of estimation by using coordinates averaged across multiple runs of LASER on the same data set.
The x axis indicates the number of runs used in calculating mean PC coordinates. The y axis indicates Procrustes similarity t0 between the mean coordinates and the SNP-based coordinates. Each box represents the distribution of t0 obtained from 15 repeating runs. (A) Results on sequence data of worldwide samples simulated from the genotypes of 238 HGDP individuals, using the other 700 HGDP individuals as the reference panel. We tested on three simulated data sets with coverage of 0.001×, 0.002× and 0.004×. (B) Results on sequence data of European samples simulated from the genotypes of 385 POPRES individuals, using the other 1,000 POPRES individuals as the reference panel. We tested on three simulated data sets with coverage of 0.10×, 0.20× and 0.40×. We only used one iteration in our examples of the 1000 Genomes and AMD targeted sequencing data because most samples have relatively high off-target coverage, such that improvement by using multiple iterations is small.
Supplementary Figure 10 Data processing procedures for the HGDP and POPRES data sets.
(A) The HGDP data set. (B) The POPRES data set.
Supplementary Figure 11 Data processing procedures for the HapMap 3 and AMD SNP data sets.
(A) The HapMap 3 data set. (B) The AMD SNP data set.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–11, Supplementary Tables 1–9 and Supplementary Note (PDF 3687 kb)
Rights and permissions
About this article
Cite this article
Wang, C., Zhan, X., Bragg-Gresham, J. et al. Ancestry estimation and control of population stratification for sequence-based association studies. Nat Genet 46, 409–415 (2014). https://doi.org/10.1038/ng.2924
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.2924
This article is cited by
-
ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data
BMC Bioinformatics (2023)
-
Hybrid autoencoder with orthogonal latent space for robust population structure inference
Scientific Reports (2023)
-
A Varying Coefficient Model to Jointly Test Genetic and Gene–Environment Interaction Effects
Behavior Genetics (2023)
-
Maternal vitamin D during pregnancy and offspring autism and autism-associated traits: a prospective cohort study
Molecular Autism (2022)
-
Cross-continental admixture in the Kho population from northwest Pakistan
European Journal of Human Genetics (2022)