Abstract
A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order of millions of individuals, and existing methods do not scale to data of this size. To solve this problem, we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (1012 observed genotypes; for example, 1 million individuals at 1 million SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure among individuals. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample sizes.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Stochastic variational variable selection for high-dimensional microbiome data
Microbiome Open Access 24 December 2022
-
Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs
Heredity Open Access 04 May 2022
-
Putative variants, genetic diversity and population structure among Soybean cultivars bred at different ages in Huang-Huai-Hai region
Scientific Reports Open Access 11 February 2022
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout



References
Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Raj, A., Stephens, M. & Pritchard, J.K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).
Bryc, K., Durand, E.Y., Macpherson, J.M., Reich, D. & Mountain, J.L. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am. J. Hum. Genet. 96, 37–53 (2015).
Jordan, M., Ghahramani, Z., Jaakkola, T. & Saul, L. Introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999).
Wainwright, M. & Jordan, M. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008).
Hoffman, M., Blei, D., Wang, C. & Paisley, J. Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013).
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Lawson, D.J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
Ranganath, R., Wang, C., Blei, D. & Xing, E. An adaptive learning rate for stochastic variational inference. J. Mach. Learn. Res. Workshop Conf. Proceed. 28 (2), 298–306 (2013).
Cann, H.M. et al. A human genome diversity cell line panel. Science 296, 261–262 (2002).
Cavalli-Sforza, L.L. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6, 333–340 (2005).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014).
Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).
Geisser, S. & Eddy, W. A predictive approach to model selection. J. Am. Stat. Assoc. 74, 153–160 (1979).
Rosenberg, N.A. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70, 841–847 (2006).
Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).
Weir, B.S. & Cockerham, C.C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
Rosenberg, N.A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Acknowledgements
W.H. and J.D.S. were supported in part by US NIH grant R01 HG006448 and ONR grant N00014-12-1-0764. D.M.B. is supported in part by ONR N00014-11-1-0651, DARPA FA8750-14-2-0009 and DARPA N66001-15-C-4032. We thank A. Ochoa for suggesting the design of the scenario B simulation.
Author information
Authors and Affiliations
Contributions
D.M.B. and J.D.S. conceived the study. P.G. implemented the algorithm. W.H. carried out simulation studies. All authors performed data analyses and methods development and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–5, Supplementary Tables 1 and 2, and Supplementary Note. (PDF 1205 kb)
Rights and permissions
About this article
Cite this article
Gopalan, P., Hao, W., Blei, D. et al. Scaling probabilistic models of genetic variation to millions of humans. Nat Genet 48, 1587–1590 (2016). https://doi.org/10.1038/ng.3710
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3710
This article is cited by
-
Stochastic variational variable selection for high-dimensional microbiome data
Microbiome (2022)
-
Putative variants, genetic diversity and population structure among Soybean cultivars bred at different ages in Huang-Huai-Hai region
Scientific Reports (2022)
-
Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs
Heredity (2022)
-
Population genetic considerations for using biobanks as international resources in the pandemic era and beyond
BMC Genomics (2021)
-
Fine population structure analysis method for genomes of many
Scientific Reports (2017)