Scaling probabilistic models of genetic variation to millions of humans

Gopalan, Prem; Hao, Wei; Blei, David M; Storey, John D

doi:10.1038/ng.3710

Technical Report
Published: 07 November 2016

Scaling probabilistic models of genetic variation to millions of humans

Prem Gopalan¹,
Wei Hao²,
David M Blei^3,4 &
…
John D Storey²

Nature Genetics volume 48, pages 1587–1590 (2016)Cite this article

6520 Accesses
25 Citations
111 Altmetric
Metrics details

Subjects

Abstract

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order of millions of individuals, and existing methods do not scale to data of this size. To solve this problem, we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (10¹² observed genotypes; for example, 1 million individuals at 1 million SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure among individuals. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample sizes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: A schematic diagram of TeraStructure, stochastic variational inference for the PSD model.**

**Figure 2: TeraStructure recovers the underlying per-individual population proportions on the simulated data sets generated via scenario A.**

**Figure 3: TeraStructure is the most accurate method for scenario B simulations.**

A method for genome-wide genealogy estimation for thousands of samples

Article 02 September 2019

Leo Speidel, Marie Forest, … Simon R. Myers

Accurate, scalable and integrative haplotype estimation

Article Open access 28 November 2019

Olivier Delaneau, Jean-François Zagury, … Emmanouil T. Dermitzakis

Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies

Article 28 August 2023

Pouria Salehi Nowbandegani, Anthony Wilder Wohns, … Luke J. O’Connor

References

Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
CAS PubMed PubMed Central Google Scholar
Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Article CAS PubMed PubMed Central Google Scholar
Raj, A., Stephens, M. & Pritchard, J.K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).
Article PubMed PubMed Central Google Scholar
Bryc, K., Durand, E.Y., Macpherson, J.M., Reich, D. & Mountain, J.L. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am. J. Hum. Genet. 96, 37–53 (2015).
Article CAS PubMed PubMed Central Google Scholar
Jordan, M., Ghahramani, Z., Jaakkola, T. & Saul, L. Introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999).
Article Google Scholar
Wainwright, M. & Jordan, M. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008).
Article Google Scholar
Hoffman, M., Blei, D., Wang, C. & Paisley, J. Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013).
Google Scholar
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lawson, D.J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).
Article CAS PubMed PubMed Central Google Scholar
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
Google Scholar
Ranganath, R., Wang, C., Blei, D. & Xing, E. An adaptive learning rate for stochastic variational inference. J. Mach. Learn. Res. Workshop Conf. Proceed. 28 (2), 298–306 (2013).
Google Scholar
Cann, H.M. et al. A human genome diversity cell line panel. Science 296, 261–262 (2002).
Article CAS PubMed Google Scholar
Cavalli-Sforza, L.L. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6, 333–340 (2005).
Article CAS PubMed Google Scholar
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014).
CAS PubMed PubMed Central Google Scholar
Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).
Article CAS PubMed PubMed Central Google Scholar
Geisser, S. & Eddy, W. A predictive approach to model selection. J. Am. Stat. Assoc. 74, 153–160 (1979).
Article Google Scholar
Rosenberg, N.A. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70, 841–847 (2006).
Article CAS PubMed Google Scholar
Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).
Article CAS PubMed Google Scholar
Weir, B.S. & Cockerham, C.C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
CAS PubMed Google Scholar
Rosenberg, N.A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

W.H. and J.D.S. were supported in part by US NIH grant R01 HG006448 and ONR grant N00014-12-1-0764. D.M.B. is supported in part by ONR N00014-11-1-0651, DARPA FA8750-14-2-0009 and DARPA N66001-15-C-4032. We thank A. Ochoa for suggesting the design of the scenario B simulation.

Author information

Authors and Affiliations

Department of Computer Science, Princeton University, Princeton, New Jersey, USA
Prem Gopalan
Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
Wei Hao & John D Storey
Department of Statistics, Columbia University, New York, New York, USA
David M Blei
Department of Computer Science, Columbia University, New York, New York, USA
David M Blei

Authors

Prem Gopalan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Hao
View author publications
You can also search for this author in PubMed Google Scholar
David M Blei
View author publications
You can also search for this author in PubMed Google Scholar
John D Storey
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.M.B. and J.D.S. conceived the study. P.G. implemented the algorithm. W.H. carried out simulation studies. All authors performed data analyses and methods development and wrote the manuscript.

Corresponding authors

Correspondence to David M Blei or John D Storey.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Tables 1 and 2, and Supplementary Note. (PDF 1205 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gopalan, P., Hao, W., Blei, D. et al. Scaling probabilistic models of genetic variation to millions of humans. Nat Genet 48, 1587–1590 (2016). https://doi.org/10.1038/ng.3710

Download citation

Received: 28 May 2015
Accepted: 04 October 2016
Published: 07 November 2016
Issue Date: December 2016
DOI: https://doi.org/10.1038/ng.3710

This article is cited by

Neural ADMIXTURE for rapid genomic clustering
- Albert Dominguez Mantes
- Daniel Mas Montserrat
- Alexander G. Ioannidis
Nature Computational Science (2023)
Stochastic variational variable selection for high-dimensional microbiome data
- Tung Dang
- Kie Kumaishi
- Hiroyoshi Iwata
Microbiome (2022)
Putative variants, genetic diversity and population structure among Soybean cultivars bred at different ages in Huang-Huai-Hai region
- Jialin Liu
- Huimin Xie
- Dongjin Xiong
Scientific Reports (2022)
Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs
- Jinliang Wang
Heredity (2022)