Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Scaling probabilistic models of genetic variation to millions of humans

Abstract

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order of millions of individuals, and existing methods do not scale to data of this size. To solve this problem, we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (1012 observed genotypes; for example, 1 million individuals at 1 million SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure among individuals. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample sizes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: A schematic diagram of TeraStructure, stochastic variational inference for the PSD model.
Figure 2: TeraStructure recovers the underlying per-individual population proportions on the simulated data sets generated via scenario A.
Figure 3: TeraStructure is the most accurate method for scenario B simulations.

Similar content being viewed by others

References

  1. Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Raj, A., Stephens, M. & Pritchard, J.K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Bryc, K., Durand, E.Y., Macpherson, J.M., Reich, D. & Mountain, J.L. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am. J. Hum. Genet. 96, 37–53 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Jordan, M., Ghahramani, Z., Jaakkola, T. & Saul, L. Introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999).

    Article  Google Scholar 

  6. Wainwright, M. & Jordan, M. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1, 1–305 (2008).

    Article  Google Scholar 

  7. Hoffman, M., Blei, D., Wang, C. & Paisley, J. Stochastic variational inference. J. Mach. Learn. Res. 14, 1303–1347 (2013).

    Google Scholar 

  8. Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Lawson, D.J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).

    Google Scholar 

  11. Ranganath, R., Wang, C., Blei, D. & Xing, E. An adaptive learning rate for stochastic variational inference. J. Mach. Learn. Res. Workshop Conf. Proceed. 28 (2), 298–306 (2013).

    Google Scholar 

  12. Cann, H.M. et al. A human genome diversity cell line panel. Science 296, 261–262 (2002).

    Article  CAS  PubMed  Google Scholar 

  13. Cavalli-Sforza, L.L. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6, 333–340 (2005).

    Article  CAS  PubMed  Google Scholar 

  14. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  15. Lazaridis, I. et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513, 409–413 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Geisser, S. & Eddy, W. A predictive approach to model selection. J. Am. Stat. Assoc. 74, 153–160 (1979).

    Article  Google Scholar 

  18. Rosenberg, N.A. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70, 841–847 (2006).

    Article  CAS  PubMed  Google Scholar 

  19. Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).

    Article  CAS  PubMed  Google Scholar 

  20. Weir, B.S. & Cockerham, C.C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).

    CAS  PubMed  Google Scholar 

  21. Rosenberg, N.A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

W.H. and J.D.S. were supported in part by US NIH grant R01 HG006448 and ONR grant N00014-12-1-0764. D.M.B. is supported in part by ONR N00014-11-1-0651, DARPA FA8750-14-2-0009 and DARPA N66001-15-C-4032. We thank A. Ochoa for suggesting the design of the scenario B simulation.

Author information

Authors and Affiliations

Authors

Contributions

D.M.B. and J.D.S. conceived the study. P.G. implemented the algorithm. W.H. carried out simulation studies. All authors performed data analyses and methods development and wrote the manuscript.

Corresponding authors

Correspondence to David M Blei or John D Storey.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5, Supplementary Tables 1 and 2, and Supplementary Note. (PDF 1205 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gopalan, P., Hao, W., Blei, D. et al. Scaling probabilistic models of genetic variation to millions of humans. Nat Genet 48, 1587–1590 (2016). https://doi.org/10.1038/ng.3710

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3710

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing