Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Inferring human population size and separation history from multiple genome sequences

Abstract

The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000–30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: MSMC locally infers branch lengths and coalescence times from observed mutations.
Figure 2: Testing MSMC on simulated data.
Figure 3: Inference of population size from whole-genome sequences.
Figure 4: Genetic separation between population pairs.

Similar content being viewed by others

References

  1. Behar, D.M. et al. The dawn of human matrilineal diversity. Am. J. Hum. Genet. 82, 1130–1140 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Fu, Q. et al. Complete mitochondrial genomes reveal neolithic expansion into Europe. PLoS ONE 7, e32473 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Balaresque, P. et al. A predominantly neolithic origin for European paternal lineages. PLoS Biol. 8, e1000285 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Atkinson, Q.D., Gray, R.D. & Drummond, A.J. mtDNA variation predicts population size in humans and reveals a major Southern Asian chapter in human prehistory. Mol. Biol. Evol. 25, 468–474 (2008).

    Article  CAS  PubMed  Google Scholar 

  5. McVean, G.A.T. & Cardin, N.J. Approximating the coalescent with recombination. Phil. Trans. R. Soc. Lond. B 360, 1387–1393 (2005).

    Article  CAS  Google Scholar 

  6. Marjoram, P. & Wall, J.D. Fast “coalescent” simulation. BMC Genet. 7, 16 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Paul, J.S., Steinrücken, M. & Song, Y.S. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187, 1115–1128 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Sheehan, S., Harris, K. & Song, Y.S. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194, 647–662 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Steinrücken, M., Paul, J.S. & Song, Y.S. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor. Popul. Biol. 87, 51–61 (2013).

    Article  PubMed  Google Scholar 

  11. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).

    Article  CAS  PubMed  Google Scholar 

  12. Fenner, J.N. Cross-cultural estimation of the human generation interval for use in genetics-based population divergence studies. Am. J. Phys. Anthropol. 128, 415–423 (2005).

    Article  PubMed  Google Scholar 

  13. Matsumura, S. & Forster, P. Generation time and effective population size in Polar Eskimos. Proc. Biol. Sci. 275, 1501–1508 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Campbell, C.D. et al. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 44, 1277–1281 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  17. Scally, A. & Durbin, R. Revising the human mutation rate: implications for understanding human evolution. Nat. Rev. Genet. 13, 745–753 (2012).

    Article  CAS  PubMed  Google Scholar 

  18. Fu, Q. et al. A revised timescale for human evolution based on ancient mitochondrial genomes. Curr. Biol. 23, 553–559 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Sun, J.X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

    Article  CAS  PubMed  Google Scholar 

  21. Henn, B.M., Cavalli-Sforza, L.L. & Feldman, M.W. The great human expansion. Proc. Natl. Acad. Sci. USA 109, 17758–17764 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Mellars, P. Going east: new genetic and archaeological perspectives on the modern human colonization of Eurasia. Science 313, 796–800 (2006).

    Article  CAS  PubMed  Google Scholar 

  23. Mellars, P. Why did modern human populations disperse from Africa ca. 60,000 years ago? A new model. Proc. Natl. Acad. Sci. USA 103, 9381–9386 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Eriksson, A. et al. Late Pleistocene climate change and the global expansion of anatomically modern humans. Proc. Natl. Acad. Sci. USA 109, 16089–16094 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. O'Rourke, D.H. & Raff, J.A. The human genetic history of the Americas: the final frontier. Curr. Biol. 20, R202–R207 (2010).

    Article  CAS  PubMed  Google Scholar 

  26. Goebel, T., Waters, M.R. & O'Rourke, D.H. The late Pleistocene dispersal of modern humans in the Americas. Science 319, 1497–1502 (2008).

    Article  CAS  PubMed  Google Scholar 

  27. Botigué, L.R. et al. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proc. Natl. Acad. Sci. USA 110, 11791–11796 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Berniell-Lee, G. et al. Genetic and demographic implications of the Bantu expansion: insights from human paternal lineages. Mol. Biol. Evol. 26, 1581–1589 (2009).

    Article  CAS  PubMed  Google Scholar 

  29. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).

    Article  PubMed  Google Scholar 

  30. Pickrell, J.K. et al. Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl. Acad. Sci. USA 111, 2632–2637 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Pagani, L. et al. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. 91, 83–96 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108, 11983–11988 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Marth, G.T. et al. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166, 351–372 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Keinan, A. et al. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nat. Genet. 39, 1251–1255 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Schaffner, S.F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Garrigan, D. et al. Inferring human population sizes, divergence times and rates of gene flow from mitochondrial, X and Y chromosome resequencing data. Genetics 177, 2195–2207 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Plagnol, V. & Wall, J.D. Possible ancestral structure in human populations. PLoS Genet. 2, e105 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Fagundes, N.J. et al. Statistical evaluation of alternative models of human evolution. Proc. Natl. Acad. Sci. USA 104, 17614–17619 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Mathieson, I. & McVean, G. Demography and the age of rare variants http://arxiv.org/abs/1401.4181 (2014).

  41. Harris, K. & Nielsen, R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9, e1003521 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Armitage, S.J. et al. The southern route “out of Africa”: evidence for an early expansion of modern humans into Arabia. Science 331, 453–456 (2011).

    Article  CAS  PubMed  Google Scholar 

  43. Petraglia, M. et al. Middle Paleolithic assemblages from the Indian subcontinent before and after the Toba super-eruption. Science 317, 114–116 (2007).

    Article  CAS  PubMed  Google Scholar 

  44. Takahata, N. & Satta, Y. Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl. Acad. Sci. USA 94, 4811–4815 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Nachman, M.W. & Crowell, S.L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Gronau, I. et al. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Delaneau, O., Zagury, J.F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

    Article  CAS  PubMed  Google Scholar 

  48. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  49. Delaneau, O., Howie, B., Cox, A.J., Zagury, J.-F. & Marchini, J . Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Kidd, J.M. et al. Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet. 91, 660–671 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Chen, G.K., Marjoram, P. & Wall, J.D. Fast and flexible simulation of DNA sequence data. Genome Res. 19, 136–142 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabalistic Models of Proteins and Nucleic Acids (Cambridge University Press, Cambridge, UK, 1998).

  53. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).

    Article  Google Scholar 

Download references

Acknowledgements

We thank A. Scally for useful comments and discussion, in particular on interpreting population divergence estimates, and the Durbin group for general discussion. S.S. thanks A. Fischer for helpful support with HMM implementation details. We thank J. Kidd, S. Gravel and C. Bustamante for making ancestry tracts for the MXL individuals available to us. S.S. acknowledges grant support from an EMBO (European Molecular Biology Organization) long-term fellowship. This work was funded by Wellcome Trust grant 098051.

Author information

Authors and Affiliations

Authors

Contributions

R.D. proposed the basic strategy and designed the overall study. S.S. developed the theory, implemented the algorithm and obtained results. S.S. and R.D. analyzed the results and wrote the manuscript.

Corresponding authors

Correspondence to Stephan Schiffels or Richard Durbin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Recombination rate inference from PSMC'.

MSMC for two haplotypes is a special case that we call PSMC', in contrast to PSMC, because we use SMC' (Supplementary Note) as the underlying coalescent model. Here we show the iterative estimation of the recombination rate for two demographic scenarios: (i) a constant population size and (ii) a bottleneck in the past. As can be seen, in both cases, the estimated recombination rate converges quickly to the true value with very high accuracy.

Supplementary Figure 2 Simulations of other scenarios.

See the Supplementary Note for details on these additional simulations. (a) MSMC results from simulated data that represent a much simplified CEU population size history with sharp changes. (b) Similar to a, but with a simplified YRI-like history. (c) A population split with subsequent migration. (d) A population split with subsequent changes in population size. (e) Inference from 8 and 16 haplotypes. For 16 haplotypes, we needed to reduce the computational complexity by reducing the simulated sequence to 1Gb instead of 3 Gb, and we used a coarse-grained set of parameters with 20 time intervals instead of 40.

Supplementary Figure 3 Testing singleton branch length estimates.

Here we compare MSMC estimates based on the estimates of Ts obtained via the HMM described in section 7 of the Supplementary Note (solid) with the true values of singleton branch length as output in the simulation (dotted). (a) Population size estimates. (b) Split estimates.

Supplementary Figure 4 Simulations with recombination hotspots.

To assess the effect of heterogeneous recombination rates across the genome, we simulated 100 chromosomes with 4 haplotypes of 1 Mb each (Supplementary Note). For practical reasons, we could not scale up this simulation to 3 Gb of total sequence or to more haplotypes. We used as input random chunks of the real human recombination map from the HapMap Project. (a) This plot shows the effective population size estimates from both the standard simulation and simulations with the human recombination map. We see only small effects of variable recombination rates, mostly at the two extreme ends of the estimated time interval. Some of that difference may also be caused by the much smaller total sequence length of the hotspot simulation in comparison to the standard simulation. (b) Here we show the two split scenarios at 10,000 and 100,000 years ago. Again, the differences between the hotspot simulation and the standard simulation are only small.

Supplementary Figure 5 Application to unphased data.

We generated data sets in which we deliberately ‘unphased’ one or both diploid genomes in a setting of four haplotypes. (a) Plot of the population size estimates from two diploid individuals of which both are phased (red), one is unphased (blue) and both are unphased (purple). (b) Plot of the relative cross coalescence rate estimates based on similarly unphased data.

Supplementary Figure 6 Comparison of trio- versus population-based phasing and effect of unphased sites.

We tested whether results from trio-based phased data differ from population-based phased data. We checked this for CEU and YRI, for which we have trio sequences available. As shown in (a) and (b), the trio-phased results do not differ strongly from the population-phased results (solid versus dashed lines). When using population-based phasing of our sequences, there are rare sites that are not present in the reference data set and are therefore not phased. There are two possibilities: (i) leave them in as unphased sites (‘all’) or (ii) remove them from the analysis (‘restricted’). The two cases are shown in a and b as solid versus dotted lines. As shown, for population size inference, removing unphased sites does not appear to improve estimates (in comparison to the trio-phased estimates), but, for the population separation analysis, removing unphased sites gives smoother estimates in the most recent times and removes some non-monotonic artifacts (in CHB/GIH).

Supplementary Figure 7 Comparisons of population size estimates with two, four and eight haplotypes.

(a) We show the estimates based on four haplotypes (solid lines) together with estimates from two haplotypes (dotted lines). For clarity, we separated the curves on the basis of African and non-African samples. (b) This plot shows estimates based on eight haplotypes (thick lines) in comparison with the estimates based on four haplotypes (thin lines).

Supplementary Figure 8 Replicate analysis with four haplotypes.

We generated a replicate set of population size and relative cross coalescence rate estimates, on the basis of the two individuals in each population not used for the main analysis, as presented in Figures 3 and 4. In both figures, the replicate estimate is shown as a dashed line, and the original estimate is shown as a solid line. For clarity, African and non-African estimates are separated.

Supplementary Figure 9 Comparison of relative cross coalescence rate estimates with four and eight haplotypes.

Here we show relative cross coalescence rate estimates based on eight haplotypes (four haplotypes from each population, in solid lines) with estimates from four haplotypes (two haplotypes from each population, in dotted lines).

Supplementary Figure 10 Comparison with diCal.

A different method to estimate historical population sizes from multiple phased haplotypes was recently implemented in the software diCal (Supplementary Note). Here we have applied diCal to eight haplotypes, each 10 Mb long, simulated using the zigzag population size history as in Figure 2. The relatively short length of 10 Mb is the same length as used in Sheehan et al. 2013; with the current diCal implementation, analysis of larger data sets is not practical. We tested three different time intervals, using the parameter “-t”, which sets the left boundary of the last time interval in scaled units. With the “-t 1” option in the plot below, diCal obtains correct estimates between 20,000 and 200,000 years ago (red curve), roughly the same period addressed by MSMC with two haplotypes. To explore more recent times that we access with four or eight haplotypes, we tried to change the default time interval by using lower “-t” values (see Supplementary Note for details), but the resulting population size estimates were not very good (purple and blue lines). The method may be able to perform better if a more efficient implementation allows it to run on wholegenome–sized data sets.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10, Supplementary Tables 1–4 and Supplementary Note (PDF 2688 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schiffels, S., Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat Genet 46, 919–925 (2014). https://doi.org/10.1038/ng.3015

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3015

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research