Abstract
The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000–30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Behar, D.M. et al. The dawn of human matrilineal diversity. Am. J. Hum. Genet. 82, 1130–1140 (2008).
Fu, Q. et al. Complete mitochondrial genomes reveal neolithic expansion into Europe. PLoS ONE 7, e32473 (2012).
Balaresque, P. et al. A predominantly neolithic origin for European paternal lineages. PLoS Biol. 8, e1000285 (2010).
Atkinson, Q.D., Gray, R.D. & Drummond, A.J. mtDNA variation predicts population size in humans and reveals a major Southern Asian chapter in human prehistory. Mol. Biol. Evol. 25, 468–474 (2008).
McVean, G.A.T. & Cardin, N.J. Approximating the coalescent with recombination. Phil. Trans. R. Soc. Lond. B 360, 1387–1393 (2005).
Marjoram, P. & Wall, J.D. Fast “coalescent” simulation. BMC Genet. 7, 16 (2006).
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).
Paul, J.S., Steinrücken, M. & Song, Y.S. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187, 1115–1128 (2011).
Sheehan, S., Harris, K. & Song, Y.S. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194, 647–662 (2013).
Steinrücken, M., Paul, J.S. & Song, Y.S. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor. Popul. Biol. 87, 51–61 (2013).
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
Fenner, J.N. Cross-cultural estimation of the human generation interval for use in genetics-based population divergence studies. Am. J. Phys. Anthropol. 128, 415–423 (2005).
Matsumura, S. & Forster, P. Generation time and effective population size in Polar Eskimos. Proc. Biol. Sci. 275, 1501–1508 (2008).
Kong, A. et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488, 471–475 (2012).
Campbell, C.D. et al. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 44, 1277–1281 (2012).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Scally, A. & Durbin, R. Revising the human mutation rate: implications for understanding human evolution. Nat. Rev. Genet. 13, 745–753 (2012).
Fu, Q. et al. A revised timescale for human evolution based on ancient mitochondrial genomes. Curr. Biol. 23, 553–559 (2013).
Sun, J.X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).
Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Henn, B.M., Cavalli-Sforza, L.L. & Feldman, M.W. The great human expansion. Proc. Natl. Acad. Sci. USA 109, 17758–17764 (2012).
Mellars, P. Going east: new genetic and archaeological perspectives on the modern human colonization of Eurasia. Science 313, 796–800 (2006).
Mellars, P. Why did modern human populations disperse from Africa ca. 60,000 years ago? A new model. Proc. Natl. Acad. Sci. USA 103, 9381–9386 (2006).
Eriksson, A. et al. Late Pleistocene climate change and the global expansion of anatomically modern humans. Proc. Natl. Acad. Sci. USA 109, 16089–16094 (2012).
O'Rourke, D.H. & Raff, J.A. The human genetic history of the Americas: the final frontier. Curr. Biol. 20, R202–R207 (2010).
Goebel, T., Waters, M.R. & O'Rourke, D.H. The late Pleistocene dispersal of modern humans in the Americas. Science 319, 1497–1502 (2008).
Botigué, L.R. et al. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe. Proc. Natl. Acad. Sci. USA 110, 11791–11796 (2013).
Berniell-Lee, G. et al. Genetic and demographic implications of the Bantu expansion: insights from human paternal lineages. Mol. Biol. Evol. 26, 1581–1589 (2009).
Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).
Pickrell, J.K. et al. Ancient west Eurasian ancestry in southern and eastern Africa. Proc. Natl. Acad. Sci. USA 111, 2632–2637 (2014).
Pagani, L. et al. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. 91, 83–96 (2012).
Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108, 11983–11988 (2011).
Marth, G.T. et al. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166, 351–372 (2004).
Keinan, A. et al. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nat. Genet. 39, 1251–1255 (2007).
Schaffner, S.F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005).
Garrigan, D. et al. Inferring human population sizes, divergence times and rates of gene flow from mitochondrial, X and Y chromosome resequencing data. Genetics 177, 2195–2207 (2007).
Plagnol, V. & Wall, J.D. Possible ancestral structure in human populations. PLoS Genet. 2, e105 (2006).
Fagundes, N.J. et al. Statistical evaluation of alternative models of human evolution. Proc. Natl. Acad. Sci. USA 104, 17614–17619 (2007).
Mathieson, I. & McVean, G. Demography and the age of rare variants http://arxiv.org/abs/1401.4181 (2014).
Harris, K. & Nielsen, R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9, e1003521 (2013).
Armitage, S.J. et al. The southern route “out of Africa”: evidence for an early expansion of modern humans into Arabia. Science 331, 453–456 (2011).
Petraglia, M. et al. Middle Paleolithic assemblages from the Indian subcontinent before and after the Toba super-eruption. Science 317, 114–116 (2007).
Takahata, N. & Satta, Y. Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl. Acad. Sci. USA 94, 4811–4815 (1997).
Nachman, M.W. & Crowell, S.L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).
Gronau, I. et al. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011).
Delaneau, O., Zagury, J.F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Delaneau, O., Howie, B., Cox, A.J., Zagury, J.-F. & Marchini, J . Haplotype estimation using sequencing reads. Am. J. Hum. Genet. 93, 687–696 (2013).
Kidd, J.M. et al. Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet. 91, 660–671 (2012).
Chen, G.K., Marjoram, P. & Wall, J.D. Fast and flexible simulation of DNA sequence data. Genome Res. 19, 136–142 (2009).
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabalistic Models of Proteins and Nucleic Acids (Cambridge University Press, Cambridge, UK, 1998).
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).
Acknowledgements
We thank A. Scally for useful comments and discussion, in particular on interpreting population divergence estimates, and the Durbin group for general discussion. S.S. thanks A. Fischer for helpful support with HMM implementation details. We thank J. Kidd, S. Gravel and C. Bustamante for making ancestry tracts for the MXL individuals available to us. S.S. acknowledges grant support from an EMBO (European Molecular Biology Organization) long-term fellowship. This work was funded by Wellcome Trust grant 098051.
Author information
Authors and Affiliations
Contributions
R.D. proposed the basic strategy and designed the overall study. S.S. developed the theory, implemented the algorithm and obtained results. S.S. and R.D. analyzed the results and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Recombination rate inference from PSMC'.
MSMC for two haplotypes is a special case that we call PSMC', in contrast to PSMC, because we use SMC' (Supplementary Note) as the underlying coalescent model. Here we show the iterative estimation of the recombination rate for two demographic scenarios: (i) a constant population size and (ii) a bottleneck in the past. As can be seen, in both cases, the estimated recombination rate converges quickly to the true value with very high accuracy.
Supplementary Figure 2 Simulations of other scenarios.
See the Supplementary Note for details on these additional simulations. (a) MSMC results from simulated data that represent a much simplified CEU population size history with sharp changes. (b) Similar to a, but with a simplified YRI-like history. (c) A population split with subsequent migration. (d) A population split with subsequent changes in population size. (e) Inference from 8 and 16 haplotypes. For 16 haplotypes, we needed to reduce the computational complexity by reducing the simulated sequence to 1Gb instead of 3 Gb, and we used a coarse-grained set of parameters with 20 time intervals instead of 40.
Supplementary Figure 3 Testing singleton branch length estimates.
Here we compare MSMC estimates based on the estimates of Ts obtained via the HMM described in section 7 of the Supplementary Note (solid) with the true values of singleton branch length as output in the simulation (dotted). (a) Population size estimates. (b) Split estimates.
Supplementary Figure 4 Simulations with recombination hotspots.
To assess the effect of heterogeneous recombination rates across the genome, we simulated 100 chromosomes with 4 haplotypes of 1 Mb each (Supplementary Note). For practical reasons, we could not scale up this simulation to 3 Gb of total sequence or to more haplotypes. We used as input random chunks of the real human recombination map from the HapMap Project. (a) This plot shows the effective population size estimates from both the standard simulation and simulations with the human recombination map. We see only small effects of variable recombination rates, mostly at the two extreme ends of the estimated time interval. Some of that difference may also be caused by the much smaller total sequence length of the hotspot simulation in comparison to the standard simulation. (b) Here we show the two split scenarios at 10,000 and 100,000 years ago. Again, the differences between the hotspot simulation and the standard simulation are only small.
Supplementary Figure 5 Application to unphased data.
We generated data sets in which we deliberately ‘unphased’ one or both diploid genomes in a setting of four haplotypes. (a) Plot of the population size estimates from two diploid individuals of which both are phased (red), one is unphased (blue) and both are unphased (purple). (b) Plot of the relative cross coalescence rate estimates based on similarly unphased data.
Supplementary Figure 6 Comparison of trio- versus population-based phasing and effect of unphased sites.
We tested whether results from trio-based phased data differ from population-based phased data. We checked this for CEU and YRI, for which we have trio sequences available. As shown in (a) and (b), the trio-phased results do not differ strongly from the population-phased results (solid versus dashed lines). When using population-based phasing of our sequences, there are rare sites that are not present in the reference data set and are therefore not phased. There are two possibilities: (i) leave them in as unphased sites (‘all’) or (ii) remove them from the analysis (‘restricted’). The two cases are shown in a and b as solid versus dotted lines. As shown, for population size inference, removing unphased sites does not appear to improve estimates (in comparison to the trio-phased estimates), but, for the population separation analysis, removing unphased sites gives smoother estimates in the most recent times and removes some non-monotonic artifacts (in CHB/GIH).
Supplementary Figure 7 Comparisons of population size estimates with two, four and eight haplotypes.
(a) We show the estimates based on four haplotypes (solid lines) together with estimates from two haplotypes (dotted lines). For clarity, we separated the curves on the basis of African and non-African samples. (b) This plot shows estimates based on eight haplotypes (thick lines) in comparison with the estimates based on four haplotypes (thin lines).
Supplementary Figure 8 Replicate analysis with four haplotypes.
We generated a replicate set of population size and relative cross coalescence rate estimates, on the basis of the two individuals in each population not used for the main analysis, as presented in Figures 3 and 4. In both figures, the replicate estimate is shown as a dashed line, and the original estimate is shown as a solid line. For clarity, African and non-African estimates are separated.
Supplementary Figure 9 Comparison of relative cross coalescence rate estimates with four and eight haplotypes.
Here we show relative cross coalescence rate estimates based on eight haplotypes (four haplotypes from each population, in solid lines) with estimates from four haplotypes (two haplotypes from each population, in dotted lines).
Supplementary Figure 10 Comparison with diCal.
A different method to estimate historical population sizes from multiple phased haplotypes was recently implemented in the software diCal (Supplementary Note). Here we have applied diCal to eight haplotypes, each 10 Mb long, simulated using the zigzag population size history as in Figure 2. The relatively short length of 10 Mb is the same length as used in Sheehan et al. 2013; with the current diCal implementation, analysis of larger data sets is not practical. We tested three different time intervals, using the parameter “-t”, which sets the left boundary of the last time interval in scaled units. With the “-t 1” option in the plot below, diCal obtains correct estimates between 20,000 and 200,000 years ago (red curve), roughly the same period addressed by MSMC with two haplotypes. To explore more recent times that we access with four or eight haplotypes, we tried to change the default time interval by using lower “-t” values (see Supplementary Note for details), but the resulting population size estimates were not very good (purple and blue lines). The method may be able to perform better if a more efficient implementation allows it to run on wholegenome–sized data sets.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–10, Supplementary Tables 1–4 and Supplementary Note (PDF 2688 kb)
Source data
Rights and permissions
About this article
Cite this article
Schiffels, S., Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat Genet 46, 919–925 (2014). https://doi.org/10.1038/ng.3015
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3015
This article is cited by
-
Reconstructing the ancestral gene pool to uncover the origins and genetic links of Hmong–Mien speakers
BMC Biology (2024)
-
Impact of population structure in the estimation of recent historical effective population size by the software GONE
Genetics Selection Evolution (2023)
-
Resequencing of Rosa rugosa accessions revealed the history of population dynamics, breed origin, and domestication pathways
BMC Plant Biology (2023)
-
Indigenous Australian genomes show deep structure and rich novel variation
Nature (2023)
-
Whole genomes from Angola and Mozambique inform about the origins and dispersals of major African migrations
Nature Communications (2023)