High-throughput sequencing is enabling massively large catalogues of DNA sequence variation to be collected in geographically diverse human populations. Such data sets contain considerable information about human history but are complex and require careful analysis.
Quality control and exploratory data analyses are critical in analyses of large-scale sequencing data sets and help to identify features of the data that may complicate downstream inferences.
Functional and comparative genomics data (such as sequence conservation, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and DNase I hypersensitivity) can be leveraged to mitigate the confounding effect of natural selection when inferring demographic models.
A large number of flexible and sophisticated methods have been developed that allow specific and detailed demographic inferences to be made. The appropriate method to use depends on the specific hypothesis or question being asked, and the underlying assumptions of a given method should be carefully considered.
As sample sizes become increasingly large, inferences about specific aspects of breeding structure and demography may be possible. However, these methods are still in their infancy and require substantial theoretical and methodological development.
The increasing availability of ancient DNA from modern and archaic humans provides exciting new possibilities to refine parameters of human evolutionary history, although new methodological development is needed to fully realize the potential of these data.
The genomes of contemporary humans contain considerable information about the history of our species. Although the general contours of human evolutionary history have been defined with increasing resolution throughout the past several decades, the continuing deluge of massively large sequencing data sets presents new opportunities and challenges for understanding human evolutionary history. Here, we review the signatures that demographic history imparts on patterns of DNA sequence variation, statistical methods that have been developed to leverage information contained in genome-scale data sets and insights gleaned from these studies. We also discuss the importance of using exploratory analyses to assess data quality, the strengths and limitations of commonly used population genomics methods, and factors that confound population genomics inferences.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Veeramah, K. R. & Hammer, M. F. The impact of whole-genome sequencing on the reconstruction of human population history. Nat. Rev. Genet. 15, 149–162 (2014).
Metzker, M. L. Sequencing technologies — the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). This study describes an international project that created one of the most-comprehensive catalogues of sequence variation in geographically diverse populations.
Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012). This article represents one of the earliest large-scale, high-coverage exome data sets to be produced; it has been extensively used in evolutionary and medical genomics.
Bustamante, C. D., De La Vega, F. M. & Burchard, E. G. Genomics for the world. Nature 475, 163–165 (2011).
Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl Acad. Sci. USA 108, 11983–11988 (2011).
Hellenthal, G. et al. A genetic atlas of human admixture history. Science 343, 747–751 (2014).
Leslie, S. et al. The fine-scale genetic structure of the British population. Nature 519, 309–314 (2015).
Nielsen, R. Molecular signatures of natural selection. Annu. Rev. Genet. 39, 197–218 (2005).
Sabeti, P. C. et al. Positive natural selection in the human lineage. Science 312, 1614–1620 (2006).
Bamshad, M. & Wooding, S. P. Signatures of natural selection in the human genome. Nat. Rev. Genet. 4, 99–111 (2003).
Akey, J. M. Constructing genomic maps of positive selection in humans: where do we go from here? Genome Res. 19, 711–722 (2009).
Fu, W. & Akey, J. M. Selection and adaptation in the human genome. Annu. Rev. Genom. Hum. Genet. 14, 467–489 (2013).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Auwera, G. A. et al. From fastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 15, 1110 (2013).
Korneliussen, T. S., Albrechtsen, A. & Nielsen, R. ANGSD: analysis of next generation sequencing data. BMC Bioinformatics 15, 356 (2014).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Schraiber, J. G., Shih, S. & Slatkin, M. Genomic tests of variation in inbreeding among individuals and among chromosomes. Genetics 192, 1477–1482 (2012).
Alkan, C. et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061–1067 (2009).
Williamson, S. H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl Acad. Sci. USA 102, 7882–7887 (2005). This study reports a clever approach to account for the effects of selection when making demographic inferences.
Živkovic, D., Steinrücken, M., Song, Y. S. & Stephan, W. Transition densities and sample frequency spectra of diffusion processes with selection and variable population size. Genetics 200, 601–617 (2015).
Hammer, M. F. et al. The ratio of human X chromosome to autosome diversity is positively correlated with genetic distance from genes. Nat. Genet. 42, 830–831 (2010).
Gottipati, S., Arbiza, L., Siepel, A., Clark, A. G. & Keinan, A. Analyses of X-linked and autosomal genetic variation in population-scale whole genome sequencing. Nat. Genet. 43, 741–743 (2011).
Gazave, E. et al. Neutral genomic regions refine models of recent rapid human population growth. Proc. Natl Acad. Sci. USA 111, 757–762 (2014). This study illustrates well how choosing neutral genomic regions carefully can lead to more-refined estimates of demographic parameters.
Consortium, T. E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Romanoski, C. E., Glass, C. K., Stunnenberg, H. G., Wilson, L. & Almouzni, G. Epigenomics: roadmap for regulation. Nature 518, 314–316 (2015).
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
McVicker, G., Gordon, D., Davis, C. & Green, P. Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet. 5, e1000471 (2009).
Pollard, K. S. et al. Forces shaping the fastest evolving regions in the human genome. PLoS Genet. 2, e168 (2006).
Arbiza, L., Zhong, E. & Keinan, A. NRE: a tool for exploring neutral loci in the human genome. BMC Bioinformatics 13, 301 (2012).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). This classic paper describes a nonparametric approach for inferring population structure.
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History And Geography Of Human Genes (Princeton Univ. Press, 1994).
Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).
Biswas, S., Scheinfeldt, L. B. & Akey, J. M. Genome-wide insights into the patterns and determinants of fine-scale population structure in humans. Am. J. Hum. Genet. 84, 641–650 (2009).
McVean, G. A. Genealogical interpretation of principal components analysis. PLoS Genet. 5, e1000686 (2009).
François, O. et al. Principal component analysis under population genetic models of range expansion and admixture. Mol. Biol. Evol. 27, 1257–1268 (2010).
Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40, 646–649 (2008).
Yang, W.-Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nat. Genet. 44, 725–731 (2012).
Tang, H., Peng, J., Wang, P. & Risch, N. J. Estimation of individual admixture: analytical and study design considerations. Genet. Epidemiol. 28, 289–301 (2005).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).
Huelsenbeck, J. P. & Andolfatto, P. Inference of population structure under a Dirichlet process model. Genetics 175, 1787–1802 (2007).
Xie, W., Lewis, P. O., Fan, Y., Kuo, L. & Chen, M.-H. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol. 60, 150–160 (2010).
Patterson, N. et al. Methods for high-density admixture mapping of disease genes. Am. J. Hum. Genet. 74, 979–1000 (2004).
Gravel, S. Population genetics models of local ancestry. Genetics 191, 607–619 (2012).
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
Price, A. L. et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5, e1000519 (2009).
Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).
Pool, J. E. & Nielsen, R. Inference of historical changes in migration rate from the lengths of migrant tracts. Genetics 181, 711–719 (2009).
Liang, M. & Nielsen, R. The lengths of admixture tracts. Genetics 197, 953–967 (2014).
Sankararaman, S., Sridhar, S., Kimmel, G. & Halperin, E. Estimating local ancestry in admixed populations. Am. J. Hum. Genet. 82, 290–303 (2008).
Brisbin, A. et al. PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Hum. Biol. 84, 343–364 (2012).
Wakeley, J. Coalescent Theory: An Introduction (Robert & Co., 2009).
Sawyer, S. A. & Hartl, D. L. Population genetics of polymorphism and divergence. Genetics 132, 1161–1176 (1992).
Bhaskar, A. & Song, Y. S. Descartes' rule of signs and the identifiability of population demographic models from genomic variation data. Ann. Statist. 42, 2469–2493 (2014).
Terhorst, J. & Song, Y. S. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl Acad. Sci. USA 112, 7677–7682 (2015).
Bustamante, C. D., Wakeley, J., Sawyer, S. & Hartl, D. L. Directional selection and the site-frequency spectrum. Genetics 159, 1779–1788 (2001).
Evans, S. N., Shvets, Y. & Slatkin, M. Non-equilibrium theory of the allele frequency spectrum. Theor. Popul. Biol. 71, 109–119 (2007).
Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009).
Lukic, S. & Hey, J. Demographic inference using spectral methods on SNP data, with an analysis of the human out-of-Africa expansion. Genetics 192, 619–639 (2012).
Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V. C. & Foll, M. Robust demographic inference from genomic and SNP data. PLoS Genet. 9, e1003905 (2013).
Excoffier, L. & Foll, M. Fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 27, 1332–1334 (2011).
Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet 8, e1002967 (2012).
Bhaskar, A., Wang, Y. & Song, Y. S. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25, 268–279 (2014).
Griffiths, R. C. & Marjoram, P. An ancestral recombination graph. University of Canterbury[online], (1997).
Wiuf, C. & Hein, J. Recombination as a point process along sequences. Theor. Popul. Biol. 55, 248–259 (1999).
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
Gusev, A. et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 19, 318–326 (2009).
Ralph, P. & Coop, G. The geography of recent genetic ancestry across Europe. PLoS Biol. 11, e1001555 (2013).
Palamara, P. F., Lencz, T., Darvasi, A. & Pe'er, I. Length distributions of identity by descent reveal fine-scale demographic history. Am. J. Hum. Genet. 91, 809–822 (2012).
Palamara, P. F. & Pe'er, I. Inference of historical migration rates via haplotype sharing. Bioinformatics 29, i180–i188 (2013).
Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).
McVean, G. A. T. & Cardin, N. J. Approximating the coalescent with recombination. Philos. Trans. R. Soc. B Biol. Sci. 360, 1387–1393 (2005). This article introduces the SMC, which enabled important developments in population genomic inferencing from recombining sequences.
Marjoram, P. & Wall, J. D. Fast 'coalescent' simulation. BMC Genet. 7, 16 (2006).
Harris, K. & Nielsen, R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9, e1003521 (2013).
Liu, S. et al. Population genomics reveal recent speciation and rapid evolutionary adaptation in polar bears. Cell 157, 785–794 (2014).
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011). This study describes PSMC, which enables quasi-non-parametric inferencing of effective population size through time from a single diploid genome sequence.
Drummond, A. J., Rambaut, A., Shapiro, B. & Pybus, O. G. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 22, 1185–1192 (2005).
Heled, J. & Drummond, A. J. Bayesian inference of population size history from multiple loci. BMC Evol. Biol. 8, 289 (2008). This study details one of the first, and underappreciated, methods to infer population size history in a relatively non-parametric way from haplotype data.
Minin, V. N., Bloomquist, E. W. & Suchard, M. A. Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Mol. Biol. Evol. 25, 1459–1471 (2008).
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).
Hobolth, A., Christensen, O. F., Mailund, T. & Schierup, M. H. Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS Genet. 3, e7 (2007).
Dutheil, J. Y. et al. Ancestral population genomics: the coalescent hidden Markov model approach. Genetics 183, 259–274 (2009).
Mailund, T., Dutheil, J. Y., Hobolth, A., Lunter, G. & Schierup, M. H. Estimating divergence time and ancestral effective population size of bornean and sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet. 7, e1001319 (2011).
Hobolth, A., Dutheil, J. Y., Hawks, J., Schierup, M. H. & Mailund, T. Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. Genome Res. 21, 349–356 (2011).
Mailund, T. et al. A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species. PLoS Genet. 8, e1003125 (2012).
Scally, A. et al. Insights into hominid evolution from the gorilla genome sequence. Nature 483, 169–175 (2012).
Sheehan, S., Harris, K. & Song, Y. S. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194, 647–662 (2013).
Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Stat. Soc. B 62, 605–655 (2000).
Paul, J. S., Steinrücken, M. & Song, Y. S. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187, 1115–1128 (2011).
Steinrücken, M., Paul, J. S. & Song, Y. S. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor. Popul. Biol. 87, 51–61 (2013).
Kuhner, M. K. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22, 768–770 (2006).
Drummond, A. J., Suchard, M. A., Xie, D. & Rambaut, A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29, 1969–1973 (2012).
Rannala, B. & Yang, Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003).
Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G. & Siepel, A. Bayesian inference of ancient human demography from individual genome sequences. Nat. Genet. 43, 1031–1034 (2011).
Lohse, K., Harrison, R. J. & Barton, N. H. A general method for calculating likelihoods under the coalescent process. Genetics 189, 977–987 (2011).
Lohse, K. & Frantz, L. A. F. Neandertal admixture in Eurasia confirmed by maximum-likelihood analysis of three genomes. Genetics 196, 1241–1251 (2014).
Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014).
Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014). This review covers in great detail the recent controversy about the human genomic mutation rate and summarizes the different kinds of mutations in the human genome.
Bhaskar, A., Clark, A. G. & Song, Y. S. Distortion of genealogical properties when the sample is very large. Proc. Natl Acad. Sci. USA 111, 2385–2390 (2014).
Wakeley, J., King, L., Low, B. S. & Ramachandran, S. Gene genealogies within a fixed pedigree, and the robustness of Kingman's coalescent. Genetics 190, 1433–1445 (2012).
Möhle, M. Robustness results for the coalescent. J. Appl. Probab. 35, 438–447 (1998). This important theory paper outlines the broad generality of the Kingman coalescent.
Pitman, J. Coalescents with multiple collisions. Ann. Appl. Probab. 27, 1870–1902 (1999).
Sagitov, S. The general coalescent with asynchronous mergers of ancestral lines. J. Appl. Probab. 36, 1116–1125 (1999).
Zerjal, T. et al. The genetic legacy of the Mongols. Am. J. Hum. Genet. 72, 717–721 (2003).
Varin, C., Reid, N. & Firth, D. An overview of composite likelihood methods. Statist. Sin. 21, 5–42 (2011).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Beaumont, M. A. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406 (2010).
Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002).
Sunnåker, M. et al. Approximate Bayesian computation. PLoS Comput. Biol. 9, e1002803 (2013).
Csilléry, K., Blum, M. G. B., Gaggiotti, O. E. & François, O. Approximate Bayesian Computation (ABC) in practice. Trends Ecol. Evol. 25, 410–418 (2010).
Wegmann, D., Leuenberger, C. & Excoffier, L. Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics 182, 1207–1218 (2009).
Sisson, S. A., Fan, Y. & Tanaka, M. M. Sequential Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 104, 1760–1765 (2007).
Wegmann, D., Leuenberger, C., Neuenschwander, S. & Excoffier, L. ABCtoolbox: a versatile toolkit for approximate Bayesian computations. BMC Bioinformatics 11, 116 (2010).
Fearnhead, P. & Prangle, D. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. 74, 419–474 (2012).
Pickrell, J. K. & Reich, D. Toward a new history and geography of human genes informed by ancient DNA. Trends Genet. 30, 377–389 (2014).
Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
Plagnol, V. & Wall, J. D. Possible ancestral structure in human populations. PLoS Genet. 2, e105 (2006).
Eriksson, A. & Manica, A. Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. Proc. Natl Acad. Sci. USA 109, 13956–13960 (2012).
Burger, J., Kirchner, M., Bramanti, B., Haak, W. & Thomas, M. G. Absence of the lactase-persistence-associated allele in early Neolithic Europeans. Proc. Natl Acad. Sci. USA 104, 3736–3741 (2007).
Malmström, H. et al. in Migration in Prehistory: DNA and Stable Isotope Analysis of Swedish Skeletal Material (ed. Linderholm, A.) (Stockholm University, 2008).
Malmström, H. et al. High frequency of lactose intolerance in a prehistoric hunter-gatherer population in northern Europe. BMC Evol. Biol. 10, 89 (2010).
Lacan, M. et al. Ancient DNA reveals male diffusion through the Neolithic Mediterranean route. Proc. Natl Acad. Sci. USA 108, 9788–9791 (2011).
Plantinga, T. S. et al. Low prevalence of lactase persistence in Neolithic South-West Europe. Eur. J. Hum. Genet. 20, 778–782 (2012).
Bollback, J. P., York, T. L. & Nielsen, R. Estimation of 2Nes from temporal allele frequency data. Genetics 179, 497–502 (2008).
Malaspinas, A.-S., Malaspinas, O., Evans, S. N. & Slatkin, M. Estimating allele age and selection coefficient from time-serial data. Genetics 192, 599–607 (2012).
Mathieson, I. & McVean, G. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics 193, 973–984 (2013).
Steinrücken, M., Bhaskar, A. & Song, Y. S. A novel spectral method for inferring general diploid selection from time series genetic data. Ann. Appl. Statist. 8, 2203–2222 (2014).
Haak, W. et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522, 207–211 (2015).
Yang, Z. & Rannala, B. Bayesian species delimitation using multilocus sequence data. Proc. Natl Acad. Sci. USA 107, 9264–9269 (2010).
The author would like to acknowledge Kelley Harris for helpful discussions regarding SMC-based approaches.
J.M.A. is a paid consultant of Glenview Capital. J.G.S. declares no competing interests.
- Exploratory data analyses
(EDA). The initial stages of 'digging into' a data set, usually by plotting low-dimensional summaries of the data.
The probabilities of the data given various models and their parameters, thought of as functions of those parameters. The parameter values that maximize the probability of the data in each model are called maximum likelihood estimates.
Vectors that, when multiplied by a given matrix, still point in the same direction.
- Covariance matrix
An n×n matrix describing the covariance between each pair in a sample of size n.
- Panmictic population
A group of individuals among whom random mating occurs.
- Linkage disequilibrium
(LD). Nonrandom association between alleles at physically distinct genomic loci. Over time, this will be broken down by recombination.
- Coalescence times
The times in the past when genomic regions shared a common genetic ancestor.
- Isolation by distance
Genetic differentiation between individuals induced by geographic separation. Individuals that are closer geographically will be closer genetically.
By adding more parameters to a model, it will begin to model the noise in the observed data, rather than the true underlying mechanism of data generation. Overfit models will generalize poorly to new data sets.
- Cross-validation error
The error in predicting the structure of a held-out portion of the data, when a model is trained on a subset of the whole data set. Minimizing cross-validation error is an effective way to choose parameters and hyperparameters.
- Ancestral recombination graphs
Graph structures representing the genealogical history of a sample with a recombining genome. In addition to coalescence events (which bring two lineages together and therefore reduce the number of lineages in the graph), recombination events cause splits to occur, which increases the number of lineages in the graph.
- Hidden Markov model
(HMM). A statistical model in which a set of underlying hidden states are assumed to follow Markov chain dynamics and induce a set of observed states.
- Reference panel
A large number of individuals, related to samples of interest, for which some quality is known (for example, allelic phase).
- Effective population size
The size that a theoretical population evolving under a Wright–Fisher model would need to be in order to match aspects of the observed genetic data.
- Poisson process
A stochastic process in which new events occur at a constant rate per unit of time. Often used to model mutation.
- Identity by descent
(IBD). Whether a genomic region has descended from an ancestor unchanged. A genomic region in two (or more) individuals is identical by descent if it is inherited from a common ancestor without being broken up by recombination. Some authors require IBD segments to also be identical by state, that is, to also have no mutations in the region.
- Identity by state
(IBS). Whether a genomic region has the same sequence as the corresponding region in another individual. A genomic region in two (or more) individuals is identical by state if it contains no mutations that distinguish the two individuals. Note that a region of IBS is not necessarily also identical by descent.
A modification to the sequential Markov coalescent (SMC) that allows for hidden recombination events that do not change the local genealogy.
- Conditionally sampled alleles
Alleles that are sampled from a population given that a set of reference alleles is already in hand.
About this article
Cite this article
Schraiber, J., Akey, J. Methods and models for unravelling human evolutionary history. Nat Rev Genet 16, 727–740 (2015). https://doi.org/10.1038/nrg4005
Current Opinion in Genetics & Development (2020)
Genome Biology (2020)
Molecular Ecology (2020)
A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics
NAR Genomics and Bioinformatics (2020)