An explosive growth is occurring in both the quantity of molecular data that are being collected and the efficiency of the computational machinery that is commonly used to analyse those data.
One of the traditional analytical paradigms has been based on models that are designed to capture the key features of the evolutionary processes.
A variety of approaches exist, and the choice of the most appropriate method, and model, depends on the features of the problem of interest.
The rapid growth in the size of data leads to an increasing computational burden for existing methods. In many cases this burden becomes overwhelming.
This has motivated a move away from exact methods (often because exact answers cannot be calculated) and towards more approximate methods. The principle is that it is better to obtain a rough answer than to seek an exact answer that cannot be computed in a reasonable time.
There will be a continuing trend to move away from exact methods and towards approximate methods as the quantity and complexity of data continue to grow.
Unfortunately, there is no 'one-size-fits-all' computational analysis method. We discuss a range of methods, but the performance of each will vary from problem to problem.
An explosive growth is occurring in the quantity, quality and complexity of molecular variation data that are being collected. Historically, such data have been analysed by using model-based methods. Models are useful for sharpening intuition, for explanation and for prediction: they add to our understanding of how the data were formed, and they can provide quantitative answers to questions of interest. We outline some of these model-based approaches, including the coalescent, and discuss the applicability of the computational methods that are necessary given the highly complex nature of current and future data sets.
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Hubby, L. & Lewontin, R. C. A molecular approach to the study of genic heterozygosity in natural populations. I. The number of alleles at different loci in Drosophila pseudoobscura. Genetics 54, 577–594 (1966).
Jeffreys, A. J. DNA sequence variants in the Gγ-, Aγ-, Δ- and β-globin genes. Cell 18, 1–10 (1979).
Kan, Y. W. & Dozy, A. M. Polymorphism of DNA sequence adjacent to human β-globin structural gene: relationship to sickle mutation. Proc. Natl Acad. Sci. USA 75, 5631–5635 (1978).
Kreitman, M. Nucleotide polymorphism at the alcohol-dehydrogenase locus of Drosophila melanogaster. Nature 304, 412–417 (1983).
Cann, R. L., Stoneking, M. & Wilson, A. C. Mitochondrial DNA and human evolution. Nature 325, 31–36 (1987).
Ward, R. H., Frazier, B. L., Dew-Jager, K. & Pääbo, S. Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl Acad. Sci. USA 88, 8720–8724 (1991).
Whitfield, L. S., Sulston, J. E. & Goodfellow, P. N. Sequence variation of the human Y chromosome. Nature 378, 379–380 (1995).
Dorit, R. L., Akashi, H. & Gilbert, W. Absence of polymorphism at the ZFY locus on the human Y chromosome. Science 268, 1183–1185 (1995).
Jorde, L. B. et al. The distribution of human genetic diversity: a comparison of mitochondrial, autosomal, and Y chromosome data. Am. J. Hum. Genet. 66, 979–988 (2000).
Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Nordborg, M. et al. The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. 3, 1289–1299 (2005).
Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Yu, J. & Buckler, E. S. Genetic association mapping and genome organization of maize. Curr. Opin. Biotechnol. 17, 155–160 (2006).
Provine, W. B. The Origins of Theoretical Population Genetics (Univ. Chicago Press, Chicago; London, 1971).
Ewens, W. J. Mathematical Population Genetics (Springer, Berlin; Heidelberg; New York, 1979). Describes the state-of-the-art in population genetics theory before the appearance of the coalescent.
Slatkin, M. & Hudson, R. R. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129, 555–562 (1991).
Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19A, 27–43 (1982). Introduces the coalescent as a way of exploiting ancestry in population genetics models.
Kingman, J. F. C. The coalescent. Stochastic Proc. App. 13, 235–248 (1982).
Hudson, R. R. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23, 183–201 (1983). Introduces the coalescent with recombination.
Hudson, R. R. in Oxford Surveys in Evolutionary Biology (eds Futuyma, D. & Antonovics, J.) (Oxford Univ. Press, New York, 1991).
Donnelly, P. & Tavaré, S. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29, 401–421 (1995).
Nordborg, M. in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. J. & Cannings, C.) (John Wiley & Sons, New York, 2001).
Hudson, R. R. Generating samples under a Wright–Fisher neutral model. Bioinformatics 18, 337–338 (2002).
McVean, G. A. T. & Cardin, N. J. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B 360, 1387–1393 (2005).
Marjoram, P. & Wall, J. D. Fast 'coalescent' simulation. BMC Genetics 7, 16 (2006).
Peng, B. & Kimmel, M. simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21, 3686–3687 (2005).
Ewens, W. J. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972). The first rigorous statistical treatment of inference for molecular population genetics data.
Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7, 256–276 (1975). A classic paper that introduces the number of segregating sites as the basis of an efficient estimator for mutation rate.
Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).
Griffiths, R. C. & Tavaré, S. The age of a mutation in a general coalescent tree. Stochastic Models 14, 273–295 (1998).
Slatkin, M. & Rannala, B. Estimating allele age. Annu. Rev. Genomics Hum. Genet. 1, 225–249 (2000).
Tavaré, S., Balding, D. J., Griffiths, R. C. & Donnelly, P. Inferring coalescence times for molecular sequence data. Genetics 145, 505–518 (1997).
Tang, H., Siegmund, D. O., Shen, P., Oefner, P. J. & Feldman, M. W. Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition. Genetics 161, 447–459 (2002).
Meligkotsidou, L. & Fearnhead, P. Maximum-likelihood estimation of coalescence times in genealogical trees. Genetics 171, 2073–2084 (2005).
Tavaré, S. in Case Studies in Mathematical Modeling: Ecology, Physiology, and Cell Biology (eds Othmer, H. G. et al.) (Prentice–Hall, New Jersey,1997).
Diggle, P. J. & Gratton, R. J. Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. B 46, 193–227 (1984).
Ripley, B. D. Stochastic Simulation (John Wiley & Sons, New York, 1987).
Griffiths, R. C. & Tavaré, S. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46, 131–159 (1994).
Griffiths, R. C. & Tavaré, S. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B 344, 403–410 (1994).
Griffiths, R. C. & Tavaré, S. Unrooted genealogical tree probabilities in the infinitely-many-sites model. Math. Biosci. 127, 77–98 (1995).
Griffiths, R. C. & Tavaré, S. Ancestral inference in population genetics. Stat. Sci. 9, 307–319 (1994).
Griffiths, R. C. & Tavaré, S. Monte Carlo inference methods in population genetics. Math. Comput. Model. 23, 141–158 (1996).
Felsenstein, J., Kuhner, M., Yamato, J. & Beerli, P. in Statistics in Molecular Biology and Genetics (ed. Seillier-Moiseiwitsch, F.) 163–185 (Hayward, California, 1999).
Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Stat. Soc. B 62, 605–655 (2000).
De Iorio, M. & Griffiths, R. C. Importance sampling on coalescent histories. I. Adv. Appl. Prob. 36, 417–433 (2004).
Griffiths, R. C. & Marjoram, P. Ancestral inference from samples of DNA sequences with recombination. J. Comp. Biol. 3, 479–502 (1996).
Stephens, M. in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. J. & Cannings, C.) 213–238 (John Wiley & Sons, New York, 2001).
Liu, J. S. Monte Carlo Strategies in Scientific Computing (Springer, New York, 2001).
De Iorio, M. & Griffiths, R. C. Importance sampling on coalescent histories. II. Subdivided population models. Adv. Appl. Prob. 36, 434–454 (2004).
De Iorio, M., Griffiths, R. C., Lebois, R. & Rousset, F. Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models. Theor. Popul. Biol. 68, 41–53 (2005).
Chen, Y. & Xie, J. Stopping-time resampling for sequential Monte Carlo methods. J. R. Stat. Soc. B 67, 199–217 (2005).
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953).
Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970).
Cowles, M. K. & Carlin, B. P. Markov chain Monte Carlo diagnostics: a comparative review. J. Am. Stat. Assoc. 91, 883–904 (1995).
Brooks, S. P. & Roberts, G. O. Assessing convergence of Markov chain Monte Carlo algorithms. Stat. Comput. 8, 319–335 (1998).
Wilson, I. J. & Balding, D. J. Genealogical inference from microsatellite data. Genetics 150, 499–510 (1998).
Nielsen, R. & Palsboll, P. J. Single-locus tests of microsatellite evolution: multi-step mutations and constraints on allele size. Mol. Phylogenet. Evol. 11, 477–484 (1999).
Markovtsova, L., Marjoram, P. & Tavaré, S. The age of a unique event polymorphism. Genetics 156, 401–409 (2000).
Markovtsova, L., Marjoram, P. & Tavaré, S. The effects of rate variation on ancestral inference in the coalescent. Genetics 156, 1427–1436 (2000).
Nielsen, R. & Wakeley, J. W. Distinguishing migration from isolation: an MCMC approach. Genetics 158, 885–896 (2001).
Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 (2001).
Fearnhead, P. & Donnelly, P. Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. B 64, 657–680 (2002).
Li, N. & Stephens, M. Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics 165, 2213–2233 (2003). An early application of the ABC idea; it is used here to construct tractable approximations to more complex evolutionary models.
Ronquist, F. & Huelsenbeck, J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003).
Thorne, J. L., Kishino, H. & Felsenstein, J. Inching towards reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 34, 3–16 (1992).
Felsenstein, J. Evolutionary trees from DNA sequence data: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981).
Geyer, C. J. in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface (ed. Keramidas, E. M.) (Interface Foundation, Fairfax Station, 1991).
Geyer, C. J. & Thompson, E. A. Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90, 909–920 (1995).
Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A. & Feldman, M. W. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol . Biol . Evol. 16, 1791–1798 (1999).
Marjoram, P., Molitor, J., Plagnol, V. & Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).
Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002). Coins the term approximate Bayesian computation, and applies it to microsatellite data.
Bortot, P., Coles, S. G. & Sisson, S. A. Inference for stereological extremes. J. Am. Stat. Assoc. (in the press).
Wiuf, C. & Hein, J. Recombination as a point process along sequences. Theor. Popul. Biol. 55, 248–259 (1999).
Wiuf, C. & Hein, J. The ancestry of a sample of sequences subject to recombination. Genetics 151, 1217–1228 (1999). References 73 and 74 present an elegant construction of the coalescent in the presence of recombination.
Box, G. E. P. in Robustness in Statistics (eds Launer, R. L. & Wilkinson, G. N.) (Academic Press, New York, 1979).
Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005). A comprehensive study that shows that the coalescent is a good model for complex evolutionary data.
Robert, C. P. & Casella, G. Monte Carlo Statistical Methods (Springer, New York, 2004).
Spiegelhalter, D. J., Thomas, A., Best, N. & Lunn, D. WinBUGS Version 1.4 User Manual [online], (2003).
Kuhner, M., Yamato, J. & Felsenstein, J. Estimating effective population size and mutation rate from sequence data using Metropolis–Hastings sampling. Genetics 140, 1421–1430 (1995).
Wall, J. D. A comparison of estimators of the population recombination rate. Mol. Biol. Evol. 17, 156–163 (2000).
Smith, N. G. C. & Fearnhead, P. A comparison of three estimators of the population-scaled recombination rate: accuracy and robustness. Genetics 171, 2051–2062 (2005).
Hudson, R. R. Two-locus sampling distributions and their applications. Genetics 159, 1805–1817 (2001).
McVean, G. A. T. et al. The fine-scale structure of recombination rate variation in the human genome. Science 304, 581–584 (2004).
Beerli, P. & Felsenstein, J. Maximum likelihood estimation of migration rates and effective population numbers in two populations. Genetics 152, 763–773 (1999).
Kuhner, M., Yamato, J. & Felsenstein, J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149, 429–434 (1998).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). Introduces a widely used method for inferring population structure.
Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
Pollinger, J. P. et al. Selective sweep mapping of genes with large phenotypic effects. Genome Res. 15, 1809–1819 (2006).
Nordborg, M. & Tavaré, S. Linkage disequilibrium: what history has to tell us. Trends Genet. 18, 83–90 (2002).
Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001). Describes an elegant use of the coalescent to infer haplotype phase from SNP data.
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).
Crawford, D. C. et al. Evidence for substantial fine-scale variation in the recombination rate across the human genome. Nature Genet. 36, 700–706 (2004).
Fearnhead, P. & Smith, N. G. C. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 77, 781–794 (2005).
Myers, S., Bottolo, L., Freeman, C., McVean, G. & Donnelly, P. A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321–324 (2005).
Tiemann-Boege, I., Calabrese, P., Cochran, D. M., Sokol, R. & Arnheim, N. High resolution recombination patterns in a region of human chromosome 21 measured by sperm typing. PLoS Genet. 2, e70 (2006).
Balding, D. J. A tutorial on statistical methods for population association studies. Nature Rev. Genet. 7, 781–791 (2006).
Hein, J., Schierup, M. H. & Wiuf, C. Gene Genealogies, Variation and Evolution (Oxford Univ. Press, New York, 2005).
Tavaré, S. in Lectures on Probability Theory and Statistics. Ecole d'Etés de Probabilité de Saint-Flour XXXI — 2001 (ed. Picard, J.) (Springer, Berlin; Heidelberg, 2004).
Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov chain Monte Carlo in Practice (Chapman & Hall, London, 1996).
The authors were supported in part by two grants from the US National Institutes of Health. S.T. is a Royal Society-Wolfson Research Merit Award holder. We thank the reviewers for helpful comments on an earlier version of the manuscript.
The authors declare no competing financial interests.
- Restriction fragment length polymorphisms
Variations between individuals in the lengths of DNA regions that are cut by a particular endonuclease.
- Microsatellite marker loci
Polymorphic loci at which short DNA sequences are repeated a varying number of times.
- Stochastic model
A model that is used to describe the behaviour of a random process.
A popular probabilistic model for the evolution of 'individuals'. Individuals might be single nucleotides, mitochondrial DNA, chromosomes and so on, depending on the context.
- Selective sweep
The increase in the frequency of an allele (and closely linked chromosomal segments) that is caused by selection for the allele. Sweeps initially reduce variation and subsequently lead to increased homozygosity.
The probability of the data under a particular model, viewed as a function of the parameters of that model (note that data discussed in this paper are discrete).
- Mitochondrial Eve
The most recent maternal common ancestor of the entire human mitochondrial population.
- Gene conversion
A non-reciprocal recombination process that results in the alteration of the sequence of a gene to that of its homologue during meiosis.
Gene flow between differentiated populations.
- Maximum likelihood
A statistical analysis in which one aims to find the parameter value that maximizes the likelihood of the data.
- Test statistic
A numerical summary of the data that is used to measure support for a null hypothesis. Either the test statistic has a known probability distribution (such as χ2) under the null hypothesis, or its null distribution is approximated computationally.
- Tajima's D
A statistic that compares the observed nucleotide diversity to what is expected under a neutral, constant population-sized model.
- Prior distribution
The distribution of likely parameter values before any data are examined.
- Posterior distribution
The distribution that is proportional to the product of the likelihood and prior distribution.
The range of values for which the probability is non-zero.
- Summary statistics
A statistic that tries to capture a complicated data set in a simpler way. An example is the use of the number of segregating sites as a surrogate for a set of DNA fragments.
- Markov process
One in which the probability of the next state depends solely on the previous state, and not on the sequence of states before it.
The state in which a process has become independent of its starting position and has settled into its long-term behaviour. In an MCMC context, the process is typically assumed to be stationary at the end of a 'burn-in' period.
- Local maxima
A local region in which a distribution takes a value that is higher than those taken at other nearby points, but which is lower than at least one value taken in some other, more distant region.
The statistic S is sufficient for the parameter η if the probability of the data, given S and η , does not depend on η.
The sequence of bases along a single copy of (typically, part of) a chromosome.
About this article
Cite this article
Marjoram, P., Tavaré, S. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet 7, 759–770 (2006). https://doi.org/10.1038/nrg1961
Using approximate Bayesian inference for a “steps and turns” continuous-time random walk observed at regular time intervals
Clonal replacement and heterogeneity in breast tumors treated with neoadjuvant HER2-targeted therapy
Nature Communications (2019)
PLOS Computational Biology (2019)
A multivariate phylogenetic comparative method incorporating a flexible function between discrete and continuous traits
Evolutionary Ecology (2019)
Nature Genetics (2019)