Modern computational approaches for analysing molecular genetic variation data

Key Points

  • An explosive growth is occurring in both the quantity of molecular data that are being collected and the efficiency of the computational machinery that is commonly used to analyse those data.

  • One of the traditional analytical paradigms has been based on models that are designed to capture the key features of the evolutionary processes.

  • A variety of approaches exist, and the choice of the most appropriate method, and model, depends on the features of the problem of interest.

  • The rapid growth in the size of data leads to an increasing computational burden for existing methods. In many cases this burden becomes overwhelming.

  • This has motivated a move away from exact methods (often because exact answers cannot be calculated) and towards more approximate methods. The principle is that it is better to obtain a rough answer than to seek an exact answer that cannot be computed in a reasonable time.

  • There will be a continuing trend to move away from exact methods and towards approximate methods as the quantity and complexity of data continue to grow.

  • Unfortunately, there is no 'one-size-fits-all' computational analysis method. We discuss a range of methods, but the performance of each will vary from problem to problem.

Abstract

An explosive growth is occurring in the quantity, quality and complexity of molecular variation data that are being collected. Historically, such data have been analysed by using model-based methods. Models are useful for sharpening intuition, for explanation and for prediction: they add to our understanding of how the data were formed, and they can provide quantitative answers to questions of interest. We outline some of these model-based approaches, including the coalescent, and discuss the applicability of the computational methods that are necessary given the highly complex nature of current and future data sets.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1

    Hubby, L. & Lewontin, R. C. A molecular approach to the study of genic heterozygosity in natural populations. I. The number of alleles at different loci in Drosophila pseudoobscura. Genetics 54, 577–594 (1966).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2

    Jeffreys, A. J. DNA sequence variants in the Gγ-, Aγ-, Δ- and β-globin genes. Cell 18, 1–10 (1979).

    CAS  PubMed  Google Scholar 

  3. 3

    Kan, Y. W. & Dozy, A. M. Polymorphism of DNA sequence adjacent to human β-globin structural gene: relationship to sickle mutation. Proc. Natl Acad. Sci. USA 75, 5631–5635 (1978).

    CAS  PubMed  Google Scholar 

  4. 4

    Kreitman, M. Nucleotide polymorphism at the alcohol-dehydrogenase locus of Drosophila melanogaster. Nature 304, 412–417 (1983).

    CAS  PubMed  Google Scholar 

  5. 5

    Cann, R. L., Stoneking, M. & Wilson, A. C. Mitochondrial DNA and human evolution. Nature 325, 31–36 (1987).

    CAS  PubMed  Google Scholar 

  6. 6

    Ward, R. H., Frazier, B. L., Dew-Jager, K. & Pääbo, S. Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl Acad. Sci. USA 88, 8720–8724 (1991).

    CAS  PubMed  Google Scholar 

  7. 7

    Whitfield, L. S., Sulston, J. E. & Goodfellow, P. N. Sequence variation of the human Y chromosome. Nature 378, 379–380 (1995).

    CAS  PubMed  Google Scholar 

  8. 8

    Dorit, R. L., Akashi, H. & Gilbert, W. Absence of polymorphism at the ZFY locus on the human Y chromosome. Science 268, 1183–1185 (1995).

    CAS  PubMed  Google Scholar 

  9. 9

    Jorde, L. B. et al. The distribution of human genetic diversity: a comparison of mitochondrial, autosomal, and Y chromosome data. Am. J. Hum. Genet. 66, 979–988 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10

    Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

    CAS  PubMed  Google Scholar 

  11. 11

    Nordborg, M. et al. The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. 3, 1289–1299 (2005).

    CAS  Google Scholar 

  12. 12

    Altshuler, D. et al. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

    Google Scholar 

  13. 13

    Yu, J. & Buckler, E. S. Genetic association mapping and genome organization of maize. Curr. Opin. Biotechnol. 17, 155–160 (2006).

    CAS  PubMed  Google Scholar 

  14. 14

    Provine, W. B. The Origins of Theoretical Population Genetics (Univ. Chicago Press, Chicago; London, 1971).

    Google Scholar 

  15. 15

    Ewens, W. J. Mathematical Population Genetics (Springer, Berlin; Heidelberg; New York, 1979). Describes the state-of-the-art in population genetics theory before the appearance of the coalescent.

    Google Scholar 

  16. 16

    Slatkin, M. & Hudson, R. R. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129, 555–562 (1991).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17

    Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19A, 27–43 (1982). Introduces the coalescent as a way of exploiting ancestry in population genetics models.

    Google Scholar 

  18. 18

    Kingman, J. F. C. The coalescent. Stochastic Proc. App. 13, 235–248 (1982).

    Google Scholar 

  19. 19

    Hudson, R. R. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23, 183–201 (1983). Introduces the coalescent with recombination.

    CAS  PubMed  Google Scholar 

  20. 20

    Hudson, R. R. in Oxford Surveys in Evolutionary Biology (eds Futuyma, D. & Antonovics, J.) (Oxford Univ. Press, New York, 1991).

    Google Scholar 

  21. 21

    Donnelly, P. & Tavaré, S. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29, 401–421 (1995).

    CAS  PubMed  Google Scholar 

  22. 22

    Nordborg, M. in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. J. & Cannings, C.) (John Wiley & Sons, New York, 2001).

    Google Scholar 

  23. 23

    Hudson, R. R. Generating samples under a Wright–Fisher neutral model. Bioinformatics 18, 337–338 (2002).

    CAS  PubMed  Google Scholar 

  24. 24

    McVean, G. A. T. & Cardin, N. J. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B 360, 1387–1393 (2005).

    CAS  Google Scholar 

  25. 25

    Marjoram, P. & Wall, J. D. Fast 'coalescent' simulation. BMC Genetics 7, 16 (2006).

    PubMed  PubMed Central  Google Scholar 

  26. 26

    Peng, B. & Kimmel, M. simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21, 3686–3687 (2005).

    CAS  PubMed  Google Scholar 

  27. 27

    Ewens, W. J. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972). The first rigorous statistical treatment of inference for molecular population genetics data.

    CAS  PubMed  Google Scholar 

  28. 28

    Watterson, G. A. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7, 256–276 (1975). A classic paper that introduces the number of segregating sites as the basis of an efficient estimator for mutation rate.

    CAS  PubMed  Google Scholar 

  29. 29

    Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30

    Griffiths, R. C. & Tavaré, S. The age of a mutation in a general coalescent tree. Stochastic Models 14, 273–295 (1998).

    Google Scholar 

  31. 31

    Slatkin, M. & Rannala, B. Estimating allele age. Annu. Rev. Genomics Hum. Genet. 1, 225–249 (2000).

    CAS  PubMed  Google Scholar 

  32. 32

    Tavaré, S., Balding, D. J., Griffiths, R. C. & Donnelly, P. Inferring coalescence times for molecular sequence data. Genetics 145, 505–518 (1997).

    PubMed  PubMed Central  Google Scholar 

  33. 33

    Tang, H., Siegmund, D. O., Shen, P., Oefner, P. J. & Feldman, M. W. Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition. Genetics 161, 447–459 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34

    Meligkotsidou, L. & Fearnhead, P. Maximum-likelihood estimation of coalescence times in genealogical trees. Genetics 171, 2073–2084 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Tavaré, S. in Case Studies in Mathematical Modeling: Ecology, Physiology, and Cell Biology (eds Othmer, H. G. et al.) (Prentice–Hall, New Jersey,1997).

    Google Scholar 

  36. 36

    Diggle, P. J. & Gratton, R. J. Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. B 46, 193–227 (1984).

    Google Scholar 

  37. 37

    Ripley, B. D. Stochastic Simulation (John Wiley & Sons, New York, 1987).

    Google Scholar 

  38. 38

    Griffiths, R. C. & Tavaré, S. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46, 131–159 (1994).

    Google Scholar 

  39. 39

    Griffiths, R. C. & Tavaré, S. Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B 344, 403–410 (1994).

    CAS  Google Scholar 

  40. 40

    Griffiths, R. C. & Tavaré, S. Unrooted genealogical tree probabilities in the infinitely-many-sites model. Math. Biosci. 127, 77–98 (1995).

    CAS  PubMed  Google Scholar 

  41. 41

    Griffiths, R. C. & Tavaré, S. Ancestral inference in population genetics. Stat. Sci. 9, 307–319 (1994).

    Google Scholar 

  42. 42

    Griffiths, R. C. & Tavaré, S. Monte Carlo inference methods in population genetics. Math. Comput. Model. 23, 141–158 (1996).

    Google Scholar 

  43. 43

    Felsenstein, J., Kuhner, M., Yamato, J. & Beerli, P. in Statistics in Molecular Biology and Genetics (ed. Seillier-Moiseiwitsch, F.) 163–185 (Hayward, California, 1999).

    Google Scholar 

  44. 44

    Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Stat. Soc. B 62, 605–655 (2000).

    Google Scholar 

  45. 45

    De Iorio, M. & Griffiths, R. C. Importance sampling on coalescent histories. I. Adv. Appl. Prob. 36, 417–433 (2004).

    Google Scholar 

  46. 46

    Griffiths, R. C. & Marjoram, P. Ancestral inference from samples of DNA sequences with recombination. J. Comp. Biol. 3, 479–502 (1996).

    CAS  Google Scholar 

  47. 47

    Stephens, M. in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. J. & Cannings, C.) 213–238 (John Wiley & Sons, New York, 2001).

    Google Scholar 

  48. 48

    Liu, J. S. Monte Carlo Strategies in Scientific Computing (Springer, New York, 2001).

    Google Scholar 

  49. 49

    De Iorio, M. & Griffiths, R. C. Importance sampling on coalescent histories. II. Subdivided population models. Adv. Appl. Prob. 36, 434–454 (2004).

    Google Scholar 

  50. 50

    De Iorio, M., Griffiths, R. C., Lebois, R. & Rousset, F. Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models. Theor. Popul. Biol. 68, 41–53 (2005).

    PubMed  Google Scholar 

  51. 51

    Chen, Y. & Xie, J. Stopping-time resampling for sequential Monte Carlo methods. J. R. Stat. Soc. B 67, 199–217 (2005).

    Google Scholar 

  52. 52

    Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953).

    CAS  Google Scholar 

  53. 53

    Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970).

    Google Scholar 

  54. 54

    Cowles, M. K. & Carlin, B. P. Markov chain Monte Carlo diagnostics: a comparative review. J. Am. Stat. Assoc. 91, 883–904 (1995).

    Google Scholar 

  55. 55

    Brooks, S. P. & Roberts, G. O. Assessing convergence of Markov chain Monte Carlo algorithms. Stat. Comput. 8, 319–335 (1998).

    Google Scholar 

  56. 56

    Wilson, I. J. & Balding, D. J. Genealogical inference from microsatellite data. Genetics 150, 499–510 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57

    Nielsen, R. & Palsboll, P. J. Single-locus tests of microsatellite evolution: multi-step mutations and constraints on allele size. Mol. Phylogenet. Evol. 11, 477–484 (1999).

    CAS  PubMed  Google Scholar 

  58. 58

    Markovtsova, L., Marjoram, P. & Tavaré, S. The age of a unique event polymorphism. Genetics 156, 401–409 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59

    Markovtsova, L., Marjoram, P. & Tavaré, S. The effects of rate variation on ancestral inference in the coalescent. Genetics 156, 1427–1436 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60

    Nielsen, R. & Wakeley, J. W. Distinguishing migration from isolation: an MCMC approach. Genetics 158, 885–896 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61

    Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. 62

    Fearnhead, P. & Donnelly, P. Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. B 64, 657–680 (2002).

    Google Scholar 

  63. 63

    Li, N. & Stephens, M. Modelling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics 165, 2213–2233 (2003). An early application of the ABC idea; it is used here to construct tractable approximations to more complex evolutionary models.

    CAS  PubMed  PubMed Central  Google Scholar 

  64. 64

    Ronquist, F. & Huelsenbeck, J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003).

    CAS  Google Scholar 

  65. 65

    Thorne, J. L., Kishino, H. & Felsenstein, J. Inching towards reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 34, 3–16 (1992).

    CAS  PubMed  Google Scholar 

  66. 66

    Felsenstein, J. Evolutionary trees from DNA sequence data: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981).

    CAS  PubMed  Google Scholar 

  67. 67

    Geyer, C. J. in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface (ed. Keramidas, E. M.) (Interface Foundation, Fairfax Station, 1991).

    Google Scholar 

  68. 68

    Geyer, C. J. & Thompson, E. A. Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90, 909–920 (1995).

    Google Scholar 

  69. 69

    Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A. & Feldman, M. W. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol . Biol . Evol. 16, 1791–1798 (1999).

    CAS  PubMed  Google Scholar 

  70. 70

    Marjoram, P., Molitor, J., Plagnol, V. & Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).

    CAS  PubMed  Google Scholar 

  71. 71

    Beaumont, M. A., Zhang, W. & Balding, D. J. Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002). Coins the term approximate Bayesian computation, and applies it to microsatellite data.

    PubMed  PubMed Central  Google Scholar 

  72. 72

    Bortot, P., Coles, S. G. & Sisson, S. A. Inference for stereological extremes. J. Am. Stat. Assoc. (in the press).

  73. 73

    Wiuf, C. & Hein, J. Recombination as a point process along sequences. Theor. Popul. Biol. 55, 248–259 (1999).

    CAS  PubMed  Google Scholar 

  74. 74

    Wiuf, C. & Hein, J. The ancestry of a sample of sequences subject to recombination. Genetics 151, 1217–1228 (1999). References 73 and 74 present an elegant construction of the coalescent in the presence of recombination.

    CAS  PubMed  PubMed Central  Google Scholar 

  75. 75

    Box, G. E. P. in Robustness in Statistics (eds Launer, R. L. & Wilkinson, G. N.) (Academic Press, New York, 1979).

    Google Scholar 

  76. 76

    Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15, 1576–1583 (2005). A comprehensive study that shows that the coalescent is a good model for complex evolutionary data.

    CAS  PubMed  PubMed Central  Google Scholar 

  77. 77

    Robert, C. P. & Casella, G. Monte Carlo Statistical Methods (Springer, New York, 2004).

    Google Scholar 

  78. 78

    Spiegelhalter, D. J., Thomas, A., Best, N. & Lunn, D. WinBUGS Version 1.4 User Manual [online], (2003).

    Google Scholar 

  79. 79

    Kuhner, M., Yamato, J. & Felsenstein, J. Estimating effective population size and mutation rate from sequence data using Metropolis–Hastings sampling. Genetics 140, 1421–1430 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  80. 80

    Wall, J. D. A comparison of estimators of the population recombination rate. Mol. Biol. Evol. 17, 156–163 (2000).

    CAS  PubMed  Google Scholar 

  81. 81

    Smith, N. G. C. & Fearnhead, P. A comparison of three estimators of the population-scaled recombination rate: accuracy and robustness. Genetics 171, 2051–2062 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  82. 82

    Hudson, R. R. Two-locus sampling distributions and their applications. Genetics 159, 1805–1817 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  83. 83

    McVean, G. A. T. et al. The fine-scale structure of recombination rate variation in the human genome. Science 304, 581–584 (2004).

    CAS  PubMed  Google Scholar 

  84. 84

    Beerli, P. & Felsenstein, J. Maximum likelihood estimation of migration rates and effective population numbers in two populations. Genetics 152, 763–773 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. 85

    Kuhner, M., Yamato, J. & Felsenstein, J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149, 429–434 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  86. 86

    Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). Introduces a widely used method for inferring population structure.

    CAS  PubMed  PubMed Central  Google Scholar 

  87. 87

    Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).

    PubMed  PubMed Central  Google Scholar 

  88. 88

    Pollinger, J. P. et al. Selective sweep mapping of genes with large phenotypic effects. Genome Res. 15, 1809–1819 (2006).

    Google Scholar 

  89. 89

    Nordborg, M. & Tavaré, S. Linkage disequilibrium: what history has to tell us. Trends Genet. 18, 83–90 (2002).

    CAS  PubMed  Google Scholar 

  90. 90

    Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001). Describes an elegant use of the coalescent to infer haplotype phase from SNP data.

    CAS  PubMed  PubMed Central  Google Scholar 

  91. 91

    Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  92. 92

    Crawford, D. C. et al. Evidence for substantial fine-scale variation in the recombination rate across the human genome. Nature Genet. 36, 700–706 (2004).

    CAS  PubMed  Google Scholar 

  93. 93

    Fearnhead, P. & Smith, N. G. C. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 77, 781–794 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  94. 94

    Myers, S., Bottolo, L., Freeman, C., McVean, G. & Donnelly, P. A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321–324 (2005).

    CAS  PubMed  Google Scholar 

  95. 95

    Tiemann-Boege, I., Calabrese, P., Cochran, D. M., Sokol, R. & Arnheim, N. High resolution recombination patterns in a region of human chromosome 21 measured by sperm typing. PLoS Genet. 2, e70 (2006).

    PubMed  PubMed Central  Google Scholar 

  96. 96

    Balding, D. J. A tutorial on statistical methods for population association studies. Nature Rev. Genet. 7, 781–791 (2006).

    CAS  PubMed  Google Scholar 

  97. 97

    Hein, J., Schierup, M. H. & Wiuf, C. Gene Genealogies, Variation and Evolution (Oxford Univ. Press, New York, 2005).

    Google Scholar 

  98. 98

    Tavaré, S. in Lectures on Probability Theory and Statistics. Ecole d'Etés de Probabilité de Saint-Flour XXXI — 2001 (ed. Picard, J.) (Springer, Berlin; Heidelberg, 2004).

    Google Scholar 

  99. 99

    Gilks, W. R., Richardson, S. & Spiegelhalter, D. J. Markov chain Monte Carlo in Practice (Chapman & Hall, London, 1996).

    Google Scholar 

Download references

Acknowledgements

The authors were supported in part by two grants from the US National Institutes of Health. S.T. is a Royal Society-Wolfson Research Merit Award holder. We thank the reviewers for helpful comments on an earlier version of the manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Simon Tavaré.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

CODA

FPG

Genetree

LAMARC v 2.0

SIMCOAL

SimuPOP

The International HapMap Project

Glossary

Restriction fragment length polymorphisms

Variations between individuals in the lengths of DNA regions that are cut by a particular endonuclease.

Microsatellite marker loci

Polymorphic loci at which short DNA sequences are repeated a varying number of times.

Stochastic model

A model that is used to describe the behaviour of a random process.

Coalescent

A popular probabilistic model for the evolution of 'individuals'. Individuals might be single nucleotides, mitochondrial DNA, chromosomes and so on, depending on the context.

Selective sweep

The increase in the frequency of an allele (and closely linked chromosomal segments) that is caused by selection for the allele. Sweeps initially reduce variation and subsequently lead to increased homozygosity.

Likelihood

The probability of the data under a particular model, viewed as a function of the parameters of that model (note that data discussed in this paper are discrete).

Mitochondrial Eve

The most recent maternal common ancestor of the entire human mitochondrial population.

Gene conversion

A non-reciprocal recombination process that results in the alteration of the sequence of a gene to that of its homologue during meiosis.

Admixture

Gene flow between differentiated populations.

Maximum likelihood

A statistical analysis in which one aims to find the parameter value that maximizes the likelihood of the data.

Test statistic

A numerical summary of the data that is used to measure support for a null hypothesis. Either the test statistic has a known probability distribution (such as χ2) under the null hypothesis, or its null distribution is approximated computationally.

Tajima's D

A statistic that compares the observed nucleotide diversity to what is expected under a neutral, constant population-sized model.

Prior distribution

The distribution of likely parameter values before any data are examined.

Posterior distribution

The distribution that is proportional to the product of the likelihood and prior distribution.

Coverage

The range of values for which the probability is non-zero.

Summary statistics

A statistic that tries to capture a complicated data set in a simpler way. An example is the use of the number of segregating sites as a surrogate for a set of DNA fragments.

Markov process

One in which the probability of the next state depends solely on the previous state, and not on the sequence of states before it.

Stationarity

The state in which a process has become independent of its starting position and has settled into its long-term behaviour. In an MCMC context, the process is typically assumed to be stationary at the end of a 'burn-in' period.

Local maxima

A local region in which a distribution takes a value that is higher than those taken at other nearby points, but which is lower than at least one value taken in some other, more distant region.

Sufficiency

The statistic S is sufficient for the parameter η if the probability of the data, given S and η , does not depend on η.

Haplotype

The sequence of bases along a single copy of (typically, part of) a chromosome.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Marjoram, P., Tavaré, S. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet 7, 759–770 (2006). https://doi.org/10.1038/nrg1961

Download citation

Further reading