Review Article | Published:

The Bayesian revolution in genetics

Nature Reviews Genetics volume 5, pages 251261 (2004) | Download Citation



Bayesian statistics allow scientists to easily incorporate prior knowledge into their data analysis. Nonetheless, the sheer amount of computational power that is required for Bayesian statistical analyses has previously limited their use in genetics. These computational constraints have now largely been overcome and the underlying advantages of Bayesian approaches are putting them at the forefront of genetic data analysis in an increasing number of areas.

Key points

  • In genetic analysis, there are often competing explanations for the same data. Sophisticated mathematical models have been developed that can encapsulate these problems in terms of parameters that need to be inferred.

  • Bayesian statistical methods are well suited to help pick out the most reasonable parameter values — as well as to choose between entire models — and they provide a framework for including background information to help with this.

  • The goal of Bayesian analysis is to compute the probability distribution of parameter values and model specifications given the data. This is called the posterior distribution.

  • It is only with the development of high-speed computing over the past ten years that the potential of Bayesian methods has been realized. As a consequence, statistical analysis in genetics has undergone a dramatic shift.

  • A computational method that has had most influence is Markov chain Monte Carlo, which allows parameter values to be drawn from the posterior distribution.

  • Example areas that have used Bayesian methods include: population genetics, detecting the effects of selection, sequence analysis, SNP discovery, haplotype identification, analysis of gene expression, association mapping and linkage-disequilibrium mapping.

  • A review of applications in these areas demonstrates the following advantages of Bayesian methods over other approaches: use of background information; the ability to include uncertainty in all parameter values; ease in making inferences about some parameters irrespective of the values of others; and lack of ad hoc calculations and approximations that are often associated with alternative statistical methods.

  • There are still computational difficulties with Bayesian approaches. Further improvements are needed both in testing the accuracy of the computation involved and also in model checking.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1.

    , & Bayesian statistics in genetics: a guide for the uninitiated. Trends Genet. 15, 354–358 (1999).

  2. 2.

    , , & Bayesian Data Analysis (Chapman and Hall, London, 1995).

  3. 3.

    & Phylogenetic analysis: models and estimation procedures. Evolution 32, 550–570 (1967).

  4. 4.

    The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972). The first use of a sampling distribution in population genetics. This paper anticipates modern approaches, such as the coalescent theory, that model the sampling distribution of chromosomes.

  5. 5.

    The coalescent. Stochastic Process. Appl. 13, 235–248 (1982).

  6. 6.

    Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23, 183–201 (1983).

  7. 7.

    Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet. Res. 59, 139–147 (1992).

  8. 8.

    & Ancestral inference in population genetics. Statistical Sci. 9, 307–319 (1994).

  9. 9.

    , & The effect of rate variation on ancestral inference in the coalescent. Genetics 156, 1427–1436 (2000).

  10. 10.

    , , & Inferring coalescence times from DNA sequence data. Genetics 145, 505–518 (1997).

  11. 11.

    & Genealogical inference from microsatellite data. Genetics 150, 499–510 (1998). An early paper that uses MCMC to carry out a fully Bayesian analysis of population-genetic data.

  12. 12.

    & Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc. Natl Acad. Sci. USA 98, 4563–4568 (2001).

  13. 13.

    & Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158, 885–896 (2001).

  14. 14.

    , , & The discovery of single-nucleotide polymorphisms and inferences about human demographic history. Am. J. Hum. Genet. 69, 1332–1347 (2001).

  15. 15.

    , & Genetic evidence for long-term population decline in a savannah-dwelling primate: inferences from a hierarchical Bayesian model. Mol. Biol. Evol. 19, 1981–1990 (2002).

  16. 16.

    & Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003).

  17. 17.

    , , & Markov chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. USA 100, 15324–15328 (2003).

  18. 18.

    , & Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035 (2002).

  19. 19.

    , & Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J. Roy. Stat. Soc. A Sta. 166, 155–188 (2003).

  20. 20.

    , & The History and Geography of Human Genes (Princeton Univ. Press, Princeton, 1994).

  21. 21.

    & Genomic control for association studies. Biometrics 55, 997–1004 (1999).

  22. 22.

    & Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).

  23. 23.

    , , & Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

  24. 24.

    & Case–control studies of association in structured or admixed populations. Theor. Popul. Biol. 60, 227–237 (2001).

  25. 25.

    , & Bioinvasions of the medfly Ceratitis capitata: source estimation using DNA sequences at multiple intron loci. Genetics 153, 351–360 (1999).

  26. 26.

    et al. Microsatellite analysis of medfly bioinfestations in California. Mol. Ecol. 10, 2515–2524 (2001).

  27. 27.

    , & Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000). An influential paper in the development of Bayesian methods to study cryptic population structure. The program described in it, Structure, has been widely used in molecular ecology.

  28. 28.

    & A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genet. Res. 78, 59–77 (2001).

  29. 29.

    Evolution and the Genetics of Populations: The Theory of Gene Frequencies (Chicago Univ. Press, Chicago, 1969).

  30. 30.

    , & Bayesian analysis of genetic differentiation between populations. Genetics 163, 367–374 (2003).

  31. 31.

    & Bayesian inference of recent migration rates using multilocus genotypes. Genetics 163, 1177–1191 (2003).

  32. 32.

    & Signatures of natural selection in the human genome. Nature Rev. Genet. 4, 99–111 (2003).

  33. 33.

    & Testing for genetic evidence of population expansion and contraction: an empirical analysis of microsatellite DNA variation using a hierarchical Bayesian model. Evolution 56, 154–166 (2002).

  34. 34.

    & Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. (in the press).

  35. 35.

    , & Maximum likelihood and Bayesian methods for estimating the distribution of selective effects among classes of mutations using DNA polymorphism data. Theor. Popul. Biol. 63, 91–103 (2003).

  36. 36.

    Statistical tests of selective neutrality in the age of genomics. Heredity 86, 641–647 (2001).

  37. 37.

    & Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148, 929–936 (1998). The first formal statistical method for inferring site-specific selection on DNA codons.

  38. 38.

    & Phylogeny estimation: traditional and Bayesian approaches. Nature Rev. Genet. 4, 275–284 (2003). Reviews the many recent applications of Bayesian inference in phylogeny estimation.

  39. 39.

    , , & Biological Sequence Analysis, (Cambridge Univ. Press, Cambridge, 1998).

  40. 40.

    et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993). The methods and models used in this paper have led to the development of a large number of Bayesian methods for the analyses of sequence data by some of the authors and their groups.

  41. 41.

    Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79–94 (1989). One of the earliest papers to use a hidden Markov model to analyse DNA sequence data.

  42. 42.

    , Genmark: parallel gene recognition for both DNA strands. Comput. Chem. 17, 123–133 (1993).

  43. 43.

    , & Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Ass. 90, 1156–1170 (1995).

  44. 44.

    , & BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res. 30, 1268–1277 (2002).

  45. 45.

    , , Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Res. 31, 3580–3585 (2003).

  46. 46.

    & Bayesian inference on biopolymer models. Bioinformatics 15, 38–52 (1999).

  47. 47.

    & in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. & Cannings, C.) 66–93 (John Wiley and Sons, Chichester, 2003).

  48. 48.

    & Bayesian restoration of a hidden Markov chain with aplications to DNA sequencing. J. Comput. Biol. 6, 261–277 (1999).

  49. 49.

    Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  50. 50.

    et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  51. 51.

    & New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165, 427–436 (2003).

  52. 52.

    et al. Single-nucleotide polymorphisms in soybean. Genetics 163, 1123–1134 (2003).

  53. 53.

    et al. A general approach to single-nucleotide polymorphism discovery. Nature Genet. 23, 452–456 (1999).

  54. 54.

    et al. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nature Genet. 26, 233–236 (2000).

  55. 55.

    Analysis of Human Genetic Linkage (Johns Hopkins, Baltimore, 1999).

  56. 56.

    , & An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56, 799–810 (1995).

  57. 57.

    & Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995).

  58. 58.

    , , & Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70, 157–169 (2002).

  59. 59.

    , & A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).

  60. 60.

    , & Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B39, 1–38 (1977).

  61. 61.

    & Testing for linkage disequilibrium in genotypic data using the Expectation-Maximization algorithm. Heredity 76, 377–383 (1996).

  62. 62.

    The use and analysis of microarray data. Nature Rev. Genet. 1, 951–960 (2002).

  63. 63.

    , & in Handbook of Statistical Genetics (eds Balding, D. J., Bishop, M. & Cannings, C.) 162–187 (John Wiley and Sons, Chichester, 2003).

  64. 64.

    & A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17, 509–519 (2001).

  65. 65.

    & Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

  66. 66.

    , & Bayesian models for gene expression with DNA microarray data. J. Am. Stat. Ass. 97, 88–99 (2002).

  67. 67.

    & Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Am. Stat. Ass. 98, 438–455 (2003).

  68. 68.

    , , , & Gene selection: a Bayesian variable selection approach. Bioinformatics 19, 90–97 (2003).

  69. 69.

    Large-scale gene expression data analysis: a new challenge to computational biologists. Genome Res. 9, 681–688 (2003).

  70. 70.

    , & A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: an application of Bayesian hierarchical clustering of curves. Department of Statistics, Imperial College, London [online], <> (2003).

  71. 71.

    Mapping project moves forward despite controversy. Nature Med. 12, 1337 (2002).

  72. 72.

    Finding genes influencing susceptibility to complex diseases in the post-genome era. Am. J. Pharmacogenomics 1, 203–221 (2001).

  73. 73.

    Statistics in Human Genetics, (Oxford Univ. Press, New York, 1998).

  74. 74.

    Linkage disequilibrium and the search for complex disease genes. Genome Res. 10, 1435–1444 (2000).

  75. 75.

    , & Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, 506–516 (1993). The first application of a family-based association test. The transmission disequilibrium test has been highly influential and spawned many related approaches.

  76. 76.

    & A Bayesian approach to disease gene location using allelic association. Biostatistics 4, 399–409 (2003).

  77. 77.

    & An extended transmission/disequilibrium test (TDT) for multi-allele marker loci. Ann. Hum. Genet. 59, 323–336 (1995).

  78. 78.

    , , & Microsatellite analysis of population-structure in Canadian polar bears. Mol. Ecol. 4, 347–354 (1995).

  79. 79.

    & Detecting immigration by using multilocus genotypes. Proc. Natl Acad. Sci. USA 94, 9197–9201 (1997).

  80. 80.

    , , , & Bayesian association mapping for quantitative traits in a mixture of two populations. Genet. Epidemiol. 21 (Suppl. 1), S692–S699 (2001).

  81. 81.

    et al. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72, 1492–1504 (2003).

  82. 82.

    Human genetics: the molecular challenge. Cold Spring Harb. Symp. Quant. Biol. 51, 1–13 (1986).

  83. 83.

    & Mapping complex genetic traits in humans: new methods using a complete RFLP linkage map. Cold Spring Harb. Symp. Quant. Biol. 51, 49–62 (1986).

  84. 84.

    et al. Approaches to localizing disease genes as applied to cystic fibrosis. Nucleic Acids Res. 18, 345–350 (1990).

  85. 85.

    et al. Linkage disequilibrium mapping in isolated founder populations: diastrophic dysplasia in Finland. Nature Genet. 2, 204–211 (1992).

  86. 86.

    & Methods for multipoint disease mapping using linkage disequilibrium. Genet. Epidemiol. 19 (Suppl. 1), S71–S77 (2000). A comprehensive review of the various likelihood approximations used in linkage-disequilibrium gene mapping.

  87. 87.

    & High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence. Am. J. Hum. Genet. 69, 159–178 (2001). The first use of the human genome sequence as an informative prior for Bayesian gene mapping.

  88. 88.

    , & Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am. J. Hum. Genet. 70, 686–707 (2002).

  89. 89.

    & Joint Bayesian estimation of mutation location and age using linkage disequilibrium. Pac. Symp. Biocomput. 526–534 (2003).

  90. 90.

    & DMLE+: Bayesian linkage disequilibrium gene mapping. Bioinformatics 18, 894–895 (2002).

  91. 91.

    , , , & Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11, 1716–1724 (2001).

  92. 92.

    Monte Carlo Methods for Scientific Computing (Springer, New York, 2001).

  93. 93.

    , & A Bayesian framework for combining gene predictions. Bioinformatics 18, 19–27 (2002).

  94. 94.

    et al. A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302, 449–453 (2003).

  95. 95.

    Simulation, (Academic, New York, 1997).

  96. 96.

    Stochastic Simulation (Wiley and Sons, New York, 1987).

  97. 97.

    Gene genealogies and the coalescent process. Oxford Surveys Evol. Biol. 7, 1–44 (1990).

  98. 98.

    , , & Equations of state calculations by fast computing machine. J. Chem. Phys. 21, 1087–1091 (1953).

  99. 99.

    Monte Carlo sampling methods using Markov chains and their application. Biometrika 57, 97–109 (1970).

  100. 100.

    , , & Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 116, 1791–1798 (1999). The first paper to use an ABC approach to infer population-genetic parameters in a complicated demographic model.

  101. 101.

    Detecting population expansion and decline using microsatellites. Genetics 153, 2013–2029 (1999).

  102. 102.

    , , & Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161, 1307–1320 (2002).

  103. 103.

    , , , & The epidemiology and iatrogenic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach. Mol. Biol. Evol. 20, 381–387 (2003).

  104. 104.

    Estimation of population growth or decline in genetically monitored populations. Genetics 164, 1139–1160 (2003).

  105. 105.

    & A general model for the analysis of pedigree data. Human Heredity 21, 523–542 (1971).

  106. 106.

    & Construction of multilocus genetic linkage maps in humans. Proc. Natl Acad. Sci. USA 84, 2362–2367 (1987).

  107. 107.

    , & Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am. J. Hum. Gen. 56, 519–527 (1995).

  108. 108.

    & A random walk method for computing genetic location scores. Am. J. Hum. Gen. 49, 1320–1334 (1991).

  109. 109.

    in Computer Science and Statistics: Proceedings of the 23rd Symposium on the Interface (eds Keramidas, E. M. & Kaufman, S. M.) 321–328 (Interface Foundation of North America, Fairfax Station, Virginia, 1991).

  110. 110.

    in Handbook of Statistical Genetics (ed. Balding, D. J.) 599–644 (John Wiley and Sons, New York, 2001). An extensive review of methods used to map quantitative trait loci in humans and other species.

Download references


We thank the four anonymous referees for their comments. Work on this paper was supported by grants from the Biotechnology and Biological Sciences Research Council and the Natural Environment Research Council to M.A.B., and by grants from the National Institutes of Health and the Canadian Institute of Health Research to B.R.

Author information


  1. School of Animal and Microbial Sciences, University of Reading, Whiteknights, P.O. Box 228, Reading RG6 6AJ, UK.

    • Mark A. Beaumont
  2. Department of Medical Genetics, 839 Medical Sciences Building, University of Alberta, Edmonton, Alberta T6G2H7, Canada.

    • Bruce Rannala


  1. Search for Mark A. Beaumont in:

  2. Search for Bruce Rannala in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Mark A. Beaumont.



The process whereby data are observed and then statements are made about unknown features of the system that gave rise to the data.


A model in which the data are modelled as random variables, the probability distribution of which depends on parameter values. Bayesian models are sometimes called fully probabilistic because the parameter values are also treated as random variables.


The probability of the data fora particular set of parameter values.


A model that is suitable for modelling a sequence of random variables, such as nucleotide base pairs in DNA, in which the probability that a variable assumes any specific value depends only on the value of a specified number of most recent variables that precede it. In an nth-order Markov chain, the probability distribution of a variable depends on the n preceding observations.


Also known as the 'prior predictive distribution'. The probability distribution of the data irrespective of the parameter values.


A quantity that might take any of a range of values (discrete or continuous) that cannot be predicted with certainty but only described probabilistically.


The probability distribution of all combinations of two or more random variables.


The probability distribution of parameter values before observing the data.


The distribution of one or more random variables when other random variables of a joint probability distribution are fixed at particular values.


The conditional distribution of the parameter given the observed data.


A summary of the location of a parameter value. In a Bayesian setting, this is generally the mean, mode or median of the posterior distribution.


An estimate of the region in which the true parameter value is believed to be located.


A method for estimating parameters by using theory to obtain a formula for the expected value of statistics measured from the data as a function of the parameter values to be estimated. The observed values of these statistics are then equated to the expected values. The formula is inverted to obtain an estimate of the parameter.


Statistical inference in which probability is interpreted as the relative frequency of occurrences in an infinite sequence of trials.


A theory that describes the genealogy of chromosomes or genes. Under many life-history schemes (discrete generations, overlapping generations, non-random mating, and so on), taking certain limits, the statistical distribution of branch lengths in genealogies follows a simple form. Coalescent theory describes this distribution.


The process of repeatedly simulating new data sets with parameters that are inferred from the observed data, and then re-estimating the parameters from these simulated data sets. This process is used to obtain confidence intervals.


(Ne). The size of a random mating population under a simple Fisher–Wright model that has an equivalent rate of inbreeding to that of the observed population, which might have additional complexities such as variable population size or biased sex ratio.


One or more model parameters are non-identifiable if different combinations of the parameters generate the same likelihood of the data.


In a standard Bayesian model, the parameters are drawn from prior distributions, the parameters of which are fixed by the modeller. In a hierarchical model, these parameters, usually referred to as 'hyperparameters', are also free to vary and are themselves drawn from priors, often referred to as 'hyperpriors'. This form of modelling is most useful for data that is composed of exchangeable groups, such as genes, for which the possibility is required that the parameters that describe each group might or might not be the same.


The data are simplified by representation as a set of summary statistics and simulations used to draw samples from the joint distribution of parameters and summary statistics (that is, the distribution shown in figure 1). The posterior distribution is approximated by estimating the conditional distribution of parameters in the vicinity of the summary statistics that are measured from the data (the vertical dotted line in figure 1) avoiding the need to calculate a likelihood function.


The combinations of alleles that are observed when individuals are simultaneously genotyped at two or more genetic marker loci.


If two or more variables have joint outcomes that are more frequent than would be expected by chance (if the two variables were independent), they are associated. An association study statistically examines patterns of co-occurrence of variables, such as genetic variants and disease phenotypes, to identify factors (genes) that might contribute to disease risk.


The probability of homozygosity by descent — that is, the probability that a zygote obtains copies of the same ancestral gene from both its parents because they are related.


Methods for comparing traits across species to identify trends in character evolution that indicate the effects of natural selection.


A hierarchical model in which the hyperparameter is not a random variable but is estimated by some other (often classical) means.


This is an enhancement of a Markov chain model, in which the state of each observation is drawn randomly from a distribution, the parameters of which follow a Markov chain. For example, the parameter might be an indicator for whether a DNA region is coding or non-coding, and the observation is the base at each nucleotide.


A large class of programmimg algorithms that are based on breaking a large problem down (if possible) into incremental steps so that, at any given stage, optimal solutions are known sub-problems.


This is the tendency in a hierarchical Bayesian model for the posterior distributions of parameters among exchangeable units (for example, genes) to become narrower as a result of pooling information across units.


The process of choosing among different models given their posterior probability.


This refers to sequences that have arisen by duplications within a single genome.


An iterative algorithm for linkage mapping. The algorithm calculates the likelihood of marker genotypes on a pedigree. Calculations on the basis of the algorithm are efficient for relatively large families, but its application is typically limited to a small number of markers.


An iterative algorithm that is used for linkage mapping. It iteratively calculates the likelihood across markers on a chromosome, rather than across families, as in the Elston–Stewart algorithm. This allows efficient calculation of pedigree likelihoods for small families with many linked markers.


A general class of genetic association tests that uses families with one or more affected children as the observations rather than unrelated cases and controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as the 'case' and the untransmitted allele is treated as the 'control' to avoid the influence of population subdivision.


The ratio of the prior probabilities of the null versus the alternative hypotheses over the ratio of the posterior probabilities. This can be interpreted as the relative odds that the hypothesis is true before and after examining the data. If the prior odds are equal, this simplifies to become the likelihood ratio.


A procedure for fine-scale localization to a region of a chromosome of a mutation that causes a detectable phenotype (often a disease) by use of linkage disequilibrium between the phenotype that is induced by the mutation and markers that are located near the mutation on the chromosome.


The inexorable tendency for a mathematical function to approach some particular value (or set of values) with increasing n. In the case of Markov chain Monte Carlo, n is the number of simulation replicates and the values that the chain approaches are the posterior probabilities.

About this article

Publication history



Further reading