Bayesian statistics allow scientists to easily incorporate prior knowledge into their data analysis. Nonetheless, the sheer amount of computational power that is required for Bayesian statistical analyses has previously limited their use in genetics. These computational constraints have now largely been overcome and the underlying advantages of Bayesian approaches are putting them at the forefront of genetic data analysis in an increasing number of areas.
In genetic analysis, there are often competing explanations for the same data. Sophisticated mathematical models have been developed that can encapsulate these problems in terms of parameters that need to be inferred.
Bayesian statistical methods are well suited to help pick out the most reasonable parameter values — as well as to choose between entire models — and they provide a framework for including background information to help with this.
The goal of Bayesian analysis is to compute the probability distribution of parameter values and model specifications given the data. This is called the posterior distribution.
It is only with the development of high-speed computing over the past ten years that the potential of Bayesian methods has been realized. As a consequence, statistical analysis in genetics has undergone a dramatic shift.
A computational method that has had most influence is Markov chain Monte Carlo, which allows parameter values to be drawn from the posterior distribution.
Example areas that have used Bayesian methods include: population genetics, detecting the effects of selection, sequence analysis, SNP discovery, haplotype identification, analysis of gene expression, association mapping and linkage-disequilibrium mapping.
A review of applications in these areas demonstrates the following advantages of Bayesian methods over other approaches: use of background information; the ability to include uncertainty in all parameter values; ease in making inferences about some parameters irrespective of the values of others; and lack of ad hoc calculations and approximations that are often associated with alternative statistical methods.
There are still computational difficulties with Bayesian approaches. Further improvements are needed both in testing the accuracy of the computation involved and also in model checking.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
We thank the four anonymous referees for their comments. Work on this paper was supported by grants from the Biotechnology and Biological Sciences Research Council and the Natural Environment Research Council to M.A.B., and by grants from the National Institutes of Health and the Canadian Institute of Health Research to B.R.
- STATISTICAL INFERENCE
The process whereby data are observed and then statements are made about unknown features of the system that gave rise to the data.
- PROBABILISTIC MODEL
A model in which the data are modelled as random variables, the probability distribution of which depends on parameter values. Bayesian models are sometimes called fully probabilistic because the parameter values are also treated as random variables.
The probability of the data fora particular set of parameter values.
- MARKOV CHAIN
A model that is suitable for modelling a sequence of random variables, such as nucleotide base pairs in DNA, in which the probability that a variable assumes any specific value depends only on the value of a specified number of most recent variables that precede it. In an nth-order Markov chain, the probability distribution of a variable depends on the n preceding observations.
- MARGINAL LIKELIHOOD
Also known as the 'prior predictive distribution'. The probability distribution of the data irrespective of the parameter values.
- RANDOM VARIABLE
A quantity that might take any of a range of values (discrete or continuous) that cannot be predicted with certainty but only described probabilistically.
- JOINT PROBABILITY DISTRIBUTION
The probability distribution of all combinations of two or more random variables.
- PRIOR [DISTRIBUTION]
The probability distribution of parameter values before observing the data.
- CONDITIONAL DISTRIBUTION
The distribution of one or more random variables when other random variables of a joint probability distribution are fixed at particular values.
- POSTERIOR DISTRIBUTION
The conditional distribution of the parameter given the observed data.
- POINT ESTIMATE
A summary of the location of a parameter value. In a Bayesian setting, this is generally the mean, mode or median of the posterior distribution.
- INTERVAL ESTIMATE
An estimate of the region in which the true parameter value is believed to be located.
- METHOD OF MOMENTS
A method for estimating parameters by using theory to obtain a formula for the expected value of statistics measured from the data as a function of the parameter values to be estimated. The observed values of these statistics are then equated to the expected values. The formula is inverted to obtain an estimate of the parameter.
- FREQUENTIST INFERENCE
Statistical inference in which probability is interpreted as the relative frequency of occurrences in an infinite sequence of trials.
- COALESCENT THEORY
A theory that describes the genealogy of chromosomes or genes. Under many life-history schemes (discrete generations, overlapping generations, non-random mating, and so on), taking certain limits, the statistical distribution of branch lengths in genealogies follows a simple form. Coalescent theory describes this distribution.
- PARAMETRIC BOOTSTRAPPING
The process of repeatedly simulating new data sets with parameters that are inferred from the observed data, and then re-estimating the parameters from these simulated data sets. This process is used to obtain confidence intervals.
- EFFECTIVE POPULATION SIZE
(Ne). The size of a random mating population under a simple Fisher–Wright model that has an equivalent rate of inbreeding to that of the observed population, which might have additional complexities such as variable population size or biased sex ratio.
- NON-IDENTIFIABLE [PARAMETERS]
One or more model parameters are non-identifiable if different combinations of the parameters generate the same likelihood of the data.
- HIERARCHICAL BAYESIAN MODEL
In a standard Bayesian model, the parameters are drawn from prior distributions, the parameters of which are fixed by the modeller. In a hierarchical model, these parameters, usually referred to as 'hyperparameters', are also free to vary and are themselves drawn from priors, often referred to as 'hyperpriors'. This form of modelling is most useful for data that is composed of exchangeable groups, such as genes, for which the possibility is required that the parameters that describe each group might or might not be the same.
- APPROXIMATE BAYESIAN COMPUTATION
The data are simplified by representation as a set of summary statistics and simulations used to draw samples from the joint distribution of parameters and summary statistics (that is, the distribution shown in figure 1). The posterior distribution is approximated by estimating the conditional distribution of parameters in the vicinity of the summary statistics that are measured from the data (the vertical dotted line in figure 1) avoiding the need to calculate a likelihood function.
- MULTILOCUS GENOTYPES
The combinations of alleles that are observed when individuals are simultaneously genotyped at two or more genetic marker loci.
- ASSOCIATION STUDY
If two or more variables have joint outcomes that are more frequent than would be expected by chance (if the two variables were independent), they are associated. An association study statistically examines patterns of co-occurrence of variables, such as genetic variants and disease phenotypes, to identify factors (genes) that might contribute to disease risk.
- INBREEDING COEFFICIENT
The probability of homozygosity by descent — that is, the probability that a zygote obtains copies of the same ancestral gene from both its parents because they are related.
- COMPARATIVE METHODS
Methods for comparing traits across species to identify trends in character evolution that indicate the effects of natural selection.
- EMPIRICAL BAYES PROCEDURE
A hierarchical model in which the hyperparameter is not a random variable but is estimated by some other (often classical) means.
- HIDDEN MARKOV MODEL
This is an enhancement of a Markov chain model, in which the state of each observation is drawn randomly from a distribution, the parameters of which follow a Markov chain. For example, the parameter might be an indicator for whether a DNA region is coding or non-coding, and the observation is the base at each nucleotide.
- DYNAMIC PROGRAMMING
A large class of programmimg algorithms that are based on breaking a large problem down (if possible) into incremental steps so that, at any given stage, optimal solutions are known sub-problems.
- BORROW STRENGTH
This is the tendency in a hierarchical Bayesian model for the posterior distributions of parameters among exchangeable units (for example, genes) to become narrower as a result of pooling information across units.
- MODEL SELECTION
The process of choosing among different models given their posterior probability.
This refers to sequences that have arisen by duplications within a single genome.
- ELSTON–STEWART ALGORITHM
An iterative algorithm for linkage mapping. The algorithm calculates the likelihood of marker genotypes on a pedigree. Calculations on the basis of the algorithm are efficient for relatively large families, but its application is typically limited to a small number of markers.
- LANDER–GREEN–KRUGYLAK ALGORITHM
An iterative algorithm that is used for linkage mapping. It iteratively calculates the likelihood across markers on a chromosome, rather than across families, as in the Elston–Stewart algorithm. This allows efficient calculation of pedigree likelihoods for small families with many linked markers.
- FAMILY-BASED ASSOCIATION TESTS
A general class of genetic association tests that uses families with one or more affected children as the observations rather than unrelated cases and controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as the 'case' and the untransmitted allele is treated as the 'control' to avoid the influence of population subdivision.
- BAYES FACTOR
The ratio of the prior probabilities of the null versus the alternative hypotheses over the ratio of the posterior probabilities. This can be interpreted as the relative odds that the hypothesis is true before and after examining the data. If the prior odds are equal, this simplifies to become the likelihood ratio.
- LD MAPPING
A procedure for fine-scale localization to a region of a chromosome of a mutation that causes a detectable phenotype (often a disease) by use of linkage disequilibrium between the phenotype that is induced by the mutation and markers that are located near the mutation on the chromosome.
The inexorable tendency for a mathematical function to approach some particular value (or set of values) with increasing n. In the case of Markov chain Monte Carlo, n is the number of simulation replicates and the values that the chain approaches are the posterior probabilities.
About this article
Population Ecology (2016)