One of the goals of statistics is to make inferences about population parameters from a limited set of observations. Last month, we showed how Bayes' theorem is used to update probability estimates as more data are collected1. We used the example of identifying a coin as fair or biased based on the outcome of one or more tosses. This month, we introduce Bayesian inference by treating the degree of bias as a population parameter and using toss outcomes to model it as a distribution to make probabilistic statements about its likely values.
How are Bayesian and frequentist inference different? Consider a coin that yields heads with a probability of π. Both the Bayesian and the frequentist consider π to be a fixed but unknown constant and compute the probability of a given set of tosses (for example, k heads, Hk) based on this value (for example, P(Hk | π) = πk), which is called the likelihood. The frequentist calculates the probability of different data generated by the model, P(data | model), assuming a probabilistic model with known and fixed parameters (for example, coin is fair, P(Hk) = 0.5k). The observed data are assessed in light of other data generated by the same model.
In contrast, the Bayesian uses probability to quantify uncertainty and can make more precise probability statements about the state of the system by calculating P (model | data), a quantity that is meaningless in frequentist statistics. The Bayesian uses the same likelihood as the frequentist, but also assumes a probabilistic model (prior distribution) for possible values of π based on previous experience. After observing the data, the prior is updated to the posterior, which is used for inference. The data are considered fixed and possible models are assessed on the basis of the posterior.
Let's extend our coin example from last month to incorporate inference and illustrate the differences in frequentist and Bayesian approaches to it. Recall that we had two coins: coin C was fair, P(H | C) = π0 = 0.5, and coin Cb was biased toward heads, P(H | Cb) = πb = 0.75. A coin was selected at random with equal probability and tossed. We used Bayes' theorem to compute the probability that the biased coin was selected given that a head was observed; we found P(Cb | H) = 0.6. We also saw how we could refine our guess by updating this probability with the outcome of another toss: seeing a second head gave us P(Cb | H2) = 0.69.
In this example, the parameter π is discrete and has two possible values: fair (π0 = 0.5) and biased (πb = 0.75). The prior probability of each before tossing is equal, P(π0) = P(πb) = 0.5, and the data-generating process has the likelihood P(Hk | π) = πk. If we observe a head, Bayes' theorem gives the posterior probabilities as P(π0 | H) = π0/(π0 + πb) = 0.4 and P(πb | H) = πb/(π0 + πb) = 0.6. Here all the probabilities are known and the frequentist and Bayesian agree on the approach and the results of computation.
In a more realistic inference scenario, nothing is known about the coin and π could be any value in the interval [0,1]. What can be inferred about π after a coin toss produces H3 (where HkTn–k denotes the outcome of n tosses that produced k heads and n–k tails)? The frequentist and the Bayesian agree on the data generation model P(H3 | π) = π3, but they will use different methods to encode experience from other coins and the observed outcomes.
In part, this compatibility arises because, for the frequentist, only the data have a probability distribution. The frequentist may test whether the coin is fair using the null hypothesis, H0: π = π0 = 0.5. In this case, H3 and T3 are the most extreme outcomes, each with probability 0.125. The P value is therefore P(H3 | π0) + P(T3 | π0) = 0.25. At the nominal level of α = 0.05, the frequentist fails to reject H0 and accepts that π = 0.5. The frequentist might estimate π using the sample percentage of heads or compute a 95% confidence interval for π, 0.29 < π ≤ 1. The interval depends on the outcome, but 95% of the intervals will include the true value of π.
The frequentist approach can only tell us the probability of obtaining our data under the assumption that the null hypothesis is the true data-generating distribution. Because it considers π to be fixed, it does not recognize the legitimacy of questions like “What is the probability that the coin is biased towards heads?” The coin either is or is not biased toward heads. For the frequentist, probabilistic questions about π make sense only when selecting a coin by a known randomization mechanism from a population of coins.
By contrast, the Bayesian, while agreeing that π has a fixed true value for the coin, quantifies uncertainty about the true value as a probability distribution on the possible values called the prior distribution. For example, if she knows nothing about the coin, she could use a uniform distribution on [0,1] that captures her assessment that any value of π is equally likely (Fig. 1a). If she thinks that the coin is most likely to be close to fair, she can pick a bell-shaped prior distribution (Fig. 1a). These distributions can be imagined as the histogram of the values of π from a large population of coins from which the current coin was selected at random. However, in the Bayesian model, the investigator chooses the prior based on her knowledge about the coin at hand, not some imaginary set of coins.
Given the toss outcome of H3, the Bayesian applies Bayes' theorem to combine the prior, P(π), with the likelihood of observing the data, P(H3 | π), to obtain the posterior P(π | H3) = P(H3 | π) × P(π) / P(H3) (Fig. 1b). This is analogous to P(A | B) = P(B | A) × P(A)/P(B), except now A is the model parameter, B is the observed data and, because π is continuous P(·) is interpreted as a probability density. The term corresponding to the denominator P(B), the marginal likelihood P(H3), becomes the normalizing constant so that the total probability (area under the curve) is 1. As long as this is finite, it is often left out and the numerator is used to express the shape of density. That is the reason why it is commonly said that posterior distribution is proportional to the prior times the likelihood.
Suppose the Bayesian knows little about the coin and uses the uniform prior, P(π) = 1. The relationship between posterior and likelihood is simplified to P(π | H3) = P(H3 | π) = π3 (Fig. 1b). The Bayesian uses the posterior distribution for inference, choosing the posterior mean (π = 0.8), median (π = 0.84) or value of π for which posterior is maximum (π = 1, mode) for a point estimate of π.
The Bayesian can also calculate 95% credible region, the smallest interval over which we find 95% of the area under the posterior—which is [0.47,1] (Fig. 1b). Like the frequentist, the Bayesian cannot conclude that the coin is not biased, because π = 0.5 falls within the credible interval. Unlike the frequentist, they can make statements about the probability that the coin is biased toward heads (94%) using the area under the posterior distribution for π > 0.5 (Fig. 1b). The probability that the coin is biased toward tails is P(π < 0.5 | H3) = 0.06. Thus, given the choice of prior, the toss outcome H3 overwhelmingly supports the hypothesis of head bias, which is 0.94/0.06 = 16 times more likely than tail bias. This ratio of posterior probabilities is called the Bayes factor and its magnitude can be associated with degree of confidence2. By contrast, the frequentist would test H0 π0 ≤ 0.5 versus HA π0 > 0.5 using the P value based on a one-tailed test at the boundary (π0 = 0.5) and obtain P = 0.125 and would not reject the null hypothesis. Conversely, the Bayesian cannot test the hypothesis that the coin is fair because, in using the uniform prior, statements about P are limited to intervals and cannot be made for single values of π (which always have zero prior and posterior probabilities).
Suppose now that we suspect the coin to be head-biased and want a head-weighted prior (Fig. 1a). What would be a justifiable shape? It turns out that if we consider the general case of n tosses with outcome HkTn–k, we arrive at a tidy solution. With a uniform prior, this outcome has a posterior probability proportional to pk(1 − π)n–k. The shape and interpretation of the prior is motivated by considering n′ more tosses that produce k′ heads, Hk′Tn–k′. The combined toss outcome is Hk+k′T(n+n′)−(k+k′), which, with a uniform prior, has a posterior probability proportional to πk+k′(1 − π)(n+n′)−(k+k′). Another way to think about this posterior is to treat the first set of tosses as the prior, πk(1 − π)n–k, and the second set as the likelihood, πk′(1 − π)n–k′. In fact, if we extrapolate this pattern back to 0 tosses (with outcome H0T0), the original uniform prior is exactly the distribution that corresponds to this: π0(1 − π)0 = 1. This iterative updating by adding powers treats the prior as a statement about the coin based on the outcomes of previous tosses.
Let's look how different shapes of priors might arise from this line of reasoning. Suppose we suspect that the coin is biased with π = 0.75. In a large number of tosses we expect to see 75% heads. If we are uncertain about this, we might let this imaginary outcome be H3T1 and set the prior proportional to π3(1 − π)1 (Fig. 2a). If our suspicion is stronger, we might use H15T5 and set the prior proportional to π15(1 − π)5. In either case, the posterior distribution is obtained simply by adding the number of observed heads and tails to the exponents of π and (1 − π), respectively. If our toss outcome is H3T1, the posteriors are proportional to π6(1 − π)2 and π18(1 − π)6.
As we collect data, the impact of the prior is diminished and the posterior is shaped more like the likelihood. For example, if we use a prior that corresponds to H3T1, suggesting that the coin is head-biased, and collect data that indicates otherwise and see tosses of H1T3, H5T15 and H25T75 (75% tails), our original misjudgment about the coin is quickly mitigated (Fig. 2b).
In general, a distribution on π in [0,1] proportional to πa−1(1 − π)b−1 is called a beta(a,b) distribution. The parameters a and b must be positive, but they do not need to be whole numbers. When a ≥ 1 and b ≥ 1, then (a + b − 2) is like a generalized number of coin tosses and controls the tightness of the distribution around its mode (location of maximum of the density), and (a − 1) is like the number of heads and controls the location of the mode.
All of the curves in Figure 2 are beta distributions. Priors corresponding to a previous toss outcomes of HkTn–k are beta distributions with a = k + 1 and b = n − k + 1. For example, the prior for H15T5 has a shape of beta(16,6). For a prior of beta(a,b), a toss outcome of HkTn–k will have a posterior of beta(a + k, b + n − k). For example, the posterior for a toss outcome of H3T1 using a H15T5 prior is beta(19,7).
In general, when the posterior comes from the same family of distributions as the prior with an update formula for the parameter, we say that the prior is conjugate to the distribution generating the data. Conjugate priors are convenient when they are available for data-generating models because the posterior is readily computed. The beta distributions are conjugate priors for binary outcomes such as H or T and come in a wide variety of shapes, flat, skewed, bell- or U- shaped. For a prior on the interval [0,1], it is usually possible to pick values of (a,b) for a suitable head probability prior for coin tosses (or the success probability for independent binary trials).
Frequentist inference assumes that the data-generating mechanism is fixed and that only the data have a probabilistic component. Inference about the model is therefore indirect, quantifying the agreement between the observed data and the data generated by a putative model (for example, the null hypothesis). Bayesian inference quantifies the uncertainty about the data-generating mechanism by the prior distribution and updates it with the observed data to obtain the posterior distribution. Inference about the model is therefore obtained directly as a probability statement based on the posterior. Although the inferential philosophies are quite different, advances in statistical modeling, computing and theory have led many statisticians to keep both sets of methodologies in their data analysis toolkits.
The authors gratefully acknowledge M. Lavine for contributions to the manuscript.