Introduction

Statistical hypothesis testing for Hardy–Weinberg (HW) equilibrium has long had an important role in genetic studies (Haldane, 1954; Rousset & Raymond, 1995). It has received renewed attention in recent years (e.g. Guo & Thompson, 1992; Zaykin et al., 1995), resulting in part from the debate over population genetics issues in the forensic use of DNA profiling. Although hypothesis testing provides some insight into the questions of scientific interest, it forms only the most basic level of statistical inference. Tests do not directly measure the size of the effect: for example, a deviation from HW may be statistically significant, and yet insignificant in the everyday sense for the application at hand.

This weakness has long been recognized, and a number of methods have emerged for obtaining point estimates of f, a parameter measuring departure from HW caused by inbreeding. However, such point estimates are also unsatisfactory for a number of reasons. First, an investigator will most often be interested in the distribution of plausible values, rather than just a (typically imprecise) point estimate. Secondly, standard errors can be attached to point estimates, but these are of limited value in connection with estimators having sampling distributions that may be highly skew and are bounded below by the requirement that all genotype frequencies be non-negative. Indeed, some point estimation methods can even produce estimates outside this bound. Finally, a satisfactory approach to estimation should allow the assumptions of the inbreeding model to be assessed and, if necessary, weakened.

The first attempt to evaluate a probability distribution for a parameter measuring departure from HW seems to have been that of Lindley (1988). This paper did not exploit modern computational methodology and dealt only with diallelic loci, for which a one-parameter model for deviations from HW is fully general. For highly polymorphic molecular genetic markers, a fully general model involves a distinct parameter for each heterozygous genotype, such as the fixation indices of Weir (1970) or the additive disequilibrium coefficients of Hernández & Weir (1989). Although they can be readily implemented, the data will provide little information for such highly parameterized models, and the resulting estimates may be very imprecise.

We do discuss a more general model below, but focus first on the one-parameter inbreeding model. We outline some of the current point estimation methods available for the parameter f, the inbreeding coefficient, and then describe a Markov chain Monte Carlo (MCMC) method for approximating the probability density of f based on a sample of genotypes from the population. We illustrate the method by applying it to simulated data and to data analysed by alternative methods in the recent literature. Next, we describe a method for investigating the validity of the inbreeding model. Finally, we discuss combining information over several loci, illustrating this with simulated data and with data from Samoan individuals at three short tandem repeat (STR) loci used in forensic work.

Computer programs (in C) for the MCMC algorithms are freely available from http://www.reading.ac.uk/~snsbalng/.

The inbreeding model

If inbreeding is expected to be the main cause of any deviation from HW, the inbreeding model may be appropriate (for example see Malécot, 1969). This model is completely general for diallelic loci but, in the multiallelic case, it cannot account for assortative mating or for some forms of selection. Under the inbreeding model, pij, the relative frequency of the genotype AiAj, is

where pi denotes the frequency of allele Ai, and f is the inbreeding coefficient. When f = 0, eqn (1) gives the HW proportions. When f = 1, the maximum value, heterozygotes never arise. The value of f can be negative. It is bounded below by the requirement that the population frequencies of each homozygote be non-negative, which leads to:

where pmin is the smallest allele frequency.

The value of f can be interpreted as the correlation between an individual's two genes at a locus (for example see Crow & Kimura, 1970). It measures the deficit (or excess) of heterozygosity that results from inbreeding (or outbreeding). In some models for population subdivision, f can be interpreted as the probability that an individual's two genes are identical by descent, in which case it is constrained to be non-negative.

Point estimation methods for f

Nei & Chesser (1983) discuss an estimator for the inbreeding coefficient:

where Xii and xi are the sample frequencies of AiAi genotypes and Ai alleles, and n is the number of individuals in the sample. This estimator is developed in terms of a subdivided population. In this setting, the parameter f in a single subpopulation is often denoted FIS.

Robertson & Hill (1984) give an alternative estimator:

where nii and ni denote the sample counts of AiAi genotypes and Ai alleles, and k is the number of alleles at the locus.

These estimators do not explicitly take account of the inbreeding model and may, in the multiallelic case, give estimates that conflict with the bound (2). The maximum likelihood estimator under the inbreeding model does respect this bound. Assuming random sampling of genotypes, the likelihood is:

where c is a constant. For k = 2, eqn (5) is readily maximized (for example, see Weir, 1996, p. 65) to obtain

For k>2, the likelihood cannot be maximized analytically, but numerical methods such as that given in Robertson & Hill (1984) can be employed, although problems may arise with iterates going out of bounds. For the case when the maximum likelihood estimate (MLE) is non-negative, the EM algorithm given in Hill et al. (1995) may be used. More general mode-finding algorithms are described in chapter 9 of Gelman et al. (1995).

MCMC method

Although standard errors can be attached to the point estimators described above, these are of limited value for estimators with bounded, and possibly highly skewed, sampling distributions. The profile likelihood for f (the likelihood function obtained by setting each nuisance parameter pi equal to its MLE ^pi) does provide a measure of the support given by the data to different possible values for f, but it ignores uncertainty in the pi.

The nuisance parameter problem can be overcome by integration over the joint distribution of the pi, leading to a marginal likelihood for f, which is also its posterior distribution when the joint prior distribution for f and the allele frequencies is (multivariate) uniform over the range of possible values. Informative prior distributions, reflecting, for example, knowledge that f is unlikely to be large or to be negative, or information about the allele frequencies from previous studies, can also be incorporated.

When exact integration is not feasible, approximate integration can be achieved via one of a range of stochastic simulation methods known as MCMC algorithms. These algorithms generate a sequence of realizations from a specified probability distribution, which can then be used to approximate properties of the distribution to any required accuracy. We implement an algorithm of the Metropolis–Hastings type (Metropolis et al., 1953; Hastings, 1970; Smith & Roberts, 1993). Details are given in the Appendix.

Application to simulated data

The solid curves in Fig. 1 show the posterior density for f from samples of size n = 200 and n = 1000, at loci with k = 2, k = 6 and k = 15 alleles, simulated from the inbreeding model with f = 5% and assuming a uniform prior density. Corresponding point estimates of f are given in Table 1. These are given for comparison with Fig. 1 only and not to assess the properties of the point estimators, for which see, for example, Curie-Cohen (1982).

Fig. 1
figure 1

Posterior density curves for the inbreeding coefficient f, for samples of size n = 200 and n = 1000, at loci with k = 2, k = 6 and k = 15 alleles, simulated from the inbreeding model with f = 5%. Solid and dashed curves correspond, respectively, to a uniform prior and an informative prior for f (shown in Fig. 2. A uniform prior was used for the allele frequencies in each case. Point estimates based on the same data are given in Table 1. The density curves are obtained by applying the density command of the statistical package S-PLUS to 10 000 values generated by an MCMC algorithm (burn-in length 5000; then every 50th value retained). See Appendix for further details.

Table 1 Nei & Chesser (1983), Robertson & Hill (1984) and maximum likelihood estimates of the inbreeding coefficient f, for data sets of size n = 200 and n = 1000, with k = 2, k = 6 and k = 15 alleles, simulated from the inbreeding model with f = 5%

Note that for k = 2 and n = 200 (Fig. 1a), the density curve is relatively flat and visibly non-zero over a large interval, reflecting the fact that there is little information about f in the data. Fortuitously in this simulation, the point estimates are around 6%, close to the true value of f (5%). The posterior density curve, however, clarifies the level of uncertainty: values for f as small as −8% and as large as 20% remain plausible based on the data.

When either the number of alleles k or the sample size n is increased, additional information is available from the data, which is reflected by sharper peaks in the density curves (Fig. 1b, c, d, e and f solid curves). For these simulations, only in the case k = 15 and n = 1000 is the hypothesis f = 0 unequivocally excluded: in the other cases, the density curve is visibly above zero at f = 0.

If additional data are unobtainable or expensive, a careful choice of prior distribution for f can be helpful in narrowing the range of plausible values. For example, the dashed curve in (Fig. 1a) shows the posterior probability density for f corresponding to an informative prior density (shown in Fig. 2), which reflects a belief that f is likely to be close to zero. The effect of incorporating this prior information is almost the same as the effect of increasing the sample size from 200 to 1000 (Fig. 1b, solid curve). In some cases, it may be reasonable to assume a priori that f is non-negative, in which case a simpler and more efficient algorithm can be implemented, because the bound (2) does not need to be recalculated at each iteration.

Fig. 2
figure 2

Prior density curves corresponding to the dashed curves of Fig. 1.

Each panel in Fig. 1 corresponds to a single data set. Although there will be some variation among replicate data sets with the same parameters, the patterns of increasing precision with greater k and n, and the effect of an informative prior density, are broadly unaltered (simulations not shown).

Comparison with analyses from recent literature

Table 2 gives point and interval estimates of f obtained by Hill et al. (1995) from a sample of size 60 of the human malaria parasite Plasmodium falciparum. Corresponding posterior density curves obtained via the MCMC method with a uniform prior density are shown in Fig. 3(a). As both approaches are likelihood based, the two sets of results do not conflict. However, the MCMC method is visual, and so immediately interpretable, reveals the support given to negative values, incorporates exactly the effect of uncertainty about the nuisance parameters and allows the inclusion of background information to narrow the range of plausible values.

Table 2 MLE of f, and intervals in which the log-likelihood is within two of the maximum value, based on a sample of size 60 of the human malaria parasite Plasmodium falciparum (taken from table 3 of Hill et al., 1995)
Fig. 3
figure 3

Posterior density curves for f for (a) the Plasmodium falciparum data of Hill et al. (1995) and (b) the Rhesus data given in Guo & Thompson (1992). Details of the MCMC algorithm used to obtain these curves are the same as for the solid curves of Fig. 1.

Figure 3(b) shows the result of applying the MCMC method to a sample of n = 8297 Rhesus genotypes given in Fig. 3 of Guo & Thompson (1992), who used hypothesis testing methods. The Rhesus locus is a highly polymorphic blood group genetic marker. Assortative mating with respect to blood groups is unlikely to occur, and any selection is thought to be negligible, so that the inbreeding model may be reasonable. Point estimates for f are extremely small:

and the exact-test P-values reported by Guo & Thompson (1992) are large, exceeding 0.69. Similarly, the posterior density for f indicates that values close to zero are highly plausible. However, the MCMC analysis also reveals that values for f in excess of 1% are consistent with the data, even though the locus is multiallelic and the sample is very large.

Note that our analysis assumes that alleles not present in the sample do not exist in the population under investigation. If there did exist a very rare allele not represented in the sample, the bound (2) would, in effect, restrict f to be non-negative because pmin would be very small.

Investigating the validity of the inbreeding model

The method thus far developed has concentrated on estimating the inbreeding coefficient f, the parameter of the inbreeding model (eqn 1). For more than two alleles, a thorough investigation into departures from HW should also examine the validity of this model. This can be performed by specifying a more general model, of which the inbreeding model is a special case, and examining whether or not the data support aspects of the general model not consistent with this special case. A natural extension of the one-parameter inbreeding model is the fixation indices model (Weir, 1970) for which genotype frequencies are

where fij is the fixation coefficient for the heterozygous genotype AiAj.

For k = 2 alleles, eqn (7) reduces immediately to the inbreeding model (eqn 1). For k>2, the inbreeding model is recovered when all the fij are equal. Inspecting the posterior densities of the coefficients fij will therefore provide an insight into the validity of the inbreeding model — if the posteriors do not overlap to any noticeable degree, the model may be invalid. For small sample sizes, the posteriors of the fij may each support a wide range of plausible values, so that the fact that they all overlap may reflect insufficient data to distinguish competing models rather than strong support for the inbreeding model. If the model does appear unsuitable, the posteriors of each fij that have been obtained may be used to infer the nature of departures from HW. Note that the lower bound of each fij is a function of the corresponding allele frequencies, which can complicate within-sample comparisons in the event that negative values are highly supported. An MCMC method for obtaining posterior densities for the fixation indices is detailed in the Appendix.

Combining information over loci

We noted from Fig. 1(a) that estimation of f when n = 200 and k = 2 is very imprecise. If deviations from HW are caused predominantly by inbreeding, then f should be approximately constant over loci. In this case, sharper estimation can be obtained by pooling together information from different loci. There may also be situations in which it is reasonable to assume that f has the same value in different populations, in which case information could also be shared across populations.

The five dashed curves in Fig. 4 show posterior densities for f evaluated from samples of size n = 200 simulated with f = 5% at each of five loci. The solid curve shows the posterior density evaluated for all five loci assuming a common value of f. The prior density is uniform for each curve, and the likelihood for the combined data is given by the product of the likelihoods for each locus. This assumes independence of genotypes at different loci, given the value of f, which would be inappropriate in the presence of gametic disequilibrium or genotypic association.

Fig. 4
figure 4

Posterior density curves for f from each of five simulated single-locus data sets with k = 2, n = 200 and f = 5% (dashed curves) and from the combined data assuming f constant over loci (solid curve). Each prior density is uniform.

Notice that the five dashed curves in Fig. 4 all overlap substantially: if this did not occur, it would suggest that the assumption of constant f is invalid. As expected, the solid curve has a higher peak than any of the dashed curves, indicating that more precise estimation is obtained by pooling information from the five data sets.

Hill et al. (1995) combine the information from the MSP-1 and MSP-2 loci to obtain an overall point estimate of 24%. Their corresponding interval is 9–39%, narrower than either of the single-locus intervals given in Table 2. Similarly, the posterior density curve for the combined data based on a uniform prior density (not shown) is more sharply peaked than either of the single-locus curves shown in Fig. 3(a).

Application to Samoan data

The dashed curves of Fig. 5 show the result of applying the MCMC methods outlined above to data from three STR loci (THO1, TPOX and CSF1PO) for a sample, collected in forensic work, of 143 Samoans resident in New Zealand. Six alleles were observed at each locus in the data set, although additional alleles are known to exist at these loci in other populations. As for the Rhesus locus considered earlier, our analysis assumes that alleles not represented in the sample do not exist in the Samoan population.

Fig. 5
figure 5

The broken curves are the posterior density curves for f from three single-locus STR Samoan samples each with k = 6. The sample sizes are 142 (THO1), 140 (TPOX) and 141 (CSF1PO) individuals. The solid curve represents the posterior density for f obtained from the combined data assuming f constant over loci. Each prior density is uniform. Data provided by John Buckleton of ESR, New Zealand.

The fixation coefficient model yielded posteriors for each fij that were diffuse and overlapped substantially. (Uniform prior densities were used for the fij.) The inbreeding model may, therefore, be reasonable for these data. Moreover, the dashed curves of Fig. 5 overlap substantially, supporting a common value for f at the three loci. The solid curve represents the posterior density for f obtained by combining the data over the three loci, again assuming a uniform prior density for f.

The data support a large range of plausible values over the three loci, from about −3% to more than 20%. Combining information over the three loci results in some improvement in estimation, with the plausible range being narrowed to, say, 0–14%, and with values between 3% and 8% now being highly supported. Further improvement could be obtained by implementing an informative prior for f.

Discussion

The most obvious advantage of the MCMC method outlined here, compared with traditional approaches, is that results are represented visually in terms of posterior density curves and are thus readily interpretable. The method also has the advantage of allowing the scientist to incorporate background information if desired, thus reducing the amount of direct data required. The method is flexible and readily implemented. As well as being useful in practice, the method is also well supported in statistical theory: there are compelling reasons to support the view that uncertainty about an unknown parameter should, if possible, be described by its probability distribution (Smith & Bernardo, 1994) .

Figure 1 shows that large ranges of plausible values often arise, particularly when few alleles can be distinguished and/or the sample size is small. The MCMC method highlights the resulting uncertainty in a direct and visual manner, making it preferable to point estimation methods. Figure 1 also shows that sharper estimation can be achieved by incorporating prior knowledge about the parameter, if desired. Alternatively, or in addition, estimation may be improved by combining information across loci (Fig. 4).

The likelihood basis of the method makes it very flexible, allowing investigation of the validity of the inbreeding model. If the model does not appear reasonable, the posteriors of the fixation indices fij may be used to quantify the departure from HW indicated by the sample.