Introduction

The Hardy–Weinberg law, independently formulated by Hardy (1908) and Weinberg (1908) more than 100 years ago, is a fundamental genetic principle. It is of importance in many areas of genome research. Hardy–Weinberg equilibrium is typically assumed in haplotype estimation (Single et al. 2002). Classical estimation of genetic relatedness between individuals by maximum likelihood methods also rests on the equilibrium assumption (Thompson 1975). In fact, many statistical models and procedures used in genetic epidemiology make the Hardy–Weinberg assumption. Data produced by genotyping arrays undergoe extensive quality control procedures, of which testing for Hardy–Weinberg proportions (HWP) forms an important part (Laurie et al. 2010). It is well-known, and widely stated in genetic textbooks (Hartl 1980; Hamilton 2009), that a biological population will reach HWP in one generation of random mating.

Recently, Graffelman and Weir (2017) have stressed that this is only true under the assumption of equal allele frequencies (EAF) in the sexes. If such equality does not hold, it will take two generations before equilibrium is achieved. Graffelman and Weir (2017) show that the statistical testing of EAF and HWP by chi-square or exact procedures is intricately linked in assumptions, leading to circularity in the statistical testing, because EAF tests assume HWP, whereas HWP tests assume EAF. The authors propose novel exact and likelihood ratio procedures that avoid this dependence in assumptions, making it possible to test EAF and HWP whether independently or simultaneously.

In this paper, we readdress this issue from a Bayesian perspective. Our approach extends previous Bayesian work for the analysis of X-chromosomal variants (Puig et al. 2017). Here, we enumerate ten possible scenarios (models) for the data, choose prior distributions for the parameters of each one of the models and prior probabilities for them, derive marginal likelihoods, and compute posterior probabilities to identify the most probable model, given the observed data. Six of the scenarios (models) considered here coincide with the ones considered in Graffelman and Weir (2017); they include the scenario with both HWP and EAF in place, together with five scenarios where either HWP or EAF or both restrictions fail. The four scenarios considered here for the first time are included to admit the possibility of having one sex in HWP and the other one not in HWP.

Different from the approach based on the usual frequentist testing, which compares only two scenarios at a time with one of the scenarios typically nested into the other one, the Bayesian approach allows one to compare any number of models which can be nested or not nested into each other. Different from the approach based on the use of heuristic model selection criteria, under the Bayesian approach, models are assessed based on their posterior probabilities, adding up to one, which helps to assess the degree of the uncertainty behind the model choice. Prior distributions are chosen in a way such that posterior probabilities can be computed either exactly or through numerical integration, hence avoiding the need to estimate them through more intensive computational methods.

The structure of the paper is as follows. In the section Theory, we develop Bayesian theory for this particular genetic context. In the section Examples, we illustrate the use of the Bayesian approach with data taken from the Japanese population of the 1,000 Genomes project (The 1,000 Genomes Project Consortium et al. 2010). A Discussion section completes the paper.

Theory

Here, we describe the Bayesian framework that enables one to select the most credible scenario among ten alternative scenarios, including the one in which both HWP as well as EAF are in place. In the following, the subsection Notation presents our basic definitions and the subsection Scenarios and priors gives the probabilistic definition of each scenario, including both the corresponding statistical models as well as the prior distributions on their parameters. Subsection Bayesian model selection addresses the way Bayesian model selection works, and finally the subsection Simulation evaluation explores the performance of the model selection procedure through a simulation study.

Notation

We consider a biallelic genetic polymorphism with alleles A and B having population allele frequencies pAf and pBf in females and pAm and pBm in males, with pAf + pBf = pAm + pBm = 1. We denote the observed A and B allele counts in females as nAf and nBf, and in males as nAm and nBm, and their totals by nA = nAf + nAm and nB = nBf + nBm. When the population has equal allele frequencies for both sexes, denoted by EAF, their ratio, d, is given by

$$d = \frac{{p_{Am}}}{{p_{Af}}} = 1,$$
(1)

and d will be used as a measure of the discrepancy of male and female allele frequencies.

Let (pAAf, pABf, pBBf), with pAAf + pABf + pBBf = 1, be the female genotype frequencies, and let (pAAm, pABm, pBBm), with pAAm + pABm + pBBm = 1, be the male genotype frequencies in the population. We denote the observed genotype counts in females by (nAAf, nABf, nBBf), and in males by (nAAm, nABm, nBBm). The total genotype counts are given by the sum of the latter vectors, and indicated by (nAA, nAB, nBB), without the index for sex. The total sample size is n = nm + nf, where nf = nAAf + nABf + nBBf is the total number of females, and nm = nAAm + nABm + nBBm is the total number of males. The total allele counts can be obtained as nA = nAf + nAm = 2nAAf + nABf + 2nAAm + nABm and nB = 2nBBf + nABf + 2nBBm + nABm. One considers the population of females to be in a Hardy–Weinberg equilibrium when their genotype frequencies are such that

$$\rho _f = \frac{{p_{AAf} - p_{Af}^2}}{{p_{Af}(1 - p_{Af})}} = 0,$$
(2)

and the population of males to be so when

$$\rho _m = \frac{{p_{AAm} - p_{Am}^2}}{{p_{Am}(1 - p_{Am})}} = 0,$$
(3)

where ρf and ρm are the inbreeding coefficients for males and females, which can be used as measures of the deviation of female and of male genotype frequencies from HWP. The term “inbreeding coefficient” might be regarded as a misnomer because disequilibrium might arise from genotyping error or by chance, instead of from inbreeding, and because here we have a different coefficient for each sex while the two sexes intervene when breeding. Nevertheless, we still use the term “inbreeding coefficient” for historical reasons and because of its widespread use in population genetics.

The range of ρf is the interval [−pmin/(1 − pmin), 1], where pmin = min(pAf, 1 − pAf); in particular, when pAf is 0 or 1, the range for ρf is [0, 1], and its range will only be [−1, 1] when pAf is 0, 5. The same applies to the range of ρm after replacing pAf by pAm.

When ρm = ρf, we will state that males and females have equal inbreeding coefficients, and it will be denoted by EIC. It is easy to check that the relationship between male genotype and allele frequencies in the population is such that

$$p_{Am} = \frac{{2p_{AAm} + p_{ABm}}}{2},$$
(4)
$$p_{AAm} = p_{Am}^2 + p_{Am}p_{Bm}\rho _m,$$
(5)
$$p_{ABm} = 2p_{Am}p_{Bm}(1 - \rho _m),$$
(6)
$$p_{BBm} = p_{Bm}^2 + p_{Am}p_{Bm}\rho _m,$$
(7)

and similarly between female genotype and allele frequencies. The distribution of the vector of female genotype counts (nAAf, nABf, nBBf), is assumed to be:

$${\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},p_{BBf})),$$
(8)

and the one for male genotype counts (nAAm, nABm, nBBm) to be:

$${\mathrm{Multinomial}}(n_m,(p_{AAm},p_{ABm},p_{BBm})).$$
(9)

The precise situation of a biallelic genetic variant with respect to the HWP and EAF hypotheses can be efficiently visualized by plotting the genotype probabilities in a ternary diagram, also known as a de Finetti diagram (de Finetti 1926). A sample can be represented by a single point in the diagram that uniquely defines its genotype and allele frequencies. The base of the diagram is a 0–1 axis for the allele frequencies. Genotype frequencies are represented by the relative length of the three line segments obtained by perpendicular projection of the sample onto the edges of the diagram. Allele frequencies can be read off by projection onto the triangle base. The two vertices at the base of the ternary diagram therefore have a double interpretation: they represent the homozygote genotypes (AA and BB) as well as the two alleles, A and B. Genotype compositions in HWP must satisfy \(p_{AB}^2 = (2p_Ap_B)^2 = 4p_{AA}p_{BB}\) and are therefore constrained to be on a parabola in the triangle. EAF in the sexes is indicated by male and female compositions that line up vertically, perpendicularly with respect to the triangle base. For more background on the use of the ternary diagram for genetic data, we refer to Cannings and Edwards (1968), Graffelman and Morales-Camarena (2008), and Graffelman and Weir (2016) for X-chromosomal variants.

The ten ternary diagrams presented in Fig. 1 distinguish ten different possibilities that we consider, which we call scenarios or models. The diagrams in the first row of Fig. 1 refer to a population with equal allele frequencies for both sexes, whereas the second row in that figure corresponds to populations with heterogeneous gender allele frequencies. The diagrams in the first column in Fig. 1 refer to the two scenarios in which genotype frequencies satisfy the HWP. Scenarios with deviations from Hardy–Weinberg proportions, shown in columns 2–5 in Fig. 1, all have points which are off the HW parabola. When the inbreeding coefficients are positive, indicating a lack of heterozygotes, the points fall below the parabola. When the inbreeding coefficients are negative, indicating an excess of heterozygotes, they will fall above the parabola.

Fig. 1
figure 1

Ternary diagrams with male and female population genotype frequencies for ten different scenarios. Top row: scenarios with equal allele frequencies in the sexes. Bottom row: scenarios with different allele frequencies for both sexes. Population allele frequencies shown by vertical projections onto the base of the diagram. Symbols on the parabola indicate that the corresponding sexes are in HWP. The dimension of the parameter space for each scenario (k) is given at the bottom of each diagram

The statistical models that correspond to each one of these ten scenarios, together with the prior distribution assumed for their parameters, are described next in a systematic way.

Scenarios and priors

We label the statistical models behind scenarios with a double subindex, Mij, setting the first subindex i to 1 if the EAF hypothesis holds and to 2 otherwise. We use the second subindex j for the HWP hypothesis, setting it to 1 when the HWP hypothesis holds for both males and females, to 2 when it holds for males but not for females, to 3 when it holds for females but not for males, to 4 when HWP neither hold for males nor for females but their inbreeding coefficients are equal, and to 5 when HWP neither hold for males nor for females and their inbreeding coefficients are different.

Scenarios M11, M21, M14, M24, M15, and M25 are the ones considered by Graffelman and Weir (2017). Here, we provide some more detail by explicitly admitting the possibility of having one sex in HWP and the other not, which correspond to the scenarios with models M12, M22, M13, and M23.

Scenario M11: EAF and HWP in both sexes

If there are no disturbing factors operating in the population, one expects EAF to hold together with HWP for both males as well as females. When these three conditions hold, that is, when d = 1, together with ρm = ρf = 0, then all male and female genotype frequencies can be written as a function of pAf, (or equivalently pAm), and

$$(n_{AAf},n_{ABf},n_{BBf})|p_A\sim {\mathrm{Multinomial}}\left( {n_f,(p_A^2,2p_A(1 - p_A),(1 - p_A)^2),} \right.$$
(10)
$$(n_{AAm},n_{ABm},n_{BBm})|p_A\sim {\mathrm{Multinomial}}\left( {n_m,} \right.(p_A^2,2p_A(1 - p_A),(1 - p_A)^2),$$
(11)

where pA = pAf = pAm. Under this scenario, male and female allele frequencies, pAm and pAf, are equal, and they will be assumed to be Beta\((b_1^{11},b_2^{11})\) distributed, where the superindex ij in \(b_k^{ij}\) denotes the model. This prior distribution on allele frequencies univocally determines the prior distribution of all male and female genotype frequencies.

The full equilibrium in M11 can be broken because EAF does not hold, and therefore d ≠ 1, because HWP do not hold for females, and therefore ρf ≠ 0, because HWP do not hold for males, and therefore ρm ≠ 0, or because of the simultaneous occurrence of any two or of all three of these conditions. These disequilibrium situations are the ones covered by the next nine scenarios.

Scenario M21: HWP in both sexes

With HWP in both sexes, we have ρm = ρf = 0, but here male and female allele frequencies are not the same, and hence d ≠ 1. In that case, all male and all female genotype frequencies can be posed as functions of pAf and pAm, respectively:

$$(n_{AAf},n_{ABf},n_{BBf})|p_{Af}\sim {\mathrm{Multinomial}}\left(n_f,(p_{Af}^2,2p_{Af}(1 - p_{Af}),(1 - p_{Af})^2)\right),$$
(12)
$$(n_{AAm},n_{ABm},n_{BBm})|p_{Am}\sim {\mathrm{Multinomial}}\left(n_m,(p_{Am}^2,2p_{Am}(1 - p_{Am}),(1 - p_{Am})^2)\right).$$
(13)

The allele frequencies, pAf and pAm, are assumed to be independent with Beta\((b_{1f}^{21},b_{2f}^{21})\) and Beta\((b_{1m}^{21},b_{2m}^{21})\) prior distributions, respectively where, unless one has different information about pAf and about pAm, one will most likely set \(b_{1f}^{21} = b_{1m}^{21}\) and \(b_{2f}^{21} = b_{2m}^{21}\).

Scenario M12: EAF and HWP in males only

Under this scenario, one assumes that there is EAF and HWP among males, and hence d = 1 and ρm = 0, but HWP do not hold among females, and hence ρf ≠ 0, and in that case, male and female genotype frequencies can be posed just as a function of pAAf and of pABf, and

$$(n_{AAf},n_{ABf},n_{BBf})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf})),$$
(14)
$$\begin{array}{l}(n_{AAm},n_{ABm},n_{BBm})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}\left( {n_m,\left( {\left( {p_{AAf} + \frac{1}{2}p_{ABf}} \right)} \right.^2,} \right.\\ 2(p_{AAf} + \frac{1}{2}p_{ABf})(1 - (p_{AAf} + \frac{1}{2}p_{ABf})),\left. {\left. {(1 - (p_{AAf} + \frac{1}{2}p_{ABf}))^2} \right)} \right).\end{array}$$
(15)

Under this scenario, the prior distribution used for (pAAf, pABf, 1 − pAAf − pABf) will be Dirichlet \((a_{1f}^{12},a_{2f}^{12},a_{3f}^{12})\), where the superindex ij in \(a_{kf}^{ij}\) denotes the model.

Scenario M22: HWP for males only

This scenario is like M12 but without EAF, and genotype frequencies can be written as a function of pAAf, pABf, and pAm, and are such that

$$(n_{AAf},n_{ABf},n_{BBf})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf})),$$
(16)
$$(n_{AAm},n_{ABm},n_{BBm})|p_{Am}\sim {\mathrm{Multinomial}}(n_m,(p_{Am}^2,2p_{Am}(1 - p_{Am}),(1 - p_{Am})^2)).$$
(17)

The prior distribution for pAm here will be Beta\((b_{1m}^{22},b_{2m}^{22})\) and for (pAAf, pABf, 1 − pAAf − pABf) will be Dirichlet \((a_{1f}^{22},a_{2f}^{22},a_{3f}^{22})\).

Scenario M13: EAF and HWP for females only

This scenario is like M12 in that EAF holds, but where HWP holds among females and not among males. In that case, genotype frequencies are a function of just pAAm and of pABm, and

$$\begin{array}{l}(n_{AAf},n_{ABf},n_{BBf})|p_{AAm},p_{ABm}\sim {\mathrm{Multinomial}}\left( {n_f,(p_{AAm} + \frac{1}{2}p_{ABm})^2,} \right.\\ 2(p_{AAm} + \frac{1}{2}p_{ABm})(1 - (p_{AAm} + \frac{1}{2}p_{ABm})), {\left. {(1 - (p_{AAm} + \frac{1}{2}p_{ABm}))^2} \right)},\end{array}$$
(18)
$$(n_{AAm},n_{ABm},n_{BBm})|p_{AAm},p_{ABm}\sim {\mathrm{Multinomial}}(n_m,(p_{AAm},p_{ABm},1 - p_{AAm} - p_{ABm})).$$
(19)

Analogously to M12, we now use the Dirichlet \((a_{1m}^{13},a_{2m}^{13},a_{3m}^{13})\) prior on (pAAm, pABm, 1 − pAAm − pABm).

Scenario M23: HWP for females only

This scenario is like M13 but without EAF, and genotype frequencies are a function of pAf, of pAAm, and of pABm, and are such that

$$(n_{AAf},n_{ABf},n_{BBf})|p_{Af}\sim {\mathrm{Multinomial}}\left(n_f,(p_{Af}^2,2p_{Af}(1 - p_{Af}),(1 - p_{Af})^2)\right),$$
(20)
$$(n_{AAm},n_{ABm},n_{BBm})|p_{AAm},p_{ABm}\sim {\mathrm{Multinomial}}(n_m,(p_{AAm},p_{ABm},1 - p_{AAm} - p_{ABm})).$$
(21)

The prior distribution will be Beta\((b_{1f}^{23},b_{2f}^{23})\) for pAf and Dirichlet \((a_{1m}^{23},a_{2m}^{23},a_{3m}^{23})\) for (pAAm, pABm, 1 − pAAm − pABm).

Finally, in the case where neither males nor females are in HWP, we will distinguish the setting in which male and female inbreeding coefficients are equal, EIC, which will be labeled with a 4 as the second index, from the case in which the two inbreeding coefficients are different, which will be labeled with a 5. That, coupled with the possibility of having or not having EAF, leads to the last four possible scenarios.

Scenario M14: EAF and EIC

With this scenario, one has EAF, and neither males nor females are in HWP, but male and female inbreeding coefficients are equal, ρm = ρf ≠ 0, in which case male and female genotype frequencies are a function of pAAf and of pABf and are such that

$$(n_{AAf},n_{ABf},n_{BBf})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf})),$$
(22)
$$(n_{AAm},n_{ABm},n_{BBm})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_m,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf})).$$
(23)

Under this scenario, the prior distribution for (pAAf, pABf, 1 − pAAf − pABf) will be Dirichlet \((a_{1f}^{14},a_{2f}^{14},a_{3f}^{14})\).

Scenario M24: EIC only

This scenario is like M14 but without EAF, and in that case, genotype frequencies are a function of pAAf, of pABf, and of pAm. Their distribution is

$$(n_{AAf},n_{ABf},n_{BBf})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf}))$$
(24)
$$\begin{array}{l}(n_{AAm},n_{ABm},n_{BBm})|p_{Am},p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}} \\ \left( {n_m,} \left( {p_{Am}^2 + p_{Am}(1 - p_{Am})\frac{{p_{AAf} - (p_{AAf} \ + \ \frac{1}{2}p_{ABf})^2}}{{(p_{AAf} \ + \ \frac{1}{2}p_{ABf})(1 - p_{AAf} \ - \ \frac{1}{2}p_{ABf})}},} \right. \right.\\ 2p_{Am}(1 - p_{Am})(1 - \frac{{p_{AAf} - (p_{AAf} \ + \ \frac{1}{2}p_{ABf})^2}}{{(p_{AAf} \ + \ \frac{1}{2}p_{ABf})(1 - p_{AAf} \ - \ \frac{1}{2}p_{ABf})}}),\\ (1 - p_{Am})^2 + p_{Am}(1 - p_{Am})\left. {\left. {\frac{{p_{AAf} \ - \ (p_{AAf} \ + \ \frac{1}{2}p_{ABf})^2}}{{(p_{AAf} \ + \ \frac{1}{2}p_{ABf})(1 - p_{AAf} \ - \ \frac{1}{2}p_{ABf})}}} \right)} \right).\end{array}$$
(25)

The prior distribution will be Dirichlet \((a_{1f}^{24},a_{2f}^{24},a_{3f}^{24})\) for (pAAf, pABf, 1 − pAAf − pABf) and it will be Beta\((b_{1m}^{24},b_{2m}^{24})\) for pAm, but truncated to the set of feasible values for that parameter, the way indicated in Appendix 1.

Scenario M15: EAF only

With this scenario, one has EAF, and neither males nor females are in HWP, but different from scenario M14, male and female inbreeding coefficients are assumed to be different. In that case, genotype frequencies are a function of pAAf, pABf, and pAAm, and

$$(n_{AAf},n_{ABf},n_{BBf})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf})),$$
(26)
$$\begin{array}{l}(n_{AAm},n_{ABm},n_{BBm})|p_{AAm},p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}\left( {n_m,\left( {p_{AAm},} \right.} \right.\\ 2p_{AAf} + p_{ABf} - 2p_{AAm},1 + p_{AAm} - 2p_{AAf} - \left. {\left. {p_{ABf}} \right)} \right).\end{array}$$
(27)

Here, the prior for (pAAf, pABf, 1 − pAAf − pABf) will be Dirichlet \((a_{1f}^{15},a_{2f}^{15},a_{3f}^{15})\) and the prior for pAAm will be Beta\((b_{1m}^{15},b_{2m}^{15})\), but truncated to the interval of feasible values for that parameter, the way indicated in Appendix 1.

Scenario M25: neither EAF nor HWP nor EIC

Finally, under M25, here, neither the EAF nor the HWP hypotheses for males and females hold, and male and female inbreeding coefficients are different, and we deal with the general unrestricted full four-dimensional parameter space model, with

$$(n_{AAf},n_{ABf},n_{BBf})|p_{AAf},p_{ABf}\sim {\mathrm{Multinomial}}(n_f,(p_{AAf},p_{ABf},1 - p_{AAf} - p_{ABf})),$$
(28)
$$(n_{AAm},n_{ABm},n_{BBm})|p_{AAm},p_{ABm}\sim {\mathrm{Multinomial}}(n_m,(p_{AAm},p_{ABm},1 - p_{AAm} - p_{ABm})).$$
(29)

Under this scenario, the prior for (pAAf, pABf, 1 − pAAf − pABf) is Dirichlet \((a_{1f}^{25},a_{2f}^{25},a_{3f}^{25})\) while the prior for (pAAm, pABm, 1 − pAAm − pABm) is Dirichlet \((a_{1m}^{25},a_{2m}^{25},a_{3m}^{25})\).

In frequentist inference, this last entirely unrestricted scenario M25 is used as the reference (alternative) hypothesis against which all other scenarios are tested. In the Bayesian setting, it becomes just another model, treated on the same level as the other nine.

Depending on the values picked for (a1, a2, a3), the Dirichlet(a1, a2, a3) distribution will be more or less informative, and it will capture different information about male or female genotype frequencies. In particular, its expected value is \(({a}_{1},{a}_{2},{a}_{3})/(\mathop {\sum} {a}_{j})\), and one can choose the aj’s to reflect the fact that one expects some genotypes to have larger probabilities than others. Also, the larger \(\mathop {\sum} {a}_{j}\), the smaller the variances of the components of the Dirichlet random variable, and the more informative that prior distribution will be. When one is not willing to use subjective information about genotype frequencies, Bernardo and Tomazella (2010) and Berger et al. (2015) recommend using a Dirichlet prior with a1 = a2 = a3 = 1/3. We will use this reference prior, which is like assuming that what you know about genotype frequencies is worth as much as what you learn from a sample with nm or nf equal to one. Given that the actual sample sizes in our setting will typically be a lot larger than one, the impact of this Dirichlet prior on the posterior distribution for the genotype frequencies will be negligible.

An analogous argument can be made for choosing the parameters of the Beta(b1, b2) distribution to model the prior information about allele frequencies in those scenarios where that is needed. In that case, in the absence of subjective information, one often chooses Beta(b1, b2) with b1 = b2 = 1/2.

Note that our choice of prior for the genotype frequencies determines that the prior distributions for ρf and for ρm will have two modes at the extremes of the range of values that they take, and one mode at 0, which are features that one expects from reference priors for parameters with finite support and a singular point in its interior. Even though the prior probability that each one of these coefficients is larger than 0 is 0.55, and not 0.5, due to the asymmetry of the support for these parameters, the prior is vague enough to avoid having that much impact on the posterior distribution for these coefficients. Our choice of prior for genotype frequencies determines that the prior distributions for allele frequencies will have modes at 0, at 0.5, and at 1.

Different parameterizations for the statistical models allow for different ways of capturing what one knows about the parameters of the model through a prior distribution for them. Adopting the parameterizations used above allows for a choice of priors that leads to simple closed-form expressions for most of the posterior probabilities of the models considered. Alternative ways of choosing prior distributions for testing for HWP under the usual autosomal data, often involving different parameterizations of the statistical models, can be found in Lindley (1988), Shoemaker et al. (1998), Consonni et al. (2008), and Wakefield (2010). All their proposals could be adapted here, but if one chose these priors to have a small effective sample size, choosing them instead of the one we chose would make a small difference at a considerable extra computational cost, because they would require the use of Markov chain Monte Carlo methods to estimate posterior probabilities.

Bayesian model selection

In the frequentist literature, one usually tests for EAF and for HWP separately, assuming that the other hypothesis holds. Graffelman and Weir (2017) propose an omnibus exact test to test both hypotheses jointly against a specific alternative scenario. They also propose a likelihood ratio approach to pick up the scenario that is most parsimonious among all scenarios that cannot be rejected. Alternatively, they also suggest doing model selection among the six models they consider, based on Akaike’s information criterion (AIC).

Instead, in the Bayesian setting, one can tackle the same problem by classifying a genetic variant into one scenario among the ten alternative scenarios described above, which is equivalent to selecting one model among ten. This is done by choosing a prior distribution on the parameter space for each model, together with a prior distribution on the model space, and then computing the posterior probability of each one of the ten models. Then, one selects the model that best represents the variant by picking up the model with the largest posterior probability. In that setting, one treats all ten models involved on the same level, without assigning a special role to the full equilibrium model, M11.

The posterior probability of each model, P(Mij|y), which is the probability that Mij is the model generating the data, y = (nAAf, nABf, nBBf, nAAm, nABm, nBBm), assessed after the data have been observed, can be computed through Bayes theorem:

$$P(M_{ij}|y) = \frac{{P(M_{ij})P(y|M_{ij})}}{{\mathop {\sum}\nolimits_{i = 1}^2 {\mathop {\sum}\nolimits_{j = 1}^5 P } (M_{ij})P(y|M_{ij})}},$$
(30)

where P(Mij) is the prior probability assigned to Mij, and where P(y|Mij) is the marginal likelihood of Mij. With everything else staying constant, the larger P(y|Mij), the larger P(Mij|y). If all models were considered equally likely a priori, with P(Mij) = 1/10, then P(Mij|y) is proportional to P(y|Mij).

Most often, computing P(y|Mij) exactly is too complicated, and the marginal likelihoods need to be estimated through MCMC simulation. In our multinomial setting with Beta and Dirichlet priors though, there are closed-form expressions for P(y|Mij) which allow one to compute these marginal likelihoods exactly in the case of M11, M14, M21, M22, M23, and M25, and to evaluate them through numerical integration in the four remaining cases. The expressions for the marginal likelihoods, P(y|Mij), under our choice of prior distribution can be found in Appendix 1.

Simulation evaluation of the performance of the method

To evaluate the performance of this Bayesian model selection procedure and of the inferences that follow from it, here, it is used under a large set of known scenarios through a simulation study. In particular, the method is tried on SNPs from populations where the female inbreeding coefficient, ρf, is known and it takes values in its whole range, where ρm is either 0 or 0.5, where d is either 1 or 1.5, and where pAf is either 0.2 or 0.4. Sample sizes, nf and nm, are assumed to be equal and either 50 or 500.

For each set of values for (ρf, ρm, d, pAf, nf, nm) considered, we have simulated 1,000 SNPs from a population with the corresponding values for (ρf, ρm, d, pAf), and for each one of the samples, we have computed P(Mij|y) for the ten models considered, and the posterior expected value of ρf, \(\hat \rho _f = E(\rho _f|y)\) under the fully unrestricted model, M25.

When simulations are set for populations with d = 1 and ρm = 0, as in the top two panels of Fig. 2, the true model is bound to be either M11, when ρf = 0, or M12 when ρf ≠ 0. These two panels present the average of the 1,000 values obtained for P(M11|y), for P(M12|y), and for the sum of the posterior probabilities of the other eight models, as a function of ρf. These averages estimate the expected value of P(M11|y), of P(M12|y), and of the sum of the remaining posterior probabilities for each given (ρf, ρm, d, pAf). As desirable, the expected value of P(M11|y) and of P(M12|y) peak for the values of ρf where the corresponding model holds true. One also observes that for nf and nm as small as 50, the sum of the posterior probabilities for the eight models known to be wrong is already negligible, and that the larger the sample sizes, the more peaked the expected values of P(M11|y) and of P(M12|y) are as a function of ρf, and hence the larger the power of the model selection procedure.

Fig. 2
figure 2

Expected value of P(Mij|y) for M11 and M12 for two different sample sizes in the top two panels, and for M23, M24, and M25 for the same two sample sizes in the bottom panels, and of the sum of the posterior probabilities of the remaining models, as a function of ρf in its whole support, when pAf = 0.4

When simulations are set for populations with d = 1.5 and ρm = 0.5, as in the bottom two panels in Fig. 2, the true models are M23, when ρf = 0, M24, when ρf = ρm = 0.5, or M25, when ρf ≠ 0 and ρf ≠ ρm. These two panels show that when nf = nm = 500, these three models are indeed the ones with the largest P(Mij|y) around the values for ρf for which they are the models in place, and they also show that the sum of the posterior probabilities of the other seven models, which are known to be wrong, is negligible. Instead, when nf = nm = 50, there is not enough power to tell M25 apart from M23 and M24 when M25 is the correct model and ρf is between 0 and 0.5. Also, when sample sizes are small, the posterior probabilities of the other seven models are not negligible anymore, even though none of these alternative models ever comes as the winner.

To evaluate the performance of our approach when it comes to estimating inbreeding coefficients, we explore how does the posterior expected value of ρf under M25 perform as an estimate for ρf. Figure 3 presents the 90% posterior-credible interval for E(ρf|y) together with an average of the sample of 1,000 values obtained for E(ρf|y) for known values of ρf in its whole range. Observe that the average of the E(ρf|y) is extremely close to the true ρf of the population from which SNPs have been sampled from, which is an indication that \(\hat \rho _f = E(\rho _f|y)\) is practically an unbiased estimate of ρf even for sample sizes as small as 50. The accuracy of the estimation of ρf grows with sample size and it is larger for pAf = 0.4. Note also that the prior mode at 0 shrinks inferences about ρ toward 0 and hence, for very small nf’s and positive ρf’s, there will be a small downward bias and not the upward bias that one might have anticipated from the fact that the prior expected value for ρf is positive.

Fig. 3
figure 3

Expected value of \(\hat \rho _f = E(\rho _f|y)\) (circles), and 90% posterior-credible intervals for \(\hat \rho _f = E(\rho _f|y)\) (line segments), as a function of ρf, when d = 1 and ρm = 0

Examples

To illustrate our approach to testing for HWP and EAF through Bayesian model selection, we analyze markers from chromosome 22 using data from the Japanese population of the 1,000 Genomes project (The 1,000 Genomes Project Consortium et al. 2015) consisting of nm = 56 males and nf = 48 females. Genetic variants were extracted with the PLINK program (Purcell et al. 2007), using only variants that had no missing values. Variants were LD pruned (using PLINK option –indep-pairwise 50 5 0.5) in order to produce an approximately uncorrelated subset.

Classification of ten single SNPs

To illustrate the use of our method, here, we report the posterior probabilities for the ten alternative scenarios described in the section Scenarios and priors for the ten SNPs presented in Table 1. These ten SNPs were chosen so that there is one with a largest posterior probability for each one of the ten scenarios considered.

Table 1 Genotype counts of ten single-nucleotide polymorphisms for males and females from a sample of the Japanese population study with nm = 56 and nf = 48

The posterior probabilities, presented in Table 2, are computed through Eq. (30), assuming equal prior probabilities for the ten models, and hence P(Mij) = 1/10, and using the expressions for the marginal likelihoods, P(y|Mij) in Appendix 1 with aj = 1/3 for all the Dirichlet priors on population genotype frequencies, and with bj = 1/2 for the beta priors on population allele frequencies. The only exception will be in Scenario M15, where we will use a Beta\((b_1^{15} = 1/3,b_2^{15} = 2/3)\) prior for the population genotype frequency, pAAm. Given that each one of these priors corresponds to an effective sample size of only one, and data involve a sample size of n = 104, the role played by prior distributions will be negligible. Sample sizes will most often be larger than in this example, and hence in practice, the choice of a prior will most often be even less relevant than here.

Table 2 Posterior probabilities, P(Mij|y), of the ten scenarios considered in the section Scenarios and priors for the ten SNPs presented in Table 1

The first marker in Table 1, rs566641289, has a posterior probability of 0.503 of being both in HWP and having EAF, and hence one rejects all the nine disequilibrium scenarios, with posterior probabilities of 0.215 or smaller. The second marker in Table 1, has a posterior probability smaller than .001 of being both in HWP and having EAF, but it has instead a posterior probability of 0.741 of being in the M12 disequilibrium scenario, with d = 1 and ρm = 0 but with ρf ≠ 0, and hence where EAF holds but where HWP fails for females. For the third marker, the equilibrium scenario M11 is also rejected, because it has a posterior probability smaller than 0.001, and one settles with the M13 scenario, with d = 1 and ρf = 0 but ρm ≠ 0, because it has a posterior probability of 0.775. For the last marker, the most probable scenario is the saturated M25 disequilibrium scenario with d ≠ 1, ρf ≠ 0, and ρm ≠ 0, even though P(M25|y) = 0.433 is smaller than 0.5.

One of the advantages of the Bayesian approach, is that one can easily simulate samples from the marginal posterior distribution of any function of the parameters, and that is very useful when it comes to present and interpret the results. Figures 4 and 5, for example, present samples from the marginal posterior distributions of (pAf, pAm), of (ρf, ρm), of log(d), and of ρm − ρf for the ten markers in Table 1. These marginal posterior distributions are computed assuming the fully unrestricted model, M25, as described in Appendix 2. These two figures also present 90% highest posterior density (hpd) credible intervals/regions for these parameter values or pairs of parameter values.

Fig. 4
figure 4

Samples from the marginal posterior distributions of (pAf, pAm), of (ρf, ρm), of log(d), and of ρm − ρf, and 90% hpd posterior-credible regions, all for the first five SNPs in Table 1

Fig. 5
figure 5

Samples from the marginal posterior distributions of (pAf, pAm), of (ρf, ρm), of log(d). and of ρm − ρf, and 90% hpd posterior-credible regions, all for the last five SNPs in Table 1

Note that in Figs. 4 and 5, all posterior distributions for a given SNP are coherent with the characteristics of the model with the largest posterior probability. In particular, observe that the posterior-credible intervals for log(d) in Fig. 4, corresponding to the five SNPs with a most probable model with d = 1, all include 0 in the intervals for log(d), while the opposite happens in Fig. 5, where all SNPs are from scenarios with d ≠ 1. Also, note that for all the SNPs classified into one of the four models having EIC, the credible intervals for ρm − ρf include 0 and the credible regions for (ρf, ρm) include a substantial part of the diagonal ρf = ρm. The same coherence is observed in the samples from the posterior of (pAf, pAm).

We analyzed the same set of SNPs using Akaike’s information criterion as proposed by Graffelman and Weir (2017). In order to do so, the maximum likelihood (ML) estimators of scenarios M12, M13, M22, and M23, not covered by that paper, were derived. Appendix 3 provides the details on the computation of the MLE for all ten scenarios. Table 3 presents the value taken by AIC for all ten SNPs and all ten scenarios; note that the AIC is always the smallest for the same scenario chosen by our Bayesian procedure, and that AIC and posterior probabilities provide very similar rankings of the models for all ten SNPs considered.

Table 3 AIC of the ten scenarios considered in the section Scenarios and priors for the ten SNPs presented in Table 1

Simultaneous analysis of multiple SNPs

In this section, we illustrate the Bayesian model selection approach to testing for HWP and EAF by carrying out the simultaneous analysis of the set of all 107.261 complete polymorphic SNPs with RS identifiers on chromosome 22 of the Japanese population of the 1,000 Genomes project.

Figure 6 presents the model with the largest posterior probability for each one of these SNPs, presented in the order in which these SNPs appear on the chromosome. Consecutive sequences of markers being systematically classified to the same scenario might be an indication of quality control problems in the SNP measurements. Too few SNPs being classified as being in HW equilibrium would also be an indication of either a problem in the measurements or of the fact that the population under scrutiny is actually in disequilibrium.

Fig. 6
figure 6

Model with the largest posterior probability for the SNPs selected from the Japanese population, presented in the position where they are placed on the chromosome. The vertical gray line indicates the boundary of the centromere, and the numbers on the right extreme correspond to the proportion of SNPs classified into each one of the scenarios considered

Figure 6 also indicates the proportion of these 107.261 SNPs that have been classified into each one of the ten alternative scenarios described in the section Scenarios and priors. Scenario M11, representing the setting where both EAF as well as HWP for both males and females are in place, is the scenario with the largest posterior probability for 92.53% of all the SNPs considered. Scenario M14 is the second most frequent scenario among all SNPs, because it is the one with the largest posterior probability in 2.67% of the cases considered. Scenarios M21 and M12 are the third and fourth most frequent ones, because they are the ones with the largest probability in 1.66% and in 1.46% of the cases, respectively. The least frequent scenario among the SNPs from chromosome 22 is the saturated model, M25, which is the most probable model in only 0.01% of the cases.

Figure 6 shows a dense stripe for the most common scenario M11, and also reflects that EAF scenarios M1i are far more common than the corresponding M2i scenarios that have different allele frequencies in the sexes. There is overall, little evidence for deviations from EAF, and if it is found, then HWP still mostly hold for both sexes separately. Chromosome 22 is acrocentric, having its centromere in the interval 12.2–17.9 Mb (hg 19). It is known that the centromere region is hard to genotype, and that deviation from HWP is more often found for variants inside and flanking the centromere (Graffelman et al. 2017). Figure 6 also shows that scenario M15 is more common in the centromere in comparison with the rest of the chromosome. In fact, all disequilibrium scenarios with EAF were found to be more common in the centromere, though this is hard to perceive in Fig. 6. Outside the centromere, 92.7% of the variants are assigned to the M11 scenario, whereas this drops to 88.5% in the centromere region.

Discussion

We have presented a Bayesian method for joint inference on Hardy–Weinberg equilibrium and on equality of allele frequencies for biallelic markers. Disequilibrium might be due to a difference in allele frequencies between the sexes, to males or females not satisfying the HW proportions, or any combination of these situations simultaneously. By computing the posterior probability for each scenario, one can classify each SNP into its most probable scenario.

Frequentist tests compare models (scenarios) in pairs, and when choosing between equilibrium, (i.e., M11), and disequilibrium, (i.e., any of the other nine models), one can pair equilibrium with disequilibrium in nine different ways. In order to precisely determine the disequilibrium scenario that is in place with a frequentist approach, several statistical tests are necessary, because the number of ways in which scenarios can be paired increases a lot, and many of these pairs of scenarios are not nested. Instead, by assigning a posterior probability to each one of the ten scenarios, with the ten probabilities adding up to one, our Bayesian approach provides a simple way of selecting the most probable scenario in the light of the data. Moreover, the Bayesian approach allows the simultaneous comparison of all possible models, whereas the likelihood ratio approach is restricted to compare models that are nested.

Akaike’s information criterion also offers an easy way to select the best fitting model among all available models. However, if two models have a similar AIC, then it may be hard to tell how strong the evidence is in favor of the better fitting model. In the Bayesian approach, the scenarios are compared in a probability scale, and this gives a better idea of the extent to which the best fitting model outperforms its competitors.

One nice feature of the Bayesian approach is that, on top of providing posterior probabilities for the scenarios, it also yields the posterior distribution of the parameters of interest, the way described in Appendix 2 and illustrated in the section Classification of ten single SNPs.

We have found it convenient to parameterize disequilibrium by using the inbreeding coefficient and the ratio of male to female allele frequencies, using a Dirichlet prior on the genotype frequencies. Alternatively, priors specified directly on the disequilibrium measures might also be considered.

The Bayesian procedures described here do not require one to implement MCMC methods, as it is usual in most Bayesian applications, and that simplifies computations a lot. If the integration required for the computation of some of the posterior probabilities is carried out efficiently, there will not be any problem in using this method for complete chromosomes.

The analysis of variants on chromosome 22 by our proposed Bayesian procedure reveals more deviation from the equilibrium scenario in the centromere region, in comparison with the rest of the chromosomes, which is consistent with previous disequilibrium studies. A more detailed analysis of centrometric regions could be of interest, given the role this region has in human diseases like cancer (Barra and Fachinetti 2018). The analysis also shows that deviation from HWP is far more frequent than deviation from EAF. This is in agreement with the population-genetic principle that it takes only one generation to achieve EAF, but two to achieve HWP.

Software

The Bayesian methods presented here have been programmed in R by Xavier Puig. Bayesian model selection can be carried out using function HWPosterior of version 1.6.2 of the Hardy–Weinberg package (Graffelman 2015). Function HWAIC does the AIC calculations.