Introduction

Testing for deviations from Hardy–Weinberg equilibrium (HWE) is an important quality control step in genome-wide association studies (Anderson et al. 2010; Bycroft et al. 2018; Laurie et al. 2010; Turner et al. 2011). Extensive literature exists on HWE tests for the autosomes, from classical tests to recent work on Bayesian approaches, structured populations, sequenced and imputed genotypes, and software tools (Ayres and Balding 1998; Bourgain et al. 2004; Emigh 1980; Graffelman 2015; Graffelman et al. 2017; Graffelman et al. 2013; Hao and Storey 2017; Hernandez and Weir 1989; Levene 1949; Rohlfs and Weir 2008; Shriner 2011; Wakefield 2010; Wigginton et al. 2005; Yu et al. 2009). However, tests for HWE on the X chromosome have only been recently developed (Graffelman and Weir 2016, 2018; Puig et al. 2017; You et al. 2015; Zheng et al. 2007). The importance of associations of X-linked variants with complex traits, particularly as a mechanism of sexual dimorphism, has been recently recognized (Chang et al. 2014; Gao et al. 2015; Khramtsova et al. 2019; Kudelka et al. 2016; Kukurba et al. 2016; Li et al. 2015; Scelsi et al. 2018; Traglia et al. 2017; Yap et al. 2018), and these developments underscore the importance of proper quality control on X, including testing for deviations from HWE.

A naive test for HWE on X would consider females only. However, such a test would implicitly assume an equal allele frequency between males and females. Indeed, a number of tests have been recently proposed for joint testing of HWE in females as well as equality of allele frequencies between the sexes (Graffelman and Weir 2016; Puig et al. 2017; You et al. 2015). However, these tests ignore the possibility that allele frequencies in males and females would differ naturally due to sex-biased admixture.

While autosomal allele frequencies reach HWE within a single generation, it is well known that for X, in case male and female allele frequencies initially differ, equilibrium (in an infinite population) is only asymptotically reached (Jennings 1916; Rosenberg 2016). The classical equations describing the evolution of allele frequencies on X, for an infinite population, are,

$$p_f\left( {t + 1} \right) = \frac{{p_m\left( t \right) + p_f\left( t \right)}}{2},$$
(1)
$$p_m\left( {t + 1} \right) = p_f(t),$$
(2)

where pf(t) and pm(t) are the female and male allele frequencies, respectively, at generation t. Starting with unequal allele frequencies at generation t = 0, the male and female frequencies oscillate while gradually stabilizing. Specifically (Rosenberg 2016), if pf (0) = 1 and pm(0) = 0, then

$$p_f\left( t \right) = \frac{{2^{t + 1} + \left( { - 1} \right)^t}}{{3 \cdot 2^t}}$$
(3)
$$p_m\left( t \right) = \frac{{2^t - \left( { - 1} \right)^t}}{{3 \cdot 2^{t - 1}}}.$$
(4)

While equilibrium is approached exponentially quickly, if allele frequencies initially differ by a substantial amount, the frequency difference between the sexes can be non-negligible in the first few generations.

Recent sex-biased admixture has been identified for several populations, in particular in the Pacific and the Americas (Bonnen et al. 2010; Bryc et al. 2010; Jagadeesan et al. 2018; Kim et al. 2012; Lie et al. 2007; Lind et al. 2007; Mathias et al. 2016). Moreover, admixture in these populations has often been cross-continental, which may have led to large initial frequency differences between the sexes. Thus, even if a population has been randomly mating since admixture, and even if SNPs are accurately genotyped, we may expect natural frequency differences to exist for some X-linked SNPs, along with deviations from HWE in females. Thus, it would be wrong to discard X SNPs due to an HWE violation, in case the violation can be explained as a natural result of sex-biased admixture.

In this work, we developed a likelihood ratio test and a χ2 test for deviations from expected genotype frequencies on X, while permitting natural sex differences in frequency due to sex-biased admixture. This is achieved by taking into account the constraints imposed by Eqs. (1) and (2) on sex-specific frequencies across generations. We show by simulations that our tests have the expected size under the null, as well as power at least as high as existing tests for true deviations from the null (e.g., due to genotyping errors or inbreeding). Specifically, our tests reject the null substantially less often compared with existing tests when HWE is violated due to historical sex-biased admixture in otherwise randomly mating populations. Finally, we investigate the application of our tests to human data from the 1000 Genomes project.

Methods

We denote the number of males and females in the sample as nm and nf, respectively, and the two alleles as A and B. The numbers of male A and B carriers are denoted mA and mB. The numbers of females with genotypes AA, AB, and BB are denoted fAA, fAB, and fBB. We denote by pm and pf the A allele frequencies in males and females, respectively.

We develop our likelihood ratio test based on the framework of You et al. (You et al. 2015). These authors have defined the inbreeding coefficient ρ to represent deviations from HWE. Using ρ, the expected genotype frequencies in females can be written as

$$p_{AA} = \left( {1 - \rho } \right)p_f^2 + \rho p_f = p_f^2 + \rho p_f(1 - p_f),$$
(5)
$$p_{AB} = 2p_f\left( {1 - p_f} \right)(1 - \rho ),$$
(6)
$$\begin{array}{{lll}}p_{BB} &=& \, \left( {1 - \rho } \right)\left( {1 - p_f} \right)^2 \,+\, \rho \left( {1 - p_f} \right)\\ &=& \left( {1 - p_f} \right)^2\, +\, \rho p_f\left( {1 - p_f} \right).\end{array}$$
(7)

The null hypothesis of no deviations from HWE and no frequency difference between males and females is pm = pf = p and ρ = 0. We interpret here the parameter ρ more generally as a measure of the deviation from random mating in females, either positive or negative, and note that it is constrained to values that result in all genotype frequencies being in [0, 1]. The alternative hypothesis is \(p_m \ne p_f\) or \(\rho \ne 0\). Denote the parameters of the model as θ = (pm, pf, ρ). The likelihood of observing the data (genotype counts) is multinomial,

$$L\left( \theta \right) = \left( {\begin{array}{*{20}{c}} {n_f} \\ {f_{AA},f_{AB},f_{BB}} \end{array}} \right)p_{AA}^{f_{AA}}p_{AB}^{f_{AB}}p_{BB}^{f_{BB}} \times \left( {\begin{array}{*{20}{c}} {n_m} \\ {m_A} \end{array}} \right)p_m^{m_A}\left( {1 - p_m} \right)^{m_B},$$
(8)

where pAA, pAB, and pBB are given by Eqs. (57), respectively. You et al. have proposed an expectation–maximization algorithm to obtain the maximum-likelihood estimates (MLE) \(\hat \theta = \left( {\hat p_f,\hat p_m,\hat \rho } \right)\).

Under the null hypothesis, pm = pf = p and ρ = 0, so θ0 = (p, p, 0), and the likelihood reduces to

$$L\left( {\theta _0} \right) = \left( {\begin{array}{*{20}{c}} {n_f} \\ {f_{AA},f_{AB},f_{BB}} \end{array}} \right)\left( {\begin{array}{*{20}{c}} {n_m} \\ {m_A} \end{array}} \right)2^{f_{AB}}p^{2f_{AA} + f_{AB} + m_A}\left( {1 - p} \right)^{2f_{BB} + f_{AB} + m_B}.$$
(9)

Here, the MLE is trivial, \(\hat \theta _0 = \left( {\hat p,\hat p,0} \right)\), where \(\hat p = (2f_{AA} + f_{AB} + m_A)/(2n_f + n_m)\). The likelihood ratio (LR) statistic is

$$LR = 2{\kern 1pt} {\mathrm{ln}}\frac{{L\left( {\hat \theta } \right)}}{{L\left( {\hat \theta _0} \right)}}.$$
(10)

The LR statistic is asymptotically distributed (under the null) as a χ2 distribution with two degrees of freedom, leading to a test we call LRTP (likelihood ratio test for panmictic populations).

As explained above, the LRTP cannot accommodate “legitimate” frequency differences between the sexes due to sex-biased admixture. To address that, we reparametrize the model as follows. Instead of θ = (pf, pm, ρ), we write θ = (pf,g, pm,g, ρ), where pf,g and pm,g are the allele frequencies in females and males in the previous generation. With these parameters, the expected genotype frequencies in males in the current generation are

$$p_{A,m,c} = p_{f,g}$$
(11)
$$p_{B,m,c} = 1 - p_{f,g}.$$
(12)

This is analogous to Eq. (2), which is true because males receive X chromosomes only from females in the previous generation. Under random mating, the expected genotype frequencies in females in the current generation are

$$p_{AA,f,c} = p_{m,g}p_{f,g},$$
(13)
$$p_{AB,f,c} = p_{m,g}\left( {1 - p_{f,g}} \right) + p_{f,g}\left( {1 - p_{m,g}} \right),$$
(14)
$$p_{BB,f,c} = \left( {1 - p_{m,g}} \right)\left( {1 - p_{f,g}} \right).$$
(15)

The above expressions reflect the fact that females receive one X chromosome from males and one from females. To incorporate deviations from random mating, we use again the parameter ρ. Analogously to the case of panmictic populations, we write the expected genotype frequencies in females in the current generation as

$$p_{AA,f,c,\rho } = p_{m,g}p_{f,g} + \frac{\rho }{2}\left( {p_{m,g}\left( {1 - p_{f,g}} \right) + p_{f,g}\left( {1 - p_{m,g}} \right)} \right),$$
(16)
$$p_{AB,f,c,\rho } = \left( {p_{m,g}\left( {1 - p_{f,g}} \right) + p_{f,g}\left( {1 - p_{m,g}} \right)} \right)\left( {1 - \rho } \right),$$
(17)
$$p_{BB,f,c,\rho } = \left( {1 - p_{m,g}} \right)\left( {1 - p_{f,g}} \right) + \frac{\rho }{2}\left( {p_{m,g}\left( {1 - p_{f,g}} \right) + p_{f,g}\left( {1 - p_{m,g}} \right)} \right).$$
(18)

Note that the overall A allele frequency in females in the current generation is (for any ρ)

$$p_{f,c} = p_{AA,f,c} + \frac{{p_{AB,f,c}}}{2} = \frac{{p_{m,g} + p_{f,g}}}{2},$$
(19)

as expected based on Eq. (1). Note that here too, ρ is constrained to values such that pAA, f,c,ρ, pAB, f,c,ρ, and pAB, f,c,ρ are all in [0, 1]. Our null hypothesis is that given the allele frequencies in the previous generation (pf,g and pm,g), the genotypes of the current generation are determined by random mating, or ρ = 0. The alternative hypothesis is that there is a deviation from random mating, or \(\rho \ne 0\). The likelihood of the data under the most general θ is

$$L\left( \theta \right) = \left( {\begin{array}{*{20}{c}} {n_f} \\ {f_{AA},f_{AB},f_{BB}} \end{array}} \right)p_{AA,f,c,\rho }^{f_{AA}}p_{AB,f,c,\rho }^{f_{AB}}p_{BB,f,c,\rho }^{f_{BB}} \times \left( {\begin{array}{*{20}{c}} {n_m} \\ {m_A} \end{array}} \right)p_{A,m,c}^{m_A}p_{B,m,c}^{m_B},$$
(20)

where pA,m,c, pB,m,c, pAA,f,c,ρ, pAB,f,c,ρ, and pBB,f,c,ρ are defined by Eqs. (11), (12), (1618), respectively. The MLE \(\hat \theta = \left( {\hat p_{f,g},\hat p_{m,g},\hat \rho } \right)\) is obtained by taking the derivatives of (the logarithm of) L(θ) and equating to zero. This results in a set of three equations, which are too tedious to reproduce here, and can be solved numerically to yield the MLE \(\hat \theta = \left( {\hat p_{f,g},\hat p_{m,g},\hat \rho } \right)\). In practice, we directly maximized the log-likelihood based on a grid search. (We discarded any parameter set \(\hat \theta\) leading to allele frequencies in the current generation outside the range [0, 1] in Eqs. (1618).)

In the case of random mating, ρ = 0, and thus the parameters are θ0 = (pf,g, pm,g, 0). The likelihood is

$$L\left( {\theta _0} \right) = \left( {\begin{array}{*{20}{c}} {n_f} \\ {f_{AA},f_{AB},f_{BB}} \end{array}} \right)p_{AA,f,c}^{f_{AA}}p_{AB,f,c}^{f_{AB}}p_{BB,f,c}^{f_{BB}} \times \left( {\begin{array}{*{20}{c}} {n_m} \\ {m_A} \end{array}} \right)p_{A,m,c}^{m_A}p_{B,m,c}^{m_B},$$
(21)

where pAA,f,c, pAB,f,c, and pBB,f,c are defined by Eqs. (1315), respectively. Taking the derivatives of log L(θ0) with respect to pf,g and pm,g and equating to zero results in the following pair of equations,

$$\frac{{f_{AA}}}{{\hat p_{m,g}}} + \frac{{f_{AB}\left( {1 - 2\hat p_{f,g}} \right)}}{{\left( {1 - \hat p_{m,g}} \right)\left( {1 - \hat p_{f,g}} \right)}} - \frac{{f_{BB}}}{{1 - \hat p_{m,g}}} = 0$$
(22)
$$\frac{{f_{AA}}}{{\hat p_{f,g}}} + \frac{{f_{AB}\left( {1 - 2\hat p_{m,g}} \right)}}{{\left( {1 - \hat p_{m,g}} \right)\left( {1 - \hat p_{f,g}} \right)}} - \frac{{f_{BB}}}{{1 - \hat p_{f,g}}} + \frac{{m_A}}{{\hat p_{f,g}}} - \frac{{m_B}}{{1 - \hat p_{f,g}}} = 0$$
(23)

The solution of these equations yields the MLE under the null, \(\hat \theta _0 = \left( {\hat p_{f,g},\hat p_{m,g},0} \right)\). Here too, in practice we used a grid search to directly maximize the log-likelihood.

The likelihood ratio is then, as in Eq. (10),

$$LR = 2{\kern 1pt} {\mathrm{ln}}\frac{{L\left( {\hat \theta } \right)}}{{L\left( {\hat \theta _0} \right)}}.$$
(24)

Under the null, the LR is asymptotically distributed as χ2 with one degree of freedom, leading to a test we call LRTA (for admixture).

Finally, we also propose a new χ2 test, analogous to the χ2 test proposed in (Graffelman and Weir 2016). Suppose we have used Eqs. (22) and (23) to obtain the MLE \(\left( {\hat p_{f,g},\hat p_{m,g}} \right)\). The expected values for the genotypes of males and females under the null (ρ = 0) are

$$\begin{array}{*{20}{l}} {{\mathrm{Expected}}\,f_{AA},f_{AB},f_{BB},m_A,m_B} \hfill \\ {\begin{array}{*{20}{l}} {} \hfill & = \hfill & {n_f\hat p_{m,g}\hat p_{f,g},n_f\left( {\hat p_{m,g}\left( {1 - \hat p_{f,g}} \right) + \hat p_{f,g}\left( {1 - \hat p_{m,g}} \right)} \right),} \hfill \\ {} \hfill & {} \hfill & {n_f\left( {1 - \hat p_{m,g}} \right)\left( {1 - \hat p_{f,g}} \right),n_m\hat p_{f,g},n_m\left( {1 - \hat p_{f,g}} \right)} \hfill \end{array}} \hfill \end{array}$$
(25)

Then, given the observed values of fAA, fAB, fBB, mA, mB, a standard χ2 statistic can be calculated, which would be asymptotically distributed as χ2 with one degree of freedom. We call this test χ2-ML.

We also note that instead of the MLE \(\hat p_{f,g}\) and \(\hat p_{m,g}\), we could use a method of moments estimator, based on isolating pf(t) and pm(t) from Eqs. (1) and (2),

$$\hat p_{m,g} = 2p_f - p_m = \frac{{2\left( {2f_{AA} + f_{AB}} \right)}}{{2n_f}} - \frac{{m_A}}{{n_m}},$$
(26)
$$\hat p_{f,g} = p_m = \frac{m_A}{{n_m}}.$$
(27)

These estimates can then be substituted in Eq. (25), and a χ2 statistic can be calculated. We call this test χ2-MM. In practice, we found that the χ2-MM did not appropriately control the type I error rate (Fig. 1), and we do not report further experiments with that test.

Fig. 1
figure 1

The proportion of rejections (type I error rate, or size) under random mating in a sex-biased admixed population. We compared the LRTP test (You et al. 2015), the GWET exact test (Graffelman and Weir, 2016), and the LRTA, χ2-ML, and χ2-MM tests developed in this paper. Our significance level was α = 0.05 (horizontal black line). a The size of the tests versus the number of generations since admixture. The initial allele frequencies were 80% in females and 30% in males. b The size of the tests versus the initial allele frequency (AF) difference between males and females. The number of generations since admixture was three

For comparison, we also considered the exact test of Graffelman and Weir for the X chromosome (Graffelman and Weir, 2016), which we denote GWET. The null hypothesis of that test includes both equality of allele frequencies between the sexes, as well as HWE for the genotypes in females.

Results

Simulations

We carried out several simulations to examine the behavior of our tests as compared with existing tests for deviations from HWE on X. We considered scenarios under our tests’ null hypothesis, as well as under a number of alternative hypotheses.

Our first set of simulations was designed to examine all tests under the null hypothesis of our tests, namely sex-biased admixture with random mating thereafter. We started with a population of 400 males and 400 females, and a single locus with an initial allele frequency of 80% in females and 30% in males. Given the allele frequencies in each generation, we calculated the expected genotype frequencies in the subsequent generation based on Eqs. (1115). Then, the genotypes of 400 males and 400 females were drawn based on multinomial distributions having these expected frequencies. We repeated the process up to ten generations after admixture, and repeated the simulation 1000 times.

In Fig. 1a, we show the proportion of rejections (type I error rate, or size) when running five tests on the above genotype counts: the LRTP test of You et al. (You et al. 2015) and the GWET (exact test) of Graffelman and Weir (Graffelman and Weir 2016), both of which test for departures from either HWE in females or equality of allele frequencies between males and females; and the LRTA, χ2-ML, and χ2-MM tests we have developed here for sex-biased admixed populations (see the Methods section). Our LRTA test and the χ2-ML test had an appropriate type I error rate (equal or close to the significance level α = 0.05), which is expected, because we simulated random mating post admixture. In contrast, the LRTP and GWET tests had much higher proportions of rejections, as expected due to the frequency differences between the sexes, which these tests are designed to detect. For these tests, the type I error rate decreased to a value close to the appropriate value under the null (0.05) after about ≈5–6 generations post admixture. The χ2-MM test did not control the type I error rate as well as the LRTA and χ2-ML tests, possibly because the parameters (allele frequencies in the preceding generation) were not accurately estimated. We thus do not further consider this test.

In Fig. 1b, we plot the type 1 error rate versus the initial allele frequency difference between the sexes, for a sample taken three generations post admixture. Again, our LRTA and χ2-ML tests control the type 1 error rate at its appropriate value (0.05), while the LRTP and GWET tests have a higher proportion of rejections, which is growing, as expected, with the initial allele frequency difference between the sexes.

Our second simulation was designed to examine the power of the various tests under the alternative hypothesis of non-random mating. We considered one locus with an allele frequency of 80% in both males and females. We then calculated the expected genotype frequencies under one generation of mating, but this time with an inbreeding coefficient ρ in the range [−0.2, 0.3], and simulated genotype frequencies in 400 females and 400 males based on the multinomial distribution with probabilities defined by Eqs. (11), (12), (1618). This simulation did not include sex-biased admixture, as the goal was to evaluate the power of our test under non-random mating, regardless of a history of admixture. We present the power of the various tests (at the 0.05 significance level and over 1000 repeats) in Fig. 2a. The power of the χ2-ML test is always the highest, followed closely by the LRTA test. The power of the LRTP and GWET tests is slightly lower. In Fig. 2b and 2c, we considered the cases ρ = −0.2 (panel b) and ρ = 0.2 (panel c), for a varying allele frequency (equal between the sexes). The results are similar, again with our LRTA and χ2-ML tests achieving the highest power, followed closely by the existing LRTP and GWET tests.

Fig. 2
figure 2

The proportion of rejections (power) of the various tests under non-random mating (without admixture). a Different values of the inbreeding coefficient ρ, for a fixed allele frequency (0.8, equal between males and females). b, c Different values of the allele frequency for a fixed inbreeding coefficient ρ (b: −0.2; c: 0.2). For ρ = −0.2, some values of the allele frequencies are missing as they would lead to genotype frequencies outside [0, 1]

Our third set of simulations was designed to validate that the LRTA and χ2-ML tests are powerful also under sex-biased admixture. We used the same approach as in our first simulation (Fig. 1), i.e., sex-biased admixture followed by random mating, except that after one generation, non-random mating was assumed with an inbreeding coefficient in the range [–0.2, 0.3]. The initial male allele frequency was 0.3, and the initial female allele frequency was 0.4 or 0.8. We plot in Fig. 3 the power of the various tests at the 0.05 significance level and over 50 repeats. The power of our LTRA and χ2-ML tests is almost unaffected by the historical admixture event (compare Fig. 2a). When the initial allele frequency difference between the sexes is large (panel b) and for small |ρ|, the LRTP and GWET tests have a much higher power than our tests, as expected. This is because these tests are identifying (as they are expected to) the allele frequency differences between the sexes and the deviations from HWE in females that were generated by the sex-biased admixture. Our tests reject the null only for larger values of |ρ|, which is the correct and expected behavior, given that our tests do not reject the null when the deviations from HWE or equal allele frequency are due to sex-biased admixture.

Fig. 3
figure 3

The proportion of rejections (power) of the various tests under non-random mating and sex-biased admixture. The power is plotted for different values of the inbreeding coefficient ρ, for a given initial male allele frequency of 0.3. The initial female allele frequency is 0.4 in (a) and 0.8 in (b). The power of our LRTA and χ2-ML tests is similar to that reported in Fig. 2 for the case without admixture, with a slight drop in power for a large initial allele frequency difference between the sexes (panel b). In that regime, the power of the LRTP and GWET tests is high for all values of ρ, because these tests are identifying frequency differences between the sexes and deviations from HWE in females. However, as these deviations are generated (for small |ρ|) by sex-biased admixture, they are correctly ignored by our methods, which as expected have higher power with increasing |ρ|

In our last set of simulations, we sought to determine whether our tests can reject the null under deviations other than nonzero inbreeding coefficients. To this end, we simulated genotype counts for 100 males and 100 females under an allele frequency of 0.5, equal between the sexes. We then flipped each female AA genotype to AB with probability q, and compared the power of the various tests at increasing levels of the genotype flipping probability q. The results (Fig. 4a) demonstrate that our LRTA and χ2-ML tests are powered to detect these deviations, although less so than the existing tests. In contrast, when simulating 200 males and 200 females (Fig. 4b), the power of our tests was comparable with the power of the existing tests.

Fig. 4
figure 4

The power of the various tests to detect genotyping errors. a We simulated genotype counts for 100 males and 100 females under allele frequency 0.5 in both sexes. We then randomly flipped each female AA genotype to AB with probability q (the x-axis), and computed the power of the various tests over 100 repeats. Panel (b) is the same as (a), for 200 males and 200 females

Empirical data analysis

To examine our tests on real data, we applied them to genomes from the 1000 Genomes project (1000 Genomes Project Consortium et al. 2015) (1 kG). We selected American populations with a history of sex-biased, cross-continental recent admixture. While admixture in these populations has mostly ended 5–10 generations ago (e.g., Baharian et al. 2016; Gravel 2012; Gravel et al. 2013; Moreno-Estrada et al. 2013), some SNPs may have not yet reached equilibrium, or were affected by more recent minor gene flow events. Our goal was to determine whether our tests reject fewer SNPs due to deviations from expected genotype frequencies. Indeed (Table 1), our LTRA test rejected the lowest proportion of SNPs in nearly all populations. The LRTP and GWET test showed a similar higher proportion of rejected SNPs, while our χ2-ML test rejected the largest proportion of SNPs in all populations.

Table 1 The proportion of rejected SNPs (at α = 0.05) under the LRTP, GWET, LRTA, and χ2 -ML tests in 1 kG American populations

To demonstrate the underlying reasons for the lower rejection rates, we considered the following two examples from the ACB population: SNP 1 at coordinate 116399355 (hg19), and SNP 2 at coordinate 78068904. The genotype counts, allele frequencies, and P-values are presented in Table 2. For both SNPs, a substantial allele frequency difference exists between the sexes, along with deviation from HWE in females, and hence the null is rejected by LRTP (as well as by GWET; not shown). In contrast, the null is rejected neither by our LRTA nor by the χ2-ML test (not shown). To see why, consider the rightmost seven columns of the table, where the estimated allele frequencies in the previous generation are shown, along with the expected genotype counts in the current generation (assuming random mating). In both cases, the expected genotype counts are similar to the observed ones, thus not providing a reason to reject the null hypothesis of random mating. However, we note that in these two examples, the implied allele frequency difference between the sexes in the previous generation is relatively high, at 0.32 and 0.43 in SNPs 1 and 2, respectively, and the plausibility of such a large natural difference should be carefully considered by users of the tests. A similar picture was observed in other SNPs, where often, SNPs were rejected by the existing tests, but not by our LRTA or χ2-ML tests, on the basis of a very large allele frequency difference between the sexes in the previous generation (see the Discussion section).

Table 2 Investigating HWE for two X chromosome SNPs

Discussion

In this paper, we proposed new tests for deviations from expected genotype frequencies on the X chromosome for sex-biased admixed populations. The X chromosome is unique in that allele frequencies do not reach equilibrium within one generation after perturbation, even when the population is otherwise randomly mating and all genotypes are observed without error. Thus, the X chromosome requires a specialized test, even beyond accounting for the different ploidies of the sexes. Here, we proposed new likelihood ratio and χ2 tests to address this gap. We showed that our tests have the expected size (type I error rate) under sex-biased admixture followed by random mating, whereas other tests have high error rates, in particular when admixture was very recent (Fig. 1). In addition, our tests had equal or higher power compared with the other tests to detect nonzero inbreeding coefficients (Fig. 2). Our tests were powered, although less than existing tests, to detect nonzero inbreeding coefficients under strong sex-biased admixture (Fig. 3), as well as errors introduced by directionally flipping one of the genotypes for small samples (Fig. 4).

We demonstrated that our tests rejected fewer X chromosome SNPs in American 1000 Genomes populations. However, there are a number of important caveats. One issue is that admixture in these populations has likely ended already 5–10 generations ago. Considering Eqs. (3) and (4) (for the case when the allele is present in all female founders but in none of the male founders), the difference in allele frequency between males and females is just ≈3% after five generations. This difference is already smaller than the standard deviation of the difference in allele frequency between 200 males and 200 females (≈4.7%; assuming true frequency 2/3 in both sexes, which is the equilibrium solution of Eqs. (3) and (4)). Thus, it cannot be established which SNPs have different allele frequency between the sexes due to sex-biased admixture, as opposed to other causes such as sampling noise or subtle ethnicity differences between males and females in the sample.

The second (and related) issue is that for most SNPs that were rejected by the existing methods but not by the tests presented here, the implied allele frequency difference between males and females in the previous generation was very high. Large differences in allele frequencies in the previous generation are permitted by our tests, as frequencies could indeed widely differ between males and females ≈1–2 generations post admixture. However, in human populations, sex-biased admixture was not as recent, and thus, the relevance of the test to human genetic data remains at this point unclear. Another limitation is the relatively low power of our test to detect genotype “flipping” for small sample sizes.

Finally, we also note that when applying our tests to genomic data, care must also be taken if the downstream application is an association test that relies on HWE, such as the alleles test (Laird and Lange 2011), because SNPs may pass our tests even if females are not in HWE.

Our tests are available as an R package called HWadmiX, at https://github.com/dbackenroth/HWadmiX. The running time of computing both the LRTA and χ2-ML P-values was on average ≈1.1 s per SNP for the 49 females and 47 males of the ACB population, on a computer with a 2.5 GHz Intel Core i7 CPU and 16 GB RAM. To reduce the running time when testing all X chromosome 1000 Genomes Project variants, we pre-calculated the P-values for all combinations of observed genotype counts. The number of combinations to pre-calculate is at most \(O\left( {n_f^2n_m} \right)\), which is smaller than the number of variants for the 1000 Genomes sample sizes.

Avenues for future research include replacing the grid search for the maximum-likelihood estimate by a faster method, as well as the development of tests for multiple alleles or Bayesian tests. Specifically, it would be of interest to impose a prior distribution on the allele frequency difference between females and males in the previous generation, which could make the test more appropriate to human populations, where sex-biased admixture occurred centuries ago. Finally, exact tests for HWE are in widespread use (Purcell et al. 2007) and were also developed for HWE on X (Graffelman and Weir 2016). Developing an exact test for sex-biased admixed populations will be of interest, as these tests better control the type 1 error rate (Wigginton et al. 2005). However, it does not seem that the approach of (Graffelman and Weir 2016) can be easily extended to admixed populations.