Introduction

Admixture is an omnipresent evolutionary force in complex disease genetics of recently admixed populations. Admixture mapping locates genomic segments that harbor causal alleles with distinct ancestral frequencies through admixture linkage disequilibrium (ALD). It has been successfully applied to locate genetic variants for a range of diseases and traits, e.g., hypertension1,2, type 2 diabetes3,4, obesity5,6 and Alzheimer’s dementia7. Systematic reviews of admixture mapping approaches can be found in the literature8,9,10. Genome-wide association studies (GWASs) have proven successful in identifying individual common genetic variants associated with common diseases and traits11. In the era of GWASs, admixture mapping has become a useful compliment for identifying common variant associations12. Several hybrid methods for combining the genotype and allele ancestry at a single-nucleotide polymorphism (SNP) have been developed13,14,15. These methods may claim variants which are in ALD with the true causal variants but not associated with the phenotype in any ancestral population. ALD may extend for substantial distances16,17,18. Qin and Zhu12 proposed a two-stage fine mapping method to first identify candidate local genomic segments and then identify individual variants responsible for the admixture mapping evidence.

The common variants identified in GWASs merely explain a small proportion of the heritability19, leading to many explanations of the ‘missing’ heritability20,21,22,23. A potential source of the missing heritability is the contribution of rare variants24,25. Evidenced by deep sequencing studies26,27,28, rare variants may have stronger effects on complex diseases than do common variants. Multiple methods have been developed for identifying rare variant associations. Collapsing methods, e.g., the CAST29 and the CMC30, utilize the number of rare alleles in a gene for each individual to enrich association information. The SDWSS31 scales SNPs in a test set by their minor allele frequencies in unaffected individuals. It utilizes a Wilcoxon type statistic to aggregate information and assesses the significance by permutation. The VT method32 utilizes the maximum of the test statistics over all allele-frequency thresholds. All these methods implicitly assume that all effects have an identical direction. To combine the effects of opposite directions, the data-adaptive sum test33 incorporates the signs of the observed effects into the CAST, whereas the C-alpha method34 and the SKAT35 test for genetic variance component. In particular, two methods have been proposed to combine the effects of different sizes and opposite directions. The ORWSS36 scales SNP wise numbers of minor alleles by the logarithms of amended odds ratios in the 2 × 2 tables of disease status by allele states. The EREC method37 scales SNP wise numbers of minor alleles by the estimated regression coefficients. When families are available, incorporating linkage evidence in rare variants analysis has also been developed38,39,40,41.

However, these existing methods may be underpowered for identifying rare variant associations in admixed populations, because they do not explicitly exploit the association information conveyed by local ancestries, particularly, ancestry-gene interaction. An admixed population and its ancestral populations often bear different burdens of complex diseases, partially due to the ancestral discrepancies in causal alleles, allele frequencies, and effects. Within a chromosomal block harboring causal alleles, an affected admixed individual may have an increased probability of inheriting alleles from the ancestry population of higher disease prevalence12,42. Common variants with different ancestral frequencies are correlated to their ancestries43,44 which provide non-redundant association information12,13,14,15. We hypothesize that this argument holds for rare variants. Single rare variant association testing has unacceptably limited power since only a small portion of study individuals carry the rare allele.

In this report, we will illustrate the utility of explicitly exploiting local ancestries and genotypes together for rare variant associations. For simplicity, we aim to identify stable ancestral blocks harboring rare variants. For each person, all the SNPs within such a block share an identical ancestry. We propose a heuristic local ancestry boosted sum test (LABST). In a stable ancestral block, our test statistic combines the sum of SNP wise numbers of rare mutations and the ancestry-gene interaction. We mathematically prove that the LABST statistic asymptotically follows a chi-square distribution with 1-df if the test block is not associated with the disease. In extensive simulations, our LABST appropriately controlled type I error rates at preset nominal levels, indicating the ideal accuracy of the asymptotic approximation. Under various multiple rare variant disease modes with a large range of relative risks, our LABST were uniformly more powerful than the benchmark CAST as well as the sophisticated SDWSS and ORWSS. The LABST is a heuristic method designed for unrelated cases and controls. It can be extended to incorporate informative weights, to accommodate covariates and to allow for multiple groups of rare variants.

Methods

In an admixed population of two ancestral populations, let a test chromosomal block contain \(L\) rare variants, i.e., the minor allele frequencies (MAFs) < 2%45. For n unrelated individuals from the admixed population, let yi be disease status of individual i (yi = 1, if individual i is affected; =0, if unaffected), G1 = {i: yi = 1} and G0 = {i: yi = 0} be the index sets of affected and unaffected individuals, respectively. Let gij denote the number of minor alleles carried by individual i at SNP j, \({s}_{i}=\sum _{j=1}^{L}\,{g}_{ij},\) and s = [s1, …, sn]. Let the test block be stable in terms of variant wise ancestries. In other words, we assume that each block wide haplotype of each individual is inherited entirely from one of the two ancestral populations without any ancestry crossover points. Under such an assumption, all the m SNPs within the block share identical ancestry. We define a = [a1, …, an], where ai denotes the number of ancestries on individual i inherited from the ancestral population of the higher disease prevalence (due to the larger risk haplotype frequency). Let α be the nominal significance level of a test for block-based associations.

The proposed LABST

For each individual i, we define ui = (1 + ai)si to combine ancestry-gene interaction aisi with si. We define a Welch type t statistic

$${W}_{e}=\frac{{\bar{u}}_{1}-{\bar{u}}_{0}}{\sqrt{{\hat{\sigma }}_{1}^{2}/{n}_{1}+{\hat{\sigma }}_{0}^{2}/{n}_{0}}},$$
(1)

where \({\bar{u}}_{1}=\sum _{i\in {G}_{1}}\,{u}_{i}/{n}_{1}\), \({\hat{\sigma }}_{1}^{2}=\sum _{i\in {G}_{1}}\,{u}_{i}^{2}/{n}_{1}-{\bar{u}}_{1}^{2}\), and \({\bar{u}}_{0}^{2}\) and \({\hat{\sigma }}_{0}^{2}\) are likewise defined using the u-scores of n0 unaffected individuals. We use \({W}_{e}^{2}\) to measure the association between the test block and the disease status. Let n1/n0 converge to a finite positive constant τ when both n0 and n1 increase, e.g., τ = 1 if n1 = n0. If the test block is not associated with the disease status, then the statistic \({W}_{e}^{2}\) converges in distribution to \(\xi \sim {\chi }_{1}^{2}\), the chi-square distribution with 1-df (see Appendix A for a mathematical proof). Thus, we compute the p-value of \({W}_{e}^{2}\) as \(P\mathop{=}\limits^{{\rm{def}}}{\rm{\Pr }}(\xi > {W}_{e}^{2})\) and claim significance when P < α, the preset nominal significance level.

Existing methods

Most existing rare variant association methods exploit genotypes without explicitly capitalizing on ancestry-gene interactions. In such methods, the genotypic score of individual i is defined as \({x}_{i}=\sum _{j=1}^{L}\,{w}_{j}{g}_{ij},\) where wj is a SNP wise weight. The simplest weight is wj ≡ 1 for all the L SNPs as in the CAST29. For this universal weight, xi collapses to si, and a benchmark 1-df statistic is constructed by replacing ui’s in our LABST with the si’s.

The SDWSS31 weighs a SNP using its minor allele frequency in unaffected individuals among whole-sample individuals. At the jth SNP, \({w}_{j}=1/\sqrt{n{q}_{j}(1-{q}_{j})},\) where qj = (1 + mj)/(2 + 2n0), and mj is the number of minor alleles at the SNP over the n0 unaffected individuals. Whole-sample individuals are ranked according to the xi scores and a rank sum \(x=\sum _{i\in {G}_{1}}\,{\rm{rank}}({x}_{i})\) is defined. Let \({x}_{1}^{\ast }\), …, \(\,{x}_{k}^{\ast }\) be the rank sums based on k(=1,000) permutations of disease status, \(\hat{\mu }\) and \(\hat{\sigma }\), be their mean and standard deviation, respectively. The standardized score is defined as z = (x\(\hat{\mu }\))/\(\hat{\sigma }\) and the P value of z is computed according to the standard normal distribution.

The weighting scheme in the SDWSS favors the disease-associated mutations with very low frequencies. As acknowledged by its authors, however, this scheme may reduce the power to detect the disease-associated mutations with higher frequencies. The SDWSS is based on the implicit assumption32 that \(\mathrm{log}({{\rm{OR}}}_{j})\propto 1/\sqrt{{q}_{0j}(1-{q}_{0j})}\), where ORj is the odds ratio in the 2 × 2 table of disease status by the allele at SNP j, and q0j is the MAF in the controls. Thus, Feng et al.36 proposed the ORWSS to jointly analyze rare and common variants. This method keeps all the steps of the SDWSS except for the weighting scheme. In the ORWSS, a SNP is weighted by the logarithm of the amended odds ratio46 in the 2 × 2 table of allele by disease status. The amended odds ratio proves a useful remedy for handling potential empty cells in SNP-wise tables.

Type I error rate inflation factor

Often, a conservative method tends more likely to miss true associations whereas a liberal method tends more likely to claim false positives. A valid powerful method should accurately control the type I error rate at each preset nominal level. Herein, we propose and use type I error rate inflation factor (TIERIF) to measure how accurately a method controls type I error rate. For a given nominal level α, we define the TIERIF of a method as γα = τα/α, where τα is the probability that the method rejects the null hypothesis. If γα = 1, then the method is able to controls type I error rate at the given nominal level α. If γα is substantially smaller than 1, then the method is overly conservative. If γα is substantially larger than 1, then the method is overly liberal.

Usually, it is intractable to mathematically formulate the TIERIF of a sophisticated method. In addition, it is hard to tell what a TIERIF is unacceptably ‘small’ or ‘large’. Herein, we propose an empirical method to estimate this quantity and tell how small (large) is too small (large). Specifically, we define \({\hat{\gamma }}_{\alpha }={\hat{\tau }}_{\alpha }/\alpha \) as an estimator of γα, where \(\,{\hat{\tau }}_{\alpha }\) is the frequency that the method claims significance over R simulation replications generated under the null hypothesis of no association. As R increases, \({\hat{\gamma }}_{\alpha }\) converges in probability to γα, and \(\sqrt{R\alpha }({\hat{\gamma }}_{\alpha }-{\gamma }_{\alpha })/\sqrt{1-\alpha }\) converges in distribution to a standard normal variable (see Appendix B for a mathematical proof). Therefore, if the method properly controls type I error rate at α, then \({\hat{\gamma }}_{\alpha }\) concentrates with probability 95% between

$$L{B}_{\alpha }=1-1.96\sqrt{(1-\alpha )/(R\alpha )}$$
(2)

and

$$U{B}_{\alpha }=1+1.96\sqrt{(1-\alpha )/(R\alpha )}.$$
(3)

Under the null of no genetic association, the concentration interval [LBα, UBα] is the shortest among all the intervals [LB, UB] such that \(\mathop{\mathrm{lim}}\limits_{R\to \infty }{{\rm{\Pr }}}_{0}(LB\le {\hat{\gamma }}_{\alpha }\le UB)=0.95\) (Appendix B). A method is called to be overly conservative if \({\hat{\gamma }}_{\alpha } < L{B}_{\alpha }\). Likewise, a method is called to be overly liberal if \({\hat{\gamma }}_{\alpha } > U{B}_{\alpha }\).

Simulation Designs

For method comparisons, we simulated an admixture using the rare variants with the frequency-spectrums of two natural populations. In the simulated admixture, block-wide haplotypes were inherited from the two ancestry populations. Four disease genetic modes were considered, including the dominant, additive, recessive, and multiplicative modes. Under each disease genetic mode, the disease status of an admixed individual was determined by the penetrance conditioning on block-wide risk haplotypes other than individual risk alleles.

Admixture

To simulate a two-way admixture, we downloaded the genotype data of region ENr113.4q26 from the ENCODE project Consortium47. Applying the Beagle software48, we separately inferred 180 CEU (Centre d’Etude du Polymorphisme Humain in Utah, USA) and 180 YRI (Yoruban in Ibadan, Nigeria) haplotypes over the ENr113.4q26 region. Details on the haplotype deconvolution have been described previously36. Across the 360 inferred haplotypes, we observed 1,693 SNPs. At each of the region-wide SNPs, we chose the minor allele in the YRI haplotype data (fYRI ≤ 0.5) as the reference allele (Fig. 1a). In the CEU haplotype data, the reference alleles at 1,373 SNPs are of frequencies fCEU ≤ 0.5, whereas at the other 320 SNPs, fCEU > 0.5 (Fig. 1b). Based on our previous association study on African Americans43, we adopted ω = 0.8 vs. ϖ = 0.2 as YRI-CEU admixture weights. To ‘genotype’ one admixed individual in the ENr113.4q26 region, we randomly chose one and another haplotype from the YRI or CEU haplotypes with probabilities ω vs. ϖ. In this simulated admixture, the frequencies of reference alleles at the 1,693 SNPs (fADX = ωfYRI + ϖfCEU) range from 0.0011 to 0.5722 (Fig. 1c), where 295 SNPs are of fADX < 0.02, satisfying the conventional criterion of rare variants45.

Figure 1
figure 1

Population wise distributions of the reference alleles at 1,693 SNPs in ENr113.4q26. (a) In the YRI haplotype data, all the reference alleles are of frequencies fYRI ≤ 0.5. (b) In the CEU haplotype data, the reference alleles at 1,373 SNPs are of frequencies fCEU ≤ 0.5. (c) In the simulated admixture, the reference alleles at 295 SNPs are rare (0.001 < fADX < 0.02).

Causal haplotypes and ancestries

From the 254 SNPs with fADX < 0.015, we randomly selected 23 SNPs as deleterious allele carriers (Table 1). The minor alleles at these 23 SNPs served as deleterious alleles. The deleterious alleles appear at 32 YRI and 4 CEU haplotypes, respectively. These 36 haplotypes served as risk haplotypes. The proportions of risk haplotypes in YRI and CEU haplotype data sets are \({p}_{YRI}=\frac{8}{45}\approx 0.1778\) and \({p}_{CEU}=\frac{1}{45}\approx 0.0222\), respectively. Thus, the proportion of risk haplotypes in the simulated admixture is

$${p}_{ADX}={p}_{YRI}\omega +{p}_{CEU}\varpi =\frac{8}{45}\times \frac{4}{5}+\frac{1}{45}\times \frac{1}{5}=\frac{11}{75}\approx \mathrm{0.1467.}$$
(4)

For each admixed individual, we let H be number of risk haplotypes and let a be the number of YRI haplotypes. Table 2 presents Pr(H, a), the joint probability mass of (H, a), where q() = 1 − p() for YRI, CEU and ADX, respectively. In this simulated admixture, the coefficient of correlation between H and a is

$$\rho =\frac{\sqrt{\varpi \omega }({p}_{YRI}-{p}_{CEU})}{\sqrt{(\omega {p}_{YRI}+\varpi {p}_{CEU})(\omega {q}_{YRI}+\varpi {q}_{CEU})}}=\frac{7}{12\sqrt{11}}\approx \mathrm{0.1759.}$$
(5)
Table 1 The distribution of the frequencies of deleterious alleles.
Table 2 Generic probability mass function of (H, a) in the simulated admixture.

Modes of disease genetics

Let y be the disease status of an admixed individual (=1, if affected; =0, if unaffected). Let fH = Pr(y = 1|H) be the penetrance for a given H value (=0, 1, or 2). Then the disease prevalence κ = Pr(y = 1) can be formulated as

$$\kappa =\sum _{H}\,{f}_{H}\,{\rm{\Pr }}(H).$$
(6)

Let RR = f2/f0 be relative risk. Then, \({f}_{1}={f}_{0}\cdot RR\) for dominant mode, \({f}_{1}=\frac{1}{2}{f}_{0}\cdot (1+RR)\) for additive mode, \({f}_{1}={f}_{0}\cdot \sqrt{RR}\) for multiplicative mode, and f1 = f0 for recessive mode. Under each mode, \({\rm{\Pr }}(y,a|H)={\rm{\Pr }}(y|H)\) \({\rm{\Pr }}(a|H)\), namely, the disease status is independent of local ancestry a given haplotype H. It follows that

$${\rm{\Pr }}(H,a|y)=\frac{{\rm{\Pr }}(H,a){\rm{\Pr }}(y|H)}{{\rm{\Pr }}(y)}$$
(7)

for an arbitrary (H, a, y). Setting y = 0 in Eq. (7) yielding the joint probability mass of (H, a) in unaffected subpopulation:

$$Pr(H,a|y=0)=\frac{Pr(H,a)(1-{f}_{H})}{1-\kappa }.$$
(8)

Setting y = 1 in Eq. (7) yielding the joint probability mass of (H, a) in affected subpopulation:

$${\rm{\Pr }}(H,a|y=1)=\frac{\Pr (H,a){f}_{H}}{\kappa }.$$
(9)

Eqs (8) and (9) and Table 2 are necessary and sufficient to mathematically formulate the Pearson coefficient of the correlation between H and a in the entire affected subpopulation \({\rho }_{1}\mathop{=}\limits^{{\rm{def}}}{\rm{corr}}(H,a|y=1)\) and that in the entire unaffected subpopulation \(\,{\rho }_{0}\mathop{=}\limits^{{\rm{def}}}{\rm{corr}}(H,a|y=0)\).

Simulation configurations

Using Table 3, we numerically computed ρ1 and ρ0 values for f0 = 0.1 and each RR under each of the four disease genetic modes (Fig. 2). Under all the four modes, ρ1 increases and ρ0 decreases from 0.1759 as RR increases from 1 (no genetic association) to 3. The dominant mode shows the largest ratio ρ1/ρ0, followed in turn by the additive mode, the multiplicative mode, and the recessive mode. Of note, f0 = f1 = f2 = κ under the null hypothesis of no genetic association. We acknowledge that prevalence (κ) varies for different diseases in an admixed population. For example, about 10% African Americans suffer from lifetime major depressive disorder49, whereas about 2.7% African American suffer from dementia50. In our simulations, we fixed f0 = 0.1 as a reference value to inspect the type I error rate and power patterns of different association methods with respect to different disease modes, relative risks, sample sizes, and nominal significance levels.

Table 3 Specific probabilities of (H, a) used in the simulation*.
Figure 2
figure 2

The trends of correlation between the numbers of causal haplotypes and ancestries under four modes of disease genetics. The correlation curves in each panel were generated by fixing f0 = 0.1 and varying (f1, f2) according to the underlying genetic modes. Generically, as relative risk increases from 1 to 3, the coefficient of correlation between H and a in affected group corr(H, a|D = 1) increases from 0.1759, whereas that in unaffected group decreases from 0.1759. The dominant mode shows the largest ratio of corr(H, a|D = 1) to corr(H, a|D = 0), followed in turn by the additive, multiplicative and recessive modes.

For each RR value under each mode, at each replication we simulated region wide genotypes and ancestries of n1 affected and n0 unaffected individuals from the admixed population. For each specific scenario, we adopted sample sizes n1 = n0 = 500 and then n1 = n0 = 2,000 to inspect the impacts of sample sizes on power levels and type I error rates of the methods under comparison. These sample sizes are realistic in that they reflect the scales of recent deep sequencing studies in African Americans. For example, 489 Alzheimer’s cases and 472 controls were sequenced a target sequencing study on Alzheimer’s disease51. The Jackson Heart Study52 has deeply sequenced more than 3400 African Americans.

To accurately evaluate the TIERIFs of the methods, we simulated 108 replications of genotypes and ancestries by setting RR = 1 under each of the four disease modes. This number of replications is sufficient and necessary for evaluating type I error rates of gene-based tests at nominal genome-wide significance level (2.5 × 10−6). For power comparisons, we generated 20,000 replications for each given RR value (>1) under each of the four disease modes. This number of replications would be sufficient for reliably inspecting power patterns.

The ORWSS was designed to accommodate both rare and common variants. Thus, for this method, we used all the region wide variants to compute the weighted-sum of genotypic scores. The SDWSS have been observed to reduce statistical power when more neutral common variants are included into the test statistic. Intuitively, the other two methods will also reduce statistical power if common neutral variants are included. In our power comparisons, therefore, we used the conventional threshold 0.02 to choose rare variants36 to perform the other three methods. In the simulated admixture, the reference alleles at 295 SNPs proved rare (fADX < 0.02). Hence, we used the numbers of minor alleles at these 295 SNPs to compute the sum scores in the CAST and our LABST as well as the weighted-sum score in the SDWSS.

Results

Type I error rates

Figure 3 presents the TIERIFs of the four methods. For both sets of sample sizes, the CAST and our LABST well control type I error rates for various nominal levels across interval [10−6, 0.05]. They do not inflate or deflate type I error rates. Their TIERIF curves consistently concentrate around 1 and within the 95% concentration band (CB). The SDWSS appears overly liberal and the type I error rate inflation is quite robust to the increase in sample size. Its TIERIF curves clearly break the upper bound of the 95% CB for both the smaller and the larger sample size settings. These results would suggest that the SDWSS suffers a systematic bias in calibrating the tail probability of its test statistic. In contrast, the ORWSS appears overly conservative. Its TIERIF curves clearly break the lower bound of the 95% CB, especially for the smaller sample sizes. It becomes essentially less conservative for the larger sample sizes. These results would suggest that the ORWSS better calibrates the tail probability of its test statistic for larger sample sizes.

Figure 3
figure 3

Empirical TIERIFs of the four methods based on different sample sizes and various nominal levels. In each panel, we generated 108 replications of region wide genotypes and ancestries of the specified numbers of affected and unaffected admixed to evaluate the TIERIF of each method at each nominal level. In the ORWSS, we used all the 1,693 variants to compute the individual weighted-sums of genotypic scores. In the other methods, we used the 295 variants of fADX < 0.02 to compute the individual sums/weighted-sums of genotypic scores. The LABST and the CAST accurately controlled the type I error rates. The SDWSS appeared overly liberal. The ORWSS appeared over conservative, particularly for the smaller samples.

Power comparisons

Figure 4 presents the power comparisons under four disease genetic modes with sample sizes n0 = n1 = 500 and nominal level α = 0.05. Overall, each method performs the best at the dominant mode, followed in turn by the additive mode, multiplicative mode, and lastly, recessive mode. Under each disease genetic mode, our LABST performs the best for all relative risks, followed in turn by the CAST, the SDWSS, and the ORWSS. Exploiting an identical set of rare variants, the CAST uniformly outperforms the SDWSS. Since the SDWSS is robustly liberal, its power inferiority would be caused by the transformation of the weighted-sums of genotypic scores to ranks, which would lose information. For the moderate sample sizes, the ORWSS appears unacceptably conservative and lacks ability to effectively separate the true causal variants from the other variants.

Figure 4
figure 4

Power comparisons under four disease modes with various relative risks: n0 = n1 = 500, nominal level α = 0.05. In the simulated admixture, all the 23 risk allele frequencies are less than 0.015, and the cumulative risk haplotype frequency is 0.1467. Under each mode, at each relative risk, we evaluated the power of each method based on 20,000 simulated replications. In the ORWSS, we used all the 1,693 variants to compute the individual weighted-sums of genotypic scores. In all the other methods, we used the 295 variants of fADX < 0.02 to compute the individual sums/weighted-sums of genotypic scores.

Figure 5 presents the power comparisons when increasing sample size to n0 = n1 = 2,000 but keeping all the other parameters used in Fig. 4. All the methods show increased power across all the different disease genetic modes. The LABST keeps its uniform preference, followed by the CAST, which outperforms the SDWSS and the ORWSS. However, the ORWSS now outperforms the SDWSS for a wide range of relative risks under the dominant, additive, and multiplicative modes. Under the recessive mode, the SDWSS slightly outperforms the ORWSS for all the relative risks. When increasing the sample size, the ORWSS becomes much less conservative and better scales the causal SNPs especially when RR becomes relatively large.

Figure 5
figure 5

Power comparisons under four disease modes with various relative risks: n0 = n1 = 2,000, nominal level α = 0.05. In the simulated admixture, all the 23 risk allele frequencies are less than 0.015, and the cumulative risk haplotype frequency is 0.1467. Under each mode, at each relative risk, we evaluated the power of each method based on 20,000 simulated replications. In the ORWSS, we used all the 1,693 variants to compute the individual weighted-sums of genotypic scores. In all the other methods, we used the 295 variants of fADX < 0.02 to compute the individual sums/weighted-sums of genotypic scores.

Figure 6 presents the power comparisons when reducing the nominal level to α = 10−6 but keeping all the other parameters in Fig. 5. At this significance level, the LABST still outperforms the other methods across all the scenarios. Under the recessive mode, all methods have very low or no power with respect to various relative risks. Under the other three modes, the CAST is more powerful than the SDWSS but becomes less powerful than the ORWSS, whereas the ORWSS become the second best among our compared methods for a wide range of relative risks.

Figure 6
figure 6

Power comparisons under four disease modes with various relative risks: n0 = n1 = 2,000, nominal level α = 10−6. In the simulated admixture, all the 23 risk allele frequencies are less than 0.015, and the cumulative risk haplotype frequency is 0.1467. Under each mode, at each relative risk, we evaluated the power of each method based on 20,000 simulated replications. In the ORWSS, we used all the 1,693 variants to compute the individual weighted-sums of genotypic scores. In all the other methods, we used the 295 variants of fADX < 0.02 to compute the individual sums/weighted-sums of genotypic scores.

Discussion

The primary objective of this report is to illustrate the utility of leveraging local ancestry for rare variant association analysis. We present the LABST to combine local ancestry of a test block with the sum of genotypic scores of block-wide rare variants. Under the null of no genetic association, we mathematically prove that the LABST statistic asymptotically follows the chi-square distribution of one degree of freedom. This explicit asymptotic null distribution enables us analytically compute the significance of each ancestry block. Under our extensive simulations, the LABST properly controls type I error rates at various preset nominal levels. These results indicate that the null distribution of the LABST statistic can be accurately approximated by the 1-df chi-square distribution. Based on our results, the permutation-based evaluations of significance in the SDWSS and ORWSS are not accurate enough for genome-wide scans for samples from admixed populations. The SDWSS tends to inflate type I error rates and the inflation appears robust to the changes of sample sizes. In other words, the SDWSS would suffer a systematic bias in calibrating the tail probability of its test statistic. The ORWSS appears severely conservative when the numbers of affected and unaffected individuals are moderate. Its conservativeness becomes less significant when the sample sizes are essentially increased. The conservativeness of the ORWSS would stem from the unideal stability and effectiveness of its weighting scheme.

In our simulations for the power comparisons, we hypothesize that certain haplotypes of some rare alleles are direct causal factors. This assumption allows for diverse SNP wise MAFs but does not necessarily mean that alleles with smaller MAFs have larger effect sizes. Under four disease genetic modes, the LABST are uniformly more powerful than the CAST, SDWSS, and ORWSS across all relative risks, sample sizes, and nominal levels investigated. Its power gain stems from explicit incorporation of the interaction between a gene and the local ancestry. The CAST is more powerful than the SDWSS uniformly across all our simulations, even though the SDWSS is robustly liberal. The SDWSS loses portion of information when transforming the original weighted-sums of numerical genotypic scores to Wilcoxon rank-sum statistic. As pointed out by Wilcoxon himself, ranks are not sufficient statistics53 and hence rank-sum test would not be the most powerful test. The superiority of the ORWSS to the CAST and the SDWSS varied across different disease genetic modes, relative risks, sample sizes, and nominal significance levels. Based on our results, the ORWSS would have limited utility for studies of small to moderate samples, whereas it would be useful for studies with large samples from a homogeneous population. Based on our results, a liberal method is not necessarily more powerful uniformly than a conservative method. The preference of a method depends on how effectively it can aggregate the association information of rare variants. Although derived from simulations on a particular region, all the conclusions are generalizable for an arbitrary admixed population with different ancestry-haplotype correlations between cases and controls. Such differences often stem from the different ancestral frequencies of the risk haplotypes, disease modes, and/or relative risks.

Our LABST can be extended to Hotelling’s two-sample T-squared test54 to jointly analyze multiple groups of variants when desired. Following the LABST, we can define uij as the integrative score of individual i in group j. For d groups, we write ui = (ui1, …, uid)′ as the d × 1 vector of integrative scores, \({\bar{{\boldsymbol{u}}}}_{0}=\sum _{i\in {G}_{0}}\,{{\boldsymbol{u}}}_{i}/{n}_{0}\), \({\bar{{\boldsymbol{u}}}}_{1}=\sum _{i\in {G}_{1}}\,{{\boldsymbol{u}}}_{i}/{n}_{1}\), and V = (n−2)−1 \([\sum _{i\in {G}_{1}}\,({{\boldsymbol{u}}}_{i}-{\bar{{\boldsymbol{u}}}}_{1})({{\boldsymbol{u}}}_{i}-{\bar{{\boldsymbol{u}}}}_{1})^{\prime} +\sum _{i\in {G}_{0}}\,({{\boldsymbol{u}}}_{i}-{\bar{{\boldsymbol{u}}}}_{0})({{\boldsymbol{u}}}_{i}-{\bar{{\boldsymbol{u}}}}_{0})^{\prime} ].\) We define Hotelling’s statistic as \({T}^{2}=({n}_{0}{n}_{1}/n)({\bar{{\boldsymbol{u}}}}_{1}-{\bar{{\boldsymbol{u}}}}_{0}){{\boldsymbol{V}}}^{-1}({\bar{{\boldsymbol{u}}}}_{1}-{\bar{{\boldsymbol{u}}}}_{0})\). The covariance matrix V converges in probability to a positively definite matrix as long as the integrative scores are not in co-linearity. The statistic T2 converges in distribution to the chi-square distribution with d degrees of freedom if group set is not associated with the disease. In addition, informative group wise weights, if available, can be readily incorporated into the Hotelling’s T2 test.

In this investigation, individual local ancestries were assumed to be known. In practice, local ancestries can be inferred from available genomic data. Several software packages, such as SABER18, HAPAA55, HAPMIX56, MULTIMIX57, CSVs58, and ELAI59, have been established for inferring local ancestries. These packages utilize available marker-wise genotypes of a target individual and the haplotypes/genotypes from certain ancestral panels. When dense SNPs are genotyped across the genome, the local ancestries can be highly accurately inferred. For each admixed individual, our LABST assumes that within a short block there is no ancestry crossover. This assumption is reasonable for haplotype blocks in ancestral populations60,61. Such haplotype blocks are of little evidence for historical recombination and much shorter than ALD regions. Gene based rare variant associations often fall in such blocks. In practice, it would be important to accommodate covariates (e.g., population structure variables, environmental factors). Let zi = [1, zi1, …, zic]′ contain the covariates of individual i and let Z = [z1, …, zn]′ be the whole sample covariates matrix. To adjust for the covariates, let \({e}_{i}={u}_{i}-{z}_{i}{\rm{^{\prime} }}{(Z{\rm{^{\prime} }}Z)}^{-1}Z{\rm{^{\prime} }}u\) for each individual i, where u = [u1, …, un]′. It is clear that the vector of residuals e = [e1, …, en]′ is orthogonal to Z, that is, eZ = 0. Replacing ui’s in the statistic We (Eq. 1) with ei’s is one way to adjust for covariates.

Like many existing methods, our LABST assumes that individuals are randomly recruited from a target admixed population. It will be instructive to develop particular integrative methods for other sampling schemes that enrich rare variants. For example, the individuals with extreme values of a quantitative trait are often recruited for sequencing studies. Under such a trait-oriented sampling scheme, the LABST is valid but its power would be improved by combining local ancestry with a direct quantitative association analysis that incorporates the sampling scheme. In addition, individuals can be selected according to a secondary sampling trait, which is conveniently and economically measured. Only for the recruited individuals, the values of the primary study trait are measured. For such a sampling scheme, we will develop novel effective methods to combine block wise ancestries and genotypes with multiple phenotypes for identifying pleiotropic genes. Currently, the LABST only works for a recent (several-generations) admixture of two ancestral populations with different genetic architectures, i.e., distinct causal allele frequencies and/or effects. One typical example is the current African American population, which suffers from disproportionately heavier burdens of multiple diseases1,2,3,4,5,6,7 than European Americans. The LABST can be extended to allow for multi-way admixtures such as Hispanic and Latino Americans. For example, it can be extended to a (d + 1)-way admixture by using Hotelling’s two-sample T-squared test with d degrees of freedom, which is similar to the above extension to combine multiple groups of variants.