Identifying Rare Variant Associations in Admixed Populations

An admixed population and its ancestral populations bear different burdens of a complex disease. The ancestral populations may have different haplotypes of deleterious alleles and thus ancestry-gene interaction can influence disease risk in the admixed population. Among admixed individuals, deleterious haplotypes and their ancestries are dependent and can provide non-redundant association information. Herein we propose a local ancestry boosted sum test (LABST) for identifying chromosomal blocks that harbor rare variants but have no ancestry switches. For such a stable ancestral block, our LABST exploits ancestry-gene interaction and the number of rare alleles therein. Under the null of no genetic association, the test statistic asymptotically follows a chi-square distribution with one degree of freedom (1-df). Our LABST properly controlled type I error rates under extensive simulations, suggesting that the asymptotic approximation was accurate for the null distribution of the test statistic. In terms of power for identifying rare variant associations, our LABST uniformly outperformed several famed methods under four important modes of disease genetics over a large range of relative risks. In conclusion, exploiting ancestry-gene interaction can boost statistical power for rare variant association mapping in admixed populations.

ORWSS 36 scales SNP wise numbers of minor alleles by the logarithms of amended odds ratios in the 2 × 2 tables of disease status by allele states. The EREC method 37 scales SNP wise numbers of minor alleles by the estimated regression coefficients. When families are available, incorporating linkage evidence in rare variants analysis has also been developed 38-41 . However, these existing methods may be underpowered for identifying rare variant associations in admixed populations, because they do not explicitly exploit the association information conveyed by local ancestries, particularly, ancestry-gene interaction. An admixed population and its ancestral populations often bear different burdens of complex diseases, partially due to the ancestral discrepancies in causal alleles, allele frequencies, and effects. Within a chromosomal block harboring causal alleles, an affected admixed individual may have an increased probability of inheriting alleles from the ancestry population of higher disease prevalence 12,42 . Common variants with different ancestral frequencies are correlated to their ancestries 43,44 which provide non-redundant association information [12][13][14][15] . We hypothesize that this argument holds for rare variants. Single rare variant association testing has unacceptably limited power since only a small portion of study individuals carry the rare allele.
In this report, we will illustrate the utility of explicitly exploiting local ancestries and genotypes together for rare variant associations. For simplicity, we aim to identify stable ancestral blocks harboring rare variants. For each person, all the SNPs within such a block share an identical ancestry. We propose a heuristic local ancestry boosted sum test (LABST). In a stable ancestral block, our test statistic combines the sum of SNP wise numbers of rare mutations and the ancestry-gene interaction. We mathematically prove that the LABST statistic asymptotically follows a chi-square distribution with 1-df if the test block is not associated with the disease. In extensive simulations, our LABST appropriately controlled type I error rates at preset nominal levels, indicating the ideal accuracy of the asymptotic approximation. Under various multiple rare variant disease modes with a large range of relative risks, our LABST were uniformly more powerful than the benchmark CAST as well as the sophisticated SDWSS and ORWSS. The LABST is a heuristic method designed for unrelated cases and controls. It can be extended to incorporate informative weights, to accommodate covariates and to allow for multiple groups of rare variants.

Methods
In an admixed population of two ancestral populations, let a test chromosomal block contain L rare variants, i.e., the minor allele frequencies (MAFs) < 2% 45 . For n unrelated individuals from the admixed population, let be the index sets of affected and unaffected individuals, respectively. Let g ij denote the number of minor alleles carried by individual i at SNP j, = ∑ = s g , Let the test block be stable in terms of variant wise ancestries. In other words, we assume that each block wide haplotype of each individual is inherited entirely from one of the two ancestral populations without any ancestry crossover points. Under such an assumption, all the m SNPs within the block share identical ancestry. We define a = [a 1 , …, a n ], where a i denotes the number of ancestries on individual i inherited from the ancestral population of the higher disease prevalence (due to the larger risk haplotype frequency). Let α be the nominal significance level of a test for block-based associations.
The proposed LABST. For each individual i, we define u i = (1 + a i )s i to combine ancestry-gene interaction a i s i with s i . We define a Welch type t statistic , and u 0 2 and σ 0 2 are likewise defined using the u-scores of n 0 unaffected individuals. We use W e 2 to measure the association between the test block and the disease status. Let n 1 /n 0 converge to a finite positive constant τ when both n 0 and n 1 increase, e.g., τ = 1 if n 1 = n 0 . If the test block is not associated with the disease status, then the statistic W e 2 converges in distribution to ξ χ 1 2 , the chi-square distribution with 1-df (see Appendix A for a mathematical proof). Thus, we compute the p-value of W e 2 as ξ = > P W Pr( ) e 2 def and claim significance when P < α, the preset nominal significance level.
Existing methods. Most existing rare variant association methods exploit genotypes without explicitly capitalizing on ancestry-gene interactions. In such methods, the genotypic score of individual i is defined as where w j is a SNP wise weight. The simplest weight is w j ≡ 1 for all the L SNPs as in the CAST 29 . For this universal weight, x i collapses to s i , and a benchmark 1-df statistic is constructed by replacing u i 's in our LABST with the s i 's.
The SDWSS 31 weighs a SNP using its minor allele frequency in unaffected individuals among whole-sample individuals. At the j th SNP, = − w nq q 1/ (1 ) , j j j where q j = (1 + m j )/(2 + 2n 0 ), and m j is the number of minor alleles at the SNP over the n 0 unaffected individuals. Whole-sample individuals are ranked according to the x i scores and a rank sum = ∑ ∈ x x rank( ) x k be the rank sums based on k(=1,000) permutations of disease status, μ and σ , be their mean and standard deviation, respectively. The standardized score is defined as z = (x−μ)/σ and the P value of z is computed according to the standard normal distribution.
The weighting scheme in the SDWSS favors the disease-associated mutations with very low frequencies. As acknowledged by its authors, however, this scheme may reduce the power to detect the disease-associated mutations with higher frequencies. The SDWSS is based on the implicit assumption 32 where OR j is the odds ratio in the 2 × 2 table of disease status by the allele at SNP j, and q 0j is the MAF in the controls. Thus, Feng et al. 36 proposed the ORWSS to jointly analyze rare and common variants. This method keeps www.nature.com/scientificreports www.nature.com/scientificreports/ all the steps of the SDWSS except for the weighting scheme. In the ORWSS, a SNP is weighted by the logarithm of the amended odds ratio 46 in the 2 × 2 table of allele by disease status. The amended odds ratio proves a useful remedy for handling potential empty cells in SNP-wise tables.
Type I error rate inflation factor. Often, a conservative method tends more likely to miss true associations whereas a liberal method tends more likely to claim false positives. A valid powerful method should accurately control the type I error rate at each preset nominal level. Herein, we propose and use type I error rate inflation factor (TIERIF) to measure how accurately a method controls type I error rate. For a given nominal level α, we define the TIERIF of a method as γ α = τ α /α, where τ α is the probability that the method rejects the null hypothesis. If γ α = 1, then the method is able to controls type I error rate at the given nominal level α. If γ α is substantially smaller than 1, then the method is overly conservative. If γ α is substantially larger than 1, then the method is overly liberal.
Usually, it is intractable to mathematically formulate the TIERIF of a sophisticated method. In addition, it is hard to tell what a TIERIF is unacceptably 'small' or 'large' . Herein, we propose an empirical method to estimate this quantity and tell how small (large) is too small (large). Specifically, we define γ τ α = α αˆ/ as an estimator of γ α , where τ α is the frequency that the method claims significance over R simulation replications generated under the null hypothesis of no association. As R increases, γ α converges in probability to γ α , and α γ γ α − − α α R ( ) / 1 converges in distribution to a standard normal variable (see Appendix B for a mathematical proof). Therefore, if the method properly controls type I error rate at α, then γ α concentrates with probability 95% between and Under the null of no genetic association, the concentration interval [LB α , UB α ] is the shortest among all the inter-

simulation Designs
For method comparisons, we simulated an admixture using the rare variants with the frequency-spectrums of two natural populations. In the simulated admixture, block-wide haplotypes were inherited from the two ancestry populations. Four disease genetic modes were considered, including the dominant, additive, recessive, and multiplicative modes. Under each disease genetic mode, the disease status of an admixed individual was determined by the penetrance conditioning on block-wide risk haplotypes other than individual risk alleles.

Admixture.
To simulate a two-way admixture, we downloaded the genotype data of region ENr113.4q26 from the ENCODE project Consortium 47 . Applying the Beagle software 48 , we separately inferred 180 CEU (Centre d'Etude du Polymorphisme Humain in Utah, USA) and 180 YRI (Yoruban in Ibadan, Nigeria) haplotypes over the ENr113.4q26 region. Details on the haplotype deconvolution have been described previously 36 . Across the 360 inferred haplotypes, we observed 1,693 SNPs. At each of the region-wide SNPs, we chose the minor allele in the YRI haplotype data (f YRI ≤ 0.5) as the reference allele ( Fig. 1a). In the CEU haplotype data, the reference alleles at 1,373 SNPs are of frequencies f CEU ≤ 0.5, whereas at the other 320 SNPs, f CEU > 0.5 (Fig. 1b). Based on our previous association study on African Americans 43 , we adopted ω = 0.8 vs. ϖ = 0.2 as YRI-CEU admixture weights. To 'genotype' one admixed individual in the ENr113.4q26 region, we randomly chose one and another haplotype from the YRI or CEU haplotypes with probabilities ω vs. ϖ. In this simulated admixture, the frequencies of reference alleles at the 1,693 SNPs (f ADX = ωf YRI + ϖf CEU ) range from 0.0011 to 0.5722 (Fig. 1c), where 295 SNPs are of f ADX < 0.02, satisfying the conventional criterion of rare variants 45  www.nature.com/scientificreports www.nature.com/scientificreports/ for an arbitrary (H, a, y). Setting y = 0 in Eq. (7) yielding the joint probability mass of (H, a) in unaffected subpopulation:

(8)
H Setting y = 1 in Eq. (7) yielding the joint probability mass of (H, a) in affected subpopulation: H Eqs (8) and (9) and Table 2 Table 3, we numerically computed ρ 1 and ρ 0 values for f 0 = 0.1 and each RR under each of the four disease genetic modes (Fig. 2). Under all the four modes, ρ 1 increases and ρ 0 decreases from 0.1759 as RR increases from 1 (no genetic association) to 3. The dominant mode shows the largest ratio ρ 1 /ρ 0 , followed in turn by the additive mode, the multiplicative mode, and the recessive mode. Of note, f 0 = f 1 = f 2 = κ under the null hypothesis of no genetic association. We acknowledge that prevalence (κ) varies for different diseases in an admixed population. For example, about 10% African Americans suffer from lifetime major depressive disorder 49 , whereas about 2.7% African American suffer from dementia 50 . In our simulations, we fixed f 0 = 0.1 as a reference value to inspect the type I error rate and power patterns of different association methods with respect to different disease modes, relative risks, sample sizes, and nominal significance levels.

Simulation configurations. Using
For each RR value under each mode, at each replication we simulated region wide genotypes and ancestries of n 1 affected and n 0 unaffected individuals from the admixed population. For each specific scenario, we adopted sample sizes n 1 = n 0 = 500 and then n 1 = n 0 = 2,000 to inspect the impacts of sample sizes on power levels and type I error rates of the methods under comparison. These sample sizes are realistic in that they reflect the scales of recent deep sequencing studies in African Americans. For example, 489 Alzheimer's cases and 472 controls were sequenced a target sequencing study on Alzheimer's disease 51 . The Jackson Heart Study 52 has deeply sequenced more than 3400 African Americans.
To accurately evaluate the TIERIFs of the methods, we simulated 10 8 replications of genotypes and ancestries by setting RR = 1 under each of the four disease modes. This number of replications is sufficient and necessary for evaluating type I error rates of gene-based tests at nominal genome-wide significance level (2.5 × 10 −6 ). For power comparisons, we generated 20,000 replications for each given RR value (>1) under each of the four disease modes. This number of replications would be sufficient for reliably inspecting power patterns.
The ORWSS was designed to accommodate both rare and common variants. Thus, for this method, we used all the region wide variants to compute the weighted-sum of genotypic scores. The SDWSS have been observed to reduce statistical power when more neutral common variants are included into the test statistic. Intuitively, the other two methods will also reduce statistical power if common neutral variants are included. In our power comparisons, therefore, we used the conventional threshold 0.02 to choose rare variants 36 to perform the other three methods. In the simulated admixture, the reference alleles at 295 SNPs proved rare (f ADX < 0.02). Hence, we used the numbers of minor alleles at these 295 SNPs to compute the sum scores in the CAST and our LABST as well as the weighted-sum score in the SDWSS.

Results
Type I error rates. Figure 3 presents the TIERIFs of the four methods. For both sets of sample sizes, the CAST and our LABST well control type I error rates for various nominal levels across interval [10 −6 , 0.05]. They do not inflate or deflate type I error rates. Their TIERIF curves consistently concentrate around 1 and within the 95% concentration band (CB). The SDWSS appears overly liberal and the type I error rate inflation is quite robust to the increase in sample size. Its TIERIF curves clearly break the upper bound of the 95% CB for both the smaller and the larger sample size settings. These results would suggest that the SDWSS suffers a systematic bias in calibrating the tail probability of its test statistic. In contrast, the ORWSS appears overly conservative. Its TIERIF curves clearly break the lower bound of the 95% CB, especially for the smaller sample sizes. It becomes essentially less conservative for the larger sample sizes. These results would suggest that the ORWSS better calibrates the tail probability of its test statistic for larger sample sizes.  www.nature.com/scientificreports www.nature.com/scientificreports/ Power comparisons. Figure 4 presents the power comparisons under four disease genetic modes with sample sizes n 0 = n 1 = 500 and nominal level α = 0.05. Overall, each method performs the best at the dominant mode, followed in turn by the additive mode, multiplicative mode, and lastly, recessive mode. Under each disease genetic mode, our LABST performs the best for all relative risks, followed in turn by the CAST, the SDWSS, and the ORWSS. Exploiting an identical set of rare variants, the CAST uniformly outperforms the SDWSS. Since the SDWSS is robustly liberal, its power inferiority would be caused by the transformation of the weighted-sums of genotypic scores to ranks, which would lose information. For the moderate sample sizes, the ORWSS appears unacceptably conservative and lacks ability to effectively separate the true causal variants from the other variants. Figure 5 presents the power comparisons when increasing sample size to n 0 = n 1 = 2,000 but keeping all the other parameters used in Fig. 4. All the methods show increased power across all the different disease genetic modes. The LABST keeps its uniform preference, followed by the CAST, which outperforms the SDWSS and the ORWSS. However, the ORWSS now outperforms the SDWSS for a wide range of relative risks under the dominant, additive, and multiplicative modes. Under the recessive mode, the SDWSS slightly outperforms the ORWSS for all the relative risks. When increasing the sample size, the ORWSS becomes much less conservative and better scales the causal SNPs especially when RR becomes relatively large. Figure 6 presents the power comparisons when reducing the nominal level to α = 10 −6 but keeping all the other parameters in Fig. 5. At this significance level, the LABST still outperforms the other methods across all the scenarios. Under the recessive mode, all methods have very low or no power with respect to various relative risks. Under the other three modes, the CAST is more powerful than the SDWSS but becomes less powerful than the ORWSS, whereas the ORWSS become the second best among our compared methods for a wide range of relative risks.

Discussion
The primary objective of this report is to illustrate the utility of leveraging local ancestry for rare variant association analysis. We present the LABST to combine local ancestry of a test block with the sum of genotypic scores of block-wide rare variants. Under the null of no genetic association, we mathematically prove that the LABST statistic asymptotically follows the chi-square distribution of one degree of freedom. This explicit asymptotic null distribution enables us analytically compute the significance of each ancestry block. Under our extensive simulations, the LABST properly controls type I error rates at various preset nominal levels. These results indicate that the null distribution of the LABST statistic can be accurately approximated by the 1-df chi-square distribution. Based on our results, the permutation-based evaluations of significance in the SDWSS and ORWSS are not accurate enough for genome-wide scans for samples from admixed populations. The SDWSS tends to inflate type I error rates and the inflation appears robust to the changes of sample sizes. In other words, the SDWSS would suffer a systematic bias in calibrating the tail probability of its test statistic. The ORWSS appears severely conservative when the numbers of affected and unaffected individuals are moderate. Its conservativeness becomes less significant when the sample sizes are essentially increased. The conservativeness of the ORWSS would stem from the unideal stability and effectiveness of its weighting scheme. www.nature.com/scientificreports www.nature.com/scientificreports/ In our simulations for the power comparisons, we hypothesize that certain haplotypes of some rare alleles are direct causal factors. This assumption allows for diverse SNP wise MAFs but does not necessarily mean that alleles with smaller MAFs have larger effect sizes. Under four disease genetic modes, the LABST are uniformly more powerful than the CAST, SDWSS, and ORWSS across all relative risks, sample sizes, and nominal levels investigated. Its power gain stems from explicit incorporation of the interaction between a gene and the local ancestry. The CAST is more powerful than the SDWSS uniformly across all our simulations, even though the SDWSS  www.nature.com/scientificreports www.nature.com/scientificreports/ is robustly liberal. The SDWSS loses portion of information when transforming the original weighted-sums of numerical genotypic scores to Wilcoxon rank-sum statistic. As pointed out by Wilcoxon himself, ranks are not sufficient statistics 53 and hence rank-sum test would not be the most powerful test. The superiority of the ORWSS to the CAST and the SDWSS varied across different disease genetic modes, relative risks, sample sizes, and nominal significance levels. Based on our results, the ORWSS would have limited utility for studies of small to moderate samples, whereas it would be useful for studies with large samples from a homogeneous population. Based on our results, a liberal method is not necessarily more powerful uniformly than a conservative method. The preference of a method depends on how effectively it can aggregate the association information of rare variants. Although derived from simulations on a particular region, all the conclusions are generalizable for an arbitrary admixed population with different ancestry-haplotype correlations between cases and controls. Such differences often stem from the different ancestral frequencies of the risk haplotypes, disease modes, and/or relative risks.
Our LABST can be extended to Hotelling's two-sample T-squared test 54 to jointly analyze multiple groups of variants when desired. Following the LABST, we can define u ij as the integrative score of individual i in group j. For d groups, we write u i = (u i1 , …, u id )′ as the d × 1 vector of integrative scores, = ∑ ∈ u u n / We define Hotelling's statistic as The covariance matrix V converges in probability to a positively definite matrix as long as the integrative scores are not in co-linearity. The statistic T 2 converges in distribution to the chi-square distribution with d degrees of freedom if group set is not associated with the disease. In addition, informative group wise weights, if available, can be readily incorporated into the Hotelling's T 2 test.
In this investigation, individual local ancestries were assumed to be known. In practice, local ancestries can be inferred from available genomic data. Several software packages, such as SABER 18 , HAPAA 55 , HAPMIX 56 , MULTIMIX 57 , CSVs 58 , and ELAI 59 , have been established for inferring local ancestries. These packages utilize available marker-wise genotypes of a target individual and the haplotypes/genotypes from certain ancestral panels. When dense SNPs are genotyped across the genome, the local ancestries can be highly accurately inferred. For each admixed individual, our LABST assumes that within a short block there is no ancestry crossover. This assumption is reasonable for haplotype blocks in ancestral populations 60,61 . Such haplotype blocks are of little evidence for historical recombination and much shorter than ALD regions. Gene based rare variant associations often fall in such blocks. In practice, it would be important to accommodate covariates (e.g., population structure variables, environmental factors). Let z i = [1, z i1 , …, z ic ]′ contain the covariates of individual i and let Z = [z 1 , …, z n ]′ be the whole sample covariates matrix. To adjust for the covariates, let = − ′ ′ ′ − e u z Z Z Z u ( ) for each individual i, where u = [u 1 , …, u n ]′. It is clear that the vector of residuals e = [e 1 , …, e n ]′ is orthogonal to Z, that is, e′Z = 0. Replacing u i 's in the statistic W e (Eq. 1) with e i 's is one way to adjust for covariates.
Like many existing methods, our LABST assumes that individuals are randomly recruited from a target admixed population. It will be instructive to develop particular integrative methods for other sampling schemes that enrich rare variants. For example, the individuals with extreme values of a quantitative trait are often recruited for sequencing studies. Under such a trait-oriented sampling scheme, the LABST is valid but its power would be improved by combining local ancestry with a direct quantitative association analysis that incorporates the sampling scheme. In addition, individuals can be selected according to a secondary sampling trait, which is conveniently and economically measured. Only for the recruited individuals, the values of the primary study trait are measured. For such a sampling scheme, we will develop novel effective methods to combine block wise ancestries and genotypes with multiple phenotypes for identifying pleiotropic genes. Currently, the LABST only works for a recent (several-generations) admixture of two ancestral populations with different genetic architectures, i.e., distinct causal allele frequencies and/or effects. One typical example is the current African American population, which suffers from disproportionately heavier burdens of multiple diseases 1-7 than European Americans. The LABST can be extended to allow for multi-way admixtures such as Hispanic and Latino Americans. For example, it can be extended to a (d + 1)-way admixture by using Hotelling's two-sample T-squared test with d degrees of freedom, which is similar to the above extension to combine multiple groups of variants.