Abstract
The widespread availability of genome sequencing data made possible by way of nextgeneration technologies has yielded a flood of different genebased rare variant association tests. Most of these tests have been published because they have superior power for particular genetic architectures. However, for applied researchers it is challenging to know which test to choose in practice when little is known a priori about genetic architecture. Recently, tests have been proposed which combine two particular individual tests (one burden and one variance components) to minimize power loss while improving robustness to a wider range of genetic architectures. In our analysis we propose an expansion of these approaches, yielding a general method that works for combining any number of individual tests. We demonstrate that running multiple different tests on the same data set and using a Bonferroni correction for multiple testing is never better than combining tests using our general method. We also find that using a test statistic that is highly robust to the inclusion of noncausal variants (jointinfinity) together with a previously published combined test (sequence kernel adaptive testoptimal) provides improved robustness to a wide range of genetic architectures and should be considered for use in practice. Software for this approach is supplied. We support the increased use of combined tests in practice – as well as further exploration of novel combined testing approaches using the general framework provided here – to maximize robustness of rare variant testing strategies against a wide range of genetic architectures.
Introduction
Numerous tests of genotype–phenotype association for rare variants have been proposed, all of which attempt to combine signals at multiple variant sites within a gene into a single, powerful genebased test of association. According to a recent work, which test is the most powerful is highly dependent upon the true genetic architecture of the phenotype.^{1, 2} The challenge for the applied researcher is to know which test to choose, given the limited information about the true genetic architecture of disease.
A general understanding of test behavior can be obtained by noting the existence of two broad classes of tests (length and joint) among the many tests proposed to date.^{1} Length tests (alternatively: burden, collapsing, linear; eg, CMC^{3}) attempt to enhance the genotype–phenotype signal in a region of interest by collapsing variant measurements into a single measure of rare variant ‘burden,’ which is then tested for association with a phenotype of interest. They are called length tests because they can be interpreted geometrically as testing for a difference in the lengths of the minor allele frequency vectors between cases and controls. These tests tend to be powerful when the proportion of causal variants is large and the effects of the causal variants are similar.^{1} Joint tests (alternatively: variance components, quadratic, eg, SKAT^{4}) combine the strength of evidence of individual phenotype–variant associations across the variants in a region of interest and tend to be powerful when there are larger proportions of noncausal variants and there is more variation in the effects of causal variants.^{1} Joint tests are so named because they simultaneously test for differences between the lengths of the minor allele frequency vectors in cases and controls, as well as testing for a nonzero angle between the vectors. A full discussion and classification of existing tests is available elsewhere.^{1, 2}
Recent papers have proposed combining test statistics across both the length and joint classes to yield more powerful test statistics.^{1, 5, 6, 7, 8} Results from these papers demonstrate how to combine a single version of a length test with a single version of a joint test,^{5} how to use a weighting strategy to find the optimal weighted combination of two particular length and joint test statistics,^{6} and that different weighted combinations of particular length and joint tests can be more powerful than single tests for different genetic architectures.^{1} Overall, these combined testing approaches show improved power against a wider range of genetic architectures when compared to using either statistic separately.^{1, 5, 6, 7}
In general, any approach that combines a singlelength test and a singlejoint test will have a limited range of situations in which it is powerful. In particular, the combined test can only be powerful in cases where either of the two individual tests being combined is powerful. The combined test will lack power where the two tests being combined, simultaneously, lack power (but potentially where another, powerful, alternative test exists). For example, a recent paper suggested a novel test statistics which may provide increased power when a large proportion of noncausal variants is present in the gene,^{1} but current testcombining strategies have not evaluated this class of alternatives. Thus, more general testcombining strategies are needed in order to potentially yield more powerful results when the component tests being combined are powerful for a wide range of genetic architectures.
In this paper, we will demonstrate how to combine an arbitrarily large and diverse set of genebased rare variant test statistics using an efficient permutation strategy. We then simulate a wide range of genetic architectures and evaluate the performance of two different methods of combining tests (Fisher’s, minimum Pvalue) when combined tests involve many different types of tests, including those using a variety of norms. We explore which combinations of tests are ideal and when.
Methods
General strategy for combining tests
We propose the following approach for combining Pvalues from k different genebased rare variant tests. For a gene of interest, calculate f ^{+} and f ^{−}, where f ^{+} is a vector of observed allele frequencies, , in the cases, across the m variant sites in the gene and where , letting c^{+}_{j} indicate the total number of minor alleles in the cases at site j, and N ^{+} be the number of cases in the sample. Vector f ^{−} holds similar definitions for the controls.
After computing f ^{+} and f ^{−}, find the Pvalue for each of the k different genebased rare variant tests, yielding a vector of Pvalues, , for each gene of interest (see Rare variant tests section for details). The vector p is used to generate a test statistic, S_{k}=f( p), which summarizes the strength of evidence across p; essentially, the combined strength of evidence of genotype–phenotype association across the entire set of ktests. We consider two different ways of computing S_{k}. The first is the Fisher’s combined Pvalue test statistic and is computed as . We note that if the ktests were mutually independent, the distribution of F_{k} would follow a X^{2} distribution; however, that is likely not the case in practice. Instead, we assess significance of F_{k} using the permutation strategy described in the following section.
The second summary statistic is the minimum Pvalue, Min(p), with significance assessed using the permutation strategy described in the following section. For comparison, we also compute significance of the Min(p) statistic using a Bonferroni correction approach where the summary statistic is deemed significant if Min(p) is less than α/k, for some a priori specified α.
Description of the permutation strategy
For a general univariate summary statistic S_{k} of vector p (in our case either F_{k} or Min(p)), statistical significance can be assessed by permuting phenotype status, performing ktests on the permuted data, recomputing S_{k} on each permutation, and calculating the percent of times that permuted values of S_{k} are greater than the observed S_{k}. Recently,^{5} an efficient permutation strategy for assessing the significance of a test S_{k} with k=2 (one length and one joint) test was proposed. We extend the approach for any number of genebased tests k of any type. The extended approach is to: (1) calculate the observed value of S_{k} as a function of p, where p is the vector of Pvalues for each of the i=1,…,ktests being combined. (2) Permute the phenotype and recompute test statistics, t_{i}*(l), under permutation for each of the i=1,…ktests and for each of l=1,…,P permutations (where P is large), yielding , a vector of permuted test statistics for test i. Note: these are the same P permutations for all tests. (3) Calculate Rank(t_{i}*(l)), the rank of each of the test statistics in vector t_{i}* for each of the i=1,…,ktests, where Rank(t_{i}*(l)))=1 for the largest value of t_{i}*(l) and Rank(t_{i}*(l)))=P for the smallest value of t_{i}*(l). (4) Calculate an empirical Pvalue for each of the permuted test statistics as _{pi}*(l) = Rank(t_{i}*(l))/P. (5) An empirical null distribution (no genotype–phenotype association) for S is computed by calculating the value of S_{k}(l) from the vector of Pvalues , for each permutation l=1,…,P. (6) The significance of S_{k} is computed by calculating the percentage of S_{k}(l) values that are larger than S_{k}, out of the set of P phenotype permutations.
A few additional comments are worthwhile. First, the procedure can be modified in a straightforward manner for twosided tests (either individual or combined), by looking at both tails of the empirical null distribution of statistics. Second, for individual tests based on asymptotic distributions, steps (3) and (4) are merely replaced by using the asymptotic distribution to calculate the p_{i}(l). Finally, and importantly, we note that the use of the same P permutations in step (5) is needed to properly model the correlation structure between tests and generate an appropriate null distribution for S_{k}.
Rare variant tests
We explored combinations of different genebased rare variant tests which were selected to represent a variety of different approaches for evaluating genotype–phenotype associations. We define as the pnorm for a vector . The individual rare variant tests we considered were: (1) Sequence kernel adaptive test (SKAT).^{4} SKAT is essentially equivalent to with an asymptotic distribution used for statistical significance – a joint test using the 2norm. (2) Combined multivariate and collapsing test (CMC).^{3} When all variants are collapsed, CMC can be viewed as essentially equivalent to with significance assessed using an asymptotic distribution – a length test using a 1norm. In our analysis we collapsed all variants because our simulations focused on variants with population minor allele frequency l <1%. (3) Sequence kernel adaptive testoptimal (SKATO).^{6, 7} SKATO combines SKAT and a general burden test (CMC) by the optimal weight ρ, such that yields the minimum Pvalue and uses an asymptotic distribution to assess statistical significance. (4) Length tests with different norms (L(p)),^{1} which test for differences in the lengths of the minor allele frequency vectors between cases and controls. We considered four versions of length tests of the form , with significance assessed via phenotype permutation. The four versions were generated by considering different values of the norm, p, P=1, 2, 4 and ∞, where . (5) Joint tests with different norms (J(p)),^{1} which simultaneously test for differences in the lengths and for a nonzero angle between the two allele frequency vectors. We considered four versions of joint tests of the form , with significance assessed via phenotype permutation. We used four different values of p, P=1, 2, 4 and ∞. Higher normed tests are more robust to the inclusion of noncausal variants.^{1} Thus, we considered a total of 11 individual genebased variant tests (SKAT (a 2norm joint test), SKATO (a combined test), CMC (a 1norm length test), L(1), L(2), L(4), L(∞), J(1), J(2), J(4), J(∞).
We then combined subsets of the 11 individual genebased rare tests using both the Fisher’s and Min(p) approaches (see General strategy for combining tests section). The eight different combinations of tests we considered were: (1) length tests with different norms L(1), L(2), L(4), L(∞)) (CT1); (2) joint tests with different norms J(1), J(2), J(4), J(∞) (CT2); (3) similar length tests (CMC), L(1)) (CT3); (4) similar joint tests (SKAT), J(2)) (CT4); (5) typical lengthjoint combined test (SKAT, CMC), (CT5); (6) length and joint tests across norms (L(1), L(2), L(4), L(∞),(J(1), J(2), J(4), J(∞)) (CT6); (7) length and joint with some norms (L(1), L(4), J(1), J(4) (CT7); (8) more robust SKATO (SKATO, J(∞)) (CT8). A brief rationale for the inclusion of each test is provided in Table 1.
Simulations
We conducted two main simulation studies as part of our analysis. In the first simulation, we explored the general behavior of the Fisher’s and Min(p) approaches across a variety of different numbers of tests, correlation structures and power settings using generalized genebased test statistics. In the second simulation we simulated data according to a priori specified genetic disease models and applied the genebased rare variant tests of association described in the previous section.
Simulation #1: investigating the behavior of Min(p) and Fisher’s
Data was simulated from multivariate normal random variables, T~MVN(μ,Σ) (MVN=multivariate normal), using R,^{9}, where and the k × k covariance matrix . Each multivariate normal sample represents a vector of test statistics, T, from k different genebased rare variant tests, where H_{0}: μ = 0, H_{a}: at least one of >0 and ρ_{i,j} is a measure of correlation between tests i and j. We consider all possible combinations of the following parameters: (1) Number of tests, k, equal to 2, 4, 6, 10 and 20 (2) ρ_{i,j} = 0, 0.25, 0.50, 0.75, 0.90 and 0.99 between the test statistics of two tests i,j. Note: we specified the correlation ρ between test statistics; however, the corresponding correlations between Pvalues are quite similar (details not shown). (3) (a) H_{0}: . (b) An H_{a} where all tests perform equally well: . We note that the approximate power of each individual test, i, under the alternative hypothesis (μ_{i} = 2) is equal to P(Z > z_{α} − μ_{i}) = P(Z ≥ 0.355) = 0.64, where Z ~ Normal(0,1) at a significance level of 5% (z_{α} = 1.645) for a onesided uppertailed test, representing a moderately powered test. We also considered lower significance levels of 0.01, 0.001 and 0.0001, which yield individual test power of 37%, 14% and 4%, respectively.
After generating 10 000 multivariate normal random samples for each combination of simulation parameters, we computed the Pvalue of each test statistic, T_{i}, for each of the 10 000 samples, by finding 1 − φ(T_{i}) where φ() is the cumulative distribution function (CDF) of a standard, normal distribution. We then applied Min(p) and Fisher’s methods to each set of Pvalues, with significance assessed by comparing alternative hypothesis values of Min(p) and Fisher’s statistics to the simulated distributions of these statistics under the null hypothesis. The power of each approach (Min(p) and Fisher’s) for each simulation setting is estimated by dividing the fraction of significant (α=0.05, 0.01, 0.001 or 0.0001) statistics by 10 000 (the number of independent samples). We then conducted a followup simulation in which we varied the number of tests, k (k=2, 4, 6, 10 and 20), fixed ρ_{i,j} = 0 between two tests i,j and then varied the number of tests for which μ_{i} = 2 from 1 to 10, with the remaining tests having μ_{i}=0. Full results from these simulations, which include observed correlations between Pvalues for all settings illustrating the approximately equivalent correlations between test statistics and Pvalues, are available in Supplementary Tables 1a–c.
Simulation #2: investigating the behavior of combinations of genebased rare variant tests across different genetic disease models
We simulated data to represent a variety of different genetic disease models. In all simulations, we considered a sample size of 2000 individuals split evenly between cases and controls. We then simulated data across all possible combinations of the following parameters: (1) number of singlenucleotide variants (SNVs) (32 or 64); (2) proportion of noncausal SNVs (0, ¼, ½, ¾, 7/8, 15/16, 31/32, 63/64, 1); (3) proportion of causal SNVs that increase disease risk (0, ¼, ½, ¾, 1), with the remaining causal SNVs causing a decline in disease risk; (4) relative risk of causal, riskincreasing SNVs (1.1, 1.5 and 2.0). To investigate impact on test performance in the presence of riskreducing SNVs, some simulation settings included riskreducing SNVs with relative risk 0.5. Furthermore, SNV minor allele frequencies were simulated in a three to one ratio of less common (0.1% population minor allele frequency) to more common (1% minor allele frequency) SNVs spread evenly across all noncausal and causal SNVs. We note that when the number of SNVs is not divisible by 4, a single 1% minor allele frequency SNV is assigned before generating up to three additional 0.1% minor allele frequency SNVs. Thus, there were a total of 2 (number of SNVs) × 9 (proportion of noncausal) × 5 (proportion of risk increasing SNVs) × 3 (relative risk of risk increasing SNVs) settings, of 270 possible simulation settings. However, some of the combinations are redundant or impossible; removing these cases yields 197 total simulation settings considered in our analysis.
Fivehundred samples were generated at each simulation setting, with each of the 20 individual tests and each of the 11 combined tests applied to each sample, and separate Pvalues for Min(p) (permutation Pvalue) and Fisher’s for each combined test. Empirical power estimates are computed as the percentage of Pvalues <0.05 (nominal alpha), giving power estimates within of the true power 95% of the time. For the Bonferroni testing approach, we deem the test significant if at least one of the individual test Pvalues in the set is below the Bonferonni correct alpha value of 0.05/k. Where needed, 500 permutations were used to assess statistical significance for individual and combined tests.
To further explore test performance at significance levels commonly used in practice, additional simulations were conducted. In particular, 16 of the settings described above were investigated using 50 000 permutations at significance levels of 10^{−4}, 10^{−3} and 10^{−2}. Fourteen of these settings represented situations in which causal variants were present (32 total singlenucleotide polymorphisms (SNPs) with 1, 2 or 4 causal variants; 64 total SNPs with 1, 2, 4 or 8 causal variants), where all causal variants have RR=2 (7 cases) or 3 (7 cases); 200 simulations were conducted at each setting. Two settings represented situations in which no causal variants were present (32 total SNPs and 64 total SNPs), and used 840 and 460 total simulations at each setting, respectively.
Application
As a proof of the concept, we applied select genebased tests to data from Genetic Analysis Workshop 17. The data consists of real genotype data (from the 1000 Genomes Project consortium) on which a disease phenotype was simulated.^{10} We considered 25 genes which were known to contain causal variants for the simulated disease phenotype and showed variation in the sample of n=321 unrelated Asian subjects. Given the small sample size and low power in this data set,^{5} final disease status for each of the 321 individuals was averaged across 200 independent phenotype simulations, with individuals who were diseased in at least 100 of the 200 independent simulations identified as ‘diseased,’ and the rest not. As has been done previously,^{5} we used a significance level of 0.05 for this analysis.
Results
General patterns in the performance of Min(p) and Fisher’s methods (Simulation #1)
We start by exploring the general behavior of Min(p) and Fisher’s method across a generic set of ktests, with different correlation structure and test performance (Simulation #1 described earlier). The goal of this analysis is to provide an intuitive sense of how the number of tests, correlation between tests and individual test performance is related to the performance of Min(p) and Fisher’s method in a wellunderstood environment. Detailed simulation results are provided in Supplementary Tables 1a–c. Supplementary Table 1a illustrates that the type I error rate is controlled across all simulation settings and significance levels.
When all tests are powerful
When all tests being combined have good power (64% at α=0.05), both the Fisher’s and Min(p) approaches yield increased power as the number of tests being combined increases. However, Fisher’s method tends to outperform Min(p), with the magnitude of the power gain for Fisher’s relative to Min(p) decreasing as the correlation between tests increases, and the power of combined, highly correlated tests equal to the power of a single test – ~64% (see Supplementary Table 1b and Figure 1). In situations where all tests are powerful, Min(p) ignores the power from all the tests but one, forgoing the opportunity to improve the power by combining tests and yielding lower power overall as compared to Fisher’s approach. Similar results are observed for other significance levels.
When some tests are powerful
When we varied the number of powerful (good) tests (power=64% at α = 0.05) and underpowered (bad) tests (power=5%=type I error rate) we found that Min(p) outperforms Fisher’s if there is only one good test in the set, with the magnitude of improvement increasing as the number of bad tests increases (for example, see Figure 2, similar results are observed for other significance levels, see Supplementary Table 1c.). When there are two good tests in the set, Fisher’s does better when there are few bad tests, but as more and more bad tests are added to the set, Min(p) gains an advantage over Fisher’s. In general, Min(p) outperforms Fisher’s when the proportion of bad tests in the set is large. The impact of correlation between tests on these relationships can be inferred from the previous section.
Performance of combined tests on simulated phenotypegenotype data (Simulation #2)
Type I error simulation
The type I error simulation showed general control of the type I error rate across all individual tests and combined tests considered here, with the lone exception being the Bonferroni method, which was, as expected, often conservative. Detailed type I error simulation results are in Supplementary Tables 2a and b. Additional simulations at lower significance levels (1x10^{−2}, 1x10^{−3} and 1x10^{−4}) also showed control of the type I error rate in all cases (detailed results not shown).
Min(p) beats bonferroni every time
Across the 197 simulation settings and eight combined tests (1576 possibilities; see Supplementary Table 3), as well as all followup simulations at lower significance levels, there were only 10 times where power of the Bonferroni approach exceeded the power of the Min(p) approach, doing so only minimally (ranging from 0.002 to 0.004); well within the range of expected variation due to simulation. Thus, it is safe to conclude that Min(p) will always be better than Bonferroni. We do not consider the Bonferroni approach in subsequent analyses.
Improving a combined test with additional tests
We explored eight different combined tests. Rationale and summaries of performance are provided in Table 1. In general, the results of the second simulation study confirmed results of the first simulation study with regards to the use of Min(p) or Fisher’s and how many tests to combine. In short, (1) combining tests that are powerful in different situations will generally be advantageous (eg, CT6, CT7 and CT8), (2) Min(p) outperforms Fisher’s combining method when there is a mix of powerful and nonpowerful tests being combined (eg, CT5, CT6, CT7) and (3) combining highly correlated tests has little benefit (eg, CT2, CT3, CT4). These results held true even at lower significance levels (see Supplementary Table 4).
Robust test statistic
As shown in Table 1, CT8 yielded the best overall performance, with the Fisher’s method performing slightly better than the Min(p) method across all simulation settings; CT6 and CT7 also performed quite well. Across the 197 simulation settings, CT8 (combination of SKATO and J(∞)) yielded power no more than 5% smaller than SKATO power in 87.3% (Fisher’s; 172/197) and 83.2% (Min(p); 164/197) of simulation settings. The power of CT8 was never worse than 10% less than SKATO power. However, the combined test was sometimes substantially better than SKATO, as shown in Table 2. In particular, since J(∞) is robust to the inclusion of high proportions of noncausal variants, CT8 is more robust to the inclusion of noncausal variants than SKATO alone. J(∞), however, performs more poorly than SKATO and most other tests when the proportion of causal variants in a gene is moderate (see Supplementary Table 3, which provides the full results for all simulation settings, for details). Finally, Figures 3 and 4 illustrate the performance of the methods at a low significance level, showing similar results at a relative risk of 2. We note that the power is not very high in this case. Supplementary Figures 2 and 3 illustrate the same performance using a relative risk of 3, yielding larger power.
The performance of the Fisher’s combination approach was generally better than the Min(p) approach of CT8 as shown in Tables 1 and 2. In a headtohead comparison, the Fisher’s approach yielded better power than the Min(p) approach in more than twice as many simulations (119 vs 45 settings), though power gains were only modestly better (average power gain 1.8 vs 1%), with a max power difference of only 5.2%. Table 2 also illustrates the relatively good performance of CT6 and 7 in this subset of simulation settings.
Application to data from Genetic Analysis Workshop 17
The Pvalues for four tests (SKATO, J(∞)) and both the Fisher’s and Min(p) versions of CT8) which were applied to 25 genes containing at least one causal variant are provided in Supplementary Table 5. Six genes are significant (P<0.05) using SKATO alone and four genes are significant using J(∞) alone (three genes are significant using both approaches), for a total of seven genes identified by at least one of the two individual testing methods. The Min(p) version of CT8 identified all seven of the genes as significant and Fisher’s identified five of the seven as significant, while the remaining two were borderline significant (P<0.07), demonstrating that the combined methods are robust. In particular, we note that the PIK3C3 gene was significant using the J(∞) approach (P=0.035), but not SKATO (P=0.056), and was significant for both combined tests (Min(p) Pvalue=0.041, Fisher’s Pvalue=0.035).
Software
Software written for R^{9} is available for free download on the research group’s software page (http://www.dordt.edu/academics/programs/math/statgen/software.shtml). All individual and combined tests considered here are included.
Discussion
We have proposed a general and flexible method for combining different rare variant tests of association to potentially improve robustness across a wide range of genetic architectures while minimizing power loss through the addition of multiple tests. A naïve approach to combining tests is to use a Bonferroni correction after applying multiple different rare variant tests to the same data. However, Bonferroni is often conservative, especially when tests being combined are correlated, and we demonstrated that the Min(p) approach is always more powerful because it empirically estimates the appropriate correlation structure. Thus, in practice, researchers should never run multiple (k>1) genebased tests on the same data set and then apply a stricter Bonferroni correction strategy (α/(k*genes)) to their data set. The Min(p) approach proposed here will always be more powerful than such an approach.
We also showed that while the Min(p) approach is sometimes optimal, the Fisher’s method offers advantages over Min(p) in some cases because it combines separate signals into a combined signal when tests are wellpowered and the correlation between tests is low. However, we’ve shown that when combining tests with lower power, Min(p) improves to the point of being better than Fisher’s method in some cases. In short, Min(p) ignores the ‘noise’ of low powered tests, while Fisher’s averages low powered tests into the signal. Furthermore, as the correlation between wellpowered tests increases, Min(p) also gains power relative to Fisher’s. Ultimately, the answer to whether Min(p) or Fisher’s provides more power is dependent upon the underlying power and correlation structure of the tests being combined. However, combining highly correlated tests is not advantageous either. The most benefit is obtained by combining disparate tests — as we illustrated by combining J(∞) with SKATO – to yield a more robust and powerful test. Across simulation settings considered here the Fisher’s approach for the SKATO/ J(∞) combined test was somewhat more robust than the Min(p) approach and so is recommended for use in practice.
More broadly than either Min(p) or Fisher’s, our method is flexible enough to consider any of the numerous other choices for S_{k}, which is simply a function of the vector of Pvalues from the ktests being combined, . We have focused on Fisher’s and Min(p) because they represent two extreme approaches: Fisher’s is a weighted average of all the Pvalues, and Min(p) only uses a single value from the vector. Furthermore, both approaches are popular since, when tests are independent, each has fairly wellunderstood asymptotic properties. More research is needed to explore additional possibilities. We note that while we restricted our analysis to case–control study designs, the results are directly applicable to results for quantitative traits.
A key advantage to the combined testing approach comes when evaluating multiple genes and/or multiple phenotypes. In these cases, a priori, there may be little information about which individual test is most powerful given the wide range of potential genetic architectures. The best test strategy will be one which provides an optimal tradeoff of power loss and robustness. Namely, for any particular genetic architecture, an individual test can be constructed with better power than any combined test. However, individual tests may be powerful against only a small set of genetic architectures. Thus, a combined test may tradeoff (vs an individual test) small amounts of power against some genetic architectures for large improvements in power versus other genetic architectures.
One area of application we have explored is the straightforward application of our approach to genebased rare variant tests that use thresholds (eg, CMC^{3} which thresholds on minor allele frequency, or the odds ratio weighted sum statistic^{11} with thresholds on empirical odds ratio) to generate variable threshold tests in a straightforward manner. In short, simply combine the same test across multiple thresholds to yield an optimally robust test (detailed results not shown).
With this in mind, how should a researcher utilize the combined tests in practice? Prior work^{5, 6, 7} has shown that combined tests can be considered ‘optimal;’ however, these approaches have been limited to combining L(1) and J(2) tests. In this paper we have shown that combining other disparate tests can be advantageous (e.g., combining SKATO, itself a combination of L(1) and J(2), with J(∞)). For example, we showed that the inclusion of a higher norm test can provide increased robustness to the inclusion of noncausal variants. In practice, we recommend including J(∞) in a combined test with L(1) and J(2) (eg, SKATO with J(∞)) to maximize robustness to the inclusion of noncausal variants in cases where little prior knowledge exists to prioritize potential causal SNPs and/or it is anticipated that a high proportion of SNPs included in the test may be noncausal. However, further analysis of simulated data with larger sample sizes, additional variation in causal variant risk distribution, etc, and which builds on our analysis of real genotype data from Genetic Analysis Workshop 17, is warranted. This exploration is especially needed given recent results yielding moderately sized relative risks, even for rare variants, in practice.
Conclusions
Combined testing approaches offer a general and appealing alternative to individual, genebased rare variant tests of association which may be optimized only for particular genetic architectures. We have demonstrated that the loss of power from the addition of one or two disparate tests may be offset by improved power for a wider range of genetic architectures. We also identified a particular combined test with good properties. As additional, novel, rare variant tests are developed they should be evaluated for possible combination with existing tests to yield maximally robust testing approaches.
References
 1
Liu K, Fast S, Zawistowski M, Tintle NL : A geometric framework for evaluating rare variant tests of association. Genet Epidemiol 2013; 37: 345–357.
 2
Lee S, Abecasis GR, Boehnke M, Lin X : Rarevariant association analysis: study designs and statistical tests. Am J Hum Genet 2014; 95: 5–23.
 3
Li B, Leal SM : Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008; 83: 311–321.
 4
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X : Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011; 89: 82–93.
 5
Derkach A, Lawless JF, Sun L : Robust and powerful tests for rare variants using Fisher’s method to combine evidence of association from two or more complementary tests. Genet Epidemiol 2013; 37: 110–121.
 6
Lee S, Emond MJ, Bamshad MJ et al: Optimal unified approach for rarevariant association testing with application to smallsample casecontrol wholeexome sequencing studies. Am J Hum Genet 2012; 91: 224–237.
 7
Lee S, Wu MC, Lin X : Optimal tests for rare variant effects in sequencing assocation studies. Biostatistics 2012; 13: 762–775.
 8
Sun J, Zheng Y, Hsu L : A unified mixedeffects model for rarevariant association in sequencing studies. Genet Epidemiol 2013; 37: 334–344.
 9
R, www.rproject.org, 2013.
 10
Almasy L, Dyer TD, Peralta JM et al: Genetic Analysis Workshop 17 miniexome simulation. BMC Proc 2011; 5: S2.
 11
Feng T, Elston RC, Zhu X : Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet Epidemiol 2011; 35: 398–409.
Acknowledgements
This work was funded by the National Human Genome Research Institute (R15HG006915). We acknowledge the use of the Hope College parallel computing cluster for assistance in data analysis. We also acknowledge funding of Genetic Analysis Workshop 17 (NIH R01 GM031575), and the preparation of the Simulated Exome Data Set, which was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project.
Author information
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Additional information
Supplementary Information accompanies this paper on European Journal of Human Genetics website
Supplementary information
Rights and permissions
About this article
Received
Revised
Accepted
Published
Issue Date
DOI
Further reading

Genebased association tests using GWAS summary statistics
Bioinformatics (2019)

Gene‐based sequential burden association test
Statistics in Medicine (2019)

Evaluating the performance of genebased tests of genetic association when testing for association between methylation and change in triglyceride levels at GAW20
BMC Proceedings (2018)

Application of novel and existing methods to identify genes with evidence of epigenetic association: results from GAW20
BMC Genetics (2018)

Detecting association of rare and common variants based on crossvalidation prediction error
Genetic Epidemiology (2017)