Introduction

The search for rare variants associated with common diseases, and traits in general, has already started.1, 2, 3, 4, 5 As these variants are rare, most studies will be inadequately powered to detect an association with any single variant.6, 7, 8 When only a handful of minor alleles are observed for any single-nucleotide variant (SNV), obtaining statistical significance, especially at traditional genome wide levels of 10−8, can be near impossible. Therefore, instead of trying to identify associations with individual variants, the goal has been to identify associations with a group of rare variants in a shared region (eg, exons, genes) or pathway, effectively increasing power by pooling information across SNVs.9 Numerous statistical tests that can search for these regional associations have already been introduced, developed, and compared.7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19

Our three goals in this paper are to (1) Unify (2) Identify, and (3) Modify association tests for rare variants. First, we introduce a simple statistical framework and show that the majority of rare-variant association tests can be reformulated within this framework. Second, we show that within this framework, we can easily identify the relationship between a statistic’s performance and the genetic characteristics of the tested SNVs, such as the proportion of SNVs associated with an outcome, direction of effects, and the relationship between effect size and MAF. Third, we show that the standard test statistics can be further tailored to the specifics of a given study or an investigator’s prior beliefs.

To achieve our objective of unification, we first revisit the standard association test for a single uncommon SNV. The standard approach would be to divide study participants into two groups, those with and without the minor allele, and then measure the difference in the average phenotype between those two groups. This approach applies equally for continuous and dichotomous variables (eg, disease), where for the latter case, we would measure the difference in disease prevalence. The resulting difference, at least for many scenarios, completely captures the available information for detecting an association. When testing a group of SNVs, the relevant information from each SNV is still only this difference, and, therefore, joint tests only vary by how they combine this information.

Most test statistics combine these differences in a very specific way. For describing the unified framework, let individual i be in the study, Yi be the phenotype (eg, weight, height, and disease status) and let Gij be the number of rare variants at SNV j. Therefore, for SNV j, we can calculate the difference between the average of the phenotype values in subjects with a minor allele, , and the average in subjects without a minor allele, .

The majority of statistics proposed for testing associations are a weighted sum of the squared differences, , and their cross-products, . As the weights are allowed to depend on Δ, all statistics cannot necessarily be formulated as a second-degree polynomial. We refer to this broader class of statistics as Data-Adaptively Weighted Generalized Score Statistics (DAWGSS, pronounced dogs). The full definition of DAWGSS, which can accommodate covariates and population stratification, will be provided later.

Standard rare-variant tests, such as the Sum test,11 Hotelling’s T2-test,20 Stouffer’s Z-test,15 Data Adaptive Sum test,21 C-alpha,16 similarity regression,17 variance components,17 CMC,22 and SKAT,23 to name a few, only vary by their chosen set of weights. We specify five global features of the weights that vary among common test statistics. We then identify the properties of these global features that are desirable, or provide high power, when genetic variants have certain behaviors. For example, one feature is the relationship between the weights and the magnitude of Δj. We show that the power of association tests can be increased by shrinking the weights when Δj is small if only a minority of rare variants are truly associated with the outcome. By observing this connection, we develop a new test statistic, ROVER, that performs well in such scenarios.

The idea of a general framework for rare-variant tests has been proposed previously.12, 13 Our objective is to propose a similar framework, but one where a rare-variant test can be recast as a function of the single SNV test statistics. This framework illuminates similarities with classical statistical tests, offers a clear means for performing meta-analyses when studies genotype different sets of SNVs, and facilitates interpretable modifications.

We use these test statistics to better understand the association between bladder cancer and the UGT1A gene locus on chromosome 2q37. UGTs facilitate cellular detoxification of multiple exogenous and endogenous substrates.24 Specifically, UGTs participate in the removal of aromatic amines, which are the main risk factors for bladder cancer found in tobacco smoke and industrial chemicals.24 This locus has been associated with colorectal cancer,25, 26 pancreatic cancer,27 liver cancer,28 and most recently bladder cancer.29 Following-up the GWAS hit, rs1189203, for bladder cancer, we performed targeted resequencing to identify variants within the UGT1A region and then a focused association study in 9319 individuals.30 After the study identified one variant that was highly associated with bladder cancer, we now use ROVER and the other test statistics described here to determine if the remaining variants, many with low MAF, were enriched for associations.

In the methods section, we provide a more complete definition of DAWGSS and provide the details about our simulations and the study of bladder cancer. In the results section, we compare the performance of these statistics for our two types of data. Importantly, we compare the statistics within their DAWGSS framework so we can demonstrate how the choice of weights determines the properties of the test statistics. Note, the common framework allows rapid computation of multiple-test statistics simultaneously, enabling our broad comparison. Our software, CRaVe, is available as both a stand-alone Unix program and an R function. In the discussion section, we summarize our conclusions. The Supplementary material shows how to reformulate many commonly used test statistics in their DAWGSS format.

Methods

DAWGSS: definition

We let n be the number of subjects, Yi be the normalized phenotype , and Gij be the number of rare variants at SNV j, j{1,…,J}. We denote the vector of all genotypes by Gi=[Gi1 Gi2GiJ]t.

We then define the genetic covariance, V, and correlation, σ, matrices from a group, S, containing nS individuals by

where , and D(V) is the matrix containing only the diagonal elements of V, with the off-diagonal elements set to 0. We will denote the element in row j1, column j2 of V and σ by , respectively. Furthermore, let σ−1 be the inverse of σ so that σσ−1=I, where I is the J × J identity matrix.

Our definition of Δj in the introduction was slightly simplified, in that it only allowed two genotypes. We redefine it here as the score statistic (or, equivalently, as the correlation between Y and G·j multiplied by a normalizing factor of √n),

and let Δ=[Δ1 Δ2…ΔJ]t. Note, for any SNV, Δj is asymptotically distributed as a normal variable with mean 0 and variance 1 under the null hypothesis.

As stated in the introduction, DAWGSS have the form

When there are covariates, we extend the definition of Δj as follows. Let us expand our notation. For subject i, let be the outcome and Xi=[Xi1 Xi2XiT]t be a set of T-covariates. We estimate by either linear or logistic regression as appropriate. We define . Usually the values of are defined so and can be ignored.

We can then use this redefined Yi in equation (4).

DAWGSS: global features

Within the DAWGSS framework, most tests can be distinguished by five global features.

  1. 1

    The magnitude of for uncorrelated SNVs (ie, are cross-product terms included when SNVs are in linkage equilibrium?)

  2. 2

    The relationship between wj and minor allele frequency (ie, are the contributions of SNVs weighted by their MAF?)

  3. 3

    The relationship between the weights and the signs of Δ (ie, is preference given to variants that appear harmful?)

  4. 4

    The relationship between wj and Δj (ie, are small Δjs shrunk on the presumption that most SNVs have no association?)

  5. 5

    The relationship between and linkage disequilibrium (ie, are independent and correlated SNVs treated equally?)

The results section will discuss how the behavior of rare variants determines the desirable properties of features 1–4. Feature 5 is listed here for completeness, but will only be addressed briefly in the discussion.

Statistics

We compare the performance of multiple-test statistics, including

  • SUM: Sum test11

  • HOI: Hotelling’s T2-test assuming independence20

  • STZ: Stouffer’s Z-test15

  • MDF: MultiDegree of Freedom test16, 17

These four tests are fully defined in Table 1. Note that HOI sets σ to be the identity matrix, allowing for more direct comparisons with the other tests, which omit any reference to σ. We focus on these four tests because they highlight the importance of the first two global features and because they are the most ubiquitous, noting that the recommended versions of tests using C-alpha, similarity regression, variance components, and kernels are all equivalent to MDF. However, for the larger simulations and bladder cancer data, we show the results for six other tests that can be described within the DAWGSS framework.

  • HOT: Hotelling’s T2-test

Table 1 Four test statistics – MultiDegree of Freedom Test (MDF), modified Hotelling’s T-test (HOI), Stouffer’s Z-Test (STZ), and the Sum test differ only by whether the statistics include cross-product terms, , and whether the weights incorporate √V, where Vjj is the variance of the genotype (ie, MAF(1−MAF))

HOT accounts for the correlation between SNVs,

  • STZ+: Positive Stouffer’s Z-test

  • MDF+: Positive MultiDegree of Freedom test

STZ+31 and MDF+ tests modify their original test statistics by including only those variants with a positive Δj. We define the binary function, 1(·) by 1(·)=1 if the enclosed statement is true, 0 otherwise.

  • THR: Threshold test

THR is designed to reflect the statistic from Hoffmann et al,12 only including variants with . In practice, α=0.05.

  • DAS: Data Adaptive Sum test

DAS, defined by Han and Pan,21 also requires the definition of a threshold, which we similarly set at α0=0.05.

where

  • ROV: Rover

ROVER shrinks the weights for SNVs with low signal

Table 2 summarizes the global features for each of these ten statistics, with Supplementary tables summarizing additional statistics.

Table 2 Summary of the global features

Simulations

Independence

We simulated genes from a case/control study with a total of n=2000 individuals, equally divided between the two groups, under the null and multiple alternative hypotheses. For each simulated gene, containing 40-independent SNVs with MAF equally spaced between 0.005 and 0.02, we calculated values for six-test statistics: SUM, HOI, STZ, MDF, MDF+, and ROV. Genes simulated under the null hypothesis were used to estimate the threshold for rejection, given a particular significance level, α. Power was then estimated as the proportion of genes simulated under an alternative, which exceeded that threshold. Under the alternative, the relationship between the probability of disease and genotype was defined by

The total number of influential SNVs , the proportion of influential SNVs that reduced risk , and the log odds ratio (βj) were varied. The values of βj, constant across all SNVs, were chosen by one of two methods. In the main text, we primarily discuss the example where ). In the Supplementary material, we discuss different effect sizes.

Linkage disequilibrium

We simulated genes from a case/control study with a total of 2000 or 10 000 individuals under the null and alternative hypotheses. Here, haplotypes were chosen to mimic those observed in the 1000 genomes project,32 and each gene contained 80 SNVs with MAF between 0.005 and 0.10. The MAF was increased because most SNVs with MAF<0.02 appeared independent of each other, while the number of SNVs was increased because, in the presence of linkage disequilibrium (LD), one influential SNV creates multiple-associated SNVs. For each alternative hypothesis, power was defined as the proportion of simulated genes with a permutation-based P-value below 10−3. Null simulations were only used to confirm that α-levels were correctly calibrated. Under the alternative, the relationship between the probability of disease and genotype was again defined by equation (15).

For the alternative hypotheses, simulations varied by the following parameters. (1) The number of subjects (n). (2) The number of influential SNVs (NI). (3) The relationship between MAF and βj. Either βj was constant for all associated variants or inversely proportional to MAF. (4) Effect direction: either 50% or all of the associated variants increased risk . (5) The magnitude of the effect size. Although the odds ratios were generally set so the power of the MDF test was ≈0.4–0.5, a ‘large’ effect size, where the power of the MDF test was inflated to ≈0.8, was also examined. Details about the simulation process are available in the Supplementary material.

Bladder cancer study

The association studies have been described elsewhere, so we only offer a brief summary here.27, 30 The original GWAS contained 3532 cases and 5120 controls of self-described European descent.29 Among the 591 637 SNVs passing quality-control, 166 SNVs were within the UGT1A region, defined as the 158 Kb of the UGT1A cluster +/100 Kb (chr2:234 091 000–234 447 000, hg18). A promising association between SNV rs11892031 in the cluster and bladder cancer (P=7.7 × 10−5) suggested additional examination. In the first step of the follow-up study, we generated highly specific long-range amplicons and sequenced the alternative first exons in each of the UGT1A genes in 44 bladder cancer cases and 30 trios from the HapMap CEU set, detecting 43 known exonic SNVs. In the second step, we selected 18 SNVs, based on LD and functional annotation, and genotyped these in a set of 1055 cases and 962 controls from the Spanish Bladder Cancer Study (SBCS), a component of the original GWAS. In step 3, based on the SBCS data enriched across the region, we imputed these additional exonic variants for the remaining samples from the stage 1 GWAS (2477 cases/4158 controls). The final data set contained 49 exonic SNVs in 3532 cases and 5120 controls. Using imputation based on the combined reference panels of HapMap 3 CEU and 1000 Genomes data, an extended data set was created with 1170 SNVs in the same group of individuals.

We performed the previously described association tests on subgroups of the 49 and 1170 SNVs, including only SNVs with a MAF below 0.1. As rs17863783 is significant by itself, we repeated these tests on the remaining SNVs after adjusting for rs17863783, using a simplified approach of treating that SNV like a covariate. All analyses were also repeated after adjusting for the covariates age, gender, smoking status, and study center.

Results

Simulations

We first compared the power to detect associations between disease status and simulated genes containing either 40-independent or 80-dependent SNVs.

(1) The magnitude of for uncorrelated SNVs

Statistics that include cross-product terms are more advantageous when a large proportion of SNVs are associated with the outcome. In our simulations assuming independence, the power for the SUM test, which includes cross-product terms, exceeds that for the MDF test when at least 15 of the 40 SNVs influence disease risk (Figure 1). If we increase β in equation (15), then that intersection point, when the study power becomes higher for the SUM test, increases slightly (Supplementary material). If we increase the total number of SNVs beyond 40, the absolute number of influential SNVs needed for SUM to have higher power will increase, but the proportion of SNVs, relative to the total number, decreases (Figure 3).

Figure 1
figure 1

The power (y-axis) to detect an association between a gene with 40-independent SNPs and a disease is illustrated for four different test statistics (ROV, HOI, MDF, SUM), as a function of the number (x-axis) of those SNPs that increase disease risk.

Our simulations where SNVs within a gene are in linkage disequilibrium show that the SUM test becomes the more powerful test at a comparatively smaller number of influential SNVs. Table 3 shows that the advantage for those tests without cross-product terms is only slight when 4 out of the 80 SNVs are influential. Because of linkage disequilibrium, the number of SNVs associated with the disease is considerably larger than four. Even after varying other parameters, such as the number of subjects in the study, the odds ratio for the disease alleles, and the relationship between MAF and the magnitude of the odds ratio, these general trends still hold.

Table 3 The statistical tests were compared when 4 and 16 out of the 80-dependent SNVs were influential

Until now, all SNVs have increased disease risk. In simulations where the strength of the association is identical among all influential SNVs, but 50% reduce disease risk, the two tests that include cross-product terms, SUM and STZ, will have nearly no power to detect an association. The cost can be understood by either noting that the cross-product terms, , are negative when the effects go in opposite directions, thereby reducing the test statistic, or by rewriting the test statistic as the square of a variable with an expected value of 0 (eg, STZ, ().

(2) The relationship between wj and MAF

The relationship between the weights and MAF is mediated by √Vjj, where Vjj is the variance of the genotype at SNV j and is proportional to MAF (1-MAF). Clearly, compared with the STZ and HOI statistics, the statistics that include √Vjj are upweighting SNVs with a higher MAF. In our simulations where SNVs are independent, the HOI statistic, which omits √Vjj from the weights, slightly outperforms the MDF test after adding a small stabilizing constant, 0.005, to the denominator of Δj (Figure 1). Whereas the magnitude of Δ is effectively independent of MAF, multiplying by √Vjj makes the contributions proportional to the number of rare variants: (see equation (4)). In our simulations where SNVs are in linkage disequilibrium, the two statistics, HOI and MDF, performed nearly identically. When we allow the odds ratio (ie βj in equation (15)) to be inversely proportional to MAF, reflecting the hypothesis that low MAF SNVs will have larger effects, the loss of power by including Vjj must, therefore, be larger (Table 3).

To explore the data-adaptive weights, we consider statistics introduced more recently. The STZ+31 and MDF+ tests modify their original test statistics by including only those variants with a positive Δj. These statistics were designed to gain power when mutations increase disease risk. ROV includes weights, wj, that decrease with , and was designed to gain power when only a small proportion of SNVs are influential.

(3) The relationship between the weights and the signs of Δ

All simulations show that without exception, STZ+ and MDF+ must outperform their all-inclusive counterparts when all SNVs increase the risk of disease (Figure 2, Table 3). Figure 2 shows, in our example where 10 out of 40-independent SNVs are influential, that the power for STZ+ and MDF+ decrease quickly as the proportion of SNVs, which reduce risk increases. If three SNVs reduce disease risk, then MDF and MDF+ statistics perform similarly, whereas if more than three SNVs reduce risk, the MDF test outperforms MDF+. Comparing STZ and STZ+ tests confirms that, in general, STZ+ will outperform STZ because as we discussed in section 1, STZ already performs poorly when SNVs have opposing effects. Clearly, when all variants are protective, the power for the MDF+ and STZ+ tests are no larger than the α-level. Table 3 shows that trends observed in the independent simulations still hold when there is LD. Although the MDF+ and STZ+ have the highest, or nearly the highest, power when all 4 or 16 influential SNVs increase risk, the power for either the MDF+ or STZ+ is only ≈10–20% higher than their counterparts. Table 3 further shows that when half of the SNVs reduce risk, their power is 50% of that for the MDF test.

Figure 2
figure 2

The power (y-axis) to detect an association between a gene with 40-independent SNPs and a disease is illustrated for four different test statistics (MDF+, MDF, STZ+, STZ), as a function of the proportion (x-axis) of the 10-associated SNPs that reduce disease risk. The remaining associated SNPs increase risk.

(4) The relationship between wj and Δj

In our simulations with 40-independent SNVs, Figure 1 shows that ROV outperformed all other statistics so long as fewer than 10 of those SNVs were influential. The exact intersection depends on odds ratios of the influential SNVs (see Supplementary material). In our simulations with linkage disequilibrium, ROV only outperforms other statistics when <4 out of the 80 SNVs are influential (Table 3). With linkage disequilibrium, non-influential SNVs will still be associated with the outcome and shrinkage is less helpful (Table 3). Table 3, however, shows that ROV, a form of soft-thresholding, may still slightly outperform methods that remove all Δ with a P-value below a hard threshold of 0.05 (ie, DAS and THR).

Bladder cancer

Among the sets of 49 exonic and 1170 total SNVs covering the UGT1A region, 24 and 556 had MAF below 0.1. In the fine-mapping study, which looked only at SNVs individually, the variant at rs17863783 was found to decrease the risk of bladder cancer (P=5 × 10−7). Table 4 shows that had we not examined each SNV individually and only examined the region as a whole, no test would have found the regional association to be statistically significant after adjusting for testing 20 000 genes and using a significance threshold of 5 × 10−7=0.01/20 000. However, among all tests, ROV, HOI, and HOT resulted in the lowest P-values. Impressively, these statistics generally performed better when the 556 SNVs were examined jointly, as opposed to only 24.

Table 4 Testing the association between the UGT1A region and bladder cancer with multiple DAWGSS

The purpose of the joint examination was to determine whether the remaining SNVs, as a group, show an association with bladder cancer. First, all tests still suggested a possible association after removing rs17863783, because surrounding SNVs were in LD with rs17863783 (Table 4). Therefore, only a test statistic that can condition on the associated SNV can be used for this analysis. Even after conditioning on an individual’s genotype at rs17863783, a few tests generally suggested an association, with, for example, the HOT and MDF having P-values of 0.0004 and 0.073. Here, after identifying a single gene, adjusting for multiple comparisons is unnecessary. However, by adjusting for the study center, a surrogate for population structure, tests suggested that there was no additional association between the UGT1A cluster and bladder cancer, with the HOT and MDF now having P-values of 0.58 and 0.11. Therefore, in this region, which is highly associated with multiple cancers, only a single SNV appears to directly influence the risk of bladder cancer.

Discussion

We examined a class of statistics, DAWGSS, that test for an association between a group of genetic variants and a phenotype and show that the majority of statistics that are currently available for testing associations with groups of variants, despite the diversity of their original presentations, can be rewritten as DAWGSS. Even the linear kernel methods, which generalize a large subset of statistics,23 are, in turn, generalized by DAWGSS when we restrict to their preferred metric of IBS. Within this shared framework, it is clear that the differences among these statistics are wholly encompassed by the weights multiplying the terms in the sum from equation (5). In the case of independent SNVs, the four classical and most ubiquitous statistics (STZ, SUM, HOI, MDF) differ only by whether they include cross-product terms and/or incorporate the SD, √V, of the number of variants into their weights. When comparing a broader range of statistics, we found five global features that distinguish many of the known tests. Furthermore, we demonstrated that the desirable properties for these features depend on the behavior of rare variants.

This paper was focused on showing the similarity of rare-variant test statistics, and in the Supplementary material, we show how specific tests can be reformulated as DAWGSS. However, our discussions are not intended to be inclusive. Foremost, DAWGSS are limited to describing statistics that have an additive effect across the SNVs. Therefore, rank-based approaches,7 methods that allow for interactions,6 methods with non-additive effects,18 and methods that compare all subsets of SNVs33 are outside of the DAWGSS framework. However, we believe these limitations have minimal practical implications. Currently, studies are underpowered to detect most interactions, especially among rare variants, and, if desired, DAWGSS can easily be extended to account for interaction terms. Moreover, among methods based on distance matrices, such as similarity regression, kernel-based approaches, and variance components, the additive model outperformed other options.17, 34

The association between bladder cancer and the UGT1A region appears to be caused by a single genetic variant. Considering the importance of the UGT family in a number of cancers, we expected that variants, which affected one cancer would also affect others, even if to a lesser extent, resulting in multiple variants associated with disease risk. However, in our tests of association, there was no evidence to support such a hypothesis as there was no regional association after adjusting for rs17863783.

The DAWGSS framework offers a practical means for performing association tests. To facilite its use, we have provided the software, CRaVe, on the author’s website that can efficiently perform all tests that can be described with the DAWGSS framework. The program inputs standard data formats (eg, vcf, tped) and can accomodate covariates and bioinformatic weights. CRaVe can perform all tests defined within this manuscript and allows the user to new tests within this framework. We are currently using this software to study the fifth global feature, the relationship between and LD, and aim to provide results evaluating tests, such as HOT, that adjust for LD in the near future.

Figure 3
figure 3

The lines illustrate the number of SNVs that need to be influential in order for the SUM test to have higher power than the MDF test, as a function of the total number of independent SNVs in the tested region. Results are based on simulations where SNVs have MAF equally spaced between 0.005 and 0.02, and the influential SNVs are randomly distributed, have equal effect size, and increase risk. The effect size (β in equation (15)) was chosen so the MDF test had a specified power: 0.2 (brown), 0.5 (red), or 0.8 (orange).