Statistical tests for detecting associations with groups of genetic variants: generalization, evaluation, and implementation

Ferguson, John; Wheeler, William; Fu, YiPing; Prokunina-Olsson, Ludmila; Zhao, Hongyu; Sampson, Joshua

doi:10.1038/ejhg.2012.220

Download PDF

Article
Published: 24 October 2012

Statistical tests for detecting associations with groups of genetic variants: generalization, evaluation, and implementation

John Ferguson¹,
William Wheeler²,
YiPing Fu³,
Ludmila Prokunina-Olsson³,
Hongyu Zhao¹ &
…
Joshua Sampson⁴

European Journal of Human Genetics volume 21, pages 680–686 (2013)Cite this article

1136 Accesses
3 Citations
1 Altmetric
Metrics details

Subjects

Abstract

With recent advances in sequencing, genotyping arrays, and imputation, GWAS now aim to identify associations with rare and uncommon genetic variants. Here, we describe and evaluate a class of statistics, generalized score statistics (GSS), that can test for an association between a group of genetic variants and a phenotype. GSS are a simple weighted sum of single-variant statistics and their cross-products. We show that the majority of statistics currently used to detect associations with rare variants are equivalent to choosing a specific set of weights within this framework. We then evaluate the power of various weighting schemes as a function of variant characteristics, such as MAF, the proportion associated with the phenotype, and the direction of effect. Ultimately, we find that two classical tests are robust and powerful, but details are provided as to when other GSS may perform favorably. The software package CRaVe is available at our website (http://dceg.cancer.gov/bb/tools/crave).

Genome-wide association studies

Article 26 August 2021

Rare-variant collapsing analyses for complex traits: guidelines and applications

Article 11 October 2019

Controlling for background genetic effects using polygenic scores improves the power of genome-wide association studies

Article Open access 01 October 2021

Introduction

The search for rare variants associated with common diseases, and traits in general, has already started.^{1, 2, 3, 4, 5} As these variants are rare, most studies will be inadequately powered to detect an association with any single variant.^{6, 7, 8} When only a handful of minor alleles are observed for any single-nucleotide variant (SNV), obtaining statistical significance, especially at traditional genome wide levels of 10⁻⁸, can be near impossible. Therefore, instead of trying to identify associations with individual variants, the goal has been to identify associations with a group of rare variants in a shared region (eg, exons, genes) or pathway, effectively increasing power by pooling information across SNVs.⁹ Numerous statistical tests that can search for these regional associations have already been introduced, developed, and compared.^{7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

Our three goals in this paper are to (1) Unify (2) Identify, and (3) Modify association tests for rare variants. First, we introduce a simple statistical framework and show that the majority of rare-variant association tests can be reformulated within this framework. Second, we show that within this framework, we can easily identify the relationship between a statistic’s performance and the genetic characteristics of the tested SNVs, such as the proportion of SNVs associated with an outcome, direction of effects, and the relationship between effect size and MAF. Third, we show that the standard test statistics can be further tailored to the specifics of a given study or an investigator’s prior beliefs.

To achieve our objective of unification, we first revisit the standard association test for a single uncommon SNV. The standard approach would be to divide study participants into two groups, those with and without the minor allele, and then measure the difference in the average phenotype between those two groups. This approach applies equally for continuous and dichotomous variables (eg, disease), where for the latter case, we would measure the difference in disease prevalence. The resulting difference, at least for many scenarios, completely captures the available information for detecting an association. When testing a group of SNVs, the relevant information from each SNV is still only this difference, and, therefore, joint tests only vary by how they combine this information.

Most test statistics combine these differences in a very specific way. For describing the unified framework, let individual i be in the study, Y_i be the phenotype (eg, weight, height, and disease status) and let G_ij be the number of rare variants at SNV j. Therefore, for SNV j, we can calculate the difference between the average of the phenotype values in subjects with a minor allele, , and the average in subjects without a minor allele, .

The majority of statistics proposed for testing associations are a weighted sum of the squared differences, , and their cross-products, . As the weights are allowed to depend on Δ, all statistics cannot necessarily be formulated as a second-degree polynomial. We refer to this broader class of statistics as Data-Adaptively Weighted Generalized Score Statistics (DAWGSS, pronounced dogs). The full definition of DAWGSS, which can accommodate covariates and population stratification, will be provided later.

Standard rare-variant tests, such as the Sum test,¹¹ Hotelling’s T²-test,²⁰ Stouffer’s Z-test,¹⁵ Data Adaptive Sum test,²¹ C-alpha,¹⁶ similarity regression,¹⁷ variance components,¹⁷ CMC,²² and SKAT,²³ to name a few, only vary by their chosen set of weights. We specify five global features of the weights that vary among common test statistics. We then identify the properties of these global features that are desirable, or provide high power, when genetic variants have certain behaviors. For example, one feature is the relationship between the weights and the magnitude of Δ_j. We show that the power of association tests can be increased by shrinking the weights when Δ_j is small if only a minority of rare variants are truly associated with the outcome. By observing this connection, we develop a new test statistic, ROVER, that performs well in such scenarios.

The idea of a general framework for rare-variant tests has been proposed previously.^{12, 13} Our objective is to propose a similar framework, but one where a rare-variant test can be recast as a function of the single SNV test statistics. This framework illuminates similarities with classical statistical tests, offers a clear means for performing meta-analyses when studies genotype different sets of SNVs, and facilitates interpretable modifications.

We use these test statistics to better understand the association between bladder cancer and the UGT1A gene locus on chromosome 2q37. UGTs facilitate cellular detoxification of multiple exogenous and endogenous substrates.²⁴ Specifically, UGTs participate in the removal of aromatic amines, which are the main risk factors for bladder cancer found in tobacco smoke and industrial chemicals.²⁴ This locus has been associated with colorectal cancer,^{25, 26} pancreatic cancer,²⁷ liver cancer,²⁸ and most recently bladder cancer.²⁹ Following-up the GWAS hit, rs1189203, for bladder cancer, we performed targeted resequencing to identify variants within the UGT1A region and then a focused association study in 9319 individuals.³⁰ After the study identified one variant that was highly associated with bladder cancer, we now use ROVER and the other test statistics described here to determine if the remaining variants, many with low MAF, were enriched for associations.

In the methods section, we provide a more complete definition of DAWGSS and provide the details about our simulations and the study of bladder cancer. In the results section, we compare the performance of these statistics for our two types of data. Importantly, we compare the statistics within their DAWGSS framework so we can demonstrate how the choice of weights determines the properties of the test statistics. Note, the common framework allows rapid computation of multiple-test statistics simultaneously, enabling our broad comparison. Our software, CRaVe, is available as both a stand-alone Unix program and an R function. In the discussion section, we summarize our conclusions. The Supplementary material shows how to reformulate many commonly used test statistics in their DAWGSS format.

Methods

DAWGSS: definition

We let n be the number of subjects, Y_i be the normalized phenotype , and G_ij be the number of rare variants at SNV j, j∈{1,…,J}. We denote the vector of all genotypes by G_i=[G_i1 G_i2…G_iJ]^t.

We then define the genetic covariance, V, and correlation, σ, matrices from a group, S, containing n_S individuals by

where , and D(V) is the matrix containing only the diagonal elements of V, with the off-diagonal elements set to 0. We will denote the element in row j₁, column j₂ of V and σ by , respectively. Furthermore, let σ⁻¹ be the inverse of σ so that σσ⁻¹=I, where I is the J × J identity matrix.

Our definition of Δ_j in the introduction was slightly simplified, in that it only allowed two genotypes. We redefine it here as the score statistic (or, equivalently, as the correlation between Y and G_·j multiplied by a normalizing factor of √n),

and let Δ=[Δ₁ Δ₂…Δ_J]^t. Note, for any SNV, Δ_j is asymptotically distributed as a normal variable with mean 0 and variance 1 under the null hypothesis.

As stated in the introduction, DAWGSS have the form

When there are covariates, we extend the definition of Δ_j as follows. Let us expand our notation. For subject i, let be the outcome and X_i=[X_i1 X_i2…X_iT]^t be a set of T-covariates. We estimate by either linear or logistic regression as appropriate. We define . Usually the values of are defined so and can be ignored.

We can then use this redefined Y_i in equation (4).

DAWGSS: global features

Within the DAWGSS framework, most tests can be distinguished by five global features.

1
The magnitude of for uncorrelated SNVs (ie, are cross-product terms included when SNVs are in linkage equilibrium?)
2
The relationship between w_j and minor allele frequency (ie, are the contributions of SNVs weighted by their MAF?)
3
The relationship between the weights and the signs of Δ (ie, is preference given to variants that appear harmful?)
4
The relationship between w_j and Δ_j (ie, are small Δ_js shrunk on the presumption that most SNVs have no association?)
5
The relationship between and linkage disequilibrium (ie, are independent and correlated SNVs treated equally?)

The results section will discuss how the behavior of rare variants determines the desirable properties of features 1–4. Feature 5 is listed here for completeness, but will only be addressed briefly in the discussion.

Statistics

We compare the performance of multiple-test statistics, including

SUM: Sum test¹¹
HOI: Hotelling’s T²-test assuming independence²⁰
STZ: Stouffer’s Z-test¹⁵
MDF: MultiDegree of Freedom test^{16, 17}

These four tests are fully defined in Table 1. Note that HOI sets σ to be the identity matrix, allowing for more direct comparisons with the other tests, which omit any reference to σ. We focus on these four tests because they highlight the importance of the first two global features and because they are the most ubiquitous, noting that the recommended versions of tests using C-alpha, similarity regression, variance components, and kernels are all equivalent to MDF. However, for the larger simulations and bladder cancer data, we show the results for six other tests that can be described within the DAWGSS framework.

HOT: Hotelling’s T²-test

Table 1 Four test statistics – MultiDegree of Freedom Test (MDF), modified Hotelling’s T-test (HOI), Stouffer’s Z-Test (STZ), and the Sum test differ only by whether the statistics include cross-product terms, , and whether the weights incorporate √V, where V_jj is the variance of the genotype (ie, ∝MAF(1−MAF))

HOT accounts for the correlation between SNVs,

STZ+: Positive Stouffer’s Z-test
MDF+: Positive MultiDegree of Freedom test

STZ+³¹ and MDF+ tests modify their original test statistics by including only those variants with a positive Δ_j. We define the binary function, 1(·) by 1(·)=1 if the enclosed statement is true, 0 otherwise.

THR: Threshold test

THR is designed to reflect the statistic from Hoffmann et al,¹² only including variants with . In practice, α=0.05.

DAS: Data Adaptive Sum test

DAS, defined by Han and Pan,²¹ also requires the definition of a threshold, which we similarly set at α₀=0.05.

where

ROV: Rover

ROVER shrinks the weights for SNVs with low signal

Table 2 summarizes the global features for each of these ten statistics, with Supplementary tables summarizing additional statistics.

Table 2 Summary of the global features

Full size table

Simulations

Independence

We simulated genes from a case/control study with a total of n=2000 individuals, equally divided between the two groups, under the null and multiple alternative hypotheses. For each simulated gene, containing 40-independent SNVs with MAF equally spaced between 0.005 and 0.02, we calculated values for six-test statistics: SUM, HOI, STZ, MDF, MDF+, and ROV. Genes simulated under the null hypothesis were used to estimate the threshold for rejection, given a particular significance level, α. Power was then estimated as the proportion of genes simulated under an alternative, which exceeded that threshold. Under the alternative, the relationship between the probability of disease and genotype was defined by

The total number of influential SNVs , the proportion of influential SNVs that reduced risk , and the log odds ratio (β_j) were varied. The values of β_j, constant across all SNVs, were chosen by one of two methods. In the main text, we primarily discuss the example where ). In the Supplementary material, we discuss different effect sizes.

Linkage disequilibrium

We simulated genes from a case/control study with a total of 2000 or 10 000 individuals under the null and alternative hypotheses. Here, haplotypes were chosen to mimic those observed in the 1000 genomes project,³² and each gene contained 80 SNVs with MAF between 0.005 and 0.10. The MAF was increased because most SNVs with MAF<0.02 appeared independent of each other, while the number of SNVs was increased because, in the presence of linkage disequilibrium (LD), one influential SNV creates multiple-associated SNVs. For each alternative hypothesis, power was defined as the proportion of simulated genes with a permutation-based P-value below 10⁻³. Null simulations were only used to confirm that α-levels were correctly calibrated. Under the alternative, the relationship between the probability of disease and genotype was again defined by equation (15).

For the alternative hypotheses, simulations varied by the following parameters. (1) The number of subjects (n). (2) The number of influential SNVs (N_I). (3) The relationship between MAF and β_j. Either β_j was constant for all associated variants or inversely proportional to MAF. (4) Effect direction: either 50% or all of the associated variants increased risk . (5) The magnitude of the effect size. Although the odds ratios were generally set so the power of the MDF test was ≈0.4–0.5, a ‘large’ effect size, where the power of the MDF test was inflated to ≈0.8, was also examined. Details about the simulation process are available in the Supplementary material.

Bladder cancer study

The association studies have been described elsewhere, so we only offer a brief summary here.^{27, 30} The original GWAS contained 3532 cases and 5120 controls of self-described European descent.²⁹ Among the 591 637 SNVs passing quality-control, 166 SNVs were within the UGT1A region, defined as the 158 Kb of the UGT1A cluster +/100 Kb (chr2:234 091 000–234 447 000, hg18). A promising association between SNV rs11892031 in the cluster and bladder cancer (P=7.7 × 10⁻⁵) suggested additional examination. In the first step of the follow-up study, we generated highly specific long-range amplicons and sequenced the alternative first exons in each of the UGT1A genes in 44 bladder cancer cases and 30 trios from the HapMap CEU set, detecting 43 known exonic SNVs. In the second step, we selected 18 SNVs, based on LD and functional annotation, and genotyped these in a set of 1055 cases and 962 controls from the Spanish Bladder Cancer Study (SBCS), a component of the original GWAS. In step 3, based on the SBCS data enriched across the region, we imputed these additional exonic variants for the remaining samples from the stage 1 GWAS (2477 cases/4158 controls). The final data set contained 49 exonic SNVs in 3532 cases and 5120 controls. Using imputation based on the combined reference panels of HapMap 3 CEU and 1000 Genomes data, an extended data set was created with 1170 SNVs in the same group of individuals.

We performed the previously described association tests on subgroups of the 49 and 1170 SNVs, including only SNVs with a MAF below 0.1. As rs17863783 is significant by itself, we repeated these tests on the remaining SNVs after adjusting for rs17863783, using a simplified approach of treating that SNV like a covariate. All analyses were also repeated after adjusting for the covariates age, gender, smoking status, and study center.

Results

Simulations

We first compared the power to detect associations between disease status and simulated genes containing either 40-independent or 80-dependent SNVs.

(1) The magnitude of for uncorrelated SNVs

Statistics that include cross-product terms are more advantageous when a large proportion of SNVs are associated with the outcome. In our simulations assuming independence, the power for the SUM test, which includes cross-product terms, exceeds that for the MDF test when at least 15 of the 40 SNVs influence disease risk (Figure 1). If we increase β in equation (15), then that intersection point, when the study power becomes higher for the SUM test, increases slightly (Supplementary material). If we increase the total number of SNVs beyond 40, the absolute number of influential SNVs needed for SUM to have higher power will increase, but the proportion of SNVs, relative to the total number, decreases (Figure 3).

Our simulations where SNVs within a gene are in linkage disequilibrium show that the SUM test becomes the more powerful test at a comparatively smaller number of influential SNVs. Table 3 shows that the advantage for those tests without cross-product terms is only slight when 4 out of the 80 SNVs are influential. Because of linkage disequilibrium, the number of SNVs associated with the disease is considerably larger than four. Even after varying other parameters, such as the number of subjects in the study, the odds ratio for the disease alleles, and the relationship between MAF and the magnitude of the odds ratio, these general trends still hold.

Table 3 The statistical tests were compared when 4 and 16 out of the 80-dependent SNVs were influential

Full size table

Until now, all SNVs have increased disease risk. In simulations where the strength of the association is identical among all influential SNVs, but 50% reduce disease risk, the two tests that include cross-product terms, SUM and STZ, will have nearly no power to detect an association. The cost can be understood by either noting that the cross-product terms, , are negative when the effects go in opposite directions, thereby reducing the test statistic, or by rewriting the test statistic as the square of a variable with an expected value of 0 (eg, STZ, ().

(2) The relationship between w_j and MAF

The relationship between the weights and MAF is mediated by √V_jj, where V_jj is the variance of the genotype at SNV j and is proportional to MAF (1-MAF). Clearly, compared with the STZ and HOI statistics, the statistics that include √V_jj are upweighting SNVs with a higher MAF. In our simulations where SNVs are independent, the HOI statistic, which omits √V_jj from the weights, slightly outperforms the MDF test after adding a small stabilizing constant, 0.005, to the denominator of Δ_j (Figure 1). Whereas the magnitude of Δ is effectively independent of MAF, multiplying by √V_jj makes the contributions proportional to the number of rare variants: (see equation (4)). In our simulations where SNVs are in linkage disequilibrium, the two statistics, HOI and MDF, performed nearly identically. When we allow the odds ratio (ie β_j in equation (15)) to be inversely proportional to MAF, reflecting the hypothesis that low MAF SNVs will have larger effects, the loss of power by including V_jj must, therefore, be larger (Table 3).

To explore the data-adaptive weights, we consider statistics introduced more recently. The STZ+³¹ and MDF+ tests modify their original test statistics by including only those variants with a positive Δ_j. These statistics were designed to gain power when mutations increase disease risk. ROV includes weights, w_j, that decrease with , and was designed to gain power when only a small proportion of SNVs are influential.

(3) The relationship between the weights and the signs of Δ

All simulations show that without exception, STZ+ and MDF+ must outperform their all-inclusive counterparts when all SNVs increase the risk of disease (Figure 2, Table 3). Figure 2 shows, in our example where 10 out of 40-independent SNVs are influential, that the power for STZ+ and MDF+ decrease quickly as the proportion of SNVs, which reduce risk increases. If three SNVs reduce disease risk, then MDF and MDF+ statistics perform similarly, whereas if more than three SNVs reduce risk, the MDF test outperforms MDF+. Comparing STZ and STZ+ tests confirms that, in general, STZ+ will outperform STZ because as we discussed in section 1, STZ already performs poorly when SNVs have opposing effects. Clearly, when all variants are protective, the power for the MDF+ and STZ+ tests are no larger than the α-level. Table 3 shows that trends observed in the independent simulations still hold when there is LD. Although the MDF+ and STZ+ have the highest, or nearly the highest, power when all 4 or 16 influential SNVs increase risk, the power for either the MDF+ or STZ+ is only ≈10–20% higher than their counterparts. Table 3 further shows that when half of the SNVs reduce risk, their power is ∼50% of that for the MDF test.

(4) The relationship between w_j and Δ_j

In our simulations with 40-independent SNVs, Figure 1 shows that ROV outperformed all other statistics so long as fewer than ∼10 of those SNVs were influential. The exact intersection depends on odds ratios of the influential SNVs (see Supplementary material). In our simulations with linkage disequilibrium, ROV only outperforms other statistics when <4 out of the 80 SNVs are influential (Table 3). With linkage disequilibrium, non-influential SNVs will still be associated with the outcome and shrinkage is less helpful (Table 3). Table 3, however, shows that ROV, a form of soft-thresholding, may still slightly outperform methods that remove all Δ with a P-value below a hard threshold of 0.05 (ie, DAS and THR).

Bladder cancer

Among the sets of 49 exonic and 1170 total SNVs covering the UGT1A region, 24 and 556 had MAF below 0.1. In the fine-mapping study, which looked only at SNVs individually, the variant at rs17863783 was found to decrease the risk of bladder cancer (P=5 × 10⁻⁷). Table 4 shows that had we not examined each SNV individually and only examined the region as a whole, no test would have found the regional association to be statistically significant after adjusting for testing 20 000 genes and using a significance threshold of 5 × 10⁻⁷=0.01/20 000. However, among all tests, ROV, HOI, and HOT resulted in the lowest P-values. Impressively, these statistics generally performed better when the 556 SNVs were examined jointly, as opposed to only 24.

Table 4 Testing the association between the UGT1A region and bladder cancer with multiple DAWGSS

Full size table

The purpose of the joint examination was to determine whether the remaining SNVs, as a group, show an association with bladder cancer. First, all tests still suggested a possible association after removing rs17863783, because surrounding SNVs were in LD with rs17863783 (Table 4). Therefore, only a test statistic that can condition on the associated SNV can be used for this analysis. Even after conditioning on an individual’s genotype at rs17863783, a few tests generally suggested an association, with, for example, the HOT and MDF having P-values of 0.0004 and 0.073. Here, after identifying a single gene, adjusting for multiple comparisons is unnecessary. However, by adjusting for the study center, a surrogate for population structure, tests suggested that there was no additional association between the UGT1A cluster and bladder cancer, with the HOT and MDF now having P-values of 0.58 and 0.11. Therefore, in this region, which is highly associated with multiple cancers, only a single SNV appears to directly influence the risk of bladder cancer.

Discussion

We examined a class of statistics, DAWGSS, that test for an association between a group of genetic variants and a phenotype and show that the majority of statistics that are currently available for testing associations with groups of variants, despite the diversity of their original presentations, can be rewritten as DAWGSS. Even the linear kernel methods, which generalize a large subset of statistics,²³ are, in turn, generalized by DAWGSS when we restrict to their preferred metric of IBS. Within this shared framework, it is clear that the differences among these statistics are wholly encompassed by the weights multiplying the terms in the sum from equation (5). In the case of independent SNVs, the four classical and most ubiquitous statistics (STZ, SUM, HOI, MDF) differ only by whether they include cross-product terms and/or incorporate the SD, √V, of the number of variants into their weights. When comparing a broader range of statistics, we found five global features that distinguish many of the known tests. Furthermore, we demonstrated that the desirable properties for these features depend on the behavior of rare variants.

This paper was focused on showing the similarity of rare-variant test statistics, and in the Supplementary material, we show how specific tests can be reformulated as DAWGSS. However, our discussions are not intended to be inclusive. Foremost, DAWGSS are limited to describing statistics that have an additive effect across the SNVs. Therefore, rank-based approaches,⁷ methods that allow for interactions,⁶ methods with non-additive effects,¹⁸ and methods that compare all subsets of SNVs³³ are outside of the DAWGSS framework. However, we believe these limitations have minimal practical implications. Currently, studies are underpowered to detect most interactions, especially among rare variants, and, if desired, DAWGSS can easily be extended to account for interaction terms. Moreover, among methods based on distance matrices, such as similarity regression, kernel-based approaches, and variance components, the additive model outperformed other options.^{17, 34}

The association between bladder cancer and the UGT1A region appears to be caused by a single genetic variant. Considering the importance of the UGT family in a number of cancers, we expected that variants, which affected one cancer would also affect others, even if to a lesser extent, resulting in multiple variants associated with disease risk. However, in our tests of association, there was no evidence to support such a hypothesis as there was no regional association after adjusting for rs17863783.

The DAWGSS framework offers a practical means for performing association tests. To facilite its use, we have provided the software, CRaVe, on the author’s website that can efficiently perform all tests that can be described with the DAWGSS framework. The program inputs standard data formats (eg, vcf, tped) and can accomodate covariates and bioinformatic weights. CRaVe can perform all tests defined within this manuscript and allows the user to new tests within this framework. We are currently using this software to study the fifth global feature, the relationship between and LD, and aim to provide results evaluating tests, such as HOT, that adjust for LD in the near future.

References

Ahituv N, Kavaslar N, Schackwitz W et al: Medical sequencing at the extremes of human body mass. Am J Hum Genet 2007; 80: 779–791.
Article CAS Google Scholar
Cohen JC, Boerwinkle E, Mosley TH, Hobbs HH : Sequence variations in pcsk9, low ldl, and protection against coronary heart disease. N Engl J Med 2006; 354: 1264–1272.
Article CAS Google Scholar
Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH : Multiple rare alleles contribute to low plasma levels of hdl cholesterol. Science 2004; 5685: 869–872.
Article Google Scholar
Nejentsev S, Walker N, Riches D, Egholm M, Todd JA : Rare variants of ifih1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 2009; 5925: 387–389.
Article Google Scholar
Romeo S, Pennacchio LA, Fu Y et al: Population based resequencing of angptl4 uncovers variations that reduce triglycerides and increase hdl. Nat Genet 2007; 4: 513–516.
Article Google Scholar
Liu DJ, Leal SM : A novel adaptive method for the analysis of next generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 2010; 10: e1001156.
Article Google Scholar
Madsen BE, Browning SR : A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009; 5: e1000384.
Article Google Scholar
Morgenthaler S, Thilly WG : A strategy to discover genes that carry multiallelic or monoallelic risk for common diseases: a cohort allelic sums test (cast). Mutat Res 2007; 12: 28–56.
Article Google Scholar
Wang K, Li M, Hakonarson H : Annovar: functional annotation of genetic variants from highthroughput sequencing data. Nucleic Acids Res 2010; 38: e164–e164.
Article Google Scholar
Basu S, Pan W, Shen X, Oetting WS : Multilocus association testing with penalized regression. Genet Epidemiol 2011; 35: 755–765.
Article Google Scholar
Chapman J, Whittaker J : Analysis of multiple snps in a candidate gene or region. Genet Epidemiol 32, 2008; 6: 560–566.
Article Google Scholar
Hoffmann TJ, Marini NJ, Witte JS : Comprehensive approach to analyzing rare genetic variants. PLoS ONE 2010; 11: e13584.
Article Google Scholar
Lin DY, Tang ZZ : A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet 2011; 3: 354–367.
Article Google Scholar
Luedtke A, Powers S, Petersen A, Sitarik A, Bekmetjev A, Tintle N : Evaluating methods for the analysis of rare variants in sequence data. BMC Proc 2011; 5: S119.
Article Google Scholar
Mosteller F, Fisher RA : Questions and answers. Am Stat 1948; 5: 30–31.
Google Scholar
Neale BM, Rivas MA, Voight BF et al: Testing for an unusual distribution of rare variants. PLoS Genet 2011; 7: e1001322.
Article CAS Google Scholar
Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M : Genetrait similarity regression for multimarkerbased association analysis. Biometrics 2009; 65: 822–832.
Article Google Scholar
Wessel J, Schork NJ : Generalized genomic distance based regression methodology for multilocus association analysis. Am J Hum Genet 2006; 79: 792–806.
Article CAS Google Scholar
Xu X, Tian L, Wei LJ : Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics 2003; 2: 223–229.
Article Google Scholar
Hotelling H : The generalization of student’s ratio. Ann Math Stat 1931; 3: 360–378.
Article Google Scholar
Han F, Pan W : A data adaptive sum test for disease association with multiple common or rare variants. Hum Hered 2010; 70: 42–54.
Article Google Scholar
Li B, Leal SM : Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008; 83: 311–321.
Article CAS Google Scholar
Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X : Rare variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011; 89: 82–93.
Article CAS Google Scholar
Tukey RH, Strassburg CP : Human udpglucuronosyltransferases: metabolism, expression, and disease. Annu Rev Pharmacol Toxicol 2000; 40: 581–616.
Article CAS Google Scholar
Chan AT, Tranah GJ, Giovannucci EL, Hunter DJ, Fuchs CS : Genetic variants in the ugt1a6 enzyme, aspirin use, and the risk of colorectal adenoma. J Natl Cancer Inst 2005; 6: 457–460.
Article Google Scholar
Strassburg CP, Vogel A, Kneip S, Tukey RH, Manns MP : Polymorphisms of the human udpglucuronosyltransferase (ugt) 1a7 gene in colorectal cancer. Gut 2002; 60: 851–856.
Article Google Scholar
Ockenga J, Vogel A, Teich N, Keim V, Manns MP, Strassburg CP : Udp glucuronosyltransferase (ugt1a7) gene polymorphisms increase the risk of chronic pancreatitis and pancreatic cancer. Gastroenterology 2003; 7: 1802–1808.
Article Google Scholar
Vogel A, Kneip S, Barut A et al: Genetic link of hepatocellular carcinoma with polymorphisms of the udpglucuronosyltransferase ugt1a7 gene. Gastroenterology 2001; 121: 1136–1144.
Article CAS Google Scholar
Rothman N, GarciaClosas M, Chatterjee N et al: A multistage genomewide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet 2010; 11: 978–984.
Article Google Scholar
Tang W, Fu YP, Figueroa J et al: An uncommon synonymous humanspecific coding variant within the ugt1a6 gene affects mrna expression and protects from bladder cancer. Genome Biol 2011; 12: 1–27.
Article Google Scholar
IonitaLaza I, Buxbaum JD, Laird NM, Lange C : A new testing strategy to identify rare variants with either risk or protective effect on disease. PLoS Genet 2011; 7: e1001289.
Article CAS Google Scholar
1000 Genome Consortium: A map of human genome variation from population scale sequencing. Nature 2010; 467: 1061–1073.
Article Google Scholar
Yu K, Li Q, Bergen AW et al: Pathway analysis by adaptive combination of P-values. Genet Epidemiol 2009; 33: 700–709.
Article Google Scholar
Mukhopadhyay I, Feingold E, Weeks DE, Thalamuthu A : Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet Epidemiol 2010; 34: 213–221.
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The research of Dr Joshua Sampson was supported by the intramural program of the National Institute of Cancer. The research of this study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. (http://biowulf.nih.gov).

Author information

Authors and Affiliations

Division of Biostatistics, Yale School of Public Health, New Haven, CT, USA
John Ferguson & Hongyu Zhao
Information Management Services, Rockville, MD, USA
William Wheeler
Laboratory of Translational Genomics, DCEG, National Cancer Institute, Bethesda, MD, USA
YiPing Fu & Ludmila Prokunina-Olsson
Biostatistics Branch, DCEG, National Cancer Institute, Rockville, MD, USA
Joshua Sampson

Authors

John Ferguson
View author publications
You can also search for this author in PubMed Google Scholar
William Wheeler
View author publications
You can also search for this author in PubMed Google Scholar
YiPing Fu
View author publications
You can also search for this author in PubMed Google Scholar
Ludmila Prokunina-Olsson
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Sampson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joshua Sampson.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on European Journal of Human Genetics website

Supplementary information

Supplementary material (PDF 172 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferguson, J., Wheeler, W., Fu, Y. et al. Statistical tests for detecting associations with groups of genetic variants: generalization, evaluation, and implementation. Eur J Hum Genet 21, 680–686 (2013). https://doi.org/10.1038/ejhg.2012.220

Download citation

Received: 26 January 2012
Revised: 31 May 2012
Accepted: 21 July 2012
Published: 24 October 2012
Issue Date: June 2013
DOI: https://doi.org/10.1038/ejhg.2012.220

Keywords

This article is cited by

A new multivariate test formulation: theory, implementation, and applications to genome-scale sequencing and expression
- Lei Xu
Applied Informatics (2016)

Subjects

Abstract

Similar content being viewed by others

Genome-wide association studies

Rare-variant collapsing analyses for complex traits: guidelines and applications

Controlling for background genetic effects using polygenic scores improves the power of genome-wide association studies

Introduction

Methods

DAWGSS: definition

DAWGSS: global features

Statistics

Simulations

Independence

Linkage disequilibrium

Bladder cancer study

Results

Simulations

(1) The magnitude of for uncorrelated SNVs

(2) The relationship between wj and MAF

(3) The relationship between the weights and the signs of Δ

(4) The relationship between wj and Δj

Bladder cancer

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary material (PDF 172 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

A new multivariate test formulation: theory, implementation, and applications to genome-scale sequencing and expression

Search

Quick links

(2) The relationship between w_j and MAF

(4) The relationship between w_j and Δ_j