Gene-set analysis (GSA) evaluates the overall evidence of association between a phenotype and all genotyped single nucleotide polymorphisms (SNPs) in a set of genes, as opposed to testing for association between a phenotype and each SNP individually. We propose using the Gamma Method (GM) to combine gene-level P-values for assessing the significance of GS association. We performed simulations to compare the GM with several other self-contained GSA strategies, including both one-step and two-step GSA approaches, in a variety of scenarios. We denote a ‘one-step’ GSA approach to be one in which all SNPs in a GS are used to derive a test of GS association without consideration of gene-level effects, and a ‘two-step’ approach to be one in which all genotyped SNPs in a gene are first used to evaluate association of the phenotype with all measured variation in the gene and then the gene-level tests of association are aggregated to assess the GS association with the phenotype. The simulations suggest that, overall, two-step methods provide higher power than one-step approaches and that combining gene-level P-values using the GM with a soft truncation threshold between 0.05 and 0.20 is a powerful approach for conducting GSA, relative to the competing approaches assessed. We also applied all of the considered GSA methods to data from a pharmacogenomic study of cisplatin, and obtained evidence suggesting that the glutathione metabolism GS is associated with cisplatin drug response.
Genetic association studies, in particular genome-wide association studies (GWAS) are a powerful approach in the search for common alleles with moderate effects on phenotypic traits. Over the last few years, GWAS have identified loci associated with numerous complex diseases.1 However, the GWAS approach has limitations. Individual single nucleotide polymorphism (SNP) effects tend to be small and explain only a small proportion of the heritable variation in a phenotype,2 making most SNP associations difficult to detect using the GWAS approach. To overcome these limitations of single SNP analysis, pathway or gene-set analysis (GSA) methods for SNP data evaluate the overall evidence of association of a phenotype with SNPs in all genes in a given molecular pathway or GS.3, 4 Such methods may enable the detection of subtle effects of multiple genes in the same pathway that may be missed by assessing each gene individually.
GSA methods were first introduced in the context of gene expression data analysis.5, 6, 7, 8, 9 Many of these methods were subsequently extended for the analysis of SNP data.10, 11, 12 Methods for GSA (for both expression and SNP studies) can be divided into two types: competitive and self-contained.6 Competitive or ‘enrichment’ methods compare the results for genes within the GS with results for genes outside the GS (complement) to test the hypothesis that genes within the GS are associated with the phenotype more than genes outside the GS, whereas self-contained methods only consider results within a GS of interest to test the hypothesis that SNPs/genes in the GS are associated with the phenotype. For more details on competitive and self-contained GS methods, the reader is referred to Fridley and Biernacka3 and Wang et al.4 In this study, we have focused on only self-contained GSA methods to ensure fair comparison of methods testing the same null hypothesis.
In this manuscript we propose the use of the Gamma Method13 (GM) for GSA testing as part of either a one-step or two-step analysis strategy. We denote a ‘one-step’ GSA approach to be one in which all SNPs in a GS are used to derive a test of GS association without consideration of gene-level effects; and a ‘two-step’ approach to be one in which all genotyped SNPs in a gene are first used to evaluate association of the gene with the phenotype and then the gene-level associations are aggregated to test for association of the GS with the phenotype. A simulation study was completed to compare the use of the GM for GSA to several other self-contained GSA strategies, including both one-step and two-step GSA approaches, in a variety of scenarios. All of the methods considered are self-contained methods that can be utilized for binary, quantitative or time-to-event phenotypes. In addition to the simulation study, we performed GSA of data from a pharmacogenomic study of cisplatin drug response.
MATERIALS AND METHODS
The GM GSA approach
Self-contained GSA of SNP data can be performed using a ‘one-step’ or a ‘two-step’ approach. One-step analysis can be based on combining SNP-specific P-values to formulate a test of association of the GS with the phenotype, whereas a two-step analysis can be completed by performing gene-level tests of association and then combining the gene-level P-values to evaluate the association of the GS with the phenotype.
One of the most commonly used approaches for combining independent P-values is Fisher's method (FM).14, 15 Several extensions and modifications to FM have been proposed for summarizing results from genetic association studies.13, 16, 17, 18, 19 The FM can be shown to be a special case of the GM previously described by Zaykin et al.13 The GM is based on summing P-values transformed using an inverse Gamma(ω, 1) transformation. For a particular shape parameter ω, the test statistic is defined as where G−1 is the inverse of a Gamma(ω, 1) cumulative distribution function.13 Application of different transformations to P-values before combining them into a test statistic varies the emphasis given to individual P-values, with more emphasis being given to P-values below a particular threshold. This threshold level, which has been referred to as the soft truncation threshold (STT), is controlled by the shape parameter ω.13 When ω is 1, the transformed P-values follow a χ2-distribution, and the GM becomes equivalent to FM with a STT value of 1/e. The shape parameter ω corresponding to a particular STT value can be calculated by solving By varying the shape parameter of the Gamma distribution, different transformations of P-values can be achieved, and thus the GM is a family of related methods with FM as a special case. Other P-value combination methods such as the truncated product method and rank truncated product method could also be considered for GSA. However, Zaykin et al13 found that the GM provided overall higher power in a simulation study. We therefore focus on the GM, including FM as a special case, for GSA as described below.
For application of the GM method to GSA, we investigated the use of the GM for combining SNP P-values for a one-step GSA or gene-level P-values for a two-step GSA. For the GM, we considered values of STT ranging from 0.01 to 1/e (ie, FM). For the two-step GSA, gene-level tests were performed using several different methods, before combining the gene-level P-values using the GM to evaluate association of the GS with the phenotype. Specifically, four commonly used methods for gene-level testing were assessed, including a global model with fixed effects (GMFEs), global model with random effects (GMRE),20 principal components (PCs) analysis,21 and the minimum P-value (MinP) approach.
A limitation of the GMFE approach is that the model is only estimable when the number of predictor variables (eg, SNPs) is smaller than the number of subjects in the study (sample size). In contrast, the GMRE proposed for gene expression GSA by Goeman et al20 is based on a random effects model that can accommodate a large number of SNPs. A continuous phenotype, Y, is modeled as Y∣X∼N(α1+Xβ,σ2l), where X represents a matrix containing the N SNP genotypes, coded in terms of the number of minor alleles, β represents a vector containing the effects of the N SNPs with each of the βj's, j=1,…, N having a common distribution with mean 0 and variance τ2. Under the null hypothesis of no association, the variance of the random effects is zero (τ2=0), which can be tested with a score test.20 GMRE has been extensively utilized and shown to outperform other methods for GSA of mRNA expression data.22
We also considered PC analysis for gene-level tests based on SNP data.21 In this approach, PCs are created using a linear combination of centered SNP genotypes (based on the number of minor alleles), with a subset of the PCs included as predictors in a regression model (eg, components that explain 80% of the variation). A gene-level test can then be based on a global test of association of the PCs with the phenotype.
Finally, the MinP, or maximum test statistic, over all SNPs in a gene is often used to represent the evidence of association with the gene in GS analyses.10, 11 This approach requires correctly accounting for gene size (number of SNPs) and LD between SNPs, as genes with more SNPs in lower LD are expected to have smaller MinP by chance, even in the absence of association between the genotypes and phenotypes.
In the second step of the two-step GSA methods, we combine the gene-level P-values using the GM with STT ranging from 0.01 to 1/e (ie, FM). All GS association P-values were determined empirically based on K permutations of the phenotype. This procedure leads to valid tests in the presence of differences in gene size and LD between SNPs or genes.
Other self-contained GSA approaches
All of the methods considered within the study are self-contained GSA approaches that can be applied to any type of phenotype (eg, binary case–control status, quantitative phenotype, time-to-event phenotype). Many of these methods were selected based on a prior study of GSA for mRNA expression data, which demonstrated that a GMRE, FM and PC approaches generally had the highest power among a number of self-contained GSA methods.22
In addition to the one-step and two-step GM approaches described above, we also studied the performance of other one-step GSA methods. In particular, one-step GSA was also performed using PC analysis with components that explain 80% of the SNP variation and the GMRE method. For all one-step GSA methods, permutations were used to determine the empirical P-value for testing the association of the GS with the phenotype.
Case study: cisplatin pharmacogenomic analysis
The platinum agent cisplatin (CDDP) is a commonly used treatment for ovarian and lung cancer. To understand the pharmacogenomics of CDDP drug therapy and the role genetic variation has on the response to CDDP, the Coriell Human Variation Panel (HVP) lymphoblastoid cell lines (LCLs) from three racial groups were studied as described previously.23, 24 The quantitative drug response phenotype CDDP IC50 (effective dose that kills 50% of the cells) was estimated using a four-parameter logistic model per cell line.25
SNP genotyping was completed on the Illumina (San Diego, CA, USA) HumanHap 550K and HumanHap510S for the LCLs at the Genotyping Shared Resources at the Mayo Clinic in Rochester, MN, USA. In addition, publically available SNP data from the Affymetrix (Santa Clara, CA, USA) SNP Array 6.0 Chips were obtained for these cell lines. In total, before completing quality control, there were 1 698 648 unique SNPs on the three arrays, with 1328 SNPs mapping within 50 kb of the 27 genes in the glutathione metabolism pathway. After removing SNPs that failed quality control, 1272 SNPs in 27 genes remained for GSA of the glutathione metabolism pathway. Table 1 shows the number of SNPs in each of the 27 genes included in the analysis. Missing genotypes were imputed before analysis using the program fastPHASE.26 The association of CDDP IC50 and the glutathione metabolism pathway was assessed using the one-step or two-step GM approach with various STT values, along with the other self-contained GSA methods. The quantitative phenotype IC50 and the genotype–phenotype association models were adjusted for gender, race, and five PCs within each race group to correct for possible population stratification effects. Empirical P-values were based on 1000 permutations.
Simulation study for assessing GSA methods
Genotypes were simulated based on the observed SNP data in the glutathione metabolism pathway for the HVP LCLs from subjects of European descent. The 27 genes within the pathway were mapped to chromosomes, and haplotypes were phased using the program fastPHASE.26 These haplotype frequencies were used to represent the underlying population. Three thousand haplotypes were simulated using the hapsim library in R (http://cran.r-project.org/web/packages/hapsim/index.html) based on these haplotype frequencies. Pairs of haplotypes were then assigned in a sequential manner to the 1500 individuals.
Case–control data sets with 500 cases and 500 controls were generated to evaluate GSA in the commonly used case–control study design. Using the simulated genotypic data for markers within the glutathione metabolism pathway, a binary phenotype (Zi) for each subject i was generated conditional on their genotypic values from a Bernoulli distribution, Zi∼Ber(pi) with log(pi/1−pi)=XiTβ. To generate data sets with 500 cases and 500 controls, the intercept in this model was selected such that the average probability of being a case was 0.50. From the cohort of 1500 subjects with a simulated binary phenotype (Zi), 500 cases (Zi=1) and 500 controls (Zi=0) were randomly selected for analysis. The disease/phenotype models varied in the number and the size of the genetic effects, with odds ratios for individual SNP effects being 1.2 and 1.5 for small and moderate effects, respectively. All causal variants were observed in the data sets.
To assess the impact of the number of genes within a GS on the power and type I error rate, we also varied the size of the GS by removing 10 genes from the pathway for some simulations, so that the GS size was 17 or 27. LD between SNPs within the genes was also varied by tagging each gene at an r2 of 0.60 or 0.90. The different simulation scenarios are listed in Table 2. Four ‘null’ scenarios (all βi=0) with no association of SNPs within the GS or pathway and 20 ‘non-null’ scenarios (some βi≠ 0) were simulated. In total, 1000 data sets were generated for each scenario, and all simulated data sets were analyzed with the GM one-step and two-step approaches, along with the other approaches. Individual SNP association P-values were based on the Armitage trend test. For GSA using the PC approach, the top k PCs needed to explain 80% of the variation in the SNP genotypes within each gene (for the two-step GSA), or GS (for the one-step GSA), were used as predictors of case–control status in the logistic regression model. The R library ‘globaltest’ with the logistic model option was used to fit the GMRE (http://bioconductor.org/packages/2.6/bioc/html/globaltest.html). Empirical gene-set association P-values were based on 1000 permutations of the phenotype. Power and type I error rates were estimated based on a 0.05 significance level.
All methods had correct type I error rates (Supplementary Table 1). Summaries of the power for the various methods across different simulation scenarios are presented in Tables 3 and 4, and Figure 1. Supplementary Table 2 presents the entire set of results for all simulation scenarios. The distribution of power for each method over all investigated scenarios (disease association models 1–5 with different levels of LD and GS size) is summarized in Table 3, whereas the mean power of each method by scenario is shown in Table 4. The results show that, on average across the considered scenarios, the two-step approaches had higher power than the one-step approaches. The one-step FM (GM with STT=1/e), PC, and GMRE approaches had the lowest average power (mean power=0.57, 0.58 and 0.60, respectively). Their power was low compared with the two-step methods especially under scenarios with a smaller number of genes in the GS (ie, for the reduced GSs with 17 rather than 27 genes).
A comparison of a range of STT values for the GM for performing the second step of the two-step GSA (ie, summarization of the gene-level association P-values to a gene-set P-value) found that power was improved when a smaller STT was used, with STT between 0.05 and 0.20 providing the highest power for our simulation scenarios (Figure 2). On average, there was little difference in power between the four approaches (PC, GMRE, GMFE and MinP) for obtaining a gene-level P-value in step one of the two-step methods, with slightly higher mean power across scenarios for the PC approach over the fixed-effects (GMFE), random effects (GMRE) or MinP approaches. For the scenarios investigated, the level of LD used for SNP selection (and thus number of SNPs per gene) had little effect on the power of the GSA methods. In general, the various GSA methods were more powerful under scenarios with a smaller number of genes in the GS (ie, reduced GSs with 17 rather than 27 genes); however, this power increase was only observed for the two-step methods, and not when one-step analyses were performed.
Comparing the power across scenarios (Table 4), indicates that power of the one-step GSA methods and the MinP-GM two-step method was much more dependent on the true underlying disease model. In contrast, the other two-step approaches, such as the PC-GM approach, had consistently good power, relative to other approaches, for all scenarios assessed. Nevertheless, the one-step approaches performed very well for scenarios 3 and 5, with average power ranging from 0.986 to 1.0 and 0.925 to 1.0, respectively. These scenarios represent the case in which there are three moderate effects in three genes (one large and two smaller genes) (scenario 3) and the setting in which there are two small effects in each of three genes (scenario 5).
CDDP pharmacogenomic study
Results from the application of the one-step and two-step GM approaches as well as the other investigated GSA methods to the glutathione metabolism GS are presented in Table 4. The only method that produced a P-value less than 0.05 was the PC-GM approach with STT of 0.01 or 0.05 (PC.GM_0.01 P-value=0.023, PC.GM_0.05 P-value=0.043). This is consistent with the idea that the two-step GM method with small STT is generally more powerful than the other methods, as we had found in the simulation study. The one-step approaches resulted in the largest P-values, ranging from 0.23 to 1.0. In addition to the one-step approaches producing large GS P-values, the two-step approaches that used a full model with fixed effects to determine the gene-level P-values for association with IC50 followed by the GM also produced large P-values for association of the glutathione metabolism GS with IC50 (P-values ranging from 0.443 to 0.627).
Discussion and conclusions
In this manuscript we propose a novel GSA approach that uses the GM to combine gene-level P-values to determine the association of the GS with a phenotype. In our simulations the two-step GM approach, with either PC analysis or GMRE for determining gene-level P-values, followed by GM with a STT value between 0.05 and 0.20 for combining the gene-level P-values, had the best power across a range of disease models. The GM was previously proposed by Zaykin et al13 as a method for combining single SNP P-values in the context of genetic association studies. Here we extended this idea by considering the GM in combination with various gene-level tests of association, including fixed and random effects models and PC analysis, for a two-step GSA. We compared this approach with alternatives, including the GM applied to individual SNP P-values for a one-step GSA.
The presented simulation results showed that among the two-step GSA methods, the PC-GM, GMRE-GM and GMFE-GM preformed similarly, regardless of disease model (scenario); however, the performance of the MinP-GM approach depended greatly on the true underlying disease model (eg, high power when one moderate SNP effect within a gene and low power when small SNP effect within a gene). Under the scenarios considered in our simulation study, for the second step in the two-step GSA, combining gene-level P-values using the GM with STT values between 0.05 and 0.20 was more powerful than GM with STT=1/e (ie, FM). However, depending on the true underlying disease risk model, other STT values may lead to higher power. One option, therefore, is to consider a range of shape parameters for the Gamma transformation when combining P-values with the GM, selecting the minimum GSA P-value, and correcting for multiple testing. However, such an approach would introduce new challenges (eg, deciding on an appropriate correction for multiple testing) and may actually reduce power as a result of running more analyses requiring a correction for multiple testing.
The results of our simulation study also indicate that two-step methods are generally more powerful for detecting GS association as compared with one-step methods. For two of our simulated scenarios (scenarios 3 and 5), the one-step PC and one-step GMRE analyses were more powerful than the two-step analyses. In one of these scenarios there were three moderate effects in three genes, whereas in the other scenario, there were two small effects in each of three genes. For the remaining scenarios, the two-step GSA approaches were much more powerful than the one-step GSA approaches. Under the scenarios for which the one-step PC and GMRE methods were more powerful (scenarios 3 and 5), univariate analysis of individual SNPs with a Bonferroni correction for the number of SNPs in the GS has very high power to detect the SNP effects. Thus, the genetic effects in these two scenarios could have been detected by a typical GWAS single SNP analysis. In the remaining scenarios, where the two-step GSA approaches were much more powerful than the one step approaches, the analysis of individual SNPs had low power. These are scenarios for which analysis of each SNP individually may not have detected any significant association, but aggregation of the small effects via GSA may have identified significant GSs. These represent the situations that motivate GSA, and in these situations the two-step GSA was particularly advantageous.
GSA of data from the CDDP pharmacogenomic study using the two-step GM approach with gene-level P-values determined by PC analysis (PC-GM) suggested the glutathione metabolism GS is associated with the IC50 phenotype (P<0.05). Although analysis of a single data set cannot be used to compare power of alternative approaches, the fact that these analyses provided stronger evidence for the association than did the other methods is consistent with the idea that the two-step GM approach, in particular the PC-GM method, is more powerful than other GSA approaches, as suggested by our simulation study.
In summary, GSA is a compelling approach for analysis of complex genetic data. On the basis of this study, we found that a two-step GM approach, with STT between 0.05 and 0.20, is a powerful approach for GSA, and in particular the PC-GM or GMRE-GM approaches. Once a GS is shown to be associated with a complex phenotype, further research is needed to assess the relationships between the SNPs and genes within the GS and the phenotype, and to reveal the biological pathways underpinning this relationship. GSA of existing GWAS data is expected to contribute to insights into the complex relationships between genomic variation and clinical phenotypes.
We thank Krishna (Rani) Kalari for the mapping of SNPs to genes within the glutathione metabolism gene set. The research was supported by the US National Institutes of Health (GM61388, CA140879, AA019570, CA130828, GM86689), a pilot project award from the Mayo Clinic SPORE in Ovarian Cancer (CA136393) and Minnesota Partnership for Biotechnology and Medical Genomics grant. The funders had no role in study design, data collection and analysis, decision to publish or in preparation of the manuscript.
About this article
BLF, GDJ and JMB conceived and designed the experiments. GDJ simulated and analyzed the simulated and real data. LW, AMM performed the pharmacogenomic cytotoxicity study. BLF, GDJ and JMB wrote the paper.
Supplementary Information accompanies the paper on European Journal of Human Genetics website (http://www.nature.com/ejhg)