GSAASeqSP: A Toolset for Gene Set Association Analysis of RNA-Seq Data

RNA-Seq is quickly becoming the preferred method for comprehensively characterizing whole transcriptome activity, and the analysis of count data from RNA-Seq requires new computational tools. We developed GSAASeqSP, a novel toolset for genome-wide gene set association analysis of sequence count data. This toolset offers a variety of statistical procedures via combinations of multiple gene-level and gene set-level statistics, each having their own strengths under different sample and experimental conditions. These methods can be employed independently, or results generated from multiple or all methods can be integrated to determine more robust profiles of significantly altered biological pathways. Using simulations, we demonstrate the ability of these methods to identify association signals and to measure the strength of the association. We show that GSAASeqSP analyses of RNA-Seq data from diverse tissue samples provide meaningful insights into the biological mechanisms that differentiate these samples. GSAASeqSP is a powerful platform for investigating molecular underpinnings of complex traits and diseases arising from differential activity within the biological pathways. GSAASeqSP is available at http://gsaa.unc.edu.

one statistic (Weighted_KS) for gene set association analysis.
Proc Natl Acad Sci U S A. (Subramanian, et al., 2005) 7 SeqGSEA Integrative gene set enrichment analysis of differential expression and splicing for RNA-Seq data. SeqGSEA designed an additional tool for gene set enrichment analysis purely based on differential expression -the DE-only analysis. This tool uses an existing R package, DESeq 2 , for gene differential expression analysis, and uses the Weighted_KS statistic for gene set enrichment analysis.

Differential expression analysis
In GSAASeqSP, we provide three statistics for differential expression analysis of individual genes: Signal2Noise, log2Ratio, and Signal2Noise_log2Ratio. GSAASeqSP employs a sample-based permutation procedure to assess the association significance, so in this step a differential expression score and a p-value are computed for each gene for both the observed data and permutations.
Consider two phenotype classes, and : (1) Signal2Noise is the absolute value of the difference of the class means scaled by the standard deviation Where and are the means and standard deviations of expression values of gene in classes and , respectively. Each s-score is normalized by the sum of s-scores of all genes in the data set. The value of the statistic represents the extent to which a gene is differentially expressed between two phenotypic classes; bigger value indicates higher differential expression.
To estimate p-values, we first generate the gene-wise null distribution of the Signal2Noise statistic by a   sample-based permutation procedure, and we then calculate a p-value for the test statistic based on its null   distribution. Suppose   is the value of the Signal2Noise statistic for the observed data and  are the values for permutations . The p-value for the Signal2Noise statistic is computed as Where is an indicator variable that is one if and is otherwise zero. Smaller p-value indicates higher probability that a gene is differentially expressed between two phenotypic classes.
(2) log2Ratio is the absolute value of the log2 ratio of the class means Where are the means of expression values of gene in classes and , respectively. Each r-score is normalized by the sum of r-scores of all genes in the data set. The value of the statistic represents the extent to which a gene is differentially expressed between two phenotypic classes; bigger value indicates higher differential expression.
We first generate the gene-wise null distribution of the log2Ratio statistic by a sample-based permutation procedure, and we then estimate a p-value for the test statistic based on its null distribution. Suppose is the value of the log2Ratio statistic for the observed data and are the values for permutations . The p-value for the log2Ratio statistic is computed as Where is an indicator variable that is one if and is otherwise zero. Smaller p-value indicates higher probability that a gene is differentially expressed between two phenotypic classes.
(3) Signal2Noise_log2Ratio is the mean of Signal2Noise and log2Ratio We first generate the gene-wise joint null distribution of the Signal2Noise statistic and log2Ratio statistic by a sample-based permutation procedure, and we then calculate a gene's differential expression score and its p-value based on this null distribution. Suppose and are the values of the Signal2Noise statistic and log2Ratio statistic for the observed data, respectively. and are the values for permutations . These s-scores and r-scores are from different types of tests and thus have different scales. In order to bring these scores to a common scale, we convert them to standard scores. The standard s-scores are computed as (5) and the standard r-scores are similarly computed as (6) Where and are the means and standard deviations of the null distributions corresponding to and , respectively. The Signal2Noise_log2Ratio statistic is then defined by Where is the minimum score of all standard s-scores and standard r-scores of all genes over the observed data and all permutations. is used to convert all z-scores to positive values. The value of the statistic represents the extent to which a gene is differentially expressed between two phenotypic classes; bigger value indicates higher differential expression. The p-value from the joint distribution is computed as Where is an indicator variable that is one if and , and is otherwise zero. This statistic was created by modifying an existing statistic introduced by NOISeq 3 . Smaller p-value indicates higher probability that a gene is differentially expressed between two phenotypic classes.

Gene set association analysis 2.2.1 Computation of gene set association scores
Ten statistics are evaluated for gene set association analysis of RNA-Seq data: Weighted_KS, L2Norm, Mean, WeightedSigRatio, SigRatio, GeometricMean, TruncatedProduct, FisherMethod, MinP, and RankSum. These statistics can be divided into three categories: score based (Weighted_KS, L2Norm, Mean, WeightedSigRatio, SigRatio), p-value based (GeometricMean, TruncatedProduct, FisherMethod, MinP) and rank based (RankSum). In this step, for a particular gene set including genes, given the differential expression scores and the corresponding p-values for all genes in the gene set, a gene set association score (AS) is computed for both the observed data and permutations based on any of the following gene set-level statistics. The differential expression scores and p-values can be computed by any of the above three gene-level statistics: Signal2Noise, log2Ratio, or Signal2Noise_log2Ratio.

(1) Weighted Kolmogorov-Smirnov test (Weighted_KS)
Given the differential expression scores of all genes in the data set, a weighted Kolmogorov-Smirnov (KS) test is used to determine the extent to which a gene set is associated with a given phenotype. This test was originally proposed by Subramanian et al 7 . Essentially, the weighted KS test determines for each gene set whether the genes belonging to that gene set are preferentially near the top of the ranked ordered list based on differential expression scores. More specifically, for gene set , given the differential expression scores for all genes in the data set, a running association score for the rank ordered genes in positions is computed as Where is an indicator variable that is one if the j th gene in the rank ordered list is in gene set and is otherwise zero. Similarly, takes the value of zero if the j th gene is in the gene set and is otherwise one. The association score of gene set , , is the maximum deviation from zero of the running association score over the positions Finally, if then the final gene set association score , otherwise .
The AS indicates the strength of the association between the gene set and the phenotype, bigger value indicates stronger association. The differential expression scores we used lack directionality, so a negative AS(S) means there is no association between the gene set and the phenotype. We set AS(S) = 0.0001 if AS(S)<0 so that negative AS will not confuse the interpretation.
(2) L 2 -norm (L2Norm) We here use the L 2 -norm of the differential expression scores of all genes in a gene set as the test statistic to evaluate the association of that gene set with the phenotype. This statistic has been adopted as the gene set-level statistic for analyzing microarray data by SAM-GS 10 , The association score of the gene set based on the L 2 -norm of differential expressions is computed as The AS indicates the strength of the association between the gene set and the phenotype, bigger value indicates stronger association.

(3) Mean (Mean)
The simple mean statistic tests whether the mean of differential expression scores in the gene set from the observed data is bigger than what would be expected from random samples. The association score of the gene set based on the mean test is computed as The AS indicates the strength of the association between the gene set and the phenotype, bigger value indicates stronger association.

(4) Weighted Significance Ratio (WeightedSigRatio)
The Weighted Significance Ratio (WSR) is defined as the ratio of the sum of significant differential expression scores to the sum of all differential expression scores in the gene set. The association score of the gene set based on the WSR test is computed as (13) Where is the p-value threshold that is used as a cutoff for determining significance. is a parameter that can be specified by users in GSAASeqSP. In this study, we set ; is an indicator variable that is one if , and is otherwise zero. The AS indicates the strength of the association between the gene set and the phenotype, bigger value indicates stronger association.

(5) Significance Ratio (SigRatio)
The Significance Ratio (SR) test compares the proportion of significant genes to all genes in the gene set. The association score of the gene set based on the SR test is computed as (14) Where is the p-value cutoff, it was set to 0.05 in this study. is a parameter that can be specified by users in GSAASeqSP. is an indicator variable that is one if , and is otherwise zero. The AS indicates the strength of the association between the gene set and the phenotype, bigger value indicates stronger association.

(6) Geometric mean (GeometricMean)
The geometric mean (GM) is defined as the nth root of the product of n numbers. Here we use the geometric mean of the p-values over all genes in a gene set as the test statistic to evaluate the association of that gene set with the phenotype. The association score of the gene set based on the GM test is computed as The AS indicates the strength of the association between the gene set and the phenotype, smaller value indicates stronger association.

(7) Truncated product method (TruncatedProduct)
Truncated product method (TPM) 11 is a global test that combines p-values from several hypothesis tests to provide a p-value for the overall hypothesis that all single hypotheses are true. It has several advantages over the ordinary Fisher product test. Here we use TPM test statistic to combine p-values of genes in a gene set into an overall significance level to investigate the association of that gene set with the phenotype. The association score of the gene set based on the TPM is computed as Where is the p-value cutoff, it was set to 0.05 in this study. is an indicator variable that is one if , and is otherwise zero. The AS indicates the strength of the association between the gene set and the phenotype, smaller value indicates stronger association.

(8) Fisher's method (FisherMethod)
Fisher's method is widely used to combine p-values from multiple independent tests. Here we use Fisher's method to summarize p-value evidence from all genes in a gene set into an overall significance level. The association score of the gene set based on the Fisher product test is computed as The AS indicates the strength of the association between the gene set and the phenotype, smaller value indicates stronger association.

(9) Minimum p-value (MinP)
We modified a MinP statistic 12 and use it as the test statistic for gene set association analysis. For gene set , given the p-value matrix of all genes in calculated from the observed data and permutations , the association score based on the MinP method is computed as Where is an indicator variable that is one if , and is otherwise zero.
The AS indicates the strength of the association between the gene set and the phenotype, smaller value indicates stronger association.

(10) Rank Sum (RankSum)
The Rank Sum is the sum of ranks of all gens in the gene set. We first transform the differential expression scores of all genes in the data set into ranks , the association score of the gene set based on the Rank Sum statistic is therefore defined as The AS indicates the strength of the association between the gene set and the phenotype, smaller value indicates stronger association.

Normalization of gene set association scores
To correct for possible heterogeneity of information at each gene set, for example differences in the gene set size or correlation structure, we normalize the AS by the mean of its null distribution generated by permutations. For a particular gene set , given its actual AS and ASs calculated from permutations , the normalized association score (NAS) is computed as This normalization method was originally introduced by GSEA 7 . The normalized association scores are then used for downstream analyses.

Generation of simulated data
To evaluate the effectiveness of different gene-level and gene set-level statistics, we conducted a comprehensive simulation study. We designed 6 scenarios of differential expression. For each scenario, 200 case-control data sets were independently generated from the same statistical model. In each data set, we simulated 200 cases and 200 controls, and for each case and control, we simulated the read counts of 1000 genes. It is well known that genetic variants are one of the major determinants that cause differential expression 13,14 . Therefore, in our simulations, we assume that the expression difference observed between cases and controls are the results from genotypic difference. Based on this assumption, we first simulated the genetic association between gene sets and phenotype then simulated the differential expression corresponding to the genetic association.

Gene set data
For each simulation we generated 100 gene sets. Only the first gene set (causal gene set) included risk genes or differentially expressed genes. We randomly chose a pathway, P53PATHWAY that contains 16 genes from MSigDB as a prototype to simulate the causal gene set. The gene expression and genotype information of P53PATHWAY were obtained from breast invasive carcinoma (BRCA) RNA-Seq data and glioblastoma data generated through The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov) project. The remaining 99 gene sets were simulated from null models, namely none of genes in these gene sets were associated with the phenotype of interest with respect to gene expression profiles or genotypes. The sizes of null gene sets were randomly drawn from U [15,30]. Genes within null gene sets were randomly drawn from a pool of 984 non-causal genes.

SNP data
Each simulated SNP data set included 1000 genes, each gene with one genotyped SNP for a total of 1000  15 . Based on these parameter settings, the genotype data were generated by PLINK 16 (http://pngu.mgh.harvard.edu/purcell/plink/). We then assigned the case-control status based on the model .
Where is the number of causal SNPs in the causal gene set. denotes the coding of the genotype at causal SNP for sample with effect size that is the log odds ratio at SNP . denotes a random sample-specific error term for sample , is sampled from a standard normal distribution.

Gene expression data
Each simulated gene expression data set consisted of 1000 genes corresponding to the 1000 genes in the SNP data set. Some of these genes were considered risk genes that were differentially expressed in cases and controls. We first generated the baseline expression levels for all genes in the data set from the negative binomial (NB) distribution. The NB-distributed read counts were generated by a function in the R Stats Package -rnbinom (n, size, prob, mu). The parameters mu and size were estimated from the normal samples in the TCGA BRCA RNA-Seq data set. Next, we added disease effect to the causal genes in the causal gene set based on the model , where is the expression level of gene in sample , is the baseline expression level of gene in sample , denotes the coding of the genotype at SNP for sample in the SNP data. is the effect size of the genotype on gene expression and reflects the degree to which the gene expression is correlated with the genotype of tag SNP of the same gene. was drawn from either U [0.8, 1], U [1,3], or U [2,4].