Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies

Differential expression (DE) analysis and gene set enrichment (GSE) analysis are commonly applied in single cell RNA sequencing (scRNA-seq) studies. Here, we develop an integrative and scalable computational method, iDEA, to perform joint DE and GSE analysis through a hierarchical Bayesian framework. By integrating DE and GSE analyses, iDEA can improve the power and consistency of DE analysis and the accuracy of GSE analysis. Importantly, iDEA uses only DE summary statistics as input, enabling effective data modeling through complementing and pairing with various existing DE methods. We illustrate the benefits of iDEA with extensive simulations. We also apply iDEA to analyze three scRNA-seq data sets, where iDEA achieves up to five-fold power gain over existing GSE methods and up to 64% power gain over existing DE methods. The power gain brought by iDEA allows us to identify many pathways that would not be identified by existing approaches in these data.

Simulations were performed on one fixed scRNA-seq data set with 0 = −2, varying 1 and CR. 1 is set to be 0.25, 0.5,1.0 or 5.0 and CR is set to be 1%, 2%, 5%, 10% respectively. In each simulation setting, power of DE results between common DE method (zingeR (blue), MAST (green), edgeR (purple)) and iDEA (orange) with summary statistics obtained from that corresponding DE method when adding simulated gene set (filling color) or not (not filling color) is plotted. Here, power was calculated based on an FDR of 5%. Figure 6. iDEA provides the powerful performance on DE analysis when varying gene set enrichment coefficient and coverage rate CR than zingeR especially when gene set enrichment parameter is higher. The data were simulated based on the parameter setting 0 = −2, 1 = 0, 0.25, 0.5, 1.0 or 5.0 and CR = 1%, 2%, 5% or 10%. iDEA identifies more significant gene sets on simulation studies when varying parameters. CR represents the percentage of genes inside the gene set. Here, power was calculated based on an FDR of 5%.

Supplementary Figure 7. Distribution of marginal DE p-values from common DE methods.
The data were simulated based on the parameter setting 0 = −2, 1 = 0.5 and CR = 10%. Pvalues from zingeR (blue) and MAST (green) follow approximately a uniform distribution under the null while P-values from edgeR (purple) does not.

Supplementary Figure 8. iDEA produces calibrated (or slightly conservative) FDR estimates.
Simulations were performed on one fixed scRNA-seq data set with 0 = −2, varying 1 and CR. 1 is set to be 0, 0.25, 0.5,1.0 or 5.0 and CR is set to be 1%, 2%, 5%, 10% respectively. In each simulation setting, the scatterplot plot showed the number of detected signals based on true FDR (y-axis) versus the number of detected signals based on estimated FDR (x-axis). The yellow line is the reference line which represents y = x. Figure 9. iDEA displays high consistency in detecting DE genes in simulations. The data were simulated based on the parameter setting 0 = −2, 1 = 5, and CR = 10%. The plot shows the Jaccard index for top DE genes between zingeR, edgeR and MAST (blue) and the Jaccard index for top DE genes between iDEA when using summary statistics from zingeR, edgeR and MAST respectively(orange). CR represents the percentage of genes inside the gene set, 0 represents number of DE genes and 1 represents the gene set enrichment coefficient.  Figure 14. iDEA displays high consistency in detecting DE genes in human embryonic stem cell scRNA-seq data. iDEA displays higher Jaccard index in the common DE genes. Jaccard index for top DE genes at an FDR of 1% between zingeR, MAST and edgeR, Jaccard index for top DE genes at an FDR of 1% between iDEA when using summary statistics from zingeR, MAST and edgeR, respectively. (B); Overlap in top DE genes at an FDR of 1% between zingeR, MAST and edgeR (A); Overlap in top DE genes at an FDR of 1% between iDEA when using summary statistics from zingeR, MAST and edgeR (C); (D) and (E) are just another visualization of the overlap corresponding to (A) and (C) by UpSetR 1 . Figure 15. iDEA displays high consistency in detecting DE genes in human embryonic stem cell scRNA-seq data. Overlap in top DE genes at an FDR of 0.1% between zingeR, MAST and edgeR (A); Overlap in top DE genes at an FDR of 0.1% between iDEA when using summary statistics from zingeR, MAST and edgeR respectively (B); Overlap in top DE genes at an FDR of 5% between zingeR, MAST and edgeR (C); Overlap in top DE genes at an FDR of 5% between iDEA when using summary statistics from zingeR, MAST and edgeR, respectively (D);  Figure 20. iDEA displays high consistency in detecting DE genes in mouse neuronal cell scRNA-seq data. iDEA displays higher Jaccard index in the common DE genes. Jaccard index for top DE genes at an FDR of 1% between zingeR, MAST and edgeR, Jaccard index for top DE genes at an FDR of 1% between iDEA when using summary statistics from zingeR, MAST and edgeR respectively(red) (B); Overlap in top DE genes at an FDR of 1% between zingeR, MAST and edgeR (A); Overlap top DE genes at an FDR of 1% between iDEA when using summary statistics from zingeR, MAST and edgeR (C); (D) and (E) are just another visualization of the overlap corresponding to (A) and (C) by UpSetR 1 . Figure 21. iDEA displays high consistency in detecting DE genes in mouse neuronal cell scRNA-seq data. Overlap in top DE genes at an FDR of 0.1% between zingeR, MAST and edgeR (A); Overlap in top DE genes at an FDR of 0.1% between iDEA when using summary statistics from zingeR, MAST and edgeR respectively (B); Overlap in top DE genes at an FDR of 5% between zingeR, MAST and edgeR (C); Overlap in top DE genes at an FDR of 5% between iDEA when using summary statistics from zingeR, MAST and edgeR respectively (D); Simulations were performed on one fixed scRNA-seq data set with 0 = −2, varying 1 and CR. 1 is set to be 0.25, 0.5,1.0 or 5.0 and CR is set to be 1%, 2%, 5%, 10% respectively. In each simulation setting, power of DE results between common DE method (zingeR (blue), MAST (green), edgeR (purple)) and iDEA (orange) with summary statistics obtained from that corresponding DE method when adding simulated gene set (filling color) or not (not filling color) is plotted. The power was calculated as the percentage of truly DE genes detected in each method. Here, power was calculated based on an FDR of 5%. ; respectively under the null that simulated one fixed scRNA-seq data set and permute the gene set 10,000 times. Here, the other parameters are set to be 0 = −2, 1 = 0 and CR = 10%. CR represents the percentage of genes inside the gene set. gc is genomic control factor.

Supplementary Note 1. EM-MCMC Inference Algorithm
The iDEA model is described in detail in the Methods. Here, we describe the detailed algorithm for inference. As explained in the main text, our goal is to infer the posterior probability of = 1 as evidence for j-th gene being DE and test the null hypothesis 0 : 1 = 0 that DE genes are not enriched in the gene set. To achieve both goals, we develop an efficient expectation maximization (EM)-Markov chain Monte Carlo (MCMC) algorithm. To simplify notation, we denote as the p-vector of the underlying true effect sizes, or = ( 1 , 2 , ⋯ , ) . We denote as the p-vector of the indicator variables, or = ( 1 , 2 , ⋯ , ) . We denote 2 as the variance of the marginal DE effect size estimate for j-th gene, or 2 = se 2 (̂). We treat both and as missing data and write out the complete likelihood as where we have also ignored the constant terms in the above equation and = exp( 0 + 1 ) 1+exp( 0 + 1 ) . With the above complete likelihood, we can derive the expectation step (E-Step) and maximization step (M-Step) as follows.

Expectation Step (E-Step)
In the E-Step, we obtain the expectation of equation (1) which involves evaluating the expectations ( ), ( ) and ( 2 ) . These expectations are obtained under the conditional distributions P( , |̂, 0 ( ) , 1 ( ) , ( ( ) ) β 2 ), with 0 ( ) , 1 ( ) and ( ( ) ) β 2 being the estimates from the previous iteration t. These conditional distributions are unfortunately not available in analytic forms. Therefore, we use Markov Chain Monte Carlo (MCMC) to obtain these expectations. Specifically, we develop a Gibbs sampling to sample the posterior distributions for and in an alternate fashion. Afterwards, we use these posterior samples to evaluate the above expectations. To do so, we first integrate out from the complete likelihood and obtain the conditional distribution for as Then posterior distribution of is, where =1 Next, we recognize from the complete likelihood that the conditional distribution of given = 1 is normal: Certainly, = 0 if = 0.

Maximization Step (M-Step)
In the M-Step, we obtain the parameter estimates for 0 , 1 and β 2 that maximize the Q function obtained in the E-Step. For 0 and 1 , we obtain the first derivatives of the Q function with respect to each parameter as We also obtain the second derivatives as where is calculated as the expectation of the indicator variable E ( |̂, 0 ( ) , 1 ( ) , ( ( ) ) β 2 ) . And is used in the following Newton-Raphson algorithm to obtain the parameter estimate of the intercept 0 and gene set coefficient 1 . Afterwards, we use the Newton-Raphson algorithm for optimization and obtain estimates of 0 ( +1) and 1 ( +1) .
For β 2 , we obtain the first derivatives of the Q function with respect to β 2 as which leads to an analytical update for β 2 as The EM-MCMC algorithm thus iterates between the E-step and the M-step until converge. The EM-MCMC algorithm allows us to directly obtain the parameter estimate ( ), which is the posterior probability of j-th gene being a DE gene. This posterior probability is also commonly referred to as the posterior inclusion probability (PIP) in other settings. We use these posterior probabilities to serve as DE evidence. In addition, the EM-MCMC algorithm also provides an estimate for

Supplementary Note 2. Louis Method for p-value Computation
Here, we describe the details of the Louis method for computing the standard error of 1. In the EM-MCMC algorithm described in the previous section, we can obtain the information matrix for ( 0 , 1 ) based on the log complete likelihood logPr(̂, , | 0 , 1 , β 2 ) as described in equation (1). For completeness, we re-write the information matrix in the complete likelihood as a 2 by 2 matrix where ̂= is computed based on the ̂ estimates from the last EM step and is the annotation for j-th gene. Our goal, however, is to obtain the information matrix for ( 0 , 1 ) based on the marginal log likelihood logPr(̂| 0 , 1 , β 2 ) , also known as the observed likelihood. Such marginal information matrix can be obtained based on the complete information matrix through an adjustment using the Louis method 2,3 . Specifically, with the posterior inclusion probably PIP for j-th gene obtained from EM steps, we compute the information matrix in the incomplete likelihood as a 2 by 2 matrix Finally, the observed information matrix is adjusted by

= −
Once we compute the marginal information matrix, we can obtain the standard error se 2 (1) as the corresponding element in the inverse of the information matrix .

Supplementary Note 3. Application to an oral carcinoma bulk RNAseq dataset
To illustrate the flexibility of the modeling framework in iDEA, we applied iDEA to analyze a publicly available bulk RNASeq dataset from Tuch et al 4 Figure 30C). For example, at an empirical FDR of 5%, iDEA identified 2075 significantly enriched gene sets, which is 17%, 80%, 20% higher than fGSEA (1777), CAMERA (1154), and GSEA (1733) Figure 30D). The 50 selected important DE genes identified by iDEA clearly distinguishes the normal tissue and cancer tissue (Supplementary Figure 30F). Importantly, using the key markers provided by the original study 9 16 . Among them, CRNN has been studied to be the potential prognostic marker of OSCC due to its downregulation in oral squamous cell carcinoma samples, WNT10A plays an important role in accelerating of the progression of carcinomas via activating EMTs and local invasiveness 11 , PTHLH is indispensable for the pathogenesis of oral squamous cell carcinoma by affecting cell proliferation and cell cycle 12 . TGFBR3 is an important activator of GDF10, which is downregulated during oral carcinogenesis and involved in the suppression of cell survival 16 .
Further, we also evaluated the GC content and gene length effect. For all the genes in the dataset, we first calculated the GC content and gene length and then we create two gene sets corresponding to the levels of GC content and gene length. Specifically, for the GC content, we use the continuous value of GC content for each gene as the gene set. For the gene length, we created a binary gene set if gene length is higher than the average of the gene length, than this gene is in this gene set and annotated as 1 otherwise 0. Finally, by adding these two gene sets into iDEA, we calculated the p-values for these two gene set to represent the significance of GC content effect and gene length effect correspondingly. In the analyses, we did not observe obvious GC content effect (p-value = 0.31) and gene length effect (p-value = 0.08) in this dataset.

Model and statistical inference
As explained in the Discussion, the iDEA model makes an implicit assumption that the prior depends on the sample size. Here, we provide an iDEA variant with alternative modeling assumption that does not have the prior dependence on the sample size. Here, the input summary statistics are again in the form of marginal DE effect size estimate ̂ and its standard error se(̂), = 1,2, ⋯ , where p is the number of genes. We assume that the true effect size follows a mixture of two distributions depending on whether j-th gene is a DE gene or not: where is the prior probability of being a DE gene; β 2 is a scaling factor that determines the DE effect size strength; and 0 is the Dirac function that represents a point mass at zero. Therefore, with proportion , j-th gene is a DE gene and its only maintain the β 2 as the variance in the prior distribution of true effect size. In this way, the effect size is no longer depend on the standard error se(̂) and thus the sample size. For the rest of the algorithm and methodology, we just followed the same procedure as the original model. Here, we mainly report the modified parts in the EM algorithm when using this model. The complete likelihood changed to be the following logPr(̂, , | 0 , 1 , β 2 ) = log{Pr(̂| , ) Pr( | , β 2 ) Pr( | 0 , 1 ) Pr( β 2 | β, β )} In the E-step: We obtained the expectation of log likelihood, = [logPr(̂, , | , β 2 )].
In the M-step, estimation of intercept 0 and gene set enrichment parameter 1 is just following the same Newton-Raphson algorithm. For the estimation of β 2 , we obtain the first derivatives of the Q function with respect to β 2 as which leads to an analytical update for β 2 as

Simulations and real data applications in iDEA variant
Following the same procedures of simulations in manuscript, we compared the performance of iDEA modeling on beta effect size with the commonly used GSE analysis methods fGSEA, CAMERA, PAGE and GSEA. We found that consistent with the results generated from modeling the z score, iDEA produces wellcalibrated p-values under the null in different simulation scenarios (Supplementary Figure 31) despite slight inflation when the number of DE genes is small. Besides type I error control, we found that iDEA is more powerful than the other GSE methods across a range of alternative scenarios at a fixed false discovery rate (FDR) of 5% (Supplementary Figure 32). For DE analysis, we found that iDEA can improve DE analysis power regardless of whether the summary statistics are from MAST, edgeR or zingeR (Supplementary Figure 33). For example, with 1 = 5 and CR = 10%, iDEA achieves a power of 84%, 61% and 84% at a true FDR of 5%, when it uses the input summary statistics obtained from zingeR, MAST and edgeR, respectively. In contrast, the power of these three different DE methods are 66%, 53%, and 67%, respectively (Supplementary Figure 33).
We also compared the performance of iDEA modeling on beta effect size in DE analysis and GSE analysis in the mouse neuronal scRNAseq dataset (Supplementary Figure 34). Consistent with simulations, the GSE p-values in the permuted data from iDEA ( gc = 1.  Figure 34B). For example, at an FDR of 5%, iDEA identified 1205 enriched gene sets, which is four times higher than the second-best method (GSEA, 246). In contrast, fGSEA, CAMERA and PAGE identified 236, 134, and 205 enriched gene sets, respectively. Many of the top 1% enriched gene sets are associated with neuron structures and functions such as neuron projection (GO:0043005) 17 , neuron part (GO:0097458) 18 axon (GO:0030424) 19 , synapse (GO:0045202) 20 and ion transport (GO:0006811) 20 . In addition, use of iDEA recovered 98 out of the 237 gene sets known to be involved in inflammatory itch as provided by the original paper 9 . In contrast, fGSEA, CAMERA, PAGE, and fGSEA identified 31, 20, 19, and 29 gene sets among them, respectively. The recovery of more enriched gene sets relevant to inflammation and itch by iDEA compared to the other GSE methods provides convergent support for the greater power of iDEA when modeling on beta effect size directly compared to the other methods for GSE analysis.
We next applied iDEA for DE analysis to identify DE genes with adding the gene set neuron part (GO:0097458). Again, iDEA identified more DE genes than zingeR (Supplementary Figure 34C). At an FDR of 1%, iDEA detected 1,222 DE genes, which is 23.0% higher than zingeR (993). Importantly, consistent with the power gain brought by iDEA, it detected 77 DE genes out of top 100 previously known NP1 DE genes listed in the original study, while zingeR detected 75 DE genes.

Supplementary Note 5. Bayesian model averaging (BMA) approach
Besides performing DE analysis in iDEA in the real data based on a pre-selected gene set, we also developed a new strategy to aggregate DE evidence on a particular gene across all gene sets through Bayesian model averaging (BMA). Specifically, for the given gene, we denote its posterior inclusion probability (PIP) obtained using the gene set k as PIP . The corresponding Bayes factor quantifying its DE evidence based on the gene set k is BF = PIP /(1 − PIP ). With equal prior weights on different gene sets, the average Bayes factor quantifying its DE evidence based on all K gene sets is thus ABF = 1 ∑ BF =1 , which can be converted back to a posterior inclusion probability as PIP = ABF/(1 + ABF). We found that PIPs computed this way is highly correlated with the PIPs computed based on the pre-selected gene set (Supplementary Figure 35). We now provide both options for computing PIPs for quantifying DE evidence: biologists can choose to use pre-selected gene sets that are known to be relevant to the particular experiments, as is the case for all the real data applications; alternatively, biologists also have the option of using the Bayesian model averaging when such prior knowledge is not available.

Supplementary Note 6. Gene set overlap
Previously we followed most existing GSE approaches and accounted for test nonindependence due to gene set overlap through permutation. In addition, we also performed new analysis to further examine the issue of gene set overlap in the real data applications. We adopted the method proposed by Jiang and Gentleman 21 to examines pairs of gene sets one at a time. For each pair of gene set, Jiang's method divides genes into three categories: one category of genes that are only in the first gene set, one category of genes that are only in the second gene set, and one category of genes that are common in both gene sets. Afterwards, Jiang's method calculates three p-values, one for each category of genes. By computing p-values in each set, we can explicitly deconvolute the results in the presence of gene set overlap. Here, we mainly applied Jiang's method to analyze the top 50 gene sets identified by iDEA in human embryonic data and mouse neuron cell data in order to further dissect particular set of genes that drive the enrichment signal.
(Note that we did not apply to all significant gene sets due to the heavy computational burden of Jiang's method and the gene set overlap is moderate compare to the gene set size). Specifically, there are 1,225 pairwise combinations among top 50 gene sets. For each real data we checked, we first construct the pairwise combinations of gene sets among top 50 significant gene sets identified by iDEA and for each pair, and then we filtered out gene set pairs which has less than 20 genes overlap (due to computational stability). For each pair which has larger than 20 genes in overlap, we calculated above mentioned three categories of p-values.

Supplementary Note 7. Cell types identification in the three scRNA-seq datasets
For all the real datasets we analyzed, one of our real data contains cell types that are known a priori and not inferred from the whole expression matrix, while the other two data contain cell types that are extensively validated through approaches other than inferring based on the whole expression matrix. Specifically, for the human embryonic stem cell scRNAseq dataset, the cell types are obtained from fluorescence-activated cell sorting (FACS) analysis before mixing for scRNA-seq. FACS relies on known cell type markers and represents a somewhat unbiased strategy for cell type clustering 22 . For the mouse neuronal scRNAseq dataset, the cell types are initially inferred through an iterative PCA-based procedure and are further validated by comparing the hierarchical relationship of the neuronal types with the known developmental origin of sensory neuron types, as well as by comparing neurons with distinct and characteristic soma sizes in their identified neuronal class. In addition, the inferred neuronal cell types are further confirmed by double and triple immunohistochemical staining (e.g. NP1 cell type by staining of PLXNC1). For the 10x Genomics PBMC scRNASeq dataset, the identity of cell types was inferred by aligning cluster-specific genes to known markers of distinct PBMC populations as well as comparing against the transcriptomes of the purified populations in PBMC subsets. Their approach has been found to be largely consistent with conventional marker-based methods and the major cell types reach to the expected ratios in PBMCs. We have also displayed t-SNE plot in Supplementary Figure 22, which clearly shows distinct cell clusters. Because the cell types in these data are validated through various approaches, the DE analysis results are less likely influenced by the cell type inference step as compared to other data that are fully relying on the whole gene expression matrix for cell type inference.

Supplementary Tables
Supplementary Computation was performed on a single core of an Intel Xeon L5420 2.50 GHz processor. n is number of genes analyzed in the dataset; m is number of gene sets.
From the second gene set to the 50 th gene set, we calculate the adjusted p-values for their intersection with the top first gene set as well as the disjoint parts. P-values were determined by two-sided Wald test and adjusted by Bonferroni correction.