Data-driven detection of subtype-specific and differentially expressed genes

Tissue or cell subtype-specific and differentially-expressed genes (SDEGs) are defined as being differentially expressed in a particular tissue or cell subtype among multiple subtypes. Detecting SDEGs plays a critical rolse in molecularly characterizing and identifying tissue or cell subtypes, and facilitating supervised deconvolution of complex tissues. Unfortunately, classic differential analysis assumes a convenient null hypothesis and associated test statistic that is subtype-non-specific and thus, resulting in a high false positive rate and/or lower detection power with respect to particular subtypes. Here we introduce One-Versus-Everyone Fold Change (OVE-FC) test for detecting SDEGs. To assess the statistical significance of such test, we also propose the scaled test statistic OVE-sFC together with a mixture null distribution model and a tailored permutation scheme. Validated with realistic synthetic data sets on both type 1 error and detection power, OVE-FC/sFC test applied to two benchmark gene expression data sets detects many known and de novo SDEGs. Subsequent supervised deconvolution results, obtained using the SDEGs detected by OVE-FC/sFC test, showed superior performance in deconvolution accuracy when compared with popular peer methods.

Tissue or cell subtype-specific and differentially-expressed genes (SDEGs) are defined as being differentially expressed in a particular tissue or cell subtype among multiple subtypes. Detecting SDEGs plays a critical rolse in molecularly characterizing and identifying tissue or cell subtypes, and facilitating supervised deconvolution of complex tissues. Unfortunately, classic differential analysis assumes a convenient null hypothesis and associated test statistic that is subtype-non-specific and thus, resulting in a high false positive rate and/or lower detection power with respect to particular subtypes.

Here we introduce One-Versus-Everyone Fold Change (OVE-FC) test for detecting
SDEGs. To assess the statistical significance of such test, we also propose the scaled test statistic OVE-sFC together with a mixture null distribution model and a tailored permutation scheme. Validated with realistic synthetic data sets on both type 1 error and detection power, OVE-FC/sFC test applied to two benchmark gene expression data sets detects many known and de novo SDEGs. Subsequent supervised deconvolution results, obtained using the SDEGs detected by OVE-FC/sFC test, showed superior performance in deconvolution accuracy when compared with popular peer methods.
Molecular characterization (e.g. gene expression profile) of a complex biologic system often includes features that are ubiquitously expressed by all cell or tissue types in the system (e.g. housekeeping genes) 1 , and expressed features that are specific to one or more cell or tissue subtypes present in the system (marker genes or differentially-expressed genes) [2][3][4] . An important but frequently underappreciated issue is how a "cell or tissue subtype-specific expression pattern" is defined. Ideally, a specific expression pattern would be composed of individual features that are specifically and differentially expressed in the cognate cell or tissue subtype of interest in relation to every othersso-called subtype-specifc and differentially expressed genes (SDEGs) [5][6][7][8] .
SDEGs play a critical role in molecularly characterizing and identifying tissue or cell subtypes, and facilitating the supervised deconvolution of complex tissues 5,8,9 . However, detecting SDEGs using tissue or cell-specific molecular expression profiles remains a challenging task 10 . For example, the most frequently used methods rely on an extension of ANOVA that identifies genes differentially expressed across any of the relevant cell or tissue subtypes. In this case, the null hypothesis is that samples in all subtypes are drawn from the same population, resulting in the selection of genes that may not conform to the SDEG definition. One-Versus-Rest Fold Change (OVR-FC) is another popular method based on the ratio of the average expression in a particular subtype to that of the average expression in the rest samples [10][11][12] , and OVR t-test is occasionally used to assess the statistical significance of detected genes 13 . However, a gene with low average expression in the rest is not necessarily expressed at a low level in every subtype in the rest. Expectedly, simulation studies show that Marker Gene Finder in Microarray data (MGFM) outperforms OVR t-test 14 . Alternative strategies include One-Versus-One (OVO) t-test and Multiple Comparisons with the Best (MCB) 15 that use additional pairwise significance testing or the confidence intervals of OVO statistics 2,16 .
To address the critical problem of the absence of a detection method explicitly matched to the definition of SDEGs, here we introduce One-Versus-Everyone Fold Change (OVE-FC) test to detect SDEGs among many subtypes. OVE-FC test has previously been proposed to detect SDEGs to improve multiclass classification, where the selection is based on whether the mean of one subtype is significantly higher or lower than the mean from each of the other subtypes 5,6 . To assess the statistical significance of such test, we also propose the scaled test statistic OVE-sFC together with a mixture null distribution model. Because the expression patterns of non-SDEGs can be highly complex, a tailored permutation scheme is used to estimate the corresponding distribution under the null hypothesis.
We first validate the performance of OVE-sFC test on extensive simulation data, in terms of type 1 error rate and False Discovery Rate (FDR) control. We then demonstrate the detection power of the OVE-FC/sFC test over a comprehensive set of scenarios, in terms of partial area under the receiver operating characteristic curve (pAUC), and in comparison with top peer methods. We present the utility of OVE-FC/sFC test by applying it to benchmark public data, and assessing the performance by comparing with known SDEGs and by the accuracy of supervised deconvolution that uses the expression patterns of de novo SDEGs detected by OVE-FC/sFC test.

Validation of OVE-sFC test on type 1 error using simulation data sets
To test whether our OVE-sFC test can detect SDEGs at appropriate significance levels, we assessed the type 1 error using simulation studies under the null hypothesis (Methods).
Accuracy of type 1 error is crucial for any hypothesis testing methods that detect SDEGs based on their p-values because if the type 1 error is either too conservative or too liberal, the p-value is inflated by either too many false positive or false negative estimates, loses its intended meaning, and becomes difficult to interpret.
The simulation data contain 10,000 genes whose baseline expression levels are sampled from the real benchmark microarray gene expression data with replicates of purified cell subtypes (GSE19380 8 ). Using realistic simulation data sets with various parameter settings, we show that in all scenarios the empirical type 1 error produced by OVE-sFC test closely approximates the expected type 1 error (Figure 1a, Figure S2, Figure 2a-b). The pvalues associated with OVE-sFC test statistics exhibit a uniform distribution as expected.
Moreover, even with unbalanced sample sizes among the subtypes, the mixture null distribution estimated by our posterior-weighted permutation scheme produces the expected empirical type 1 error rate ( Figure S2 and Figure 2a). In contrast, the empirical type 1 error produced by OVR t-test and OVO t-test either over-estimates or under-estimates the expected type 1 error. The p-values associated with OVR t-test and OVO t-test deviate from a uniform distribution (Figure 1b). We also evaluate the type 1 error associated with individual subtypes under high noise levels and small sample sizes. For each of the subtypes, experimental results show that the empirical type 1 error produced by OVE-sFC test closely matches the expected type 1 error (Figure 1b and Supplementary Information).
We conducted similar validation studies involving five subtypes over a wide range of simulation scenarios (Error! Reference source not found.). Experimental results again show that OVE-sFC test produces the empirical type 1 error rates that match the expected type 1 error rates. Furthermore, subtype-specific p-value estimates effectively balance the uneven type 1 error rates among the subtypes with different numbers of upregulated genes (Methods, Figure 2b, and Supplementary Information).

Comparative assessment of OVE-FC/sFC test on power of detecting SDEGs using simulation data sets
Using realistic simulation data sets, we simulated a comprehensive set of scenarios to In large-scale multiple testing, False Discovery Rate (FDR) control is an important issue when assessing the detection power. For a well-designed significance test, the objective is to maximize power while assuring FDR below an allowable level. To test whether the qvalue reflects the actual FDR, 'fdrtool' package is used to estimate the q-value for each gene 17 , where the empirical FDR with an estimated q-value of 0.05 is expected to be around 0.05. Another informative criterion is the pAUC that emphasizes the leftmost partial area under the receiver operating characteristic curve, focusing on the sensitivity at lower False Positive Rates (FPR) 18 .
Experimental results show that both overall and subtype-specific OVE-sFC test achieve a well-controlled FDR that matches the q-value cutoff (Figure S5-S6 (Figure 3, Figure S7-S8, Table S1-S3). More specifically, for more realistic SDEGs (with sufficiently large fold change), OVE-sFC test shows the best performance; for the ideal SDEGs (i.e., marker genes with significantly large fold change 8,19 ), both OVE-FC and OVE-sFC achieve the best performance with slight outperformance by OVE-FC. In comparison with the peer methods, OVE-sFC test consistently outperforms OVO t-test in these challenging experiments involving more subtypes and using RNAseq data. More importantly, the outperformance of OVE-FC/sFC over the peer methods at a stringent FPR range in ROC analysis is significant, because the related FDR is problematic in many real-world applications where large scale multiple comparisons are involved. In contrast, all three OVR methods exhibit lower detection power, and ANOVA has the lowest detection power (Supplementary Information).
It is worth mentioning that when sample size is small, OVE-sFC test statistic borrows information across genes in estimating a priori variance via the limma method, thus stabilizing variance estimate for each gene. Furthermore, OVE-sFC test statistic estimates the parameters of the limma model from all subtypes, producing better results than that applying t-test independently with the limma model for each subtype pair. Indeed, for small sample size cases, our experimental results show that OVE-sFC test clearly outperforms OVO t-test ( Figure 3, Figure 5c and Tables S2-S3). Note that when a large number of genes is involved, a more stringent multiple comparison correction or FPR/FDR control is applied.

SDEGs (human immune cells)
We applied OVE-sFC test to two real microarray gene expression data sets, GSE28490 (Roche) and GSE28491 (HUG), to detect SDEGs associated with human immune cells 20 . In these data sets, the constituent subtypes are composed of seven different human immune cells that were isolated from healthy human blood: B cells, CD4+ T cells, CD8+ T cells, NK cells, monocytes, neutrophils, and eosinophils. Because Roche and HUG used the same protocols for cell isolation and sample processing from two independent panels of donors, the derived gene expression profiles enable use of a cross-validation strategy.
With an FDR control of q-value < 0.05 applied to both data sets, OVE-sFC test detects n=28 CD4+ T cell marker genes, n=7 CD8+ T cell marker genes, and multiple marker genes for other more distinctive cell types (Table S4-S6). Between the two data sets, we obtain a Jaccard index (intersection over union) of 36.8% for all SDEGs across all seven cell types. Overlap of monocyte and neutrophil marker genes detected from the two datasets is >40% (Figure 4). The number of SDEGs accounts for about one-third of all probesets (Roche: 39%, HUG: 34%). This result is expected because these subtypes are pure cell types and so more distinctive than would be seen with samples from multicellular complex tissues 9,21,22 . We also applied a Bonferroni multiple testing correction and a more stringent pvalue < 0.001; the number of SDEGs account for 10.7% and 2.7% of all probesets in Roche and HUG data sets, respectively (Table S4)

Evaluation of ideal SDEGs detected by OVE-FC/sFC test via supervised deconvolution
Accurate and reliable detection of ideal SDEGs has significant impact on the performance of many supervised deconvolution methods that use the expression patterns of ideal SDEGs to score constituent subtypes in heterogeneous samples 21,23,24 . We adopted a Convex Analysis of Mixtures (CAM) score calculated from ideal SDEGs-guided supervised deconvolution to quantify the proportional aboundance of each subtype (Supplementary Information). The correlation coefficient between estimated scores and true proportions was used to assess the accuracy of various SDEGs selection methods.
Both OVE-FC and OVE-sFC tests are applied to three independent data sets acquired from the purified subtype expression profiles (GSE28490 Roche), purified subtype RNAseq profiles (GSE60424), and classified single-cell RNAseq profiles (GSE72056), respectively.
The idea SDEGs are detected by six different methods including OVE-FC, OVE-sFC, OVR-FC, OVR t-stat, OVR t-test, and OVO t-test, and then used to supervise the deconvolution of realistically synthesized mixtures with ground truth.
The proportions of constituent subtypes are estimated by the CAM scores derived from expression levels of top-ranked SDEGs for each subtype. Supervised deconvolution results show that OVE-sFC test, OVE-FC test and OVO t-test methods achieved the highest correlation coefficients between CAM score and true proportions when compared with the performance of other methods (Figure 5a, Figure S10).
As a more biologically realistic case involving higher between-sample variations, we synthesize a set of n=50 in silico mixtures by combining the subtype expression profiles from bootstrapped samples in the RNAseq data set according to pre-determined proportions.
Again, supervised deconvolution results show that the ideal SDEGs detected by OVE-FC test or OVE-sFC test or OVO t-test achieved superior deconvolution performance (Figure 5b,   Figure S11).
As a more challenging case of RNAseq data with lower signal-to-noise ratio and small sample size, we repeated the simulations where in silico mixtures were synthesized by combining subtype mean expressions (GSE28491 HUG) and ideal SDEGs were detected from downsampled RNAseq profiles (GSE60424, n=3). Three purified samples were randomly selected for each subtype and analyzed by the six methods. The experimental results show that, in terms of ideal SDEG-guided deconvolution performance, OVE-sFC test strongly defeats OVO t-test. Moreover, OVE-sFC test outperforms OVE-FC test for phenotypically closer cell types (CD4+ T and CD8+ T cell types) (Figure 5c).
Across the varying number of ideal SDEGs (5~200) being selected, Figure 5 shows cell or monocyte versus CD4+ T cell or CD8+ T cell, the fundamental working principle of various tissue deconvolution methods is that there is a proper small number of ideal SDEGs to be expressed exclusively in only one particular subtype. Thus applying a stringent OVE-sFC test p-value threshold, e.g., < 0.001 after correction (Table S4) is a good option, because suitable number of ideal SDEGs for CD4+ or CD8+ T cell is 5~20, while B cells or monocytes often allows more ideal SDEGs to be used in supervised deconvolution.

Discussions
Interpreting an expression profile of complex tissues requires both the knowledge of the Expression patterns of SDEGs for the relevant cells or tissues can be used to support supervised deconvolution to estimate the relative prevalence of the contributing cell or tissue subtypes. Our present work is focused on SDEGs, i.e. restricted to the SDEG definition widely adopted 7,8,19,25 . Indeed, the work here is motivated by the need to obtain SDEGs for supervised in silico tissue deconvolution 21  OVE-sFC test makes a few assumptions and works best when all assumptions are valid. For example, while the proposed permutation scheme does not require the data to be normally distributed, under the null hypothesis, OVE-sFC test assumes that samples are drawn from the distribution of the same 'shape' for different genes, ensuring that the null distributions across genes can be aggregated together with variance-based standardization.
Practically, when data distributions deviate significantly from a common shape, the limmavoom/vooma/voomaByGroup's variance models can be used to accommodate unequal variances by appropriate observational-level weights 26 ; when data distributions deviate significantly from normality, a permutational ANOVA can be used to estimate the null hypothesis components of the mixture distribution. Our experimental results show that with the mean-variance relationship estimated by limma-voom on RNASeq data, the OVE-sFC test can maintain the expected type 1 error rates or specified FDR (Figure S6). For outliers and drop-out zero values in RNAseq data, when needed, state-of-the-art two-group test methods designed specifically for RNAseq will be exploited and adopted, e.g. edgeR 27 (Figure S8) or phenotypically closer cell types (Figure 5c). Nevertheless, OVE-sFC test may become unstable when the scaling factor is too small or inaccurately estimated, and OVE-FC will not work well when pre-exclusion of extremely lowly-expressed genes is not done properly.
Though ANOVA has been the most commonly used method to test differences among the means of multiple subtypes, often in conjunction with a post-hoc Tukey HSD comparing all possible pairs of means 29 , it is not suitable for detecting SDEGs because the null hypothesis used by ANOVA does not truly enforce the definition of SDEGs. ANOVA detects differentially expressed genes rather than SDEGs and therefore produces too many false positives with respect to individual subtypes.
In addition to the SDEGs discussed here (genes uniquely up-regulated in a particular subtype), the counterpart subtype-specific down-regulated genes (genes uniquely downregulated in a particular subtype) are also of biological interest 5 . OVE-FC/sFC test is principally applicable to detecting down-regulated SDEGs by reversing the comparison rule 5 . There are certainly other alternative definitions of 'informative genes' for different analytical purposes, e.g., sample classification. In our earlier work on multiclass classification 5,6 , we have shown that upregulated SDEGs selected by OVE-FC are sufficient to achieve multiclass classification and can often improve classifier performance over alternative informative gene subsets of the same size.
Lastly, when subtype-specific expression patterns are unknown, unsupervised deconvolution techniques (e.g., CAM 21 ) are required. A theoretical advantage of unsupervised deconvolution is that it can identify both the cell/tissue subtype proportions and their specific expression patterns, albeit with possibly less fidelity than when neither is known a priori or measured from the same sample.

Basic OVE-FC test. Consider the measured expression level
where subscript ( ) indicates the subtype with the maximum mean among all subtypes. Note that OVE-FC has previously been proposed for multiclass classification 13,14 , and matches well the definition of SDEGs 5,8,19,25 . Conceptually, the null hypothesis for non-SDEGs, and the alternative hypothesis for SDEGs, can be described as It is worth reiterating that the ideal SDEGs detected by OVE strategy with a stringent threshold are also termed the marker genes for supervised deconvolution 8,19 , and are principally similar to what detected by the Convex Analysis of Mixtures (CAM) method for fully unsupervised deconvolution 9,21 , i.e. the marker genes resided near the vertices of the scatter simplex.

OVE-sFC test statistic and null distribution modeling.
To assess statistical significance of OVE-FC and to cross-fertilize the information among genes, we assume that log ( , )~( ( ), 2 ( ) ), and further define the scaled test statistic OVE-sFC as where ( ) and are the numbers of samples in subtypes ( ) and , respectively. However, for more than two subtypes ≥ 3, modeling the distribution of under the null hypothesis is challenging because the expression patterns of non-SDEGs are highly complex, i.e., non-SDEGs include both housekeeping genes and various combinatorial forms of differentiallyexpressed genes among the subtypes.
We propose the following mixture distribution of OVE-sFC test statistic under the null hypothesis (Figure 6) where fdr non-SDEG, 0 ( ) is the local FDR associated with ANOVA on all subtypes, and fdr non-SDEG, ( ) is the local FDR associated with ANOVA on the top ( − ) subtypes, estimated using R package "fdrtool" 17 where P is the number of permutations, J is the number of participating genes, (. ) is the indicator function, and , is the OVE-sFC test statistic in the th permutation on th gene. . (8) Lastly, substituting (7) and (8) into (6), the p-value associated with gene j is calculated by: with a lower bound of min{∑ non-SDEG, ( ) where N is the total number of samples, and ̂2( ) is the pooled variance estimator, given by The where ̅ ( ) and ̅ -( ) are the geometric means of the th gene expressions within subtype and associated with the combined remaining subtypes, respectively. The OVR t-test uses a statistical test given by  where the prior degree of freedom 0 takes 5 or 40, and 0 takes 0.2, 0.5, or 0.8

(Supplementary Information).
Gene expression data of human immune cells (GSE28490 and GSE28491). In these data sets, each cell subtype consists of at least five samples, excluding few outliers (Table S7).