A multivariate statistical test for differential expression analysis

Statistical tests of differential expression usually suffer from two problems. Firstly, their statistical power is often limited when applied to small and skewed data sets. Secondly, gene expression data are usually discretized by applying arbitrary criteria to limit the number of false positives. In this work, a new statistical test obtained from a convolution of multivariate hypergeometric distributions, the Hy-test, is proposed to address these issues. Hy-test has been carried out on transcriptomic data from breast and kidney cancer tissues, and it has been compared with other differential expression analysis methods. Hy-test allows implicit discretization of the expression profiles and is more selective in retrieving both differential expressed genes and terms of Gene Ontology. Hy-test can be adopted together with other tests to retrieve information that would remain hidden otherwise, e.g., terms of (1) cell cycle deregulation for breast cancer﻿ and (2) “programmed cell death” for kidney cancer.

Differential expression analysis (DEA) is a large-scale inference procedure used to identify genes whose expression differs under different biological conditions. Several variants of the t-test have been developed to perform DEA 1,2 . However, the small and skewed data typically analysed make the parametric assumptions rarely satisfied and, therefore, t-test p-values are often unreliable 3 . The easiest solution to small data size would be to increase the number of experiments, which, however, would increase experimental costs accordingly. Furthermore, data collected for poorly expressed genes are characterized by several zeros in the data. This evidence violates the typical assumptions under which t-test statistics are reliable. As a result, t-tests tend to increase type I errors and overestimate the number of significant genes. Alternative definitions of the t-test have been proposed to reduce the impact of small samples and low expression variability, e.g., moderated t-test 4 and Significance Analysis of Microarray (SAM) 5 . Indeed, we compare the performance of the proposed test for differential expression with the one of moderated t-test and SAM. Conversely, t-tests applied to large data sets also produce too many significant genes; this depends on the fact that average expression differences may be significantly different from zero from a statistical point of view but are not large enough to be biologically meaningful.
A common strategy to reduce the number of selected differentially expressed genes is to discretize the gene expression. The discretization of gene expression data (GED) is widely used in genomics analysis. Despite a certain loss of information, GED discretization is often used as a preprocessing step to reduce raw data noise and facilitate the interpretation of data 6 . Several algorithms require data discretization during the preprocessing, e.g., the biclustering method 7 . Moreover, many network models require discrete data as input, e.g., Bayesian Networks and logical networks 8,9 . Despite the importance of discretization in transcriptomics, the criteria behind discretization methods are always arbitrary: the log2-Fold Change (FC)-discretization 10 depends on an arbitrary set threshold, usually equal to 1, 1.5 or 2; the Equal Width discretization 11 depends on a tuning parameter; a simple rank-based discretization depends on the Xth percentile that identifies the top-X% genes.
We propose a novel statistical test for DEA based on a convolution of multivariate hypergeometric distributions (Hy-test), which addresses both issues of t-test methods discussed before. Moreover, the method implicitly comprises a novel approach for data discretization, which is free from arbitrary parameters. At the price of a slight loss of information, Hy-test presents the following advantages with respect to the currently used methods: (1) It is free from parametric assumptions; (2) It allows implicitly provides a discretization of the expression profiles; (3) It is more conservative than the t-tests, reducing type I errors.
(4) It can be integrated with other methods. www.nature.com/scientificreports/ In this paper, the Hy-test has been applied to investigate breast and kidney cancer tissues, and results have been compared to those obtained through the t-test approach. The results indicate that the joint use of the Hytest and moderated t-test allows one to understand the biological implications of DEA better.

Methods
Algorithm. Let's consider a gene expression profile recorded in two experimental conditions, e.g., normal and cancer tissues, for n pairs of tissues. We estimate a threshold couple able to discretize gene expression as "downregulated", "upregulated", and "no-changed". The optimum thresholds are obtained by maximizing the disagreement between the discretized levels of the two different experimental conditions. Applying the thresholds k 1 , k 2 on the whole expression of a single gene, we obtain two discretized vectors, one for healthy tissues, say v H , and one for diseased tissues, say v D , with entries that take values {-1,0,1}, which means "downregulated", "no-changed", and "upregulated", respectively. The thresholds k 1 , k 2 are estimated by maximizing the quantity where n +,− n −,+ is the number of tissue couples that present upregulated normal (cancer) tissues paired with downregulated cancer (normal) tissues. Optimization research has been carried out by using a genetic algorithm 12 . A threshold has been estimated for each gene of the dataset. However, this method can also be easily adapted to extract a single cut-off couple for all genes.
As soon as optimal values for the thresholds, k 1 and k 2 , are determined, we calculate a p-value to assess if gene expression is significantly different between cancer and normal tissues. To associate a p-value with H( v H , v D ) it's necessary, as a preliminary step, to evaluate the probability that a value of H(� v H , � v D ) = n +,− + n −,+ occurs by chance. For the sake of readability, we describe the analysis in two steps. In the first one, we set constraints on the total number of positive, negative, and null signs on both vectors in the null hypothesis, then we describe the distribution of the null model after relaxing these constraints. Specifically, in the first step, the null model depends on the external parameters � where n is the total number of tissue couples in the dataset. We are interested in calculating the probability that matrix occurs by chance, subject to the aforementioned constraints. An entry n i,j of C represents the number of tissues that display sign i in vector v H and sign j in v D . Notation C is used here because sometimes matrices such as the one above are indicated as "confusion" matrices. Entries of matrix C are not independent due to the constraints on the number of positive, negative, and null signs described above. Specifically, they are linearly dependent according to the following six equations: This linear system has rank equal to 5, because of the linear relationship between parameters: Therefore, it can be solved as This result indicates that matrix C is fully determined by the knowledge of n −,− , n −,+ , n +,− , and n +,+ . Therefore, the probability where according to a simple combinatorial analysis of the problem, and (1) (2) C = n +,+ n +,− n +,0 n −,+ n −,− n −,0 n 0,+ n 0,− n 0,0   The distribution of C allows to calculate the probability As According to this distribution, the p-value associated with an observation x =n −,+ +n +,− is : In the second step, we relax the constraints on the total number of positive, negative, and null signs in both the vectors associated with healthy (H) and diseased tissues (D). This is done by only assuming that the overall (across H and D tissues) number of positive, K+ , negative, K-, and null signs, K0, are set. In this case, we have to modify the previous formula. Specifically, let's indicate with G the total number of positive, negative and null signs across the 2n = K + + K − + K 0 samples, that is, two times the number of paired tissues. In this case, the null hypothesis is attained by assuming that n tissues are randomly selected to be pathological, and paired with the others, which are supposed to be the healthy ones. Therefore: where Q = {K + D , K − D , n +,+ , n −,− , n −,+ }, such that x ≥x . Therefore, at difference with Eq. (9), quantities K + D and K − D can vary, and the sum is carried over all possible values of parameters such that x ≥x , under the constrain In this manuscript, the Hy-test refers to Eq. (10). We use this test on a large set of genes, therefore a multiple comparison correction is required. In all subsequent analysis statistical significance indicates that a Bonferroni corrected p-value is below the 5% level 13 .
Preprocessing procedure for microarray data. To test the effectiveness of the proposed method, we consider gene expression profiles of breast cancer (BRCA) cells in a pattern of paired tissues; 17.632 genes have been recorded in 75 tumour tissues and in the 75 paired normal tissues. Then the analysis has also been performed by considering 67 kidneys with renal clear cell carcinoma-KIRC-paired with 67 normal tissues. Data has been downloaded from The Cancer Genome Atlas (TCGA) database using the TCGA-assembler tool 14 . The expression profiles of duplicated genes have been replaced by their mean expression. Moreover, the expression of each gene has been normalized using a quantile normalization procedure implemented in R package preprocessCore 15 . Finally, gene expression values were log2-transformed.
Quantitative analysis of GO-terms. The performance of the Hy-test has been compared to one of two classical methods of differential expression analysis, i.e., moderated t-test 4 in combination with fold change larger than 2 and significance analysis of microarray 5 . Both tests are available from the Bioconductor repository and are implemented in the packages "limma" and "siggenes", respectively. According to the three methods, genes that turned out to be significant were also compared by exploiting their functional roles with a Gene Ontology (GO) enrichment analysis 16 . We obtained three separate lists of significant GO-terms from the three sets of differentially expressed genes. GO-analysis has been done using the topGO package from Bioconductor, focusing on biological process terms. Fisher exact p-values have been associated with each GO-term. To identify GO-terms (e.g., cell cycle) conceptually associated with a specific cell line (for example, breast cancer), we have www.nature.com/scientificreports/ defined a novel procedure that counts the PubMed articles related to the biological concepts under exam, for example, breast cancer and cell cycle. We assume that more articles related to both concepts indicate a stronger conceptual association between them. The automated PubMed search has been carried out using the R package RISmed 17 . The used query considers articles published between January 2000 and December 2020. The probability of observing n C,T PubMed articles with both keywords "breast cancer" and "cell cycle" is where N is the number of articles available on PubMed, N C is the number of articles with the keyword "breast cancer" and N T is the number of articles with "cell cycle" as keywords. Using a hypergeometric test we have associated a p-value of conceptual association with each GO term as

Results
The three methods, i.e. Hy-test, moderated t-test and SAM, have been compared. Venn diagrams reported in Fig. 1 clearly show the differences between the outcomes of the three considered methods. Considering breast (kidney) tissues, the Hy-test identifies 1.304 (2.720) significant genes, whereas both SAM and the moderated t-test select many more genes: 7.620 (8.347) and 3.362 (4.192) significant genes, respectively. More importantly, panels A (breast cancer) and D (kidney cancer) of Fig. 1 clearly show that the Hy-test mostly identifies differentially expressed genes also identified by both the other methods. These results indicate that the Hy-test is more conservative than the other two tests. According to a GO-enrichment analysis of the lists of differentially expressed genes, 109 (245) significant terms result from the Hy-test gene list, 230 (457) from the moderated t-test list, and 66 (162) from the SAM list. The intersections among the detected lists of terms (12) Pr   Fig. 1C, F. It's worth noticing that SAM analysis provides such a large number of differentially expressed genes, more than 5000 in both the applications, that it is reasonable to assume the presence of many false positives, while the Hy-test alone or the combined use of Hy-test and moderate t-test suggest better recovery of significant terms associated with both types of cancer.
A crucial issue in interpreting results from transcriptomics studies is the bias due to the significantly high and increasing number of cancer-related studies with respect to any other disease 18 . The consequence is that almost any gene has been (or will be) associated with cancer. Evaluating the performance of our algorithm by measuring its ability to retrieve cancer-related genes might not be sufficient. On the other hand, several different perturbations can trigger concerted "expression waves" marking state transitions that could cause global transcriptomic changes with common underlying characteristics 19 . The consequence, in this case, is the reported presence of a "generic signature" of differentially expressed genes, i.e. genes that are frequently detected as differentially expressed, despite the comparison performed 20 . Therefore, we evaluated the algorithms by considering Table 1. GO-terms significantly associated with breast cancer among significant GO-terms found using Hy-test, moderated t-test and both procedures. Term size is the number of genes that compose a GO-term; BR term size is the number of GO-term genes associated with breast cancer; p-value is computed by using the hypergeometric distribution.  Table 2. GO-terms significantly associated with "kidney cancer" among significant GO-terms found using Hy-test, t-test and both procedures. Term size is the number of genes that compose a GO-term; KIRC term size is the number of GO-term genes associated with kidney cancer; p-value is computed by using the hypergeometric distribution.

Sign. GO-term GO ID Analysis Term size KIRC term size p value
Apoptotic www.nature.com/scientificreports/ their ability to avoid the selection of the generic signature, not because the genes selected are not related to the comparisons we are performing, but by testing which algorithm can retrieve more specific features of the system under investigation and not the effect of the generic perturbation. To measure the condition specificity of the used tests, i.e., the ability to select differentially expressed genes specifically related to the performed sample's comparisons, we used the DE prior score defined and computed in 20 . The genes selected with the SAM test show a DE prior score cumulative distribution very close to the diagonal, explained by selecting a high number of genes, most of which are probably false positive (Supplementary Fig. 1). The DE prior scores of the genes selected as differentially expressed in breast cancer tissues with the Hy-test and the moderated t-test are similarly distributed.
On the other hand, the Hy-test in kidney data analysis selects differentially expressed genes with significantly lower DE prior scores. Even though the Hy-test selects a smaller number of differentially expressed genes, its focus is not on the genes that appear differentially expressed in any condition of comparison but, at least in these examples, on genes more peculiar to the system under investigation.
Correlation structures and spectral analysis. Besides using statistical techniques to identify differentially expressed genes, it is also important to use statistical charts to detect normalization problems, differential expression designation problems, and common analysis errors. For example, as shown in Fig. 2 (Fig. 3) for breast (kidney)-cancer data, a simple comparison between the correlation matrices of tissues is constructed by using (1) all available genes, (2) the genes selected by the moderate t-test and (3) the ones selected by the Hy-test, allows one to perform a quality check on the two analyses of differential expression. Specifically, it is possible to observe how the panel of genes selected by the Hy-test can be considered a better filter than the one obtained Many times genes do not work in isolation, but their "effect" is organised into "eigengene" modes, which one can study by performing a Principal Component Analysis (PCA) 21 . The first component typically reflects the batch effect corresponding to the "average expression profile" of genes, whereas minor components may identify disease (or any other perturbation) effects 22 . The dimensionality reduction obtained by considering principal components provides relevant insights into the considered selection procedures of differentially expressed genes. Using all the genes, the first eigenvector, which explains about 90% of the total variance, also captures the differential effect of the two types of paired tissues as a background effect, making it impossible to use it to identify the gene-disease association. However, analysing the two reduced sets of genes that we identified through the moderated t-test and the Hy-test, we observe that the two effects (background and difference between healthy and cancer tissues) are split into the first two principal components. The variance explained by the two components together is the same as the one explained by the first component obtained from the whole dataset, i.e., about 90%. When looking at the distribution of gene scores projected on the first component (top panels of Supplementary  Figs. 2a and 3a), we note a peak in the right tail of the distribution, which smooths out if one considers only genes selected through the t-test, and eventually disappears if one only focuses on genes selected through the Hy-test (batch effect). This evidence suggests using the second principal component (bottom panels of Supplementary  Figs. 2a and 3a) to obtain more insights into the involvement of the selected genes in the differentiation between  Figs. 2b and 3b). This unsupervised procedure can be used in conjunction with ours to visualise high-dimensional space better and investigate the structure of several complex systems in biology 23 .

Robustness analysis.
To evaluate the finite sample properties of our test, we perform a Monte Carlo simulation in various scenarios. We assessed the performance of both the Hy-test and moderated t-test in terms of power functions (i.e., rejection rate) under the null (a) and alternative (b) hypotheses, respectively. Simulations are performed by generating paired vectors (Y 1 , Y 2 ) of length n ∈ {50, 75} of synthetic expression profiles, once log-normally distributed and once power-law distributed. ( Y 1 ∼ PL(x min = 20, α = 3.5) and Y 2 ∼ PL(x min = 40, α = 3.5) ) Under the null hypothesis we simulated two independent samples,  Table 3 shows the mean rejection rate after adjusting the p-values with the Bonferroni correction. Results of simulations under the null hypothesis of no differential expression block (a) of simulations are not shown because both tests were robust in detecting true negatives. According to the results reported in Table 3, the Hy-test method shows greater robustness than the moderate t-test in identifying true positives, even in low correlation and especially with highly leptokurtic distributions, such as the power-law distribution.

Discussion
DEA plays a central role in comparative transcriptomic studies, which represent the vast majority of gene expression analyses. The core action that defines a transcriptomic comparative study is the definition and retrieval of differentially expressed genes in different conditions. Working with data generated by a plethora of procedures in a very noisy and variable system, such as a biological one, requires one to adopt different approaches to analyse a given phenomenon. We provide a biological interpretation of the results obtained by performing a differential expression analysis of breast and kidney cancer genes through the moderated t-test and our Hy-test.
In the case of the real breast cancer profiles analysed, both moderated t-test and Hy-test reveal that DE genes are enriched in functions involved in tissue development and cell proliferation, as expected 24 . While the t-test approach focuses on signal transduction [25][26][27] , the Hy-test highlights a central role in regulating the cell cycle in breast cancer, as strongly supported by recent literature 28,29 .
In detail, the mammary gland is a tissue characterised by a high proliferation rate, and the developmental programs are prompt to be subverted to promote cancer progression. In the gland, many cells are extremely polarised. When extrinsic or intrinsic factors disrupt the maintenance of this organisation, this disruption may act as a promoter of hyperplasia and transformation 30 . Several studies also suggest that the disruption of the typical apical-basal polarity may contribute to the metastatic event 31 . The deregulation of extracellular matrix proteins and signalling is sufficient to promote breast cancer development and progression 24 . Signal transduction has a central role in breast cancer; indeed, breast cancer molecular classification usually follows the presence or absence of specific hormone and growth factor receptors 25,26 with direct implications in diagnosis, prognosis, and therapy. Both tissue development and signal transduction have a central role in breast cancer. However, the moderated t-test is not efficient in retrieving the cell-intrinsic cell cycle deregulation GO terms that the Hy-test has pinpointed. Indeed, cell cycle deregulation is crucial to breast cancer development and cell cycle control machinery targets novel therapeutic strategies, such as CDK4/6 inhibitors 28,29 .
In the case of kidney cancers, the differences between the Hy-test results and those from the moderated t-test are even more apparent. Both approaches retrieve an enrichment in cell signalling, particularly in the contest of the immunological microenvironment 32,33 , and the t-test only finds the involvement of functions related to  34 . However, Hy-test only points to "programmed cell death" , which is a central mechanism in kidney cancer, targeted by some therapeutic approaches to the disease 33 . In detail, it is known that the reshaping of the metabolism is one of the key steps that kidney tumour cells must undergo during cancer progression. This metabolic reshape strongly relies on the cross-talk between cancer cells and the tumour microenvironment 35 . In particular, the inflammatory microenvironment is involved in developing of both pre-neoplastic alterations and kidney cancer 36 . To further support our findings, we can also mention that, for patients with renal clear cell carcinoma, a model has been proposed based on a few immunerelated genes that can predict the prognosis based on tumour immune microenvironments 37 . Considering that the programmed cell death subversion plays a central role in kidney cancer development, it is intriguing to ascertain that only the Hy-test leads to retrieving this GO term from the enrichment analysis, strongly suggesting that a dual approach using both the Hy-test and moderated t-test can be even more suitable than single methods alone to investigate the biological meaning of a DEA on real data.

Conclusions
Hy-test can be adopted alone or jointly with other existing DEA tests to identify differentially expressed genes in a very conservative way, as confirmed by the analyses of real data of breast and kidney cancers reported in this paper. Such robust information would remain otherwise hidden within the extremely large number of genes identified by standard DEA tests as differentially expressed, likely including many false positives. According to our results, the moderated t-test increases substantially the number of significant genes retrieved from DEA with respect to the Hy-test, broadening the differential gene ontology enrichment. Consequently, the Hy-test is more selective than moderated t-test in both retrieving DE genes and relevant terms of GO. On the other end, the SAM test detects even more statistically significant genes than the moderated t-test, leading to apparent issues in identifying of enriched GO terms. To evaluate the performance of the analysed DEA tests in detecting cancer-related genes, we have focused on the enriched ontology terms validated through the automated PubMed-search procedure described in the "Methods"section. In this way, we can focus our attention only on terms with a widely established involvement in cancer diseases. The excluded terms might also carry important cancer information, but their analysis goes beyond the purpose of the present performance evaluation. Hy-test is not only able to narrow the window of selected genes but focusing the functional analysis. It can also retrieve specific terms of GO that would be otherwise missing. This is particularly evident in the breast cancer dataset, where the moderated t-test also collects the vast majority of DE genes retrieved by the Hy-test. However, the enrichment analysis shows only a moderate overlapping, strongly suggesting that Hy-test can retrieve a different set of genes that points to functions of biological relevance that would be otherwise missed. This is also true to a lower extent for the kidney dataset.