Introduction

Selection of informative genes, including individually discriminant genes and synergic genes, from expression data has been useful for medical diagnosis and prognosis. Individual gene ranking techniques such as t-test1 etc. can typically produce a “list of genes” that are correlated with disease2. However, they cannot provide insights into the interaction of these genes. According to information theory, the pair-wise interactions I (X 1; X 2; Y)3 is defined as

$$I({X}_{1};{X}_{2};Y)=I({X}_{1},{X}_{2};Y)-I({X}_{1};Y)-I({X}_{2};Y)$$
(1)

where I is the symbol for mutual information, I (X 1; Y) is the individual effect of gene X 1 relative to phenotype Y, I (X 2; Y) is the individual effect of gene X 2 relative to Y, and I (X 1, X 2; Y) is the joint effect of X 1 and X 2 relative to Y. A positive value of I (X 1; X 2; Y) indicates synergy, while a negative value of I (X 1; X 2; Y) indicates redundancy.

Figure 1 illustrates four typical pair-wise synergies examples from Watkinson et al.4 (Fig. 1A,B) and Chen et al.5 (Fig. 1C,D). Figure 1A–C are generated by simulated data, and Fig. 1D is generated by real-world data. As an example, when the RSG9 or DIAPH2 is evaluated individually and separately, neither of these two genes is correlated with cancer. Therefore, genes RGS9 and DIAPH2 would not be present in the output of any “individual gene ranking” techniques. However, when the pair-wise interactions is evaluated, the genes RGS9 -DIAPH2 are sufficient to distinguish cancer from normal samples (Fig. 1D).

Figure 1
figure 1

Four typical pair-wise synergies examples. Red and green dots represent cancer and normal samples, respectively.

Detecting synergic genes is more difficult than selecting individually discriminant genes. Several efforts have recently been made to detect gene–gene synergies. These efforts often fall into one of the two strategies. One is the non-conversion strategy, which uses formula (1) directly to measure I(X 1; X 2; Y)4 or uses the maximal information coefficient directly to measure MIC(X 1; X 2; Y)5. The way to discretize continuous variable is the key to estimate the value of mutual information. Binarization, such as the dendrogram-based4 technique, simplifies the estimation, and provides simple logical functions in the connection of the genes. However, it may result in information loss and estimation error. Although MIC(X 1; X 2; Y)5 can capture a wide range of interactions, it has a high computational cost triggered by its 3-D search. The other is the conversion strategy, such as doublets 6 and top scoring pair (TSP)7. They employ a new variable Z derived from the combinations between X 1 and X 2 (e.g. for the sum type of doublets, Z = X 1 + X 2) to measure I (Z; Y) instead of I(X 1; X 2; Y). This strategy is low computational cost, due to the search space reduced from 3-D to 2-D. However, it is unclear whether this conversion strategy can capture synergies8 efficiently.

Inspecting Fig. 1A–C, we found that they share the same pattern and can be characterized by the same function, Y = |X 1 − X 2|. The only difference between them is the value ranges of independent variables. Although Doublets 6 included sum, diff, mul and sign conversion types (TSP is similar to sign), it, unfortunately, ignored abs conversion type.

In this work, we developed a simple and fast approach based on abs conversion type and t-test, to discover pair-wise synergic genes that are related to cancer. Furthermore, we validated these synergic genes by using classification performance with simulation and real-world datasets. Our results show that these synergic genes can enhance the individually discriminant model and improve the prediction performance. We also demonstrated that these synergic genes should be converted into new variables (Z) prior to be used as input features for classifiers, especially for many pairs of synergistic genes.

Datasets and Methods

Datasets

Four binary class datasets are involved in this work. The reference, sample size, number of genes in each dataset, and the number of samples in each class are summarized in Table 1. All gene expression data have been normalized by using the RMA method9.

Table 1 Four binary class gene expression datasets.

Conversion types and pair-wise gene rank

Suppose that a dataset has n samples and m genes, and can be denoted as {Y i , X ij}, i = 1,2,…,n; j = 1,2,…,m. X ij represents the expression value of the j th gene (G j) in the i th sample; and Y i represents the class label of i th sample. Y i {0, 1}, 0 denotes cancerous and 1 denotes normal tissue samples. Rank-based methods7 are robust to quantization effects and to overcome background differences between gene pairs. Therefore, let R ij denote the rank of the i th sample in the j th gene, we replace the expression values X ij by their ranks R ij and get a new data matrix {Y i , R ij}.

For two genes G p and G q , Doublets 6 lists four conversion types.

$$Sum\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}={R}_{ip}+{R}_{iq}$$
(2)
$$Diff\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}={R}_{ip}\,-\,{R}_{iq}$$
(3)
$$Mul\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}={R}_{ip}\times {R}_{iq}$$
(4)
$$Sign\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}=\{\begin{array}{c}1\,,\,{\rm{if}}\,{R}_{ip}\ge {R}_{iq}\\ 0\,,\,{\rm{if}}\,{R}_{ip} < {R}_{iq}\end{array}$$
(5)

We add a new conversion type:

$$Abs\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}=|{R}_{ip}\mbox{--}{R}_{iq}|$$
(6)

Here, i = 1,2,…,n; p = 1,2,…, m; q = 1,2,…, m; p ≠ q; s = 1,2,…, m(m−1)/2. Again, we get a new data matrix {Y i , Z is }. For each converted feature Z s , we use the t-score, instead of I (Z; Y), to rank the association between Z and Y, since Y {0, 1}.

The individually discriminant genes are also ranked by t- score.

Support Vector Machine Classifier and performance evaluation

Each gene pairs and each individually discriminant genes are ranked by t- score based on all samples. The Top N gene pairs and/or the Top N individually discriminant genes are selected as input features. Support Vector Machine (SVM) Classifier is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 10. We simply use the average accuracy of five-fold cross-validation (CV) to evaluate the classifier performance as the datasets involved in this paper have balanced numbers of positive and negative samples.

$$Accuracy=\frac{{\rm{TP}}+{\rm{TN}}}{{\rm{TP}}+{\rm{FP}}+{\rm{TN}}+{\rm{FN}}}\times 100 \% $$
(7)

Here TP, TN, FP, FN denote true positives, true negatives, false positives and false negatives respectively.

Results and Discussion

Comparing gene pairs selected by different methods

Figure 2 illustrates the scatterplot of the top-two gene pairs selected by abs conversion type and six reference methods in Prostate1 dataset11. In Fig. 2A,B,M and N, although the top-two synergic genes selected by abs conversion type and MIC(X 1; X 2; Y) are different, they share the same pattern: each individual gene is unrelated to cancer by individual gene evaluation, but the pair-wise genes are sufficient to distinguish the cancer from normal samples. Figure 2C–L are the top-two gene pairs selected from sum, diff, mul, sign and dendrogram-based I(X 1; X2; Y) methods. As an example (Fig. 2C), the higher the gene PWP2 expression level, the more likely to suffer cancer. The gene MNAT1 showed similar pattern as PWP2. Thus, these two genes (PWP2 and MNAT1) are related with cancer directly. However, they are individually discriminant rather than synergic genes. In a word, only abs conversion type and MIC(X 1; X 2; Y) can capture typical pair-wise synergies, dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions.

Figure 2
figure 2

Top2 gene pairs selected by different methods in Prostate1 dataset. Red and green dots represent cancer and control, respectively. Gene expression levels are represented by the ranked values. K and L are from dendrogram-based I(X 1; X 2; Y)4, M and N are from MIC(X 1; X 2; Y)5.

We then compared the overlaps among the informative genes selected by Ind, Sum, Diff, Mul, Sign and Abs methods (Table 2). Clearly, a considerable number of similar informative genes can be detected by the first five methods. On the contrary, the informative genes selected by Abs method have little overlap with the informative genes selected by the others.

Table 2 Overlaps among the informative genes selected by different methods in the Prostate1 dataset.

Given the top10 pair-wise synergic genes (16 genes) selected by abs conversion type, Fig. 3 contains the heat maps generated by these genes with different conversion type. Only the heat maps with abs conversion type (Fig. 3A) and diff conversion type (Fig. 3C) can distinguish cancer from normal samples. In diff conversion type, the Z values are medium in cancer samples, but they are either low or high in normal samples, and vice versa. Therefore, the pair-wise synergic genes converted by diff will receive low t-scores and cannot be highlighted.

Figure 3
figure 3

The heat maps generated by the same top10 synergic genes which were selected by abs conversion type. Each row corresponds to a pair of genes (AE) or a gene (F), and each column corresponds to a sample. Gene expression levels are represented by the ranked values, and normalized to [−1, 1].

To answer whether the synergic genes selected by abs conversion type have any biological relevance to cancer, we further validated the top10 gene pairs (16 genes) according to UniHI12 database (http://www.unihi.org/) and PubMed (Table 3). UniHI is an enhanced database for retrieval and interactive analysis of human molecular interaction networks. In Top10 gene pairs, so far we have found two gene pairs (PARP1-HMGB1 and CCHCR1-GRAP) that are associated with interaction in UniHI. The interaction between PARP1 and HMGB1 has been verified by Dara et al. (2007)13, the activation of PARP1 induces release of the pro-inflammatory mediator HMGB1 from the nucleus13,14,15. Of the 16 genes, 15 of them have been reported to relate to cancer. Four of them have been reported to relate to prostate cancer directly. Although LINC01278 has not yet been reported to relate to cancer, abs conversion type suggests that it is an important informative gene. LINC01278 occurred three times in the top 10 gene pairs (Table 3), and should be given proper attention.

Table 3 The top10 synergic genes selected by abs conversion type in Prostate1 dataset.

Classifier cannot learn well if synergic genes have not been converted properly

Although we get the pair-wise synergic genes based on abs conversion type, Fig. 3F suggests that the no conversion feature (X or R) cannot distinguish cancer from normal samples. It also indicates that the input features for classifiers should be conversion feature Z (Fig. 3A). Therefore, we conducted an experiment to further validate this hypothesis. Ten simulation datasets were generated according to Table 4; their prediction accuracy of 5 fold cross-validation is listed in Table 5.

Table 4 Ten simulation datasets and their input features.
Table 5 Prediction accuracy with converted and not converted input features.

For the less input features (e.g dataset1 and dataset2) (Table 5), all of the seven models perform well by applying with the converted features, whereas only two models (SVM-RBF and ANNs) perform well by applying with the not- converted features. For the larger input features (e.g dataset9 and dataset10) (Table 5), although four models (SVM-RBF, SVM-poly, SVM-sig and ANNs) still perform well by applying with the converted features, none of these seven models perform well by applying with the not converted features. Thus, we can conclude that pair-wise synergic genes should be converted into new variables (Z) prior to be used as input features for classifiers, especially for many pairs of synergistic genes.

This is a surprising and important discovery. Suppose phenotype Y is determined by individually discriminant genes X 1 and X 2, and pair-wise synergic genes X 3X 4 and X 5X 6. In other words, the true genetic model is \(Y=X1+X2+|X3-X4|+|X5-X6|\), and the true optimal subset is {X 1, X 2, X 3, X 4, X 5, X 6}, X 7X 1000 are genes unrelated to Y. Now we get the dataset {Y, X 1, X 2,…, X 1000} and want to construct a genomic prediction model16 based on machine learning, but don’t know the true genetic model. Even the individual discriminant genes X 1 and X 2 can be highlighted by t-test, and the synergic genes X 3, X 4, X 5 and X 6 can be highlighted by Abs conversion type or MIC(X 1; X 2; Y), classifier cannot learn well when the input features space is {X 1, X 2, X 3, X 4, X 5, X 6}. It means that learning machine can never tell us the true optimal subset, if synergic genes have not been converted properly. This indicates the complexity of genomic prediction, also provides a new explain for “missing heritability” in GWAS study.

Combining individually discriminant and synergic genes can improve prediction performance

To further validate the reliability of synergic genes selected by abs conversion type, we also evaluated the prediction performance of individually discriminant and synergic genes with three more recent and larger publicly available datasets (Lung, Prostate2 and Cardiovascular) (see Table 1). Meantime, the label randomization tests were performed. The top individually discriminant genes are selected by t-test, the top synergic genes are selected by abs conversion type + t-test. Here, we take the individually discriminant genes and/or converted synergic genes as the input features for the SVM-RBF classifier.

Table 6 illustrates the prediction of accuracy in different schemes of input features. The results show that: 1) By using the individually discriminant genes as input features alone, the average accuracies for Top10_Ind, Top20_Ind and Top40_Ind are 77.30%, 78.74% and 80.36%, respectively. By using the synergic genes as input features alone, the average accuracies for Top5_Syn, Top10_Syn and Top20_Syn are 75.58%, 81.67% and 84.63%, respectively. These indicate that the synergic genes receive comparable accuracy to the individually discriminant genes using the same number of genes. 2) When the input features involves 20 genes, the average accuracies for Top20_Ind, Top10_Syn and Top10_Ind + Top5_Syn are 78.74%, 81.67%, and 83.74%, respectively. When the input features involves 40 genes, the average accuracies for Top40_Ind, Top20_Syn and Top20_Ind + Top10_Syn are 80.36%, 84.63%, and 85.75%, respectively. These indicate that combining individually discriminant and synergic genes, rather than only using the individually discriminant genes or the synergic genes, can receive better prediction accuracies. 3) The classification performances of the label randomization tests drop to random, it validate the reliability of synergic genes selected by abs conversion type.

Table 6 Prediction accuracies of 5-fold CV in different schemes of input features (%).

The minimum number of individually discriminant and synergic genes required in the optimal subset remains to be determined by the further research.

We also compared the prediction performance of the 5 conversion types (Table 7). The results show that the genes selected by Abs conversion type have more powerful ability to improve prediction performance for the individually discriminant model than the genes selected by the other conversion types.

Table 7 Prediction accuracies of 5-fold CV in different conversion types (%).

Conclusion

In this paper, we propose a fast approach based on the combination of abs conversion type and t-test, to detect gene–gene synergy. We find that dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions, and the synergic genes selected by our method and the MIC(X 1; X 2; Y) method are consistent with the typical pair-wise synergy. However, MIC(X 1; X 2; Y) has a higher computational cost. For example, the running time of the entire process on Prostate1 dataset (12,600 × 12,599/2 gene pairs) by MIC(X 1; X 2; Y) method is approximately 20 hours (Intel Core i5-4590@3.3 GHz), whereas it is only 47 minutes by our method. Experiments on simulated and real-world data showed that combining the individually discriminant genes selected by t-test and the synergic genes selected by our methods can improve prediction performance. These synergic genes should be converted into new variables (Z) prior to be used as input features for classifiers.