Article | Open | Published:

# A fast approach to detect gene–gene synergy

## Abstract

Selecting informative genes, including individually discriminant genes and synergic genes, from expression data has been useful for medical diagnosis and prognosis. Detecting synergic genes is more difficult than selecting individually discriminant genes. Several efforts have recently been made to detect gene-gene synergies, such as dendrogram-based I(X 1; X 2; Y) (mutual information), doublets (gene pairs) and MIC(X 1; X 2; Y) based on the maximal information coefficient. It is unclear whether dendrogram-based I(X 1; X 2; Y) and doublets can capture synergies efficiently. Although MIC(X 1; X 2; Y) can capture a wide range of interaction, it has a high computational cost triggered by its 3-D search. In this paper, we developed a simple and fast approach based on abs conversion type (i.e. Z = |X 1 − X 2|) and t-test, to detect interactions in simulation and real-world datasets. Our results showed that dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions, our approach can discover typical pair-wise synergic genes efficiently. These synergic genes can reach comparable accuracy to the individually discriminant genes using the same number of genes. Classifier cannot learn well if synergic genes have not been converted properly. Combining individually discriminant and synergic genes can improve the prediction performance.

## Introduction

Selection of informative genes, including individually discriminant genes and synergic genes, from expression data has been useful for medical diagnosis and prognosis. Individual gene ranking techniques such as t-test1 etc. can typically produce a “list of genes” that are correlated with disease2. However, they cannot provide insights into the interaction of these genes. According to information theory, the pair-wise interactions I (X 1; X 2; Y)3 is defined as

$$I({X}_{1};{X}_{2};Y)=I({X}_{1},{X}_{2};Y)-I({X}_{1};Y)-I({X}_{2};Y)$$
(1)

where I is the symbol for mutual information, I (X 1; Y) is the individual effect of gene X 1 relative to phenotype Y, I (X 2; Y) is the individual effect of gene X 2 relative to Y, and I (X 1, X 2; Y) is the joint effect of X 1 and X 2 relative to Y. A positive value of I (X 1; X 2; Y) indicates synergy, while a negative value of I (X 1; X 2; Y) indicates redundancy.

Figure 1 illustrates four typical pair-wise synergies examples from Watkinson et al.4 (Fig. 1A,B) and Chen et al.5 (Fig. 1C,D). Figure 1A–C are generated by simulated data, and Fig. 1D is generated by real-world data. As an example, when the RSG9 or DIAPH2 is evaluated individually and separately, neither of these two genes is correlated with cancer. Therefore, genes RGS9 and DIAPH2 would not be present in the output of any “individual gene ranking” techniques. However, when the pair-wise interactions is evaluated, the genes RGS9 -DIAPH2 are sufficient to distinguish cancer from normal samples (Fig. 1D).

Detecting synergic genes is more difficult than selecting individually discriminant genes. Several efforts have recently been made to detect gene–gene synergies. These efforts often fall into one of the two strategies. One is the non-conversion strategy, which uses formula (1) directly to measure I(X 1; X 2; Y)4 or uses the maximal information coefficient directly to measure MIC(X 1; X 2; Y)5. The way to discretize continuous variable is the key to estimate the value of mutual information. Binarization, such as the dendrogram-based4 technique, simplifies the estimation, and provides simple logical functions in the connection of the genes. However, it may result in information loss and estimation error. Although MIC(X 1; X 2; Y)5 can capture a wide range of interactions, it has a high computational cost triggered by its 3-D search. The other is the conversion strategy, such as doublets 6 and top scoring pair (TSP)7. They employ a new variable Z derived from the combinations between X 1 and X 2 (e.g. for the sum type of doublets, Z = X 1 + X 2) to measure I (Z; Y) instead of I(X 1; X 2; Y). This strategy is low computational cost, due to the search space reduced from 3-D to 2-D. However, it is unclear whether this conversion strategy can capture synergies8 efficiently.

Inspecting Fig. 1A–C, we found that they share the same pattern and can be characterized by the same function, Y = |X 1 − X 2|. The only difference between them is the value ranges of independent variables. Although Doublets 6 included sum, diff, mul and sign conversion types (TSP is similar to sign), it, unfortunately, ignored abs conversion type.

In this work, we developed a simple and fast approach based on abs conversion type and t-test, to discover pair-wise synergic genes that are related to cancer. Furthermore, we validated these synergic genes by using classification performance with simulation and real-world datasets. Our results show that these synergic genes can enhance the individually discriminant model and improve the prediction performance. We also demonstrated that these synergic genes should be converted into new variables (Z) prior to be used as input features for classifiers, especially for many pairs of synergistic genes.

## Datasets and Methods

### Datasets

Four binary class datasets are involved in this work. The reference, sample size, number of genes in each dataset, and the number of samples in each class are summarized in Table 1. All gene expression data have been normalized by using the RMA method9.

### Conversion types and pair-wise gene rank

Suppose that a dataset has n samples and m genes, and can be denoted as {Y i , X ij}, i = 1,2,…,n; j = 1,2,…,m. X ij represents the expression value of the j th gene (G j) in the i th sample; and Y i represents the class label of i th sample. Y i {0, 1}, 0 denotes cancerous and 1 denotes normal tissue samples. Rank-based methods7 are robust to quantization effects and to overcome background differences between gene pairs. Therefore, let R ij denote the rank of the i th sample in the j th gene, we replace the expression values X ij by their ranks R ij and get a new data matrix {Y i , R ij}.

For two genes G p and G q , Doublets 6 lists four conversion types.

$$Sum\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}={R}_{ip}+{R}_{iq}$$
(2)
$$Diff\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}={R}_{ip}\,-\,{R}_{iq}$$
(3)
$$Mul\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}={R}_{ip}\times {R}_{iq}$$
(4)
$$Sign\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}=\{\begin{array}{c}1\,,\,{\rm{if}}\,{R}_{ip}\ge {R}_{iq}\\ 0\,,\,{\rm{if}}\,{R}_{ip} < {R}_{iq}\end{array}$$
(5)

We add a new conversion type:

$$Abs\,{\rm{c}}{\rm{o}}{\rm{n}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}\,{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}{\rm{:}}{Z}_{is}=|{R}_{ip}\mbox{--}{R}_{iq}|$$
(6)

Here, i = 1,2,…,n; p = 1,2,…, m; q = 1,2,…, m; p ≠ q; s = 1,2,…, m(m−1)/2. Again, we get a new data matrix {Y i , Z is }. For each converted feature Z s , we use the t-score, instead of I (Z; Y), to rank the association between Z and Y, since Y {0, 1}.

The individually discriminant genes are also ranked by t- score.

### Support Vector Machine Classifier and performance evaluation

Each gene pairs and each individually discriminant genes are ranked by t- score based on all samples. The Top N gene pairs and/or the Top N individually discriminant genes are selected as input features. Support Vector Machine (SVM) Classifier is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 10. We simply use the average accuracy of five-fold cross-validation (CV) to evaluate the classifier performance as the datasets involved in this paper have balanced numbers of positive and negative samples.

$$Accuracy=\frac{{\rm{TP}}+{\rm{TN}}}{{\rm{TP}}+{\rm{FP}}+{\rm{TN}}+{\rm{FN}}}\times 100 \%$$
(7)

Here TP, TN, FP, FN denote true positives, true negatives, false positives and false negatives respectively.

## Results and Discussion

### Comparing gene pairs selected by different methods

Figure 2 illustrates the scatterplot of the top-two gene pairs selected by abs conversion type and six reference methods in Prostate1 dataset11. In Fig. 2A,B,M and N, although the top-two synergic genes selected by abs conversion type and MIC(X 1; X 2; Y) are different, they share the same pattern: each individual gene is unrelated to cancer by individual gene evaluation, but the pair-wise genes are sufficient to distinguish the cancer from normal samples. Figure 2C–L are the top-two gene pairs selected from sum, diff, mul, sign and dendrogram-based I(X 1; X2; Y) methods. As an example (Fig. 2C), the higher the gene PWP2 expression level, the more likely to suffer cancer. The gene MNAT1 showed similar pattern as PWP2. Thus, these two genes (PWP2 and MNAT1) are related with cancer directly. However, they are individually discriminant rather than synergic genes. In a word, only abs conversion type and MIC(X 1; X 2; Y) can capture typical pair-wise synergies, dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions.

We then compared the overlaps among the informative genes selected by Ind, Sum, Diff, Mul, Sign and Abs methods (Table 2). Clearly, a considerable number of similar informative genes can be detected by the first five methods. On the contrary, the informative genes selected by Abs method have little overlap with the informative genes selected by the others.

Given the top10 pair-wise synergic genes (16 genes) selected by abs conversion type, Fig. 3 contains the heat maps generated by these genes with different conversion type. Only the heat maps with abs conversion type (Fig. 3A) and diff conversion type (Fig. 3C) can distinguish cancer from normal samples. In diff conversion type, the Z values are medium in cancer samples, but they are either low or high in normal samples, and vice versa. Therefore, the pair-wise synergic genes converted by diff will receive low t-scores and cannot be highlighted.

To answer whether the synergic genes selected by abs conversion type have any biological relevance to cancer, we further validated the top10 gene pairs (16 genes) according to UniHI12 database (http://www.unihi.org/) and PubMed (Table 3). UniHI is an enhanced database for retrieval and interactive analysis of human molecular interaction networks. In Top10 gene pairs, so far we have found two gene pairs (PARP1-HMGB1 and CCHCR1-GRAP) that are associated with interaction in UniHI. The interaction between PARP1 and HMGB1 has been verified by Dara et al. (2007)13, the activation of PARP1 induces release of the pro-inflammatory mediator HMGB1 from the nucleus13,14,15. Of the 16 genes, 15 of them have been reported to relate to cancer. Four of them have been reported to relate to prostate cancer directly. Although LINC01278 has not yet been reported to relate to cancer, abs conversion type suggests that it is an important informative gene. LINC01278 occurred three times in the top 10 gene pairs (Table 3), and should be given proper attention.

### Classifier cannot learn well if synergic genes have not been converted properly

Although we get the pair-wise synergic genes based on abs conversion type, Fig. 3F suggests that the no conversion feature (X or R) cannot distinguish cancer from normal samples. It also indicates that the input features for classifiers should be conversion feature Z (Fig. 3A). Therefore, we conducted an experiment to further validate this hypothesis. Ten simulation datasets were generated according to Table 4; their prediction accuracy of 5 fold cross-validation is listed in Table 5.

For the less input features (e.g dataset1 and dataset2) (Table 5), all of the seven models perform well by applying with the converted features, whereas only two models (SVM-RBF and ANNs) perform well by applying with the not- converted features. For the larger input features (e.g dataset9 and dataset10) (Table 5), although four models (SVM-RBF, SVM-poly, SVM-sig and ANNs) still perform well by applying with the converted features, none of these seven models perform well by applying with the not converted features. Thus, we can conclude that pair-wise synergic genes should be converted into new variables (Z) prior to be used as input features for classifiers, especially for many pairs of synergistic genes.

This is a surprising and important discovery. Suppose phenotype Y is determined by individually discriminant genes X 1 and X 2, and pair-wise synergic genes X 3X 4 and X 5X 6. In other words, the true genetic model is $$Y=X1+X2+|X3-X4|+|X5-X6|$$, and the true optimal subset is {X 1, X 2, X 3, X 4, X 5, X 6}, X 7X 1000 are genes unrelated to Y. Now we get the dataset {Y, X 1, X 2,…, X 1000} and want to construct a genomic prediction model16 based on machine learning, but don’t know the true genetic model. Even the individual discriminant genes X 1 and X 2 can be highlighted by t-test, and the synergic genes X 3, X 4, X 5 and X 6 can be highlighted by Abs conversion type or MIC(X 1; X 2; Y), classifier cannot learn well when the input features space is {X 1, X 2, X 3, X 4, X 5, X 6}. It means that learning machine can never tell us the true optimal subset, if synergic genes have not been converted properly. This indicates the complexity of genomic prediction, also provides a new explain for “missing heritability” in GWAS study.

### Combining individually discriminant and synergic genes can improve prediction performance

To further validate the reliability of synergic genes selected by abs conversion type, we also evaluated the prediction performance of individually discriminant and synergic genes with three more recent and larger publicly available datasets (Lung, Prostate2 and Cardiovascular) (see Table 1). Meantime, the label randomization tests were performed. The top individually discriminant genes are selected by t-test, the top synergic genes are selected by abs conversion type + t-test. Here, we take the individually discriminant genes and/or converted synergic genes as the input features for the SVM-RBF classifier.

Table 6 illustrates the prediction of accuracy in different schemes of input features. The results show that: 1) By using the individually discriminant genes as input features alone, the average accuracies for Top10_Ind, Top20_Ind and Top40_Ind are 77.30%, 78.74% and 80.36%, respectively. By using the synergic genes as input features alone, the average accuracies for Top5_Syn, Top10_Syn and Top20_Syn are 75.58%, 81.67% and 84.63%, respectively. These indicate that the synergic genes receive comparable accuracy to the individually discriminant genes using the same number of genes. 2) When the input features involves 20 genes, the average accuracies for Top20_Ind, Top10_Syn and Top10_Ind + Top5_Syn are 78.74%, 81.67%, and 83.74%, respectively. When the input features involves 40 genes, the average accuracies for Top40_Ind, Top20_Syn and Top20_Ind + Top10_Syn are 80.36%, 84.63%, and 85.75%, respectively. These indicate that combining individually discriminant and synergic genes, rather than only using the individually discriminant genes or the synergic genes, can receive better prediction accuracies. 3) The classification performances of the label randomization tests drop to random, it validate the reliability of synergic genes selected by abs conversion type.

The minimum number of individually discriminant and synergic genes required in the optimal subset remains to be determined by the further research.

We also compared the prediction performance of the 5 conversion types (Table 7). The results show that the genes selected by Abs conversion type have more powerful ability to improve prediction performance for the individually discriminant model than the genes selected by the other conversion types.

## Conclusion

In this paper, we propose a fast approach based on the combination of abs conversion type and t-test, to detect gene–gene synergy. We find that dendrogram-based I(X 1; X 2; Y) and doublets are helpless for discovering pair-wise gene interactions, and the synergic genes selected by our method and the MIC(X 1; X 2; Y) method are consistent with the typical pair-wise synergy. However, MIC(X 1; X 2; Y) has a higher computational cost. For example, the running time of the entire process on Prostate1 dataset (12,600 × 12,599/2 gene pairs) by MIC(X 1; X 2; Y) method is approximately 20 hours (Intel Core i5-4590@3.3 GHz), whereas it is only 47 minutes by our method. Experiments on simulated and real-world data showed that combining the individually discriminant genes selected by t-test and the synergic genes selected by our methods can improve prediction performance. These synergic genes should be converted into new variables (Z) prior to be used as input features for classifiers.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

1. 1.

Jafari, P. & Azuaje, F. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Medical Informatics and Decision Making 6, 27 (2006).

2. 2.

Neumann, U., Genze, N. & Heider, D. EFS: an ensemble feature selection tool implemented as R-package and web-application. Biodata Mining 10, 21 (2017).

3. 3.

Anastassiou, D. Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology 3, 83 (2007).

4. 4.

Watkinson, J., Wang, X. & Tian, Z. & Anastassiou, Dimitris. Identification of gene interactions associated with disease from gene expression data using synergy networks. BMC Systems Biology 2, 1–16 (2008).

5. 5.

Chen, Y. et al. Discovering Pair-wise Synergies in Microarray Data. Scientific Reports 6, 30672 (2016).

6. 6.

Chopra, P., Lee, J., Kang, J. & Lee, S. Improving Cancer Classification Accuracy Using Gene Pairs. PloS One 5, e14305 (2010).

7. 7.

Geman, D. et al. Classifying gene expression profiles from pairwise mRNA comparisons. Statistical Applications in Genetics & Molecular Biology 3, Article19 (2004).

8. 8.

Chen, Y. et al. Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinformatics 17, 1–16 (2016).

9. 9.

Irizarry, R. A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).

10. 10.

Chang, C. & Lin, C. LIBSVM: A library for support vector machines. Acm Transactions on Intelligent Systems & Technology 2, 389–96 (2011).

11. 11.

Singh, D. et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203 (2002).

12. 12.

Kalathur, R. K. R. et al. UniHI 7: an enhanced database for retrieval and interactive analysis of human molecular interaction networks. Nucleic Acids Research 42(Database issue), D408 (2014).

13. 13.

Dara, D. W.-X. Z. & Craig, B. Thompson. Activation of Poly(ADP)-ribose Polymerase (PARP-1) Induces Release of the Pro-inflammatory Mediator HMGB1 from the Nucleus. Journal of Biological Chemistry 282, 17845 (2007).

14. 14.

Sharma, A. et al. Overexpression of high mobility group (HMG) B1 and B2 proteins directly correlates with the progression of squamous cell carcinoma in skin. Cancer Investigation 26, 43–51 (2008).

15. 15.

Gnanasekar, M. et al. HMGB1: A Promising Therapeutic Target for Prostate Cancer. Prostate Cancer 10, 157103 (2013).

16. 16.

Bermingham, M. L. et al. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports 5, 10312 (2015).

17. 17.

Spira, A. et al. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature Medicine 13, 361–366 (2007).

18. 18.

Penney, K. L. et al. Association of Prostate Cancer Risk Variants with Gene Expression in Normal and Tumor Tissue. Cancer Epidemiology, Biomarkers & Prevention 24, 255–260 (2015).

19. 19.

Ellsworth, D. L. et al. Intensive Cardiovascular Risk Reduction Induces Sustainable Changes in Expression of Genes and Pathways Important to Vascular Function. Circulation-cardiovascular Genetics 7, 151–160 (2014).

20. 20.

Lacroix, M. Significance, detection and markers of disseminated breast cancer cells. Endocrine Related Cancer 13, 1033 (2006).

21. 21.

Xia, G. et al. EphB4 expression and biological significance in prostate cancer. Cancer Research 65, 4623–32 (2005).

22. 22.

Neil, J. R. et al. TAB1:IκB Kinase Interaction Promotes Transforming Growth Factor β–Mediated Nuclear Factor-κB Activation during Breast Cancer Progression. Cancer Research 68, 1462–70 (2008).

23. 23.

Zhou, J. et al. Over-Expression of CDH22 Is Associated with Tumor Progression in Colorectal Cancer. Tumor Biology 30, 130–40 (2009).

24. 24.

Ding, X. et al. KLF7 overexpression in human oral squamous cell carcinoma promotes migration and epithelial-mesenchymal transition. Oncology Letters 13, 2281–2289 (2017).

25. 25.

Mccormick, C. et al. The putative tumour suppressor EXT1 alters the expression of cell-surfaceheparan sulfate. Nature Genetics 19, 158 (1998).

26. 26.

Jönsson, G. et al. Genomic subtypes of breast cancer identified by array-comparative genomic hybridization display distinct molecular and clinical characteristics. Breast Cancer Research 12, R42 (2010).

27. 27.

Hammerman, P. S. et al. Mutations in the DDR2 Kinase Gene Identify a Novel Therapeutic Target in Squamous Cell Lung Cancer. Cancer Discovery 1, 78 (2011).

28. 28.

Matullo, G. et al. Abstract 778: DNA repair capacity, chromosomal damage, methylation and gene expression levels in bladder cancer: An integrated analysis 76, 778–778 (2016).

29. 29.

Larsson. et al. DIP2C regulates expression of the tumor suppressor gene CDKN2A. Genomics (2014).

30. 30.

Kim, Y. R. et al. Differential CARM1 expression in prostate and colorectal cancers. BMC cancer 10, 1–13 (2010).

31. 31.

Suomela, S. et al. CCHCR1 Is Up-Regulated in Skin Cancer and Associated with EGFR Expression. PloS one 4, e6030 (2009).

32. 32.

Ludwig, L. et al. Expression of the Grb2-related RET adapter protein Grap-2 in human medullary thyroid carcinoma. Cancer Letters 275, 194–7 (2009).

33. 33.

Schiewer, M. J. et al. Dual roles of PARP-1 promote cancer growth and progression. Cancer Discovery 2, 1134 (2012).

## Acknowledgements

This work was supported by the National Natural Science Foundation of China (61701177 to Y.C.), the Science Research Projects of Hunan Provincial Department of Education (1071 to Z.Y.). We thank Dr. Alicia K. Byrd for helpful suggestions.

## Author information

### Author notes

1. Pengwei Xing and Yuan Chen contributed equally to this work.

### Affiliations

1. #### Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, Hunan, 410128, China

• Pengwei Xing
• , Yuan Chen
•  & Zheming Yuan
2. #### Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, Hunan, 410128, China

• Pengwei Xing
• , Yuan Chen
•  & Zheming Yuan

• Jun Gao
4. #### Biotechnology Research Center, Hunan Academy of Agricultural Sciences, Changsha, Hunan, 410125, China

• Lianyang Bai

### Contributions

P.X., Y.C., L.B and Z.Y. conceived and designed the experiments. P.X. and Y.C performed the experiments. P.X., Y.C., J.G., L.B and Z.Y. analyzed the data. P.X., J.G. and Z.Y. wrote the paper. All the authors reviewed the manuscript.

### Competing Interests

The authors declare that they have no competing interests.

### Corresponding authors

Correspondence to Lianyang Bai or Zheming Yuan.