Discovering Pair-wise Synergies in Microarray Data

Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.

Cancer tissue sample microarray expression data typically possess a common property-the number of samples is much smaller than the number of features-here those features are genes 1 . Informative gene selection has important implications for the improvement of cancer diagnosis, the selection of targeted therapeutics, and the identification of new drug targets 2,3 . Individual-gene-ranking methods, such as the t test for binary class differentiation 4 and the F test for multi-class differentiation rank genes by comparing the expression values of the same individual gene between different classes. Although these individual-gene methods may discover individual effect genes efficiently, they may have ignored interactions (i.e., redundancy and synergy) between genes [4][5][6] . The interactions between genes are critical in pathway dysregulations which trigger carcinogenesis 7 . Table 1 illustrates an example case of synergy between Gene X 1 and Gene X 2 : 1) Knowledge regarding the state of only one of these two variables leaves the state of Y uncertain. 2) When states of both X 1 and X 2 are known, then the state of Y becomes certain.
Pair-wise gene evaluation has been implemented in several popular algorithms, including top scoring pair (TSP) 8,9 , top scoring genes (TSG) 2 , and doublets (sum, diff, mul and sign) 7 , which all compare expression values of the same sample between two different genes. However, these methods are incapable of discovering pair-wise interactions efficiently. For example, let X 1 and X 2 be two independent random variables; Y equals |X 1 -X 2 | and is binarized with a median (Fig. 1). Then, the Δ -score for TSP is 0.04, the χ 2 -score for TSG is 0.18, and the t-score is 0.04, 0.18, 3.42, and 0.56 for sum, diff, mul, and sign, respectively. The synergic pairs, X 1 and X 2 , cannot be highlighted with these low scores calculated by these methods.
Based on information theory, the measure of I(X 1 ; X 2 ; Y) 10,11 can be used to identify pair-wise interactions [12][13][14] . The interaction of a gene pair with respect to cancer is defined as Where I is the symbol for mutual information (MI), X 1 and X 2 are random variables representing the expression levels of the two genes and Y is a binary random variable representing the presence or absence of cancer 15 . A positive value of I(X 1 ; X 2 ; Y) indicates synergistic interactions, while a negative value of I(X 1 ; X 2 ; Y) indicates redundant interactions. Several efforts have recently been made to discover pair-wise synergy even multivariate synergy among interacting genes on experimental biological data. The Anastassiou group proposed a systems-based approach called Entropy Minimization and Boolean Parsimony (EMBP) to identify modules of genes that are jointly associated with a phenotype from gene expression data 15 and SNP data 16 . Anastassiou 11 emphasized the significance of multivariate analysis such as EMBP for molecular systems biology and clarified the fundamental concepts by explaining the precise physical meaning. Watkinson et al. 17 presented a novel dendrogram-based technique to identify synergies of pairwise genes. Hanczar et al. 18 devised a histogram-based method called FeatKNN to detect the joint effect I(X 1 , X 2 ; Y). Park et al. 19 proposed a new approach for inferring combinatorial Boolean rules of gene sets for cancer classification by using a synergy network. Shiraishi et al. 20 presented a rank-based non-parametric statistical test for measuring synergistic combinations between two gene sets. Ignac et al. 21 used interaction distances (ID) to identify the most synergic pairs of markers such as SNPs.
Binarization of continuous expression data simplifies the estimation of MI and provides simple logical functions connecting the genes within the found modules 2,15 . However, there are multitype complicated patterns in both real-world data ( Fig. 2A,B) and simulation data (Fig. 2C,D); binarization might lead to loss of information 11,21 . For example, the IGLC1 gene for the prostate dataset must be trinarized, rather than binarized (Fig. 2C). Several methods have been proposed for the MI estimation, such as kernel density estimation 22 , histogram-based technique 23 , k-nearest-neighbor estimator 24 , B-spline functions 25 , Edgeworth 26 , adaptive partitioning 27,28 and dendrogram-based method 17 . Khan et al. 29 evaluated the relative performance of several MI estimation methods, and suggested that the most suitable estimation procedure would depend on known data or domain characteristics and exploratory data analysis. Recently, Reshef et al. 30 presented a novel estimator for two variables called maximal information coefficient (MIC). MIC explores various binning strategies with different numbers of bins, and can capture a wide range of associations, both functional and non-functional, regardless of linear or non-linear relationships. Due to its generality, MIC is becoming widely accepted in scientific research fields 31 . Therefore, there is a large demand for extending MIC from two variables to three variables, even multivariate, to capture a wide range of synergistic interactions 32 .
In this paper, we first developed and described an algorithm to compute MIC(X 1 ; X 2 ; Y). We demonstrated the generality of MIC(X 1 ; X 2 ; Y) with simulation data. We identified the most synergic pairs of genes (not discovered by popular feature selection approaches) using MIC(X 1 ; X 2 ; Y) with several real-world, cancer gene expression profile datasets. Finally, we validated these synergic genes using classification performance, Gene Ontology annotation (GO), and the OUgene database 33 . Table 1. A typical pair-wise synergy between X 1 and X 2 . ⊕ is an exclusive-or operation. Calculation of MIC(X 1 ; X 2 ; Y) where Y is a discrete variable Preliminary. Given a finite set D n × 3 = {(x 1 , x 2 , y)| x 1 ∈ X 1 , x 2 ∈ X 2 , y ∈ Y}, where n is the sample size, X 1 and X 2 are two continuous independent variables, Y is the discrete dependent variable Y = {class 1 , class 2 ,..., class P }, and P is the number of classes, we can partition X 1 , X 2 , and Y into x 1 bins, x 2 bins, and y bins, respectively. Here, y is fixed as P, because Y is a discrete variable. We denote such a partition x 1 -by-x 2 -by-y as grid G, and the distribution of the data points in D on the cells of G as D | G .

Definition 1
For a finite set ⊂ R D 3 and positive integers x 1 , x 2 , y, define where the maximum is over all grids G with x 1 -by-x 2 -by-y, and I(D| G ) is the interaction defined in formula (1).

Definitions 3
The maximal interaction coefficient MIC(X 1 ; X 2 ; Y) of a set D of three-variable data with sample size n and grid size less than B(n) is defined as  In this paper a equals 0.6, the default setting suggested by Reshef et al. 30 .
The maximal grid size B(n) and normalization of MIC(X 1 ; X 2 ; Y). Formula (1) can be rewritten as Here I(X 2 , Y|X 1 ) and I(X 1 , Y|X 2 )are conditional mutual information. According to formula (5) and knowing that the X 1 , x 1 -axis partition is fixed, i.e. that X 1 is equipartitioned with x 1 bins, the set D of three-variable data with sample size n can be subdivided into x 1 subsets, and each subset has only two-variable ( X 2 and Y) and n/x 1 samples. The mutual information for each subset can be normalized with log(min{x 2 , y}) and the maximal grid size B(n) for each subset should be (n/x 1 ) a . Therefore, for set D, while the x 1 -axis partition is fixed, the normalization benchmark and B(n) are log (min{x 2 , y}) and (n/x 1 ) a , respectively.
Similarly, for set D where the x 2 -axis partition is fixed, the normalization benchmark and B(n) are log (min{x 1 , y}) and (n/x 2 ) a , respectively.
Approximation algorithm for MIC(X 1 ; X 2 ; Y). Here, we describe the heuristic algorithm, ApproxCharateristicMatrix_3D, for approximating the optimal MIC(X 1 ; X 2 ; Y). It includes four sub-algorithms: EquipartitionX1Axis, SortInIncreasingOrderByX2Value, GetSuperclumpsPartition_3D, and ApproxOptimizeX2Axis. In the dataset D, the first and second columns represent X 1 and X 2 respectively; the last column represents Y. n is the sample size. B defines the maximal grid size. The symbol "⊥ " represents the dataset which is changed from (a 1 , b 1 , z 1 ) to (b 1 , a 1 , z 1 ). c represents the candidate partition point for x-axis. "log" is base-2 logarithm. x fix , representing the corresponding x-axis partition, is fixed (x fix ∈ {x 1 , x 2 }). The symbol " ← " is an assignment operator.

Algorithm ApproxCharacteristicMatrix_3D(D, B, c)
Require: D={(a 1 , b 1 , z 1 ), … , (a n , b n , z n )| z ∈ Y} is a set of ordered 3D vector sorted in increasing order by the First column-values Require: B is an integer greater than 3, and B(n) = (n/x fix ) 0.6 Require: c is greater than 0   EquipartitionX1Axis, SortInIncreasingOrderByX2Value and ApproxOptimizeX2Axis are nearly the same as EquipartitionYAxis, SortInIncreasingOrderByXValue, and ApproxOptimizeXAxis in Reshef et al. 30 , respectively, except that ApproxOptimizeX2Axis uses I(X 1 ; X 2 ; Y) in place of I(X; Y). Here we demonstrate an example of a superclumps partition (see Fig. 3) and list only the pseudo-code of GetSuperclumpsPartition_3D, which is our core algorithm for calculating interactions. The algorithm includes three steps: 1) divide the data into P parts according to Y; 2) fix an equipartition of size x 1 on x 1 -axis; and 3) ensure points in the same superclump to be a unit in the same class, with the rank of x 2 -axis.

Results
Generality of MIC(X 1 ; X 2 ; Y) according to simulation analysis. If X 1 and X 2 are statistically independent of Y, MIC(X 1 ; X 2 ; Y) should be close to 0. For example, let X and Y be two independent, random variables and Y is binarized with a median (sample size n = 200 and 500 replicates), then MIC(X; Y) = 0.1702 ± 0.0292. Similarly, let X 1 , X 2 and Y be three independent, random variables, then MIC(X 1 ; X 2 ; Y) = 0.1562 ± 0.0230. MIC(X 1 ; X 2 ; Y) is reasonable in scope compared with MIC(X; Y), and decreases as the sample size grows (0.0596 ± 0.0012, n = 20000) and finally converges to 0.
If the state of Y is completely determined by the synergy between X 1 and X 2 , then MIC(X 1 ; X 2 ; Y) should be 1, and MIC(X; Y) should be close to 0. As shown in Fig. 4, MIC(X 1 ; X 2 ; Y) = 1, MIC(X 1 ;Y) = 0.0379 and MIC(X 2 ; Y) = 0.0533. If Y is a noiseless function of X 1 and X 2 , and X 1 is fully redundant of X 2 , then MIC(X 1 ; X 2 ; Y) should be − 1.  Table 2. All of the joint effects are close to 1 (0.9672~1.1675). This indicates that the value of MIC(X 1 ; X 2 ; Y) calculated with ApproxCharateristicMatrix_3D is credible, while the value of MIC(X; Y) calculated with ApproxMaxMI 30 has been widely accepted. From all of the above, we deduce that MIC(X 1 ; X 2 ; Y) can capture a wide range of interactions, not limited to specific function types. That is, MIC(X 1 ; X 2 ; Y) has the property of generality.
Informative genes of synergy pairs discovered by MIC(X 1 ; X 2 ; Y). We employ MIC(X 1 ; X 2 ; Y) to detect pair-wise synergic genes in three real-world datasets. The literature resources, sample size, number of genes, and the number samples of each class in each dataset are summarized in Table 3.
Each reference method ranks the top 200 genes (Top200s) for each dataset (Top200s are shown in the Supplementary Material Table S1-S3). The Top200s identified by different reference methods are compared with each other. We can observe significant overlaps between the Top 200s selected by the four reference methods, as shown in Figs 6, 7 and 8. This indicates that a considerable number of similar informative genes can be detected by these reference methods. MIC(X; Y) is an individual-gene-filter method and can only highlight genes that are individually discriminant. Although mRMR, SVM-RFE and TSG are not individual-gene-filter methods; the Top200s selected by them have considerable similarities to the Top200s selected by MIC(X; Y). This indicates that these methods can efficiently discover genes that are individually discriminant, but not specific to the genes have pair-wise synergy effects. Now, we employ MIC(X 1 ; X 2 ; Y) to detect pair-wise synergic genes. MIC(X 1 ; X 2 ; Y) ranks the top 117, 117 and 110 pair-wise genes for Prostate, DLBCL and Lung1, respectively. After removing repeated genes, we obtain three Top200s (Top200s are shown in the Supplementary Material Table S1-S3). We compare our MIC(X 1 ; X 2 ; Y) results with the results from four above mentioned reference selection methods. Clearly, the Top200s selected by MIC(X 1 ; X 2 ; Y) has little overlap with the Top200s selected by the others (Figs 9, 10 and 11). We, therefore, deduce    that MIC(X 1 ; X 2 ; Y) can discover new synergic genes and that the other four reference feature selection methods can only discover genes that are individually discriminant. Synergic gene justification. We initially validate these synergic genes according to their prediction performance with a supported vector classifier (SVC). SVC is available at http://prtools.org/ software/. Fig. 12, illustrates the 10-fold cross-validation prediction accuracies using genes from Top1 to the Top200 selected by MIC(X 1 ; X 2 ; Y), as well as by MIC(X; Y), MRMR, SVM-RFE and TSG. MIC(X 1 ; X 2 ; Y) receives comparable accuracies. This indicates that these synergic genes have sufficient ability to distinguish tissue and cancer types, from the perspective of machine learning.
Do the synergic genes selected by MIC(X 1 ; X 2 ; Y) have any biological relevance to tissue or cancer type? This is particularly relevant considering that even a random set of genes may be a good predictor of cancer sample definition 37 . Therefore, we further validated these synergic genes, using the Prostate dataset as an example, according to GO annotation and OUgene database.
We used the GATHER system 38 (http://gather.genome.duke.edu/) to query GO annotations associated with the Top200s selected by the five methods, as shown in Fig. 13. Although there is little overlap between the genes selected by MIC(X 1 ; X 2 ; Y) and the genes selected by the four reference methods (Figs 9, 10 and 11), synergic genes share the same four heavily marked terms with genes that are individually discriminant (Fig. 13). These four heavily marked GO terms are "cellular macromolecule metabolism, " "nucleobase, nucleoside, nucleotide and nucleic acid metabolism, " "protein metabolism, " and "regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism".
The current version of OUgene, a disease associated, over-expressed and under-expressed gene database, includes 7,238 gene entries, 1,480 diseases entries, and 56,442 PubMed links. We ranked the Top200 synergic genes out of the 12,600 genes in the Prostate dataset using MIC(X 1 ; X 2 ; Y). Of these Top200, 67 tumorigenesis genes were queried against OUgene, and 18 of them have been reported related to prostate cancer [39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56] (Table 4) Prostate  12600  102  52  50  74   Lung  12533  181  150  31  76   DLBCL  7129  77  58  19  77   Table 3. Three binary-class gene expression datasets. of microarray-based predictive models 57 . We use the Breast Cancer dataset from MAQC-II to further evaluate the reliability of MIC(X 1 ; X 2 ; Y). This dataset is used to predict the pre-operative treatment response (pCR) and estrogen receptor status (erpos). It was originally grouped into two groups: a training set containing 130 samples (33 positivesand 97 negatives for pCR, 80 positives and 50 negatives for erpos), and a validation set containing 100 Here TP, TN, FP, FN denote true positives, true negatives, false positives and false negatives respectively. Greater accuracy and MCC represent better prediction ability of a model. As shown in Table 5, for Breast erops, the accuracies of individual model and synergic model are 89% and 90%, the MCCs are 0.77 and 0.79, respectively. If we integrate the two models, the accuracy and MCC of combined model are improved into 92% and 0.83, respectively (Better results may be achieved while the redundancies among genes are removed). Similar improved effects are observed in the "Breast pCR" dataset analysis. These results demonstrate that synergic genes selected by MIC(X 1 ; X 2 ; Y) enhance the individually discriminant model for improving prediction performance.

Discussion
We scanned the Top200s genes selected by MIC(X 1 ; X 2 ; Y) on Prostate and Breast cancer datasets, and summarized three representative patterns of pair-wise synergy and their corresponding theoretic distribution (Fig. 14). Pattern I (Fig. 14A,B,F) results from the typical synergy of Fig. 4, Pattern II (Fig. 14C,D,G) results from the function y = x 1 -x 2 (Fig. 5B), and Pattern III (Fig. 14E,H) results from the function y = |x 1 -x 2 | (Fig. 5C). These patterns offer an efficient tool to infer pathogenic mechanism, even to provide a quantitative model, of pair-wise synergy genes. For Pattern I, Gene A and Gene B both could be on-off oncogenes (Fig. 14A) or tumor suppressor genes (Fig. 14B) which inhibit each other. For Pattern II, one could be an oncogene, and the other could be a tumor suppressor gene. Pattern III is similar to Pattern I, but Gene A and Gene B both could be non on-off oncogenes.
The results indicate that although the synergy pattern is diversified in real-world datasets, the MIC(X 1 ; X 2 ; Y) method can explore them well. For the pair-wise synergy ERBB2-PAPSS1, they have been widely reported to correlate with breast cancer [59][60][61][62] , as well as the ENO1-PTP4A2 pair [63][64][65][66] . For the BRF2-LIPIN1 pair, BRF2 is related to tumor angiogenesis 67 . LIPIN1 has been reported to correlate with non-tumorous diseases such as rhabdomyolysis 68 , Type 2 diabetes 69 , metabolic syndrome 70 and acute myoglobinuria 71 . Recently, LIPIN1 was reported to regulate breast adenocarcinoma cell proliferation rate 72 . For the SDC4-LINC01278 pair, SDC4 has been reported to correlate with tumors 73 , but LINC01278 has not. For the RGS9-DIAPH2 pair, neither of them has been reported to correlate with cancer. However, MIC(X 1 ; X 2 ; Y) suggests that LINC01278, RGS9 and DIAPH2 are important informative genes for prostate tumors, and should be given proper attention. "MIC is a great step forward, but there are many more steps to take" 32 . In this article we took such a stepthe extension of two variables to three variables which consider pair-wise interaction. Based on "exploring various binning strategies with different number of bins", Reshef et al. 30 employed a clump (points in the same   Genes  Related tumors   ABCB1, AMACR, CAV1, CCND1, CSF2, DPT, E2F3,  ETV4, GOT2, GREB1, HBP1, HCLS1, HMGA1, PAX2,  SFRP1, SOX9, TRAF4, ZNF143   Prostate   ABCA4, CASC3, CD81, COMP, MAP1LC3B, PPP3CA,  SLN, TFAP2C, TRO  Breast cancer   DSC2, EDG4, FBLN1, GALNT3, KRT10, Table 5. Results of independent test for erpos and pCR of Breast cancer. clump to be a unit) partition technique to reduce computing time and improve estimation accuracy of MI in a two-dimensional space. This technique does not work in a three-dimensional space, because the definition of clump/superclump has changed. We re-defined superclumps as "points in the same superclump to be a unit in the same class, with the rank of x 2 -axis" for considering three variables as a whole, and designed a novel algorithm illustrated in Fig. 3 to overcome this barrier. However, complicated diseases such as cancer are often related to collaborative effects involving interactions of multiple genes. Multivariate analysis, just as Anastassiou group 11,15-17 , Park et al. 19 and Shiraishi et al. 20 did, is going to be the trend. An extension from MIC(X 1 ; X 2 ; Y) to MIC-based multivariate association networks is therefore still desire.