Network-aided Bi-Clustering for discovering cancer subtypes

Bi-clustering is a widely used data mining technique for analyzing gene expression data. It simultaneously groups genes and samples of an input gene expression data matrix to discover bi-clusters that relevant samples exhibit similar gene expression profiles over a subset of genes. The discovered bi-clusters bring insights for categorization of cancer subtypes, gene treatments and others. Most existing bi-clustering approaches can only enumerate bi-clusters with constant values. Gene interaction networks can help to understand the pattern of cancer subtypes, but they are rarely integrated with gene expression data for exploring cancer subtypes. In this paper, we propose a novel method called Network-aided Bi-Clustering (NetBC). NetBC assigns weights to genes based on the structure of gene interaction network, and it iteratively optimizes sum-squared residue to obtain the row and column indicative matrices of bi-clusters by matrix factorization. NetBC can not only efficiently discover bi-clusters with constant values, but also bi-clusters with coherent trends. Empirical study on large-scale cancer gene expression datasets demonstrates that NetBC can more accurately discover cancer subtypes than other related algorithms.


Results and Discussion
To comparatively evaluate the performance of the proposed NetBC, we compared NetBC with NCIS 38 , MSSRCC 29 , BCC 22 , Cheng and Church (CC) 12 , BiMax 14 , Biforce 24 , xMOTIFs 26 , FABIA 27 , and Plaid 28 . Since NCIS, MSSRCC and BCC aim to extract non-overlapping bi-clusters with checkerboard structure, we compare NetBC with NCIS, MSSRCC and BCC on separating samples on two large scale cancer gene expression datasets from TCGA 34 and several cancer gene expression datasets with known subtypes. CC, BiMax, Biforce, xMOTIFs, FABIA, and Plaid aim to extract arbitrarily positioned overlapping bi-clusters, we compare NetBC with CC, BiMax, Biforce, xMO-TIFs, FABIA, and Plaid by relevance and recovery on synthetic datasets with implanted bi-clusters.
Determining the number of Gene clusters (k) and sample clusters (d). Determining the number of clusters is a challenge for most clustering methods. Here we adopt a widely used method to find the number of gene clusters k (or sample clusters d) that best fits the gene expression data matrix 38,40 . If the number of gene clusters k (or sample clusters d) is suitable, we would expect that the gene separation (or sample separation) would change very little in different runs. For each run, we define an adjacency matrix of genes M g with size m×m and an adjacency matrix of samples M s with size n×n, Results on TCGA cancer gene expression data. To comparatively evaluate the performance of the proposed NetBC, we compare NetBC with other related and representative bi-clustering methods on separating samples, including NCIS 38 , MSSRCC 29 , and BCC 22 on two large scale cancer gene expression data from TCGA 34 . Since all the selected comparing aim to extract non-overlapping bi-clusters with checkerboard structure, we can use the dependence test between different clinical features and the discovered subtypes to evaluate their performance.
The lung cancer gene expression data contains 1298 patients (samples) with gene expression profiles of 20530 genes. The cancer subtypes of these samples are unknown. For comparison, we adopts the clinical features to study the performance of NetBC and these comparing methods. The clinical features are survival analysis, percent lymphocyte, eml4 alk translocation performed, pathologic stage, percent tumor cells stage, percent tumor nuclei. We choose relapse-free survival (RFS) for survival analysis. RFS means the length of time after primary treatment to a cancer patient that survives without any signs or symptoms of that cancer. RFS is one way to measure how well the treatment works. Percent lymphocyte means different percentages of infiltration of lymphocyte. Eml4 alk translocation clinical feature means whether Eml4 gene and alk gene are fused, the fusion of these two genes can lead to lung cancer. Pathologic stage represents different stages of the cancer pathologic. Percent tumor cell stage represents the percentages of tumor cells in the total cells. Percent tumor nuclei stage represents the percentages of tumor nuclei in a malignant neoplasm specimen. After removing samples with missing data of clinical features, 486 samples with 20530 genes are remained in the lung cancer dataset.
The breast cancer gene expression dataset contains 1241 patients with 17814 genes. Removing the samples that lack of clinical data, we finally retain 416 samples with 17372 genes. The clinical features to evaluate these bi-clustering methods contain AJCC Stage, Converted stage, Node coded, Tumor coded, Percent normal cells, and Percent tumor nuclei. We also make survival analysis for breast cancer, but no method has a p-value smaller than 0.05, one possible reason is the insufficient clinical data. The AJCC stage represents the different stages of the cancer based on the present lymph nodes. Converted stage represents different stages of the cancer. Node coded means different Node status of patients. Tumor coded means different types of tumor. Percent normal cells represents different percentages of normal cells in the malignant neoplasm specimen. Percent tumor nuclei represents different percentages of the tumor nuclei in the malignant neoplasm specimen.
The gene interaction network used for experiments are combined with networks collected from BioGRID 41 (version: 3.4.138, access date: July 1, 2016), HPRD 42 (version: 9, access date: February 15, 2017) and STRING 43 (version: 10.0, access date: February 15, 2017). To fairly compare NetBC with NCIS, the collected gene interaction network for both NetBC and NCIS is directed. Since the TCGA cancer gene expression data is too large, we use k-means to initialize the indicative matrix R and C of NetBC, NCIS and MSSRCC. The number of iterations for these methods is set as 300. We set k = 7 and d = 6 for the TCGA lung cancer gene expression data and k = 11 and d = 6 for the TCGA breast cancer gene expression data based on the cophenetic correlation coefficient over a range of combinations of k and d (k from 1 to 12, and d from 4 to 6).
θ is a scalar parameter to balance the contribution of gene interaction network and the deviation of gene expression profiles among samples when assigning genes weights. The significance levels of the difference between different clinical features and subtypes discovered by NetBC and NCIS with a range over different θ are given in Fig. 1 (Lung) and Fig. 2 (Breast). The p-value is adjusted by Benjamini & Hochberg method 44 . From Fig. 1 and Fig. 2, we can see that the input value of θ affects the experimental results of NetBC and NCIS. θ ∈ (0,1) means assigning weights to genes according to both the variation of gene expression levels and gene interaction network. We can find that NetBC and NCIS with θ ∈ (0.5, 1) show better performance than their cousins θ ∈ (0,0.5) in most  cases. This observation demonstrates that assigning weights to genes according to both the variation of gene expression levels and gene interaction network can improve the performance of bi-clustering than using gene expression profiles along. NetBC also outperforms NCIS in majority cases.
To fairly compare NetBC, NCIS with BCC and MSSRCC, θ is setting as 0 for both NetBC and NCIS. Parameters of BCC are the number of gene clusters (or samples clusters), and the initialization of Gaussian distribution (μ, σ). μ and σ are fixed as random values provided in their demo codes, since there is no prior knowledge about the distribution of samples. The significance levels of the difference between subtypes discovered by these bi-clustering methods and the clinical features are given in Table 1 (Lung) and Table 2 (Breast). The p-value is adjusted by Benjamini & Hochberg method 44 . In these tables, a smaller p value indicates better results. From Table 1 and Table 2, we can see that NetBC has smaller p-value than other methods on most clinical features. Given the p-value threshold 0.05, NetBC successfully divides cancer patients into subtypes according to clinical features. BCC can not separate the cancer subtypes as well as others. That is principally because it assumes that patients are generated by a finite mixture of underlying probability distributions. Since the cancer gene expression data has limit samples with a large amount of genes, it is difficult to well estimate these underlying distributions. Although both NetBC and NCIS assign weights to genes as the importance indicator of genes, NetBC performs much better than NCIS on most clinical features. The main difference between them is that the objective function of NetBC is to minimize the sum-squared residue, while NCIS is to minimize the sum-squared distance between entries and centroids of bi-clusters. NCIS can only discover bi-cluster with constant values, NetBC can not only discover bi-clusters with constant values but also bi-clusters with coherent trend values. We can also observe that NetBC outperforms MSSRCC. MSSRCC utilizes a similar objective function as NetBC to minimize the sum-squared residue, the main difference between them is that NetBC assigns weights to genes, but MSSRCC does not. NetBC uses matrix factorization to get row and column indicative matrices and MSSRCC iteratively obtains row and column clusters by a k-means like algorithm on row and column dimensions. This observation shows that assigning weights to genes can improve the performance of bi-clustering.
Results on cancer gene expression data with known subtypes. We also apply NetBC, NCIS, MSSRCC and BCC in clustering cancer gene expression datasets with known subtypes. Table 3    Since the ground truth sample clusters of these datasets are known, we adopt two widely used metrics: rand index (RI) 45 and F1-measure 46  . np 1 represents the number of pairs of samples that are both in the same clusters of  and also both in the same clusters of ′; np 2 represents the number of pairs of samples that are in the same clusters of  but in different clusters of ′; np 3 represents the number of pairs of samples that are in different clusters of  but in the same clusters of ′; np 4 represents the number of pairs of samples that are in different clusters of  and also in different clusters of ′  . RI is defined as follows: F1-measure is the harmonic mean of precision and recall and is defined as follows: where Pr and Re are precision and recall, defined as follows: We compare the performance of NetBC with NCIS, MSSRCC and BCC. We perform the experiment by randomly initializing the row and column indicative matrices R and C of NetBC, NCIS and MSSRCC. The number of iterations for NetBC and NCIS is set as 300, and θ = 0 is used for both NetBC and NCIS. The parameter setting of BCC is similar to the experiment on TCGA cancer gene expression data. We perform 10 independent experiments and report the average results for each method on a particular dataset. We determine the number of gene clusters k based on the cophenetic correlation coefficient and set d as the ground truth number of cancer subtypes. Figure 3 provide the experimental results of these comparing methods with respect to RI and F1-measure. We can observe that NetBC performs better than other methods under both RI and F1-measure. This observation indicates that NetBC is an effective bi-clustering approach to identify cancer subtypes.  Table 3. Details of 8 cancer gene expression datasets. #Subtypes is the number of cancer subtypes (or clusters), #samples is the number of samples, and #genes is the number of genes. We also compare NetBC and NCIS when incorporating gene interaction network on real cancer gene expression data with known subtypes. θ controls the balance of referring to gene interaction network and the variation of gene expression profiles when assigning weights to genes. We analyze the influence of θ by varying its value from 0 to 1 with stepsize 0.1. We perform the experiment by randomly initializing the indicative matrices R and C of NetBC, NCIS, and fix the number of iterations for both NetBC and NCIS as 300. Figures 4 and 5 reveal the RI and F1-measure of NetBC and NCIS on eight cancer gene expression datasets. The reported experimental results are the average of ten independent runs for each particular dataset under each input value of θ. We can see that NetBC consistently outperforms NCIS over a range of θ values in most cases. This experimental results again demonstrate that NetBC improves the performance of cancer subtypes discovery. We can observe that NetBC (0.1 ≤ θ ≤ 0.9) generally has better performance than NetBC when θ = 0 (or θ = 1). From these observations, we can conclude that assigning weights to genes by referring to both gene expression profiles and gene interaction network shows advantage than assigning weights to genes by using gene interaction network (or by  gene expression profiles) alone. However, there is no clear pattern to choose the most suitable θ. The possible reason is that θ is not only related to the deviation of expression profiles, but also the quality of gene interaction network and gene expression data. Adaptively choosing a suitable θ is an important future pursue. In summary, these empirical study shows that integrating gene interaction networks with gene expression profiles can generally boost the performance of bi-clustering in discovering cancer subtypes.
Results on synthetic datasets. Let  denote the set of implanted bi-clusters and ′ denote the set of bi-clusters discovered by a bi-clustering method. We can measure the similarity between the implanted bi-clusters and discovered bi-clusters by using the average bi-cluster relevance score 14 as follows: where each ′  (or ) contains a gene set ′ (or ) and a sample set ′  (or ). The relevance score reflects to what extent the discovered bi-clusters represent implanted bi-clusters in the gene dimension. Similarly, average bi-cluster recovery is defined as  ′ S ( , ) G , which evaluates how well implanted bi-clusters are covered by a bi-clustering method.
Here, we compare NetBC with several bi-clustering methods on synthetic datasets with known bi-clusters. The codes of these comparing methods are online available, including Cheng and Church (CC) 12 , BiMax 14 , FABIA 27 , Plaid 28 , xMOTIFs 26 and Biforce 24 .
Since NetBC and NCIS aim to discover non-overlapping bi-clusters, the synthetic datasets are generated without overlapping bi-clusters in the same way as that done by Prelic et al. 14 and Wang et al. 47 Particularly, these synthetic datasets are generated with five different types of bi-clusters, each synthetic dataset contains one type of bi-clusters. The five types of bi-clusters are (i) constant bi-cluster; (ii) row-constant bi-cluster; (iii) column-constant bi-cluster; (iv) additive coherent bi-cluster (or shift bi-cluster); (v) multiplicative coherent bi-cluster (or scale bi-cluster). The background matrices of these synthetic datasets are with entries randomly chosen from Gaussian distribution N(0, 1). Each bi-cluster is generated by choosing a submatrix from the background matrix with the entries modified according to one of the five rules: (i) constant bi-cluster is generated by randomly selecting an entry of a submatrix and replacing other entry values with this entry value; (ii) row-constant bi-cluster is generated by randomly selecting a base column within a selected submatrix and copying it to other columns in this submatrix; (iii) column-constant bi-cluster is generated by randomly selecting a base row within a selected submatrix and copying it to other rows in this submatrix; (iv) additive coherent bi-cluster is generated by randomly selecting a base row within a selected submatrix and replacing other rows in this submatrix by shifting the base rows; (v) multiplicative coherent bi-cluster is generated by randomly selecting a base row within a selected submatrix and replacing other rows in this submatrix by scaling the base rows. Then, we also add noise to these synthetic datasets to study the robustness of these comparing methods. Noise is simulated by adding random value from normal distribution to each entry of the synthetic gene expression data matrix. The noise level is increased by enlarging the standard deviation σ from 0.05 to 0.25 with stepsize 0.05.
It is crucial to select suitable parameters for bi-clustering tools. We follow the solution used by Eren et al. 48 and Sun et al. 24,49 to select major parameters of these comparing bi-clustering tools over a range of values when they perform the best in the specified range. We set the number of the bi-clusters as the ground truth number of implanted bi-clusters for all bi-clustering approaches. δ and α are critical to the accuracy and runtime of CC. δ controls the maximum mean squared-residue in a bi-cluster. By default δ = 1, we run CC with different δ between 0 and 1 with stepsize 0.01. We set δ = 0.03, since CC performs the best when δ = 0.03. α controls the tradeoff between running time and accuracy. By default α=1, a larger α produces higher accuracy but asks for more runtime, we set α = 1.5 since the synthetic datasets are not too large. The number of max iterations for each layer in Plaid affects its results and its default value is 20, and we set it as 50 since Plaid performs best with the number of max iterations fixed as 50. We also select the row.release and column.release of Plaid in [0.5, 1] and set them as 0.6. BiMax and xMOTIFs are dependent on how the data are discretized. We performs xMOTIFs on synthetic data discretized with number of levels from 0 and 50 with stepsize 1. xMOTIFs performs best on synthetic data with the number of levels as 5. BiMax requires binary input data, we perform BiMax on synthetic data that are binary discretized with different thresholds from 0.05 to 1 with stepsize 0.05. BiMax performs best with threshold 0.1. We run Biforce with parameter t 0 (edge threshold) from 0.05 to 1 with setpsize 0.05. Biforce performs best with t 0 = 0.1. NetBC and NCIS depend on the initialization of indicative matrices R and C, we randomly initialize them for both NetBC and NCIS for all experiments on synthetic datasets. The number of iterations for NetBC and NCIS is set as 300. θ is set as 0, since the gene interaction networks of these synthetic datasets are not available.
From Figs 6 and 7, we can see that, even without incorporating gene interaction networks, NetBC still achieves better performance than most bi-clustering methods on the constant, row constant, column constant, coherent additive bi-clusters. CC achieves relatively better performance on constant, row constant and column constant bi-clusters. Plaid achieves better performance on coherent additive bi-clusters than other methods. xMOTIFs is skilled in extracting additive bi-clusters. Bimax is good at extracting column constant bi-clusters. On the coherent multiplicative bi-clusters, Biforce performs better than NetBC. Furthermore, we can see that NetBC outperforms NCIS on all these five types of bi-clusters and NetBC is generally robust to noise.
To further assess the performance of NetBC and NCIS, we compared their performance on separating samples in Leukemia cancer gene expression data with simulated noisy genes. Leukemia contains 38 samples and 4412 genes. The gene interaction network (obtained from BioGrid, HPRD and STRING) of Leukemia contains 162987 edges. We assign weights to genes according to the variation of gene expression profiles across samples and gene interaction network. Next, we choose genes with lowest weight as 'uninformative' genes and tag them as noisy genes. We permute the expression levels of these noisy genes with random numerics between the maximum and minimum value of the Leukemia gene expression data matrix. Figure 8 reports the results of NetBC and NCIS under different number of noisy genes.
From this Figure, we can see that NetBC almost always outperforms NCIS, but the performance of NetBC and NCIS with θ = 0, and of NCIS with θ = 0.85 downgrades as the number of noisy genes increase. The reason is that noisy genes not only mislead the variance of gene expression profiles, but also successively mislead the weights assigned to genes. We can find that NetBC and NCIS with θ = 0.85 generally have better performance than with θ = 0, and NetBC with θ = 0.85 is much less affected by noisy genes than NCIS with θ = 0.85. These experimental results demonstrate that incorporating gene interaction network to assign weight to genes can improve the  performance of bi-clustering than assigning weights to genes based on the variation of gene expression profiles alone, and also show that NetBC can more effectively incorporating gene interaction network than NCIS.
We also explore the performance of NetBC and NCIS over random perturbations in the gene interaction network. Figure 9 reports the experimental results of NetBC and NCIS with θ = 0.85 on Leukemia dataset with randomly added, deleted and rewired edges between genes in the network. Here, NetBC and NCIS with θ = 0 are not considered, since θ = 0 means the network is excluded. By comparing the results in Fig. 8 with those in Fig. 9, we can see that, even with some fluctuations both NetBC and NCIS are relatively robust to perturbations of network. NetBC sometimes even obtains better results as more edges added or deleted. This observation is accountable, since the edges in the original network are not complete but with some noisy (missing) edges, and some randomly added (deleted or perturbed) edges just coincide with missing (or noisy) edges. We believe the performance of NetBC and NCIS can be further improved with the improved quality of gene interaction network.

Runtime analysis.
Bi-clustering is often involved with long runtime costs, especially when the gene expression data matrix is large. Therefore, it is highly challenging to develop an efficient bi-clustering algorithm. NCIS also works on the same size matrices, it has similar time complexity as NetBC.
Here, we record the runtime change of NetBC and of several widely used bi-clustering methods as the number of genes increases from 1000 to 5000 and fixed the number samples as 200, and report the recorded results in Fig. 10. All the comparing methods are implemented on a desktop computer (Windows 7, 8GB RAM, Intel(R) Core(TM) i5-4590). The parameters of these methods are sets as default values. From Fig. 10, we can see that Fabia runs faster than other comparing methods. The runtime cost of NetBC is slightly larger than MSSRCC, since NetBC additionally utilizes gene interaction networks to assign weights to genes. NetBC is slightly faster than NCIS and they both increase relative slowly with the increase number of genes. The runtime cost of BCC is the largest and increases sharply, since BCC assumes the genes (or samples) of gene expression data are generated by a finite mixture of underlying probability distributions and it takes much time to estimate these distributions. Similarly, Plaid assumes the values of bi-clusters can be explained by additive model, it also needs relatively larger runtime cost than other comparing methods. The runtime cost of Biforce is larger than Plaid. In summary, NetBC can hold comparable efficiency with state-of-the-art bi-clustering methods and it generally obtains better performance than related methods.

Let
 ∈ × G m n represent the gene expression data for m genes and n samples, the (i, j)-th entry of G is given by g ij . A bi-cluster is a subset of rows that exhibit similar behavior across a subset of columns, and it can be described as a sub-matrix of G. Let  represent a set of row indices for a row cluster and  represent a set of column indices for a column cluster. A sub-matrix IJ G of G determined by  and  encodes a bi-cluster.
Objective. The target of bi-clustering is to discover multiple bi-clusters from a gene expression data matrix, all the entries of a bi-cluster exhibit similar numerical values as much as possible. This target can be achieved by reordering the rows and columns of the matrix to group similar rows and similar columns together 13 . Traditional clustering algorithms group the similar genes (or samples) together by assigning genes (or samples) to their nearest cluster centroids. Ideally, entries are with constant value in the entire bi-cluster. For real world gene expression data, constant bi-clusters are usually distorted by noises. A common criterion to evaluate a bi-cluster is the sum of squared differences between each entry of a bi-cluster and the mean of that bi-cluster 17 . The squared difference between an entry g ij and the mean of the corresponding bi-cluster is computed as below: is the mean of all entries in the bi-cluster,  is the cardinality of  . Bi-clustering tries to find all combinations of  and  , each combination has the minimum ∑ ∈ ∈ h i j ij I J . In fact, the criterion to evaluate a bi-clustering method depends on the types of bi-clusters that can be identified by that method 15     To discover bi-clusters of these patterns, we use the squared-residue to quantify the difference between an entry g ij of a bi-cluster I J ( , ) and the row mean, column mean and bi-cluster mean of that bi-cluster. Here we adopt a formula suggested by Cheng et al. 12 as follows: is the bi-cluster mean. The smaller the squared residue h ij , the larger the coherence is.
Suppose G is partitioned into k row (gene) clusters and d column (sample) clusters. We use  . Since there are k row clusters, we use ′ k  to represent the row index set of ′ k -th row cluster. Then, we can obtain: . To explore multiple bi-clusters, NetBC minimizes the sum-squared residue over all genes and samples. Based on the above analysis, we can measure the overall sum-squared residue of multiple bi-clusters discovered by NetBC using G, R and C as follow: is the column mean, and is the bi-cluster mean of (i, j) entry of G in its corresponding bi-cluster, respectively. The above objective function of NetBC can also be reformulated as: Assigning weights to genes. Gene expression data usually has a large amount of genes but with a few samples. The similarity between samples turns to be isometric as the gene dimensionality increasing 30 . Usually, genes are selected based on the absolute deviation value of the gene expression profile among samples to reduce the gene dimension. Feature selection methods, like principle component analysis (PCA) 50 , are applied to select genes to reduce the gene dimension. But selecting a subset of genes may result in information loss on clustering gene expression data especially when the biological sense is usually not straight. For this reason, assigning weights to genes to indicate the importance of the gene is more reasonable 38 . NetBC assigns weights to genes by using both the gene expression profiles and gene interaction network. Genes, who regulate more genes in the gene Figure 11. Example of the five types of bi-clusters.
interaction network and show larger expression variations across samples than other genes, are viewed more important to identify cancer subtypes, and will be given larger weights. NetBC uses the GeneRank algorithm 51 to assign weights to genes. Suppose interactions between m genes are encoded by  ∈ × P m m . If there is directed (or undirected) an interaction from gene i to gene j, = P 1 ij (or = = P P 1 ij ji ); otherwise, = P 0 ij . The importance of a gene depends on the quantity and importance of its interacting partners, a gene that interacts with more genes and shows larger variation of expression profiles across samples should be assigned with a larger weight. Based on this assumption, the weight is set as follows: 1 means the total number of genes that have interactions with gene j. θ balances the weight from gene expression profiles and gene interaction network, it also enables isolated genes to be accessed and it is usually set as 0.85. Eq. (7) is guaranteed to convergence when 0 ≤ θ ≤ 1 38 . The final optimized weights for all genes in Eq. (7) can be computed as follows: m m is a diagonal matrix with the j-th diagonal element equal to deg j . When θ < < 0 1 , the weights of the genes are assigned according to Eq. (8). When θ = 0, the weights of genes are completely dependent on the deviation of gene expression profiles. When θ = 1, the weights of genes are assigned only based on the interacting partners of genes in the interaction network.
Optimization. After assigned weights to genes, the objective function of NetBC can be rewritten as: be the Lagrangian multipliers for ⩾ R R ( 0), then the Lagrangian function for R is:  Let  Λ ∈ × n d 2 be the Lagrangian multiplier for C, then the Lagrangian function for C is as below: Similarly, we can obtain the optimal formula of C.