Computational analysis of fused co-expression networks for the identification of candidate cancer gene biomarkers

The complexity of cancer has always been a huge issue in understanding the source of this disease. However, by appreciating its complexity, we can shed some light on crucial gene associations across and in specific cancer types. In this study, we develop a general framework to infer relevant gene biomarkers and their gene-to-gene associations using multiple gene co-expression networks for each cancer type. Specifically, we infer computationally and biologically interesting communities of genes from kidney renal clear cell carcinoma, liver hepatocellular carcinoma, and prostate adenocarcinoma data sets of The Cancer Genome Atlas (TCGA) database. The gene communities are extracted through a data-driven pipeline and then evaluated through both functional analyses and literature findings. Furthermore, we provide a computational validation of their relevance for each cancer type by comparing the performance of normal/cancer classification for our identified gene sets and other gene signatures, including the typically-used differentially expressed genes. The hallmark of this study is its approach based on gene co-expression networks from different similarity measures: using a combination of multiple gene networks and then fusing normal and cancer networks for each cancer type, we can have better insights on the overall structure of the cancer-type-specific network.

algorithm that leverages pre-defined non-linear interactions.
Bayesian methods are a major class of methods that have long been applied for network inference [11,12]. The guiding idea of these methods is that the relationship between two genes is expressed by their conditional probability based on their expression profiles. Thus, the inferred networks (or Bayesian networks, in this case) have genes as nodes and conditional probabilities as edge weights. Multiple Bayesian networks can be inferred from a gene expression dataset; thus, the best Bayesian network is the one that maximizes the likelihood estimation of the edge weights [1].
Disadvantages of this approach include the very high computational cost and the poor prediction of loop motifs in the networks [1,6].
The last major class of network inference methods includes the information theory methods.
Entropy measures the uncertainty of a random variable; for the purpose of network inference, it is used to find the mutual information shared by two variables or genes [1]. The mutual information is obtained by comparing the entropy of the joint distribution with the individual entropies of the two variables; thus, it can be seen as a measure of dependency between variables [13]. Inferred networks have genes as nodes and their mutual information as edges. A limitation of these methods is the constrain of using a binning process for continuous expression data, which is an ongoing problem regarding this approach [14]; differently from our approach, this binning can lead to a loss of information for the inferred networks. A famous example for this class of algorithms is the Algorithm for the Reconstruction of Accurate Cellular Networks, or ARACNe [15], where the networks are computed by combining the computation of the mutual information with two different pruning steps. Other similar methods include the ones by Butte et al. [16], Faith et al. [17], and Meyer et al. [18].

Robustness of IC genes using different numbers of LIHC cancer samples
We evaluated the robustness of our approach by sub-sampling an increasing number of samples of the LIHC tumor. Particularly, keeping all the 50 normal samples, which are available in a Computational analysis of fused co-expression networks for the identification of candidate cancer gene biomarkers limited number, we randomly extracted 50, 100, 200, 300 cancer samples and we applied to each of such sample sets the same pipeline for the identification of the IC genes that we used for all LIHC samples (370 tumoral and 50 normal samples), as reported in the main manuscript. We noticed that the number of gene communities extracted from the networks in each case is similar to the one extracted in our originally evaluated case with all 370 LIHC tumoral samples, i.e., 2 or 3 communities. Moreover, we compared the IC genes extracted from the networks in the different cases with those extracted from the original one. For each case, Supplementary Table 1 3 show that the remaining Computational analysis of fused co-expression networks for the identification of candidate cancer gene biomarkers number of edges after the permutation tests from 10 to 100 times shuffling is the same; thus, to speed up the process, we considered 10 times shuffling. Supplementary Tables 2,3 show also the used thresholds in the permutation tests with the different shuffling times. Particularly, the two low and high thresholds indicate the lower and higher limit values of the weights of the removed network edges: if an edge has a Pearson's correlation weight lower than threshold high and higher than threshold low, it is removed from the network. Supplementary