Discovery of Bladder Cancer-related Genes Using Integrative Heterogeneous Network Modeling of Multi-omics Data

In human health, a fundamental challenge is the identification of disease-related genes. Bladder cancer (BC) is a worldwide malignant tumor, which has resulted in 170,000 deaths in 2010 up from 114,000 in 1990. Moreover, with the emergence of multi-omics data, more comprehensive analysis of human diseases become possible. In this study, we propose a multi-step approach for the identification of BC-related genes by using integrative Heterogeneous Network Modeling of Multi-Omics data (iHNMMO). The heterogeneous network model properly and comprehensively reflects the multiple kinds of relationships between genes in the multi-omics data of BC, including general relationships, unique relationships under BC condition, correlational relationships within each omics and regulatory relationships between different omics. Besides, a network-based propagation algorithm with resistance is utilized to quantize the relationships between genes and BC precisely. The results of comprehensive performance evaluation suggest that iHNMMO significantly outperforms other approaches. Moreover, further analysis suggests that the top ranked genes may be functionally implicated in BC, which also confirms the superiority of iHNMMO. In summary, this study shows that disease-related genes can be better identified through reasonable integration of multi-omics data.


Heterogeneous network model for BC-related gene identification
As a widely used measurement of correlational relationships 1 , Pearson correlation coefficient (Pcc) is employed to reflect the correlational relationships between the seeds in the omics of gene expression, CNV, methylation and miRNA expression, respectively. For a seed in a certain omics, we calculate the Pcc values and corresponding t-test p-values between this seed and all other genes listed in this omics.
Only gene pair with p-value no larger than 0.01 and absolute Pcc value ranked in the top 0.5 percent 1 is kept as the correlational relationship of this seed. Then we extract the correlational relationships of the seeds in the omics of miRNA expression similarly, except that the threshold of Pcc is set to top 5 percent since the number of miRNAs is small. The neighbors of these seeds in the correlational relationships are regarded as candidates and the correlational relationships between them are also explored in the same way as the seeds. Totally we get 9089 genes and 426 miRNAs in the omics of gene expression and miRNA expression, respectively. For all 9,089 genes in the omics of gene expression, there are 741 and 149 genes lacking the information of CNV and methylation, respectively.
Therefore, these genes are abandoned and we regard the remaining 8,348 and 8,940 genes as the final elements in the omics of CNV and methylation, which include 9 and 27 seeds, respectively. The correlational relationships of these genes in each omics are extracted with the same method as

Details of the performance comparison with existing approaches
To perform a comprehensive comparison of the proposed method with existing approaches, we implement four network-based approaches for disease-related gene identification: the original random walk algorithm 2 , PRINCE 3 , PageRank algorithm 4 and HNP 5 . Here we download totally 4,850,628 protein-protein interactions (PPIs) from STRING database 6 and transform these PPIs into matrix WPPI.
The matrix is also normalized as follows: Then the weighted PPI network is constructed. We further implement the first three approaches on this PPI network and their performance are evaluated by treating all BC-related genes in the omics of gene expression, CNV and methylation as true positives. HNP is implemented and evaluated in the same way as we described in 5 .

Supplement of the performance comparison with existing approaches
To show the advantage of the modified propagation algorithm, we also implement the original random walk algorithm on the heterogeneous network model in this study and compare its performance with iHNMMO by ROC curves, precision-recall curves and rank cutoff curves with the threshold varying from 200 to 10000. The results are depicted in Figure S1 and the better performance of iHNMMO indicates the superiority of the heterogeneous network model, which is based on the integration of multi-omics data. Besides, we also implement the modified propagation algorithm on the PPI network and the performance comparison with iHNMMO is shown in Figure S2. We can see that the performance is obviously developed by using the modified propagation algorithm.

Generalization of iHNMMO to GBM
We take GBM as an example to show the generalization ability of iHNMMO to other diseases. Here we utilize the known GBM-related genes and multi-omics data of GBM that processed in our previous study 7 . As shown in Figure S3, the AUC value of our method reaches 94.2% and the Sn values at three levels of Sp are all high. Especially, when Sp is 99.0%, our method can still achieve a high Sn value.
These results of performance evaluation indicate the good generalization ability of our method to other diseases.

Analysis of top ranked miRNAs
We extract target genes of the top 10 ranked miRNAs from the heterogeneous model (Supplementary   Table S1). Specifically, one third of the target genes of the first ranked miRNA mir-138-2 are seed genes, which means that more information can be directly propagated from seed genes to mir-138-2. To further explore how iHNMMO helps detect mir-138-2, the intermediate results of the algorithm are shown in Figure S4. In detail, Figure S4A shows the prior information of mir-138-2 and another 10 randomly chosen miRNAs that are not ranked in top 10, i.e., the numbers of seeds connecting to them in the heterogeneous network. It is obvious that mir-138-2 has more seed neighbors, which means more information can be directly propagated from seeds to mir-138-2. The weights of the associations between mir-138-2 and its neighbors are exhibited in Figure S4B, in which some associations are relatively larger such as HIF1, TERT and ZEB2. In addition, Figure S4C indicates that mir-138-2 has a significantly larger number of neighbors in the heterogeneous network. Based on the prior information in Figure S4A and S4C, the learned local models of these miRNAs are shown in Figure S4D, which represent their prior information scores. Using these models, iHNMMO successfully retrieves mir-138-2 based on the transition probabilities, i.e., the edge weights in the heterogeneous network. In the same manner, other novel BC-related genes/miRNAs can be discovered. The interactions between mir-138-2 and its target genes are shown in Figure S5. miRNAs  Target genes   mir-138- Table S1. The target genes of top 10 ranked miRNAs. The bold gene names represent seeds.