Prediction of potential disease-associated microRNAs by composite network based inference

MicroRNAs (miRNAs) act a significant role in multiple biological processes and their associations with the development of all kinds of complex diseases are much close. In the research area of biology, medicine, and bioinformatics, prediction of potential miRNA-disease associations (MDAs) on the base of a variety of heterogeneous biological datasets in a short time is an important subject. Therefore, we proposed the model of Composite Network based inference for MiRNA-Disease Association prediction (CNMDA) through applying random walk to a multi-level composite network constructed by heterogeneous dataset of disease, long noncoding RNA (lncRNA) and miRNA. The results showed that CNMDA achieved an AUC of 0.8547 in leave-one-out cross validation and an AUC of 0.8533+/−0.0009 in 5-fold cross validation. In addition, we employed CNMDA to infer novel miRNAs for kidney neoplasms, breast neoplasms and lung neoplasms on the base of HMDD v2.0. Also, we employed the approach for lung neoplasms on the base of HMDD v1.0 and for breast neoplasms that have no known related miRNAs. It was found that CNMDA could be seen as an applicable tool for potential MDAs prediction.

miRNAs functional similarity (MFS), Xuan et al. 22 proposed a prediction model via implementing random walk on constructed miRNA functional similarity network in which they assigned larger transition weights to marked nodes. At last, probability association scores of each disease-miRNA pair would be obtained and ranked. A calculation model was further built by Chen et al. 17 in which miRNA's k-nearest-neighbors (KNNs) and disease's KNNs were respectively searched and then these KNNs would be ranked according to support vector machine. After that, they finally got all potential MDAs with weighted voting. Under the framework of semi-supervised learning, a novel model 23 was presented for MDAs prediction via combining the optimal solutions in the miRNA space and disease space. Recently, Chen et al. 24 proposed another prediction model through calculating within-score and between-score for both miRNAs and diseases which were then combined to obtain the final MDA scores. Also, researchers put forward some other calculation approaches via considering relevant genes or proteins as a bridge to predict novel MDAs. For example, using a discrete probability distribution of hypergeometric, Jiang et al. 25 presented a prediction model on the basis of the constructed integrated network. By connecting miRNAs to diseases with the proteins as a bridge between them, a calculation model was employed by Mork et al. 26 through using a scoring scheme, which can greatly increase the model's efficiency. Furthermore, Shi et al. 27 implemented random walk on a built protein similarity network to identify MDAs.
By combining the known MDAs network and MFS network, a new calculating method was studied by Chen et al. 28 by the analyzed of random walk with restart (RWR). It is worth noting that RWR is a very effective model for MDAs prediction. By adopting RWR, a novel model named Composite Network based inference for MiRNA-Disease Association prediction (CNMDA) was presented in the light of a multi-level network which was built by combination of Gaussian interaction profile kernel similarity (GIPKS) for lncRNAs, integrated similarity for miRNAs (ISMs) and diseases (ISDs), known MDAs, lncRNA-disease associations (LDAs) and miRNA-lncRNA interactions (MLIs). In addition, leave-one-out cross validation (LOOCV) and 5-fold cross validation were adopted in this paper to assess CNMDA's effectiveness. It could be seen that the AUCs of LOOCV and 5-fold cross validation were respectively 0.8547 and 0.8533+/−0.0009. As for case studies, CNMDA was carried out on kidney neoplasms (KN), breast neoplasms (BN) and lung neoplasms (LN) to infer its associated miRNAs based on HMDD v2.0 29 . Also according to HMDD v2.0, we further infer novel miRNAs for BN after hiding its known associated miRNAs. At last, we carried out the case studies based on HMDD v1.0 30 to infer LN-related miRNAs. Based on the above results, the effectiveness of CNMDA for MDAs prediction was validated.

Results
Cross validation. In this paper, we carried out LOOCV and 5-fold cross validation to assess CNMDA's prediction accuracy according to HMDD v2.0 29 and then made comparison between CNMDA and four other classical computational models: RLSMDA 23 , HDMP 21 , WBSMDA 24 and RKNNMDA 17 (See Fig. 1). In LOOCV, test sample is one of the 5430 MDAs; training samples are the rest of 5429 known MDAs; candidate samples are those unlabeled 184155 miRNA-disease pairs. When each known MDA was taken to be the test sample, we would get association scores for all miRNA-disease pairs after implementing MCMDA and then the ranking of test sample among the candidate samples would be gained based on their association scores. We would say that the model makes a correctly prediction if the test sample ranked higher than the set threshold. Case studies. Three different case studies were also implemented to assess CNMDA' performance. In the first case study, CNMDA was employed to predict KN-related miRNAs based on HMDD v2.0. Further, another two reliable MDA databases (dbDEMC and miR2Disease) would be utilized to validate the top 50 identified outcomes. In the second case study, we respectively inferred BN-associated miRNAs and BN-associated miRNAs after removing all known BN-associated miRNAs in HMDD v2.0. In the third kind of case studies, CNMDA was adopted to predict for LN according to associations in HMDD v1.0 and v2.0, respectively. KN is a disease caused by cellular metabolic disorders 31 . If kidney tumors are detected and treated early and localized in the kidney, Patients would have a good disease-specific survival rate. Otherwise, patients have only an 18% two-year survival rate when they present with terminal disease 32 . With recent researches and studies, about two hundred and fifty thousand renal tumor patients are newly diagnosed annually, and KN' morbidity and mortality continue to increase 33 . Many miRNAs related to KN have been found based on a large number of biological experiments. For example, in renal cell carcinoma (RCC), up regulation of miR-21 is related to kidney cancer that with lower survival rate 34 . Through targeting MMP-9 in RCC, miRNA-133b can suppress cell proliferation, migration and invasion 35 . Finally, we implemented CNMDA for potential KN-related miRNA prediction. It was found that 8 of the first 10 and 37 of the first 50 miRNAs were verified (See Supplementary Table 1). we also provided the whole scores of potential MDAs on the base of HMDD v2.0 (See Supplementary Table 2).
BN is a major chronic disease affecting adult women and detected breast neoplasms can be removed surgically 36 . However, if people with BN have not been detected, BN may develop into a life-threatening clinical recurrence in the next 5, 10, 15, or more years 37 . Recent experimental studies have provide evidences that miRNA-195 may work as latent biomarker for early BN detection 38 . To find the novel biomarkers for BN for the treatment of the disease is significant. In the second, we employed CNMDA for potential BN-related miRNA prediction. It was found that 5 of the first 10 and 31 of the first 50 miRNAs were verified (See Supplementary Table 3). Also, we implemented CNMDA for the prediction of BN by hiding all its confirmed associations in HMDD v2.0. This means that we would remove all known BN-associated miRNAs and predict potential BN-associated miRNAs based on other known associations and corresponding similarity information. Supplementary Table 4 presents the top 50 predicted outcomes and their verification evidences. As a result, 9 of the first 10 and 41 of the first 50 miRNAs were confirmed (See Supplementary Table 4).
LN is the primary reason of cancer deaths on a global scale 39 . The genetic and epigenetic damage caused by tobacco smoke is the main cause of the disease 40 . Obviously, it is urgent to find a more therapy systemic 39 . In squamous cell carcinoma, miR-126 have been verified to be down regulated and two miRNAs of miR-185 * , miR-125a-5p were up regulated 39 . MiR-205 were expressed differently in the non-small cell lung carcinoma (NSCLC) 40 . In order to test the stability of CNMDA, we employed CNMDA based on the associations in HMDD v2.0 and HMDD v1.0 30 , respectively. It was found that 20 and 28 of first 50 associated miRNAs for LN have been verified, respectively (See Supplementary Tables 5 and 6).
As seen in the results above, we can arrival at a conclusion that CNMDA possesses excellent predictive performance for the novel MDAs prediction.

Discussions
As overwhelming evidences expounded that miRNAs are participated in all sorts of diseases. The development of new calculation approaches for predicting MDAs in a short time is important to further experimental validation. Accordingly, it is now possible to confirmed novel MDAs using biological experiments with low time and cost. Existing models are usually proposed based on four different calculation mechanisms 41 . Some scoring functions were constructed to prioritize disease-related miRNAs through carrying out probability distribution. Complex network algorithm-based prediction models were introduced through establishing complex network based on various data that are collected or calculated from different perspectives. Machine learning-based prediction models were introduced by using powerful machine learning algorithms. Moreover, multiple biological information-based models were put forward through constructing intermediate medium associations based on various biological datasets. We put forward the computing method of CNMDA to infer novel MDAs. In the model, we implemented RWR on a multi-level composite network that was built through combining collected and calculated data (ISD, ISM, GIPKS for lncRNAs, experimentally validated MDAs, MLIs and LDAs). From the evaluation results, it can be seen that the accuracy of our prediction model was superior in the comparison with other four models.
The main merits for the effective performance of CNMDA are as follows: Through taking advantage of multi-source information based on reliable database, it is no surprise that the integration strategy of CNMDA could predict potential MDAs effectively. Secondly, in comparison of local network information, RWR is an iterative process based on global network for the MDAs prediction. The attractive properties of global network information have been proved in the identification for potential disease-gene associations, MDAs 41,42 , LDAs 43 and drug-target interaction 44 . Furthermore, CNMDA could identify novel diseases that have no known associated miRNAs. At last, the implementation of CNMDA only needs positive samples as training data. Since there is no known negative sample information, the forecasting precision of CNMDA is more convincing. However, some limitations also exist in the computation model of CNMDA. For example, the number of experimentally determined MDAs, LDAs and MLI is insufficient. For the number of known MDAs, only 5430 known MDAs were collected. The more the known MDAs, the higher forecasting precision the model. Importantly, the current forecasting precision still needs to be improved according to the evaluation of LOOCV.

Methods
MiRNA-disease associations. Experimentally confirmed MDAs used in this paper were come from high-quality database 29 . Through constructing a adjacency matrix W dm to indicate the 5430 known MDAs, we made use of variables nm and nd to express the total amount of miRNAs and diseases in the known MDAs dataset, respectively.
dl MiRNA-lncRNA interactions. The known MLIs was from starBase v2.0 46 . In the same way, we need to delete excess MLIs whose miRNAs and lncRNAs do not exist in the 5430 known MDAs and 250 known LDAs. At last, 9088 known MLIs were gotten and an adjacency matrix W ml was used to refer to the 9088 MLIs.
where Δ is the semantic contribution decay factor. It is worthy of being mentioned that the value of contribution for disease D to its own semantic value is 1. The semantic value of disease D could be put forward.
At last, DSS1 between d(i) and d(j) can be described.

Disease semantic similarity model 2 (DSS2).
In the DSS2 48 , due to the fact that a more specific disease d appearing in less DAGs would contribute more to the semantic value of disease D. Accordingly, the contribution made by d for the semantic value of D can be described by Gaussian interaction profile kernel similarity. For disease d(u), we used IP(d(u)) to refer to row vectors of line u in W dm on the basis of known MDA. Through watching whether d(u) is related to each miRNA, we computed GIPKS for diseases d(u) and d(v) 50 .  For lncRNA l(p) and l(q), GIPKS between them can be constructed. Similarly, the ISM between miRNAs m(i) and m(j) can be put forward by the integration of GIPK for miRNA and MFS 24 .  studies of Yao et al. 51  Global information based on the multi-level network would be captured through RWR algorithm. At each steps, seed nodes move to their immediate neighbors with a probability δ − (1 ) or go back to the seed nodes with a restart probability δ. P 0 was put forward to denote the original probability vector, and P t+1 was introduced to represent a probability vector of node at step t + 1, which could be described by: where δ ∈ (0, 1) is a restart probability. In the multi-level network, the initial seed node probability M(i,j) represents the transition probability from i to j. In the network of GIPKS for lncRNAs, the transition probability from lncRNA i(l i ) to lncRNA j(l j ) was put forward.  In the MDAs network, the transition probability from disease i d ( ) i to miRNA j m ( ) j was put forward. where x y z , , are the jumping probability between the network of GIPKS for lncRNAs and ISD network, between the network of GIPKS for lncRNAs and ISM network, and between ISD network and ISM network, respectively.
CNMDA is performed until the probabilities tend to a steady state, (the range between P t and P 0 computed by L 1 norm is smaller than 10 −6 ). Then, the candidate miRNAs can be ranked according to ∞ w . By incorporating MLIs and LDA into MDAs prediction, RWR was put forward on a constructed multi-level network to infer novel MDAs. In the network, because initial MLIs, LDAs and MDAs have more credibility, they all as weights in the RWR equations. Obviously, the one interaction and two associations play an equally important part in the network to disseminate information of miRNAs, diseases and lncRNAs for the novel MDAs prediction. In this study, we chose the same parameter as the one in previous literature 51 , which used RWR on the same multi-level composite network in their study. Therefore, we set the parameter δ to 0.7 and x, y, z, α, β to 1 3 .