Predicting miRNA–disease associations using improved random walk with restart and integrating multiple similarities

Predicting beneficial and valuable miRNA–disease associations (MDAs) by doing biological laboratory experiments is costly and time-consuming. Proposing a forceful and meaningful computational method for predicting MDAs is essential and captivated many computer scientists in recent years. In this paper, we proposed a new computational method to predict miRNA–disease associations using improved random walk with restart and integrating multiple similarities (RWRMMDA). We used a WKNKN algorithm as a pre-processing step to solve the problem of sparsity and incompletion of data to reduce the negative impact of a large number of missing associations. Two heterogeneous networks in disease and miRNA spaces were built by integrating multiple similarity networks, respectively, and different walk probabilities could be designated to each linked neighbor node of the disease or miRNA node in line with its degree in respective networks. Finally, an improve extended random walk with restart algorithm based on miRNA similarity-based and disease similarity-based heterogeneous networks was used to calculate miRNA–disease association prediction probabilities. The experiments showed that our proposed method achieved a momentous performance with Global LOOCV AUC (Area Under Roc Curve) and AUPR (Area Under Precision-Recall Curve) values of 0.9882 and 0.9066, respectively. And the best AUC and AUPR values under fivefold cross-validation of 0.9855 and 0.8642 which are proven by statistical tests, respectively. In comparison with other previous related methods, it outperformed than NTSHMDA, PMFMDA, IMCMDA and MCLPMDA methods in both AUC and AUPR values. In case studies of Breast Neoplasms, Carcinoma Hepatocellular and Stomach Neoplasms diseases, it inferred 1, 12 and 7 new associations out of top 40 predicted associated miRNAs for each disease, respectively. All of these new inferred associations have been confirmed in different databases or literatures.


Scientific Reports
| (2021) 11:21071 | https://doi.org/10.1038/s41598-021-00677-w www.nature.com/scientificreports/ contributions of our study. First, we integrated multiple similarity networks to build two heterogeneous networks in disease and miRNA spaces, respectively, to designate different walk probabilities to each related neighbor node of the disease or miRNA node in line with its degree in different spaces. Second, we solved the problem of sparsity and incompletion of data to reduce negative impact of a large number of missing associations by using a WKNKN algorithm as a pre-processing step. Finally, we improved the extended random walk with restart algorithm based on miRNA similarity-based and disease similarity-based heterogeneous networks to calculate miRNA-disease association prediction probabilities. The experiments based on the dataset of miRNA-disease associations which was downloaded from the HMDD V2.0 database 30 containing 5430 experimentally verified associations between 383 diseases and 495 miRNAs as in PMFMDA 4 , miRNA functional similarities and disease semantic similarities showed that our proposed method (RWRMMDA) achieved a decisive performance. In details, RWRMMDA achieved global LOOCV AUC (Area Under Roc Curve) and AUPR (Area Under Precision-Recall Curve) values of 0.9882 and 0.9066 respectively. Additionally, its best AUC and AUPR values, proven by statistical tests, are 0.9855 and 0.8642, respectively, under fivefold-cross-validation experiments. Its performance is superior to other state of the art methods as NTSHMDA 29 , PMFMDA 4 , IMCMDA 13 and MCLPMDA 14 . It could be considered as a forceful and valuable tool to infer miRNA-disease associations.

Materials and methods
Method overview. In this paper, we proposed a new method to predict potential miRNA-disease associations using improved random walk with restart and integrating multiple similarities (RWRMMDA). The workflow of RWRMMDA is shown in Fig. 1. In overview, RWRMMDA based on the known miRNA-disease associations, miRNA functional similarity and disease semantic similarity information. It contains six stages. At www.nature.com/scientificreports/ the first stage, we calculated Gaussian Interaction Profile Kernel Similarity for miRNAs and diseases. At second stage, we figured out the Integrated Similarity for miRNAs and diseases. At third stage, we performed a weighted K-nearest known neighbors (WKNKN) algorithm as a preprocessing step to exclude unknown missing values in miRNA-disease association set. In other words, it reduced the impact of sparsity data problem. During the fourth stage, we constructed two miRNA similarity based and disease similarity based heterogeneous networks. Next, we handled an improved random walk with restart algorithm on miRNA similarity-based and disease similarity-based heterogeneous networks to calculate the final prediction probabilities. Finally, we ranked the prediction scores in descending order to obtain the most potential disease associated miRNAs.
Human miRNA-disease associations. We used an adjacency matrix A DM to express the known miRNAdisease associations which were downloaded from the HMDD V2.0 database 30 and contained 5430 experimentally verified associations between 383 diseases and 495 miRNAs. Especially, if the association between disease d i and miRNA m j was experimentally verified, we represent the element A DM ij to be equal to 1, otherwise A DM ij is equal to 0. Hence, a binary vector which indicates the associations between disease d i and each miRNA is represented by the ith row of A DM , and a binary vector reflects the associations between miRNA m j and each disease is represented by the jth column of A DM . Disease semantic similarity. Disease semantic similarity was estimated according to the literatures 4,17,31 .
We gathered the relationships of various diseases based on the hierarchical directed acrylic graphs (DAGs) by downloading MeSH descriptors from the National Library of Medicine (http:// www. ncbi. nlm. nih. gov/). DAGs are usually used to measure the similarity among diseases. For instance, for a disease d, its directed acrylic graph is given by DAG(d) = (d, TA d , EC d ) , where TA d indicates the set of the disease d's ancestors and d itself, and EC d symbolizes the set of edges which point to child nodes from parent nodes in the MeSH tree. Therefore, the semantic contribution of disease t to disease d is as in the following equation where symbolizes a predefined semantic contribution factor with values range from 0 to 1. According to Wang et al. 31 , Xu et al. 4 and Chen et al. 17 , in this paper, we set equal to 0.5. We calculated the semantic similarity between diseases based on the assumption that two diseases having larger parts in their DAGs favor to have higher semantic similarity as in formula (2). miRNA functional similarity. As previous studies 4,31 , in this paper, the functional similarity measurements were used to represent miRNA functional similarities among miRNAs. Especially, let any two miRNAs m i and m j associated disease sets be the DTT i = {d i1 , d i2 , . . . , d ik } and DTT j = d j1 , d j2 , . . . , d jl , respectively. Similar to Wang et al. 31 and Xu et al. 4 , we firstly used SS(d, DTT) = d i∈DTT max DSS(d, d i ) to depict the similarity between a disease d and DTT set. Then, the similarity between m i and m j was computed as follows: The illustration of calculating miRNA functional similarity is shown in Fig. 2.  where γ d signifies a kernel bandwidth's adjustment parameter and it is updated as follows: here γ ′ d is widely set to 1 as in previous studies 4, 17 . In a similar way, we calculated the Gaussian interaction profile kernel similarity between miRNA m i and miRNA m j as follows: where γ m signifies a kernel bandwidth's adjustment parameter and it is updated as follows: here γ ′ m is widely set to 1 as in previous studies 4,17 . Integrated similarity for miRNAs and diseases. We could not attain DAGs for all diseases though the disease semantic similarity was determined based on DAGs as mentioned before. Therefore, we could not assess disease semantic similarity in case of the specific disease without DAGs. Consequently, to measure all disease similarity information, we incorporated disease semantic similarity with Gaussian interaction profile kernel according to previous studies 4,32 as follows: Similarly, integrated miRNA similarity was computed according to previous studies 4,32 as follows: Weighted K-nearest known neighbors algorithm. We utilized a WKNKN algorithm introduced in 25,28 as a pre-processing step to exclude unknown values in miRNA-disease association set. It based on the known neighbors' information by considering the fact that many of the non-interacting miRNA-disease pairs in A DM are unknown cases that could potentially be truthful associations. Particularly, WKNKN replaces A DM ij = 0 with an interaction likelihood continuous value in the range from 0 to 1 as follows. Firstly, for each disease d i , we selected the semantic similarities with K known diseases which are nearest to d i and their corresponding interaction profiles to quantify the interaction likelihood profile for disease d i . Secondly, for each miRNA m j , we chose its functional similarities with K known miRNAs which are nearest to m j and their corresponding interaction profiles to estimate the interaction likelihood profile for miRNA m j . And finally, if A DM ij = 0 , we changed it by averaging the two interaction likelihood profiles. Figure 3 contains the pseudocode that describes the above steps in detail in which r is a decay term where r ≤ 1, and KNN() returns the K-nearest known neighbors in descending order based on their similarities to d i or m j .
Construct miRNA similarity-based and disease similarity based heterogeneous networks. Normally, the transition probabilities from a disease (miRNA) node to each related neighbor miRNA (disease) are equally allocated while the total of the probabilities is equal to 1 in the common random walk with restart (RWR) algorithms [18][19][20] . However, the tends of degree to be related with different miRNAs or diseases corresponding to a given disease or miRNA literally exists difference 29,33 . For instance, a number of associations between a given disease d i and many related miRNAs show different similarities among them while remained d i -associated miRNAs do not have or have sparse similarities to other miRNAs associated with d i . Therefore, we suppose that a disease or miRNA has stronger relation with miRNA or disease to which a larger number of the remaining miRNAs or diseases are similar among miRNAs or diseases associated with the disease or miRNA 29 . Based on that hypothesis, we incorporated topological similarity with semantic similarity for a disease or with functional similarity for a miRNA to measure the tends of degree to be related of a disease (miRNA) to a miRNA (disease) 29,33 . We determined the edges' weights in miRNA-disease association network which reflect the related degree of actual association based on integrated similarity for diseases and integrated similarity for miRNAs, respectively as follows. Firstly, a bipartite graph which consists disease nodes and miRNA nodes was www.nature.com/scientificreports/ constructed. Secondly, when the walker moves from disease network to miRNA network, we selected the possibility of targeted miRNA node m j (j = 1, 2, …, n m ) for a specific disease node d i (i = 1, 2, …, n d ) totally depends on the similarities between m j and all neighbor d i -related miRNA nodes including m j 29 . Analogously, for a specific miRNA node m j (j = 1, 2, …, n m ), when the walker moves to disease network from miRNA network, we selected the possibility of targeted disease node d i (i = 1, 2,…, n d ) totally bases on the similarities between d i and all neighbor m j-related disease nodes including d i 29 . Figure 4 illustrates a simple example of the process of weight assignment in disease and miRNA spaces, respectively. Finally, we redefined two new integrated adjacency matrices A DMdiseasebase and A DMmirnabase based on the integrated similarity ISD matrix for diseases, integrated similarity ISM matrix for miRNAs and A DM_new adjacency matrix as in the following equations:  www.nature.com/scientificreports/ Improved random walk with restart to predict miRNA-disease associations. Firstly, we defined a transition probability matrix from disease network to miRNA network T DM and a transition probability matrix from miRNA network to disease network T MD based on the two new integrated adjacency matrices identified previously as follows: where ϕ ∈ (0,1) is the jumping probability of random walker among these two different networks 29 . Secondly, we defined a disease transition probability matrix W d to represent the transition probabilities from a disease node to all neighbor disease nodes in disease network in which the element W d i, j signifies the jumping probability from disease d i to disease d j as in Eq. (14).
Furthermore, the miRNA network transition probability matrix W m can be constructed as follows: Thirdly, instead of using the vector form of initial probability as in common RWR algorithms [18][19][20] , and inspired by the extended RWR proposed by Luo and Long 29 , we defined the initial probability matrix of heterogenous network to perform improved random walk with restart with supposition that all miRNA-disease associations could be concurrently produced, where PD 0 and PM 0 are the diagonal matrices with PD 0 (i, i) = 1/n d and PM 0 j, j = 1/n m serve as the normalized probabilities of disease and miRNA seed nodes and δ is the weight factor used to point out the importance level or impact factor of two sub-networks which are represented by A DMdiseasebase and A DMmirnabase matrices.
And then, we defined a new transition probability matrix W newTP_DM of heterogeneous network relied on disease similarity-based network as follows: and a new transition probability matrix W newTP_MD of heterogeneous network depended on miRNA similaritybased network as follows: where T DM , and T MD , are the transpose matrices of T DM and T MD respectively. From the new transition probability matrices and initial transition probability matrix, the improved random walk with restart can be identified as follows: where P1 t and P2 t illustrate prediction matrices which reflect the probability values of all miRNA-disease associations at the t time step, and γ stands for the restart probability, γ ∈ (0, 1). We again and again executed the www.nature.com/scientificreports/ improved random walk process on the heterogeneous network until convergence, generally, the t time is set to 10 as in 29 . Finally, the final prediction matrix P is defined as: in which the elements of P reveal the score of associations between disease nodes and miRNA nodes would be produced simultaneously.
Rank the final prediction score of associations to obtain predicted miRNA-disease associations. For a given disease, we ranked all candidate miRNAs' score of associations in descending order to obtain the most possible miRNA-disease associations. The candidate with higher score will have more chance to be verified in the future.
Ethics approval and consent to participate. Not applicable. The study does not involve human subjects, only used public data.

Results
Performance measures. We appraise our method's performance in inferring miRNA-disease associations by doing the fivefold cross-validation experiments and global LOOCV and measure the Area under roc curve (AUC) 34 and the Area under precision-recall curve (AUPR) 35  where TP (true positive) specifies that a positive sample is precisely forecasted as positive sample; FN (false negative) depicts that a positive sample is falsely predicted as negative sample; FP (false positive) symbolizes that a negative sample wrongly predicted as positive sample; TN (true negative) shows that a negative sample is perfectly concluded as negative sample. We used TPR as vertical axis and FPR as horizontal axis to figure the receiver operating characteristic (ROC) curve 34 .
As mentioned by Takaya Saito and Marc Rehmsmeier 35 , in case of Evaluating Binary Classifiers on Imbalanced Datasets, the Precision-Recall is more informative than the ROC. Therefore, we also draw Precision-Recall curve and calculate the AUPR value to evaluate prediction performance. The Precision depicts the percentage of the accurately predicted positive samples in all predicted positive samples whereas the Recall reflects the percentage of the accurately predicted positive samples in all real positive samples. Precision and Recall are computed as follows: Evaluating the AUC and AUPR under fivefold cross validation. In fivefold cross-validation experiments, firstly we considered the known miRNA-disease associations as positive samples and the remained unknown associations as negative samples. Secondly, we randomly partitioned all positive and negative samples in known adjacency matrix A DM into five equal parts to perform fivefold cross-validation. Thirdly, in each experimental running time, we took four parts of positive and negative samples for training and the last part for testing. The elements' values which are equal to 1 in the part used for testing were changed to 0. Fourthly, we recalculated Final_score in each running time.     Two parameters from WKNKN. Considering that there are some unknown miRNA-disease associations in the matrix A DM ij , the WKNKN algorithm was used as a pre-processing step to exclude unknown values in miRNAdisease association set based on their known neighbors. The K parameter reflects the number of nearest known neighbors, r means a decay term where r ≤ 1. In this study, we mainly focus on the influence of number of nearest known neighbors to reduce the impact of sparsity data problem. The more nearest known neighbors were chosen, the more associations between diseases and miRNAs would be added into the heterogeneous network. And the impact of sparsity data problem would be reduced. However, when the number of added associations was too big, the imbalanced data problem would again appear. Therefore, the two parameters would be determined to the optimal value before performing improved random walk on heterogeneous networks. In our experiments, we again and again changed the value of K and r to choose the optimal values. And it showed that AUC and AUPR achieve the best values when K = 5 and r = 0.7. It is similar to the result in NPCMF method 26 . Table 2 shows the evaluation index changes when K was fixed to 5 and r ranged from 0.1 to 0.9 and r was fixed to 0.7 and K range from 1 to 9 when evaluating prediction performance over all samples.
Three parameters from improved random walk with restart. When performing improved random walk with restart on heterogeneous networks, there are three parameters which can imply the result performance. The ϕ parameter, ϕ ∈ (0, 1), is used to indicate the jumping probability of random walker among two different networks. Theδ parameter , δ ∈ (0, 1), signifies the weight factor used to present the importance level or impact factor of two sub-networks. The γ parameter, γ ∈ (0, 1) , stands for the restart probability. We examined the influences of the three parameters by adjusting them over repeated experiments and then select ϕ = 0.9 , δ = 0.7andγ = 0.7 as the optimal combination values in our proposed method.
Performance comparison with other related models. In comparison with other related approaches to demonstrate the outperformance of our model, we compare our model performance with the performances of NTSHMDA 29 , PMFMDA 4 , IMCMDA 13 and MCLPMDA 14 models under best averaged fivefold cross validation experiments The NTSHMDA method contained an extended Random Walk with Restart algorithm which we used in our method. PMFMDA, ICMMDA and MCLPMDA methods used the same miRNA-disease association dataset as in our experiments. The performances of these methods in terms of AUCs and AUPRs are shown in Fig. 7. As can be seen, our proposed approach is superior to all NTSHMDA, PMFMDA, IMCMDA and MCLPMDA methods in AUC measurement of 0.61%, 0.6%, 14.5% and 7.5%, respectively. It is superior to all NTSHMDA, PMFMDA, IMCMDA and MCLPMDA methods in AUPR measurement of 13.62%, 35.04%, 60.44% and 53.52%, respectively. The differences in accuracy values between different methods indicated that our proposed method outperforms all other previous related methods. Especially, in the kind of imbalanced datasets, the significant improvement in AUPR performance prediction showed that our proposed method could be considered to be more informative and reliable than other previous related methods.
Additionally, to understand the effects of using WKNKN and integrating multiple similarities independently, we also draw ROC curves and Precision and Recall curves of performing random walk with restart in the cases of (1) using WKNKN as a pre-processing step and not using integrated similarities, and (2) using integrated similarities and not using WKNKN as a pre-processing step. As shown in Fig. 8a, the AUC value of the proposed method seems to be the average of the AUC values of the above cases (1) and (2). And, as illustrated in Fig. 8b, the AUPR value of the proposed method is the highest one in comparison with the above cases. It means that both cases of using WKNKN algorithm as a pre-processing step and using integrated similarities respectively, can increase the AUPR values while using WKNKN algorithm as a pre-processing step can reduce the impact of sparsity data problem when evaluating AUC values. Breast neoplasms. Breast Neoplasms is also known as Breast Cancer, it is the leading cause of cancer death in women worldwide. MicroRNAs (miRNAs) have been found to play an important role in breast cancer 37,38 . For example, miR-34 family members in regulating of proliferation, apoptosis, invasion, and metastasis of breast cancer cells 39 . miR-34a inhibits proliferation and migration of breast cancer through down-regulation of Bcl-2 and SIRT1 40 . In this paper, we selected Breast Neoplasms as a case study to demonstrate the ability of our method in inferring miRNA-disease associations. As can be seen in Table 3, in top 40 predicted Breast Neoplasmsassociated miRNAs, there is one new miRNA-disease association. This new association has been verified in dbDEMC V2.0 database.
Hepatocellular carcinoma. Hepatocellular carcinoma (HCC) is the most common primary liver malignancy and it is a leading cause of cancer-related death in global 41 . In the United States, HCC is the ninth leading cause of cancer deaths 42,43 . MiRNAs are essential participants and regulators and they also play important roles in the development and progression in HCC 41 . For instances, microRNA-146a inhibits cancer metastasis by downregulating VEGF through dual pathways in hepatocellular carcinoma 44 . miRNA-21 contributes to tumor  www.nature.com/scientificreports/ progression by converting hepatocyte stellate cells to cancer-associated fibroblasts in HCC 45 . By selecting HCC as a case study to illustrate the ability of our approach, it discovered 12 new associations out of top 40 predicted Hepatocellular Carcinoma-associated miRNAs as can be seen in Table 4. To increase the reliability of predicted results, we already checked the evidences of these new predicted associations in dbDEMC V2.0, mirCancer, mirdb (http:// mirdb. org/) databases as well as in other literatures. For examples, the new predicted association between hsa-mir-452 miRNA and Hepatocellular carcinoma disease has been verified in dbDEMC V2.0 database and some other published papers [46][47][48] . For the new predicted association between has-mir-454 and Hepatocellular carcinoma disease, Yu et al. 49 proved that miR-454 functions as an oncogene by inhibiting CHD5  Predicting new disease-related miRNAs. The dataset used in this study does not contain any new disease or new miRNA. It means that a disease or a miRNA in this dataset has at least one known association with other miRNAs or diseases. Therefore, to demonstrate the proposed method's performance in predicting new diseaserelated miRNAs, we conducted two simulated experiments on Lung Neoplasms and Ovarian Neoplasms diseases. The first simulated experiment was conducted based on Lung Neoplasms. It is also known as Lung Cancer and is the leading cause of cancer deaths worldwide 58 . The clinical applications of miRNAs in lung cancer diagnosis and prognosis have been indicated in many studies 58,59 . In this study, the dataset contained 132 associations between Lung neoplasms and miRNAs. We already removed all known associations related to Lung neoplasms to perform the simulated experiment of predicting new disease-related miRNAs. After performing simulated experiments, we selected top ten predicted miRNAs for Lung cancer to report the performance of our method. As can be seen in Table 6, in top ten predicted miRNAs, our method successfully predicted four known associations and it inferred six new associations. All of six new predicted associations have been confirmed in other databases or literature.
The second simulated experiment was performed on Ovarian Neoplasms. It is also known as Ovarian Cancer and has the highest mortality rate among gynecological cancers 60 . miRNAs have been indicated to be promising biomarkers for Ovarian Cancer [60][61][62] . The dataset in this study included 114 known associations between miRNAs and Ovarian Neoplams. We performed the simulated experiment on Ovarian Neoplasms by removing all known associations related to Ovarian Neoplams and making them to be unknown. The simulated result showed that in top ten predicted miRNAs for Ovarian Neoplasms, three known associations have successfully been predicted and seven new associations have been reported. All of seven new predicted associations have been confirmed  Table 7.

Conclusion and discussions
Inferring potential miRNA-disease associations by integrating various types of prior information is a very challenging and meaningful work for disease-related researches. In this paper, we proposed a new method to infer miRNA-disease associations using improved random walk with restart and integrating multiple similarities (RWRMMDA) such as miRNA functional similarity, disease semantic similarity and network topological similarities of miRNA-disease association network. With Global LOOCV AUC (Area Under Roc Curve) and AUPR (Area Under Precision-Recall Curve) values of 0.9882 and 0.9066, respectively, and AUC and AUPR values of 0.9855 and 0.8642, respectively, under fivefold-cross-validation experiments, it illustrated that our proposed method achieved a reliable performance. In comparison with other related previous methods, it outperformed than NTSHMDA, PMFMDA, IMCMDA and MCLPMDA methods in both AUC and AUPR values. In case studies of Breast Neoplasms, Carcinoma Hepatocellular and Stomach Neoplasms diseases, it inferred 1, 12 and 7 new associations out of top 40 predicted associations, respectively. All of these new predicted associations have been confirmed in different databases or literatures. Therefore, our proposed method could be considered as a useful and meaningful tool to infer potential miRNA-disease associations. There are some factors which contribute to the desirable performance of our proposed method as follows. Firstly, the known miRNA-disease associations which includes 5430 experimentally verified associations between 383 diseases and 495 miRNAs were gathered from the HMDD V2.0 database are reliable and they were used in many recent researches 4,14,27 . Secondly, both AUC and AUPR values of the proposed method were increased by using integrated similarities although it did not reduce the effect of sparsity data problem. Thirdly, the impact of sparsity data problem was reduced by performing a WKNKN algorithm as a pre-processing step to exclude unknown values in miRNA-disease association set based on their known neighbors. Therefore, the prediction performance becomes more informative. And finally, the most importance point is that the improved random walk with restart algorithm in our method was differed to common random walk with restart algorithms [18][19][20] . By supposing that a disease (miRNA) would have different relevant probabilities to each associated miRNA (disease), each miRNA-disease association was accredited different weight value in different heterogeneous network spaces which were built from integrating of multiple similarities. It would result in the trends to select actual miRNA-disease association couple with higher possibility when the extended random walk with restart algorithm was performed, from that prediction bias is limited.
Although our proposed approach achieves a reliable prediction performance and it could infer new diseaserelated miRNAs as indicated in the simulated experiments' results of Lung Neoplasms and Ovarian Neoplasms in predicting new disease-related miRNAs section. However, subjectively choosing a new disease to perform simulated experiments by removing all its known associations can cause the bias in prediction. Therefore, it requires to do further researches or integrate more biological information to increase the reliability of prediction in case of new diseases or new miRNAs.