Introduction

MicroRNAs (miRNAs) are a class of small endogenous single-stranded non-coding RNAs (~22 nt), which normally post-transcriptionally suppress gene expression and protein production by base pairing to the 3′ untranslated regions (UTRs) of their target messenger RNAs (mRNAs)1,2,3,4. In some cases, miRNAs may also function as positive regulators5,6. It has been demonstrated that many miRNAs are highly conserved7. Especially, some of them are even lineage specific. After the discovery of the first two well-known miRNAs (Caenorhabditis elegans (C. elegans) lin-4 and let-7 by conventional forward genetic screens8,9,10), thousands of miRNAs (for example, more than 1400 miRNAs in human according to miRBase11) have been discovered in eukaryotic organisms ranging from nematodes to humans in the past few years12. It is estimated that 1–4% genes in the human genome are miRNAs13. MiRNAs recognize their target primarily through sequence complementarity between the seed region of the miRNA and the binding sites on its target mRNAs14. It has been conjectured that a single miRNA can regulate as many as 200 mRNAs13 and about one thirds of human gene can be targeted by miRNAs12,15. Therefore, one miRNA can regulate many target genes and one target gene can be targeted by multiple miRNAs15. These miRNA-mRNA interactions construct an important post-transcriptional regulatory network which plays critical roles in various biological processes16,17,18,19. It has been observed that miRNA-mediated regulations are evolutionarily conserved19,20,21 and hence typically rare sequence variants that disrupt miRNA regulations are often related to human diseases19,22,23,24.

Accumulating evidences indicates that miRNA is one of the most important components of the cell, playing critical roles in many significant biological processes, including the development25, proliferation26, differentiation27 and apoptosis28 of the cell, signal transduction16, viral infection27 and so on. Therefore, the dysregulation of the miRNAs are related to plenty of the diseases, playing important roles in the development, progression13,29,30, prognosis, diagnosis and treatment response evaluation of human disease31,32,33,34,35,36,37,38.

Especially in the last few years, many studies have demonstrated that numerous miRNAs are associated with initiation and development of various cancers and cancer-related processes39,40,41,42. Abnormality of miRNAs leads to the dysfunction of downstream target genes, which can lead to the development of cancer in turn42. MiRNAs have been important part of the field of human molecular oncology40. Another well-known example is that mir-375 can regulate insulin secretion43,44. Therefore, identifying disease-related miRNAs is one of the most important goals of biomedical research, which can benefit the understanding of disease pathogenesis at the molecular level, molecular tools design for disease diagnosis, treatment and prevention31,32,33,34,36,45,46. Searching for disease-miRNA associations form experimental methods is expensive and time-consuming45,46. Encouragingly, plenty of biological data about miRNAs has been generated. Therefore, there is strong incentive to develop powerful computational methods for predicting potential disease-related miRNAs on a large scale47. Computational methods are an essential complementary means for disease-related miRNAs prioritization, which can benefit the understanding of miRNAs function, decrease the number of biological experiments and select most promising miRNAs for further experimental validation45,47.

To provide a comprehensive resource of experimentally verified miRNA-disease associations, Lu, et al.30 and Jiang, et al.48 successively constructed two publicly available and manually curated databases, i.e. Human MicroRNA Disease Database (HMDD) and miR2Disease. Focusing on cancer-related miRNAs, Yang, et al.49 developed a manually curated database of Differentially Expressed MiRNAs in human Cancer (dbDEMC). The establishment of these disease-related miRNAs databases laid a solid data fundament for predictive research. Lu, et al.30 integrated and analyzed these disease-miRNA associations to obtain some important patterns between human diseases and miRNAs, which not only benefited the understanding of human diseases at miRNA level, but also laid the solid theoretical fundament for the identification of novel disease-related miRNAs. The most important conclusion in this paper is that miRNAs related to phenotypically similar diseases tend to be functionally related, which have been treated as the basic assumption of many current disease-miRNAs associations predication methods30.

Some bioinformatics methods have been developed for predicting novel disease-miRNA associations mostly based on aforementioned assumption in literature30. Jiang, et al.45 extended logically previous disease genes prioritization methods and developed a computational model based on hypergeometric distribution to prioritize the entire microRNAome for disease of interest. This method integrated the miRNA functional interactions network, disease similarity network and known phenome-microRNAome network constructed based on miR2Disease. However, this method only adopts local similarity measure and strongly relies on the predicted miRNA-target interactions, which have a high rate of false-positive and high false-negative results. Other limitations lie in the construction of miRNA functional similarity network (two miRNAs may be functionally related when target genes are located in the same functional modules or pathways, rather than significantly share common target genes) and the use of disease phenotypical similarity network (Only used the information whether or not two phenotype are similar, rather than similarity scores). As a result, the prediction accuracy of this method is not high. Based on the assumption that most of miRNAs associated with given disease regulates genes associated with this disease, or functionally related genes with these known disease genes, Jiang, et al.50 proposed a computational method based on genomic data fusion in the framework of naïve Bayes. Recently, Shi et al.51 developed a computational framework to identify miRNA-disease associations by focusing on the functional link between miRNA targets and disease genes in protein-protein interaction networks. These two methods strongly relied on known disease-genes association and miRNA-target interactions. However, the molecular bases for as many as 60% of human disease are unknown. The problem of miRNA-target interactions has also limited the application of this method.

Jiang, et al.46 and Xu, et al.40 extracted different feature vectors and developed the support vector machine classifier to distinguish positive disease miRNAs from negative ones, respectively. As we all known, selecting negative disease-related miRNAs is currently difficult or even impossible. Hence, these methods selected unlabeled disease-miRNAs interactions as negative samples, which would largely influence the predictive accuracy. Based on the assumption that global network similarity measures are better suited to capture the associations between diseases and miRNAs than traditional local network similarity, Chen, et al.47 first adopted global network similarity and developed the method of Random Walk with Restart for MiRNA–Disease Association (RWRMDA). Also, Xuan et al.52 developed the new prediction method of HDMP based on weighted k most similar neighbors by calculating the functional similarity between miRNAs from the information content of disease terms and phenotype similarity between diseases and assigning higher weight to members of miRNA family or cluster. RWRMDA and HDMP obtained excellent predictive accuracy based on cross validation and case studies. However, they does not work for disease without any known associated miRNA. Furthermore, the selection of parameter k is critical to the performance of HDMP and we should have different values of this parameter when different diseases are investigated. Recently, Chen and Zhang53 adopt the method of Network-Consistency-Based Inference (Net-CBI) to infer potential disease-miRNA associations based on the idea of network consistency and the integration of miRNA functional similarity network, disease similarity network and known miRNA disease associations. Although Net-CBI can work for diseases not linked with any known miRNAs, the performance is significantly worse than RWRMDA based on the validation of cross validation.

Taken together, the limitations of previous methods are summarized as follows. Firstly, some methods strongly relies incomplete and inaccuracy datasets such as miRNA-target interactions, disease-related genes; secondly, some methods need negative disease-miRNA associations; thirdly, although methods such as RWRMDA have obtained reliable predictive accuracy, they can't predict novel miRNAs for diseases which do not have any known associated miRNAs; finally, methods such as Net-CBI can work for disease without known related miRNAs, but unsatisfactory performances have been obtained. To solve these problems, we developed the method of Regularized Least Squares for MiRNA-Disease Association (RLSMDA) by integrating known disease-miRNA associations, disease-disease similarity dataset and miRNA-miRNA functional similarity network to uncover potential disease-miRNA associations. RLSMDA can predict novel miRNAs for diseases which do not have any known related miRNAs. More importantly, it is developed in the framework of semi-supervised classifier, so it does not need negative miRNA-disease associations. Furthermore, different from RWRMDA, RLSMDA is a global approach which can reconstruct the missing associations for all the diseases simultaneously. Cross validations, Case studies about several important diseases, global prediction for all the diseases simultaneously and independent prediction for diseases without any known related miRNAs have fully demonstrated the superior performance of RLSMDA to previous methods.

Results

Leave-one-out cross validation

Here, we implemented LOOCV on known experimentally verified miRNA-disease associations to evaluate the predictive performance of RLSMDA. To our knowledge, RWRMDA47, HDMP52 and the global network algorithm developed by Shi et al.51 are the-state-of-art approaches in the computational research about disease-related miRNA prediction. However, the global network algorithm developed by Shi et al.51 focused on the functional connectivity between miRNA targets and disease genes in PPI network. Therefore, this method integrated the information of disease gene associations, miRNA-target interactions and protein interactions, which were totally different from the dataset used in RLSMDA. Furthermore, this method did not use the information of known disease-miRNA associations and cross validation by splitting known samples into test samples and training samples implemented in this paper cannot be implemented for this method. Therefore, the performance of this method and RLSMDA could not be compared in a fair and reasonable way. Based on the above consideration, we will compare the performance of RLSMDA with RWRMDA and HDMP.

For simplicity, we choose ηM = 1, ηD = 1 for trade-off parameters in the cost functions according to previous literatures54 and weight parameter w = 0.9 in the final classifier considering the fact that miRNA functional similarity has played a critical role in disease-related miRNA prediction, as what have shown in the method of RWRMDA. Both trade-off parameters in the cost function and weight parameter in the final classifier can be better selected by further cross validation.

LOOCV can be implemented in the following two ways: (1) For the ith disease, each known miRNA associated with disease i was left out in turn as test miRNA. Entity F(i,j) in row i column j of the matrix F reflect the probability that miRNA j is related to the disease i. How well this test miRNA was ranked relative to the candidate miRNAs was evaluated based on the ith line of the matrix F (seed miRNAs: other known disease-miRNA associations; candidate miRNAs: all the miRNAs which do not have the evidence to show their association with disease i). If the rank of test miRNA exceeds the given threshold, the model was considered to successfully predict this miRNA–disease association. We called the LOOCV in this way as local LOOCV. (2) Unlike LOOCV, we did not give a fixed disease, where all the diseases were considered simultaneously. Each known disease-miRNA association was left out in turn as test association and how well this test association was ranked relative to the candidate associations was evaluated based on matrix F (seed associations: other known disease-miRNA associations; candidate associations: all the disease-miRNA pairs which do not have the evidence to confirm the association). If the rank of test association exceeds the given threshold, the model was considered to successfully predict this association. We called the LOOCV in this way as global LOOCV. The difference between local and global LOOCV is whether we considered all the diseases simultaneously. From the aforementioned fact that RWRMDA cannot uncover the missing associations for all the diseases simultaneously, we cannot implement global LOOCV for RWRMDA. For the HDMP, global LOOCV can be implemented. As a global predictive approach, RLSMDA can be checked in both local and global LOOCV.

Receiver-operating characteristics (ROC) curve was drawn and Area under the curve (AUC) was calculated to evaluate the performance of predictive methods. ROC curve plots true positive rate (sensitivity) versus false positive rate (1-specificity) at different thresholds. Sensitivity refers to the percentage of the test samples whose ranking is higher than a given threshold. Specificity refers to the percentage of samples that are below the threshold. AUC = 1 indicates perfect performance and AUC = 0.5 indicates random performance.

According to literature47, the AUC of RWRMDA is 0.8617, which has significantly improved the performance of previous computational method based on the hypergeometric distribution45. However, for diseases which only have 1 known miRNA, LOOCV can't be implemented. To be fair, we think left-out known association obtained the random rank in that case, i.e. for N candidate miRNAs, we regard the rank of left-out known miRNA as (N+1)/2. Recalculated AUC for RWRMDA was 0.8473. For global LOOCV, HDMP obtained an AUC of 0.9431. For RLSMDA, AUC in local and global LOOCV is 0.8450 and 0.9511, respectively (see Figure 1). We can reach the conclusion that the performance of RLSMDA is comparable to RWRMDA and slightly better than HDMP. However, RWRMDA and HDMP cannot predict the potential miRNAs for diseases which do not have known related miRNAs, which is the major defect of their methods. Furthermore, RWRMDA is a local approach which cannot uncover the missing associations for all the diseases simultaneously, i.e. we cannot compare the scores between one miRNA and two different diseases. Although there is no significant improvement in the way of AUC, RLSMDA can successfully solve aforementioned these two problems. Furthermore, HDMP introduce additional information of miRNA family and cluster, which benefit the performance of their method. It is much likely that the performance of RLSMDA would be further improved after introducing the information of miRNA family and cluster into its model. Excellent performance demonstrates RLSMDA can recover known experimentally verified miRNA–disease associations and hence has the potential to predict potential associations.

Figure 1
figure 1

Method comparison: (left) Comparison between RLSMDA and RWRMDA proposed by Chen, et al.47 in terms of ROC curve and AUC based on local leave-one-out cross validation on 1394 known experimentally verified miRNA–disease associations.

RLSMDA obtained comparable performance in the local LOOCV as RWRMDA, while RWRMDA cannot predict disease-related miRNAs for diseases without known related miRNAs and all the diseases simultaneously. RLSMDA can successfully solve these two critical shortcomings of RWRMDA. (right) Comparison between RLSMDA and HDMP in the term of global LOOCV. RLSMDA and HDMP obtained the AUC of 0.9511 and 0.9431, respectively. Although only slight improvement has been obtained here, RLSMDA can predict the potential miRNAs for diseases which do not have known related miRNAs, which has solved the most critical limitation of HDMP. The performance of RLSMDA could be further improved by introducing the information of miRNA family and cluster as what has been done in the method of HDMP.

Parameter effect

In the above cross validation, we want to place more emphasis on miRNA space classifier (this classifier is based on the dataset of miRNA functional similarity dataset) in the final classifier based on the fact that miRNA functional similarity has played a critical role in disease-related miRNA prediction. However, we cannot totally rely on the results from miRNA space, because in that way we cannot predict potential miRNAs for diseases which do not have any known related miRNAs. Therefore, we chose weight parameter w = 0.9 in the final classifier. We also assigned the different weights for the classifier constructed in the miRNA space and calculated corresponding AUCs. The result has been shown in Supplementary Figure 1 and it could be observed that a higher weight can improve the final performance of RLSMDA.

Case studies

It has been demonstrated that many miRNAs are associated with various human cancers12,13,38,55,56,57 and almost half of miRNAs are located in cancer-associated genomic regions or fragile sites12,55. Here, case studies about several important diseases were implemented to evaluate the independent predictive ability of RLSMDA. Predictive results were confirmed based on the update of HMDD and the datasets in miR2disease and dbDEMC.

Hepatocellular cancer (Hepatocellular carcinoma, malignant hepatoma, HCC) is the third leading cause of cancer deaths worldwide nowadays, with over 500,000 people affected (http://emedicine.medscape.com/article/197319-overview). As the most common type of liver cancer, the most affected people of HCC come from Asia and Africa, where high prevalence of hepatitis B and hepatitis C strongly leads to the development of chronic liver disease and HCC (http://emedicine.medscape.com/article/197319-overview). In the gold-standard data, 34 miRNAs have been related to the development of HCC. For example, independent experimental observations showed that the expression of miRNAs let-7e, 125a and 99b were quite lower in HCC compared to normal liver58. MiRNAs without the known relevance to HCC were prioritized based on the predictive results of RLSMDA. Among the top 50 predicted HCC-related miRNAs, 40 miRNAs have been confirmed by aforementioned various databases. Especially, top 20 potential miRNAs are all confirmed. The top 50 potential HCC related miRNAs and evidences for the associations with HCC were listed (See Table 1). Unconfirmed potential miRNA with the highest rank is the miR-34b (ranked 22th). However, the recent findings in the literature59 showed that the potentially functional SNP rs4938723 in the promoter region of pri-miR-34b/c may lead to the development of HCC in the investigated Chinese population, which established the connection between HCC and miR-34b. All the datasets used in this paper is generated before the publication of this paper. Therefore, this successful independent literature validation gave a further strong support to the reliable performance demonstration of RLSMDA. We did not further check whether the associations between other unconfirmed potential miRNAs and HCC can be verified based on recent experimental literatures. However, the excellent performance of RLSMDA based on cross validation and previous case study makes us believe that RLSMDA can predict more disease-related miRNAs.

Table 1 The top 50 potential Hepatocellular cancer (HCC) related miRNAs predicted by RLSMDA and the confirmation for their associations by various databases are listed here (1st column: top 1–25; 2nd column: top 26–50). Forty of top 50 miRNAs have been confirmed to be related with HCC

In our previous paper about the method of RWRMDA47, 98% (Breast cancer), 74% (Colon cancer) and 88% (Lung cancer) of top 50 predicted miRNAs are confirmed by published experiments. It seems that the predictive accuracy for Breast cancer and Lung cancer has been much satisfactory. Hence, we implemented the case study about Colon cancer here to see whether RWRMDA can further improve the performance of our method in the case study of Colon cancer. As the third most common cancer in the world, more than half of the people who die of Colon cancer come from developed countries (http://en.wikipedia.org/wiki/Colonic_cancer). Usually colon cancer strikes without symptoms, therefore, it's important to get a colon cancer screening test. If the colon cancer is found early, the doctor can use surgery, radiation, and/or chemotherapy for effective treatment (http://www.webmd.com/colorectal-cancer/default.htm). There are thirty-seven known colon cancer related miRNAs in the golden standard dataset. For example, miR-200b and miR-141 have been shown to be highly overexpressed in colon carcinoma60. Candidate miRNAs were prioritized in the term of scores obtained from the method of RLSMDA. Forty-two out of top fifty predicted colonic cancer related miRNAs have been confirmed by various databases and literatures12,61,62. The top 50 potential colonic cancer related miRNAs and confirmation evidences for the associations were listed (See Supplementary Table 1). A typical example is miR-18b, which is ranked 24th in the predictive list. Recent experimental literature confirm its connection to colonic cancer62. In that paper, the expression of miR-18b was upregulated in colonic cancer tissues, compared with the para-cancerous control. Therefore, miR-18b is expected to participate in the process of colonic cancer and play a critical role in the carcinogenesis of colon. As mentioned, the dataset used in this paper for potential miRNAs prediction is generated before the publication of this paper. Another independent validation further supports the excellent performance of RLSMDA.

As mentioned, RLSMDA can reconstruct the missing associations for all the diseases simultaneously. The top 20 potential disease-miRNA associations predicted by RLSMDA and the confirmation based on various databases are listed in the Table 2. Fifteen of top 20 potential disease-miRNAs associations have been confirmed. Also, the top 100 potential disease-miRNA associations were shown in Supplementary Table 2 and verified based on various databases and literatures12,61. These 100 potential associations involved various diseases, including breast cancer, colonic cancer, brain cancer, type 2 diabetes and so on. As a result, 61 out of top 100 potential associations were confirmed.

Table 2 The top 20 potential disease related miRNAs predicted by RLSMDA in the global ranking and the confirmation for their associations by various databases are listed here. Fifteen of top 20 disease-miRNA associations have been confirmed

Applicability of RLSMDA to diseases without any known related microRNAs

To demonstrate that RLSMDA is applicable to diseases without any known associated miRNAs, we implemented case studies for the diseases discussed in the above section by removing all the known verified miRNAs which have been shown to be related to this disease. This operation made sure that prioritizing candidate miRNAs for the given disease only made use of the information of other diseases having known related miRNAs and similarity information. The fact must be pointed out we select the same candidate miRNA set as normal case study for a given disease, i.e. abandoned known seed miRNAs were not regarded as candidate miRNAs.

For the Hepatocellular cancer, we removed 34 known HCC related miRNAs to prioritize candidate miRNAs based on the predictive result of RLSMDA. Among the top 50 potential prediction, 36 miRNAs have been confirmed by various databases. The top 50 potential HCC related miRNAs when the information about known HCC related miRNAs are removed and evidences for the associations with HCC were listed (See Supplementary Table 3). The aforementioned successful independent literature validation example about HCC and miR-34b were also ranked in the top 50 predictive list. For the colon cancer, after removing 37 known seed miRNAs, RLSMDA was implemented to uncover potential connection between colon cancer and candidate miRNAs. As a result, 36 out of top 50 miRNAs are confirmed by various databases and literatures12,61,62. Top 50 potential miRNAs and the evidences were listed (See Supplementary Table 4). Surprisingly, successful independent predictive example of miR-18b and colon cancer is ranked 1st by RLSMDA when known colon cancer related miRNAs are removed.

Except for above simulation experiments, RLSMDA was also applied to diseases without any known related miRNAs in our golden standard dataset. In this way, when we prioritize candidate miRNAs for the given disease, only the disease-miRNA associations of other diseases and similarity information between these diseases have been used. The prediction result was verified based on recent experimental literatures. As a result, in the top 3 potential related miRNA list predicted by RLSMDA for 32 diseases investigated here, 34 disease-miRNA associations were successfully confirmed by biological experiments63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95 (See Table 3).

Table 3 Confirmed disease-miRNA associations predicted by RLSMDA for diseases without known related miRNAs in our golden standard dataset

For example, hsa-mir-21 has been shown to play a critical role in various cellular processes including maturation, migration, proliferation and survival. Accumulated evidences has linked mir-21 to many complex human diseases and its associations with many diseases have been collected in the golden standard dataset, such as Breast cancer, Brain cancer, Lung cancer, Stomach cancer and so on. Here, we predicted mir-21 as the most likely related miRNAs for Abdominal Aortic Aneurysm (AAA), Thoracic Aortic Aneurysm (TAA), Sezary Syndrome (SS) and Vascular Diseases. These predictions were all confirmed by biological experiments. Maegdefessel et al identified mir-21 as a key modulator of proliferation and apoptosis of vascular wall smooth muscle cells during development of AAA and provided a new therapeutic pathway that could be targeted to treat AAA95. Jones et al observed decreased expression of mir-21 in TAA compared to normal aortic samples and further identified a significant relationship between its expression level and aortic diameter65. Narducci et al profiled the expression of miRNAs in a cohort of 22 SS patients and identified differential expression of mir-21 between SS and controls75. Cheng and Zhang pointed out mir-21 plays important roles in biological processes, such as vascular smooth muscle cell proliferation and apoptosis, cardiac cell growth and death and cardiac fibroblast functions and so on. Furthermore, they showed that mir-21 is proven to be involved in the pathogenesis of the cardiovascular diseases76. These successful predictive examples fully demonstrates that RLSMDA has the potential to provide high-quality disease-miRNA associations for the diseases without any known related miRNAs, which solved the critical deficiency existing in the previous methods.

Predicting novel human miRNAs-disease associations

Here, we further applied RLSMDA to predict potential human disease-miRNAs associations after confirming the reliable performance of RLSMDA in the term of cross validation and case studies. All the known disease-miRNA associations in the gold-standard dataset were used as positive samples. We publicly released potential human disease-miRNA association list to facilitate the biological experimental validation (see Supplementary Table 5). It is anticipated that potential disease-miRNA associations predicted here could be validated by further biological experiments and useful for biomedical research.

Discussions

Identifying potential disease-miRNA associations is critical for understanding the pathogenesis of disease at the miRNA level and further improving human medicine. In this paper, RLSMDA was developed to identify disease-related miRNAs by integrating disease-disease semantic similarity information, miRNA-miRNA functional similarity information and known human miRNA-disease associations on a large scale. RLSMDA was motivated in the framework of regularized least squares and the basic assumption that functionally related miRNAs tend to be related to phenotypically similar diseases. Compared with previous methods, RLSMDA can identify related miRNAs for diseases without any known associated miRNAs. Furthermore, RLSMDA does not need negative samples selection and reconstruct the missing associations for all the diseases simultaneously. Cross validation and case studies about Hepatocellular cancer and Lung cancer have fully demonstrated the reliable performance of RLSMDA. Furthermore, we implemented simulated case studies for Hepatocellular cancer and Lung cancer after removing all the known verified miRNAs which have been shown to be related to this disease. Plenty of prediction results were confirmed by various databases and literature. More importantly, when we applied RLSMDA to diseases without any known related miRNAs in our golden standard dataset, 34 disease-miRNA associations, ranked in the top 3 potential related miRNA list predicted by RLSMDA for 32 diseases investigated here, were successfully confirmed by biological experiments.

These excellent examples fully demonstrated that RLSMDA is applicable to diseases without any known associated miRNAs. Considering the fact that RLSMDA can reconstruct the missing associations for all the diseases simultaneously, we applied it to implement global prediction for all the diseases simultaneously. As a result, 15 of top 20 potential disease-miRNAs associations have been confirmed. Also, out of the top 100 potential disease-miRNA associations, 61 potential associations were confirmed, involved various diseases including breast cancer, colonic cancer, brain cancer, type 2 diabetes and so on. We publicly released potential miRNA lists for 137 diseases investigated in this paper to guide biological experiments. It is anticipated that RLSMDA would be a useful resource for researches about the associations between miRNAs and human diseases.

The reliable performance of RLSMDA could largely be attributed to several factors as follows. Firstly, heterogeneous datasets (known disease-miRNA associations, miRNA functional similarity and disease semantic similarity) were integrated to capture the potential associations between disease and miRNA. Especially, RLSMDA can predict potential related miRNAs for diseases without any known associated miRNAs by introducing the information of disease similarity. Secondly, RLSMDA is a semi-supervised method, which overcomes the difficulties in obtaining negative disease-miRNA associations samples in the practical problems. Finally, RLSMDA is a global approach, which can predict the scores between miRNAs and diseases for all the diseases simultaneously. These three critical success factors also constitute the novelties of RLSMDA. Hence, RLSMDA represents a novel, useful and important biomedical resource for miRNA-disease association identification.

Although there are several important novelties in the method development of RLSMDA, some limitations also exist. Firstly, how to decide the parameters values in the RLSMDA is not still solved well. Especially, we need to integrate predictive result from disease space and miRNA space by weight parameters. How to directly obtain a single classifier or reasonably integrate results from different spaces would be a critical problem for future research. Secondly, more reliable construction of disease similarity and miRNA similarity would further improve the predictive ability. We plan to integrate more biological relevant information to define miRNA similarity and disease similarity. Thirdly, more available experimentally verified human disease-miRNA associations would promote the development and the performance of computational human disease-miRNA identification methods.

Methods

Human miRNA-disease associations

The human miRNA-disease association dataset used as gold standard dataset in this paper was downloaded from the supplementary material of literature96 (obtained from HMDD in September, 2009). We want to confirm our prediction list based on the update of HMDD and the datasets in other datasets, so we did not use the newest association dataset in HMDD and the datasets in the other databases. The gold standard in this paper includes 1616 distinct high-quality experimentally verified human miRNA-diseases associations. After implementing the operations such as merging different miRNA copies which produce the same mature miRNA and unifying the name of mature miRNAs and diseases, 1395 miRNA–disease associations, including 271 miRNAs and 137 diseases, were used in this paper (see Supplementary table 6). We use nd as the number of diseases and nm as the number of miRNAs. Matrix A is denoted as the adjacency matrix of disease-miRNA associations, where the entity A(i,j) in row i column j is 1 if miRNA j is related to the disease i, otherwise 0.

MiRNA functional similarity

In the literature96, functional similarity score for each miRNA pair was calculated based on the assumption that miRNAs with similar functions tend to be related with similar diseases. We downloaded the miRNA functional similarity scores from http://cmbi.bjmu.edu.cn/misim/ in January 2010 (see Supplementary table 7). Matrix SM is denoted as the miRNA functional similarity matrix, where the entity SM(i,j) in row i column j is the functional similarity between miRNA i and j. MiRNA functional similarity used here has been used to predict disease-related miRNAs and environmental factor-miRNA combination interactions and excellent performance have been obtained47,97.

Disease semantic similarity

Here, we calculated the disease similarity in the same way as literature96. The basic idea of disease semantic similarity calculation is illustrated in Figure 2. We can obtain the relationship between diseases from MeSH database (http://www.ncbi.nlm.nih.gov/), which provided a strict system for disease classification. Disease can be described as a DAG, where the nodes represent disease itself and its ancestor diseases and the link from a parent node to a child node represents the relationship between these two nodes. For example, disease A can be described as a graph DAG(A) = (A,T(A),E(A)), where T(A) is the node set including node A itself and all ancestor nodes of A and E(A) is the corresponding links set. The contribution of disease t in DAG(A) to the semantics of disease A is defined as follows:

where Δ is the semantic contribution factor. The contribution of disease A to its own semantic value is one, while the contributions of other ancestor diseases to the semantic value of disease A decrease with the distance between this disease and disease A. Therefore, we can define the semantic value of disease A based on the contribution of ancestor diseases and disease A itself, i.e.

Based on the assumption that disease pairs sharing larger part of their DAGs are more similar, we defined the semantic similarity between two diseases A and B as follows:

Matrix SD is denoted as the disease semantic similarity matrix, where the entity SD(i,j) in row i column j is the disease semantic similarity between disease i and j (see Supplementary table 8).

Figure 2
figure 2

The basic idea of disease semantic similarity calculation.

Regularized Least Squares for MiRNA-Disease Association (RLSMDA)

Based on the underlying assumption that miRNAs associated with more similar diseases are more similar and vice versa, here we developed the method of Regularized Least Squares for MiRNA-Disease Association (RLSMDA) to uncover the potential miRNAs associated with various diseases (See Figure 3). RLSMDA is designed to construct a continuous classification function which can reflect the probability that each miRNA is related to a given disease. We hope the function can meet the following two criterions: (1) it complies with the known disease-related miRNAs information; (2) it is smooth over the miRNA space and disease space, i.e. for a given disease (miRNA), similar miRNAs (diseases) would obtain similar scores, which meet the basic assumption of our methods. Considering the difficulties of obtaining negative sample, a semi-supervised classifier is constructed under the framework of Regularized Least Squares (RLS), which is obtained by defining a cost function and minimizing this cost function. Cost functions can be developed in miRNA space and disease space, respectively. Taking miRNA space and as an example, optimal classification function can be obtained by solving the following optimization problem:

where is the Frobenius norm and ηM is the trade-off parameter. The solution of this optimization problem is:

where IM is the identity matrix with the same size as matrix SM.

Figure 3
figure 3

The flowchart of RLSMDA includes three steps: solving optimization problem; obtaining the optimal classifier in the disease and miRNA space, respectively; combining classifiers in the disease and miRNA space to obtain final predictive result.

In the similar way, we can obtain the optimal classification function in the disease space as follows:

where ID is the identity matrix with the same size as matrix SD.

Finally, the optimal classifier in two different spaces will be combined to give the final solution based on a simple weighted average operation, i.e.

where the entity F(i,j) in row i column j reflect the probability that miRNA j is related to the disease i.