Global network random walk for predicting potential human lncRNA-disease associations

There is more and more evidence that the mutation and dysregulation of long non-coding RNA (lncRNA) are associated with numerous diseases, including cancers. However, experimental methods to identify associations between lncRNAs and diseases are expensive and time-consuming. Effective computational approaches to identify disease-related lncRNAs are in high demand; and would benefit the detection of lncRNA biomarkers for disease diagnosis, treatment, and prevention. In light of some limitations of existing computational methods, we developed a global network random walk model for predicting lncRNA-disease associations (GrwLDA) to reveal the potential associations between lncRNAs and diseases. GrwLDA is a universal network-based method and does not require negative samples. This method can be applied to a disease with no known associated lncRNA (isolated disease) and to lncRNA with no known associated disease (novel lncRNA). The leave-one-out cross validation (LOOCV) method was implemented to evaluate the predicted performance of GrwLDA. As a result, GrwLDA obtained reliable AUCs of 0.9449, 0.8562, and 0.8374 for overall, novel lncRNA and isolated disease prediction, respectively, significantly outperforming previous methods. Case studies of colon, gastric, and kidney cancers were also implemented, and the top 5 disease-lncRNA associations were reported for each disease. Interestingly, 13 (out of the 15) associations were confirmed by literature mining.

Several powerful computational methods for predicting new human lncRNA-disease associations have been developed in recent years. According to their implementation strategy, these methods can be further classified as machine-learning-based methods and network-based methods.
The former class of computational methods predicts lncRNA-disease associations based on training datasets (i.e., known lncRNA-disease associations) and testing datasets (i.e., unknown lncRNA-disease associations) 8 . For example, Chen et al. 19 developed LRLSLDA (laplacian regularized least squares for lncRNA-disease association) model to predict potential disease-related lncRNAs. LRLSLDA reveals the potential lncRNA-disease associations by integrating known lncRNA-disease associations with lncRNA expression profiles. LRLSLDA is a semi-supervised classification algorithm that does not require negative training samples. A major issue of LRLSLDA is how to select optimal parameters to obtain the best predicted performance. Subsequently, Chen et al. 20 proposed a novel lncRNA similarity calculation method, namely, LNCSIM, and then used it to evaluate the predicted performance. LNCSIM showed a significant improvement for lncRNA-disease association prediction in a LOOCV process. By integrating genome, regulome, and transcriptome data, Zhao et al. 21 proposed a naive Bayesian classifier to identify cancer-related lncRNAs. The results of ten-fold cross validation showed good performance, and 707 potential cancer-related lncRNAs were identified by this method. However, it is difficult to infer negative samples from this kind of method, which has become a key bottleneck to further research.
The latter class of computational methods predicts lncRNA-disease associations based on lncRNA similarities and disease similarities. The lncRNA and disease similarity network are connected by the known lncRNA-disease associations to form a heterogeneous network, based on the network to uncover the potential lncRNA-disease associations. A common assumption of this kind of method is that functionally similar lncRNAs tend to be associated with phenotypically similar diseases, and vice versa. Several researchers implemented random walks on heterogeneous networks to uncover the potential associations between lncRNAs and diseases [22][23][24] . For instance, Zhou et al. 24 integrated a miRNA-associated lncRNA-lncRNA crosstalk network, disease-disease similarity network, and known lncRNA-disease association network into a heterogeneous network. Then a random walk was implemented with a restart on this heterogeneous network to prioritize candidate lncRNA-disease associations (RWRHLD). They used LOOCV to evaluate the predicted performance and obtained a reliable area under the curve (AUC) value of 0.871. RWRHLD implements a random walk from disease-related seed lncRNAs to other nodes; thus, this approach cannot be applied to predict isolated disease-related lncRNAs. By integrating a wide variety of biological data (disease semantic, lncRNA expression profiles, and known lncRNA-disease associations) into a heterogeneous network, Chen 25 proposed the model of KATZ measure for lncRNA-disease association prediction (KATZLDA) to predict disease-related lncRNAs. Cross validation and case studies showed that KATZLDA offers good predicted performance. In particular, KATZLDA can predict isolated disease-related lncRNAs. However, the method relies excessively on a network topology structure; and may cause bias to diseases with more known related lncRNAs and lncRNAs with more known associated diseases.
In summary, existing computation methods for predicting lncRNA-disease associations have several limitations: (1) some approaches are unable to predict isolated disease-related lncRNAs; (2) some machine-learning-based methods require negative samples that are difficult to obtain; and (3) other approaches may be biased towards well-known lncRNAs and diseases. To overcome these limitations, we proposed a global network random walk for potential human lncRNA-disease association prediction (GrwLDA) to reveal the potential associations between lncRNAs and diseases. GrwLDA integrates disease semantic similarities, lncRNA functional similarities, and known lncRNA-disease associations to discover the potential associations.
The main contributions of the paper are summarized as follows.
(2) GrwLDA is a universal network-based method and does not require negative samples.

Results
Performance evaluation. LOOCV was implemented on the benchmark dataset to evaluate the predicted performance of GrwLDA and two state-of-the-art computational models: LRLSLDA 19 and KATZLDA 25 . One lncRNA-disease association was excluded (set to 0), and the predictor score was recovered by remaining associations. All predictor scores were sorted, and a special ranking position was selected as a threshold. True positives (TP) were the number of the known associations above the threshold, whereas false positives (FP) were the number of the unknown associations above the threshold. True negatives (TN) were the number of the unknown associations below the threshold, whereas false negatives (FN) were the number of the known associations below the threshold. The receiver operating characteristic (ROC) curve plotted the test sensitivity or true-positive rate at different thresholds. Specifically, the area under the ROC curve (AUC) and the area under the PR curve (AUPR) were adopted to evaluate the performances.
The three approaches can reconstruct missing associations for all the diseases simultaneously; and can predict potential lncRNA-disease associations for novel lncRNA and isolated disease. To comprehensively compare the predicted performance of the above three methods, we implement LOOCV on the benchmark dataset while considering the following aspects: (1) the overall performance evaluation; (2) the predicted performance of novel lncRNA-associated diseases prediction (when calculating the predictor score between lncRNA i and disease j; all associations between lncRNA i and all diseases are excluded and its score is recovered by remaining associations); and (3) the predicted performance of isolated disease-related lncRNAs prediction (all associations among all lncRNAs and disease j are excluded, and the predictor score between lncRNA i and disease j is recovered by the remaining associations). Four parameters, namely, γ, the restart probability of RWR, the two balance parameters α and β, and the integrate parameter η were employed in our model. We obtained the optimal parameters by experiments. We implemented the LOOCV of GrwLDA method by setting the four parameters from 0.1 to 0.9; the optimal parameters are γ = 0.9, α = 0.1, β = 0.1, and η = 0.7. The optimal parameters were selected for LRLSLDA and KATZLDA as described in the literature. The ROC curves and PR curves of the previously mentioned features were plotted and are shown in Fig. 1 and Fig. 2, respectively, and the AUC values and AUPR values are shown in their legends.  As seen from the figures, for the overall performance evaluation, GrwLDA has an AUC of 0.9449 and AUPR of 0.5869, which are better than those of LRLSLDA and KATZLDA; and for the predicted performance of isolated disease-related lncRNAs prediction, GrwLDA has an AUC of 0.8374 and AUPR of 0.2184, also ahead of those of LRLSLDA and KATZLDA. Although the LRLSLDA method obtained the best AUPR value of novel lncRNA-associated diseases prediction, its AUC value was significantly lower than that of GrwLDA.
The comparison among the above three methods based on 5-fold cross validation was implemented to further demonstrate the predictive ability of GrwLDA. The benchmark dataset was randomly divided into five parts, one for testing and the rest as a training set. In other words, all associations in the testing set were removed, and their predictor scores were regenerated by other associations. After all the predictor scores were obtained, the ROC curve was drawn, and the AUC value was calculated. The 5-fold cross validation was performed 10 times, and the average AUC value was adopted to evaluate the performances. As a result, GrwLDA had an average AUC of 0.9201, and those of LRLSLDA and KATZLDA were 0.8585 and 0.8145, respectively.
In conclusion, GrwLDA demonstrated significant performance improvements over previous computational models in the evaluation framework of LOOCV and 5-fold cross validation.
Case study. Evidence from a wide range of sources suggests that lncRNAs play critical roles in the development of various cancers. To further evaluate the performance of GrwLDA in predicting potential disease-related lncRNAs, colon cancer, kidney cancer and gastric cancer were chosen as case studies. All known associations were used as the training set, and the unknown associations were assigned as the testing set. Then, the unknown lncRNA-disease associations of each disease were ranked according to the predicted results of GrwLDA, and the top five were selected for further validation. The predicted results were verified based on newly updated disease-lncRNA associations in the LncRNADisease database and in a few recently published studies. The predicted results and verified evidence are listed in Table 1.
Colon cancer is one of the most common malignant tumors worldwide, killing almost 700,000 people every year. This cancer is a disease of modernity, with the highest rates of incidence being recorded in developed countries. Biological experiments have demonstrated several important associations between colon cancer and the dysregulation of lncRNAs. The potential colon cancer-related lncRNAs were predicted by GrwLDA. As a result, the associations between colon cancer and HOTAIR, CRNDE, and MALAT1 (top 3 predictions) were verified by the updates in the LncRNADisease database. Furthermore, Tseng et al. 26 showed that ablation of PVT1 (ranked fourth) from the MYC-driven colon cancer line HCT116 diminishes tumorigenic potency. Although there is no direct evidence validating that KCNQ1OT1 is associated with colon cancer, it has been considered as an effective biomarker for disease diagnosis 27 due to the high frequency of the loss of KCNQ1OT1imprinting in colon cancer.
Kidney cancer, also known as renal cancer, is a disease that starts in the kidneys. Kidney cancer occurs when healthy cells in one or both kidneys grow out of control and form a lump (called a tumor). Kidney cancer is the 12 th most common cancer worldwide, with more incidences in men than women; in addition, it is more prevalent in developed countries, with the highest rates being observed in North America and Europe, while the lowest are found in Africa and Asia 28 . GrwLDA was implemented to identify kidney cancer-related lncRNAs. The predicted kidney-related lncRNAs, H19 and PVT1 (ranked first and third in the predicted results, respectively) have already been validated according to the LncRNADisease database. Furthermore, Dallosso et al. 29 and Xin et al. 30 respectively inferred that WT1-AS (ranked 4 th ) and KCNQ1DN (ranked 5 th ) are associated with Wilms' tumor, a cancer of the kidneys that usually affects newborns and the very young.
Gastric cancer is the third most common cause of cancer-related deaths in the world. This type of cancer remains difficult to cure in Western countries, primarily because most patients present with an advanced disease stage. Therefore, the identification of novel molecules associated with gastric cancer is beneficial to the diagnosis and treatment of gastric cancer. We also implemented GrwLDA to identify potential gastric cancer-related rank disease lncRNA evidence  31 reported that MALAT1 (ranked 5 th ) promotes cell proliferation in gastric cancer by recruiting SF2/ASF. To further compare the predicted performance of GrwLDA, LRLSLDA, and KATZLDA, we curated a list of newly collected lncRNAs associated with colon, gastric and kidney cancers by the updates of the LncRNADisease database, which were considered as the ground truth; we then ranked three methods based on the list (Table 2). Finally, we calculated the average ranking of the three diseases together. With an average ranking of 19 distinct, experimentally confirmed lncRNA-disease associations for these three important diseases, GrwLDA outperformed LRLSLDA and KATZLDA.
Application of GrwLDA to predict isolated disease-related lncRNAs and novel lncRNA-associated diseases. Research into lncRNA is still in its infancy, and numerous diseases associated with lncRNAs have yet to be confirmed. Therefore, the prediction and identification of isolated disease-related lncRNAs has become an important task in lncRNA research. GrwLDA was implemented to predict isolated disease-related lncRNAs. We removed the known and verified lncRNA-disease associations related to predictive diseases. This operation ensures that we only use similarity information and known lncRNA-disease associations of the other diseases to predict disease-related lncRNAs. Isolated disease-related lncRNAs prediction was implemented for colon, kidney and gastric cancers, and the top five predicted results of each disease were listed in Table 3. As a result, 13 of the 15 predicted results were confirmed by the updates of the LncRNADisease database and by some recent literatures.
Novel lncRNAs are a class of lncRNAs that target unavailable disease association information. To verify that our method is able to prioritize diseases for novel lncRNAs, we removed all experimentally verified associations related to lncRNA. This step ensured that only similarity information and known lncRNA-disease associations of the other lncRNAs were used to predict potential associations. GrwLDA was also implemented to predict novel lncRNA-associated diseases. Novel lncRNA-associated disease prediction was implemented for H19, HOTAIR, and MALAT1; the top five of each lncRNA predicted results are listed in Table 4. As a result, 14 of the 15 predicted results are confirmed by the updates of the LncRNADisease database and a few recent journal articles.
In conclusion, GrwLDA exhibits good performance in inferring isolated disease-related lncRNAs and novel lncRNA-associated diseases.

Discussion
Accumulating evidence has indicated that lncRNAs play important roles in the development of diseases. Identification of disease-related lncRNAs will be beneficial to gain a deeper understanding of disease mechanisms at the molecular level. As valuable complements to experimental studies, computational models used to identify associations between lncRNAs and diseases are in high demand.
In this article, by integrating known lncRNA-disease associations, disease semantic similarities, and lncRNA functional similarities, a method called GrwLDA was developed to predict potential lncRNA-disease associations on a large scale. GrwLDA is a universal network-based method can be applied to predict isolated diseases and novel lncRNAs without any known associations. GrwLDA achieved LOOCV AUCs of 0.9449, 0.8562, and 0.8374  Table 2. Performance comparisons of GrwLDA, LRLSLDA and KATZLDA methods based on the newly collected lncRNAs associated with colon, gastric and kidney cancer by the updates of LncRNADisease database and their ranking of the three methods.
for overall, novel lncRNA and isolated disease prediction, respectively, which were considerably higher values than those obtained by existing computational models. Furthermore, by applying the GrwLDA method to colon, gastric, and kidney cancer as case studies, 13 potential associations in the top five predictions for these important diseases were confirmed by recent studies. Despite the favorable results obtained using GrwLDA, this study presents certain limitations. First, given that available experimentally validated lncRNA-disease associations are still relatively rare and the lncRNA similarities are calculated based on them, GrwLDA will probably produce biased predictions. This problem is common to predicting lncRNA-disease associations. With the development of lncRNA-related research, more comprehensive data will be obtained and will improve the prediction performance of the GrwLDA method. Second, more reliable information sources, such as lncRNA-miRNA interactions and lncRNA expression profiles can be integrated to measure the lncRNA functional similarities. Third, parameter selection of GrwLDA is difficult, and we selected the optimal parameters by experience. Therefore, the parameter optimization method for GrwLDA should be studied in the future. Finally, GrwLDA implements random walk with restart from lncRNA seed nodes and disease seed nodes, which will result in two stable transition probabilities. Researching how to obtain the final score with a single measurement or a more reliable integration method should be prioritized in future studies.

Materials and Methods
LncRNA-disease associations. The known human lncRNA-disease association dataset is downloaded from LncRNADisease database (http://www.cuilab.cn/lncrnadisease) in October 2012. The data preprocessing process is as follows. (1) The diseases of the dataset are mapped to MeSH description by the MeSH database (https://www.ncbi.nlm.nih.gov/mesh). (2) The repeated associations and several diseases without any MeSH descriptors or tree numbers are removed. (3) All data are screened through homo sapiens. (4) The classification of the gene sequence is determined by querying the Nucleotide database (https://www.ncbi.nlm.nih.gov/nuccore); if the gene class is not lncRNA, such as 7SK (class is snRNA) and 7SL (class is scRNA), the gene will be removed. After pretreatment, 210 distinct high-quality experimental verified lncRNA-disease associations are obtained, including 78 lncRNAs and 113 diseases. We use this dataset as the benchmark dataset and variables nl and nd to represent the number of lncRNAs and diseases, respectively. We let matrix AS denote the adjacency matrix of lncRNA-disease associations, where the entity AS(i, j) in row i and column j is 1 if lncRNA i is associated with disease j, otherwise 0.
Disease semantic similarities. According to the disease tree numbers and disease semantic terms, each disease can be described as a directed acyclic graph (DAG). The DAG is the vertex set including the disease D and its ancestor nodes, and E(D) is the set of connecting edges including the direct edges from parent nodes to child nodes. In the same manner as described in the literature 32 , the contribution of each disease semantic term of disease D is numerically investigated as follows: where ∆ ∈ [0, 1] is the semantic contribution decay factor. In the DAG of disease D, disease D is the most specific disease, and its contribution to its own semantic score is defined 1. The disease term located far from disease D is considered to be a more general disease, and its contribution is multiplied by the semantic contribution decay factor. The semantic score of disease D is defined in Equation (2): on the basis of the shared nodes in two disease DAGs, we calculate disease semantic similarity between disease A and disease B defined as Equation (3): where DD is the disease semantic similarity matrix and DD(i, j) in row i, and column j represents the semantic similarity between diseases i and j. The disease semantic similarity between diseases i and j is measured based on both the addresses of these diseases in DAGs and their semantic relations with their ancestor diseases.
LncRNA functional similarities. The lncRNA functional similarities are calculated using LNCSIM model 20 . The LNCSIM model quantitatively calculates the functional similarities between two lnRNAs by measuring the semantic similarity between the two lncRNA-related disease groups. LNCSIM defines D(u) and D(v) as the disease groups associated with lncRNAs u and v, respectively; and calculates the similarity between D(u) and D(v) as the functional similarity of lncRNAs u and v. LNCSIM first calculates the similarity between one disease and a disease group. For example, the similarity between disease d1 (a member of D(u)) and disease group D(v) was calculated as follows: and the similarity between lncRNA u and v was defined as Equation (5): ( ) are the numbers of diseases associated with lncRNAs u and v, respectively. We use matrix LL to denote the lncRNA functional similarities, where the variable LL(i, j) in row i and column j is the functional similarity between lncRNA i and lncRNA j. Based on the common assumption that the more similar the two lncRNA associated diseases are, then the more similar their functions are, Equation (5) calculates the functional similarity of two lncRNAs based on their respective associated disease group.
Constructing probability transfer matrix. We define Equation (6)  The matrices LL, DD, AS, and AS T (transpose matrix of AS) are normalized by Equation (6), and we obtain the normalized matrices WL, WD, WA1, and WA2, respectively.
Global network random walk model for predicting potential human lncRNA-disease (GrwLDA). The flowchart of the GrwLDA method is shown in Fig. 3.
Random walk with restart (RWR) algorithms are derived from graph theory and randomly simulates a random walker's transition from its current nodes to its neighbors in the network starting at several given seed nodes. Many researchers have successfully applied the RWR algorithm in their specific application [33][34][35][36] . For instance, Sebastian et al. 33 implemented the RWR algorithm on a global protein-protein interaction (PPI) network for prioritizing candidate disease genes; and focused on the functional link between miRNA targets and disease genes in a PPI network. Shi et al. 34 used the RWR algorithm for predicting potential miRNA-disease associations. In this work, inspired by previous studies, we propose a global network random walk model for predicting potential human lncRNA-disease. The GrwLDA method is implemented in three steps, as follows: (1) RWR is restarted from lncRNA seed nodes associated with query disease. (2) RWR is restarted from disease seed nodes associated with query lncRNA. (3) The potential lncRNA-disease associations are predicted by integrating results of step (1) and step (2).
The detailed implementation procedure of the GrwLDA method to calculate the predictor score between lncRNA i and disease j is as follows.
First, based on the common assumption that lncRNAs with similar functions are normally associated with phenotypically similar diseases and vice versa, the GrwLDA method implements RWR from lncRNA seed nodes associated with disease j, and the transition probability from lncRNA i to disease j is obtained. We let pl j 0 be the initial probability vector of disease j and pl j t be a vector consisting of the transition probability from all lncRNAs to disease j at step t. Therefore, the probability vector at step t + 1 can be iteratively calculated by Equation (7): where γ (0, 1)  indicates the restart probability, and WL is the probabilistic weight network of lncRNAs. To make full use of global network similarity information, the global relevance score between disease j and all diseases is approximately calculated as follows: where d(j) is a binary column vector of length nd, with a jth element of 1 and other elements being 0. Vector LPd(j) is the Laplacian score vector of query disease j, and α ∈ (0, 1) is a balance parameter. To force connected diseases to receive similar scores and ensure the consistency with the query disease, the Laplacian score vector of query disease j is smoothed by parameter α. Unlike traditional RWR, we construct the initial probability vector Figure 3. Flowchart of GrwLDA. The GrwLDA method is implemented in three steps as follows: (1) RWR is restarted from lncRNA seed nodes associated with query disease; (2) RWR is restarted from disease seed nodes associated with query lncRNA; and (3) the potential lncRNA-disease associations are predicted by integrating the results of step (1) and step (2).
Scientific REPORTs | 7: 12442 | DOI:10.1038/s41598-017-12763-z of disease j considering the associated lncRNA seed nodes and the Laplacian score vector of query disease j simultaneously: j 0 and then normalize by Equation (6). Then, the transition probability of initial seed nodes of disease j is acquired. We then implement RWR by Equation (7), and after several steps, the steady probability ∞ pl j is obtained when the change between + pl j t 1 and pl j t is less than 10 −6 , and the transition probability from lncRNA i to disease j is obtained using Equation (10): j According to Equation (7), random walk from the disease j-related lncRNA seed nodes, for any node, will be probabilistic 1 − γ transferred to its neighbor nodes and probabilistic γ back to the seed nodes. The greater the similarity between the nodes is, the greater the transition probability will be. At the end of the iteration, s1(i, j) is the probability of lncRNA i to disease j, and the greater the value is, the greater is the likelihood of the association.
Second, based on the assumption that phenotypically similar diseases are normally associated with functional similarity lncRNAs and vice versa, the GrwLDA method implements RWR from disease seed nodes associated with lncRNA i to obtain the transition probability from disease j to lncRNA i. Similar to the first step, graph Laplacian scores can be derived to measure the global relevance between lncRNA i and all lncRNAs as follows: where l(i) is a binary column vector of length nl, with an ith element of 1 and other elements being 0. Vector LPl(i) is the Laplacian score vector of query lncRNA i, and β ∈ (0, 1) is a balance parameter. To focus on lncRNA-disease associations and the global lncRNA-lncRNA similarities simultaneously, we defined the initial transition probability from lncRNA i to all diseases as: i 0 and then we implemented RWR from disease seed nodes associated with lncRNA i as follows: , pd i t and pd i 0 are the transition probability vector from lncRNA i to all disease nodes at (t + 1), (t)th and (0)th step of iteration, separately. After several steps, the steady probability ∞ pd i and the transition probability from lncRNA i to disease j were obtained as: i For similar reasons, the greater the value of s i j 2( , ) is, the greater the likelihood that lncRNA i and disease j are associated.
Finally, the transition probability from lncRNA i to disease j obtained from the previous two steps is integrated as the final score (FS) to predict the potential lncRNA-disease associations: Parameter η ∈ [0, 1] is the integrated parameter and ∈ FS i j ( , ) [0, 1] is the final prediction score between lncRNA i to disease j.