KATZLDA: KATZ measure for the lncRNA-disease association prediction

Accumulating experimental studies have demonstrated important associations between alterations and dysregulations of lncRNAs and the development and progression of various complex human diseases. Developing effective computational models to integrate vast amount of heterogeneous biological data for the identification of potential disease-lncRNA associations has become a hot topic in the fields of human complex diseases and lncRNAs, which could benefit lncRNA biomarker detection for disease diagnosis, treatment, and prevention. Considering the limitations in previous computational methods, the model of KATZ measure for LncRNA-Disease Association prediction (KATZLDA) was developed to uncover potential lncRNA-disease associations by integrating known lncRNA-disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity. KATZLDA could work for diseases without known related lncRNAs and lncRNAs without known associated diseases. KATZLDA obtained reliable AUCs of 7175, 0.7886, 0.7719 in the local and global leave-one-out cross validation and 5-fold cross validation, respectively, significantly improving previous classical methods. Furthermore, case studies of colon, gastric, and renal cancer were implemented and 60% of top 10 predictions have been confirmed by recent biological experiments. It is anticipated that KATZLDA could be an important resource with potential values for biomedical researches.

Scientific RepoRts | 5:16840 | DOI: 10.1038/srep16840 Results Model design. KATZLDA was developed to predict potential disease-related lncRNAs by measuring the importance of candidate nodes relative to given seed nodes and identifying nodes similar to seed nodes (motivated by literature 42,43 , see Fig. 1). In the context of lncRNA-disease association prediction, KATZLDA computes the similarity scores between candidate lncRNAs and investigated diseases by integrating walks of different lengths between corresponding lncRNA and disease nodes (See the Method section for the detail of KATZLDA) in the heterogeneous network consisting of known disease-ln-cRNA association network, disease similarity network, and lncRNA similarity network. The novelty of KATZLDA could be largely attributed to the combination of the following several factors. Firstly, various types of biological datasets were integrated to implement the prediction, such as disease semantic similarity, lncRNA expression similarity, and lncRNA functional similarity (See the Method section for the detail of datasets used in this paper). New diseases (diseases without any known related lncRNAs) and lncRNAs (lncRNAs without any known associated diseases) are discovered each year. However, it is not clear whether newly discovered diseases would be correlated with some lncRNAs or uncorrelated with any lncRNAs. For these new diseases, KATZLDA could be used to quantify lncRNA-disease association probability and provide the potential lncRNA-disease pairs with higher association probability for biological experimental validation. If new disease is indeed related with some lncRNAs, KATZLDA could predict its potential related lncRNAs. The same conclusion is also true for newly discovered lncRNAs. Therefore, KATZLDA could work for both new diseases and lncRNAs. Finally, KATZLDA is a global method, which could reconstruct potential lncRNA-disease associations for all the diseases simultaneously. Therefore, KATZLDA represents an important and effective computational tool for biomedical research. Here, 293 distinct experimentally confirmed lncRNA-disease associations download from the LncRNADisease database were used as gold standard dataset in the cross validation for model evaluation and training dataset in the potential disease-lncRNA association prediction, respectively.
Performance evaluation. Global and local LOOCV were implemented based on known experimentally verified lncRNA-disease associations in the lncRNADisease database to evaluate the performance of KATZLDA. When LOOCV was implemented, each known disease-lncRNA association was left out in turn as test sample and other known disease-lncRNA associations were regarded as training samples for model learning. The only difference between global and local LOOCV is the selection of candidate samples resulting from whether all the diseases were investigated simultaneously. For the global LOOCV, all the disease-lncRNA pairs without known relevance evidences would be considered as candidate samples. However, for the local LOOCV, attention is only paid to the disease in the test sample. Only all the lncRNAs without known associations with this disease would be regarded as candidate samples. How well the left-out test sample was ranked relative to candidate samples would be further evaluated. If the rank of test sample exceeds the given threshold, then the model was considered to implement a successful prediction. For the different thresholds, corresponding true positive rates (TPR, sensitivity) and false positive rates (FPR, 1-specificity) could be further obtained. Here, Sensitivity is the percentage of the test samples with the rank higher than the given threshold and specificity is the percentage of samples with the rank below this threshold. Therefore, Receiver-operating characteristics (ROC) curve could be drawn, which plots TPR versus FPR at different thresholds. Area under ROC curve (AUC) was calculated to evaluate the prediction performance of KATZLDA. AUC = 1 indicates perfect performance and AUC = 0.5 indicates random performance.
KATZLD was compared with the following three the-state-of-art computational models in the framework of LOOCV: LRLSLDA 34 , RWRlncD 36 , and NRWRH 44 . LRLSLDA could reconstruct the missing associations for all the diseases simultaneously. Therefore, both global and local LOOCV could be implemented for LRLSLDA. However, global LOOCV can't be implemented for RWRlncD and NRWRH because they only predict associated lncRNAs for the given disease. As a result, KATZLDA achieved AUCs of 0.7886 and 0.7175 for the global and local LOOCV, respectively (see Fig. 2). The performance of KATZLDA significantly improved all the previous classical models in the framework of both global and local LOOCV. LRLSLDA and RWRlncD can't work for diseases without any known associated lncR-NAs. Furthermore, RWRlncD and NRWRH can't uncover the missing associations for all the diseases simultaneously. Therefore, except for significant improvement in the term of LOOCV, KATZLDA could effectively overcome these important limitations in the previous models.
Furthermore, 5-fold cross validation was implemented for KATZLDA. In the known lncRNA-disease association dataset, there were only about 1.75 known related lncRNAs for each disease and 2.48 known associated diseases for each lncRNA on average. Therefore, 5-fold cross validation was implemented based on all the known lncRNA-disease associations. All the known associations were randomly divided into 5-folds, i.e. 80% of the known associations were used as training samples for model learning, and the remaining 20% were used as test samples for model evaluation. All the disease-lncRNA pairs without known association evidences would be regarded as candidate samples.
As mentioned above, RWRlncD and NRWRH only could predict associated lncRNAs for the given disease and couldn't infer all the missing associations for all the diseases simultaneously. Therefore, 5-fold cross validation couldn't be implemented for these two computational models. Here, the comparison between KATZLDA and LRLSLDA based on 5-fold cross validation was implemented to further Performance comparisons between KATZLD and three the-state-of-art disease-lncRNA association prediction models (LRLSLDA, RWRlncD, and NRWRH) in terms of ROC curve and AUC based on LOOCV. As a result, KATZLDA achieved AUCs of 0.7886 and 0.7175 for the global and local LOOCV, respectively, which significantly improved all the previous classical models and effectively demonstrated its reliable predictive ability demonstrate the predictive ability of KATZLDA. To minimize the influence caused by sample division, the performance was evaluated under 100 different random divisions of known lncRNA-disease associations. ROC curves were drawn and AUCs were calculated for all the 100 experiments in the similar way to LOOCV, respectively. As a result, the mean and the standard deviation of AUCs for KATZLDA and LRLSLDA were 0.7719 + /-0.0084 and 0.7295 + /-0.0089, respectively. In conclusion, KATZLDA has demonstrated significant performance improvements over previous computational models in the evaluation framework of local LOOCV, global LOOCV, and 5-fold cross validation, respectively.
Case studies. In order to further evaluate the predictive performance of KATZLDA, KATZLDA was applied to three kinds of important cancers for potential associated lncRNA prediction by regarding all the known disease-lncRNA associations as training samples for model learning. Prediction results were verified based on the recent updates in the LncRNADisease database and recently published experimental literatures. Validating prediction results in this framework for the model evaluation has been frequently adopted for previous computational models of disease related lncRNAs prediction [34][35][36][37][38]41 . Almost all the previous computational models reviewed in the Introduction section have been evaluated based on this framework. Furthermore, performance comparison between KATZLDA and LRLSLDA was implemented based on newly updated disease-lncRNA associations in the LncRNADisease database. All the updated associations for these three kinds of cancers have been checked and all the corresponding ranking results have been listed in Table 1.
Colon cancer is one of the most common malignant tumors worldwide and a great threat to public health 45 , even with the disease-specific mortality rate of nearly 33% in the developed world 46 . In China, the prevalence rate of colon cancer has increased dramatically in recent years due to the changes of human lifestyle 45 . Biological experiments have discovered some important association between the development and progression of colon cancer and mutations and dysregulations of lncRNAs 35 . KATZLDA was implemented to predict potential colon cancer-related lncRNAs. As a result, seven out of top ten potential related lncRNAs have been validated by the updates of lncRNADisease database 21 , MNDR database 47 and recent biological experiments literature 48 . For example, the association between colon cancer and MALAT1, HOTAIR, UCA1, KCNQ10T1, and CRNDE (ranked 2nd, 4th, 6th, 7th, 9th in the prediction results, respectively) were validated by lncRNADisease database or MNDR database. Furthermore, according to The Cancer Network Galaxy (http://tcng.hgc.jp/index.html?t= gene&id= 100048912), CDKN2B-AS1, 1st in the prediction results, has been included in many colon cancer-related networks constructed based on the expression data of primary colorectal cancers. Furthermore, real time PCR has indicated the expression level of PVT1 (3rd in the prediction results) in colon cancer tissues was higher than normal tissues and PVT1 was functionally correlated with the proliferation and invasion of colon cancer cells 46,48 . Therefore, it has been considered as a new oncogene in colon cancer tissues and an independent risk biomarker for overall survival of colon cancer patients 46,48 . Gastric cancer is the second leading cause of cancer-related death and the fourth most common cancer worldwide 49 . Therefore, it is imperative to identify novel molecules for early diagnosis, prognosis, and treatment of gastric cancer. Accumulating evidences have demonstrated that lncRNAs have played critical roles in the de velopment and progression of gastric cancers 50 . KATZLDA was further implemented to identify lncRNAs potentially associated with gastric cancer. As a result, six out of top ten predicted lncR-NAs have been validated by the updates of lncRNADisease database and recent biological experiment literatures 51 . H19, CDKN2B-AS1, MEG3, PVT1, and HOTAIR have been validated by lncRNADisease database, which was ranked 1st, 2nd, 3rd, 4th, and 7th in the prediction results, respectively. For example, both microarray and qRT-PCR have indicated that H19 was the most upregulated lncRNA among 135 differentially expressed lncRNAs in gastric cancer tissues relative to adjacent normal gastric mucosa 49 . In gastric cancer tissues, HOTAIR was also confirmed to exhibit abnormally high expression level relative to adjacent normal tissues 52 . The association between MALAT1 (5th in the prediction results) and gastric cancer has also been confirmed by experimental observations that MALAT1 was frequently upregulated in gastric cancer cell lines and could induce gastric cancer cell proliferation 51 . Among the urinary system tumors, renal cancer has the third highest incidence, with more than 250,000 new cases diagnosed each year worldwide 53 . Nowadays, biological experiments have further discovered the associations between the development and progression of renal cancer and the mutations and dysregulations of some lncRNAs 53 . KATZLDA was applied to renal cancer for potentially related lncRNA prediction. As a result, five out of top ten predicted renal cancer-related lncRNAs have been validated by the update of lncRNADisease database and recent biological experiment literature reports. For example, H19, MEG3, PVT1, and MALAT1, ranked the 1st, 3rd, 4th, 6th in the prediction results, were validated by lncRNADisease database. Another confirmed lncRNA is UCA1, which was ranked the 8th in the prediction results. Biological experiments have shown that expression level of UCA1 in renal cancer tissue was significantly higher than normal tissues (http://www.cnki.com.cn/Article/ CJFDTotal-ZLYD201507007.htm).
In addition, performance comparisons between KATZLDA and LRLSLDA were implemented based on the rankings of lncRNAs associated with colon, gastric, and renal cancer according to the updates of LncRNADisease database after gold-standard associations in this paper were downloaded (See Table 1). After getting rid of duplicate associations with different evidences and lncRNA-disease associations involved with lncRNAs which were not investigated in this paper, there were 19 distinct experimentally confirmed lncRNA-disease associations about these three important diseases. Observed results further indicated KATZLDA has more effective ability of inferring potential lncRNA-disease associations than LRLSLDA.

Discussions
As valuable complements to experimental studies, computational models are in pressing need to effectively identify potential disease-related lncRNAs and lncRNA signature for disease diagnosis, therapeutic effect prediction, and treatment evaluation, considering the limitations of experimental methods and the generation of vast amount of biological datasets. In this article, KATZLDA was developed to predict potential lncRNA-disease associations on a large scale by integrating known lncRNA-disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for diseases and lncRNAs to measure the importance of candidate lncRNAs relative to known disease-related lncRNAs. KATZLDA could be applied to new diseases and lncRNAs without any known associations. In order to validate reliable prediction performance of KATZLDA and demonstrate its advantage over previous classical models, local LOOCV, global LOOCV, and 5-fold cross validation were implemented based on known lncRNA-disease associations. Furthermore, case studies of colon cancer, gastric cancer, and renal cancer were implemented and 18 potential associations in the top 10 predictions for these three important diseases have been confirmed by recent experimental results. In the future, it is anticipated that KATZLDA could play important roles in potential lncRNA-disease association identification and disease biomarker detection.
Some limitations exist in the current version of KATZLDA. Firstly, although KATZLDA has significantly improved previous methods, its performance is still not very satisfactory, especially in the local LOOCV. Further data integration would benefit the improvement of predictive ability. For example, disease phenotypic similarity, known disease-genes/miRNAs associations, and various lncRNA-related interactions could be introduced into this model. Meanwhile, it is also very important to develop more effective similarity integration method. Furthermore, since Gaussian interaction profile kernel similarity and lncRNA functional similarity was calculated based on known lncRNA-disease associations, miRNA-disease associations, and lncRNA-miRNA interactions, KATZLDA may cause the bias to diseases with more known related lncRNAs and lncRNAs with more known associated diseases or/and more known miRNA interaction partners. Data integration would also benefit the decrease of the prediction bias. Thirdly, how to reasonably select nonnegative coefficients to differentiate the contribution from the different walks with different lengths is still not solved well. Finally, the new era of personalized medicine Scientific RepoRts | 5:16840 | DOI: 10.1038/srep16840 has dawned, so it is very important to design different models and different lncRNA biomarkers for different patients [54][55][56] .

Methods
LncRNA-disease associations. Known lncRNA-disease associations were downloaded from the LncRNADisease database in October, 2012 21 . After getting rid of duplicate associations with different evidences, there were 293 distinct experimentally confirmed lncRNA-disease associations about 118 lncRNAs and 167 diseases (see Supplementary Table 1). In order to use new associations added into this database after October, 2012 for the validation of potential lncRNA-disease associations predicted by KATZLDA, the latest version lncRNA-disease association dataset in the LncRNADisease database was not used as golden-standard dataset in this paper. Variable nl and nd represents the number of lncRNAs and diseases, respectively. Furthermore, matrix A is the adjacency matrix of lncRNA-disease association network. If lncRNA l(i) is related to the disease d(j), A(i,j) is 1, otherwise 0.
Disease semantic similarity. Furthermore, disease semantic similarity was calculated according to newly developed methods of constructing large-scale lncRNA functional similarity network 35 . Disease semantic similarity has been widely applied to identify disease-related ncRNAs and its effective performance has been fully demonstrated in plenty of previous studies 35,39,57,58 .
Disease semantic similarity would be calculated based on disease MeSH descriptors and their corresponding direct acyclic graphs (DAGs). Disease A can be described as DAG(A) = (D(A),E(A)), where D(A) is composed of the nodes of this disease itself and its ancestor diseases and E(A) consists of all the direct edges from parent nodes to child nodes. In the traditional disease semantic similarity calculation model 35 , the disease terms in the same layer would have the same contribution to the semantic value of disease A. However, considering the fact that two diseases in the same layer of DAG(A) may appear in the different numbers of disease DAGs, it is less accurate to assign the same contribution value to them. Based on the assumption that a more specific disease should have a greater contribution to the semantic value of disease A, the contribution of disease term t in DAG(A) was defined as follows:

the number of DAGs including t the number of diseases [ ] 1
A Therefore, the semantic value of disease A was obtained by summing all the contributions from ancestor diseases and disease A itself as follows.
Furthermore, disease semantic similarity between disease A and B could be defined as follows by paying attention to the nodes shared by their corresponding disease DAGs: In this way, disease semantic similarity matrix SS could be constructed, where the entity SS(i, j) in row i column j is the disease semantic similarity between disease d(i) and d(j).
LncRNA expression similarity. Considering the fact that comprehensive lncRNA expression data has been unavailable till now and long intergenic non-coding RNA (lincRNA) occupies a large part of the whole lncRNA set, lincRNA expression profiles were downloaded from UCSC Genome Bioinformatics (http://genome.ucsc.edu/) in October, 2012, which included 21626 lincRNAs' expression profiles across 22 human tissues or cell types (Supplementary Table 2). Then, lincRNA expression similarity was defined by calculating the Spearman correlation coefficient between the expression profiles of each lincRNA pair. Matrix ES represents the lncRNA expression similarity matrix, where ES(i, j) is the expression similarity between lncRNA l(i) and l(j) if they are both lincRNA, otherwise 0.
LncRNA functional similarity. In the previous study, based on the assumption that lncRNAs with similar functions tend to interact with similar miRNAs and similar miNAs tend to be associated with similar diseases, the model of LFSCM was developed to calculate lncRNA functional similarity by integrating disease semantic similarity, miRNA-disease associations, and lncRNA-miRNA interactions 39 .
Here, lncRNA functional similarity results in that study was introduced into the current study. Therefore, lncRNA functional similarity matrix FS could be obtained, where the entity FS(i,j) in row i column j is the functional similarity between lncRNA l(i) and l(j) according to the similarity calculation model of LFSCM.
Gaussian interaction profile kernel similarity for diseases and lncRNAs. Based on the topology information of known lncRNA-disease association network and the assumption that similar diseases tend to show a similar interaction and non-interaction pattern with the lncRNAs, Gaussian interaction Scientific RepoRts | 5:16840 | DOI: 10.1038/srep16840 on integrated lncRNA similarity, and known disease-lncRNA association network constructed based on known associations downloaded from the lncRNADisease database.
KATZLDA only based on known disease-lncRNA association network is first introduced here. Therefore, the number of walks connecting lncRNA node l(i) and disease node d(j) in the known lncRNA-disease association network is calculated. It is easy to see ( , ) A i j l is exactly the number of walks of length l that link lncRNA node l(i) and disease node d(j). In order to obtain a single similarity measure between these two nodes as the potential association probability between corresponding lncRNA and disease, different walks of different lengths are integrated. To differentiate the contribution of different walks of different lengths based on the assumption that walks with shorter lengths tend to contribute more to the similarity between two nodes, nonnegative coefficient sequence β l are introduced to dampen the contributions from longer walks by ensuring β l1 is smaller than β l2 when l 1 is larger than l 2 . In this way, potential association probability between lncRNA l(i) and disease d(j) could be calculated based on the following formula. where the matrix S denotes the similarities between all the lncRNA-disease pairs. Above model only uses known disease-lncRNA associations. To make full use of the heterogeneous network constructed before, integrated disease similarity matrix DS and integrated lncRNA similarity matrix LS are further introduced into this computational model by replacing adjacency matrix A by the following form:

LS A A DS 12
T By integrating lncRNA and disease similarity, KATZLDA could be applied to new diseases and lncR-NAs without any known associations.