lncRNA-disease association prediction based on latent factor model and projection

Computer aided research of lncRNA-disease association is an important way to study the development of lncRNA-disease. The correlation analysis of existing data, the establishment of prediction model, prediction of unknown lncRNA-disease association, can make the biological experiment targeted, improve the accuracy of biological experiment. In this paper, a lncRNA-disease association prediction model based on latent factor model and projection is proposed (LFMP). This method uses lncRNA-miRNA association data and miRNA-disease association data to predict the unknown lncRNA-disease association, so this method does not need lncRNA-disease association data. The simulation results show that under the LOOCV framework, the AUC of LFMP can reach 0.8964. Better than the latest results. Through the case study of lung and colorectal tumors, LFMP can effectively infer the undetected lncRNA-disease association.

lncRNA refers to long non-coding RNAs (lncRNAs) with a length of more than 200 nucleotides. In the past, it was thought that lncRNAs had little effect on gene expression 1 . However, in recent years, studies have shown that lncRNAs are closely related to various human diseases, which has triggered a research upsurge in bioinformatics on the association between lncRNAs and diseases 2 . Studies have shown that lncRNAs are involved in diseases through abnormal sequence 3 and spatial structure 4 , abnormal expression level 5 and abnormal interaction with binding proteins 6 , thus affecting human health, including diabetes 7 , cardiovascular disease 8 , and various types of cancer 9 . With the development of computer, big data technology is gradually mature. The application of artificial intelligence technology in the research of associations between lncRNA and diseases can accelerate the discovery of the potent association between lncRNA and diseases, improve the accuracy of biological experiments, and reduce the efforts of bioinformatics researchers and the cost of biological experiments. In medicine, the association between lncRNA-diseases can help doctors improve the detection of early diseases and targeted treatment of some diseases 10 ; in biology, the association between lncRNA-diseases can help researchers systematically understand the pathogen nature of complex diseases 11 . Therefore, it is necessary to analyze the existing data through big data technology and establish a prediction model to predict the association between lncRNA-diseases.
At present, lncRNA-disease association prediction model can be roughly divided into two parts. Part of it is based on single association data. For example, Chen et al. proposed a new lncRNA-disease prediction method (LRLSSP) 12 based on Laplacian regularized least squares and spatial projection. Firstly, by integrating the above information and Gaussian kernel similarity to make up for the lack of semantic similarity of disease, an accurate lncRNA-disease similarity network was reconstructed, and then Laplacian regularized least squares method was used Small two multiplication is used to estimate the association between lncRNA-diseases and solve the problem of lncRNA-disease sparsity. However, this model has some disadvantages, such as requiring a large number of combined data, and relying too much on the known lncRNA-disease association data; in view of Chen et al. 's problem, the models established by the following scholars do not need to rely on Xie et al. proposed a novel prediction method of human lncRNA-disease Association (NCPHLDA) 13 based on network consistent projection. The model integrates the above information, including lncRNA cosine similarity network and disease cosine similarity network. NCPHLDA has no requirement for parameters and has good prediction performance. However, there are some limitations. If the known lncRNA-disease correlation is small, the prediction results will be biased. In order to solve the problem of insufficient data set of lncRNA-disease association, Zhang et al. constructed a prediction model of lncRNA-disease association based on comprehensive spatial projection fraction (LDAI-ISPS) 14 . In addition, Li et al. proposes a new network consistency prediction lncRNA-disease association model (NCPLDA) 15 . The probability matrix of lncRNA-disease association is calculated by integrating the above information. Then the lncRNA similarity and disease similarity are obtained based on Gaussian kernel www.nature.com/scientificreports/ similarity. Finally, the lncRNA-disease association score is obtained by combining the disease space projection score and lncRNA space projection score the effect of prediction. The disadvantage is that this method depends on the quality of the data, and the above methods have achieved good prediction results. A hybrid computing framework (SDLDA) 16 was proposed by Zeng et al. It is a lncRNA-disease association prediction model based on singular value decomposition and deep learning. The model uses singular value decomposition and deep learning to extract the linear and nonlinear features of lncRNA-disease respectively, and combines the linear and nonlinear features to train SDLDA. The combination of linear and nonlinear features can enhance each other to obtain relatively high-quality features, and the connected vectors are used for the association prediction of lncRNA-disease. The performance of the prediction model has been greatly improved. The disadvantage is that it is difficult for SDLDA to determine the parameters. However, biological association information is generally affected by a variety of factors 17 , only through a single data prediction has certain limitations. The other part is to use multiple association data for prediction. Ding et al. Proposed a novel lncRNA-disease association prediction (TPGLDA) 18 . By integrating gene disease association and lncRNA-disease association, we can better describe the heterogeneity of coding non coding gene disease association and effectively identify potential lncRNA-disease association. Fu et al. proposed Matrix factorization-based data fusion for the prediction of lncRNA-disease associations (MFLDA) 19 . In this way, the weights of the data sources and the correlation matrix of the disease can be assigned to the data sources with less weight to break the potential association of lncRNA-disease. The biggest advantage of this model is that it is easy to predict the correlation between different research objects by sorting out a variety of heterogeneous data sources. However, MFLDA is more inclined to study data sparse matrix, and its performance depends on low-quality and unrelated internal relational data sources. Considering the different correlations between the incidence matrix and multiple internal incidence matrices, Wang et al. improved the MFLDA model proposed above, and proposed a model WMFLDA 20 which decomposes the weighted matrix of multiple relational data. Firstly, the model constructs a heterogeneous network for different types of entities and multiple relational intranets works for the same type of entities. Then the weights are assigned to these networks, and the cooperative low rank matrix is decomposed. Then, the association between lncRNA-diseases was predicted based on the optimized low rank matrix. WMFLDA model can be applied to all kinds of link prediction problems, and can collect data sources among and within relationships. However, this model ignores the different correlations of multiple relational matrices to target prediction tasks. In addition, Liu et al. Proposed a method A Weighted Graph Regularized Collaborative Matrix Factorization Method for Predicting Novel lncRNA-Disease Associations (WGRCMF) 21 . When the known information is insufficient, the performance of the matrix factorization method decreases significantly. The model A Probabilistic Matrix Factorization Method for Identifying lncRNA-disease Associations (PMFLDA) 22 developed by Xuan et al. Established a new weighted lncRNA-disease association network through three association networks of lncRNA-miRNA, miRNA disease and lncRNA-disease. The KNN algorithm based on disease semantic similarity and lncRNA function similarity is further updated. Finally, the potential lncRNA-disease association is inferred based on probability matrix decomposition. However, this model relies not only on miRNA and lncRNA association data, miRNA-disease association data, but also on lncRNA-disease association data. The above methods use multi-source data to predict the association between lncRNA and disease, but these methods still need the association between lncRNA and disease. However, lncRNA-disease association data are too sparse. In order to solve these problems, a new lncRNA-disease association prediction method LFMP is proposed in this paper. lncRNA-miRNA association data and miRNA-disease association data were used to calculate lncRNA similarity and disease similarity. The lncRNA-disease potential association was constructed through these two data sets. In the absence of known lncRNA-disease association data, the prediction of unknown lncRNA-disease association data is realized. The simulation results show that the AUC of LFMP can reach 0.8964 under the LOOCV framework. Better than the latest results. Through case studies of lung and colorectal tumors, it is proved that LFMP can effectively infer the undetected lncRNA-disease association.

Results
Evaluation metrics. In order to evaluate the performance of LFMP model, we used the ROC curve and AUC value generated by Leave One Out Cross Validation (LOOCV) as the evaluation measure, and compared it with other advanced models, namely CFNBC 23 , NBCLDA 24 . Under the framework of LOOCV, we take the association between each lncRNA and the disease one by one as the test set, By comparing the calculated results with the given threshold, we get four evaluation indexes: True

Analysis of parameters.
In this model, we introduce the parameter ω, whose value range is [0,1]. This parameter is used to adjust the ratio of lncRNA projection fraction and disease projection fraction in the final result calculation. We conducted the experiment with the parameter of 0 and the increment of 0.1, and the results are shown in Fig. 3. It is easy to see that when ω = 0, only lncRNA-miRNA is used to calculate functional similarity, AUC is 0.8892; when ω = 1, only lncRNA-disease is used to calculate functional similarity, AUC is 0.8693, while the fused lncRNA similarity matrix is used and the AUC is 0.8964, when ω = 0.3, which proves that the fusion functional similarity has certain advantages.
Case studies. In order to further prove LFMP's potential ability to detect potential lncRNAs associated with diseases, several common diseases were analyzed, and we obtained the rank of related disease prediction through experiments and ranked it. We verified the top 15 lncRNAs by searching the literature, selected the verified lncRNAs and attached the PMID (PMID is the literature number in the fields of life science and medicine    31 . lncRNA OIP5-AS1 was strongly expressed in lung cancer tissues, which was correlated with tumor size and tumor growth rate. Overexpression of OIP5-AS1 increased the proliferation of lung cancer cells in vitro 32 . Colorectal cancer (CRC) is also among the top three cancers in the world, the third most common cancer in men (746,000 cases, 10.0% of the total) and the second most common cancer in women (614,000 cases, 9.2 of the total) 33 . Among the top 15 candidate lncRNAs in our prediction results, 9 have been shown to be associated with colorectal Neoplasms in which MALAT1 polymorphism inhibits the binding of mir-194-5p, leading to the risk, growth and metastasis of colorectal cancer 34 ; the long non-coding RNA HCG18 promotes the growth

Discussion
The research of lncRNA and disease association prediction calculation model has been a hot spot. Using computational models to predict the association between lncRNA and diseases can accelerate the discovery of the potential association between lncRNA and diseases, improve the accuracy of biological experiments, reduce the energy of bioinformatics researchers and the cost of biological experiments, and help doctors improve the early detection and targeted treatment of some diseases. At present, there are a large number of lncRNA-disease prediction models. Most of these models use the association information between lncRNA and disease to predict the unknown lncRNA-disease association, and the most important step to predict the unknown association is the lncRNA-lncRNA similarity calculation and disease-disease similarity calculation. It is commonly used to calculate lncRNA-lncRNA similarity and disease-disease similarity through lncRNA-disease association information. This method has both advantages and disadvantages. The advantage is that the lncRNA-lncRNA calculated directly from the lncRNA-disease association information has more credibility in the prediction of lncRNA-disease association information. However, the disadvantage is that the known lncRNA-disease association information is too sparse, resulting in the lack of known information, which makes the credibility decline. Therefore, we use lncRNA-miRNA association information to calculate lncRNA-lncRNA similarity and miRNA-disease association information to calculate disease-disease similarity. The introduction of miRNA as an intermediate variable makes the credibility of the calculated lncRNA-lncRNA similarity and disease-disease similarity in the prediction of lncRNA-disease association decrease. However, due to the known lncRNA-miRNA association information and miRNA-disease association information are more perfect, the credibility of the calculated lncRNA-lncRNA similarity is improved, Moreover, the introduction of miRNA can solve the problem of lack of lncRNA-disease association information, and provide great help for the prediction of unknown lncRNA-disease association.

Conclusion
In this study, we propose a lncRNA-disease association prediction model LFMP based on implicit semantic model and projection. The model integrates multiple data, namely lncRNA-miRNA association data and miRNAdisease association data, and realizes indirect prediction of lncRNA-disease association, that is, the model does not need to be based on the known lncRNA-disease association data to predict the association between lncRNA and disease. By comparing with other models and consulting literature to verify the prediction results, it is proved that LFMP has certain reliability and good prediction ability. It is undeniable that our calculation model also has some limitations. Using multivariate data to calculate is a double-edged sword. It helps to improve the reliability of prediction, but also increases the difficulty of obtaining data. Compared with single data association prediction, this model needs more stringent data preprocessing methods, and the model relies too much on the known lncRNA-miRNA association data and miRNA-disease association data. If these two data are too sparse, the prediction performance of the model will be affected.  40 . The data obtained is cleaned up and the data is finally obtained as shown in Table 1. lncRNA-miRNA adjacency matrix A LM = {a lm }m × n , miRNA-disease adjacency matrix A MD = {a md }n × e are constructed from lncRNA-miRNA association data set, miRNA-disease association data set. The construction of adjacency matrix is shown in Fig. 4, the experimental data are shown in Table 2.

Methods
Cosine similarity for diseases. The cosine similarity for disease between miRNA disease adjacency matrix was calculated: where A MD (:, i) is the i-th column vector in the adjacency matrix of miRNA and disease, which represents the association feature of disease i.
Jaccard similarity for diseases. The calculation of similarity is an important part of gene association prediction. At present, the methods of similarity calculation in most articles include Gauss interactive calculation of similarity. Compared with the past, we use Jaccard similarity to calculate. The Jaccard similarity for disease between miRNA disease adjacency matrix was calculated:    Calculation of latent factor model. Compared with previous studies 41,42 , the matrix of lncRNA-disease association was calculated by using the adjacency matrix A LM = {a lm }m × n composed of lncRNA-miRNA association information and the adjacency matrix A MD = {a md }n × e composed of miRNA-disease association information, which was defined as follows: The matrix A LD = {a ld }m × e represents the preliminary correlation score between lncRNA and disease. However, the matrix is still too sparse. In order to solve this problem, we use the latent factor model to calculate the potential score. For matrix A LD = {a ld }m × e , it can be expressed approximately by the product ψ of two matrices X and Y: X is the lncRNA feature matrix, Y is the disease feature matrix, and k is an implicit class. X and Y are obtained by A LD decomposition, Conversely, the lncRNA feature matrix X is multiplied by the disease feature matrix Y to obtain the lncRNA-disease score matrix ψ (compared with the A LD matrix, the ψ matrix has a score for the zero part of the A LD matrix, while the corresponding part of the ψ matrix is about equal to A LD for the non-zero part of the A LD matrix), where in the element in the lncRNA-disease score matrix ψ is the dot product of the corresponding characteristic vector in the matrix X and the matrix Y, It reflects the fit between lncRNA feature and disease feature. Therefore, the larger the number in ψ, the greater the association between lncRNA and disease. In order to obtain the target value, we use the gradient descent method to solve the problem, the loss function is defined as: Here, ||X i || and ||Y j || are regularization terms used to prevent over fitting, and λ can be obtained experimentally. For each X i , the partial derivative is obtained: Then, according to the random gradient descent method, the parameters need to be pushed forward along the fastest descent direction. Therefore, the following recurrence formula can be obtained: where α is the learning rate, Combine formula (12) with formula (13): Similarly, we can get: In our experiment, α is set to 0.0002 and λ is set to 0.004.
x ik y kj . www.nature.com/scientificreports/ Establishment of LFMP prediction model. This paper proposes a new LFMP prediction model by combining the latent factor model and projection. The flow chart of LFMP model is shown in Fig. 4. Compared with previous studies 43 , we further extended the network consistency projection from single lncRNA-disease association data to multivariate data, such as lncRNA-miRNA association data, miRNA-disease association data, and so on. The lncRNA-disease potential score matrix was calculated by the latent factor model. On the lncRNA-disease potential correlation matrix, the functional similarity of the fused lncRNA and the comprehensive disease risk factors were combined the semantic similarity of disease was used to project lncRNA and disease respectively. The projection of lncRNA is defined as: In the above formula, ILS(i, :) represents the vector composed of the similarity between lncRNA i and other kinds of lncRNA. ψ(j, :) is potential score matrix between lncRNA j and various diseases. ||ILS(i, :)|| is the second normal form of vector formed by column i of integrated similarity matrix of lncRNA. LP(i, j) is the projection score. m is the number of lncRNA species. The projection of disease is defined as: IDS(:, j) represents the vector composed of the similarity between disease j and other diseases. ψ(:, i) represents the second normal form of the vector formed by row i of lncRNA-disease potential score matrix. DP(i, j) is the projection score. e is the number of diseases.
The final lncRNA-disease potential association prediction score matrix was formed by fusing lncRNA projection score with disease projection: LFMP(i, j) is the final association score between lncRNA i and disease j. ω means to regulate lncRNA projection and disease projection in the final result.  www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.