Abstract
Identifying lncRNAdisease associations is conducive to the diagnosis, treatment and prevention of diseases. Due to the expensive and timeconsuming methods verified by biological experiments, prediction methods based on computational models have gradually become an important means of lncRNAdisease associations discovery. However, existing methods still have challenges to make full use of network topology information to identify potential associations between lncRNA and disease in multisource data. In this study, we propose a novel method called HGNNLDA for lncRNAdisease association prediction. First, HGNNLDA constructs a heterogeneous network composed of lncRNA similarity network, lncRNAdisease association network and lncRNAmiRNA association network; Then, on this heterogeneous network, various types of strong correlation neighbors with fixed size are sampled for each node by restart random walk; Next, the embedding information of lncRNA and disease in each lncRNAdisease association pair is obtained by the method of typebased neighbor aggregation and all types combination though heterogeneous graph neural network, in which attention mechanism is introduced considering that different types of neighbors will make different contributions to the prediction of lncRNAdisease association. As a result, the area under the receiver operating characteristic curve (AUC) and the area under the precisionrecall curve (AUPR) under fivefold crossvalidation (5FCV) are 0.9786 and 0.8891, respectively. Compared with five stateofart prediction models, HGNNLDA has better prediction performance. In addition, in two types of case studies, it is further verified that our method can effectively predict the potential lncRNAdisease associations, and have ability to predict new diseases without any known lncRNAs.
Similar content being viewed by others
Introduction
Long noncoding RNAs (lncRNAs) are noncoding RNAs with more than 200 nt (nucleotides) in length^{1}. More and more studies have shown that lncRNAs participates in many important biological processes, including gene transcription, cell differentiation and genetic regulation^{2}. Moreover, Complex diseases that seriously endanger human health are also inseparable from the abnormal expression of lncRNAs, including diabetes^{3}, cardiovascular diseases^{4}, HIV^{5}, mental disorders^{6} and some cancers such as lung cancer^{7}, breast cancer^{8} and prostate cancer^{9}. Therefore, identifying the associations between lncRNAs and diseases contributes to understanding the pathogenesis and principles of the diseases, and also provides help for the diagnosis, treatment and prevention of human disease. However, the traditional biological experiments take up a long time, cost much, and have some blindness, all of which will hinder the research process. In recent years, established lncRNA databases such as LncRNADisease2.0^{10}, Lnc2Cancer v2.0^{11}, NRED^{12}, MNDR^{13}, and GeneRIF^{14} have made it possible to develop computational methods for predicting potential lncRNAdisease associations. According to the different ideas of algorithms, the existing methods for predicting lncRNAdisease associations can be broadly classified into two categories. They are the method based on biological networks and machine learning, respectively.
Computational methods based on biological networks often rely on the known associations information between lncRNA and disease to build heterogeneous networks. Then lncRNAdisease association prediction is carried out based on this heterogeneous networks. For example, Sun et al.^{15} proposed a networkbased computational model RWRlncD, that known lncRNAdisease association network is used to calculate the lncRNA similarity to predict the disease relevance of lncRNAs. Gu et al.^{16} proposed a model for a random walk on a global network (GrwLDA) that uses random walk in the lncRNA similarity network and the disease similarity network to predict potential lncRNAdisease associations. However, GrwLDA had difficulties in optimizing the model parameters. Wen et al.^{17} proposed the LapBiRWRHLDA model, which Laplace normalized the similarity matrix before constructing the lncRNAdisease networks, which integrated the two similarity networks through known lncRNAdisease associations, and then predicted lncRNAdisease associations using a double random walk on this heterogeneous networks. Zhang et al.^{18} propose a model LncRDNetFlow based on a global network framework that integrated multisource networks, including lncRNAs similarity network, proteins interaction network, diseases similarity network, and associations information among heterogeneous nodes. The model was able to predict potential associations information for an isolated disease. Zhao et al.^{19} developed a new random walk method MHRWR based on multisource networks. This method introduced diseasegene network and lncRNAgene network to build a multilayer network, so as to extract more potential information. Finally, a multilayer random walk method was used to predict the associations of lncRNAdisease.
Computational methods based machine learning predict potential associations between lncRNAs and diseases by building lncRNAdisease association models, and train the model to improve accuracy using known lncRNAdisease associations data. Chen et al.^{20} assumed that similar diseases were often associated with functionally similar lncRNAs, and developed a model LRLSLDA based on a semisupervised learning framework, where LRLSLDA effectively predicted potential lncRNAdisease associations by integrating known lncRNAdisease associations and lncRNA expression profiles. Nonetheless, LRLSLDA had the problem of optimize the model parameters. Subsequently, Chen et al.^{21} proposed a new lncRNAdisease prediction model named LNCSIM. LNCSIM further improved LRLSLDA model by introducing lncRNAdisease prediction similarity score. However, this method still could not solve the problem of parameter selection of semantic contribution factors. Zhao et al.^{22} developed a naive Bayesianbased computational approach that integrated various information of diseaserelated lncRNAs, including genomic, regulome, transcriptome, which resulted in successfully predicting 707 potential cancerassociated lncRNAs. Lan et al.^{23} proposed a novel computational method that used Katcher means to fuse the lncRNA and disease similarity matrixs of multiple data sources and predicted potential lncRNAdisease associations by the SVM classifier. Sheng et al.^{24} used random walk and convolution autoencoders to obtain new feature distributions and then input them into the model to reveal the potential associations between lncRNAs and diseases.
These two types of approaches still have methodological weaknesses. The methods based on biological network rely heavily on the constructed lncRNAdisease heterogeneous network. When network structure changes, this kind of method can’t effectively deal with it. The problem of the method based on machine learning is how to select the optimal features. Most existing machine learning methods do not take full advantage of the rich topological information contained in heterogeneous networks. To make full use of the lncRNAs and diseases feature information and the local and global information on the lncRNAdisease association data, the graph neural network approach appears in some new studies recently. For example, Xuan et al.^{25} used graph convolution network and convolutional neural network to learn the network structure information and the local network features of lncRNAdisease association pair. Wu et al.^{26} used graph convolutional network (GCN) as encoder to obtain the features of lncRNAs and diseases on the heterogeneous network, and then calculated the interaction score between lncRNA and disease by using the inner product of two potential factor vector. Zhang et al.^{27} utilized metapaths to represent complex semantic information between entities in the network and introduced attention mechanisms to learn the weights of each neighborhood under the metapath and finally aggregate the potential features they obtained from the GCN model. A graph autoencoder was leveraged to acquire lowdimensional features, finally used a random forest classifer for lncRNAdisease prediction^{28}. Zhao et al.^{29} proposed a deep learning algorithm HGATLDA based on heterogeneous graph attention network. HGATLDA uses graph attention network to learn node embedding from isomorphic and heterogeneous subgraphs. In addition, a computational model based on graph attention network and multilayer perceptron (MLP) was proposed for association prediction^{30}. However, these methods directly use graph attention network to extract features, and do not take into account the different number of neighbor nodes and the heterogeneity of node types. PANDA applied a graph convolutional autoencoders for feature extraction and utilized a neural network to predict LDAs^{31}.These graph neural network methods realize the capture and utilization of topological information in heterogeneous networks, but ignore the heterogeneity of nodes and edges in heterogeneous graphs.
Inspired by Zhang et al.^{32}, the heterogeneity of structure and content in the heterogeneous graph is considered. We propose a novel method for lncRNAdisease association prediction called HGNNLDA. First, a heterogeneous network is constructed, which is composed of the similar network of lncRNAs, the known lncRNAdisease association network and the known lncRNAmiRNA association network. Then, a fixedsize sampling of strongly correlated neighbors is performed by restart random walk for each lncRNA and disease, and the sampled neighbors are grouped according to the types of nodes. Then, the feature vectors of sampled lncRNA, disease and miRNA are obtained by word2vec. The final embedding information of each lncRNA and disease is extracted by aggregating the sampling neighbors according to types and fusing different types, in which attention mechanism is introduced to indicate the importance of different types of neighbors. Finally, the embedding obtained from above steps of each lncRNAdisease association pair are used as the input of classifier, and the prediction score of association pair is calculated. The experimental results show that the AUC and AUPR values of HGNNLDA under fivefold cross validation (5FCV) are 0.9786 and 0.8891, respectively, which is superior to other stateofart methods. In addition, two case studies show that HGNNLDA has the ability to predict diseaserelated lncRNA without any known association.
Results
Performance evaluation
We considered 2697 known lncRNAdisease associations as positive samples, but the number of positive samples only account for 2.7\(\%\) of the total number of samples, so some previous studies^{33,34,35,36} selected negative samples with the same number of positive samples from all unknown association pairs. We followed the same strategy and randomly selected 2697 lncRNAdisease associations from all the unknown lncRNAdisease associations to be the negative samples. After constructing the training set of the model, fivefold cross validation (5FCV) was used to evaluate the prediction performance of HGNNLDA. For 5FCV, the sample set was divided into 5 disjoint subsets on average, among which 4 subsets were utilized to train the model and the remaining subset was utilized for testing in each round. Then, HGNNLDA model trained was used to obtain the score of each test sample. The higher the score, the more likely it is that this lncRNA is related to the disease. Next, all test samples were sorted in descending order according to their scores. On this basis, we calculated the true positive rate (TPR) and false positive rate (FPR), Precision and Recall under different thresholds. The specific calculation is as follows:
Where TP (true positive) means that positive samples are correctly predicted as positive samples; FN (false negative) indicates that the positive sample is erroneously predicted as a negative sample; FP (false positive) means that the negative sample is erroneously predicted as a positive sample; TN (true negative) means that the negative sample is correctly predicted as a negative sample. Then, the ROC curve was drawn with TPR as the vertical axis and FPR as the horizontal axis, and the area under the ROC curve (AUC value) was used as the performance index to evaluate the prediction performance of the model. If the AUC value is larger, the prediction performance of this model is better. To improve the evaluation of the model performance when the positive and negative samples were seriously unbalanced, we also calculated AUPR value to evaluate the overall performance of the model.
Comparison with other models
In order to further evaluate the prediction performance of HGNNLDA method, we compared it with five stateofart lncRNAdisease association prediction models, such as SIMCLDA^{37}, MFLDA^{38}, LDAP^{23}, CNNLDA^{39} and GCNLDA^{25}. Under the 5FCV, the average AUCs and AUPRs of all lncRNAdisease association prediction models as shown in Table 1. All experimental results of compared models come from Yao et al.’s previous study on lncRNAdisease association prediction, and the results of these models were tested on the same datasets^{40}. ROC curve of each crossvalidation of HGNNLDA is shown in Fig. 1. Other models didn’t take into account that the neighbor nodes of some nodes may not contain all types of nodes, and assumed that each type of node contributed the same to the prediction of lncRNAdisease association. HGNNLDA solves the problems of other models by restarting random walk and introducing the attention mechanism.
Ablation study
To analyze the necessity of each component of our model, we adopt two variants of HGNNLDA (HGNNLDAnoNeigh and HGNNLDAnoAttention) as the comparison method. Specifically, HGNNLDAnoNeigh means that the embedded information of each node is only obtained by word2vec, and the information of any neighboring nodes is not aggregated. HGNNLDAnoAttention uses fully connected neural network instead of attention mechanism to aggregate the embedding of different types of neighbors, which means that different types of neighbor nodes are equally important for the final embedding of lncRNA and disease. Figure 2 shows the average AUC obtained using HGNNLDA and two variant models. HGNNLDA has better performance than HGNNLDAnoNeigh, which indicates that aggregating the information of neighboring nodes can better generate the embedded information of nodes. HGNNLDA gets better results than HGNNLDAnoAttention, which shows that attention mechanism can capture the influence of different types of nodes.
The effects of embedding size
Embedding size plays an important role in HGNNLDA, which is able to directly affect the performance of the model. In the experiment, we set different embedding dimension d (i.e. 8, 16, 32, 64, 128, 256), and evaluated the prediction performance under different setting. As can be seen from Fig. 3, within a certain range, the larger the embedding dimension, the better the node representation can be learned, and the higher the AUC value. However, when the embedding dimension increase continuously, the AUC value will become stable or slightly worse, which may be caused by overfitting. Accounting for this factor, the embedding size is set to 128 in this paper.
Case studies
To further verify the accuracy and effectiveness of HGNNLDA, we conducted two types of case studies.
For the first type of case study, we applied our proposed method to predict the potential lncRNAdisease associations of three common diseases (lung cancer, colon cancer and osteosarcoma). First, for a specific disease, we regarded all known associations between lncRNAs and diseases as training samples and unknown associations with this disease as candidate samples. Then, we scored all unknown candidate samples of lncRNA/lung cancer/colon cancer/osteosarcoma, then sorted the scores in descending order and select the top 10 candidate associations related to this disease. The prediction results were verified using two databases (LncRNADisease^{10} database and LncRNA2Cancer^{11} database). Table 2 showed the top 10 results of predicting the potential associations with lung cancer, colon cancer and osteosarcoma, the accuracy reached 100\(\%\), 80\(\%\) and 80\(\%\) respectively. The results showed that our method can effectively predict the potential lncRNAdisease associations.
For the second type of case study, We evaluated the ability of our proposed method to predict the new associations of diseases without any known related lncRNA. We took breast as an example in this case study. First, we set the known associations of breast cancer as unknown associations, and all lncRNAs were considered as candidate lncRNAs. The HGNNLDA was used to score these candidate lncRNAs associated with breast cancer. We found that 27 of the top 30 lncRNA were confirmed by LncRNAdisease database or LncRNA2Cancer database, as shown in Table 3. This result shows that HGNNLDA can effectively predict the potential associations of diseases without any known related lncRNAs.
Discussion
Identifying associations between lncRNAs and diseases will have a huge impact on our treatment and prevention of some diseases. Therefore, we propose a novel method HGNNLDA to predict the potential associations between lncRNAs and diseases. From the comparison of experimental results, it can be seen that HGNNLDA has superior performance for predicting lncRNAdisease associations. In addition, two types of cases also verify that HGNNLDA has the ability to identify potential lncRNAdisease associations, and can effectively predict new diseases without any known lncRNA.
The reliable performance of HGNNLDA is related to the following factors. First, the model integrates multiple sources of heterogeneous data to build a heterogeneous networks. Second, HGNNLDA gets all types of strong related neighbors of fixed size for each node by restarting random walk, which solves the defect that the direct related neighbors of some nodes are not representative enough. In addition, HGNNLDA is able to capture the strong correlation neighbor features of each node in this heterogeneous network, and fully exploiting the topology information of the heterogeneous network. Finally, HGNNLDA employs the attention mechanism to account for the differential impact of different types of nodes on lncRNAdisease association prediction. To sum up, HGNNLDA makes full use of the complex structural and semantic information of heterogeneous network, so as to achieves good prediction of lncRNAdisease associations.
However, our method still has some limitations. First, the data we use to build heterogeneous networks may contain noise and some outliers. Second, we randomly select the unknown lncRNAdisease association pairs as negative samples for training, which can’t guarantee that the lncRNA and disease in the unknown association pairs are completely unrelated, so it will have some influence on the prediction performance. Therefore, our future research will focus on how to overcome these problems.
Methods
The general overview of our proposed HGNNLDA framework is shown in Fig. 4, which consists of five key parts: (1) Construction of heterogeneous networks. First we downloaded lncRNAdisease associations, lncRNAmiRNA associations, and calculated the similarity between lncRNAs, and then constructed a heterogeneous network containing the three types of nodes of lncRNA, disease, and miRNA. (2) Sampling strong correlation neighbors and the feature representation of each neighbor. We sampled various types of fixedsized neighbors for each lncRNA and disease by restart the random walk, and then extract the features of each neighbor node by word2vec. (3) Embedding learning. We used BiLSTM to obtain embedding for the three types of neighbors, lncRNA, disease, and miRNA. (4) Updating the node embedding. We introduced the attention mechanism, and aggregated the embedding of three types of neighbors and the embedding of nodes themselves based on the weights obtained. (5) LncRNAdisease association prediction. The embedding of lncRNA and disease were concatenated to get the embedding of lncRNAdisease association pair, then the prediction scores between lncRNA and disease were obtained by using fully connected and softmax layers, eventually optimizde the model by crossentropy.
Datasets for lncRNAdisease associations prediction
Studies have shown that lncRNA can interact with the corresponding miRNA and perform biological functions together with miRNAs^{41}. Therefore, all useful biological information can be assembled to construct a heterogeneous network including the lncRNAlncRNA similarity network, the experimentally validated lncRNAdisease association network, and the lncRNAmiRNA association network. The data used in this paper were obtained from the previous study of lncRNAdisease association prediction by Fu et al.^{38}. This dataset included 240 lncRNAs, 412 diseases, and 495 miRNAs. Among them, 2,697 verified lncRNAdisease associations are derived from LncRNADisease^{10}, Lnc2Cancer^{11} and GeneRIF^{14} databases. In addition, 1002 lncRNAmiRNA associations came from starBase database^{42}.
LncRNA functional similarity network
In this paper, the functional similarity of lncRNA is calculated by the method of Chen et al.^{21}. LncRNA similarity is expressed by the similarity of lncRNA related diseases. Suppose that lncRNA \(l\left( 1 \right)\) is associated with a group of diseases \(D\left( 1 \right) = \left\{ {d\left( {11} \right) ,d\left( {12} \right) , \ldots ,d(1m)} \right\}\), lncRNA \(l\left( 2 \right)\) is associated with a group of diseases \(D\left( 2 \right) = \left\{ {d\left( {21} \right) ,d\left( {22} \right) , \ldots ,d\left( {2n} \right) } \right\}\). Then the functional similarity between lncRNA \(l\left( 1 \right)\) and \(l\left( 2 \right)\) is represented by \({S_{l\left( 1 \right) ,l\left( 2 \right) }}\) as follows:
where \(DSS\left( {d\left( {1i} \right) ,d\left( {2j} \right) } \right)\) represents the semantic similarity between disease \(d\left( {1i} \right)\) and disease \(d\left( {2j} \right)\), which adopts the method calculated by Wang et al.^{43}; m and n represent the number of diseases in disease group \(D\left( 1 \right)\) and \(D\left( 2 \right)\), respectively; LFS is a functional similarity matrix of 240 × 240, and 240 represents the number of lncRNAs.
LncRNAdisease associations and lncRNAmiRNA associations
The datasets includes 2697 experimentally verified lncRNAdisease associations and 1002 experimentally verified lncRNAmiRNA associations^{38}. The associations between lncRNAs and diseases are expressed by a 240 × 412 adjacency matrix LD, \(LD\left( {l\left( i \right) ,l\left( j \right) } \right) = 1\), if lncRNA \(l\left( i \right)\) is related to disease \(d\left( j \right)\), otherwise it is 0. Similarly, the associations between lncRNAs and miRNAs are represented by an adjacency matrix LM of 240 × 495, \(LM\left( {l\left( i \right) ,m\left( j \right) } \right) = 1\), if lncRNA \(l\left( i \right)\) is related to miRNA \(m\left( j \right)\), otherwise it is 0.
Heterogeneous network construction
As shown in Fig. 4a, we construct a heterogeneous network based on lncRNA functional similarity LFS, lncRNAdisease association network LD and lncRNAmiRNA association network LM. Heterogeneous networks can be expressed as:
where N represents the node set, which contains three types of nodes, namely \(NT = \left\{ {\ln cRNA,disease,miRNA} \right\}\), E represents the edge set, which contains three types of edges, namely \(ET = \left\{ {\ln cRNA{\text{}}disease,\ln cRNA{\text{}} \ln cRNA,\ln cRNA{\text{}}miRNA} \right\}\).
Sampling heterogeneous neighbors with restart random walk
In heterogeneous networks, the neighbors of many nodes cannot include all types of nodes, and the number of neighbor nodes will vary^{32}. For example, in Fig. 4a, no disease node is directly connected to the miRNA node, and \({d_1}\) has two neighbor nodes, while \({l_2}\) has seven neighbor nodes. Therefore, to make full use of the information of heterogeneous networks, we introduced restart random walk (RRW) to sample three types of strongly correlated neighbors for each node. The sampling operation of RRW in lncRNAdisease heterogeneous network includes two steps:

Selecting fixed size sampling length for RRW. Starting random walk from node \(v \in N\), return to the starting node with probability p or iteratively move to the neighbor of the current node, where the probability q controls whether the walk is depth first select or breadth first select. When \(q > 1\), random walk tends to give priority to breadth; when \(q < 1\), random walk tends to give priority to depth. RRW runs until a fixed number of nodes are successfully collected, and the sampled nodes are denoted as \(\left {RRW\left( v \right) } \right\). Moreover, the number of different types of nodes in \(\left {RRW\left( v \right) } \right\) is constrained, which ensure that all types of nodes are sampled.

Grouping neighbor nodes of lncRNA, disease and miRNAtype. For each node type t, the top \({k_t}\) nodes are selected based on the frequency of occurrence, and take them as the set of ttype correlated neighbors of node v.
In this way, three types of neighbors can be collected for each node, and classification by type is conducive to subsequently learn embedding of type.
Embedding learning
Word2vec is a word embedding technology proposed by Mikolov et al.^{44}, it can provide a vectorized representation for each word or sentence while preserving semantic and contextual integrity. In the last step, each node obtained a fixed size sampling sequence by using the strategy based on RRW. Therefore, Word2vec can be used to get the embedding of each node. Next, to obtain the embedding of type, we can aggregate all the same type neighbors after sampling by using BiLSTM^{45}. For example, we can express diseasetype neighbors of node \(v \in N\) in the heterogeneous network as \({N_d}\left( v \right)\). Next, the embedding of each diseasetype neighbor node is obtained by Word2vec^{44}, as shown in Fig. 4b. Then we utilize BiLSTM to aggregate the embeddings of all diseasetype neighbors, as shown in Fig. 4c. In the process of aggregating all lncRNAtype nodes, diseasetype nodes and miRNAtype nodes, different BiLSTM are used to distinguish them. BiLSTM consists of a forward LSTM layer and a backward LSTM layer. The main structure of LSTM layer can be expressed as follows:
Where \(\sigma\) is sigmoid activation function; i, f, o and c represent input gate vector, forget gate vector, output gate vector and memory unit respectively; \({h_s}\) represents the output hidden vector by sth node; w and b represent learnable parameters; \(\otimes\) represents dot product operation. Two different middle layer representations can be obtained through calculation. Then, after splicing the two middle layers, the general embedding of all diseasetype neighbor nodes of node v can be obtained through the average pool layer, as shown follow:
Where \({f^d}\left( v \right) \in {\mathrm{{R}}^{d \times 1}}\) is the general embedding of all diseasetype neighbors of node v; \(\overrightarrow{{h_s}}\) and \(\overleftarrow{{h_s}}\) represent the forward and backward LSTM representations of s node respectively; the symbol \(\oplus\) indicates the connection operation.
Updating the node embedding with attention mechanism
In the previous step, the general embedding of lncRNAtype, diseasetype and miRNAtype will be generated. Different types of neighbors will have different influences on the final embedding of node v^{32}, for example, nodes of lncRNA, diseasetype usually play a more important role in the prediction of lncRNAdisease association. So as to combine lncRNAtype, diseasetype and miRNAtype general embeddings with node v embedding, we introduce the attention mechanism^{46}. First, the importance of each type is learned, and then all heterogeneous types of nodes(including node v itself ) are aggregated to form the final embedding of node v. For any \(t \in N\left( v \right)\), \(N\left( v \right) = \left\{ {v \cup NT} \right\}\), the importance \(\beta _v^t\) of ttype relative to node v is expressed as:
Where \(\sigma\) is ReLU activation function; \({\mathrm{{q}}^\mathrm{{T}}} \in {\mathrm{{R}}^{2d \times 1}}\) represents the attention vector; f(v) is that embedding of v obtained by word2vec; \({f^t}\left( v \right)\) is a general embedding based on ttype aggregating; \(\parallel\) indicates the connection operation; \({f^k}\left( v \right) = f\left( v \right)\) when k equals v. Then, the final embedding of node v can be aggregated by various types of embedding based on the corresponding importance coefficient. The details are as follows:
Where \(z\left( v \right) \in {\mathrm{{R}}^{d \times 1}}\) represents the final embedding. To better understand the aggregation process of various types of nodes, explanation is shown in Fig. 4d.
LncRNAdisease association prediction
The final embedding of lncRNA \({l_i}\) and the final embedding of disease \({d_j}\) are spliced to constitute the vector representation \({x_{i,j}}\in {\mathrm{{R}}^{2d \times 1}}\) of the association pair \({l_i}  {d_j}\):
Where \(\otimes\) represents splicing operation. Then, each positive sample (there is an association between lncRNA and disease) is marked as 1, and each negative sample (there is no association between lncRNA and disease) is marked as 0. Then, we provide the embedding of the association pair \({l_i}  {d_j}\) to the fully connected layer and the softmax layer, and the score of association \({s_{i,j}} \in \left[ {0,1} \right]\) between lncRNA \({l_i}\) and disease \({d_j}\) is obtained. The specific \({s_{i,j}}\) is expressed as follows:
Where \(W \in {\mathrm{{R}}^{2 \times 2d}}\) is the parameter of the full connection layer and b is the bias; the larger the score of \({s_{i,j}}\), the greater the possibility of association between lncRNA \({l_i}\) and disease \({d_j}\). In our model, the crossentropy loss between prediction and real association is defined as follows:
Where T is the number of training samples; \({s_i}\) is the score of the association between lncRNA and disease of training sample; \({y_i}\) is the label of real association between lncRNA and disease.
Data availability
The original datasets of our study was download from another lncRNAdisease association prediction study, the orginal datasets were available at https://github.com/ydkvictory/RFLDA. The processed data along with codes are available at https://github.com/hongshi940/HGNNLDA.
References
Kapranov, P. et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488 (2007).
Mercer, T. R., Dinger, M. E. & Mattick, J. S. Long noncoding RNAs: Insights into functions. Nat. Rev. Genet. 10, 155–159 (2009).
Pasmant, E., Sabbagh, A., Vidaud, M. & Bièche, I. ANRIL, a long, noncoding RNA, is an unexpected major hotspot in GWAS. FASEB J. 25, 444–448 (2011).
Congrains, A. et al. Genetic variants at the 9p21 locus contribute to atherosclerosis through modulation of ANRIL and CDKN2A/B. Atherosclerosis 220, 449–455 (2012).
Zhang, Q., Chen, C.Y., Yedavalli, V. S. & Jeang, K.T. Neat1 long noncoding RNA and paraspeckle bodies modulate HIV1 posttranscriptional expression. MBio 4, e0059612 (2013).
Johnson, R. Long noncoding RNAs in Huntington’s disease neurodegeneration. Neurobiol. Dis. 46, 245–254 (2012).
Ji, P. et al. MALAT1, a novel noncoding RNA, and thymosin β4 predict metastasis and survival in earlystage nonsmall cell lung cancer. Oncogene 22, 8031–8041 (2003).
BarsyteLovejoy, D. et al. The cMyc oncogene directly induces the H19 noncoding RNA by allelespecific binding to potentiate tumorigenesis. Cancer Res. 66, 5330–5337 (2006).
De Kok, J. B. et al. DD3PCA3, a very sensitive and specific marker to detect prostate tumors. Cancer Res. 62, 2695–2698 (2002).
Bao, Z. et al. LncRNADisease 20: An updated database of long noncoding RNAassociated diseases. Nucleic Acids Res. 47, D1034–D1037 (2019).
Ning, S. et al. Lnc2Cancer: A manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 44, D980–D985 (2016).
Dinger, M. E. et al. NRED: A database of long noncoding RNA expression. Nucleic Acids Res. 37, D122–D126 (2009).
Wang, Y. et al. Mammalian ncRNAdisease repository: A global view of ncRNAmediated disease network. Cell Death Dis. 4, e765–e765 (2013).
Lu, Z., Bretonnel Cohen, K. & Hunter, L. GeneRIF quality assurance as summary revision. In Biocomputing 2007, 269–280 (World Scientific, 2007).
Sun, J. et al. Inferring novel lncRNAdisease associations based on a random walk model of a lncRNA functional similarity network. Mol. BioSyst. 10, 2074–2081 (2014).
Gu, C. et al. Global network random walk for predicting potential human lncRNAdisease associations. Sci. Rep. 7, 1–11 (2017).
Wen, Y., Han, G. & Anh, V. V. Laplacian normalization and birandom walks on heterogeneous networks for predicting lncRNAdisease associations. BMC Syst. Biol. 12, 11–19 (2018).
Zhang, J., Zhang, Z., Chen, Z. & Deng, L. Integrating multiple heterogeneous networks for novel lncRNAdisease association inference. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 396–406 (2017).
Zhao, X., Yang, Y. & Yin, M. Mhrwr: Prediction of lncRNAdisease associations based on multiple heterogeneous networks. In IEEE/ACM Transactions on Computational Biology and Bioinformatics (2020).
Chen, X. & Yan, G.Y. Novel human lncRNAdisease association inference based on lncRNA expression profiles. Bioinformatics 29, 2617–2624 (2013).
Chen, X. et al. Constructing lncRNA functional similarity network based on lncRNAdisease associations and disease semantic similarity. Sci. Rep. 5, 1–12 (2015).
Zhao, T. et al. Identification of cancerrelated lncRNAs through integrating genome, regulome and transcriptome features. Mol. BioSyst. 11, 126–136 (2015).
Lan, W. et al. LDAP: A web server for lncRNAdisease association prediction. Bioinformatics 33, 458–460 (2017).
Sheng, N., Cui, H., Zhang, T. & Xuan, P. Attentional multilevel representation encoding based on convolutional and variance autoencoders for lncRNAdisease association prediction. Brief. Bioinform. 22, bbaa067 (2021).
Xuan, P., Pan, S., Zhang, T., Liu, Y. & Sun, H. Graph convolutional network and convolutional neural network based method for predicting lncRNAdisease associations. Cells 8, 1012 (2019).
Wu, X. et al. Inferring lncRNAdisease associations based on graph autoencoder matrix completion. Comput. Biol. Chem. 87, 107282 (2020).
Zhang, J., Jiang, Z., Hu, X. & Song, B. A novel graph attention adversarial network for predicting diseaserelated associations. Methods 179, 81–88 (2020).
Wu, Q.W., Xia, J.F., Ni, J.C. & Zheng, C.H. GAERF: Predicting lncRNAdisease associations by graph autoencoder and random forest. Brief. Bioinform. 22, bbaa391 (2021).
Zhao, X., Zhao, X. & Yin, M. Heterogeneous graph attention network based on metapaths for lncRNAdisease association prediction. Brief. Bioinform. 23, bbab407 (2022).
Lan, W. et al. GANLDA: Graph attention network for lncRNAdisease associations prediction. Neurocomputing 469, 384–393 (2022).
Silva, A. B. O. V. & Spinosa, E. J. Graph convolutional autoencoders for predicting novel lncRNAdisease associations. In IEEE/ACM Transactions on Computational Biology and Bioinformatics (2021).
Zhang, C., Song, D., Huang, C., Swami, A. & Chawla, N. V. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 793–803 (2019).
Guo, Z.H., You, Z.H., Wang, Y.B., Yi, H.C. & Chen, Z.H. A learningbased method for lncRNAdisease association identification combing similarity information and rotation forest. iScience 19, 786–795 (2019).
Zhang, Y., Ye, F., Xiong, D. & Gao, X. LDNFSGB: Prediction of long noncoding RNA and disease association using network feature similarity and gradient boosting. BMC Bioinform. 21, 1–27 (2020).
Madhavan, M. et al. Deep belief network based representation learning for lncRNAdisease association prediction. arXiv preprint arXiv:2006.12534 (2020).
Zhu, R., Wang, Y., Liu, J.X. & Dai, L.Y. IPCARF: Improving lncRNAdisease association prediction using incremental principal component analysis feature selection and a random forest classifier. BMC Bioinform. 22, 1–17 (2021).
Lu, C. et al. Prediction of lncRNAdisease associations based on inductive matrix completion. Bioinformatics 34, 3357–3364 (2018).
Fu, G., Wang, J., Domeniconi, C. & Yu, G. Matrix factorizationbased data fusion for the prediction of lncRNAdisease associations. Bioinformatics 34, 1529–1537 (2018).
Xuan, P., Cao, Y., Zhang, T., Kong, R. & Zhang, Z. Dual convolutional neural networks with attention mechanisms based method for predicting diseaserelated lncRNA genes. Front. Genet. 10, 416 (2019).
Yao, D. et al. A random forest based computational model for predicting novel lncRNAdisease associations. BMC Bioinform. 21, 1–18 (2020).
Yang, G., Lu, X. & Yuan, L. LncRNA: A link between RNA and cancer. Biochim. Biophys. Acta Gene Regul. Mech. 1839, 1097–1109 (2014).
Li, J.H., Liu, S., Zhou, H., Qu, L.H. & Yang, J.H. starBase v20: Decoding miRNAceRNA, miRNAncRNA and proteinRNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 42, D92–D97 (2014).
Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S. & Chen, C.F. A new method to measure the semantic similarity of go terms. Bioinformatics 23, 1274–1281 (2007).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26 (2013).
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
Veličković, P. et al. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 61862067), the Applied Basic Research Project in Yunnan Province (No. 202201AT070042) and the NSFCYunnan Union Key Grant (No. U1902201).
Author information
Authors and Affiliations
Contributions
L.T., L.L. and H.S. conceived the presented idea. H.S. and X.M.Z. carried out the experiment and wrote the draft. L.T. and L.L. helped shape the research, analysis and manuscipt. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shi, H., Zhang, X., Tang, L. et al. Heterogeneous graph neural network for lncRNAdisease association prediction. Sci Rep 12, 17519 (2022). https://doi.org/10.1038/s4159802222447y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159802222447y
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.