Abstract
Long noncoding RNAs (lncRNAs) regulate many biological processes by interacting with corresponding RNA-binding proteins. The identification of lncRNA–protein Interactions (LPIs) is significantly important to well characterize the biological functions and mechanisms of lncRNAs. Existing computational methods have been effectively applied to LPI prediction. However, the majority of them were evaluated only on one LPI dataset, thereby resulting in prediction bias. More importantly, part of models did not discover possible LPIs for new lncRNAs (or proteins). In addition, the prediction performance remains limited. To solve with the above problems, in this study, we develop a Deep Forest-based LPI prediction method (LPIDF). First, five LPI datasets are obtained and the corresponding sequence information of lncRNAs and proteins are collected. Second, features of lncRNAs and proteins are constructed based on four-nucleotide composition and BioSeq2vec with encoder-decoder structure, respectively. Finally, a deep forest model with cascade forest structure is developed to find new LPIs. We compare LPIDF with four classical association prediction models based on three fivefold cross validations on lncRNAs, proteins, and LPIs. LPIDF obtains better average AUCs of 0.9012, 0.6937 and 0.9457, and the best average AUPRs of 0.9022, 0.6860, and 0.9382, respectively, for the three CVs, significantly outperforming other methods. The results show that the lncRNA FTX may interact with the protein P35637 and needs further validation.
Similar content being viewed by others
Introduction
Noncoding RNAs regulate the majority of biological processes associated with development, differentiation, and metabolism in organisms1. In contrast to small noncoding RNAs (i.e., miRNAs), which are highly conserved and regulate transcriptional and posttranscriptional gene silencing2,3, long noncoding RNAs (lncRNAs), as one type of transcribed RNA molecules, are poorly conserved and control gene expression based on various mechanisms4,5,6. lncRNAs have close linkages with posttranscriptional gene regulation by regulating biological processes including protein synthesis, RNA maturation and transportation, and transcriptional gene silencing7,8. Although a few lncRNAs have been well studied, the biological functions of the majority of lncRNAs remain enigmatic9. Recent studies demonstrate that most of lncRNAs regulate various biological activities through specific associations with chromatin, for example, interacting with corresponding RNA-binding proteins10,11,12. Therefore, identification of potential lncRNA–protein Interactions (LPIs) is vital to understand lncRNAs’ biological functions and mechanisms.
To find new LPIs, many experimental methods were designed13,14. However, wet experiments for finding possible LPIs are costly and time-consuming. Computational methods are thus developed as a silver-bullet solution to LPI prediction. This type of methods is classified into two main categories: network-based methods and machine learning-based methods15,16.
Network-based LPI prediction methods, for example, random walk with restart-based model17, linear neighborhood propagation algorithm18, bipartite network projection-based recommendation method19,20,21, HeteSim algorithm22, firstly computed lncRNA similarity and protein similarity based on related biological data, and then integrated similarity matrix to heterogeneous lncRNA–protein network, finally designed network propagation algorithms to score for unknown lncRNA–protein pairs. Network-based LPI prediction methods successfully found part of LPIs, however, the type of methods cannot be applied to predict linkage information for an orphan lncRNA or protein.
Machine learning-based LPI identification methods first extracted features of lncRNAs and proteins and then designed a novel machine learning model to compute interaction probabilities for lncRNA–protein pairs. Classical machine learning-based LPI prediction models include matrix factorization-based methods and ensemble learning-based methods. Matrix factorization-based methods represented LPI prediction as a recommender task and used diverse matrix factorization models to discover unobserved LPIs, for example, gradient boosted regression trees23, graph regularized nonnegative matrix factorization24, and neighborhood regularized logistic matrix factorization25,26. Ensemble learning-based methods utilized ensemble techniques and constructed ensemble models for new LPIs identification27,28, for example, random forest-based ensemble framework29, sequence feature projection-based ensemble algorithm30, broad learning system-based stacked ensemble classifier31, and graph attention-based deep learning model32.
Although computational methods effectively identified potential linkages between lncRNAs and proteins, most of the above models remain the following limitations. First, the performance of these models was evaluated only on one dataset, thereby producing prediction bias. Second, the vast majority of models are not applied to find possible association proteins (or lncRNAs) for lncRNAs (or proteins) without any interaction information. Third, the performance needs to be further improved. To solve the above three problems, in this study, known LPI data are firstly integrated and five different LPI datasets are collected. Second, the features of lncRNAs and proteins are extracted based on four-nucleotide composition and the BioSeq2vec methods, respectively. Finally, a Deep Forest model (LPIDF) with cascade forest structure is designed to find LPI candidates. We compare the proposed LPIDF method with four classical LPI prediction models based on three different cross validations. The results show that LPIDF obtains better average AUCs and the best average AUPRs on the five datasets under the three cross validations. More importantly, case studies demonstrate that most of our predicted lncRNA–protein pairs with higher interaction probabilities are true LPIs and the remaining needs further experimental validation.
Results
We perform a series of experiments to investigate the prediction performance of our proposed LPIDF method.
Evaluation metrics
In this study, precision, recall, accuracy, F1-score, AUC and AUPR are used to evaluate the performance of LPIDF. Precision, recall, accuracy, and F1-score are defined as follows.
where TP, FP, TN, and FN denote the predicted number of true LPIs, false LPIs, true non-LPIs, and false non-LPIs. AUC and AUPR denote the average areas under the ROC curve and the precision-recall curve, respectively. The experiments are repeated for 20 times and the average performance from the 20 rounds is computed as the final performance.
Experimental settings
In the study, we conduct three different experimental settings.
Five-fold Cross Validation 1 (CV1): Cross validation on lncRNAs, that is, random rows (i.e., lncRNAs) in an LPI matrix \(Y\) are masked for testing.
Five-fold Cross Validation 2 (CV2): Cross validation on proteins, that is, random columns (i.e., proteins) in an LPI matrix \(Y\) are masked for testing.
Five-fold Cross validation 3 (CV3): Cross validation on lncRNA–protein pairs, that is, random lncRNA–protein pairs in an LPI matrix \(Y\) are masked for testing.
Under CV1, in each round, 80% of lncRNAs in an LPI network \(Y\) are screened as training set and the remaining is represented as testing set. Under CV2, in each round, 80% of proteins in \(Y\) are screened as training set and the remaining is represented as testing set. Under CV3, in each round, 80% of lncRNA–protein pairs in \(Y\) are represented as training set and the remaining is represented as testing set. The three cross validations refer to LPI identification for (1) new (unknown) lncRNAs (lncRNAs whose interaction information is unknown), (2) new proteins, and (3) lncRNA–protein pairs, respectively.
Comparison with four state-of-the-art methods
We compare our proposed LPIDF method with four state-of-the-art association identification methods to evaluate the prediction ability and robustness of LPIDF, that is, XGBoost33,34, Categorical Boosting (CatBoost)35, random forest36,37, and DRPLPI38. The above methods are classical machine learning models and obtained wide applications in various areas. XGBoost33,34 is a scalable and end-to-end tree boosting-based model. CatBoost35 is a novel gradient boosting-based technique and can effectively integrate ordered boosting and processing categorical features. Random forest36,37 is composed of multiple decision trees and each tree is independently trained on a random subset. DRPLPI38 exploited a multi-head self-attention model to extract high quality LPI features based on long short-term memory encoder-decoder mechanism. In the experiments, we randomly select the same number of negative LPIs as positive LPIs from unknown lncRNA–protein pairs to decrease the overfitting problem produced by data imbalance.
In random forest, the number of trees is set as 70, and the minimum number used to split samples is set as 5. In CatBoost, the maximum number of trees is set as 150, the maximum depth as 15, and the learning rate as 0.5. Other parameters are set as the corresponding values provided by the corresponding manuscript. XGBoost is conducted based on the scikit-learn package39.
Table 1 shows the precision, recall, accuracy, F1-score, AUC and AUPR values computed by LPIDF and other four methods under CV1. As shown in Table1, LPIDF achieves the highest average precision, accuracy, F1-score, and AUPR over all datasets, remarkably outperformed other four competing LPI prediction methods. Although the average recall and AUC computed by LPIDF are slightly lower than random forest and DRPLPI, LPIDF obtains the best average AUPR. The computed average AUPR obtained by LPIDF is 0.9022, which is 0.96%, 2.10%, 0.02% and 0.63% higher than XGBoost, CatBoost, random forest, and DRPLPI, respectively. Compared to AUC, AUPR is one more important measurement metric. Therefore, LPIDF can effectively find potential proteins interacting with a new lncRNA.
Table 2 gives the comparison results under CV2. In particular, LPIDF computes the best average precision, recall, accuracy, F1-score, AUC and AUPR over all datasets. Over all datasets, LPIDF investigates the best average AUC value of 0.6937, which is 4.80%, 10.81%, 1.17% and 0.91% better than XGBoost, CatBoost, random forest, and DRPLPI, respectively. More importantly, LPIDF calculates the highest average AUPR value of 0.6860, which is 2.17% and 2.65% higher than the second-best and third-best methods, respectively. In summary, under CV2, LPIDF remarkably improves LPI prediction performance compared to the other four prediction methods and is statistically significant in identifying possible lncRNAs for a new protein.
The prediction results computed under CV3 are shown in Table 3. In particular, LPIDF outperforms other LPI prediction methods over all datasets in terms of all six measurements. For example, LPIDF achieves the best average AUC value of 0.9457, which is 1.72%, 6.39%, 0.87%, and 0.97% better than XGBoost, CatBoost, random forest, and DRPLPI, respectively. In addition, for the AUPR metric, LPIDF obtains the best average AUPR of 0.9382, which is 0.88% and 1.20% superior to the second-best and third-best methods, respectively. It can be seen that the LPIDF can effectively predict potential LPIs.
Case study
After confirming the performance of our proposed LPIDF method, we further identify possible LPIs, especially predict interaction information for new lncRNAs and proteins.
Finding possible proteins interacting with new lncRNAs
In this section, we intend to find potential proteins interacting with new lncRNAs. Small Nucleolar RNA Host Gene 3 (SNHG3) and Growth Arrest-Special transcript 5 (GAS5) are masked all association information and taken as new lncRNAs. LPIDF is then applied to identify possible proteins interacting with the two lncRNAs.
SNHG3 is an RNA Gene affiliated with the lncRNA class. It may have dense correlation with various cancers, for example, hepatocellular carcinoma40, non-small-cell lung cancer41, clear cell renal cell carcinoma42, gastric cancer43, hypoxic-ischemic brain damage44, papillary thyroid carcinoma45, ovarian cancer46,47, bladder cancer48, and acute myeloid leukemia49. Table 4 shows the predicted top 5 proteins related to SNHG3 with the highest interaction probabilities on three human datasets.
The results from Table 4 show that SNHG3-protein interaction pairs predicted by LPIDF are rank advanced in all other four methods. We predict that O00425 may interact with SNHG3 (ranked as 4) in dataset 3, which has been validated in dataset 1. In addition, we observe that Q9NUL5 and Q13148 may interact with SNHG3. Among all possible 27 proteins, the interaction between Q9NUL5 and SNHG3 is ranked as 1 by all five LPI prediction methods. The association between Q13148 and SNHG3 is ranked as 5, 7, 8, 5, and 4 by LPIDF, XGBoost, random forest, CatBoost, and DRPLIP, respectively. The facts demonstrate the powerful prediction performance of LPIDF.
GAS5 can prevent glucocorticoid receptors from being activated and thus control transcriptional activities from its target genes. It is inferred as a potential tumor suppressor and has close correlations with coronary artery disease50, cirrhotic livers51, coronary artery disease52,53, rheumatoid arthritis54, Parkinson’s disease55, and primary glioblastoma56.
Table 5 lists the predicted top 5 proteins interacting with GAS3 with the highest association scores on three human datasets. In dataset 3, although the interactions between GAS5 and Q9NZI8 and Q9Y6M1 are unknown, we find that the two LPIs are ranked as 5 and 4 by LPIDF, respectively. More importantly, in datasets 1 and 2, it can be seen that Q9NZI8 and Q9Y6M1 show higher interaction probabilities with GAS5 and the two LPIs have been reported. In addition, O00425 is inferred to interact with GAS5 with the ranking of 2 in dataset 3 and has been validated in dataset 1. These facts again suggest that LPIDF can effectively find possible proteins associated with a new lncRNA.
Finding potential lncRNAs interacting with new proteins
We continue to uncover lncRNAs interacting with a new protein on three human datasets. Q13148 and Q9HCK5 are masked all associated lncRNAs and taken as new proteins. LPIDF is then used to find possible associated lncRNAs for the two proteins.
Q13148 is an RNA-binding protein involved in RNA biogenesis and processing and various neurodegenerative diseases57,58,59,60. In addition, it also participates in the formation and regeneration of normal skeletal muscles and plays an important role in keeping the circadian clock periodicity59,60. Its second RNA recognition motif has been reported as a major promoter towards aggregation and resultant toxicity61. Frontotemporal lobar degeneration associated with Q13148 aggregation is depicted as progressive neuronal atrophy in cerebral cortex62. Table 6 illustrates the predicted top 5 lncRNAs associated with Q13148 on three human datasets. From Table 6, we can investigate that all predicted top 5 lncRNAs interacting with Q13148 are known in the three datasets.
Table 7 lists the identified top 5 lncRNAs associated with Q9HCK5 on three human datasets. Q9HCK5 is required for RNA-mediated genes’ silencing, RNA-directed transcription and human hepatitis delta virus replication63. Table 7 demonstrates that all predicted top 5 LPIs for Q9NCK5 are given in the three datasets. In summary, LPIDF may be appropriate for LPI identification for a new protein.
Finding new LPIs based on known LPIs
The number of lncRNA–protein pairs with unknown interaction information is 51,686, 71,075, 22,572, 2,867 and 49,435 on the five datasets, respectively. We rank these unknown lncRNA–protein pairs based on their interaction probabilities computed by LPIDF and list the predicted top 100 lncRNA–protein pairs. The results are shown in Fig. 1. In Fig. 1, black dotted lines and sky blue solid lines represent unknown and known LPIs predicted by LPIDF, respectively. Tan hexagons and light sky blue circulars denote lncRNAs whose interactions with proteins are unknown and known, respectively. Yellow diamonds denote proteins.
We observe that some identified lncRNA–protein pairs have higher interaction probabilities. For example, the interactions between NONHSAT137627 and P35637, n344749 and Q15717, NONHSAT119864 and Q15717, AthIncRNA18 and Q9LES2, and ZmaLncRNA38 and C4J594 are ranked as 33, 97, 85, 161, and 215, respectively. The lncRNA–protein pairs with advanced ranks need further experimental validation.
The lncRNA FTX (NONHSAT137627) can positively regulate the expression and function of ALG3 in AML cells, especially cell growth and apoptosis related to ADR-resistance. FTX could thus probably be applied to reduce therapeutic resistance in AML64. P35637 plays a key role in RNA transport, mRNA stability and synaptic homeostasis in neuronal cells63. The protein has been validated to be target of the treatment of cancers, amyotrophic lateral sclerosis, and Alzheimer’s disease65.
In dataset 2, it is observed that FTX interacts with Q15717, Q9NZI8, and P26599. Q15717 helps in increasing the leptin mRNA’s stability. Q9NZI8 can regulate neurite outgrowth and neuronal cell migration, promote tumor-derived cells’ adhesion and movement, and prevent infectious HIV-1 particles’ formation64. P26599 can bind to the viral internal ribosome entry site and stimulate the translation mediated by the picornaviruses’ infection site. Q35637 has similar functions with Q15717, Q9NZI8, and P26599. Based on the “guilt-by-association” theory, we infer that FTX may associate with P35637.
Fractions of true LPIs among the predicted top N LPIs
In addition, we consider the fractions of true LPIs among the inferred top \(N\) LPIs. The results are shown in Table 8. \(N\) is selected as 10, 30, and 50, respectively. From Table 8, we can find that all the predicted top 10 LPIs by LPIDF have been labeled as 1 on five datasets. Similar to top 10, we can obtain the same fraction results on the predicted top 30 LPIs. For the predicted top 50 LPIs by LPIDF, although only 94% of LPIs have been labeled as 1 in dataset 1, all the top 50 LPIs are known on other four datasets. In summary, LPIDF obtains the best prediction performance based on fractions of true LPIs among the top 10, 30, and 50 LPIs.
Discussion and conclusion
lncRNAs are widely distributed in various organisms and regulate gene expression on transcriptome and post-transcriptome. However, lncRNAs are difficult to crystallize and only several lncRNAs have been investigated. Since lncRNAs play an important regulatory role in protein molecules, the discovery of proteins binding to specific lncRNAs becomes an issue to identify lncRNAs’ functions and mechanisms.
In this study, first, we integrate five LPI datasets where three datasets are from human and the remaining is from plants. Second, features of lncRNAs and proteins are selected by four-nucleotide composition and BioSeq2vec based on their sequences, respectively. Finally, a deep forest model with cascade forest structure, LPIDF, is developed to predict LPI candidates. To evaluate the performance of LPIDF, we compare our proposed LPIDF method with other four LPI prediction models on five datasets under three cross validations. The results suggest that LPIDF remarkably outperforms other four competing LPI identification methods. We further conduct a series of case studies to find possible associated proteins (or lncRNAs) for new lncRNAs (or proteins) and potential LPIs. The results from case analyses again demonstrate that LPIDF is a powerful LPI identification method.
LPIDF can compute the optimal precision, recall, accuracy, F1-score, AUC and AUPR. We think that it may be attribute to the following advantages. First, LPIDF selects high quality features of lncRNAs and proteins based on four-nucleotide composition and BioSeq2vec, respectively. Second, deep forest with cascade forest structure could automatically determine the depths of cascade forest, thereby reducing prediction bias produced by parameter tuning. Finally, each layer in the cascade forest receives LPI features from the last layer and sends its result to the next layer. Since all layers are automatically generated, LPIDF need not set too many hyperparameters. The predominant experimental consequences indicate that LPIDF has a powerful ability in excavating new LPIs.
In addition, the time required for the proposed LPIDF model and other methods is investigated. The details are shown in Table 9. It can be seen that the time required for LPIDF is much lower than ones of CatBoost and DRPLPI.
However, our work has a few limitations. We only consider LPI prediction on human and plant LPI-related datasets. Indeed, other species closer human evolutionarily than plants should be investigated. In addition, the predicted LPIs with the highest interaction probabilities should be experimentally validated.
In the future, first, we will integrate more biological information, for example, disease symptom information, drug chemical structure, miRNA-lncRNA interactions. Second, we will consider the prediction performance of the proposed model on other species closer human evolutionarily than plants. Third, CD-Hit66 is one broadly used software for reducing sequence redundancy. To improve the performance of sequence analyses algorithms, we will further remove proteins with high sequence similarity in larger datasets based on CD-Hit. Finally, we will further conduct experimental validation for the predicted RNA-binding proteins.
Materials and methods
Data preparation
In this study, we integrate five different LPI datasets. Dataset 1 was provided by Li et al.17. Noncoding RNA–protein interaction data were firstly downloaded from the NPInter 2.0 database67. lncRNA and protein sequences were extracted from the NONCODE database 4.068 and the UniProt65 database, respectively. 3,487 LPIs from 938 lncRNAs and 59 proteins were obtained. We then remove lncRNAs and proteins whose sequences are unknown in the UniProt65, NPInter67 and NONCODE68 databases. Finally, we obtain 3,479 LPIs from 935 lncRNAs and 59 proteins.
Dataset 2 was provided by Zheng et al.22. Noncoding RNA–protein interaction, lncRNA and protein sequences were downloaded from NPInter 2.067, NONCODE 4.068, and UniProt65, respectively. They obtained 4,467 LPIs between 1,050 lncRNAs and 84 proteins. Similar to dataset 1, we further remove the lncRNAs and proteins whose sequences are unknown in the NONCODE68, UniProt65, and NPInter67 databases and obtain 3,265 LPIs from 885 lncRNAs and 84 proteins.
Dataset 3 was provided by Zhang et al.18. Experimentally validated LPIs between 1,114 lncRNAs and 96 proteins were extracted based on data resources compiled by Ge et al.69. The sequence and expression data of lncRNAs in 24 human tissues or cell types were downloaded from the NONCODE 4.0 database68. The sequence data of proteins were obtained from the SUPERFAMILY database70. lncRNAs without sequence or expression information and proteins without sequence information were removed. lncRNA (or protein) with only one associated protein (or lncRNA) were still removed. Finally, 4,158 LPIs from 990 lncRNAs and 27 proteins were selected.
Dataset 4 contains sequence information of lncRNAs and proteins about Arabidopsis thaliana from the plant lncRNA database (PlncRNADB71). LPI data can be obtained from http://bis.zju.edu.cn/plncRNADB. The dataset contains 948 LPIs from 109 lncRNAs and 35 proteins.
Dataset 5 contains sequence data of lncRNAs and proteins about Zea mays from the PlncRNADB database71. LPI data can be downloaded from http://bis.zju.edu.cn/plncRNADB. The dataset contains 22,133 LPIs from 1,704 lncRNAs and 42 proteins. Table 10 describes the details about the five datasets.
We describe an LPI network as a matrix \(Y\):
Feature selection
Feature selection of lncRNAs
Tri-nucleotide composition is effectively applied to characterize lncRNA sequences72. In this section, we use four-nucleotide composition to select lncRNA features. Given an lncRNA sequence \(L\) with the length of \(x\) where \({l}_{i}\in \{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\}\) and \(i=\mathrm{1,2}, ...,x\), we use a four-tuple letter arrangement, for example, (A, A, A, A), (A, A, A, C), (A, A, A, G), …, (T, T, T, T), to compute the numeric matrix from \(L\).
Feature selection of proteins
The encoder-decoder structure can better describe sequence-to-sequence features73,74. Inspired by the sequence representation techniques provided by Sutskever et al.74 and Yi et al.75, we use Biological Sequence-to-vector (BioSeq2vec) representation learning method75 with encoder-decoder structure to characterize amino acids of a protein.
For a protein with sequence length of \(L\), first, a sliding window of size \(K\) is used to divide the sequence into \(L-K+1\) segments. Second, the segments are converted into hash values. Finally, the hash values are used as input of an autoencoder. As shown in Fig. 2, an input vector composed of the hash values is first mapped into a low-dimensional feature vector by an encoder. Second, the low-dimensional feature vector is reproduced as an input vector by a decoder. Finally, the reproduced low-dimensional feature vector in the final intermediate layer is used as features of a protein.
Deep forest with cascade forest structure
In this study, we utilize a Deep Forest with cascade forest structure (LPIDF) to find new LPIs. Deep forest with cascade forest structure, integrating deep forest and ensemble learning, exploits an ensemble-ensemble architecture. In the model, deep forest76 conducts layer-by-layer propagation and feature transformation. Ensemble learning-based model, composed of multiple single classifiers, more effectively improves LPI prediction compared with one single classifier77. For ensemble learning, larger diversities between single classifiers mean better improvement. To ensure the diversity, in this study, four different types of classifiers, logistic regression, XGBoost Classifier, random forest, and extra trees, are utilized to learn the model.
In the model, class vectors used to denote the class distribution are obtained through the four basic classifiers. For a given LPI feature, the class distribution first calculated the proportions that the feature classifies an lncRNA–protein pair as two classes (positive class and negative class), respectively. Suppose that there are three trees in a random forest. As shown in Fig. 3, for a LPI feature \({f}_{i}\), the probabilities that \({f}_{i}\) classify an lncRNA–protein pair as two classes (positive class and negative class) in the three trees are \({(\mathrm{0.3750,0.6250})}^{T}\), \({(\mathrm{0.5556,0.4444})}^{T}\) and \({(\mathrm{1.0000,0}.0000)}^{T}\), respectively. The probabilities are then summed up and averaged and thus the final class distribution \({(\mathrm{0.6435,0.3565})}^{T}\) can be computed based on the feature \({f}_{i}\). That is, the probability that \({f}_{i}\) classify the lncRNA–protein pair as positive example is \((0.3750+0.5556+1.0000)/3=0.6435\) and the probability that \({f}_{i}\) classify the lncRNA–protein pair as negative sample is \((0.6250+0.4444+0.0000)/3=0.3565\).
Similarly, at each layer, for each LPI feature, logistic regression, XGBoost Classifier, random forest, and extra trees are trained. An 8-dimensional class vector is generated based on two classes and four types of classifiers.
Figure 4 shows a deep forest with cascade structure. As illustrates in Fig. 4, an 800-dimensional feature vector is used as the initial input to the cascade forest. After each layer, the generated eight-dimensional class vector with the most important information combining the old 800-dimensional features are used as the input at the next layer. The details are shown as follows. First, four different types of classifiers, logistic regression, XGBoost Classifier, random forest, and extra trees, are utilized to train the model. Second, an eight-dimensional class vector is picked and concatenated with the original 800-dimensional feature vector to generate an 808-dimensional vector. Third, an 808-dimensional class vector is used as the input at the second layer. Similarly, the second layer produces an eight-dimensional class vector, which will be concatenated with the 800-dimensional feature vector. And another 808-dimensional class vector is applied as the input at the third layer. Finally, when training on a new layer, a training set is used to tune the parameters and a validation set is utilized to evaluate the performance. The feature importance will be evaluated through the prediction difference between the original LPI features and the learned ones in the four different types of classifiers. The training process will be terminated when the performance is not significantly improved. After training, LPI features with zero importance values are removed and the features with valid importance values are kept. For a test example (an LPI feature), it will be represented by each level until the last level.
Figure 5 demonstrates the pipeline of LPIDF. First, five LPI datasets are obtained based on the existing resources. Second, for an lncRNA–protein pair, lncRNA and protein sequences are characterized and concatenated as a vector based on four-nucleotide composition and BioSeq2vec with encoder–decoder structure. Third, the concatenated vector is used as the input to the cascade forest. Finally, the most important features are selected based on layer-to-layer propagation and label of each lncRNA–protein pair is computed.
Data availability
Source codes and datasets are freely available for download at https://github.com/plhhnu/LPIDF.
References
Zhang, W. et al. LncRNA-miRNA interaction prediction through sequence-derived linear neighborhood propagation method with information combination. BMC Genomics 20(11), 1–12 (2019).
Chen, X., Zhu, C. C. & Yin, J. Ensemble of decision tree reveals potential miRNA-disease associations. PLoS Comput. Biol. 15(7), e1007209 (2019).
Chen, X. et al. MicroRNAs and complex diseases: From experimental results to computational models. Brief. Bioinform. 20(2), 515–539 (2019).
Wang, K. C. et al. A long noncoding RNA maintains active chromatin to coordinate homeotic gene expression. Nature 472(7341), 120–124 (2011).
Chen, X. et al. Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief. Bioinform. 18(4), 558–576 (2017).
Ponting, C. P., Oliver, P. L. & Reik, W. Evolution and functions of long noncoding RNAs. Cell 136(4), 629–641 (2009).
Deng, L. et al. Accurate prediction of protein-lncRNA interactions by diffusion and HeteSim features across heterogeneous network. BMC Bioinform. 19(1), 1–11 (2018).
Liu, H. et al. Predicting lncRNA–miRNA interactions based on logistic matrix factorization with neighborhood regularized. Knowl.-Based Syst. 191, 105261 (2020).
Chen, X. et al. Computational models for lncRNA function prediction and functional similarity calculation. Brief. Funct. Genomics 18(1), 58–82 (2019).
Li, G. et al. Prediction of lncRNA-disease associations based on network consistency projection. IEEE Access 7, 58849–58856 (2019).
Wang B, Wang L, Zheng C H, et al. Imbalance data processing strategy for protein interaction sites prediction. in IEEE/ACM Transactions on Computational Biology and Bioinformatics (2019).
Zhang, Z. et al. KATZLGO: Large-scale prediction of LncRNA functions by using the KATZ measure based on multiple networks. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(2), 407–416 (2017).
Wang, K. C. & Chang, H. Y. Molecular mechanisms of long noncoding RNAs. Mol. Cell 43(6), 904–914 (2011).
Kopp, F. & Mendell, J. T. Functional classification and experimental dissection of long noncoding RNAs. Cell 172(3), 393–407 (2018).
Peng, L. et al. Probing lncRNA–protein interactions: Data repositories, models, and algorithms. Front. Genet. 10, 11 (2019).
Ferre, F., Colantoni, A. & Helmer-Citterich, M. Revealing protein–lncRNA interaction. Brief. Bioinform. 17(1), 106–116 (2016).
Li, A., Ge, M., Zhang, Y., et al. Predicting long noncoding RNA and protein interactions using heterogeneous network model. BioMed. Res. Int. 2015 (2015).
Zhang, W. et al. The linear neighborhood propagation method for predicting long non-coding RNA–protein interactions. Neurocomputing 273, 526–534 (2018).
Zhao, Q. et al. The bipartite network projection-recommended algorithm for predicting long non-coding RNA-protein interactions. Mol. Ther.-Nucleic Acids 13, 464–471 (2018).
Xie, G. et al. Lpi-ibnra: Long non-coding rna-protein interaction prediction based on improved bipartite network recommender algorithm. Front. Genet. 10, 343 (2019).
Zhu, R. et al. ACCBN: Ant-colony-clustering-based bipartite network method for predicting long non-coding RNA–protein interactions. BMC Bioinform. 20(1), 16 (2019).
Zheng, X. et al. Fusing multiple protein-protein similarity networks to effectively predict lncRNA–protein interactions. BMC Bioinform. 18(12), 420 (2017).
Deng, L., Yang, W. & Liu, H. Predprba: Prediction of protein-rna binding affinity using gradient boosted regression trees. Front. Genet. 10, 637 (2019).
Zhang, T., Wang, M., Xi, J., et al. Lpgnmf: Predicting long non-coding RNA and protein interaction using graph regularized nonnegative matrix factorization. IEEE/ACM Trans. Comput. Biol. Bioinform. (2018).
Liu, H. et al. LPI-NRLMF: lncRNA–protein interaction prediction by neighborhood regularized logistic matrix factorization. Oncotarget 8(61), 103975 (2017).
Zhao, Q. et al. IRWNRLPI: Integrating random walk and neighborhood regularized logistic matrix factorization for lncRNA–protein interaction prediction. Front. Genet. 9, 239 (2018).
Liu, Q. et al. Hot spot prediction in protein-protein interactions by an ensemble system. BMC Syst. Biol. 12(9), 89–99 (2018).
Shen, C. et al. LPI-KTASLP: Prediction of lncRNA–protein interaction by semi-supervised link learning with multivariate information. IEEE Access 7, 13486–13496 (2019).
Hu, H. et al. HLPI-ensemble: Prediction of human lncRNA–protein interactions based on ensemble strategy. RNA Biol. 15(6), 797–806 (2018).
Zhang, W. et al. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting lncRNA–protein interactions. PLoS Comput. Biol. 14(12), e1006616 (2018).
Fan, X. N. & Zhang, S. W. LPI-BLS: Predicting lncRNA–protein interactions with a broad learning system-based stacked ensemble classifier. Neurocomputing 370, 88–93 (2019).
Wekesa, J. S., Meng, J. & Luan, Y. A deep learning model for plant lncRNA–protein interaction prediction with graph attention. Mol. Genet. Genomics 2020, 1–12 (2020).
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794 (2016).
Hasan, M. M. et al. Meta-i6mA: An interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief. Bioinform. 22(3), bbaa202 (2021).
Prokhorenkova, L., Gusev, G., Vorobev, A., et al. CatBoost: Unbiased boosting with categorical features. in Advances in Neural Information Processing Systems. 6638–6648 (2018).
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005).
Hasan, M. M. et al. HLPpred-Fuse: Improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics 36(11), 3350–3356 (2020).
Wekesa, J.S., Meng, J., Luan, Y. Multi-feature fusion for deep learning to predict plant lncRNA–protein interaction. Genomics (2020).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Zhang, T. et al. SNHG3 correlates with malignant status and poor prognosis in hepatocellular carcinoma. Tumor Biol. 37(2), 2379–2385 (2016).
Liu, L., Ni, J., He, X. Upregulation of the long noncoding RNA SNHG3 promotes lung adenocarcinoma proliferation. Dis. Mark. 2018 (2018).
Zhang, C. et al. LncRNA SNHG3 promotes clear cell renal cell carcinoma proliferation and migration by upregulating TOP2A. Exp. Cell Res. 384(1), 111595 (2019).
Sun, B. et al. Long non-coding RNA SNHG3, induced by IL-6/STAT3 transactivation, promotes stem cell-like properties of gastric cancer cells by regulating the miR-3619-5p/ARL2 axis. Cell Oncol. 44(1), 179–192 (2021).
Yang, Q. et al. Long non-coding RNA Snhg3 protects against hypoxia/ischemia-induced neonatal brain injury. Exp. Mol. Pathol. 112, 104343 (2020).
Duan, Y. et al. lncRNA SNHG3 acts as a novel tumor suppressor and regulates tumor proliferation and metastasis via AKT/mTOR/ERK pathway in papillary thyroid carcinoma. J. Cancer 11(12), 3492 (2020).
Hong, L. et al. Upregulation of SNHG3 expression associated with poor prognosis and enhances malignant progression of ovarian cancer. Cancer Biomark. 22(3), 367–374 (2018).
Li, N. A., Zhan, X. & Zhan, X. The lncRNA SNHG3 regulates energy metabolism of ovarian cancer by an analysis of mitochondrial proteomes. Gynecol. Oncol. 150(2), 343–354 (2018).
Dai, G. et al. LncRNA SNHG3 promotes bladder cancer proliferation and metastasis through miR-515-5p/GINS2 axis. J. Cell Mol. Med. 24(16), 9231–9243 (2020).
Peng, L., Zhang, Y. & Xin, H. lncRNA SNHG3 facilitates acute myeloid leukemia cell growth via the regulation of miR-758-3p/SRGN axis. J. Cell. Biochem. 121(2), 1023–1031 (2020).
Yin, Q., Wu, A. & Liu, M. Plasma long non-coding RNA (lncRNA) GAS5 is a new biomarker for coronary artery disease. Med. Sci. Monit. 23, 6042 (2017).
Han, M. H. et al. Expression of the long noncoding RNA GAS5 correlates with liver fibrosis in patients with nonalcoholic fatty liver disease. Genes 11(5), 545 (2020).
Li, X. et al. Overexpression of GAS5 inhibits abnormal activation of Wnt/β-catenin signaling pathway in myocardial tissues of rats with coronary artery disease. J. Cell Physiol. 234(7), 11348–11359 (2019).
Li, H. et al. Association of genetic variants in lncRNA GAS5/miR-21/mTOR axis with risk and prognosis of coronary artery disease among a Chinese population. J. Clin. Lab. Anal. 34(10), e23430 (2020).
Moharamoghli, M. et al. The expression of GAS5, THRIL, and RMRP lncRNAs is increased in T cells of patients with rheumatoid arthritis. Clin. Rheumatol. 38(11), 3073–3080 (2019).
Xu, W. et al. Long noncoding RNA GAS5 promotes microglial inflammatory response in Parkinson’s disease by regulating NLRP3 pathway through sponging miR-223–3p. Int. Immunopharmacol. 85, 106614 (2020).
Shen, J. et al. Serum HOTAIR and GAS5 levels as predictors of survival in patients with glioblastoma. Mol. Carcinog. 57(1), 137–141 (2018).
Salvatori, I. et al. Differential toxicity of TAR DNA-binding protein 43 isoforms depends on their submitochondrial localization in neuronal cells. J. Neurochem. 146(5), 585–597 (2018).
Kino, T. et al. Noncoding RNA gas5 is a growth arrest- and starvation-associated repressor of the glucocorticoid receptor. Sci. Signal. 3(107), ra8 (2010).
Zhang, X. F., Ye, Y. & Zhao, S. J. LncRNA Gas5 acts as a ceRNA to regulate PTEN expression by sponging miR-222-3p in papillary thyroid carcinoma. Oncotarget 9(3), 3519–3530 (2017).
Bhardwaj, A. et al. Characterizing TDP-43 interaction with its RNA targets. Nucleic Acids Res. 41(9), 5062–5074 (2013).
Prakash, A. et al. Structural heterogeneity in RNA recognition motif 2 (RRM2) of TAR DNA-binding protein 43 (TDP-43): Clue to amyotrophic lateral sclerosis. J. Biomol. Struct. Dyn. 39(1), 357–367 (2021).
Endo, R. et al. TAR DNA-binding protein 43 and disrupted in schizophrenia 1 coaggregation disrupts dendritic local translation and mental function in frontotemporal lobar degeneration. Biol. Psychiat. 84(7), 509–521 (2018).
Tollervey, J. R. et al. Characterizing the RNA targets and position-dependent splicing regulation by TDP-43. Nat. Neurosci. 14(4), 452–458 (2011).
Wang, A. et al. A single N-terminal phosphomimic disrupts TDP-43 polymerization, phase separation, and RNA splicing. EMBO J. 37(5), e97452 (2018).
UniProt Consortium. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506–D515 (2019).
Fu, L. et al. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012).
Yuan, J. et al. NPInter v2.0: An updated database of ncRNA interactions. Nucleic Acids Res. 42(D1), D104–D108 (2014).
Xie, C. et al. NONCODEv4: Exploring the world of long non-coding RNA genes. Nucleic Acids Res. 42(D1), D98–D103 (2014).
Ge, M., Li, A. & Wang, M. A bipartite network-based method for prediction of long non-coding RNA–protein interactions. Genomics Proteomics Bioinform. 14(1), 62–71 (2016).
Pandurangan, A. P. et al. The SUPERFAMILY 2.0 database: A significant proteome update and a new webserver. Nucleic Acids Res. 47(D1), D490–D494 (2019).
Bai, Y. et al. PlncRNADB: A repository of plant lncRNAs and lncRNA-RBP protein interactions. Curr. Bioinform. 14(7), 621–627 (2019).
Jani, M. R. et al. iRecSpot-EF: Effective sequence based features for recombination hotspot prediction. Comput. Biol. Med. 103, 17–23 (2018).
Cho, K., Van Merriënboer, B., Gulcehre, C., et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. Adv. Neural. Inf. Process. Syst. 27, 3104–3112 (2014).
Yi, H.C., You, Z.H., Su, X.R., et al. A unified deep biological sequence representation learning with pretrained encoder-decoder model. in International Conference on Intelligent Computing. 339–347 (Springer, 2020).
Zhou, Z. H. & Feng, J. Deep forest. Natl. Sci. Rev. 6(1), 74–86 (2019).
Breiman, L. Random forests. Mach. Learn. 45(1), 5–32 (2001).
Acknowledgements
We would like to thank all authors of the cited references.
Funding
This research was funded by the Natural Science Foundation of China (Grant 62072172, 61803151).
Author information
Authors and Affiliations
Contributions
X.T., L.S., Z.W., L.Z., and L.P. designed the LPIDF method. X.T. and L.S. ran LPIDF. L.P. wrote the original manuscript. L.P. and L.Z. revised the original draft. L.P. and L.Z. discussed the proposed method and gave further research. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tian, X., Shen, L., Wang, Z. et al. A novel lncRNA–protein interaction prediction method based on deep forest with cascade forest structure. Sci Rep 11, 18881 (2021). https://doi.org/10.1038/s41598-021-98277-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-021-98277-1
This article is cited by
-
LPI-SKMSC: Predicting LncRNA–Protein Interactions with Segmented k-mer Frequencies and Multi-space Clustering
Interdisciplinary Sciences: Computational Life Sciences (2024)
-
Inferring circRNA-drug sensitivity associations via dual hierarchical attention networks and multiple kernel fusion
BMC Genomics (2023)
-
Rational design and glass-forming ability prediction of bulk metallic glasses via interpretable machine learning
Journal of Materials Science (2023)
-
An exhaustive review of computational prediction techniques for PPI sites, protein locations, and protein functions
Network Modeling Analysis in Health Informatics and Bioinformatics (2023)
-
Predicting circRNA-drug sensitivity associations via graph attention auto-encoder
BMC Bioinformatics (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.