A novel lncRNA–protein interaction prediction method based on deep forest with cascade forest structure

Long noncoding RNAs (lncRNAs) regulate many biological processes by interacting with corresponding RNA-binding proteins. The identification of lncRNA–protein Interactions (LPIs) is significantly important to well characterize the biological functions and mechanisms of lncRNAs. Existing computational methods have been effectively applied to LPI prediction. However, the majority of them were evaluated only on one LPI dataset, thereby resulting in prediction bias. More importantly, part of models did not discover possible LPIs for new lncRNAs (or proteins). In addition, the prediction performance remains limited. To solve with the above problems, in this study, we develop a Deep Forest-based LPI prediction method (LPIDF). First, five LPI datasets are obtained and the corresponding sequence information of lncRNAs and proteins are collected. Second, features of lncRNAs and proteins are constructed based on four-nucleotide composition and BioSeq2vec with encoder-decoder structure, respectively. Finally, a deep forest model with cascade forest structure is developed to find new LPIs. We compare LPIDF with four classical association prediction models based on three fivefold cross validations on lncRNAs, proteins, and LPIs. LPIDF obtains better average AUCs of 0.9012, 0.6937 and 0.9457, and the best average AUPRs of 0.9022, 0.6860, and 0.9382, respectively, for the three CVs, significantly outperforming other methods. The results show that the lncRNA FTX may interact with the protein P35637 and needs further validation.

Noncoding RNAs regulate the majority of biological processes associated with development, differentiation, and metabolism in organisms 1 . In contrast to small noncoding RNAs (i.e., miRNAs), which are highly conserved and regulate transcriptional and posttranscriptional gene silencing 2,3 , long noncoding RNAs (lncRNAs), as one type of transcribed RNA molecules, are poorly conserved and control gene expression based on various mechanisms [4][5][6] . lncRNAs have close linkages with posttranscriptional gene regulation by regulating biological processes including protein synthesis, RNA maturation and transportation, and transcriptional gene silencing 7,8 . Although a few lncRNAs have been well studied, the biological functions of the majority of lncRNAs remain enigmatic 9 . Recent studies demonstrate that most of lncRNAs regulate various biological activities through specific associations with chromatin, for example, interacting with corresponding RNA-binding proteins [10][11][12] . Therefore, identification of potential lncRNA-protein Interactions (LPIs) is vital to understand lncRNAs' biological functions and mechanisms.
To find new LPIs, many experimental methods were designed 13,14 . However, wet experiments for finding possible LPIs are costly and time-consuming. Computational methods are thus developed as a silver-bullet solution to LPI prediction. This type of methods is classified into two main categories: network-based methods and machine learning-based methods 15,16 .
Network-based LPI prediction methods, for example, random walk with restart-based model 17 , linear neighborhood propagation algorithm 18 , bipartite network projection-based recommendation method [19][20][21] , HeteSim algorithm 22 , firstly computed lncRNA similarity and protein similarity based on related biological data, and then integrated similarity matrix to heterogeneous lncRNA-protein network, finally designed network propagation algorithms to score for unknown lncRNA-protein pairs. Network-based LPI prediction methods successfully found part of LPIs, however, the type of methods cannot be applied to predict linkage information for an orphan lncRNA or protein.
Machine learning-based LPI identification methods first extracted features of lncRNAs and proteins and then designed a novel machine learning model to compute interaction probabilities for lncRNA-protein pairs. Classical machine learning-based LPI prediction models include matrix factorization-based methods and ensemble learning-based methods. Matrix factorization-based methods represented LPI prediction as a recommender OPEN School of Computer Science, Hunan University of Technology, Zhuzhou 412007, China. * email: zhoulq11@163. com; plhhnu@163.com

Results
We perform a series of experiments to investigate the prediction performance of our proposed LPIDF method.
Evaluation metrics. In this study, precision, recall, accuracy, F1-score, AUC and AUPR are used to evaluate the performance of LPIDF. Precision, recall, accuracy, and F1-score are defined as follows.
where TP, FP, TN, and FN denote the predicted number of true LPIs, false LPIs, true non-LPIs, and false non-LPIs. AUC and AUPR denote the average areas under the ROC curve and the precision-recall curve, respectively. The experiments are repeated for 20 times and the average performance from the 20 rounds is computed as the final performance.
Experimental settings. In the study, we conduct three different experimental settings.
Five-fold Cross Validation 1 (CV1): Cross validation on lncRNAs, that is, random rows (i.e., lncRNAs) in an LPI matrix Y are masked for testing.
Five-fold Cross Validation 2 (CV2): Cross validation on proteins, that is, random columns (i.e., proteins) in an LPI matrix Y are masked for testing.
Five-fold Cross validation 3 (CV3): Cross validation on lncRNA-protein pairs, that is, random lncRNA-protein pairs in an LPI matrix Y are masked for testing.
Under CV1, in each round, 80% of lncRNAs in an LPI network Y are screened as training set and the remaining is represented as testing set. Under CV2, in each round, 80% of proteins in Y are screened as training set and the remaining is represented as testing set. Under CV3, in each round, 80% of lncRNA-protein pairs in Y are represented as training set and the remaining is represented as testing set. The three cross validations refer to LPI identification for (1) new (unknown) lncRNAs (lncRNAs whose interaction information is unknown), (2) new proteins, and (3) lncRNA-protein pairs, respectively.
Comparison with four state-of-the-art methods. We compare our proposed LPIDF method with four state-of-the-art association identification methods to evaluate the prediction ability and robustness of LPIDF, that is, XGBoost 33,34 , Categorical Boosting (CatBoost) 35 , random forest 36,37 , and DRPLPI 38 . The above methods are classical machine learning models and obtained wide applications in various areas. XGBoost 33,34 is a scalable and end-to-end tree boosting-based model. CatBoost 35 is a novel gradient boosting-based technique and can effectively integrate ordered boosting and processing categorical features. Random forest 36,37 is composed of multiple decision trees and each tree is independently trained on a random subset. DRPLPI 38 exploited a multi-head self-attention model to extract high quality LPI features based on long short-term memory encoderdecoder mechanism. In the experiments, we randomly select the same number of negative LPIs as positive LPIs from unknown lncRNA-protein pairs to decrease the overfitting problem produced by data imbalance.
In random forest, the number of trees is set as 70, and the minimum number used to split samples is set as 5. In CatBoost, the maximum number of trees is set as 150, the maximum depth as 15, and the learning rate as  39 . Table 1 shows the precision, recall, accuracy, F1-score, AUC and AUPR values computed by LPIDF and other four methods under CV1. As shown in Table1, LPIDF achieves the highest average precision, accuracy, F1-score, and AUPR over all datasets, remarkably outperformed other four competing LPI prediction methods. Although the average recall and AUC computed by LPIDF are slightly lower than random forest and DRPLPI, LPIDF obtains the best average AUPR. The computed average AUPR obtained by LPIDF is 0.9022, which is 0.96%, 2.10%, 0.02% and 0.63% higher than XGBoost, CatBoost, random forest, and DRPLPI, respectively. Compared to AUC, AUPR is one more important measurement metric. Therefore, LPIDF can effectively find potential proteins interacting with a new lncRNA. Table 2 gives the comparison results under CV2. In particular, LPIDF computes the best average precision, recall, accuracy, F1-score, AUC and AUPR over all datasets. Over all datasets, LPIDF investigates the best average AUC value of 0.6937, which is 4.80%, 10.81%, 1.17% and 0.91% better than XGBoost, CatBoost, random forest, and DRPLPI, respectively. More importantly, LPIDF calculates the highest average AUPR value of 0.6860, which is 2.17% and 2.65% higher than the second-best and third-best methods, respectively. In summary, under CV2, LPIDF remarkably improves LPI prediction performance compared to the other four prediction methods and is statistically significant in identifying possible lncRNAs for a new protein.
The prediction results computed under CV3 are shown in Table 3. In particular, LPIDF outperforms other LPI prediction methods over all datasets in terms of all six measurements. For example, LPIDF achieves the best www.nature.com/scientificreports/ average AUC value of 0.9457, which is 1.72%, 6.39%, 0.87%, and 0.97% better than XGBoost, CatBoost, random forest, and DRPLPI, respectively. In addition, for the AUPR metric, LPIDF obtains the best average AUPR of 0.9382, which is 0.88% and 1.20% superior to the second-best and third-best methods, respectively. It can be seen that the LPIDF can effectively predict potential LPIs.
Case study. After confirming the performance of our proposed LPIDF method, we further identify possible LPIs, especially predict interaction information for new lncRNAs and proteins.
Finding possible proteins interacting with new lncRNAs. In this section, we intend to find potential proteins interacting with new lncRNAs. Small Nucleolar RNA Host Gene 3 (SNHG3) and Growth Arrest-Special transcript 5 (GAS5) are masked all association information and taken as new lncRNAs. LPIDF is then applied to identify possible proteins interacting with the two lncRNAs. SNHG3 is an RNA Gene affiliated with the lncRNA class. It may have dense correlation with various cancers, for example, hepatocellular carcinoma 40 , non-small-cell lung cancer 41 , clear cell renal cell carcinoma 42 , gastric cancer 43 , hypoxic-ischemic brain damage 44 , papillary thyroid carcinoma 45 , ovarian cancer 46,47 , bladder cancer 48 , and acute myeloid leukemia 49 . Table 4 shows the predicted top 5 proteins related to SNHG3 with the highest interaction probabilities on three human datasets.  Table 4 show that SNHG3-protein interaction pairs predicted by LPIDF are rank advanced in all other four methods. We predict that O00425 may interact with SNHG3 (ranked as 4) in dataset 3, which has been validated in dataset 1. In addition, we observe that Q9NUL5 and Q13148 may interact with SNHG3. Among all possible 27 proteins, the interaction between Q9NUL5 and SNHG3 is ranked as 1 by all five LPI prediction methods. The association between Q13148 and SNHG3 is ranked as 5, 7, 8, 5, and 4 by LPIDF, XGBoost, random forest, CatBoost, and DRPLIP, respectively. The facts demonstrate the powerful prediction performance of LPIDF.
GAS5 can prevent glucocorticoid receptors from being activated and thus control transcriptional activities from its target genes. It is inferred as a potential tumor suppressor and has close correlations with coronary artery disease 50 , cirrhotic livers 51 , coronary artery disease 52,53 , rheumatoid arthritis 54 , Parkinson's disease 55 , and primary glioblastoma 56 . Table 5 lists the predicted top 5 proteins interacting with GAS3 with the highest association scores on three human datasets. In dataset 3, although the interactions between GAS5 and Q9NZI8 and Q9Y6M1 are unknown, we find that the two LPIs are ranked as 5 and 4 by LPIDF, respectively. More importantly, in datasets 1 and 2, it can be seen that Q9NZI8 and Q9Y6M1 show higher interaction probabilities with GAS5 and the two LPIs have been reported. In addition, O00425 is inferred to interact with GAS5 with the ranking of 2 in dataset 3 and has been validated in dataset 1. These facts again suggest that LPIDF can effectively find possible proteins associated with a new lncRNA. www.nature.com/scientificreports/ Finding potential lncRNAs interacting with new proteins. We continue to uncover lncRNAs interacting with a new protein on three human datasets. Q13148 and Q9HCK5 are masked all associated lncRNAs and taken as new proteins. LPIDF is then used to find possible associated lncRNAs for the two proteins. Q13148 is an RNA-binding protein involved in RNA biogenesis and processing and various neurodegenerative diseases [57][58][59][60] . In addition, it also participates in the formation and regeneration of normal skeletal muscles and plays an important role in keeping the circadian clock periodicity 59,60 . Its second RNA recognition motif has been reported as a major promoter towards aggregation and resultant toxicity 61 . Frontotemporal lobar degeneration associated with Q13148 aggregation is depicted as progressive neuronal atrophy in cerebral cortex 62 . Table 6 illustrates the predicted top 5 lncRNAs associated with Q13148 on three human datasets. From Table 6, we can investigate that all predicted top 5 lncRNAs interacting with Q13148 are known in the three datasets. Table 7 lists the identified top 5 lncRNAs associated with Q9HCK5 on three human datasets. Q9HCK5 is required for RNA-mediated genes' silencing, RNA-directed transcription and human hepatitis delta virus replication 63 . Table 7 demonstrates that all predicted top 5 LPIs for Q9NCK5 are given in the three datasets. In summary, LPIDF may be appropriate for LPI identification for a new protein.
Finding new LPIs based on known LPIs. The number of lncRNA-protein pairs with unknown interaction information is 51,686, 71,075, 22,572, 2,867 and 49,435 on the five datasets, respectively. We rank these unknown lncRNA-protein pairs based on their interaction probabilities computed by LPIDF and list the predicted top 100 lncRNA-protein pairs. The results are shown in Fig. 1. In Fig. 1, black dotted lines and sky blue solid lines represent unknown and known LPIs predicted by LPIDF, respectively. Tan hexagons and light sky blue circu-  www.nature.com/scientificreports/ lars denote lncRNAs whose interactions with proteins are unknown and known, respectively. Yellow diamonds denote proteins. We observe that some identified lncRNA-protein pairs have higher interaction probabilities. For example, the interactions between NONHSAT137627 and P35637, n344749 and Q15717, NONHSAT119864 and Q15717, AthIncRNA18 and Q9LES2, and ZmaLncRNA38 and C4J594 are ranked as 33, 97, 85, 161, and 215, respectively. The lncRNA-protein pairs with advanced ranks need further experimental validation.
The lncRNA FTX (NONHSAT137627) can positively regulate the expression and function of ALG3 in AML cells, especially cell growth and apoptosis related to ADR-resistance. FTX could thus probably be applied to reduce therapeutic resistance in AML 64 . P35637 plays a key role in RNA transport, mRNA stability and synaptic homeostasis in neuronal cells 63 . The protein has been validated to be target of the treatment of cancers, amyotrophic lateral sclerosis, and Alzheimer's disease 65 .
In dataset 2, it is observed that FTX interacts with Q15717, Q9NZI8, and P26599. Q15717 helps in increasing the leptin mRNA's stability. Q9NZI8 can regulate neurite outgrowth and neuronal cell migration, promote tumor-derived cells' adhesion and movement, and prevent infectious HIV-1 particles' formation 64 . P26599 can bind to the viral internal ribosome entry site and stimulate the translation mediated by the picornaviruses' infection site. Q35637 has similar functions with Q15717, Q9NZI8, and P26599. Based on the "guilt-by-association" theory, we infer that FTX may associate with P35637.
Fractions of true LPIs among the predicted top N LPIs. In addition, we consider the fractions of true LPIs among the inferred top N LPIs. The results are shown in Table 8. N is selected as 10, 30, and 50, respectively. From Table 8, we can find that all the predicted top 10 LPIs by LPIDF have been labeled as 1 on five datasets. Similar  www.nature.com/scientificreports/

Discussion and conclusion
lncRNAs are widely distributed in various organisms and regulate gene expression on transcriptome and posttranscriptome. However, lncRNAs are difficult to crystallize and only several lncRNAs have been investigated. Since lncRNAs play an important regulatory role in protein molecules, the discovery of proteins binding to specific lncRNAs becomes an issue to identify lncRNAs' functions and mechanisms. In this study, first, we integrate five LPI datasets where three datasets are from human and the remaining is from plants. Second, features of lncRNAs and proteins are selected by four-nucleotide composition and BioSe-q2vec based on their sequences, respectively. Finally, a deep forest model with cascade forest structure, LPIDF, is developed to predict LPI candidates. To evaluate the performance of LPIDF, we compare our proposed LPIDF method with other four LPI prediction models on five datasets under three cross validations. The results suggest that LPIDF remarkably outperforms other four competing LPI identification methods. We further conduct a series of case studies to find possible associated proteins (or lncRNAs) for new lncRNAs (or proteins) and potential LPIs. The results from case analyses again demonstrate that LPIDF is a powerful LPI identification method.
LPIDF can compute the optimal precision, recall, accuracy, F1-score, AUC and AUPR. We think that it may be attribute to the following advantages. First, LPIDF selects high quality features of lncRNAs and proteins based on four-nucleotide composition and BioSeq2vec, respectively. Second, deep forest with cascade forest structure could automatically determine the depths of cascade forest, thereby reducing prediction bias produced by parameter tuning. Finally, each layer in the cascade forest receives LPI features from the last layer and sends its result to the next layer. Since all layers are automatically generated, LPIDF need not set too many hyperparameters. The predominant experimental consequences indicate that LPIDF has a powerful ability in excavating new LPIs.
In addition, the time required for the proposed LPIDF model and other methods is investigated. The details are shown in Table 9. It can be seen that the time required for LPIDF is much lower than ones of CatBoost and DRPLPI.
However, our work has a few limitations. We only consider LPI prediction on human and plant LPI-related datasets. Indeed, other species closer human evolutionarily than plants should be investigated. In addition, the predicted LPIs with the highest interaction probabilities should be experimentally validated.   Top 10 60  100  90  100  100   Top 30 70  90  90  100  100   Top 50 74  88  88  94  94   Dataset 2   Top 10 70  70  90  90  100   Top 30 77  87  87  93  100   Top 50 80  86  86  92  100   Dataset 3   Top 10 80  90  90  100  100   Top 30   www.nature.com/scientificreports/ In the future, first, we will integrate more biological information, for example, disease symptom information, drug chemical structure, miRNA-lncRNA interactions. Second, we will consider the prediction performance of the proposed model on other species closer human evolutionarily than plants. Third, CD-Hit 66 is one broadly used software for reducing sequence redundancy. To improve the performance of sequence analyses algorithms, we will further remove proteins with high sequence similarity in larger datasets based on CD-Hit. Finally, we will further conduct experimental validation for the predicted RNA-binding proteins.

Materials and methods
Data preparation. In this study, we integrate five different LPI datasets. Dataset 1 was provided by Li et al. 17 .
Noncoding RNA-protein interaction data were firstly downloaded from the NPInter 2.0 database 67 . lncRNA and protein sequences were extracted from the NONCODE database 4.0 68 and the UniProt 65 database, respectively. 3,487 LPIs from 938 lncRNAs and 59 proteins were obtained. We then remove lncRNAs and proteins whose sequences are unknown in the UniProt 65 , NPInter 67 and NONCODE 68 databases. Finally, we obtain 3,479 LPIs from 935 lncRNAs and 59 proteins. Dataset 2 was provided by Zheng et al. 22 . Noncoding RNA-protein interaction, lncRNA and protein sequences were downloaded from NPInter 2.0 67 , NONCODE 4.0 68 , and UniProt 65 , respectively. They obtained 4,467 LPIs between 1,050 lncRNAs and 84 proteins. Similar to dataset 1, we further remove the lncRNAs and proteins whose sequences are unknown in the NONCODE 68 , UniProt 65 , and NPInter 67 databases and obtain 3,265 LPIs from 885 lncRNAs and 84 proteins. Dataset 3 was provided by Zhang et al. 18 . Experimentally validated LPIs between 1,114 lncRNAs and 96 proteins were extracted based on data resources compiled by Ge et al. 69 . The sequence and expression data of lncRNAs in 24 human tissues or cell types were downloaded from the NONCODE 4.0 database 68 . The sequence data of proteins were obtained from the SUPERFAMILY database 70 . lncRNAs without sequence or expression information and proteins without sequence information were removed. lncRNA (or protein) with only one associated protein (or lncRNA) were still removed. Finally, 4,158 LPIs from 990 lncRNAs and 27 proteins were selected.
Dataset 5 contains sequence data of lncRNAs and proteins about Zea mays from the PlncRNADB database 71 . LPI data can be downloaded from http:// bis. zju. edu. cn/ plncR NADB. The dataset contains 22,133 LPIs from 1,704 lncRNAs and 42 proteins. Table 10 describes the details about the five datasets.
We describe an LPI network as a matrix Y: Feature selection of proteins. The encoder-decoder structure can better describe sequence-to-sequence features 73,74 . Inspired by the sequence representation techniques provided by Sutskever et al. 74 and Yi et al. 75 , we use Biological Sequence-to-vector (BioSeq2vec) representation learning method 75 with encoder-decoder structure to characterize amino acids of a protein.
For a protein with sequence length of L , first, a sliding window of size K is used to divide the sequence into L − K + 1 segments. Second, the segments are converted into hash values. Finally, the hash values are used as input of an autoencoder. As shown in Fig. 2, an input vector composed of the hash values is first mapped into a low-dimensional feature vector by an encoder. Second, the low-dimensional feature vector is reproduced as an (5) Y ij = 1 if lncRNAl i interacts with protein p j 0 otherwise In the model, class vectors used to denote the class distribution are obtained through the four basic classifiers. For a given LPI feature, the class distribution first calculated the proportions that the feature classifies an lncRNA-protein pair as two classes (positive class and negative class), respectively. Suppose that there are three trees in a random forest. As shown in Fig. 3, for a LPI feature f i , the probabilities that f i classify an lncRNA-protein pair as two classes (positive class and negative class) in the three trees are (0.3750, 0.6250) T , (0.5556, 0.4444) T and (1.0000, 0.0000) T , respectively. The probabilities are then summed up and averaged and thus the final class distribution (0.6435, 0.3565) T can be computed based on the feature f i . That is, the probability that f i classify the lncRNA-protein pair as positive example is (0.3750 + 0.5556 + 1.0000)/3 = 0.6435 and the probability that f i classify the lncRNA-protein pair as negative sample is (0.6250 + 0.4444 + 0.0000)/3 = 0.3565.
Similarly, at each layer, for each LPI feature, logistic regression, XGBoost Classifier, random forest, and extra trees are trained. An 8-dimensional class vector is generated based on two classes and four types of classifiers. Figure 4 shows a deep forest with cascade structure. As illustrates in Fig. 4, an 800-dimensional feature vector is used as the initial input to the cascade forest. After each layer, the generated eight-dimensional class vector with the most important information combining the old 800-dimensional features are used as the input at the next layer. The details are shown as follows. First, four different types of classifiers, logistic regression, XGBoost Classifier, random forest, and extra trees, are utilized to train the model. Second, an eight-dimensional class vector is picked and concatenated with the original 800-dimensional feature vector to generate an 808-dimensional vector. Third, an 808-dimensional class vector is used as the input at the second layer. Similarly, the second layer produces an eight-dimensional class vector, which will be concatenated with the 800-dimensional feature vector. And another 808-dimensional class vector is applied as the input at the third layer. Finally, when training on a new layer, a training set is used to tune the parameters and a validation set is utilized to evaluate the performance. The feature importance will be evaluated through the prediction difference between the original LPI features and the learned ones in the four different types of classifiers. The training process will be terminated when the performance is not significantly improved. After training, LPI features with zero importance values are removed and the features with valid importance values are kept. For a test example (an LPI feature), it will be represented by each level until the last level. Figure 5 demonstrates the pipeline of LPIDF. First, five LPI datasets are obtained based on the existing resources. Second, for an lncRNA-protein pair, lncRNA and protein sequences are characterized and concatenated as a vector based on four-nucleotide composition and BioSeq2vec with encoder-decoder structure. Third, the concatenated vector is used as the input to the cascade forest. Finally, the most important features are selected based on layer-to-layer propagation and label of each lncRNA-protein pair is computed.