Abstract
Virus‒host protein‒lncRNA interaction (VHPLI) predictions are critical for decoding the molecular mechanisms of viral pathogens and host immune processes. Although VHPLI interactions have been predicted in both plants and animals, they have not been extensively studied in viruses. For the first time, we propose a new deep learning-based approach that consists mainly of a convolutional neural network and bidirectional long and short-term memory network modules in combination with transfer learning named CBIL‒VHPLI to predict viral–host protein‒lncRNA interactions. The models were first trained on large and diverse datasets (including plants, animals, etc.). Protein sequence features were extracted using a k-mer method combined with the one-hot encoding and composition–transition–distribution (CTD) methods, and lncRNA sequence features were extracted using a k-mer method combined with the one-hot encoding and Z curve methods. The results obtained on three independent external validation datasets showed that the pre-trained CBIL‒VHPLI model performed the best with an accuracy of approximately 0.9. Pretraining was followed by conducting transfer learning on a viral protein–human lncRNA dataset, and the fine-tuning results showed that the accuracy of CBIL‒VHPLI was 0.946, which was significantly greater than that of the previous models. The final case study results showed that CBIL‒VHPLI achieved a prediction reproducibility rate of 91.6% for the RIP-Seq experimental screening results. This model was then used to predict the interactions between human lncRNA PIK3CD-AS2 and the nonstructural protein 1 (NS1) of the H5N1 virus, and RNA pull-down experiments were used to prove the prediction readiness of the model in terms of prediction. The source code of CBIL‒VHPLI and the datasets used in this work are available at https://github.com/Liu-Lab-Lnu/CBIL-VHPLI for academic usage.
Similar content being viewed by others
Introduction
Long noncoding RNA (lncRNA) form a diverse class of endogenous single-stranded polynucleotide noncoding transcripts with sequence lengths exceeding 200 nucleotides1. Although the expression levels of lncRNA are much lower than those of protein-coding genes, these RNAs play important roles in many key biological processes, such as cell differentiation, gene expression, and developmental and tissue-specific expression patterns. Thus, lncRNA not only has important biological roles but also has the potential to participate in life activities as regulatory factors and molecular switches2,3,4. Over the past decade, there has been increasing evidence has shown that viral proteins interact with host lncRNA to establish persistent infection5,6. Thus, effectively characterizing the interactions between lncRNA and viral proteins is essential for advancing the current understanding of the role of host lncRNA in viral infection and host immune functions. However, the current knowledge concerning host lncRNA remains limited.
Multiple experimental techniques, such as the RNA pull-down method7, chromatin isolation by RNA purification (ChIRP)8, capture hybridization analysis of RNA targets (CHART)9, and RNA immunoprecipitation (RIP)10, have been utilized to identify virus‒host protein‒lncRNA interactions (VHPLIs). However, these experimental techniques are time-consuming and laborious, and experiments have demonstrated that lncRNA is less strongly associated with proteins. Many effective and novel machine-learning methods have been developed by applying these databases. Generally, these methods can be divided into two categories: machine learning-based methods and network-based methods. Machine learning-based methods are used to build binary classifiers to distinguish whether lncRNA interacts with proteins. For example, Muppirala et al. proposed a model for predicting RNA–protein interactions using only sequence information (RPISeq) in 2015. RPISeq uses support vector machines (SVMs) and random forests (RFs) as classifiers11. In 2019, Yi et al. developed a method named lncRNA–protein interaction prediction (LPI-Pred) that exploits the analogy between biological sequences and natural language. LPI-Pred uses the word2vec natural language processing (NLP) method word2vec to learn high-level embeddings of protein and RNA sequences12.
Network-based approaches construct a relational network using known lncRNA‒protein interactions. The interactions between lncRNAs and proteins are predicted by analyzing the topology of the constructed network. For example, Zhang et al. proposed a path-based lncRNA–protein interaction (PBLPI) prediction method, which uses a depth-first search algorithm to reveal novel lncRNA‒protein interactions on three interconnected subgraphs13. Zhang et al. implemented label propagation on a directed graph with linear neighborhood similarity14. Although many computational methods have been developed to predict lncRNA‒protein interactions, these computational models mainly focus on the prediction of lncRNA‒protein interactions in plants or animals, whereas fewer studies have been conducted on the prediction of lncRNA‒protein interactions with viral proteins; thus, the predictions of such models are not satisfactory.
It is well known that RNA/protein sequences carry important information for predicting RNA‒protein interactions15,16, and many studies have shown that convolutional neural network (CNN) models outperform other deep learning architectures in feature extraction tasks and that bidirectional long short-term memory (BiLSTM) can learn hidden information17,18. Li et al. proposed a new convolutional layer for deep neural networks that can efficiently identify motifs in high-throughput histological data by adaptively learning kernel lengths from the data19. In addition, Bidirectional Long Short-Term Memory (BiLSTM) can learn hidden information20. In this work, we propose a sequence-based method using a deep learning model, a CNN, in combination with BiLSTM. We used k-mer sparse matrices to represent lncRNA and protein sequences and then extracted feature vectors from these matrices via one-hot encoding. To obtain additional biological information, for protein sequences, we used the CTDD method to obtain deep sequence features. For lncRNA sequences, the Z-curve method was used to extract lncRNA sequence features. Variable-length sequences were converted into fixed-length sequences using both zero-padding and cropping techniques and then fed into a CNN. The model further uses the BiLSTM layer to learn the hidden information. Finally, the model was fine-tuned by applying transfer learning to a viral protein–host lncRNA dataset, and the results showed that the proposed method has relatively stable and good performance in terms of the results obtained on the virus test set. By comparing CBIL‒VHPLI with existing methods, the results showed that CBIL‒VHPLI is an effective deep-learning method for predicting lncRNA‒protein interactions.
Materials and Methods
This study aimed to develop a machine learning model for predicting protein‒RNA interactions. To achieve this goal, numerous data processing and modeling steps were carried out. The overall design flow of the model is shown in Fig. 1.
Dataset sources
Five datasets were used in this paper: one dataset, the RPI18072 dataset, to pretrain the CBIL‒VHPLI model (abbreviated as Pre_CBIL‒VHPLI), three datasets, the RPI2241, RPI1807, and RPI488 datasets, for fivefold cross–validation (5CV) on Pre_CBIL‒VHPLI, and one dataset, the vhRPI286 dataset, for fine-tuning Pre_CBIL‒VHPLI to obtain CBIL‒VHPLI. An overview of these five datasets is shown in Table 1.
The RPI18072 dataset was used for model pretraining. Long human noncoding RNA‒protein interaction data were extracted from the NPInter v2.0 database21 and the lncRNome database22. The corresponding information of the long noncoding RNA sequence was extracted from the NONCODE v3.0 database, and the corresponding protein sequence information was extracted from the UniProt database. The overlapping information contained in the two databases was removed. Considering that our model is based only on lncRNA and protein sequences, and that high sequence similarity may lead to biased results for machine learning approaches, we performed redundant sequence elimination on all datasets (except for the publicly available datasets RPI2241, RPI1807, and RPI488 datasets) using the CD-HIT package23 with a threshold of 0.9. The dataset downloaded from the NPInter v2.0 database contained 2780 positive samples (reciprocal pairs), with 2725 pairs remaining after removing the redundant and incomplete data. The dataset downloaded from the lncRNome database was also processed and filtered to yield 8110 positive samples. The lncRNA‒protein interaction data obtained from these two databases were combined and once again subjected to redundancy removal, resulting in 9036 positive samples. Finally, negative samples, i.e., negative data, were generated by conducting random matching. To ensure that the data were balanced, the amount of negative sample data generated by random pairing was the same as the amount of positive sample data. Thus, we constructed the nonredundant RPI8072 dataset as a pretraining dataset for the model, and the final 20% of the data were used a pre- and posttraining test set. Three datasets, the RPI224124, RPI180725, and RPI48826 datasets, were obtained from published papers and used for 5CV purposes after pretraining. These three lncRNA-protein interaction datasets are more reliable and were built by calculating the atomic distances between the RNAs and proteins in an RNA‒protein complexes from the Protein Data Bank (PDB)27.
We obtained viral protein and lncRNA interaction data from the RAID v2.028, RNAInter v4.029, and VirBase v3.030 databases and augmented the fine-tuned dataset via the random pairing method to make the numbers of positive and negative samples more balanced. The fine-tuning dataset (vhRPI286) was constructed with 87 protein sequences and 123 lncRNA sequences.
Overall, our dataset consisted of 30,142 samples, including 15,270 positive samples and 14,872 negative samples, covered wide ranges of species and types of lncRNAs and proteins, and provided a diverse and comprehensive dataset for subsequent analyses.
The zero-padding and cropping tricks
Since the CNN model requires fixed-length sequences as inputs and different lncRNA and protein sequences have significantly different lengths, we treated the sequences as fixed-length sequences. Currently, two tricks, zero-padding and cropping, are often used to convert a variable-length sequence into a fixed-length sequence. The zero-padding trick31 generates fixed-length sequences by simply supplementing zero values at two ends or one end of each sequence until the lengths of all sequences are equal to that of the longest sequence contained in the training and testing datasets. The cropping tricks involves cutting a long sequence into a fixed-length sequence32. Since some of the lncRNA and protein sequences were too long or too short, we set the average lengths of the lncRNA sequences and protein sequences separately. In this study, we set lncRNA_max_length = 2500 nt and protein_max_length = 1000 aa because approximately 90% of the lncRNAs were ≤ 2500 nt in length and approximately 90% of the proteins were ≤ 1000 aa in length. After we set the maximum length thresholds for the lncRNA and protein sequences, we converted these sequences into fixed-length sequences. The utilized method encoded the textual information of the sequences through the pad_sequences function, which transformed the variable-length sequences into fixed-length feature matrices as inputs for the subsequent feature extraction and model learning processes. For lncRNA and protein sequences, when the length was greater than the set fixed length, the sequence was trimmed to a fixed length, and when the sequence was shorter, it was extended to a fixed length with 0. Finally, the populated sequences were passed to the model for training or testing.
One-hot encoding method for encoding lncRNA and protein sequences
One-hot methods are currently very common approaches for encoding RNA and protein sequences. Recent studies have shown that considering the dependencies between nucleotides or amino acids can improve the performance of predictors. Here, we also adopted the high-order one-hot encoding to transform the lncRNA and protein sequences into matrices.
For a given lncRNA sequence \(S={N}_{1}{N}_{2}\cdots {N}_{Lnc}\) with \({L}_{lnc}\) nucleotides, the one-hot encoding matrix R of this sequence can be formulated as33:
where \(i\in \left[\text{1,2}\cdots ,{4}^{k}\right]\) is the index of k-mer nucleotides, and \(k\) denotes the order of \(k-mer\text{ nucleotides}\). In this paper, we set \(k=4\), and each lncRNA sequence was converted into an \({4}^{k}\times max{L}_{lnc}\) numerical matrix.
The protein sequence was composed of 20 kinds of amino acids. For \({\text{P}}_{\text{mer}}\), the amino acids were divided into seven groups according to the dipole moment and side-chain volumes of the proteins34. Each group is represented as an alphabetic symbol \({F}_{f}(f=1, 2, \cdots ,7\)). For instance, \(F_{1} = \left\{ {A, G, V} \right\},F_{2} = \{ I, L, F, P\} , F_{3} = \left\{ {Y, M, T, S} \right\}\), \(F_{4} = \left\{ {H, N, Q, W} \right\},F_{5} = \left\{ {R, K} \right\},F_{6} = \left\{ {D, E} \right\}\), and \({F}_{7}=\{C\}\). Thus, a protein sequence with 20 amino acid symbols could be reduced to a sequence with 7 symbols, i.e., \(Q ={p}_{1}, {p}_{2}\cdots {p}_{l}\cdots {p}_{L}, {p}_{l}\in {F}_{1}, {F}_{2, },\cdots {F}_{7}\). Then, the one-hot encoding matrix P of a protein sequence \(\text{Q }={p}_{1}, {p}_{2}\cdots {p}_{l}\cdots {p}_{L}\) could be formulated as:
where \(i\in \left[\text{1,2}\cdots ,{7}^{k}\right]\) is the index of \({7}^{h}- mer {F}_{f} symbols\), and h denotes the high-order degree. In this paper, we set \(h=3\), and each protein sequence was converted into a \({7}^{h}\times max{L}_{lnc}\) numerical matrix.
Protein feature CTDD
The composition–transition–distribution (CTD) method was utilized to extract protein sequence features, while the K-mer and one-hot encoding methods were also utilized to obtain sequence features. CTDD features indicate the pattern of an amino acid distribution in a protein or peptide sequence with specific structural or physicochemical properties35. We started from the first base in the protein sequence text, extending up to and including the residue labels for any given group of amino acid residues occurring 25/50/75/100% of the time. The positions of these residues were then divided by the length of the entire sequence to obtain a sequence capable of being represented in a numerical format suitable for machine learning.
lncRNA feature Z-curve
For the lncRNA sequences, we also used the Z curve parameters to determine the frequencies of phase-specific tri-nucleotides (Z_curve_144bit) for extracting the computable features of the species contained in the lncRNA sequence. The Z_curve_144bit descriptor is a three-dimensional transformation of the frequencies of nucleotides (AUGC) of lncRNA sequences into a three-dimensional space using the Z-transformation technique36. In this part of the study, we extracted \(3\times 3\times 4\times 4=144\) feature descriptors using the Z_curve_144bit method. This descriptor can be calculated as follows:
where \({x}_{XY}^{k}\), \({y}_{XY}^{k}\) and \({z}_{XY}^{k}\) are the coordinates of a point in a three-dimensional space. The frequency of tri-nucleotides \(XYA\), \(XYG\), \(XYC\) and \(XYU\) be denoted by \(p\left(XYA\right)\), \(p\left(XYG\right)\), \(p\left(XYC\right)\), and \(p\left(XYU\right)\), where X, Y = A, C, G and U.\(k = 1, 2, 3\) means the nucleotides are situated at the 1st 2nd, and 3rd codon positions.
Model construction
CBIL‒VHPLI was created using the Keras application programming interface (API) and TensorFlow as a backend. CBIL-VHPLI is composed of a Conv1D layer, a maximum pooling layer, a bidirectional LSTM layer and, a final dense layer. Firstly, feature extraction of RPI18072 was performed through four different pipelines to obtain four different types of features. Next, the input lncRNA/protein sequence feature matrix was convolved through the Conv1D layer. The data was downscaled using max pooling layer. The feature vectors are connected to a bi-directional LSTM layer and the ReLU function is used as the activation function for the convolutional network. BiLSTM layer outputs a one-dimensional feature vector after which the output characteristics are fused together. Finally, the prediction results are output through the MLP layer (Fig. 2).
Conv1D Layer
The Conv1D layer is a one-dimensional convolutional layer that performs convolution on the input sequence with filters to learn local patterns or features. In this study, we employ a Conv1D layer as a feature extractor whose inputs are the sequence codes, i.e., matrix R and matrix P as described in Section "Protein feature CTDD". As well as the protein sequence descriptors extracted by CTDD and the lncRNA sequence feature descriptors extracted by the Z-curve method are inputted into the Conv1D layer for processing, respectively. These Conv1D layer have 64 filters, which means it learns 64 different patterns or features from the input sequence. For lncRNA sequences, the convolution kernel size is set to 3, which means that each filter is a sequence of three consecutive values from the input sequence. For protein sequences, the convolution kernel size is set to 5. The padding parameter is set to "same", which means that the input sequence is padded with zeros at both ends so that the output sequence has the same length as the input sequence. The activation function used in the Conv1D layer is the rectified linear unit (\(ReLU\)), which is a commonly used activation function in neural networks.
MaxPooling1D Layer
The output from the convolutional layer is fed to a max pooling layer with a pool size of 2. The max pooling operation reduces the dimensionality of the input by taking the maximum value of each pair of adjacent values in the input. The MaxPooling1D layer is a pooling layer that performs down-sampling on the input sequence by taking the maximum value in each window of a specified size. The pool size is set to 2, which means that each window of the input sequence is of size 2 and the maximum value in each window is taken to generate the output sequence. The MaxPooling1D layer does not have a padding parameter, which means that the output sequence is shorter than the input sequence by a factor of the pool size.
Bidirectional LSTM Layer
Through a convolutional filter and a maximum pooling layer, the CNN module learns the sequence information of lncRNA and proteins. Then, all channels of each subunit are separated into a new feature vector. To further exploit the sequence information, we adopt a BiLSTM neural network. The BiLSTM neural network is based on a recurrent neural network (RNN) and adds memory units in each neural unit of the hidden layer to make the memory information on the time series controllable and thus has a long-term memory function37,38,39. The Bidirectional LSTM layer has 32 units, which means that it has 32 LSTM cells. The default activation function for LSTM cells in Keras is the hyperbolic tangent \((tanh)\) function, and the recurrent activation function for LSTM cells in Keras is the sigmoid function40. The Bidirectional LSTM layer uses the default merge mode, which concatenates the outputs of the forward and backward LSTM layers along the last dimension.
Dense Layer
The output from the bidirectional LSTM layer is fed to a fully connected dense layer with 64 units. The activation function used in this layer is \(ReLU\).
Output Layer
The final layer in the model is a dense layer with a single unit and a sigmoid activation function. This layer produces a binary classification output, indicating the probability of the input sequence belonging to one of two classes. A threshold value can be applied to this output to make a binary classification decision. The formula can be expressed as follows:
where y is the output, σ is the sigmoid activation function, \({W}_{n+1}\) is the weight matrix for the last dense layer, \({W}_{n}\) is the weight matrix for the second dense layer, \({f}_{n-1}\) is the activation function for the second dense layer (in this case, the rectified linear unit (ReLU)), \({h}_{bi}\) is the output of the BiLSTM layer, and \({W}_{n}\) and \({f}_{n-1}\) are the weight matrix and activation function, respectively, for the ith layer of the model.
During training, the model is fed batches of input sequences along with their corresponding labels. The model calculates the loss between the predicted output and the true label and then updates its weights using the backpropagation algorithm. The loss function used in the CBIL‒VHPLI model is binary cross-entropy, and the optimizer used is Adam. For each dataset, we extracted 20% of the data as a test set and performed 5CV on three independent test sets (RPI2241, RPI1807, and RPI488) after pre-training the model to comprehensively evaluate the performance of the model. The batch size 256 and training epoch number 100. The learning rate of CNN is 0.001, and the loss function uses the binary cross entropy loss.
Fine turning
Transfer learning can solve the target data scarcity problem by applying the knowledge learned from a data-rich source task to a target task with a small body of data41.
After conducting transfer learning on human lncRNA-viral protein interaction pairs, the fine-tuned model was compared with some existing algorithms: IPMiner, RPISeq-RF, and RPITER. During the fine-tuning process, we selected 80% of the samples from the fine-tuned dataset as the transfer learning training set and then randomly selected the 20% remaining data as the validation set. To address the problem that a data imbalance between the fine-tuned dataset and the training set may affect the model performance of a mode , we utilized an upsampling technique to increase the size of the test set so that the ratio of positive to negative samples was more balanced. Specifically, we randomly sampled RNA and protein sequences without pairwise information from the test set and treated them as noninteracting (negative) data. By doing so, we were able to increase the size of the test set and ensure that the proportions of positive and negative samples in the test set were similar to those in the training set.
Evaluation Metrics
In this study, we used the accuracy (ACC), Matthew’s correlation coefficient (MCC), F1 score (F1), recall, specificity (SPE), precision (PRE), and positive predictive value (PPV) metrics to measure the performance of the CBIL‒VHPLI. The formulae for these indicators are shown below.
where TP and TN represent the numbers of correctly predicted lncRNA‒protein interaction pairs and noninteraction pairs, respectively, and FP and FN represent the numbers of falsely predicted lncRNA‒protein interaction pairs and noninteraction pairs, respectively. Furthermore, we calculated the area under the ROC curve (AUC) to measure the performance of CBIL‒VHPLI.
Case study
Plasmids and antibodes
A full-length human PIK3CD-AS2 (ENSG00000231789) sequence was synthesized, subcloned and inserted into the pcDNA3.1 vector by Sangon Biotech. The pEGFP-N1 vector was provided by the Key Laboratory of the Ministry of Education for Protein Science, Tsinghua University. Escherichia coli strains (DH5a and BL21) were preserved in our laboratory. NS1 fragments were first derived from the A/Goose/Guangdong/1/96 H5N1 influenza virus42. The NS1 antibodies were purchased from Santa Cruz Biotechnology (Cat. No. sc-130568, USA).
Cell culture
A549 human lung adenocarcinoma cells were purchased from Dalian Meilun Biological Company (product code: PWE-HU008). The A549 cells were cultured in an F12K basal medium (Beijing Solarbio Technology Co., Ltd., China) with 10% fetal bovine serum (FBS, Meiluncell) and inoculated in 6-well plates containing F12K medium for 12 h; DNA plasmids were transfected for 4 h. The cell cultures were centrifuged at 6,000 rpm for 10 min at 4 °C. The collected sediments were washed and resuspended in PBS with Triton X-100 and the protease inhibitor PMSF. After performing sonication, the cells were centrifuged at 4 °C and 12,000 rpm for 20 min, after which the supernatant was collected.
Cell transfection
After 48 h of culture, the cells were transfected with DNA plasmids using LipofectamineTM 2000 (Invitrogen, Carlsbad, CA, USA). DNA plasmid transfection was performed using the Lipofectamine RNAiMAX reagent (Invitrogen, Carlsbad, CA, USA) according to instructions provided by the manufacturer.
Western blotting
Total protein was extracted from the A549 cell using an RIPA lysis buffer (Servicebio). The lysates were centrifuged at 12,000 rpm for 15 min at 4 °C, and the supernatant was collected for electrophoresis. The protein extracts (20 μg protein/lane) were separated by 10% SDS‒PAGE and then transferred to a PVDF membrane. The membrane was then incubated in Tris-buffered saline containing 5% skim milk and 0.1% Tween 20 for 2 h at room temperature. Next, the membrane was incubated with a primary antibody overnight at 4 °C. After washing, the membrane was incubated with a secondary antibody for 1 h at room temperature. After rinsing, the proteins (Servicebio) were detected via enhanced chemiluminescence, and X-ray films were pressed and sequentially placed in a developer solution for development and a fixer solution for fixation. After rinsing, drying, and scanning the films, IPP analyzed them for grey values.
Real-time PCR
The full-length sequence of the constructed PIK3CD-AS2 overexpression plasmid was used as a template for PCR amplification to obtain the sense and antisense strands. The PCR mix consisted of 2 μL of template, 1 μL of primer, 12.5 μL of Taq polymerase, and 8.5 μL of sterilized water. PCR cycling conditions consisted of an initial denaturation for 5 min at 95°C, followed by 40 cycles of denaturation at 95°C for 30 s, annealing at 55–60°C for 40 s, elongation at 72°C for 40 s, and a final 5-min extension at 72°C. The PCR products were confirmed by gel electrophoresis on a 1% agarose gel. The sizes of the PCR products were confirmed by applying gel electrophoresis on a 1% agarose gel.
RIP-Seq
The pEGFP-N1-NS1 plasmid-liposome complex was added in a dropwise manner to the A549 cells in the logarithmic growth phase, and the A549 cell precipitates were collected after 48 h. An RIP lysis buffer (4 µl of PMSF) was added, and the lysed cells were vortexed and shaken. Then, 10 µl of PIC and 2.5 µl of RNase inhibitor were added. After sufficient lysis, the cells were centrifuged at 12,000 rpm for 10 min at 4 °C. Fifty microlitres of cell lysate were used as the input positive control, and the remaining lysate sample was used for NS1 antibody enrichment via RIP assays. The cell lysates were incubated with 50 µl of Protein G magnetic beads containing the NS1 antibody mixture in EP tubes, and nonspecifically bounded RNA was removed by washing after 4 h of incubation. The enriched lncRNAs were subjected to fragmentation, double-stranded cDNA synthesis, end repair, junction ligation, magnetic bead purification, and PCR amplification to obtain sequencing libraries that were suitable for the Illumina platform.
RNA pull-down
lncRNA-PIK3CD-AS2 RNA was synthesized from PIK3CD-AS2 plasmid DNA using a T7 In Vitro Transcription Kit (Thermo Scientific, Waltham, MA, USA) and labelled with biotin using an RNA 3'-end desulfurization biotinylation kit (Thermo Scientific). RNA pulldown was performed using the Magnetic RNA‒Protein Pull-down Kit (Thermo Scientific) according to the instructions provided by the manufacturer. Briefly, A549 cells transfected with NS1 for 48 h were harvested in a lysis buffer (25 mM Tris, pH 7.4; 150 mM NaCl; 1% NP-40; 1 mM EDTA; and 5% glycerol). Fifty pmol of biotinylated RNA was incubated with streptavidin magnetic beads for 30 min at room temperature. The cell lysates were then incubated with RNA-conjugated beads for 60 min at 4 °C with rotation. After three washes, the bead-associated proteins were eluted and analysed by protein blotting with an NS1 antibody (sc-130568; 1:200), which was purchased from Santa Cruz Biotechnology.
Results
CNNs34, RNNs43, BiLSTM37, and LSTM44 are the four commonly used methods for predicting protein‒RNA interactions using deep learning techniques. Therefore, we trained Pre_CBIL‒VHPLI on the RPI18072 pretraining dataset. We then compared Pre_CBIL‒VHPLI with four other methods on the RPI2241, RPI1807, and RPI488 datasets to test the reliability and robustness of Pre_CBIL‒VHPLI. Next, the performance of the fine-tuned CBIL‒VHPLI model was compared with that of several existing algorithms, IPMiner, RPISeq-RF, and RPITER, after implementing transfer learning on the vhRPI286 human lncRNA and viral protein interaction dataset. During the experiment, we selected 80% of these datasets as the training set and the remaining 20% as the validation set. Finally, we conducted an auxiliary validation on the prediction results of the CBIL‒VHPLI model through RIP-Seq technology in combination with RNA pull-down experiments.
Feature analysis
To verify the effectiveness of the hybrid feature extraction method, a comprehensive experimental analysis was performed. We compared the results obtained in the following four cases. Table 2 presents the results produced under the combined scheme of the four feature extraction approaches mentioned above.
The worst prediction results were obtained when protein sequence and lncRNA sequence features were extracted using the K-mer method alone. When the data were processed using CDTT or the Z-curve method, the prediction results improved to some extent, indicating that these feature extraction methods are effective at extracting features from sequence data. Our proposed method, which combines protein sequence feature extraction conducted via the k-mer method and CTDD with lncRNA sequence feature extraction via the k-mer and Z-curve approaches, achieved the best results, proving the effectiveness of our model.
Performance of Pre_CBIL‒VHPLI in predicting lncRNA–protein interactions
To evaluate the performance of Pre_CBIL‒VHPLI, we first compared Pre_CBIL‒VHPLI with RNN, CNN, BiLSTM, and LSTM on the benchmark RPI18072 dataset. The results showed that our Pre_CBIL‒VHPLI outperforms the other four methods (Table 3).
On the RPI18072 dataset, Pre_CBIL‒VHPLI yielded an accuracy of 94.23%, which was 6.58%, 5.15%, 4.49%, and 6.31% higher than that of RNN, CNN, BiLSTM, and LSTM, respectively. The recall of Pre_CBIL‒VHPLI was 0.9202, which was 3.63%, 0.94%, 1.82%, and 1.68% higher than RNN, CNN, BiLSTM, and LSTM, respectively. The MCC of Pre_CBIL‒VHPLI was 0.8948, which was 2.23%, 5.29%, 1.02%, and 4.81% higher than RNN, CNN, BiLSTM, and LSTM, respectively. The PRE of Pre_CBIL‒VHPLI was 0.9371, which was 3.8%, 4.98%, 3.81%, and 5.77% higher than RNN, CNN, BiLSTM, and LSTM, respectively. The F1 of Pre_CBIL‒VHPLI was 0.9169, which was 1.39%, 2.5%, 0.18%, and 0.3% higher than RNN, CNN, BiLSTM, and LSTM, respectively. The SPE of Pre_CBIL‒VHPLI was 0.9283%, which was 7.09%, 6.42%, 2.25%, and 3.53% higher than RNN, CNN, BiLSTM, and LSTM, respectively.
The ROC curves of Pre_CBIL‒VHPLI and the other four methods on RPI18072 are shown in Fig. 3. From this figure, it can be observed that the AUC score of Pre_CBIL‒VHPLI reached 0.9522, which was higher than that of CNN, RNN, LSTM, and BiLSTM. These results indicate that Pre_CBIL‒VHPLI performs well in predicting lncRNA–protein interactions.
To further test the reliability and robustness of CBIL‒VHPLI after pretraining, we also compared Pre_CBIL‒VHPLI with four other methods in a 5CV on the RPI488, RPI1807 and RPI2241 datasets. To evaluate the model performance, we removed the same samples in these three datasets, after which we predicted the remaining samples using each of the five models (Table 4).
Performance of CBIL‒VHPLI for predicting virus-host protein-lncRNA interactions
To validate the performance of the CBIL‒VHPLI model in predicting lncRNA‒protein interactions after fine-tuning by transfer learning, we first compared the fine-tuned CBIL‒VHPLI model with the four previous models, the CNN, RNN, BiLSTM, and LSTM models, in terms of ACC, MCC, F1, and PRE. The results showed that the CBIL‒VHPLI model achieved an ACC value of 0.946 and an MCC of 0.884 on the vhRPI286 dataset. As shown in Fig. 4, the CBIL‒VHPLI model performs better after transfer learning on the virus-host protein‒lncRNA data.
Our method was then compared with three other state-of-the-art algorithms. Since catRAPID does not provide a stand-alone package, a working link to RPI-Pred is not available, and lncPro only provides source code for predictive models trained on its dataset. Therefore, we selected IPMiner, RPISeq-RF, and RPITER for comparison on independent test sets of viral proteins and human lncRNAs under 5CV. On the fine-tuned vhRPI286 dataset, the CBIL‒VHPLI model in this paper achieved the highest accuracy (94.56%), precision (89.7%), and sensitivity (80.8%). The accuracies of IPMiner, RPISeq-RF, and RPITER were 0.875, 0.834, and 0.776, respectively. The results proved that CBIL‒VHPLI outperforms other existing algorithms in the prediction of viral–host protein–lncRNA interactions.
Case study
To assess the predictive performance of CBIL-VHPLI for viral–host protein–lncRNA interactions, we performed a case study. The flow of the case study is shown in Fig. 5a. We used an avian influenza virus (H5N1 NS1) protein as the experimental research material. The constructed H5N1 NS1 protein overexpression plasmid (pEGFP-N1-NS1) was used to transfect lung epithelial A549 cells for 48 h, and the blank plasmid group was used as the experimental control group. The NS1 protein expressions in cells were examined using Western blotting (Fig. 5b). RNA immunoprecipitation was used to enrich the lncRNAs that interacted with the NS1 protein in A549 cells. Finally, the enriched lncRNAs were sequenced and identified via RNA sequencing. After setting the enrichment fold threshold to a value greater than 3, 880 host lncRNAs that might interact with NS1 proteins were identified (Supplementary Table S1). The lncRNAs obtained from these experiments were used as CBIL-VHPLI model inputs for predictive classification. We set the scoring threshold for the CBIL-VHPLI model to 0.5, and a model prediction score greater than 0.5 was considered a possible interaction. The experimental results showed that the CBIL-VHPLI model predicted 807 lncRNAs that might interact with NS1 proteins, with a replication rate of approximately 91.7% (according to the results of the RIP-Seq experiments; Supplementary Table S2). We found that the human lncRNA PIK3CD-AS2 was among the top three in both the model scores and RIP-Seq experiments. Therefore, we used the full-length sequence of the constructed PIK3CD-AS2 overexpression plasmid as a template for PCR amplification to obtain the sense and antisense strands (Fig. 5c), Table 5 shows the PCR sequences of the primers used, and Table 6 shows the utilized PCR reaction system. Finally, the in vitro interaction of the NS1 protein with PIK3CD-AS2 was investigated by an RNA pull-down assay to validate the predictions of the CBIL-VHPLI model. As shown in Fig. 5d, there was a direct interaction between the NS1 protein and the lncRNA PIK3CD-AS2. All these results demonstrate the accuracy and reliability of the CBIL-VHPLI model predictions.
Discussion
In this study, we proposed a new approach based on deep learning and used multiple sequence feature extraction methods to predict human lncRNA and viral protein interactions. On the RPI18072 dataset, Pre_CBIL‒VHPLI achieved an accuracy of 0.9413, an MCC of 0.8948, a recall of 0.9202, a PRE of 0.9371, an F1 score of 0.9169, an SPE of 0.9283, and a PPV of 0.9194. The results obtained on the RPI2241, RPI1807, and RPI488 independent test sets also demonstrated the superior performance of Pre_CBIL‒VHPLI to that of the previously developed methods. The results produced on the fine-tuned vhRPI286 dataset also demonstrated the good validity and robustness of CBIL‒VHPLI in terms of predicting human lncRNA‒viral protein interactions. A final case study also helped validate the predictive reliability of the CBIL‒VHPLI model.
CBIL-VHPLI performs well in predicting lncRNA-protein interactions, and we believe there are several reasons for this. Firstly, all the sequences in the training dataset and the fine-tuning dataset were de-redundant using the CD-hit method, such that the likelihood of two proteins interacting with the same lncRNA, if their sequences are not similar, is low. Second, among the raw sequence features, we used a variety of approaches for sequence feature extraction to consider the key role of deeper amino acid/nucleotide prior-dependent effects in lncRNA-protein interactions. In addition, we compared various types of combinations of lncRNA and protein signatures to provide a more comprehensive picture of the results of lncRNA and protein interactions. Fourth, the combination of BiLSTM and CNN networks allows the model to capture both local and global dependencies in the data, thus improving the ability to accurately predict outcomes. Finally, it may be that fine-tuning the network using transfer learning on datasets of viral proteins and human lncRNAs allows the model to adapt to the specific characteristics of the dataset, thus improving performance. However, it is worth noting that the performance of hybrid models may vary depending on the specific combination of dataset and network used. Therefore, it is critical to conduct rigorous experiments and evaluations to determine the best model for a given task. Overall, the architecture and input coding scheme of the hybrid model gives it an advantage over other models, resulting in better predictive performance.
The CBIL-VHPLI model has achieved better results in predicting viral protein and human lncRNA interactions, but there are still some limitations. Firstly, the extent of research into the interaction of viral proteins of viral species with human lncRNA is limited, and consequently, little is known about the data on lncRNA binding to viral proteins. Information bias can mislead the measurement of the probability of interaction between viral proteins and human lncRNAs. Additional data sources and experimental evidence have the potential to further improve model performance. Second, the interaction between viral proteins and lncRNA is still very one-sided by exploring only sequence features. Agliano et al. summarised the functional role of lncRNA in regulating the mammalian immune response during host–pathogen interactions, demonstrating that many lncRNA are emerging as modulators of the inflammatory response of immune cells and host–pathogen interactions45. Therefore, finding better sequence features and structural features, exploring better network structures, improving feature and network structure performance, and predicting the biological functions of lncRNA to increase the prediction accuracy of the models will be the focus of our future work.
Conclusion
To understand the various regulatory and pathogenic mechanisms involved in lncRNAs through interactions between lncRNAs and proteins, several computational methods have been developed for predicting scuch interactions46,47. However, predicting the interaction between viral proteins and human lncRNAs is not a very effective strategy. In this study, we present CBIL‒VHPLI, a deep learning-based framework that combines BiLSTM with a CNN to predict interactions between proteins and lncRNAs. This method also includes a model checkpoint callback structure that saves the best model weights during training based on validation accuracy metrics. This is a useful technique to avoid overfitting and preserve the best model for later use. Another important aspect is that the model is trained on a fixed number of periods (100 in this case) and a fixed batch size (256). These hyperparameters can significantly affect the performance of the model, and tuning them is an important step in the development of deep learning models. The prediction results show that the method is more efficient than other traditional methods in terms of prediction.
Supplementary data information
Publicly available datasets were analysed in this study. These data can be found at the links listed below.
NPInter v2.0 database: http://bigdata.ibp.ac.cn/npinter4/download/
lncRNome database: http://genome.igib.res.in/lncRNome/
the NONCODE v3.0: http://www.noncode.org/index.php
the UniProt database: https://www.uniprot.org
Data availability
The source code of CBIL‒VHPLI and the datasets used in this work are available at https://github.com/Liu-Lab-Lnu/CBIL-VHPLI for academic usage.
Abbreviations
- lncRNAs:
-
Long noncoding RNAs
- VHPLIs:
-
Virus‒host protein‒lncRNA interactions
- CNN:
-
Convolutional neural network
- RNN:
-
Recurrent neural network
- LSTM:
-
Long short-term memory
- BiLSTM:
-
Bidirectional long short-term memory
- RNN:
-
Recurrent neural network
- CTD:
-
Composition, transition and distribution
- PDB:
-
Protein data bank
- API:
-
Application programming interface
- ACC:
-
Accuracy
- MCC:
-
Matthew’s correlation coefficient
- F1:
-
F1_score
- SPE:
-
Specificity
- PRE:
-
Precision
- PPV:
-
Positive predictive value
- RIP:
-
RNA immunoprecipitation
- NS1:
-
Nonstructrual protein 1
- pEGFP-N1-NS1:
-
H5N1 NS1 protein overexpression plasmid
- pEGFP-N1:
-
Blank vector plasmid
References
Khalil, A. M. & Rinn, J. L. RNA–protein interactions in human health and disease. Semin. Cell Dev. Biol. 22, 359–365 (2011).
Statello, L., Guo, C. J., Chen, L. L. & Huarte, M. Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 22, 96–118 (2021).
Li, C. H. & Chen, Y. Targeting long non-coding RNAs in cancers: Progress and prospects. Int. J. Biochem. Cell Biol. 45, 1895–1910 (2013).
Spizzo, R., Almeida, M. I., Colombatti, A. & Calin, G. A. Long non-coding RNAs and cancer: A new frontier of translational research?. Oncogene. 31, 4577–4587 (2012).
Wang, J. et al. Host long noncoding RNA LncRNA-PAAN regulates the replication of influenza a virus. Viruses. 10, 330 (2018).
More, S. et al. Long non-coding RNA PSMB8-AS1 regulates influenza virus replication. RNA Biol. 16, 340–353 (2019).
Huarte, M. et al. A large intergenic noncoding RNA induced by p53 mediates global gene repression in the p53 response. Cell. 142, 409–419 (2010).
Chu, C., Qu, K., Zhong, F. L., Artandi, S. E. & Chang, H. Y. Genomic maps of long noncoding RNA occupancy reveal principles of RNA-chromatin interactions. Mol. Cell. 44, 667–678 (2011).
Simon, M. D. et al. High-resolution Xist binding maps reveal two-step spreading during X-chromosome inactivation. Nature. 504, 465–469 (2013).
Wessels, H. H., Hirsekorn, A., Ohler, U. & Mukherjee, N. Identifying rbp targets with rip-seq. Methods Mol. Biol. Clifton N. J. 1358, 141–152 (2016).
Muppirala, U. K., Honavar, V. G. & Dobbs, D. Predicting RNA-protein interactions using only sequence information. BMC Bioinform. 12, 1–11 (2011).
Yi, H. C. et al. Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions. Comput. Struct. Biotechnol. J. 18, 20–26 (2019).
Zhang, H., Ming, Z., Fan, C., Zhao, Q. & Liu, H. A path-based computational model for long non-coding RNA-protein interaction prediction. Genomics. 112, 1754–1760 (2020).
Zhang, W., Qu, Q., Zhang, Y. & Wang, W. The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions. Neurocomputing. 273, 526–534 (2018).
Wang, L., You, Z. H., Huang, D. S. & Zhou, F. Combining high speed ELM learning with a deep convolutional neural network feature encoding for predicting protein-RNA interactions. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 972–980 (2018).
Ray, D. et al. Hughes, Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat. Biotechnol. 27, 667–670 (2009).
Tallam, K. et al. Identification of snails and schistosoma of medical importance via convolutional neural networks: A proof-of-concept application for human schistosomiasis. Front. Public Health. 9, 642–655 (2021).
Huang, L. et al. LGFC-CNN: Prediction of lncRNA-protein interactions by using multiple types of features through deep learning. Genes. 12, 1675–1689 (2021).
Li, J. Y., Jin, S., Tu, X. M., Ding, Y. & Gao, G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief. Bioinform. 22, 220–233 (2021).
Xuan, P., Ye, Y., Zhang, T., Zhao, L. & Sun, C. Convolutional neural network and bidirectional long short-term memory-based method for predicting drug-disease associations. Cells. 8, 705 (2019).
Yuan, J. et al. NPInter v2.0: An updated database of ncRNA interactions. Nucleic Acids Res. 42, 54–104 (2014).
Bhartiya, D. et al. lncRNome: A comprehensive knowledgebase of human long noncoding RNAs. Database (Oxford). 34, 14–33 (2013).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics. 28, 3150–3152 (2012).
Muppirala, U. K., Honavar, V. G. & Dobbs, D. Prediction RNA-proteins interactions using only sequence information. BMC Bioinformatics. 12, 489 (2011).
Suresh, V., Liu, L., Adjeroh, D. & Zhou, X. RPI-Pred: Predicting ncRNA-protein interaction using sequence and structural information. Nucleic Acids Res. 43, 1370–1379 (2015).
Pan, X., Fan, Y. X., Yan, J. & Shen, H. B. IPMiner: Hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction. BMC Genomics. 17, 582 (2016).
Wang, Y. et al. De novo prediction of RNA-protein interactions from sequence information. Mol. Biosyst. 9, 133–142 (2013).
Yi, Y. et al. RAID v2.0: An updated resource of RNA-associated interactions across organisms. Nucleic Acids Res. 45, 115–118 (2017).
Kang, J. et al. RNAInter v4.0: RNA interactome repository with redefined confidence scoring system and improved accessibility. Nucleic Acids Res. 50, 326–332 (2022).
Cheng, J. et al. ViRBase v3.0: A virus and host ncRNA-associated interaction repository with increased coverage and annotation. Nucleic Acids Res. 50, 928–933 (2022).
Yin, C. & Yau, S. S. A coevolution analysis for identifying protein-protein interactions by Fourier transform. PloS One. 12, 0174862 (2017).
Hashemifar, S., Neyshabur, B., Khan, A. A. & Xu, J. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics. 34, 802–810 (2018).
Zhang, Q., Zhu, L. & Huang, D. S. High-order convolutional neural network architecture for predicting DNA-protein binding sites. IEEE ACM. Trans. Comput. Biol. 16, 1184–1192 (2019).
Liu, B. BioSeq-Analysis: A platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings Bioinform. 20, 1280–1294 (2017).
Govindan, G. & Nair, A. S. Composition, transition and distribution (CTD)—A dynamic feature for predictions based on hierarchical structure of cellular sorting. Ann. IEEE India Conf. 26, 1–6 (2011).
Zhang, R. & Zhang, C. T. Z curves, an intutive tool for visualizing and analyzing the DNA sequences. J. Biomol. Struct. Dyn. 11, 767–782 (1994).
Pan, X. & Shen, H. B. Predicting RNA-protein binding sites and motifs through combining local and global deep convolutional neural networks. Bioinformatics. 34, 3427–3436 (2018).
Cornegruta, S., Bakewell, R., Withey, S., Montana, G. Modelling radiological language with bidirectional long short-term memory. arXiv preprint arXiv. Vol 27, pp 1609–8409 (2016).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Sarlin, P. E., DeTone, D., Malisiewicz, T. & Rabinovich, A. SuperGlue: Learning feature matching with graph neural network. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recogn. 43, 4938–4947 (2020).
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data En. 22, 1345–1359 (2010).
Zhu, C. et al. Interaction of avian influenza virus NS1 protein and nucleolar and coiled-body phosphoprotein 1. Virus Genes. 46, 287–292 (2013).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning internal representations by error propagation. Technical report, California 318 Univ San Diego La Jolla Inst for Cognitive Science, Vol 71, pp. 599–607 (1986).
Jordan, M. I. Serial order: A parallel distributed processing approach. Adv. Psychol. 121, 471–495 (1997).
Agliano, F., Rathinam, V. A., Medvedev, A. E., Vanaja, S. K. & Vella, A. T. Long noncoding RNAs in host–pathogen interactions. Trends Immunol. 40, 492–510 (2019).
Zhang, W., Qu, Q., Zhang, Y. & Wang, W. The linear neighborhood propagation method for predicting long non-coding RNA–protein interactions. Neurocomputing. 273, 526–534 (2018).
Zhu, R., Li, G., Liu, J. X., Dai, L. Y. & Guo, Y. ACCBN: Ant-Colony-clustering-based bipartite network method for predicting long non-coding RNA–protein interactions. BMC Bioinformatics. 20, 34–16 (2019).
Acknowledgements
The authors are grateful for the support from the Technology Innovation Center for the Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province and the Engineering Laboratory for the Molecular Simulation and Design of Drug Molecules of Liaoning Province.
Funding
This work was supported by the Liaoning Province Science and Technology Innovation Leading Talent "Xing Liao Talents Program" Project (Grant number XLYC2002045), the National Natural Science Foundation of China (Grant number 82003655), the Shenyang Science and Technology Plan Project (Grant numbers 21-116-3-48), the Shenyang Young and Middle‐aged Science and Technology Innovation Talent Support Program (Grant numbers RC220277, RC210216), the Doctoral Research Startup Fund Guidance Program of Liaoning Province (Grant numbers 2023-BS-084), the Scientific Research Project from the Department of Education of Liaoning Province (Grant numbers LJKZ0088, LQN201906, LJKMZ20220455, JYTQN2023187).
Author information
Authors and Affiliations
Contributions
Man Zhang: Methodology, Data Curation, Software, Writing—Original Draft, Conceptualization. Li Zhang: Conceptualization, Formal Analysis, Resources, Funding Acquisition. Ting Liu: Investigation, Validation. Huawei Feng: Writing—Review & Editing. Zhe He: Data Curation. Feng Li: Formal Analysis. Jian Zhao: Visualization. Hongsheng Liu: Conceptualization, Supervision, Funding Acquisition.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, M., Zhang, L., Liu, T. et al. CBIL-VHPLI: a model for predicting viral-host protein-lncRNA interactions based on machine learning and transfer learning. Sci Rep 14, 17549 (2024). https://doi.org/10.1038/s41598-024-68750-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-68750-8