RNA independent fragment partition method based on deep learning for RNA secondary structure prediction

Zhao, Qi; Mao, Qian; Zhao, Zheng; Yuan, Wenxuan; He, Qiang; Sun, Qixuan; Yao, Yudong; Fan, Xiaoya

doi:10.1038/s41598-023-30124-x

Download PDF

Article
Open access
Published: 17 February 2023

RNA independent fragment partition method based on deep learning for RNA secondary structure prediction

Qi Zhao¹,
Qian Mao²,
Zheng Zhao³,
Wenxuan Yuan¹,
Qiang He¹,
Qixuan Sun¹,
Yudong Yao⁴ &
…
Xiaoya Fan⁵

Scientific Reports volume 13, Article number: 2861 (2023) Cite this article

1737 Accesses
4 Citations
2 Altmetric
Metrics details

Subjects

Abstract

The non-coding RNA secondary structure largely determines its function. Hence, accuracy in structure acquisition is of great importance. Currently, this acquisition primarily relies on various computational methods. The prediction of the structures of long RNA sequences with high precision and reasonable computational cost remains challenging. Here, we propose a deep learning model, RNA-par, which could partition an RNA sequence into several independent fragments (i-fragments) based on its exterior loops. Each i-fragment secondary structure predicted individually could be further assembled to acquire the complete RNA secondary structure. In the examination of our independent test set, the average length of the predicted i-fragments was 453 nt, which was considerably shorter than that of complete RNA sequences (848 nt). The accuracy of the assembled structures was higher than that of the structures predicted directly using the state-of-the-art RNA secondary structure prediction methods. This proposed model could serve as a preprocessing step for RNA secondary structure prediction for enhancing the predictive performance (especially for long RNA sequences) and reducing the computational cost. In the future, predicting the secondary structure of long-sequence RNA with high accuracy can be enabled by developing a framework combining RNA-par with various existing RNA secondary structure prediction algorithms. Our models, test codes and test data are provided at https://github.com/mianfei71/RNAPar.

Accurate prediction of RNA secondary structure including pseudoknots through solving minimum-cost flow with learned potentials

Article Open access 09 March 2024

RNA secondary structure prediction using deep learning with thermodynamic integration

Article Open access 11 February 2021

Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction

Article Open access 16 September 2023

Introduction

In a genome, most of the genes are transcribed into non-coding RNAs (ncRNAs)^1,2. The ncRNAs are involved in many important biological processes such as protein synthesis, gene regulation, and immunoregulation^3,4. They also play an important role in many diseases such as cancer, diabetes, and atherosclerosis^4,5. The RNA function is associated with its spatial structure, which exhibits a hierarchical folding process. Hence, initially, the secondary structure is formed, followed by the tertiary structure (3D structure) formation with further folding⁴. The secondary structure rapidly forms with large energy change and rich conformations while being stable and independent of its tertiary structure⁶. These features lead to the feasibility of its application in the functional inference of non-coding RNA, drug target discovery, and anti-RNA virus drug design⁷.

An RNA secondary structure is composed of the base pairs connected via hydrogen bonds within this RNA sequence, which include canonical base pairs (A-U, C-G, and G-U) and special base pairs (non-canonical base pairs, pseudoknots, and base triples). At present, two main approaches are utilized to obtain the RNA secondary structure: experimental approach and computational approach. Even though the experimental approach provides highly accurate RNA secondary structure^8,9 based on the 3D structures determined by wet lab experiments (X-ray crystallography and nuclear magnetic resonance), the highly unstable RNA molecules and their crystallization challenges render this method difficult to practice. In addition, the experimental approach is often expensive and time- and labor-consuming. Hence, the generalization of this approach is difficult. Secondary structure probing by chemical reagents (SHAPE, DMS) is another class of methods to obtain RNA secondary structures¹⁰. However, these methods are generally used for detecting RNA structures in vitro, and the RNAs in vitro are not always consistent with those in living cells¹¹. To date, the structures of less than one out of a total of ten thousand known RNAs have been determined by this approach¹². Thus, this task largely relies on the computational approach.

In general, the computational approach can be classified into two categories, i.e., comparative sequence analysis^13,14 and de novo folding algorithm^15,16. The comparative sequence analysis is based on the covariant alignment of complementary bases in the RNA sequences, and the structure is determined via homologous sequences¹³. Although this method exhibits high accuracy, it requires a set of homologous sequences. Thus, the identification of a limited number of RNA families has led to the unavailability of homologous sequences of most RNAs, resulting in limited usage of comparative sequence analysis.

The de novo folding algorithms are designed assuming the RNA folding mechanism, based on which the structure partition functions and optimization goals are proposed. The secondary structure of RNA sequences can be obtained using optimization algorithms, such as dynamic programming, which deduce the global optimal or expectation maximized structure in the structure space^{15,16,17,18,19,20,21}. Therefore, the secondary structure can be predicted from a single RNA sequence^22,23,24,25. The base-pairing accuracy of the known de novo folding algorithms with the datasets composed of multiple short-sequence RNAs (< 300 nt) is 71%²⁶; however, for long-sequence RNA (> 1000 nt, such as long non-coding RNA (lncRNA) or messenger RNA (mRNA)), most of the known de novo folding algorithms are not applicable due to their low precision and cost-extensive nature²⁶.

Recently, rapid advances have been made in RNA secondary structure prediction owing to the application of deep neural networks²⁷. Singh et al.²⁸ proposed the first hybrid deep neural network, SPOT-RNA model, which combines residual network and long-short term memory (LSTM) for high accuracy in predicting secondary structures for RNA sequences. Subsequently, Sato et. al²⁹ proposed a weighted-based approach based on deep learning by combining a fairly deep model with a dynamic programming algorithm. However, the aforementioned two models are inadequate for predicting long-sequence RNA secondary structures. Lu et al.³⁰ proposed an LSTM network that can predict the secondary structure of RNA of any length. Yet, its prediction accuracy needs further improvement. In short, it is a significant challenge to predict secondary structure for long RNAs.

An RNA exterior loop³¹ (or external loop, Fig. 1) is a type of RNA subsequence that meets the following conditions:

(1)
all of its bases are unpaired (bases in non-canonical base pairs, triplets and pseudoknots are regarded as paired bases);
(2)
it is not located between two paired bases.

The independent fragments (i-fragments) are the subsequences divided by the exterior loops. Hence, the secondary structure prediction of an i-fragment is independent of other fragments, and the simple assembly of i-fragment structures can form the secondary structure of the complete RNA. In this study, we propose a deep learning model, RNA-par, which can predict the bases in exterior loops of a given RNA sequence. The RNA sequence can be partitioned into several short i-fragments, whose secondary structures can be predicted separately followed by their simple assembly into the secondary structure of the complete RNA. With the proposed RNA-par model, the task of long-sequence RNA structure prediction can be addressed via the prediction of several short RNA fragments, which can be easily obtained by using common RNA prediction methods.

Materials and methods

Data and datasets

In this study, all the base pairs (including canonical base pairs, non-canonical base pairs, triplets and pseudoknots) were regarded as secondary structure. The RNA sequence and structure data used in this research were collected from the bpRNA-1 m³², RNA Strand³³, PDB³⁴, Archive II³⁵, RNAStralign³⁶, and RMDB³⁷ by January 2021. All these data can be downloaded easily from the corresponding databases. Structural data in RNA Strand could be classified into two categories: data obtained by experiments (RNA Strand-experiment) and those obtained by prediction (RNA Strand-prediction). We collected RNA data from PDB with less than 3.5 Å resolution. As PDB offers only tertiary structures, they were converted to secondary structures using RNApdbee³⁸.

All structural data in these datasets could be classified into three groups (Fig. 2) as follows:

(1)
Data obtained by experiments (PDB, RMDB, and RNA Strand-experiment data);
(2)
Data obtained by comparative sequence analysis and is widely used as benchmark in other studies. (Archive II and RNAStralign);
(3)
Data obtained by comparative sequence analysis or other methods with uncertain accuracy (bpRNA-1 m and RNA Strand-prediction).

The structural data in the first group are the most accurate but comprise a low number of entities (Fig. 2). Hence, they were used as the independent test set for evaluating our model. The data in Archive II and RNAStralign are widely used in many RNA structure prediction studies as ground truth; hence, we used the data in group two as training data. The third group contains a large amount of RNA secondary structure data. However, these structures were obtained using computational methods and have not been carefully screened. Hence, they were not reliable for directly training our model. To this end, transfer learning (transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem)³⁹ was used, i.e., at first, the RNA-par model was pre-trained with the data in group three, then trans-trained with the data in group two. Furthermore, the data in group one were used to evaluate the performance of RNA-par while avoiding possible bias across the databases.

The data processing flow is described in Fig. 2. For the raw RNA sequences collected from all datasets, the sequences containing letters corresponding to bases other than A, C, G and U were removed. Thereafter, duplicated or similar sequences (similarity > 80%, which is the lowest cutoff allowed by CD-HIT-EST) within each group were removed separately using CD-HIT-EST⁴⁰ (the unduplicated sequences and the corresponding structures in group two were used for statistical analysis because of their large number and high accuracy). The unduplicated sequences were partitioned into subsequences of equal length (200 nt) by a sliding window (window length = 200, step = 200). If the length of the last subsequence was shorter than 200 nt, ‘-’ was used to pad out the gaps. Subsequently, CD-HIT-EST was used again with the same parameters to remove duplicated or similar sequences within and across each subsequence dataset. During this process, we retained the cross-dataset duplications in group two while eliminating their counterparts in other groups (bottom left part of Fig. 2). The reasons underlying this approach are listed as follows: First, the structures in group two were more accurate than those in group three; hence, retaining the samples with higher accuracy for training the model could lead to a better performance of the model; Second, group one served as an independent test set for further evaluation of the model, whereas the benefits of increasing the sample size in the testing set were limited.

After data preprocessing, we obtained 128 subsequences in group one, 3,635 subsequences in group two, and 17,927 subsequences in group three. In fact, most subsequences were removed from the datasets due to the high similarity of the subsequences within or across datasets. These datasets were shuffled and divided into training, validation, and test sets as shown in Table 1. Every RNA sequence of these datasets was encoded using one-hot fashion, i.e., ‘00001’ for ‘A’, ‘00010’ for ‘C’, ‘00100’ for ‘G’, ‘01000’ for ‘U’, and ‘10,000’ for ‘-’.

Table 1 Data and datasets.

Full size table

To analyze the differences between sequence patterns around bases of the exterior loop with those around bases of the non-exterior loop, further two sub-datasets were obtained from unduplicated RNAs in group two. Two patterns of the sequence were sampled, namely, nonexternal-loop group and external-loop group. Both patterns are RNA subsequences with 31 nt in length with the only difference being in the base association (at the middle of their sequences) either to an exterior loop (external-loop group) or not (nonexternal-loop group).

We built the labels for each RNA subsequence involved in training, validation, and test sets, in accordance with the RNA secondary structure data. Each RNA subsequence label was composed of the labels of each base in its sequence. A base was labeled ‘1’ if it belongs to an exterior loop (Fig. 1) in the corresponding secondary structure data, otherwise (including ‘-’) it was labeled ‘0’.

Model architecture of RNA-par

Our RNA-par model was composed of 4 blocks (Fig. 3). The 4-layer one-dimensional convolutional neural network (1D-CNN) module (green box), with the same kernel size \(K\), \(C\) channels and activation function relu, constructed the first block for extracting input data features. The second block was a 1-layer Bi-LSTM module (blue box) with \(U\) units in the cells of this layer and activation function tanh, stacked by batch normalization. Thereafter, a block of 2-layer ResNet⁴¹ (yellow box) with \(N\) nodes and activation function relu was stacked followed by batch normalization. Dropout technique was employed in all layers of these three blocks with the rate 0.1. The 2-layer ResNet consists of two ResNet layers, and each ResNet layer consists of two FC layers and a shortcut connection (connecting the outputs of two FC layers). The last block is a fully-connected layer (white box), with two nodes for the output layer and softmax function as the activation function. For inputs of all these layers, masks were used to eliminate the effects of ‘-’.

We employed transfer learning³⁹ for training our RNA-par model. The training process comprised two stages: pre-training and trans-training. In the pre-training stage, the data with a large number of entities but lower accuracy in group three were used, while the limited data with higher precision in group two was used in the trans-training stage. The model with the same training scheme was used in both stages, which was different from the general approach in which the learning rate is lower or shallow layers are frozen during trans-training. To avoid overfitting, an early stopping⁴² was employed with patience of 10. We also employed Bayesian Optimization (BO)⁴³ to find the best combinations of the undecided hyper-parameters (\(K\),\(C\),\(U\),and \(N\). The scopes of these hyper-parameters used for optimization are shown in Table S1). As the BO was run 100 times, after completion, 100 models with different hyper-parameters were obtained.

To increase the generalization of our model, we employed the ensemble strategy, which was used in the previous papers^28,44. Specifically, we selected the top 3 best models according to Matthews correlation coefficient (MCC) for validation set V2 (see Table 1), and the final prediction for each base was the average prediction of these 3 models.

Model performance metrics

For each input subsequence (200 nt), RNA-par model outputs a prediction (a value between 0 and 1) for each base in this sequence (for each overlapped base in two sliding windows, the final prediction of that base is the average of both predictions). For analyzing the performance of RNA-par, commonly used metrics, i.e., sensitivity (SEN), precision (PRE) accuracy (ACC) and MCC were calculated (formulas are shown in Table 2), of which true positives (TPs, denoting the number of correctly predicted exterior-loop bases), false positives (FPs, denoting the number of incorrectly predicted exterior-loop bases), true negatives (TNs, denoting the number of correctly predicted non-exterior-loops bases), and false negatives (FNs, denoting the number of incorrectly predicted non-exterior-loops bases) were defined based on the model prediction and ground truth of each base. Apart from these base-based metrics, we proposed segment-based metrics to better evaluate the accuracy of the RNA sequence being partitioned (Fig. 4), where TPs, FPs, TNs, and FNs were defined based on the prediction and ground truth of subsequences.

Table 2 Metrics formulas.

Full size table

We hypothesize that the RNA-par model could serve as a preprocessing step for RNA secondary structure prediction methods for better performance. To support this hypothesis, we combined the RNA-par model with five state-of-the-art RNA secondary structure prediction methods and evaluated the performance of such a two-stage approach. The evaluation metrics were SEN, PRE, ACC, MCC, and F1 score (F1, Table 2), which are commonly employed in RNA secondary structure prediction studies. In these metrics, TP denotes the number of true positive base-pairs, FP denotes the number of false positive base-pairs, TN denotes the number of true negative base-pairs, and FN denotes the number of false negative base-pairs.

Results

Statistical analysis of RNAs and i-fragments

We first analyzed the RNAs and their corresponding i-fragments using the dataset composed of 2,847 unduplicated RNAs (indicated by red circle in Fig. 2) obtained by duplicated or similar sequence removal from sequences in group two³³. The length distribution of the RNA sequences in this dataset is shown in Fig. 5 (Supplemental Fig. 1). A total of 11.67% of these sequences were longer than 400 nt (the largest length is 1800 nt), whose secondary structures were difficult to predict using most of the existing prediction algorithms. With an increase in RNA length, the number of i-fragments in RNA sequences increased significantly (Fig. 5, correlation coefficient = 0.617, p < < 0.01). In this **dataset, 17.1% of RNAs had more than two (2–10) i-fragments (Fig. 6). The correlations between the length of i-fragments and the length of RNAs, and correlation between the number of i-fragments and the length of i-fragments are weak (r = 0.198 and -0.166, respectively, p < < 0.01, Supplemental Fig. 2, Fig. 6). The average length of i-fragments was 129 nt (the shortest length: 4 nt, the longest length: 1139 nt), which was significantly shorter than the lengths of complete RNA sequences. Long-sequence RNAs could be split into shorter i-fragments (Fig. 7, Supplemental Fig. 3) whose secondary structure could be effectively predicted using most of the existing algorithms.

To observe any differences between the bases in exterior loops and non-exterior loops, we compared the sequences in nonexternal-loop group and external-loop group (2000 sequences for each pattern, see “Methods”). The components of four species of base showed a significant difference between these two patterns of sequences (Figure S4, p < < 0.01). The C and G contents in external-loop group are significantly lower than those in nonexternal-loop group. This could result in the formation of a stable secondary structure due to the lower energy of G-C pairs. To better distinguish the bases in exterior loops and non-exterior loops, and for an accurate prediction of the potential i-fragments, a more complex binary-classification model RNA-par was built.

The RNA-par model proposed in this study was composed of 4 blocks (Fig. 3), i.e., a 4-layer 1D-CNN block to extract the features of input sequences, a Bi-LSTM block to catch the information from both sides of the sequences, a 2-layer ResNet block to transform the features, and another fully-connected layer for providing predictions as output. We also used the BO to find the best hyper-parameter combinations (Supplemental Table 1), resulting in multiple models with the same architecture but different hyper-parameters. The details of the RNA-par model are described in the “Methods” section.

Pre-training and trans-training

We used the transfer learning approach for training the RNA-par model. The training process consisted of two steps: pre-training and trans-training. In pre-training, coarse datasets T1 and V1 from group three were used as the training set and validation set, respectively. In trans-training, the pre-trained model was directly trained and validated again with accurate datasets T2 and V2 derived from group two, respectively (Table 1). The performance on TS in both steps is shown in Table 3, which was analyzed by averaging over the best three models obtained via BO. The PRE, SEN, ACC and MCC significantly improved on applying transfer learning strategy. This result suggests that the trans-training step is critical for the further improvement of the performance of our model. For better evaluating the accurate partition of the RNA sequence into i-fragments by the predicted exterior loop bases, we proposed segment-based metrics (see “Methods”). The segment-based metrics (Table 2) were different from traditional metrics (base-based). The segment-based metrics can better reflect the performance of RNA-par.

Table 3 Performance of RNA-par trained with different modes on TS dataset. The best performance was highlighted in bold.

Full size table

To further evaluate the benefits of the transfer learning strategy, we trained the same model using two traditional approaches without transfer learning. More specifically, in the first traditional approach (Coarse + accurate), the model was trained with a dataset containing both T1 and T2, and validated with a dataset containing both V1 and V2; and in the second approach (Only accurate), the model was directly trained and validated with T2 and V2, respectively. We compared the model-performance with different training modes on TS dataset in terms of PRE, SEN, ACC and MCC (Table 3). In all metrics, the performance of the model trained with ‘Only accurate’ mode were better than ‘Coarse + accurate’ mode, except SEN (which was comparable between the two modes). Then we compared traditional training mode with the transfer learning mode. Overall, the model only trained with pre-training mode underperformed both models trained in traditional training modes with respect to all metrics, except SEN (which was comparable to “Coarse + accurate” mode) and segment-based SEN (which was equal to “Coarse + accurate” mode); however, it outperformed both models trained in traditional training modes after trans-training.

Performance in independent test set

To further evaluate the performance of the RNA-par model, we used another independent test set TS’ (see “Methods”) from group one. All 128 subsequences in TS’ were obtained by experiments and different from those already used in training, validation, or test sets. Their labels were predicted by the best three models obtained through BO. The results are shown in Table 4, which are slightly lower than those on TS (Table 3) except for segment-based SEN but satisfactory enough for practical application. In addition, the prediction of each subsequence only took 0.015 s, which was fairly rapid for practical applications. These results indicated the high reliability and applicability of the RNA-par model.

Table 4 Performance of RNA-par on TS’ (see Table 1).

Full size table

Furthermore, the lengths of i-fragments analyzed were determined via the predicted labels of RNA-par. When one or successive bases were predicted, the structure of these bases are regarded as an exterior loop (not needs further structure prediction), thereby leading to the production of several short i-fragments. To illustrate the ability to partition long sequences effectively, RNAs longer than 200 nt were analyzed. For these RNAs, the i-fragments were significantly shorter than the length of the RNAs themselves (average 453 nt compared to 848 nt), as illustrated in Fig. 8. The secondary structures of these i-fragments can be easily predicted via known algorithms and their simple assembly can provide the secondary structure of the complete RNA.

RNA secondary structure prediction with RNA-par

To evaluate the benefits of employing RNA-par as a preprocessing step of RNA secondary structure prediction, we combined RNA-par with five state-of-the-art RNA secondary structure prediction methods (RNAfold¹⁸, CONTRAfold, Linearfold²⁰, SPOT-RNA²⁸, and mfold⁴⁵), and validated them with the independent test set. The performance of these methods with or without RNA-par as preprocessing is shown in Table 5. For prediction with RNA-par, we first predicted the substructures of the i-fragments determined by RNA-par, then assembled these substructures directly to obtain the complete structures. The performance metrics of these methods with RNA-par was higher than those observed without RNA-par except the SEN and F1 of mfold and SEN of SPOT-RNA (these metrics were comparable with or without RNA-par). This finding suggests that RNA-par can be used with various prediction methods to improve prediction performance. According to improvements of ACC, MCC or F1 (three comprehensive metrics), we found that the RNAfold and CONTRAfold showed maximum improvement, while Linearfold and SPOT-RNA showed minor improvement. In our test set, the highest values in all metrics (with RNA-par), except PRE, were obtained with RNAfold. When RNA length was shorter than 200 nt, RNA-par can hardly improve the accuracy (Fig. 9) or computational efficiency (Fig. 10). While, when the length was much longer, the advantage of RNA-par on both accuracy and runtime become obvious. Hence, RNA-par was more suitable for processing long-sequence RNAs.

Table 5 Performance comparison among different methods on dataset TS’ (see Table 1).

Full size table

Comparison of the predicted structures of three RNAs (PDB00409, PDB3SUH, and PDB01118) with and without RNA-par is shown in Fig. 11. The secondary structure of PDB00409 was predicted by RNAfold. In the reference structure, PDB00409 (from RMDB) contained four i-fragments, and RNA-par successfully predicted the two i-fragments near the 3’ end, but failed to predict the two i-fragments near 5’ end (however, the predicted start and end points were very close to those in the reference structure obtained by experiment). The F1 was improved by 0.079 with the combination of RNA-par. For PDB3SUH (from PDB), only one i-fragment is contained. Although RNA-par predicted this i-fragment correctly, the performance was primarily determined by the prediction method. If the i-fragments cannot be reliably predicted by RNA-par, the performance of the prediction method can be reduced remarkably. For example, RNA-par wrongly predicted the i-fragment of PDB01118 (from RMDB, only one i-fragment is contained in its reference structure); the F1 of SPOT-RNA was reduced by 0.036 when used in combination with RNA-par.

Discussion

In bioinformatics, the prediction of RNA secondary structures with long sequences remains a significant challenge. Further complexities were observed when it was discovered that the structures of lncRNA, mRNA, and virus RNA with long sequences are critical for their functions. Here, we proposed a deep learning model, RNA-par, to predict the bases in exterior loops of an RNA sequence, further partitioning it into i-fragments. Thus, the secondary structure prediction of long-sequence RNAs can be turned into several sub-tasks which can be easily addressed using existing algorithms, i.e., predicting the secondary structure of i-fragments. RNA-par renders many existing prediction algorithms applicable for long-sequence RNAs, which are otherwise inadequate due to their cost-extensive nature or inferior performance.

In the early days, the length of most discovered functional RNAs is very short, such as tRNAs. With the in-depth research, especially the discovery of the lncRNA, more and more functional RNAs with long chain have been found. Their length ranges from 200 to serval thousand nucleic acids. However, the number of long-chain RNAs with decided secondary structures is still very limited due to the difficulty in the structure determination experiment. In our study, we collected as much data (from bpRNA-1 m, Archive II, RNAStralign, PDB, RMDB and RNA Strand Database) as possible to train and test our model, but the number of long-chain RNA (especially RNAs longer than 500 nt) is limited (Fig. 5). RNA-par focuses on the pre-process of RNAs to improve the performance prediction method on predicting the secondary structure of RNAs with long chain. In our results, RNA-par was shown to be able to improve the prediction performance for RNAs in any length, especially the RNAs longer than 200 nt (Fig. 5). The performance of RNA-par will be further clarified if more samples with long chain are available.

In fact, besides 200 nt, we also considered other three lengths of subsequences (60 nt, 120 nt, and 280 nt), and trained and validated four models in the transfer learning mode. Then, two comprehensive indexes ACC and MCC of four models were measured with TS dataset (Table 6). Results revealed that the model performance first climbed along with the increase of subsequence length, and then degenerated after it reaches its highest value at 200 nt. We speculate that the subsequence does not contain enough information for such prediction when it is too short. On the other hand, when it is too long, the Bi-LSTM module could not handle the long distance dependance well. Therefore, the we selected 200 nt as length the subsequence RNA-par model finally.

Table 6 The performance of RNA-par models under different length of subsequences on TS dataset. The best performance was highlighted in bold.

Full size table

In our independent test set, the performance improvement by RNA-par was limited (Table 5), which could be attributed to the short length of the sequence obtained via experiments in our test set (86.7% sequence length < 150 nt). Generally, these short-length RNA sequences contain only one i-fragment; hence, RNA-par cannot function, and no improvement can be acquired on structure accuracy or running time. In addition, the accuracy of RNA-par is still limited. For example, RNA-par failed to partition the longest RNA PDB_00791 (1533 nt, 4 i-fragments) in our independent test set. The improvement of RNA-par could further enhance the accuracy of the assembled structures.

Owing to the fact that RNA-par is a neural network model, it runs very fast (0.015 s) and only accounts for a small part of the entire structure prediction time (a few seconds to a few hours). Hence, RNA-par scarcely increases the time complexity of structure prediction. For most RNA secondary structure prediction methods, the time complex is at least \(O({n}^{2})\), where n is the length of the input RNA. Because an RNA is partitioned into shorter i-fragments, so that the time-consumption of the prediction method combined with RNA-par can be less than that without RNA-par.

In our study, the amount of training data in group three was large enough for training our deep learning model RNA-par; however, these structure data were predicted using comparative sequence analysis, which showed the following two problems:

(1)
The structures may not be reliable at the single base-pair level, especially the long-range base pairs in long RNAs.
(2)
The structures may be incomplete.

These two problems are hard to overcome completely, and may limit the performance of the RNA-par model. In order to make full use of the large amount of data in group three, while reducing the impact of the inaccurate or incomplete structures, we employed the transfer learning method. Specifically, a pre-training step was performed before the trans-training step. All data in group three were only used in the pre-training step (Table 1) to build a rough model. For the trans-training step, we used the data with relatively high quality in group two. Our results showed that this transfer learning strategy significantly improved the performance of RNA-par.

Unlike proteins, high-precision RNA structure data is scarce. Many RNA secondary structure prediction models based on machine learning use the structure data obtained by comparative sequence analysis as a training set, especially the deep learning-based model with a large number of parameters. RNA-par model does not predict the complete RNA secondary structure, while only predict bases in exterior loops, hence, it does not need a large training set. We also believe our RNA-par model and the machine learning based RNA structure prediction method will be further improved along with the accumulation of reliable experiment-based RNA structure data.

The RNA-par is composed of several blocks (CNN, Bi-LSTM, and ResNet) which have been successfully applied in other fields, such as image recognition and translation^46,47. The method of building new models through the integration of mature blocks according to their characteristics could aid in the search for a suitable architecture for a specific task. In addition, addressing a problem using a mature block is also of significant value. In addition to RNA secondary structure prediction, deep learning was also widely used in many other RNA-related fields. For example, the LSTM model⁴⁸ and dilated convolutional neural network⁴⁹ were used to predict the RNA solvent accessibility, and multi-layer stacked autoencoder⁵⁰ was used to predict the subcellular localization for lncRNAs.

In the transfer learning process of RNA-par, neither the weights of the blocks in shallow layers of the model were frozen nor the learning rate was lowered. The model that was already trained with a coarse dataset (T1) was retained using a more precise dataset (T2). Results showed the effectiveness of this transfer learning strategy. This approach has also shown to be more effective than the traditional transfer learning approaches in previous studies^28,51. Thus, dividing the dataset into several subsets based on the data quality followed by training of the model in order from coarse subset to precise subset is an effective approach to obtain a well-performing model.

To overcome overfitting when training RNA-par, early stopping⁴² strategy was employed and the patience was set to 10 by experience. Specifically, when no improvement was observed with the validation set within the last p (patience parameter) epochs, the model training was stopped, rather than going with all the epochs.

BO was used for model hyper-parameter selection. The BO was run 100 times, resulting in the best three optimized models (according to MCC) with different hyper-parameters. These models predicted the label of a base from different aspects, and the prediction results were improved upon averaging three models. We also tested the performance of RNA-par under different combinations of the number of layers of 1D-CNN (LoC) and the number of layers of Bi-LSTM (LoL), and the results (Table S2) showed that LoC = 4 and LoL = 1 was the best solution.

As a preprocessing step, RNA-par could process RNAs with any length. However, before an RNA is input into RNA-par, it should be cut into subsequences with 200 nt by a sliding window (see Section "Data and datasets"). After the prediction by RNA-par, the predicted results of subsequences (by RNA-par) are assembled to compose the labels for entire RNA (The label for a base is defined in Section "Data and datasets"). In fact, some machine learning models are capable of dealing with the input of any length. And, RNA-par can be further refined by these models to make the preprocessing simpler.

For cases where long-range interactions between two identified i-fragments exist, the proposed solution will give false predictions. However, in our method, RNA-par was trained to avoid these cases. Specifically, when preparing for the samples for training RNA-par, all bases located between two paired bases (including non-canonical base pairs, triplets and pseudoknots) were labeled as bases do not belong to exterior loop. However, an FP predicted by RNA-par may falsely partition a single i-fragment into two and interactions may exist between these two falsely partitioned parts. It is true that our solution is not able to predict such interaction and will give false negative prediction. The RNA-par achieved 0.9998 S-PRE in TS dataset and 0.9474 S-PRE in TS’ dataset with trans-training strategy, indicating that such cases are rare. Existing results also showed that our solution could improve the performance of structure prediction of long-sequence RNA (Table 5, Fig. 9 and Fig. 11). In addition, it may be improved by structure adjustment after the assembly of the predicted i-fragment structures. For example, for the predicted structure of PDB01118 (the bottom right cell of Fig. 11) with the preprocessing of RNA-par, it is not difficult to find that the first three bases should pair with the 52 th, 53 th and 54 th base to further reduce the free energy.

We believe that developing a framework by combining RNA-par and existing RNA secondary structure prediction algorithms will advance the structure prediction of long-sequence RNAs. In addition, if an algorithm is developed to predict the outmost base pairs of each i-fragment, then RNA-par could be used iteratively to partition RNA into shorter subsequences.

Conclusions

Here, we propose a deep learning model, RNA-par, to predict the bases in exterior loops of an RNA sequence. Using these bases, the RNA sequence could be partitioned into i-fragments. This turned the long-sequence secondary structure prediction into the prediction of much shorter i-fragments. Each i-fragment secondary structure predicted individually can be further assembled to obtain the complete RNA secondary structure with high accuracy. We believe that developing a framework by combining RNA-par and existing RNA secondary structure prediction algorithms can improve the structure prediction of long-sequence RNAs.

Data availability

All the RNA data used in this research are provided at https://doi.org/10.5281/zenodo.7510176.

References

Fu, Y., Xu, Z. Z., Lu, Z. J., Zhao, S. & Mathews, D. H. Discovery of Novel Ncrna Sequences in Multiple Genome Alignments on the Basis of Conserved and Stable Secondary Structures. PLoS ONE 10(6), e0130200 (2015).
Article PubMed PubMed Central Google Scholar
Consortium, The ENCODE Project. An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414): 57–74.
Higgs, P. G. & Lehman, N. The Rna world: Molecular cooperation at the origins of life. Nat. Rev. Genet. 16(1), 7–17 (2015).
Article CAS PubMed Google Scholar
Mortimer, S. A., Kidwell, M. A. & Doudna, J. A. Insights into Rna structure and function from genome-wide studies. Nat. Rev. Genet. 15(7), 469–479 (2014).
Article CAS PubMed Google Scholar
Doudna, J. A. & Cech, T. R. The chemical repertoire of natural ribozymes. Nature 418(6894), 222–228 (2002).
Article ADS CAS PubMed Google Scholar
Celander, D. W. & Cech, T. R. Visualizing the higher order folding of a catalytic Rna molecule. Science 251(4992), 401–407 (1991).
Article ADS CAS PubMed Google Scholar
Palde, P. B., Ofori, L. O., Gareiss, P. C., Lerea, J. & Miller, B. L. Strategies for recognition of stem-loop Rna structures by synthetic ligands: Application to the Hiv-1 frameshift stimulatory sequence. J. Med. Chem. 53(16), 6018–6027 (2010).
Article CAS PubMed PubMed Central Google Scholar
Westhof, E. Twenty years of Rna crystallography. RNA 21(4), 486–487 (2015).
Article CAS PubMed PubMed Central Google Scholar
Fürtig, B., Richter, C., Wöhnert, J. & Schwalbe, H. Nmr spectroscopy of Rna. ChemBioChem 4(10), 936–962 (2003).
Article PubMed Google Scholar
Weeks, K. M. Advances in Rna structure analysis by chemical probing. Curr. Opin. Struct. Biol. 20(3), 295–304 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kwok, C. K., Ding, Y., Tang, Y., Assmann, S. M. & Bevilacqua, P. C. Determination of in vivo Rna structure in low-abundance transcripts. Nat. Commun. 4, 2971 (2013).
Article ADS PubMed Google Scholar
Rose, P. W. et al. The Rcsb protein data bank: Integrative view of protein, gene and 3d structural information. Nucl. Acids Res. 45, 271–281 (2017).
Google Scholar
Gutell, R. R., Lee, J. C. & Cannone, J. J. The accuracy of ribosomal Rna comparative structure models. Curr. Opin. Struct. Biol. 12(3), 301–310 (2002).
Article CAS PubMed Google Scholar
Madison, J. T., Everett, G. A. & Kung, H. Nucleotide sequence of a yeast tyrosine transfer Rna. Science 153(3735), 531–534 (1966).
Article ADS CAS PubMed Google Scholar
Reuter, J. S. & Mathews, D. H. Rnastructure: Software for Rna secondary structure prediction and analysis. BMC Bioinformatics 11, 129 (2010).
Article PubMed PubMed Central Google Scholar
Nussinov, R., and A. B. Jacobson (1980) Fast Algorithm for Predicting the Secondary Structure of Single-Stranded Rna. Proc. Natl. Acad. Sci. U S A 77(11): 6309–6313.
Zuker, M. Mfold Web Server for Nucleic Acid Folding and Hybridization Prediction. Nucleic Acids Res. 31(13), 3406–3415 (2003).
Article CAS PubMed PubMed Central Google Scholar
Lorenz, R. S. H. et al. Viennarna Package 20. Algorithms Mol. Biol. 6, 26 (2011).
Article PubMed PubMed Central Google Scholar
Tinoco, I., Uhlenbeck, O. C. & Levine, M. D. Estimation of Secondary Structure in Ribonucleic Acids. Nature 230(5293), 362–367 (1971).
Article ADS CAS PubMed Google Scholar
Huang, L. et al. Linearfold: Linear-time approximate Rna folding by 5’-to-3’ dynamic programming and beam search. Bioinformatics 35(14), i295–i304 (2019).
Article CAS PubMed PubMed Central Google Scholar
Do, C. B., Woods, D. A. & Batzoglou, S. Contrafold: Rna secondary structure prediction without physics-based models. Bioinformatics 22(14), e90–e98 (2006).
Article CAS PubMed Google Scholar
Seetin, M. G. & Mathews, D. H. Rna structure prediction: An overview of methods. Methods Mol. Biol. 905, 99–122 (2012).
Article CAS PubMed Google Scholar
Gorodkin, J. Special issue: computational analysis of Rna Structure and Function. Genes (Basel) 10(1), 55 (2019).
Liu, Y. et al. A new method to predict Rna secondary structure based on Rna folding simulation. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(5), 990–995 (2016).
Article CAS PubMed Google Scholar
Zhao, Q. et al. Fledfold: A novel software for Rna secondary structure prediction. Lett. Org. Chem. 14(9), 714–716 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhao, Y., Wang, J., Zeng, C. & Xiao, Yi. Evaluation of Rna secondary structure prediction for both base-pairing and topology. Biophys. Rep. 4(3), 123–132 (2018).
Article CAS Google Scholar
Zhao, Qi. et al. Review of machine learning methods for Rna secondary structure prediction. PLoS Comput. Biol. 17(8), e1009291 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Singh, J., Hanson, J., Paliwal, K. & Zhou, Y. Q. Spot-Rna: Rna secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun. 10(1), 1–13 (2019).
Article Google Scholar
Sato, K., Akiyama, M. & Sakakibara, Y. Rna secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 12(1), 1–9 (2021).
Article Google Scholar
Lu, W. et al. Predicting Rna secondary structure via adaptive deep recurrent neural networks with energy-based filter. BMC Bioinformatics 20(25), 1–10 (2019).
Google Scholar
Hofacker, I. L. et al. Fast folding and comparison of Rna secondary structures. Monatshefte für Chemie / Chemical Monthly 125(2), 167–188 (1994).
Article CAS Google Scholar
Danaee, P. et al. Bprna: Large-scale automated annotation and analysis of Rna secondary structure. Nucl. Acids Res. 46(11), 5381–5394 (2018).
Article CAS PubMed PubMed Central Google Scholar
Andronescu, M., Bereg, V., Hoos, H. H. & Condon, A. Rna strand: The Rna secondary structure and statistical analysis database. BMC Bioinf. 9, 1–10 (2008).
Article Google Scholar
Burley, S. K. et al. Rcsb Protein Data Bank: Powerful new tools for exploring 3d structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucl. Acids Res. 49, 437–451 (2021).
Article Google Scholar
Sloma, M. F. & Mathews, D. H. Exact calculation of loop formation probability identifies folding motifs in Rna secondary structures. RNA 22(12), 1808–1818 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tan, Z., Fu, Y. H., Sharma, G. & Mathews, D. H. Turbofold Ii: Rna structural alignment and secondary structure prediction informed by multiple homologs. Nucl. Acids Res. 45(20), 11570–11581 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cordero, P., Lucks, J. B. & Das, R. An Rna mapping database for curating Rna Structure mapping experiments. Bioinformatics 28(22), 3006–3008 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zok, T. et al. Rnapdbee 2.0: Multifunctional tool for Rna structure annotation. Nucl. Acids Res. 46, W30–W35 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zhuang, F. Z. et al. A Comprehensive Survey on Transfer Learning. Proc. IEEE 1, 43–76 (2021).
Article Google Scholar
Fu, L. M., Niu, B. F., Zhu, Z. W., Wu, S. T. & Li, W. Z. Cd-Hit: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
He, K. M., Zhang, X. Y., Ren, S. Q., & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (Cvpr), 770–78 (2016).
Prechelt, L. Early Stopping - but When? In Neural Networks: Tricks of the Trade (eds Orr, G. B. & Müller, K.-R.) 55–69 (Springer, 2002).
Google Scholar
Shahriari, B., Swersky, K., Wang, Z. Y., Adams, R. P. & de Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proc. IEEE 104(1), 148–175 (2016).
Article Google Scholar
Hanson, J., Paliwal, K., Litfin, T., Yang, Y. & Zhou, Y. Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 34(23), 4039–4045 (2018).
Article CAS PubMed Google Scholar
Zuker, M. & Stiegler, P. Optimal computer folding of large rna sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9(1), 133–148 (1981).
Article CAS PubMed PubMed Central Google Scholar
Young, T., Hazarika, D., Poria, S., & Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing. In IEEE Computat. Intell. Mag., 55–75 (2018).
Liu, X., Deng, Z. & Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev. 52(2), 1089–1106 (2019).
Article Google Scholar
Sun, S., Wu, Q., Peng, Z. & Yang, J. Enhanced prediction of Rna solvent accessibility with long short-term memory neural networks and improved sequence profiles. Bioinformatics 35(10), 1686–1691 (2019).
Article CAS PubMed Google Scholar
Hanumanthappa, A. K., Singh, J., Paliwal, K., Singh, J. & Zhou, Y. Q. Single-sequence and profile-based prediction of Rna solvent accessibility using dilated convolutional neural network. Bioinformatics 36(21), 5169–5176 (2020).
Article CAS Google Scholar
Cao, Z., Pan, X. Y., Yang, Y., Huang, Y. & Shen, H. B. The lnclocator: A subcellular localization predictor for long non-coding Rnas based on a stacked ensemble classifier. Bioinformatics 34(13), 2185–2194 (2018).
Article CAS PubMed Google Scholar
Hanson, J., Litfin, T., Paliwal, K. & Zhou, Y. Q. Identifying molecular recognition features in intrinsically disordered regions of proteins by transfer learning. Bioinformatics 36(4), 1107–1113 (2020).
Article CAS PubMed Google Scholar

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China (62002056, 31801623, 62201116), Starting Research Funds of Dalian Maritime University (02500348), Youth Scientific Research Fund Project of Liaoning University, Fundamental Research Funds for the Central Universities (DUT22JC01), Liaoning Provincial Natural Science Foundation of China and Fundamental Research Funds for the Central Universities (N2224001-10).

Author information

Authors and Affiliations

College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, 110169, Liaoning, China
Qi Zhao, Wenxuan Yuan, Qiang He & Qixuan Sun
College of Light Industry, Liaoning University, Shenyang, 110036, Liaoning, China
Qian Mao
College of Artificial Intelligence, Dalian Maritime University, Dalian, 116026, Liaoning, China
Zheng Zhao
Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, 07030, USA
Yudong Yao
School of Software, Dalian University of Technology, Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, 116620, Liaoning, China
Xiaoya Fan

Authors

Qi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qian Mao
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wenxuan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Qiang He
View author publications
You can also search for this author in PubMed Google Scholar
Qixuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yudong Yao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoya Fan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Q.Z., Z.Z. and X.F.; methodology, Q.Z., Z.Z. and X.F.; software, Q.Z. and Q.H.; validation, Q.M. and Y. Y.; data curation, Q.Z., Z.Z, W.Y. and Q.M.; writing—original draft preparation, Q.Z, X.F.; writing—review and editing, Z.Z., Q.H., Y.Y. Q.M. and Q.S. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Xiaoya Fan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, Q., Mao, Q., Zhao, Z. et al. RNA independent fragment partition method based on deep learning for RNA secondary structure prediction. Sci Rep 13, 2861 (2023). https://doi.org/10.1038/s41598-023-30124-x

Download citation

Received: 06 September 2022
Accepted: 16 February 2023
Published: 17 February 2023
DOI: https://doi.org/10.1038/s41598-023-30124-x

This article is cited by

Accelerating prediction of RNA secondary structure using parallelization on multicore architecture
- Pradnya Borkar
- Snehal Shinde
- Roshani Raut
Sādhanā (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.