Abstract
Accurate predictions of RNA secondary structures can help uncover the roles of functional noncoding RNAs. Although machine learningbased models have achieved high performance in terms of prediction accuracy, overfitting is a common risk for such highly parameterized models. Here we show that overfitting can be minimized when RNA folding scores learnt using a deep neural network are integrated together with Turner’s nearestneighbor free energy parameters. Training the model with thermodynamic regularization ensures that folding scores and the calculated free energy are as close as possible. In computational experiments designed for newly discovered noncoding RNAs, our algorithm (MXfold2) achieves the most robust and accurate predictions of RNA secondary structures without sacrificing computational efficiency compared to several other algorithms. The results suggest that integrating thermodynamic information could help improve the robustness of deep learningbased predictions of RNA secondary structure.
Introduction
Recent studies have revealed that functional noncoding RNAs (ncRNAs) play essential roles, including transcriptional regulation and guiding modification, participating in various biological processes ranging from development to cell differentiation, with defective functionality being involved in various diseases^{1}. Because it is wellknown that the functions of ncRNAs are deeply related to their structures rather than their primary sequences, discovering the structures of ncRNAs can elucidate the functions of ncRNAs. However, there are major difficulties in determining RNA tertiary structures through experimental assays such as nuclear magnetic resonance and Xray crystal structure analysis, because of the high experimental costs and resolution limits on measurements of RNA. Although considerable advances in cryoelectron microscopy research on RNA tertiary structure determination have been achieved in recent years^{2}, these limitations have not yet been completely overcome. Therefore, instead of conducting such experimental assays, we frequently perform computational prediction of RNA secondary structures, defined as sets of basepairs with hydrogen bonds between the nucleotides.
The most popular approach for predicting RNA secondary structures is based on thermodynamic models, such as Turner’s nearestneighbor model^{3,4}, in which a secondary structure is decomposed into several characteristic substructures, called nearestneighbor loops, such as hairpin loops, internal loops, bulge loops, basepair stackings, multibranch loops, and external loops, as shown in Fig. 1. The free energy of each nearestneighbor loop can be calculated by summing the free energy parameters that characterize the loop. The free energy parameters have been determined in advance by experimental methods such as optical melting experiments^{3}. The free energy of an entire RNA secondary structure is calculated by summing the free energy of the decomposed nearestneighbor loops. We can efficiently calculate an optimal secondary structure that has the minimum free energy using a dynamic programming (DP) technique, the wellknown Zuker algorithm^{5}. A number of tools, including Mfold/UNAfold^{6,7}, RNAfold^{8,9}, and RNAstructure^{10,11}, have adopted this approach.
An alternative approach utilizes machine learning techniques, which train scoring parameters for decomposed substructures from reference structures, rather than experimental techniques. This approach has successfully been adopted by CONTRAfold^{12,13}, ContextFold^{14}, and other methods, and has enabled us to more accurately predict RNA secondary structures. However, rich parameterization can easily cause overfitting to the training data, thus preventing robust predictions for a wide range of RNA sequences^{15}. Probabilistic generative models such as stochastic contextfree grammars (SCFGs) have also been applied to predicting RNA secondary structures. Recently, TORNADO^{15} implemented an application of SCFGs to the nearestneighbor model, thus achieving performance comparable with that of other machine learningbased methods.
Hybrid methods that combine thermodynamic and machine learningbased approaches to compensate for each other’s shortcomings have been developed. SimFold^{16,17} more accurately estimates thermodynamic parameters from training data, including RNA sequences and their known secondary structures, as well as the free energy of their known secondary structures. We previously proposed MXfold^{18}, which combines thermodynamic energy parameters and richparameterized weight parameters. The model learns more precise parameters for substructures observed in training data and avoids overfitting richparameterized weight parameters to the training data by resorting to the thermodynamic parameters for assessing previously unobserved substructures.
In recent years, deep learning has made remarkable progress in a wide range of fields, including bioinformatics and life sciences. SPOTRNA^{19} and E2Efold^{20} have been developed for RNA secondary structure prediction using deep neural networks (DNNs). Both algorithms formulate RNA secondary structure prediction as multiple binary classification problems that predict whether each pair of nucleotides forms a base pair or not, using a deep neural network trained with a large amount of training data. As with the other machine learningbased methods described above, concerns about overfitting caused by rich parameterization remain.
Inspired by MXfold and the DNNbased RNA secondary structure prediction methods, in this paper, we propose an algorithm for predicting RNA secondary structures using deep learning. Similar to MXfold, we integrate folding scores, which are calculated by a deep neural network, with Turner’s nearestneighbor free energy parameters. The deep neural network is trained using the maxmargin framework with thermodynamic regularization, which is proposed in this paper to make our model robust by maintaining that the folding score calculated by our method and the free energy calculated by the thermodynamic parameters are as close as possible. To confirm the robustness of our algorithm, we conducted computational experiments with two types of crossvalidation: sequencewise crossvalidation, with test datasets structurally similar to training datasets, and familywise crossvalidation, with test datasets structurally dissimilar to training datasets. The resulting method, called MXfold2, achieved the best prediction accuracy based not only on the sequencewise crossvalidation, but also on the familywise crossvalidation, indicating the robustness of our algorithm. In addition, the computational experiment conducted on a dataset with sequencestructureenergy triplets shows a high correlation between folding scores predicted by MXfold2 and the free energy derived from optical melting experiments.
Results
Overview of our algorithm
Our algorithm computes four types of folding scores for each pair of nucleotides by using a deep neural network, as shown in Fig. 2. The folding scores are used to calculate scores of the nearestneighbor loops. Similar to MXfold, we integrate the folding scores, which are calculated using a deep neural network, with Turner’s nearestneighbor free energy parameters. Then, our algorithm predicts an optimal secondary structure that maximizes the sum of the scores of the nearestneighbor loops using Zukerstyle dynamic programming (DP)^{5}. The deep neural network is trained using the maxmargin framework, also known as structured support vector machine (SSVM), to minimize the structured hinge loss function with thermodynamic regularization, which prevents the folding score of the secondary structure from differing significantly from the free energy of the thermodynamic parameters.
Effectiveness of two thermodynamicrelated techniques
To investigate the effectiveness of the thermodynamicintegrated folding scores and thermodynamic regularization, which are the main contribution of this study, we conducted an experiment to compare the cases with and without the thermodynamicrelated techniques. We implemented a base model that employs only the deep neural network to compute the folding score; that is, it does not use the thermodynamic parameters for the folding score and the thermodynamic regularization. Then, we compared the base model with the use of the thermodynamicintegrated folding score and/or the thermodynamic regularization. For this experiment, we employed TrainSetA, TestSetA, and TestSetB, which have been established by Rivas et al.^{15}. Note that TrainSetA and TestSetB are structurally dissimilar, whereas TrainSetA and TestSetA have some structural similarity. All the models were trained with TrainSetA and were then evaluated using TestSetA and TestSetB. The results in Table 1 and Supplementary Fig. 1 show that the base model achieved high accuracy for TestSetA, but quite poor accuracy for TestSetB, suggesting that the base model might be categorized as being prone to heavy overfitting. In contrast, the model using the complete thermodynamicrelated techniques (the final model) showed much better accuracy for TestSetB compared with the base model, indicating that the two thermodynamicrelated techniques enable robust predictions.
Comparison with the existing methods
We compared our algorithm with nine available folding algorithms: MXfold version 0.0.2^{18}, LinearFoldV (committed on Feb 5, 2020)^{21}, CONTRAfold version 2.02^{12,13}, ContextFold version 1.00^{14}, CentroidFold version 0.0.16^{22,23}, TORNADO version 0.3^{15} with ViennaRNA grammar, RNAfold in ViennaRNA package version 2.4.14^{8,9}, SimFold version 2.1^{16,17}, and RNAstructure version 6.2^{10,11}. The parameters of four trainable algorithms, namely, MXfold, CONTRAfold, ContextFold and TORNADO, were trained with TrainSetA. For the other algorithms, default parameters were used.
Figures 3a and 3b show PPVSEN plots of prediction accuracy for TestSetA and TestSetB, respectively. The maximum expected accuracy (MEA)based secondary structure predictions, TORNADO, CONTRAfold and CentroidFold, can control PPV and SEN through the parameter γ, and therefore, their accuracies are shown as the curves for multiple γ ∈ {2^{n}∣ − 5 ≤ n ≤ 10}.
For the predictions on TestSetA, our algorithm, MXfold2, achieved the best accuracy (F = 0.761), followed by ContextFold (F = 0.759). While the difference in Fvalue between MXfold2 and ContextFold was not significant (p = 0.49, onesided Wilcoxon singedrank test), MXfold2 was significantly more accurate than the other methods, including TORNADO (F = 0.746 at γ = 4.0), MXfold (F = 0.739), and CONTRAfold (F = 0.719 at γ = 4.0) (p < 0.001). On the other hand, for the predictions on TestSetB, which is structurally dissimilar from TrainSetA, we observe that ContextFold achieved the worst accuracy (F = 0.502), consistent with the possible proneness of ContextFold to overfitting, as noted by Rivas et al.^{15}. In contrast, we do not observe heavy overfitting, as occurred with ContextFold, with MXfold2 (F = 0.601), which is also significantly more accurate than the other methods, including MXfold (F = 0.581), CONTRAfold (F = 0.573 at γ = 4.0), and TORNADO (F = 0.552 at γ = 4.0) (p < 0.001).
Figures 3c and 3d show the PPVSEN plot and the distribution of Fvalues, respectively, on the combined dataset comprising TestSetA and TestSetB. These results indicate that MXfold2 (F = 0.693) achieved the best accuracy, followed by the trainable methods, namely, MXfold (F = 0.673), TORNADO (F = 0.664 at γ = 4.0), CONTRAfold (F = 0.658 at γ = 4.0), and ContextFold (F = 0.651), and MXfold2 also outperformed the thermodynamicsbased methods (p < 0.001, onesided Wilcoxon singedrank test).
Comparison with DNNbased methods
We compared our algorithm, MXfold2, with two recently published DNNbased methods: SPOTRNA^{19} and E2Efold^{20}. We trained MXfold2 using the available SPOTRNA and E2Efold training datasets and tested the prediction accuracy on their test datasets for direct comparison because SPOTRNA does not provide training modules. Furthermore, because their datasets were designed only for sequencewise crossvalidation, for the purpose of familywise crossvalidation, we developed the bpRNAnew dataset, which does not contain any families present in their training datasets.
Table 2 and Supplementary Fig. 2 show the comparison of the prediction accuracy among MXfold2, SPOTRNA, TORNADO, ContextFold, and RNAfold, with MXfold2 and ContextFold trained on the TR0 dataset, which is a subset of the bpRNA1m dataset^{24}. In contrast, SPOTRNA used the pretrained model trained on the TR0 dataset and an additional dataset for transfer learning. All of these methods were tested using the TS0 dataset, which is also a subset of the bpRNA1m dataset, for sequencewise crossvalidation, after which the bpRNAnew dataset was used for familywise crossvalidation.
We observe that MXfold2 performs slightly worse in sequencewise crossvalidation relative to SPOTRNA, but the difference is not significant (p = 0.31, onesided Wilcoxon signedrank test). On the other hand, in terms of familywise crossvalidation, MXfold2 is significantly more accurate than SPOTRNA (p < 0.001). In addition, as shown in Supplementary Fig. 3, MXfold2 is, on average, approximately 15 and 36 times faster than SPOTRNA when using GPU and CPU, respectively.
Figures 4 and 5 show typical examples of successful and failed predictions by MXfold2. We find in Fig. 4 that our method, which integrates the thermodynamic model, enables accurate predictions even when machine learningbased methods cannot be sufficiently trained. However, Fig. 5 suggests that if neither machine learningbased methods nor the thermodynamicbased method can accurately predict the secondary structure, MXfold2 may also be difficult to predict it successfully.
We also compared MXfold2 with E2Efold, TORNADO, ContextFold, and RNAfold (Supplementary Table 5 and Supplementary Fig. 4), in which E2Efold used the pretrained model trained on a subset of the RNAStrAlign dataset^{25}, and MXfold2 and ContextFold were trained on the same dataset. All of these methods were tested on a subset of the ArchiveII dataset^{26} for sequencewise crossvalidation and the bpRNAnew dataset for familywise crossvalidation. We observe that E2Efold almost completely failed to predict secondary structures for unseen families (F = 0.0361), whereas MXfold2 accurately predicted their secondary structures (F = 0.628). Furthermore, E2Efold trained with TrainSetA also failed to predict secondary structures for TestSetB (F = 0.0322), which is outside of Fig. 3. These results suggest that E2Efold might be categorized as being prone to heavy overfitting.
Correlation with free energy
We investigated the correlation between the free energy and the predicted folding score using the TFull dataset^{17}, which contains sequencestructureenergy triplets. Table 3 shows the root mean square error (RMSE) and the Spearman’s rank correlation coefficient (ρ) between the free energy of the reference structure and the folding score of the predicted structure for MXfold2 with and without thermodynamic regularization compared with CONTRAfold and RNAfold. The folding score of MXfold2 is highly correlated with the true free energy, though it is not as high as that of RNAfold. In contrast, even though MXfold2 without thermodynamic regularization is able to predict secondary structures as accurately as MXfold2 with thermodynamic regularization, its folding score is not highly correlated with the true free energy. This suggests that high accuracy in the secondary structure prediction does not directly lead to the reliable evaluation of thermodynamic stability.
Discussion
We have proposed a deep learningbased algorithm for predicting RNA secondary structures with thermodynamic integration. Our experiments showed that thermodynamic integration, consisting of thermodynamicintegrated folding scores and thermodynamic regularization combined, substantially improved the robustness of predictions. We conducted benchmarks that compare our algorithm with conventional algorithms, using not only sequencewise crossvalidation, but also familywise crossvalidation, which assumes one of the most practical situations for RNA secondary structure prediction from single sequences. In this comparison, our algorithm achieved the best prediction accuracy without overfitting to the training data, as was the case with other richparameterized methods such as ContextFold and E2Efold. The proposed MXfold2 algorithm should be useful for improving RNA structure modeling, especially for newly discovered RNAs.
Our experiments showed that several methods with rich parameterization achieved excellent performance for test datasets structurally similar to training datasets, but poor performance for test datasets structurally dissimilar from training datasets. Rivas et al.^{15} have already pointed out that richparameterized methods might easily tend towards overfitting. Such secondary structure prediction methods would be useful if we could know a priori whether the sequence we want to predict is a member of a family included in the training data. However, if we have knowledge about the family of the sequences whose structures are to be predicted, homology searches for RNA sequences, such as Infernal^{27}, are a better choice. In other words, the lack of prior knowledge about sequences is a major reason the prediction of secondary structures from single sequences is an important goal. In principle, regardless of whether a given sequence is within a family included in a training dataset or not, robust methods should be able to properly predict its secondary structure because its secondary structure is formed under the same physical laws. In general, machine learningbased methods can be expected to improve prediction accuracy by increasing the number of training samples. Although the development of sequencing techniques has made it easier to obtain sequence data, it is still not easy to increase the amount of training data because of the significant effort required to obtain those empirically determined secondary structures. Therefore, it is impractical to predict secondary structures of newly discovered sequences using methods that are not generalizable to a wide variety of sequences, such as those methods that rely on only memorizing sequences and their secondary structures within a large number of neural network parameters. We designed our experiments to predict secondary structures of newly discovered RNAs using test datasets structurally dissimilar from training datasets, e.g., TestSetB for TrainSetA (originally developed by Rivas et al.^{15}) and the bpRNAnew dataset for the TR0 dataset. These conditions are among the most important circumstances for RNA secondary structure prediction from single sequences, and our method, which is able to achieve high accuracy under these conditions, is useful in practice.
Deep neural networkbased methods, including our method, improve prediction accuracy by optimizing a larger number of parameters for training data relative to conventional methods. For example, our algorithm employs 803k parameters, whereas SPOTRNA and E2Efold use 1746k and 719k parameters, respectively. Therefore, as shown so far, such richparametrized models need to be trained more carefully because they are prone to overfitting. SPOTRNA mitigates overfitting by building an ensemble of five deep models with transfer learning. To do that, our MXfold2 integrates the folding scores, which are calculated by a deep neural network, with Turner’s nearestneighbor free energy parameters, and employs training using thermodynamic regularization. Whereas MXfold2 is limited to pseudoknotfree secondary structure prediction owing to its use of the nearestneighbor model, SPOTRNA and E2Efold are able to predict pseudoknotted structures because their RNA secondary structure prediction is formulated as multiple binary classification problems that predict whether each pair of nucleotides forms a base pair or not.
Estimation of free energies is important for applications other than structure prediction, such as small interfering RNA selection using hybridization thermodynamics^{28}. Table 3 indicates that RNAfold and MXfold2 with thermodynamic regularization can calculate folding scores that are highly correlated with true free energy estimates, at least for sequences for which secondary structures can be predicted with high accuracy. Given that MXfold2 is more accurate in secondary structure prediction compared with RNAfold, as mentioned above, MXfold2 can be expected to evaluate thermodynamic stability with high accuracy for more sequences compared with RNAfold.
Methods
Using a deep neural network to compute folding scores
Figure 2 shows the overview of our deep neural network, which is used to calculate four types of folding scores for each pair of nucleotides in a given sequence: helix stacking, helix opening, helix closing, and unpaired region.
The input of the network is an RNA sequence of length L. Each base (A, C, G, or U) is encoded into ddimensional embeddings using a trainable embedding function.
The L × ddimensional sequence embedding is entered into 1D convolution blocks (Block 1) repeated N_{1} times. Each 1D convolution block consists of two convolutional layers, with kernel sizes of 5 and 3, respectively, and a depth of d. Layer normalization^{29} and continuously differentiable exponential linear units (CELU) activation functions^{30} are employed. A dropout rate of 50% is used to avoid overfitting^{31}. To efficiently train deep neural networks, a residual connection^{32} that skips each layer is employed.
The next block (Block 2) contains N_{2} layers of the bidirectional long shortterm memory (BiLSTM) network^{33}. The number of hidden units in each direction of the BiLSTM network is d/2. Each BiLSTM layer is followed by the layer normalization and CELU activation functions. A dropout rate of 50% is also used in BiLSTM layers.
The L × d matrix obtained so far is divided into two matrices of size L × d/2; one of the two matrices expands in the column direction and the other in the row direction, and then the two matrices are concatenated into an L × L × d matrix, as shown in the “Concat 2D” operation in Fig. 2. This matrix is entered into 2D convolution blocks (Block 3) N_{3} times. Each 2D convolution block consists of two convolutional layers, each with a kernel size of 5 × 5 and 3 × 3, respectively, and a depth of d. The layer normalization and CELU functions with a dropout rate of 50% are again utilized.
Finally, four types of folding scores for each pair of ith and jth nucleotides are calculated by 3layer multilayer perceptron (MLP) with h hidden units.
We used the following values for the above hyperparameters: d = 64, N_{1} = 4, N_{2} = 2, N_{3} = 4, and h = 32.
Using DP to predicting RNA secondary structure
Similar to traditional folding methods, such as RNAfold and CONTRAfold, our algorithm adopts the nearestneighbor model^{4,34} as an RNA secondary structure model without pseudoknots. As an alternative to the thermodynamic parameters, our algorithm calculates the free energy of the nearestneighbor loops using four types of folding scores computed by the deep neural network described in the previous section. The four types of the folding scores are helix stacking scores, helix opening scores, helix closing scores, and unpaired region scores (Fig. 2). The helix stacking scores characterize the stacking of basepairs that mainly contributes to the thermodynamic stability of RNA structures. The helix opening scores and helix closing scores characterize each end of helix regions, corresponding to the terminal mismatch in Turner’s thermodynamic model. The unpaired region scores characterize unstructured sequences that contribute to the thermodynamic instability of RNA structures. Our algorithm calculates the folding score for a nearestneighbor loop by summing up the four types of the scores comprising the loop. For example, the folding score of the internal loop consisting of bases 25–29 and 43–47 in Fig. 1 is calculated by summing the helix closing score for (25, 47), the helix opening score for (28, 44), and the unpaired region scores for (26, 28) and (44, 46). Similarly, the folding score of the basepair stacking consisting of 23–24 and 48–49 in Fig. 1 is calculated from the helix stacking score for (23, 49).
Similar to MXfold, we integrate our model with the thermodynamic model by adding the negative value of the Turner free energy^{4} to the folding scores calculated by the deep neural network for each nearestneighbor loop. We define our scoring function of a secondary structure y given an RNA sequence x as
where f_{T}(x, y) is the contribution of the thermodynamic parameters (i.e., the negative value of the free energy of y) and f_{W}(x, y) is the contribution of the deep neural network, which is calculated as the sum of the folding scores over the decomposed nearestneighbor loops as described above. The thermodynamicintegrated folding scores defined by Eq. (1) can also be decomposed into nearestneighbor loops as each term of Eq. (1) can be decomposed in the same manner. Therefore, we can efficiently calculate an optimal secondary structure that maximizes the scoring function (Eq. 1) using the Zukerstyle dynamic programming (DP) algorithm^{5}. See Supplementary Methods for more details of the Zukerstyle DP algorithm.
Training parameters with the maxmargin framework
To optimize the network parameters λ, we employed a maxmargin framework called the structured support vector machine (SSVM)^{35}. Given a training dataset \({\mathcal{D}}={\{({x}^{(k)},{y}^{(k)})\}}_{k = 1}^{K}\), where x^{(k)} is the kth RNA sequence and y^{(k)} is the reference secondary structure for the kth sequence x^{(k)}, we aim to find a set of parameters λ that minimizes the objective function
where \({\mathcal{S}}(x)\) is a set of all possible secondary structures of x. The first term is the structured hinge loss function^{35}. Here, \({{\Delta }}(y,\hat{y})\) is a margin term of \(\hat{y}\) for y defined as
where δ^{FN} and δ^{FP} are tunable hyperparameters that control the tradeoff between sensitivity and specificity for learning the parameters. We used δ^{FN} = 0.5 and δ^{FP} = 0.005 by default. The margin term for structured models enables robust predictions by maximizing the margin between f(x, y) and \(f(x,\hat{y})\) for \(y\, \ne\, \hat{y}\). We can calculate the first term of Eq. (2) using Zukerstyle dynamic programming^{5} modified by the use of lossaugmented inference^{35}. The second and third terms of Eq. (2) are regularization terms that penalize parameter values that are extreme relative to a postulated distribution. The second term is our proposed thermodynamic regularization, which prevents the folding score of the secondary structure from differing significantly from the free energy of the thermodynamic parameters. The third term of Eq. (2) is the ℓ_{2} regularization term. We used C_{1} = 0.125 and C_{2} = 0.01 by default. To minimize the objective function (2), we employ the Adam optimizer^{36}.
Datasets
To evaluate our algorithm, we performed computational experiments on several datasets. Supplementary Table 1 shows a summary of the datasets used in our experiments.
The first dataset, which has been carefully established by Rivas et al.^{15}, includes TrainSetA, TestSetA, TrainSetB, and TestSetB. TrainSetA and TestSetA were collected from the literature^{12,16,17,37,38,39}. TrainSetB and TestSetB, which contain 22 families with 3D structure annotations, were extracted from Rfam 10.0^{40}. The sequences in Train/TestSetB share less than 70% sequence identity with the sequences in TrainSetA. We excluded a number of sequences that contain pseudoknotted secondary structures in the original data sources from all four subdatasets since all algorithms evaluated in this study were designed for predicting RNA secondary structures without pseudoknots. It is important to note that literaturebased TrainSetA and Rfambased TestSetB are structurally dissimilar, whereas TrainSetA and TestSetA have some structural similarity.
To compare our algorithm with SPOTRNA^{19}, we employed the same dataset extracted from the bpRNA1m dataset^{24}, which is based on Rfam 12.2^{41} with 2588 families. After removing redundant sequences based on sequence identity, the bpRNA1m dataset was randomly split into three subdatasets, namely, TR0, VL0, and TS0, for training, validation, and testing, respectively. Note that this partition is not familywise, but sequencewise. SPOTRNA performed the initial training and validation using TR0 and VL0, which was followed by testing using TS0 and then transfer learning using other PDBbased datasets.
To confirm the robustness against “unseen” families, we built a dataset that includes families from the most recent Rfam database, Rfam 14.2^{42}. Since the release of Rfam 12.2, from which bpRNA1m is derived, the Rfam project has been actively collecting about 1,500 RNA families, including families detected by newly developed techniques^{43}. We first extracted these newly discovered families. Then, as with SPOTRNA, we removed redundant sequences by CDHITEST^{44}, with a cutoff threshold of 80%, and discarded sequences whose lengths exceeded 500 bp. Our dataset is referred to as bpRNAnew.
We also compared our algorithm with E2Efold^{20}, using the same experimental conditions in which the RNAStrAlign dataset^{25} and the ArchiveII dataset^{26} were employed for training and testing, respectively. Note that this setting is also not familywise, but sequencewise, because both datasets contain the same families.
To investigate the correlation between folding score and free energy, we employed the TFull dataset, which was compiled by Andronescu et al.^{17}. Each sequence in the TFull dataset has a corresponding reference secondary structure, as well as its free energy change derived from optical melting experiments.
Performance measure
We evaluated the accuracy of predicting RNA secondary structures through the positive predictive value (PPV) and the sensitivity (SEN) of the base pairs, defined as
where TP is the number of correctly predicted base pairs (true positives), FP is the number of incorrectly predicted base pairs (false positives), and FN is the number of base pairs in the reference structure that were not predicted (false negatives). We also used the Fvalue as the balanced measure between PPV and SEN, which is defined as their harmonic mean:
For the experiment on the TFull dataset, we evaluated the root mean square error (RMSE)
where e_{i} and \({\hat{e}}_{i}\) are the free energy of the reference secondary structure and the negative value of the folding score of the predicted secondary structure for the ith sequence, respectively. We also evaluated the Spearman’s rank correlation coefficient (ρ) between e and \(\hat{e}\).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The datasets used in this study are available at https://doi.org/10.5281/zenodo.4430150.
Code availability
The MXfold2 source code is available at https://github.com/keiobioinformatics/mxfold2/, and the MXfold2 web server is available for use at http://www.dna.bio.keio.ac.jp/mxfold2/.
References
Hirose, T., Mishima, Y. & Tomari, Y. Elements and machinery of noncoding RNAs: toward their taxonomy. EMBO Rep. 15, 489–507 (2014).
Kappel, K. et al. Accelerated cryoEMguided determination of threedimensional RNAonly structures. Nat. Methods 17, 699–707 (2020).
Schroeder, S. J. & Turner, D. H. Optical melting measurements of nucleic acid thermodynamics. Methods Enzymol. 468, 371–387 (2009).
Turner, D. H. & Mathews, D. H. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 38, D280–2 (2010).
Zuker, M. & Stiegler, P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9, 133–148 (1981).
Zuker, M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31, 3406–3415 (2003).
Markham, N. R. & Zuker, M. UNAFold: software for nucleic acid folding and hybridization. Methods Mol. Biol. 453, 3–31 (2008).
Hofacker, I. L. Vienna RNA secondary structure server. Nucleic Acids Res. 31, 3429–3431 (2003).
Lorenz, R. et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Mathews, D. H., Andre, T. C., Kim, J., Turner, D. H. & Zuker, M. An updated recursive algorithm for RNA secondary structure prediction with improved thermodynamic parameters. (eds. Leontis & John, N) Molecular Modeling of Nucleic Acids. Vol. 682 of ACS Symposium Series. 246–257 (American Chemical Society, 1997).
Reuter, J. S. & Mathews, D. H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinform. 11, 129 (2010).
Do, C. B., Woods, D. A. & Batzoglou, S. CONTRAfold: RNA secondary structure prediction without physicsbased models. Bioinformatics 22, e90–8 (2006).
Foo, C.S., Foo, C. S. & Ng, A. Y. Efficient multiple hyperparameter learning for loglinear models. (eds. Platt, J. C., Koller, D., Singer, Y. & Roweis, S. T.) Advances in Neural Information Processing Systems 20, 377–384 (Curran Associates, Inc., 2008).
Zakov, S., Goldberg, Y., Elhadad, M. & ZivUkelson, M. Rich parameterization improves RNA structure prediction. J. Comput. Biol. 18, 1525–1542 (2011).
Rivas, E., Lang, R. & Eddy, S. R. A range of complex probabilistic models for RNA secondary structure prediction that includes the nearestneighbor model and more. RNA 18, 193–212 (2012).
Andronescu, M., Condon, A., Hoos, H. H., Mathews, D. H. & Murphy, K. P. Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics 23, i19–28 (2007).
Andronescu, M., Condon, A., Hoos, H. H., Mathews, D. H. & Murphy, K. P. Computational approaches for RNA energy parameter estimation. RNA 16, 2304–2318 (2010).
Akiyama, M., Sato, K. & Sakakibara, Y. A maxmargin training of RNA secondary structure prediction integrated with the thermodynamic model. J. Bioinform. Comput. Biol. 16, 1840025 (2018).
Singh, J., Hanson, J., Paliwal, K. & Zhou, Y. RNA secondary structure prediction using an ensemble of twodimensional deep neural networks and transfer learning. Nat. Commun. 10, 5407 (2019).
Chen, X., Li, Y., Umarov, R., Gao, X. & Song, L. RNA secondary structure prediction by learning unrolled algorithms. In Proceedings of the 8th International Conference on Learning Representations https://doi.org/10.6084/m9.figshare.hgv.1920 (2020).
Huang, L. et al. LinearFold: lineartime approximate RNA folding by 5’to3’ dynamic programming and beam search. Bioinformatics 35, i295–i304 (2019).
Hamada, M., Kiryu, H., Sato, K., Mituyama, T. & Asai, K. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics 25, 465–473 (2009).
Sato, K., Hamada, M., Asai, K. & Mituyama, T. CentroidFold: a web server for RNA secondary structure prediction. Nucleic Acids Res. 37, 277–280 (2009).
Danaee, P. et al. bpRNA: largescale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 46, 5381–5394 (2018).
Tan, Z., Fu, Y., Sharma, G. & Mathews, D. H. TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res. 45, 11570–11581 (2017).
Sloma, M. F. & Mathews, D. H. Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA 22, 1808–1818 (2016).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Lu, Z. J. & Mathews, D. H. Efficient siRNA selection using hybridization thermodynamics. Nucleic Acids Res. 36, 640–647 (2008).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at arXiv:1607.06450 [stat.ML] (2016).
Barron, J. T. Continuously differentiable exponential linear units. Preprint at arXiv:1704.07483 [cs.LG] (2017).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. Improving neural networks by preventing coadaptation of feature detectors. Preprint at arXiv:1207.0580 [cs.NE] (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at arXiv:1512.03385 [cs.CV] (2015).
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
Tinoco, I. Jr. et al. Improved estimation of secondary structure in ribonucleic acids. Nat. New Biol. 246, 40–41 (1973).
Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005).
Kingma, D. P. & Ba, J. (2014) Adam: A Method for Stochastic Optimization. Preprint at https://arxiv.org/abs/1412.6980.
Steinberg, S., Misch, A. & Sprinzl, M. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 21, 3011–3015 (1993).
Dowell, R. D. & Eddy, S. R. Evaluation of several lightweight stochastic contextfree grammars for RNA secondary structure prediction. BMC Bioinform. 5, 71 (2004).
Lu, Z. J., Gloor, J. W. & Mathews, D. H. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 15, 1805–1813 (2009).
Gardner, P. P. et al. Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res. 39, D141–D145 (2011).
Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–7 (2015).
Kalvari, I. et al. Rfam 13.0: shifting to a genomecentric resource for noncoding RNA families. Nucleic Acids Res. 46, D335–D342 (2018).
Weinberg, Z. et al. Detection of 224 candidate structured RNAs by comparative analysis of specific subsets of intergenic regions. Nucleic Acids Res. 45, 10811–10823 (2017).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CDHIT: accelerated for clustering the nextgeneration sequencing data. Bioinformatics 28, 3150–3152 (2012).
Darty, K., Denise, A. & Ponty, Y. VARNA: interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, 1974–1975 (2009).
Acknowledgements
This work was partially supported by a GrantinAid for Scientific Research (B) (No. 19H04210) and Challenging Research (Exploratory) (No. 19K22897) from the Japan Society for the Promotion of Science (JSPS) to K.S. and a GrantinAid for JSPS Fellows (No. 18J21767) from JSPS to M.A. This work was also supported by a GrantinAid for Scientific Research on Innovative Areas “Frontier Research on Chemical Communications.” The supercomputer system used for this research was provided by the National Institute of Genetics (NIG), Research Organization of Information and Systems (ROIS).
Author information
Authors and Affiliations
Contributions
K.S. conceived the study, implemented the algorithm, conducted experiments, and drafted the manuscript. K.S. and M.A. discussed the network architectures, designed the experiments, and collected the datasets. K.S. and Y.S. discussed the training algorithm. All authors read, contributed to the discussion of, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Elena Rivas and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sato, K., Akiyama, M. & Sakakibara, Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat Commun 12, 941 (2021). https://doi.org/10.1038/s41467021211944
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021211944
This article is cited by

REDfold: accurate RNA secondary structure prediction using residual encoderdecoder network
BMC Bioinformatics (2023)

Automated design of dynamic programming schemes for RNA folding with pseudoknots
Algorithms for Molecular Biology (2023)

The oncogenic circular RNA circ_63706 is a potential therapeutic target in sonic hedgehogsubtype childhood medulloblastomas
Acta Neuropathologica Communications (2023)

Categorizing 161 plant (streptophyte) mitochondrial group II introns into 29 families of related paralogues finds only limited links between intron mobility and intronborne maturases
BMC Ecology and Evolution (2023)

Comparative Study of Singlestranded Oligonucleotides Secondary Structure Prediction Tools
BMC Bioinformatics (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.