## Main

RNA molecules perform essential roles in cells, including regulating transcription, translation and molecular interactions and performing catalysis1. Synthetic RNA molecules are gaining increasing interest for a variety of applications, including genome editing2, biosensing3 and vaccination4. Characterizing RNA secondary structure, the collection of base pairs present in the molecule, is typically necessary for understanding the function of natural RNA molecules and is of crucial importance for designing better synthetic molecules. Some of the most widely used packages use a physics-based approach5 that assigns thermodynamic values to a set of structural features (ViennaRNA6, NUPACK7 and RNAstructure8), with parameters traditionally characterized via optical melting experiments and then generalized by expert intuition9. However, a number of other approaches have also been developed that use statistical learning methods to derive parameters for structural features (RNAsoft10, CONTRAfold11, CycleFold12, LearnToFold13, MXfold14, SPOT-RNA15).

Secondary structure modeling packages are typically evaluated by comparing single predicted structures to secondary structures of natural RNAs16. While important, this practice has limitations for accurately assessing packages, including bias toward structures more abundant in the most well-studied RNAs (transfer RNAs, ribosomal RNA and so on) and neglect of energetic effects from these natural RNAs’ tertiary contacts or binding partners. Furthermore, scoring on single structures fails to assess the accuracy of ensemble-averaged RNA structural observables, such as base-pairing probabilities, affinities for proteins and ligand-dependent structural rearrangements, which are particularly relevant for the study and design of riboswitches17,18, ribozymes, pre-mRNA transcripts and therapeutics19 that occupy more than one structure as part of their functional cycles. Existing packages are, in theory, capable of predicting ensemble properties through so-called partition function calculations and, in practice, are used to guide RNA ensemble-based design, despite not being validated for these applications.

Data from high-throughput RNA structure experiments, such as high-throughput chemical mapping20,21,22 and RNA-MaP experiments23,24, offer the opportunity to make incisive tests of secondary structure models with orders-of-magnitude more constructs than previously. Unlike datasets of single secondary structures, both of these experiments provide ensemble-averaged structural properties, which allow for directly evaluating the full ensemble calculation of secondary structure algorithms, obviating the need to also evaluate the further nontrivial inference of a most-likely structure from the calculated ensemble. Furthermore, experimental data on human-designed synthetic RNA libraries have the potential to mitigate effects of bias incurred in natural RNA datasets.

In this work, we evaluate the performance of commonly used packages capable of making thermodynamic predictions in two tasks for which large datasets of synthetic RNAs have been collected via the RNA design crowdsourcing platform Eterna25: (1) predicting chemical reactivity data through calculating probabilities that nucleotides are unpaired, and (2) predicting relative stabilities of multiple structural states that underlie the functions of riboswitch molecules: a task that involves predicting affinities of both small molecules and proteins of interest. We find striking, consistent differences in package performance across these quantitative tasks, with the packages CONTRAfold and RNASoft performing better than packages that are in wider use.

We hypothesized that these data, although shorter than many natural RNAs of interest and not designed to bear similarity to natural RNAs, might still sufficiently represent RNA thermodynamics to allow for developing an improved algorithm. We tested this by developing a multitask-learning-based framework to train a thermodynamic model on these tasks concurrently with the task of single-structure prediction. The resulting multitask-trained model, called EternaFold, did indeed demonstrate increased accuracy both on held-out data from Eterna as well as a collection of 31 independent datasets gathered from other literature sources, which encompass viral genomes, mRNAs and other small synthetic RNAs, probed with distinct methods and under distinct solution and cellular conditions. Compared to prior studies, this represents a very large collection of datasets used to evaluate RNA secondary structure algorithms.

## Results

### Evaluated packages

We initially evaluated commonly used secondary structure modeling packages in their ability to make thermodynamic predictions on two datasets of diverse synthetic molecules from Eterna: EternaBench-ChemMapping (n = 12,711) and EternaBench-Switch (n = 7,228). The packages ViennaRNA (v.1.8.5, 2.4.10), NUPACK (v.3.2.2), RNAstructure (v.6.2), RNAsoft (v.2.0) and CONTRAfold (v.1.0, 2.02) were analyzed across different package versions, parameter sets and modeling options where available (Supplementary Table 1). We also evaluated packages trained more recently through a varied set of statistical or deep learning methods (LearnToFold13, SPOT-RNA15, MXfold14, CycleFold12 and CROSS26), but these packages demonstrated poor performance on a subset of chemical mapping data (Extended Data Fig. 1a) and, due to their intensive runtimes, were omitted from further comparison.

### Package ranking based on RNA chemical mapping predictions

Our first ensemble-based structure prediction task investigates the capability of these packages to predict chemical mapping reactivities. Chemical mapping is a widely used readout of RNA secondary structure20,21,22 and has served as a high-throughput structural readout for experiments performed in the Eterna massive open online laboratory25. A nucleotide’s reactivity in a chemical mapping experiment depends on the availability of the nucleotide to be chemically modified, and hence provides an ensemble-averaged readout of the nucleotide’s deprotection from base pairing or other binding partners27. We wished to investigate whether current secondary structure packages differed in their ability to recapitulate information about the ensembles of misfolded states that are captured in chemical mapping experiments.

To make this comparison, we used the Eterna ‘Cloud Labs’ for this purpose: 24 datasets of 38,846 player-designed constructs, ranging from 80 to 130 nucleotides in length (dataset statistics in Supplementary Table 2, participant information in Supplementary Table 3). These constructs were designed in iterative cycles on the Eterna platform (Fig. 1a). Participants launched ‘projects’, each of which contained one ‘target structure’, and posed a design challenge or tested a hypothesis about RNA structure (project information in Supplementary Table 4). The constructs designed in these laboratories were periodically collected and mapped in vitro using selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE) and the multiplexed accessibility probing read out through sequencing (MAP-seq) chemical mapping protocol28. These data were returned to participants, and the results guided future laboratory development and construct design29.

The community of Eterna participants collectively developed highly diverse sequence libraries across target structures ranging from 0 to 12 loops (a proxy for design complexity30), as assessed by analyzing the positional sequence entropy of collected constructs as grouped by project (Fig. 1b). Example project target structures, colored by the mean reactivity of the probed solutions, are shown in Fig. 1b (inset). Some projects sought to design intricate structures, for example, ‘The Nonesuch by rnjensen45’ and ‘Robot serial killer 1’, while other participant projects focused more on better understanding experimental signals from particular structure motifs, for example, ‘SHAPE Profile U-U mismatch’, which consisted of a single stem and a U-U mismatch.

Figure 1a depicts an example heatmap of SHAPE data for Eterna-player-designed synthetic RNA molecules from the project ‘Aires’ by participant wateronthemoon. Figure 1c depicts calculated ensemble-averaged unpaired probabilities per nucleotide, P(unpaired), for five example package options, plotted in the same heatmap arrangement as the experimental data in Fig. 1a (see Extended Data Fig. 2 for heatmaps from all package options tested). In this subset of constructs, all packages are largely able to identify which regions are completely paired (P(unpaired) equal to 0, white) or unpaired (P(unpaired) equal to 1, black), but some packages predict P(unpaired) values between 0 and 1 that more accurately reflect intermediate reactivity levels. Arrows (blue, green and magenta) indicate intermediate reactivity values that are captured by predictions from CONTRAfold and RNAsoft but not ViennaRNA, NUPACK and RNAstructure. We quantified similarity between reactivity and P(unpaired) by calculating the Pearson correlation coefficient between the experimental reactivity values and P(unpaired) values (Methods). As an example, predictions from CONTRAfold 2 and RNAsoft BLstar for Cloud Lab round 1 (1,088 constructs) demonstrate improved correlation of R = 0.718(2) and 0.724(3) (respectively) over Vienna 2, RNAstructure and NUPACK (0.673(2), 0.671(2) and 0.667(2), respectively) (Supplementary Table 5). Noting that some projects had low sequence diversity, and to make the dataset a more manageable size for benchmarking while maintaining the same degree of sequence diversity, we filtered constructs to remove highly similar sequences (Methods and Extended Data Fig. 3). Clustering the resulting sequences per project (Fig. 1d) demonstrates that low-entropy projects were reduced in size. The final 24 EternaBench-CM datasets comprised 12,711 individual constructs.

We observed that CONTRAfold and RNAsoft generally predict that the constructs studied are more melted than the other packages predict at their default temperatures of 37 °C, even though the actual chemical mapping experiments were carried out at lower temperature (24 °C; Methods). Motivated by this observation, we wished to ascertain whether a simple change in temperature might account for differences in performance between packages. Because ViennaRNA, NUPACK and RNAstructure packages include parameters for both enthalpy and entropy, we calculated correlations across predictions from a range of temperatures (Extended Data Fig. 1b). We found that increasing the temperature from the default value of 37 °C used in these packages to 60 °C improved the correlation to experimental data for ViennaRNA (R = 0.708(2)) and RNAstructure (R = 0.707(2)), but not NUPACK (R = 0.639(2)). We hence included each of these packages also at 60 °C as options to test.

We established a ranking of all package options for each dataset (Fig. 1e, Supplementary Table 5 and representative heatmaps for all datasets in Extended Data Fig. 4) by computing the z-score for each package correlation in comparison to all packages tested, and averaging over all datasets (Fig. 1f). The top three package options were CONTRAfold 2, ViennaRNA at 60 °C and RNAsoft with ‘BLstar’ parameters. Using a Pearson correlation assumes a linear relationship between P(unpaired) and reactivity and relies on a two-state model with inherent limitations (Methods). We therefore also ranked all packages with a Spearman rank correlation coefficient and found a similar global overall ranking (Extended Data Fig. 1c). Overall package performance and the resulting ranking was not strongly dependent on guanine-cytosine (GC) content, sequence length or total number of loops in the project target structure, which was investigated by calculating correlations and rankings when grouping constructs by project (Methods and Extended Data Fig. 1d).

### Package ranking based on riboswitch affinity predictions

Our second ensemble-based structure prediction task involved predicting the relative populations of states occupied by riboswitch molecules. Riboswitches are RNA molecules that alter their structure on binding of an input ligand, which effects an output action such as regulating transcription, translation, splicing or the binding of a reporter molecule18,31,32. We compared these packages in their ability to predict the relative binding affinity of synthetic riboswitches to their output reporter, fluorescently tagged MS2 viral coat protein in the absence of input ligand, $$K_{{\mathrm{MS2}}}^{ - {\mathrm{lig}}}$$ (Methods and Extended Data Fig. 5a). As with the chemical mapping datasets, each riboswitch dataset was filtered to exclude highly similar sequences (Extended Data Fig. 3 and Supplementary Table 6). These riboswitches came from two sources: the first consisted of 4,849 riboswitches (after filtering) designed by citizen scientists on Eterna33. The second consisted of 2,509 riboswitches (after filtering) designed fully computationally using the RiboLogic package34, probed concomitantly with Eterna riboswitches. These riboswitches were designed using aptamers for three small molecules: flavin mononucleotide (FMN), theophylline and tryptophan.

Figure 2a depicts experimental values for $$\log K_{{\mathrm{MS2}}}^{ - {\mathrm{lig}}}$$ for FMN riboswitches from the RiboLogic dataset versus predicted $$\log K_{{\mathrm{MS2}}}^{ - {\mathrm{lig}}}$$ values. Again, CONTRAfold and RNAsoft BLstar packages exhibit higher correlations to the experimental data (Pearson R = 0.50(2) and 0.51(2), respectively) than ViennaRNA, NUPACK and RNAstructure (R = 0.37(2), 0.34(2), 0.36(2), respectively). Example predictions for all package options tested are in Extended Data Fig. 6. We evaluated performance across 12 independent experimental datasets (Fig. 2b, Supplementary Table 7 and representative predictions in Extended Data Fig. 7), and obtained a ranking (Fig. 2c) similar to the ranking obtained from chemical mapping data. CONTRAfold 2, RNAsoft (model ‘BL, no dangles’, equivalent to BLstar but without dangles) and RNAstructure 60 °C were ranked as the top three out of the package options tested. The top ranking of CONTRAfold 2 matches the entirely independent ranking based on chemical mapping measurements of distinct RNA sequences described in the previous section. These riboswitches were designed using aptamers for three small molecules: FMN, theophylline and tryptophan. Calculating z-scores over each individual subset resulted in slightly differing rankings but consistently favored Contrafold methods (Extended Data Fig. 5b). Predicting MS2 binding affinity in the presence of the riboswitch input ligand, $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$, as well as the activation ratio requires computing constrained-partition functions, a capability limited to Vienna RNAfold, RNAstructure and CONTRAfold. Rankings for predicting $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$ and activation ratio followed the same trends (Extended Data Fig. 5d,e and Methods).

### EternaFold gives best-of-class performance in multiple tasks

We hypothesized that performance in both secondary structure prediction tasks above might be improved by incorporating these tasks in the process of training a secondary structure package. The RNAsoft10,35 and CONTRAfold11 packages, which performed well in both tasks (Table 1), both take advantage of the property that the gradient of the partition function with respect to any feature is related to the expected counts of that feature10, which can be readily computed in dynamic programming scheme. We generalized this framework beyond maximizing the likelihood of one single structure to matching the experimentally determined probability of a particular structural motif in the ensemble through minimizing the root-mean-squared error to the logarithm of riboswitch affinities for MS2 protein (Methods). We used the CONTRAfold code as a framework to explore multitask learning on RNA structural data, since it has previously been extended to train on chemical mapping data to maximize the expected likelihood of chemical mapping data36.

We tested training from three data types: secondary structures, chemical mapping reactivity and riboswitch affinities. We used the STRAND S-Processed dataset for secondary structures (n = 3,439), which was the same data used to train RNAsoft and CONTRAfold10. The chemical mapping training data (n = 2,603) came from Cloud Lab datasets used in previous model development36. We used riboswitches designed by the automated RiboLogic34 algorithm for riboswitch training data (n = 1,295). We trained models with a variety of combinations of data types to explore interactions in multitask training (Fig. 3a), used holdout sets to determine hyperparameter weights (Methods) and evaluated performance on separate test sets for single-structure prediction accuracy37, chemical mapping prediction accuracy and riboswitch affinity prediction. To ensure a rigorous separation of training and test data, each test dataset was filtered for sequence similarity to all training data at 80% using a windowed Levenshtein metric (Methods). Marked sequence similarity overlap between the S-Processed train and test sets motivated us to develop an orthogonal dataset for secondary structure prediction testing based on the dataset ArchiveII38. Test sets for chemical mapping and riboswitch data came from completely different experimental rounds than those used in training to avoid learning experiment-specific biases.

Comparing performance across models trained with different types of input data indicates some tradeoffs in performance. CONTRAfold 2 exhibited the highest accuracy, followed by ‘Model S’, trained only on single-structure prediction training data, exhibited the highest accuracy on the separate single-structure prediction test set (Fig. 3b, F-score = 0.56(0.22)). Incorporating other data types in model training resulted in F-scores worse than Model S on the ArchiveII-NR single-structure prediction test set but within error of CONTRAfold 2 (Fig. 3b). Model ‘SCRR’, trained on four data types (single-structure data, chemical mapping, riboswitch $$K_{{\mathrm{MS2}}}^{ - {\mathrm{lig}}}$$ and $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$) exhibited the highest performance on separate test sets for chemical mapping (Fig. 3c) and riboswitch $$K_{{\mathrm{MS2}}}^{ - {\mathrm{lig}}}$$ prediction (Fig. 3d, data for all test sets in Supplementary Table 8). We termed this SCRR model ‘EternaFold’.

### Independent tests confirm EternaFold performance

We wished to test whether EternaFold’s improvements in correlating P(unpaired) values to chemical mapping and protein-binding data generalized to improvement in predictions for datasets from other groups, experimental protocols and RNA molecules. We compiled 3 datasets of chemical mapping data for molecules including viral genomes39,40,41,42,43,44,45,46,47,48,49 in cells and in virions, ribosomal RNAs44,50,51 both in cells and extracted from cells, synthetic mRNAs and RNA fragments designed to improve protein expression and in vitro stability19,52, and mRNAs probed in various subcellular compartments and extracted from human embryonic kidney 293 (HEK293) cells53 (Fig. 4a and Supplementary Table 9). These datasets spanned structure probing methods different from those used in the Eterna Cloud Labs (SHAPE-CE, SHAPE-MaP, DMS-MaP-seq versus MAP-seq) as well as a variety of chemical modifications (DMS, icSHAPE, NAI). Most of these test molecules were much longer (thousands of nucleotides) than the 85-nucleotide RNAs used as the primary training data for EternaFold. Notably, six of these involved the SARS-CoV-2 genome46,47,48, which came into prevalence after the development of the EternaFold model, and represented a test of new data. The following results are for P(unpaired) values calculated for overlapping windows of size 900, but other window sizes and Levenshtein distance metrics gave qualitatively similar results (Extended Data Fig. 8). We wished to ascertain that the sequences in these datasets did not overlap with sequences that EternaFold had been trained on, so we also filtered these data using a windowed Levenshtein distance metric at a cutoff of 60% sequence similarity. This removed 37% of the originally collected sequences for a dataset size of 8,734 sequences (Supplementary Table 10).

For 15 out of 31 datasets across all categories, EternaFold exhibited the highest correlation coefficient (with P < 0.05, determined by 95% overlapping confidence intervals; Methods), and had the highest average z-score (Fig. 4b and Table 2). For the other 16 datasets, EternaFold was tied with other packages for having the highest correlation. EternaFold showed significant improvement (P < 0.05) in datasets from varying sources including RNAs probed in cell (5 of 7 in cell datasets), extracted from cells (6 of 8), in virion (1 of 3), extracted from viral particles (1 of 2) and with other modifiers, including DMS (2 of 5) and icSHAPE (8 of 11). EternaFold was the top-scoring package (P < 0.05) in five of the six datasets of new SARS-CoV-2 data.

We were curious as to whether the differences in packages arose from consistent accuracy differences across all regions of these RNAs or from a net balance of increased and decreased accuracies at specific subregions of the RNAs, which might reflect particular motifs that are handled better or worse by the different packages. We calculated correlations along the length of example constructs—the Zika ILM genome probed in virion45 (Fig. 4c), HEK293 mRNA for gene RPS27A, extracted from chromatin and probed ex vivo53 (Fig. 4d)—and observed that EternaFold correlations generally demonstrated a fixed improvement across compared packages across all regions, supporting a consistent accuracy improvement by this package.

We also tested the ability of EternaFold to predict the thermodynamics of binding of human Pumilio proteins 1 and 2 in a dataset of 1,405 constructs54. EternaFold showed no significant increase or decrease in predictive ability (P > 0.05) when compared to CONTRAfold or ViennaRNA 2 at 37 °C (Extended Data Fig. 9a and Supplementary Table 11).

## Discussion

In this work, we have established EternaBench, benchmark datasets and analysis methods for evaluating package accuracy for two modeling tasks important in RNA structural characterization and design. These include (1) predicting unpaired probabilities, as measured through chemical mapping experiments, and (2) predicting relative stabilities of different conformational states, as exhibited in riboswitch systems. Unlike in single secondary structure prediction tasks, we demonstrate that both widely used and state-of-the-art machine-learning algorithms demonstrate a wide range in performance on these tasks. We averaged both rankings to acquire a final ranking of the tested external packages in Table 1.

We discovered that CONTRAfold 2, which inferred thermodynamic parameters by feature representation in datasets of natural RNA secondary structures, performed best in this ranking and performed significantly better than Vienna RNAfold, NUPACK and RNAstructure, packages with parameters derived from thermodynamic experiments9. The results were particularly notable since the probed RNA molecules were designed for two distinct tasks (chemical mapping and riboswitch binding affinities), with no relationship between these two sets of sequences and no relationship between the synthetic sequences and natural sequences. We further investigated whether combining these tasks in a multitask-learning framework could improve performance. We found that models trained on four types of data—single structures, chemical mapping data and riboswitch affinities for an output protein with and without an input ligand—showed improved performance in predictions for held-out subsets of EternaBench datasets as well as improvements in datasets involving virus RNA genomes and mRNAs collected by independent groups.

The improved performance of CONTRAfold and RNAsoft—two packages developed by maximum likelihood training approaches—was not obvious prospectively. Statistically learned packages could incorporate bias toward common motifs in the RNA structures that they were trained on and might overstabilize motifs simply due to their increased frequency rather than actual thermodynamic stability. Indeed, methods developed with a variety of more recent methodological advances, including machine learning from chemical mapping datasets (CROSS), deep learning methods for secondary structure prediction (SPOT-RNA), extended parameter sets (CONTRAfold-noncomplementary, CycleFold, MXfold) or accelerated folding packages (LearnToFold), demonstrated diminished performance in the EternaBench tasks (Extended Data Fig. 1a). It was surprising that well-developed and more widely used packages such as ViennaRNA and RNAstructure gave worse performances than CONTRAfold and RNAsoft across all tasks, but that predictions from ViennaRNA and RNAstructure at 60 °C showed notable improvement over the default of 37 °C. This observation might be rationalized by discrepancies in ionic conditions used to measure these packages’ thermodynamic parameters, and the in vitro and in vivo conditions tested here.

We used the EternaBench datasets to train a thermodynamic model via multitask learning on secondary structure prediction, chemical mapping signal likelihood maximization and minimizing error for riboswitch protein-binding prediction. The resulting model, termed EternaFold, performed best across 31 external datasets in four categories of natural and synthetic RNAs (Table 2) in a variety of cellular contexts, including RNAs probed in and extracted from cells and viral particles. It was not obvious that a model trained on datasets collected in vitro would demonstrate improvement on the variety of contexts for which we collected datasets. Although many factors influence RNA structure in cells beyond thermodynamic base pairing55, this demonstrates that existing natural RNA datasets are indeed capable of discriminating between ensemble-averaged base-pairing predictions and that accurate prediction of chemical mapping signal presents an ensemble-aware target for RNA secondary structure algorithm improvement.

The improvements from multitask training in EternaFold indicated that the nearest-neighbor model encoded in CONTRAfold had sufficient representational capacity to gain improvement on the chemical mapping and riboswitch prediction tasks. A notable area of algorithm development and potential improvement is the systematic evaluation of structure prediction methods that incorporate structure mapping data8,55,56. We implemented data-driven folding in EternaFold and tested on a collection of 13 structured RNAs as well as three other independent datasets. We found that EternaFold-SHAPE resulted in the highest mean Mathews correlation coefficient (MCC) over all these datasets (0.842), but this improvement was not statistically significant over several other algorithms in use for SHAPE-directed folding, such as SHAPEknots57 and the heuristic developed by Zarringhalam et al.58. implemented in ViennaRNA (mean MCCs of 0.820 and 0.830, respectively, Extended Data Fig. 9c and Supplementary Table 12), indicating potential for improvement. Another limitation of the resulting EternaFold algorithm is that it does not contain distinct terms for entropy, enthalpy and ionic concentrations. Future work creating temperature and salt-dependent models may benefit from analogous ensemble-aware fitting procedures collected at varying temperatures and ionic concentrations. Further improvements in modeling may arise from applying more sophisticated graph-59 and language-based60 architectures to predicting RNA thermodynamics. Further investigations will also be necessary to improve performance and aspects of the model that need to be expanded, which may include noncanonical pairs12, more sophisticated treatment of junctions61, next-nearest-neighbor effects14 and chemically modified nucleotides62. Orthogonal 3D structure methods such as nuclear magnetic resonance spectroscopy63 and cryogenic-electron microscopy64 will likely be instrumental to these pursuits. Taken together, the datasets presented here serve as an important starting point for evaluating and improving future RNA structure prediction algorithms.

## Methods

The algorithms evaluated in this work model secondary structure in the following manner. Given a model Θ, which is composed of a set of structural features {θ}, the partition function of an RNA sequence x is computed as

$$Z(x|{\Theta}) = \mathop {\sum}\limits_{s \in \{ S\} } {\mathop {\sum}\limits_{k \in s} {\exp } } \left( { - \frac{{{\Delta}G\left( {\theta _k} \right)}}{{k_{\mathrm{B}}T}}} \right),$$
(1)

where ΔG(θk) is the free energy contribution of structural feature k, kB is Boltzmann’s constant and T is temperature. Z represents a sum over the set of all possible structures {S} (ref. 65). From this expression, the probability of any particular structure s is defined as

$$P(s|x,{\Theta}) = Z^{ - 1}\mathop {\sum}\limits_{k \in s} {\exp } \left( { - \frac{{{\Delta}G\left( {\theta _k} \right)}}{{k_{\mathrm{B}}T}}} \right).$$
(2)

### Chemical mapping prediction theoretical basis

Structure prediction algorithms are able to estimate the ensemble-averaged probability that a nucleotide is paired or unpaired. Let $$P(i:j|x,{\Theta})$$ be the probability of bases i and j being paired, given sequence x and model Θ. For simplifying notation, we continue with implicit x and Θ, that is, $$P\left( {i:j|x,{{{\mathrm{{\Theta}}}}}} \right) = P(i:j)$$. This is computed as

$$P(i:j) = \mathop {\sum}\limits_{s_{i:j} \in \{ S\} } {P(s_{i:j})} ,$$
(3)

where si:j denotes a structure containing the base pair i:j, and {S} is the full set of possible structures. These posterior probabilities are analytically calculated by all the algorithms tested here. The probability of any single base being unpaired can be computed as

$$P(i\,{\mathrm{unpaired}}) = 1 - \mathop {\sum}\limits_j {P(i:j)} .$$
(4)

The relationship between the probability of a nucleotide being unpaired and its experimentally measured reactivity has served as a locus for efforts to improve structure prediction of RNA constructs incorporating chemical mapping data from those constructs, and several functional forms have been used to describe the relationship between unpaired probability and chemical mapping reactivity27,66,67. In this work, we use the linear Pearson correlation coefficient between unpaired probability and experimentally measured reactivity as a measure of model quality. In the following, we describe the simple model under which this linear assumption holds. We write the probability that nucleotide i (nti) is modified at time t as

$$P\left( {{\mathrm{nt}}\,_i\,{\mathrm{modified}},{\mathrm{time}}\,t} \right) = 1 - {\mathrm{e}}^{ - k_{{\mathrm{mod}}}\left( i \right)t},$$
(5)

where kmod(i) is the rate of modification for nucleotide i. The measured chemical modification signal is an ensemble population average, where the time exposure of the ensemble to the modifier has been limited to aim to achieve ‘single-hit kinetics’ with single-hit frequency, so that the degree of modification in experiment is proportional to the rate of modification68. In other words, because kmod(i)t 1, we can approximate

$$P\left( {{\mathrm{nt}}\,_i\,{\mathrm{modified}},{\mathrm{time}}\,t} \right) \approx k_{{\mathrm{mod}}}\left( i \right)t \propto k_{{\mathrm{mod}}}(i).$$
(6)

This expression assumes that each RNA molecule is not heavily modified, such that kmod(i) for each nucleotide is independent of the modification state of other nucleotides. If we assume that the timescale of chemical modification is much slower than the timescale of fluctuation between structural ensemble states, then we may write the overall modification rate for each nucleotide i as averaged over the equilibrated structure ensemble of the RNA,

$$k_{{\mathrm{mod}}}\left( i \right) = \mathop {\sum}\limits_{s \in \left\{ S \right\}} P \left( s \right)k_{{\mathrm{mod}}}(i|s)$$
(7)

If we consider a simplest two-state model for each nucleotide, with modification rate kpr if paired and a rate kunp if unpaired, then this reduces to

$$\begin{array}{rcl}k_{{\mathrm{mod}}}\left( i \right) & = & k_{{\mathrm{unp}}}P\left( {i\,{\mathrm{unpaired}}} \right) + k_{{\mathrm{pr}}}P\left( {i\,{\mathrm{paired}}} \right) \\ & = & k_{{\mathrm{pr}}} + (k_{{\mathrm{unp}}} - k_{{\mathrm{pr}}})P\left( {i\,{\mathrm{unpaired}}} \right),\end{array}$$
(8)

which demonstrates that under this simple model, the modification rate is linear with respect to P(unpaired). The model above is limited in its assumption of two states and does not account for reactivity effects caused by sequence and local environment. For instance, Hoogsteen conformations in G–A and G–G mismatches expose the Watson–Crick faces of purine nucleobases, resulting in higher DMS reactivity69. A Spearman rank correlation (Extended Data Fig. 1c), which will be more dominated by relative rankings, results in a similar overall ranking.

### Chemical mapping data

Chemical mapping data for the Eterna Cloud Lab experiments were downloaded from the RNA Mapping DataBase (RMDB)28 and processed with RDATKit (https://ribokit.github.io/RDATKit/). The RNA was probed with the MAP-seq protocol with a coloaded standard molecule (P4-P6-2HP RNA) to enable normalization, as described in ref. 70 measurements were carried out at ambient temperatures (24 °C) with 10 mM MgCl2 and 50 mM Na-HEPES, pH 8.0. Data were processed using MAPseeker71 with standard settings.

Within each chemical mapping dataset, CD-HIT-EST72 was used to filter sequences with greater than 80% redundancy (excluding a shared 3′ primer binding site). From each sequence cluster identified, the sequence with the highest signal-to-noise ratio from chemical mapping experiments was selected as the representative sequence. These datasets ranged in size from 605 (round 15) to 3,378 constructs (round 23), with a median size of 1,577; after filtering, they ranged from 101 (round 12) to 1,088 (round 1), with a median size of 562 (Extended Data Fig. 3 and Supplementary Table 2). The filtered 24 datasets comprised 12,711 individual constructs, and distributions of GC content, average sequence length and number of loops in the target structures were not significantly affected (Extended Data Fig. 3).

Nucleotides with reactivities less than or equal to zero or greater than the 95th percentile of the dataset were removed from analysis. Cloud Lab round 2 was filtered to exclude experiments that had FMN present, which pertained to Eterna Cloud Lab challenges to design riboswitches. Adenosine nucleotides preceded by six or more As were also removed due to evidence of anomalous reverse transcription effects in such stretches73. External chemical mapping datasets were obtained from the supplementary information from the papers and processed similarly (outliers, nucleotides in poly-A stretches removed).

### Analyzing package performance by the Cloud Lab project

We wished to understand whether factors such as target structure complexity, GC content and sequence length influenced package predictions. We performed the same package ranking analysis, grouping constructs by their projects instead of by the 24 datasets. Because grouping constructs into projects sometimes resulted in a small number of nucleotides over which to calculate correlations, we omitted package predictions where the standard error of the calculated Pearson correlation was greater than 0.05. This resulted in a total of 612 project groupings remaining, names and calculated metrics for which are contained in Supplementary Table 4.

We found weak correlation between the per-project z-score of the top-performing package, CONTRAfold 2 and GC content (Spearman R = 0.15), sequence length (0.07) and total loops in the target structure (R = 0.16). There were also weak correlations between the average Pearson correlation for all packages and GC content (Spearman R = 0.10), sequence length (R = −0.24) and total target structure loops (R = −0.01) (Extended Data Fig. 1d).

### Riboswitch activity prediction theoretical basis

A thermodynamic framework discussed in greater detail in ref. 17 allows us to relate the observed binding affinity of an output molecule to the relative populations of a riboswitch molecule in different states. In the absence of input ligand, we may relate the probability that a riboswitch adopts a structural feature that can bind its output, P(out), to an experimentally measured binding affinity, $${{{{K}}}}_{{{{\mathrm{obs}}}}}^{ - {\mathrm{lig}}}$$, via the relative ratios of both values to those of a reference state:

$$\frac{{K_{{\mathrm{obs}}}^{ - {\mathrm{lig}}}}}{{K_{{\mathrm{obs}}}^{{\mathrm{ref}}}}} = \frac{{P^{{\mathrm{ref}}}\left( {{\mathrm{out}}} \right)}}{{P\left( {{\mathrm{out}}} \right)}} \equiv K_{{\mathrm{MS2}}}^{ - {\mathrm{lig}}}.$$
(9)

We selected the MS2 hairpin aptamer as a reference state whose probability of forming, Pref(out), can be estimated by the secondary structure algorithm. For each separate independent experimental dataset, $$K_{\mathrm{{obs}}}^{\mathrm{{ref}}}$$ is estimated as the strongest affinity measured (Extended Data Fig. 10a). We refer to the estimated ratio $$\frac{{P^{{\mathrm{ref}}}\left( {{\mathrm{out}}} \right)}}{{P\left( {{\mathrm{out}}} \right)}}$$ as $$K_{\mathrm{{MS2}}}^{ - {\mathrm{lig}}}$$ in the main text, as the equilibrium constant of forming the MS2 hairpin as normalized to the reference state.

Although there may be error introduced in which experimental point is selected to be $$K_{\mathrm{{obs}}}^{{\mathrm{ref}}}$$, relative error should be constant when comparing packages on the same dataset. To compare packages, we therefore report the correlation between $$\log (K_{{\mathrm{obs}}}^{ \pm {\mathrm{lig}}}/K_{\mathrm{{obs}}}^{{\mathrm{ref}}})$$ and $$\log (K_{{\mathrm{MS2}}}^{ \pm {\mathrm{lig}}})$$, which excludes the effect of selection for $${{{{K}}}}_{{{{\mathrm{obs}}}}}^{{\mathrm{ref}}}$$.

In general, the probability of an RNA molecule forming any structure motif is computed as

$$P({\mathrm{motif}}|x,\theta ) = \mathop {\sum}\limits_{s_{{\mathrm{motif}}} \in \left\{ S \right\}} {P(s_{{\mathrm{motif}}})} ,$$
(10)

where smotif denotes a structure containing that motif. Computing this probability requires a dynamic programming routine that is able to constrain the sampled structure space to only structures containing that motif to estimate a so-called ‘constrained-partition function’. However, not all secondary structure algorithms have implemented constrained-partition function estimation. Because the MS2 aptamer is a hairpin, we can approximate its probability of forming as the probability of forming the final base pair of the MS2 hairpin aptamer, an experimental observable that can be estimated by all the packages tested here. Thus, our prediction of interest is

$$\frac{{K_{{\mathrm{pred}}}^{ - {\mathrm{lig}}}}}{{K_{{\mathrm{pred}}}^{{\mathrm{ref}}}}} = \frac{{{{{{P}}}}^{{{{\mathrm{ref}}}}}\left( {i:j} \right)}}{{P\left( {i:j} \right)}},$$
(11)

where i and j are the nucleotides forming the terminal base pair in the MS2 aptamer stem. The value Pref(i:j) is accordingly computed as the probability of closing the base pair in the reference sequence. We confirmed that calculations using equations (9) and (11) agree for Vienna, RNAstructure and CONTRAfold packages.

### Predicting protein-binding affinities with input ligand bound

The estimation of $${{{{K}}}}_{{{{\mathrm{fold}}}}}^{ + {\mathrm{lig}}}$$ follows similarly to the above but accounts for increased thermodynamic weights for states that correctly display the aptamer of the input small molecule ligand. Therefore, it cannot be estimated via the simplified single base-pair calculation and must make use of constrained-partition functions (equation (10)).

Analogously to equation (9), we define $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$ as

$$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}} = \frac{{K_{{\mathrm{obs}}}^{ + {\mathrm{lig}}}}}{{K_{{\mathrm{obs}}}^{{\mathrm{ref}}}}}$$
(12)

which is calculated as

$$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}} = \frac{{Z + bZ_{{\mathrm{lig}}}}}{{Z_{{\mathrm{MS2}}} + bZ_{{\mathrm{lig}},{\mathrm{MS2}}}}}$$
(13)

where Zlig is the constrained-partition function of the state including the ligand aptamer (calculated in each algorithm as described in the next section), ZMS2 is the partition function for the state including the MS2 aptamer and Zlig,MS2 is the partition function of the state including both ligand aptamer and MS2 aptamer. The constant $$b = \frac{{\left[ {{\mathrm{ligand}}} \right]}}{{K_{d,{\mathrm{ligand}}}}}$$ is the Boltzmann weight of binding the ligand when the bulk concentration of the ligand is [ligand]. Values used for calculating b are in Supplementary Table 14. Representative predictions of $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$ versus experimental $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$ values are in Extended Data Fig. 10b.

### Riboswitch data

Riboswitch data were downloaded from supplementary materials from refs. 33,34. In brief, measurements were carried out at 37 °C in 100 mM Tris-HCl, pH 7.5, 80 mM KCl, 4 mM MgCl2, 0.1 mg ml−1 BSA, 1 mM DTT, 0.01 mg ml−1 yeast tRNA, 0.01% Tween-20 and varying concentrations of small molecule ligand (FMN, theophylline, tryptophan) and MS2 coat protein. Datasets were filtered to include only constructs with more than 50 copies of the sequence represented in the RNA-MaP experiment, constructs that included the canonical MS2 and small molecule aptamers, and filtered using CD-HIT-EST72 to remove sequence redundancy over 80%. As per the CD-HIT-EST algorithm default, the longest sequence per cluster was maintained. If all sequences were the same length, the first sequence was used. After filtering, the riboswitch datasets comprised 7,228 constructs in total. Scripts to replicate data processing from refs. 33,34. are included in the EternaBench software repository.

For all constructs as well as the reference MS2 hairpin construct, we performed $$K_{{\mathrm{MS2}}}^{{\mathrm{lig}}}$$ estimations including a flanking hairpin included in the Illumina array experiments, described in ref. 33. As an example, the full reference MS2 hairpin construct, as well as the constraint used for estimating $$K_{{\mathrm{pred}}}^{{\mathrm{ref}}}$$ with constrained-partition-function-based estimation, is reproduced below. The MS2 hairpin construct is underlined and the nucleotides in the base used for base-pair-based prediction are bold.

Sequence: GGGUAUGUCGCAGAAACAUGAGGAUCACCCAUGUAACUGCGACAUACCC

Structure:...............(((((x((xxxx)))))))...............

The riboswitches in EternaBench-Switch are controlled by the small molecules FMN, tryptophan or theophylline. Motifs, concentrations and intrinsic Kd values used for $$K_{{\mathrm{MS2}}}^{ + {\mathrm{lig}}}$$ prediction, taken from refs. 33,34, are provided in Supplementary Table 14.

### EternaFold multitask learning

The CONTRAfold11 loss function optimizes the conditional log-likelihood of ground-truth structure s(i) given sequence x(i) over dataset D:

$$L_{{\mathrm{CONTRAfold}}} = L_{{\mathrm{Struct}}}(\theta ) = \mathop {\sum}\limits_{i \in D} {{\mathrm{log}}} P(s^{(i)}|x^{(i)},\{ \theta \} ).$$
(14)

In CONTRAfold-SE36, the authors include a term to also use chemical mapping data to optimize structure prediction by maximizing the likelihood of observing the included chemical mapping dataset. The loss function then becomes

$$\begin{array}{rcl}L_{{\mathrm{CONTRAfold}} - {\mathrm{SE}}} & = & L_{{\mathrm{Struct}}} + w_{{\mathrm{CM}}}L_{{\mathrm{CM}}}, \\ L_{{\mathrm{CM}}}(\theta ,\phi ) & = & \mathop {\sum}\limits_{i \in D} {\log } \mathop {\sum}\limits_s P (s,{{{{\mathbf{d}}}}}|x,\{ \theta \} ,\phi ),\end{array}$$
(15)

where d are the chemical mapping datapoints from construct x. CONTRAfold-SE fits reactivity signals to gamma distributions for each nucleotide type (A, C, G, U) and whether the base is paired or unpaired, parameters for which are represented by ϕ.

We further included a term to minimize the mean squared error of predicted $$\log K_{{\mathrm{fold}}}^{ - {\mathrm{lig}}}$$ and $$\log K_{{\mathrm{fold}}}^{ + {\mathrm{lig}}}$$:

$$\begin{array}{rcl}L_{{\mathrm{MS2}}} &=& w_{ - {\mathrm{lig}}}\left[ {\log K_{{\mathrm{MS2}}}^{{\mathrm{exp}}}( - {\mathrm{lig}}) - \log K_{{\mathrm{MS2}}}^{{\mathrm{pred}}}( - {\mathrm{lig}})} \right]^2 \\ && + w_{ + {\mathrm{lig}}}\left[ {\log K_{{\mathrm{MS2}}}^{{\mathrm{exp}}}( + {\mathrm{lig}}) - \log K_{{\mathrm{MS2}}}^{{\mathrm{pred}}}( + {\mathrm{lig}})} \right]^2.\end{array}$$
(16)

The full loss function for EternaFold is thus written as

$$L_{{\mathrm{EternaFold}}} = L_{{\mathrm{Struct}}} + L_{{\mathrm{CM}}} + L_{{\mathrm{MS2}}}.$$
(17)

The hyperparameters $$w_{{\mathrm{CM}}},w_{ - {\mathrm{lig}}},w_{ + {\mathrm{lig}}}$$, corresponding to the relative weights placed on different data types, were selected through a grid search on the holdout sets STRAND-holdout, EternaBench-CM-holdout and EternaBench-Switch-holdout (data not shown). The final values used for training were $$w_{{\mathrm{CM}}} = 0.5,w_{ - {\mathrm{lig}}} = 30,w_{ + {\mathrm{lig}}} = 30$$.

### Dataset selection for training and testing EternaFold

#### Single-structure data

For training EternaFold, we used the S-Processed dataset37 train and holdout sets used previously in training CONTRAfold 2 and RNAsoft10, to keep the same datasets consistent with these algorithms. However, we found that the S-Processed test set had 68 and 52% redundancy to the S-Processed train and holdout sets, respectively, using CD-HIT-EST-2D. We therefore created a new secondary structure test set by filtering the more recent ArchiveII dataset38 for constructs with <80% sequence similarity to any sequence across all three data types used in EternaFold training. We also evaluated EternaFold performance on structure prediction for the S-Processed test set, and found qualitatively similar results to the ArchiveII-NR test set (Extended Data Fig. 9b, compare to Fig. 3b).

#### Cloud Lab chemical mapping data

We used rounds 3, 4, 5, 7, 10 and 11 as training and holdout data. This was to be consistent with the training data used in CONTRAfold-SE36, and to reserve rounds 0 and 1 as test rounds, given their large size and high signal-noise ratio. GC content, sequence length, total loops in the target structure and signal/noise ratio were equivalent across train, holdout and test rounds (Extended Data Fig. 3c).

#### Riboswitch data

We partitioned the RiboLogic dataset into our training, holdout and test sets due to the high signal-noise ratio and diversity of structures, subdividing the riboswitches so that each split contained identical fractions of FMN-, theophylline- and tryptophan-responsive riboswitches. This left the rest of the Eterna riboswitch rounds as test sets (Extended Data Fig. 3d).

### Test dataset filtering

To filter test datasets based on sequence similarity to the EternaFold training data, we implemented a ‘windowed Levenshtein distance’. We calculated Levenshtein distance across sliding windows of the longer sequence that are the length of the shorter sequence. A sequence was counted as redundant at X% cutoff if any window had a Levenshtein edit distance smaller than (100-X)% the window size. Supplementary Table 10 contains test dataset sizes before and after filtering at a windowed Levenshtein distance cutoff of 80, 60 and 40%. As a point of comparison, uniformly distributed, randomly generated 50-mers, 100-mers and 200-mers were calculated to have average Levenshtein distances of 42, 44 and 45%, respectively.

### Evaluating base pair probabilities for external datasets

For comparing P(unpaired) calculations to natural RNAs, many of which are thousands of nucleotides long, we compared several practices for calculating, which includes predicting base pair probabilities from overlapping windows, constraining the nucleotides under consideration using a beam search algorithm implemented in LinearPartition74, and conventional folding of the entire RNA. Windows of length 300, 600, 900 and 1,200 with 25-nt overlap. Results from length 900 are shown in the main text, although results are similar for other window sizes (Extended Data Fig. 8a).

### SHAPE-directed folding evaluation

We implemented SHAPE-directed folding in EternaFold in the following way: for an RNA sequence x with length L, let dj be the probing signal at nucleotide j in the sequence. The joint probability for structure s and the vector of reactivities d is given as

$$P\left( {s,{{{{\mathbf{d}}}}}|x;\theta ,\phi } \right) = P\left( {s|x;\theta } \right)\mathop {\prod}\limits_{j = 1}^L P \left( {d_j|x_j,s;\phi } \right)^\kappa$$
(18)

where θ represents the learned set of thermodynamic parameters and ϕ represents the parameters learned for eight gamma distributions defining the reactivities of A,C,G,U being paired or unpaired (Extended Data Fig. 9d), and κ is a parameter specifying the relative weight of the evidence. Predicting a maximum likelihood structure given an observed reactivity vector d, is calculated as

$$s_{{\mathrm{MLE}}} = \mathop {{{{{\mathrm{argmax}}}}}}\limits_{\hat s \in \{ S\} } P(s,{{{{\mathbf{d}}}}}|x;\theta ,\phi ).$$
(19)

The maximum expected accuracy structure is calculated using the same SHAPE-weighted partition function and the expression

$$s_{{\mathrm{mea}}} = \mathop {{{{{\mathrm{argmax}}}}}}\limits_{\hat s} {\Bbb E}_s\left[ {{\mathrm{Acc}}\left( {\hat s,s^ \ast } \right)} \right]$$
(20)

where $${\mathrm{Acc}}\left( {\hat s,s^ \ast } \right)$$ is the pseudo-accuracy measure described in detail in ref. 11 and s* is the (unknown) true structure.

When the EternaFold parameters were initially trained, κ was set to 1. To fit κ in the context of SHAPE-directed folding, we used the SHAPEknots training dataset and calculated the MCC. This dataset consists of 16 RNAs with known 3D structure and was used similarly to tune parameters in SHAPEknots57 and for the default settings of three formulas present in the ViennaRNA package. We refer to this model as EternaFold-SHAPE.

We compared EternaFold-SHAPE to SHAPEknots57, RNAstructure with structure probing (but not pseudoknots as in SHAPEknots), three algorithms implemented in ViennaRNA from Washietl66, Deigan50 and Zaringhalam58, as well as RNAstructure, ViennaRNA and EternaFold predictions without reactivity data. We also evaluated the algorithms on the SHAPEknots-TEST dataset, as well as datasets from Chen and Kappel that included DMS probing data for RNAs with secondary structures validated by other methods (Extended Data Fig. 9c, full dataset in Supplementary Table 12). In addition, 13 further RNA constructs were probed by SHAPE and DMS as described in the following section.

We calculated mean MCC across datasets and averaged these values. We found that EternaFold+SHAPE resulted in the highest mean MCC over test constructs of 0.842, but this was not statistically significant (evaluated as P < 0.05) over SHAPEknots (MCC = 0.818), EternaFold without SHAPE data (MCC = 0.814), ViennaRNA with the heuristic developed by Zarringhalam (MCC = 0.828), RNAstructure with SHAPE data (MCC = 0.803) or Vienna RNAfold 2 (MCC = 0.801). Statistical significance was evaluated using a two-sided t-test for related values. Supplementary Table 12 contains predicted SHAPE- or DMS-directed MFE structures for the dataset in all evaluated algorithms.

### SHAPE and DMS probing by capillary electrophoresis of 13 structured RNAs for SHAPE-directed folding evaluation

#### DNA template preparation

DNA templates were designed to include the 20-nt T7 RNA polymerase promoter sequence followed by a sequence encoding the desired RNA flanked by two hairpins used to normalize the resulting signal70. Double-stranded templates were prepared by the extension of 60-nt DNA oligomers (Integrated DNA Technologies) with Phusion polymerase, using the following thermocycler protocol: denaturation for 30 s at 98 °C, 35 cycles of denaturation for 10 s at 98 °C, annealing for 30 s at 60 to 64 °C, extension for 30 s at 72 °C, final extension for 10 min at 72 °C and cooling to 4 °C. DNA samples were purified with AMPure XP beads (Beckman Coulter), following the manufacturer’s instructions. Sample concentrations were estimated based on ultraviolet absorbance at 260 nm measured on Nanodrop spectrophotometer. Verification of template length was accomplished by electrophoresis of all samples and 10- and 20-bp ladder length standards (Thermo Scientific O’RangeRuler SM1313 and SM1323) in 4% agarose gels (containing 0.5 mg ml−1 ethidium bromide) and 1× TBE (100 mM Tris, 83 mM boric acid, 1 mM disodium EDTA).

#### Preparation of RNA templates

In vitro transcription reactions were carried out in 40 µl volumes with 10 pmol of DNA template, using the TranscriptAid T7 High Yield Transcription Kit (Thermo Fisher). Reactions were incubated for 3 h at 37 °C, followed by degradation of DNA template with 2 µl of DNase I at 37 °C for 30 min. RNA samples were purified using the Zymo RNA Clean and Concentrator-25 kit (Zymo Research). Concentrations were measured by absorbance at 260 nm on Nanodrop spectrophotometers.

#### SHAPE mapping

1.2 pmol of purified RNA was added to 2 µl of 500 mM Na-HEPES buffer (pH 8.0) and denatured at 90 °C for 3 min. The reaction was then cooled down to room temperature over 10 min. Then 2 µl of 100 mM MgCl2 was added, followed by incubation at 50 °C for 30 min. The sample was cooled down to room temperature over 20 min before addition of 5 µl of nuclease-free water (negative control) or 1-methyl-7-nitroisatoic anhydride (8.48 mg ml−1 of dimethylsulfoxide) followed by incubation at room temperature for 15 min and brought to a final volume of 20 µl with nuclease-free water. The SHAPE-RNA sample was further purified by incubating the sample with 5.0 µl of Na-MES, pH 6.0, 3.0 µl of 5 M NaCl, 1.5 µl of Oligo dT bead, 0.25 µl of 10 µM FAM-A20-Tail2 and brought to a final volume of 10 µl with nuclease-free water. The reaction mixture was incubated at room temp for 15 min, pulled down by 96-post magnetic stand for 10 min, washed twice with 70% ethanol and allowed to dry, before adding 2.5 µl of nuclease-free water.

#### DMS mapping

5 µl of RNA stock in H2O containing 12.5 pmol of RNA was mixed with 5 µl of 1× TE (Ambion) and denatured by incubating at 95 °C for 2 min, and then cooling on ice for 1 min. Then 12.5 µl of 2× buffer (600 mM Na-cacodylate, pH 7.0 and 20 mM MgCl2) was added, and the RNA was incubated at 37 °C for 30 min to fold. RNAs were modified by adding 2.5 µl of DMS (1.7 M in 100% ethanol); for no-modification controls, 2.5 µl of 100% ethanol was added instead. Reactions were incubated at 37 °C for 6 min, and then quenched with 25 µl of 2-mercaptoethanol.

#### Preparing samples for capillary electrophoresis

Compementary DNA (cDNA) was prepared from in-line probing and SHAPE-RNA samples as follows (note that above procedures leave RNA bound to FAM-A20-Tail2 reverse-transcription primers that are in turn bound to Oligo dT beads). Next, 2.5 µl of purified RNA was added to a reaction mixture containing 1× First Strand buffer (Thermo Fisher), 5 mM dithiothreitol (DTT), 0.8 mM dNTPs, 0.2 µl of SS-III RTase (Thermo Fisher) to a final volume of 5.0 µl. The reaction was incubated at 48 °C for 40 min, and stopped with 5 µl of 0.4 M sodium hydroxide. The reaction was then incubated at 90 °C for 3 min, cooled on ice for 3 min and neutralized with 2 µl of quench mix (2 m of 5 M sodium chloride, 3 ml of 3 M sodium acetate, 2 ml of 2 M hydrochloric acid). For four cDNA reference ladders, each of four ddNTPs (GE Healthcare 27-2045-01) with a ddNTP:dNTP ratio of 1.25 (0.1:0.08 mM) was used in the reverse-transcription reaction.

cDNA was pulled down on a 96-post magnetic stand and washed twice with 100 μl of 70% ethanol. To elute the bound cDNA, the magnetic beads were resuspended in 10.0625 μl of ROX350 (Thermo Fisher Scientific 401735)/Hi-Di (0.0625 μl of ROX350 ladder in 10 μl of Hi-Di formamide) and incubated at room temperature for 20 min. The cDNA was further diluted by 1/3 and 1/10 in ROX350/Hi-Di and samples loaded onto capillary electrophoresis sequencers (ABI-3730) on capillary electrophoresis services rendered by ELIM Biopharmaceuticals. Capillary electrophoresis data were analyzed using the HiTRACE v.2.0 package (https://github.com/ribokit/HiTRACE), following the recommended steps for sequence assignment, peak fitting, background subtraction of the no-modification control, correction for signal attenuation and reactivity profile normalization.

### Error and significance estimation

We estimated confidence intervals on reported Pearson correlation values by bootstrapping the datapoints under consideration and reporting the 2.5th and 97.5th percentile over 1,000 rounds of bootstrapping. Reported standard error values are estimated by calculating the standard deviation across bootstrapping rounds. We inferred significance in differences between package correlations by analyzing overlap between 95% confidence interval estimates75,76. All code to reproduce significance analyses is included in the EternaBench repository.

### Package predictions

All base-pairing probability calculations and constrained-partition function calculations were performed using standardized system calls through Python wrappers developed in Arnie (www.github.com/DasLab/arnie). Example command-line calls for each package option evaluated are provided in Supplementary Table 1. Datasets were processed with Pandas (https://github.com/pandas-dev/pandas) and visualized with Seaborn (https://seaborn.pydata.org/).

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.