## Main

Therapeutics based on messenger RNA (mRNA) have shown immense promise as a modular therapeutic platform, allowing potentially any protein to be delivered and translated1,2, as evidenced by the rapid deployment of mRNA-based vaccines against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)3,4,5. However, the chemical instability of RNA sets a fundamental limit on the stability of RNA-based therapeutics1,6,7,8, with RNA hydrolysis specifically setting a limiting factor on stability in lipid nanoparticle (LNP)-based formulations9,10. Hydrolysis in LNP formulations degrades the amount of mRNA remaining during shipping and storage, and hydrolysis in vivo after vaccine injection limits the amount of resulting protein produced over time9. Better methods to develop thermostable RNA therapeutics would allow for increasing the equitability of their distribution, reducing their cost and possibly increasing their potency10,11.

An underexplored path to more shelf-stable mRNA therapeutics lies in the prospect of synonymous sequence design. A simple calculation reveals that there exist 10633 mRNA sequences that all code for the SARS-CoV-2 spike protein antigen. With an astronomical number of mRNA sequences available for a given therapeutic target, it is likely that some of these sequences may harbour structural features that make them more resistant to hydrolysis than first-generation mRNA vaccine formulations. Indeed, initial results have demonstrated that more stable mRNAs for model protein systems can be designed by optimizing candidate RNA sequences, scored with a model for RNA hydrolysis12,13. These initial studies indicate that stabilized mRNAs can produce equivalent, and for some designs, more protein compared with non-optimized mRNAs13. These design strategies are predicted to be able to produce mRNAs that do not activate double-stranded RNA immune sensors12 such as RIG-I14. These strategies have also demonstrated compatibility with mRNAs synthesized from modified nucleotides including pseudouridine13, which are used in mRNA vaccine formulations15.

However, the potential of any such mRNA design algorithm is limited by the accuracy of the underlying model in predicting RNA degradation. Previous models for RNA degradation have assumed that the probability of any RNA nucleotide linkage being cleaved is proportional to the probability of the 5′ nucleotide being unpaired12. Computational studies with this model suggested that at least a twofold increase in stability could be achieved through sequence design, while maintaining a wide diversity of sequences and features related to translatability, immunogenicity and global structure13. However, it is unlikely that degradation depends only on the probability of a nucleotide being unpaired: local sequence- and structure-specific contexts may vary widely, as evidenced by ribozyme RNAs found in nature, whose sequences adopt specific structures that undergo self-scission16.

We wished to understand the maximum predictive power achievable for RNA degradation on a short timescale for model development. To do this, we combined two crowdsourcing platforms: Eterna, an RNA design platform, and Kaggle, a platform for machine learning competitions. The problem of ‘RNA design’ involves designing RNA sequences with specific target properties such as a particular overall structure17,18, a target function such as sensor activity19, or, in this case, high chemical stability13. We used degradation data from short RNA fragments designed on the Eterna platform, which comprised a wide diversity of sequences and structures, and hypothesized that crowdsourcing the problem of obtaining a machine learning architecture would result in a model capable of expressing the resulting complexity of sequence- and structure-dependent degradation patterns (Fig. 1a). We hypothesized that this ‘dual crowdsourcing’ would lead to stringent and independent tests of the models developed, minimizing sharing of assumptions between the individuals designing the constructs to test (Eterna participants) and the individuals building the models (Kaggle participants) and leading to better generalizability on independent datasets.

The resulting models were subjected to two blind prediction challenges. The first was in the context of the Kaggle competition, where the RNA structure probing and degradation data that participants would be aiming to predict was not acquired until after the competition was announced. The experimental method used for these data, In-line-seq, allowed for measuring the degradation rate of individual nucleotide linkages. However, this method relies on probing short RNA fragments and is unable to scale to make single-nucleotide degradation measurements of full-length mRNAs for protein targets of interest. Other experimental methods such as PERSIST-seq13 have been developed to characterize the overall degradation rates per mRNA molecule, which is the primary value of interest to minimize when designing stabilized RNA-based therapeutics. In principle, the overall degradation rate of an mRNA molecule of length N is equivalent to the sum of degradation rates at each dinucleotide linkage in the backbone12:

$$k_\mathrm{deg}^\mathrm{mRNA} = \mathop {\sum}\nolimits_{i = 1}^{N - 1} {k_\mathrm{deg}^i,}$$
(1)

where $$k_\mathrm{deg}^i$$ is the degradation of nucleotide linkage i. The half-life of the mRNA is calculated as

$$t_{1/2} = \frac{{\ln 2}}{{k_\mathrm{deg}^\mathrm{mRNA}}}.$$
(2)

We tested the above model empirically by comparing the summed degradation rates per nucleotide to the abundance of the entire construct remaining from sequencing and found high agreement (Extended Data Fig. 1). Using the above ansatz, the resulting models were tested in a second blind challenge of predicting the overall degradation of full-length mRNAs encoding a variety of model proteins, experimentally tested using PERSIST-seq. The models also demonstrated increased predictive power over existing methods in predicting these overall degradation rates. These models therefore appear immediately useful for guiding design of low-degradation mRNA molecules. Analysis of model performance suggests that the task of predicting RNA degradation patterns is limited by both the amount of data available as well as the accuracy of the structure prediction tools used to create input features. Further developments in experimental data and secondary structure prediction, when combined with network architectures such as those developed here, will further advance RNA degradation prediction and therapeutic design.

## Results

### Dual-crowdsourced competition design and assessment

The aim of the OpenVaccine Kaggle competition (Fig. 1b) was to develop computational models for predicting RNA degradation patterns. We asked participants on the Eterna platform to submit RNA designs using a web-browser design window (Fig. 1c), which resulted in a diversity of sequences and structures (Fig. 1d). In total, 150 participants (Supplementary Table 1) submitted sequences. A secondary motivation was an opportunity for participants to receive feedback on RNA fragments they may wish to use in mRNA design challenges described by Leppek et al.13 In total, 3,029 RNA designs of length 107 nt were collected in the first ‘Roll Your Own Structure’ round I (RYOS-I), which opened on 26 March 2020 and closed on 19 June 2020 (Fig. 1e).

We then obtained nucleotide-level degradation profiles for the first 68 nucleotides of these RNAs using In-line-seq13, a method for characterizing in-line RNA degradation in high throughput for the purposes of designing stabilized RNA therapeutics. In brief, a library of short RNA fragments was produced from a DNA library via in vitro transcription, each of which contained a unique barcode at the 3′ terminus. The RNA library was subjected to one of several accelerated degradation conditions, which included combinations of increases in Mg2+ concentration, basicity and temperature. The resulting fragmented RNA was reverse transcribed and the complementary DNA was sequenced. The base-pairing structures of the constructs were also characterized via selective 2′ hydroxyl acylation with primer extension (SHAPE; termed ‘Reactivity’ below)20,21, a technique to characterize RNA secondary structure. SHAPE experiments were performed analogously to the In-line-seq experiments described above, but instead of degradation conditions, the RNA was subjected to a chemical modifier (1-methyl-7-nitroisatoic anhydride, 1M7) which acylates the 2′-OH group. When the RNA is reverse transcribed, such an acylation causes the reverse transcriptase enzyme to terminate. The resulting cDNA fragments were used to create a ‘reactivity profile’ for each molecule.

The Kaggle competition was designed to create models that would have predictive power for three of these data types, given RNA sequence and secondary structure as input (Fig. 1f). In addition to scoring two types of degradation data, we also scored predictions for SHAPE data, hypothesizing that models would be more accurate if able to learn shared underlying features between degradation data and SHAPE data as a form of multi-task training. Nucleotides that are more reactive to the SHAPE reagent would be predicted to also have dinucleotide linkages with higher degradation rates.

In total, each independent construct of length N required predicting 3 × N values for the three data types. In addition to these experimental data, Kaggle participants were also provided with features related to RNA secondary structure computed from available biophysical models to use if they wished. These features included N × N base-pairing probability matrices from EternaFold22, a recently developed package with state-of-the-art performance on RNA structural ensembles; dot-parenthesis notated minimum free energy (MFE) RNA secondary structure from the ViennaRNA package23; and a six-character featurization of the MFE structure calculated using bpRNA24.

We developed training and ‘public test’ datasets from the RYOS-I dataset (Fig. 2). The public test dataset was used to rank submissions during the competition. The 3,029 constructs were filtered for those with mean signal-to-noise values greater than 1, resulting in 2,218 constructs (Fig. 2, dark blue track, Methods). These constructs were segmented into splits of 1,179 in the public training dataset, 400 constructs in the public test set and 639 for the ‘private test’ dataset, the set which would be used in the final evaluation. The sequences that did not pass the signal-to-noise filter were also provided to Kaggle participants with the according description. The RYOS-I data contained some ‘clusters’ of sequences where Eterna players included many small variations on a single sequence (clusters visible in Fig. 1d). To mitigate the possibility of sequence motifs in these clusters biasing evaluation, we segmented the RYOS-I data into a public training, public test and private test sets by clustering the sequences and including only sequences that were singly, doubly or triply clustered in the private test set (Methods). This strategy was described to Kaggle participants during the competition.

To ensure that the majority of the data used for the private test set were fully blind, we initiated a second ‘Roll Your Own Structure’ challenge that was launched for Eterna design collection on 18 August 2020. Given that useful models for degradation should be agnostic to RNA length, we designed the constructs in RYOS-II to be 34 nucleotides longer (102 versus 68 nt) than the constructs in RYOS-I to discourage modelling methods that would overfit to constructs of length 68. Design collection was closed on 7 September, 3 days before the launch of the Kaggle challenge on 10 September. The RYOS-II wet-lab experiments were conducted concurrently with the Kaggle challenge, enabling a completely blind test for the models developed on Kaggle. The Kaggle competition was closed on 6 October. The RYOS-II was similarly clustered and filtered to ensure that the test set used for scoring consisted primarily of singly and doubly clustered constructs. Three data types were used to score models: SHAPE; 10 mM Mg2+, pH 10, 1 day, 24 °C; and 10 mM Mg2+, pH 7.2, 1 day, 50 °C. Models were scored using the mean column RMSE (MCRMSE) across three data types, defined as

$$\mathrm{MCRMSE} = \frac{1}{{N_t}}\mathop {\sum}\nolimits_{j = 1}^{N_t} {\sqrt {\frac{1}{n}\mathop {\sum}\nolimits_{i = 1}^n {\left( {y_{ij} - \hat y_{ij}} \right)^2} } ,}$$
(3)

where Nt is the number of scored data types, n is the number of nucleotides in the dataset, yij is the measured data value, and $$\hat y_{ij}$$ is the predicted data value for nucleotide i in sequence j. Two additional data types were included in the training data corresponding to RNAs degraded for 7 days without Mg2+ rather than 1 day with Mg2+: (pH 10, 7 days, 24 °C; and pH 7.2, 7 days, 50 °C). However, these data were not collected for the second round to accelerate competition turnaround.

### Kaggle team performance and common attributes of top models

During the 3 week competition period, 1,636 teams submitted 35,806 solutions. The overall performance of teams compared to baseline models for RNA degradation is depicted in Fig. 3a. Kaggle entries significantly outperformed the ‘DegScore’ linear regression model for RNA degradation13 by 37% in MCRMSE for the public test set and 25% for the private test set (Fig. 3a). We found that for predictions from the top 100 teams (Extended Data Fig. 2) as well as amongst predictions between individual constructs (Extended Data Fig. 3), performance between data types was highly correlated in the public dataset and less strongly correlated in the private dataset. Overall, the weakest correlation was between SHAPE and degradation data types.

An additional benchmark model that used the DegScore windowed featurization (see Methods) with improved XGBoost25 training, termed the DegScore-XGB model, resulted in moderate improvement (public MCRMSE 0.35854, private MCRMSE 0.43850 compared to public MCRMSE 0.39219, private MCRMSE 0.47197 for the original DegScore). Kaggle participants developed feature encodings beyond what was provided. One of the most widely used community-developed featurizations was a graph-based distance embedding depicted in Fig. 3b. Several teams, including the top three teams, used a publicly shared autoencoder/GNN/GRU kernel (https://www.kaggle.com/code/mrkmakr/covid-ae-pretrain-gnn-attn-cnn/), which alone achieved a MCRMSE of 0.24860 on the public and 0.36106 on the private test set (Fig. 3a). This notebook was the most forked (forked 936 times as of March 2022) and upvoted (upvoted 386 times). The architecture of the winning ‘Nullrecurrent’ model (Fig. 3c) depicts the architecture of this shared kernel, which feeds 1D and 2D features, including adjacency matrices based on the secondary structure of the RNA inputs, into a multi-head attention network, the output of which is then fed into convolutional neural net layers. Many teams additionally cited pseudo-labelling and generating additional mock data as being integral to their solutions. The architecture of the second-placed team (Fig. 3d) demonstrates an example implementation of using pseudo-labelling. The practice of pseudo-labelling, which is similar to the student-teacher learning paradigm,26 involves using predictions from one model as ‘mock ground truth’ labels for another model. To generate additional mock data, participants generated random RNAs and structure featurizations from five different secondary structure prediction algorithms using the package Arnie (https://github.com/DasLab/arnie) and used these RNAs in training as well (see Supplementary Information for more detailed descriptions of solutions from Kaggle teams).

We explored whether increased accuracy in modelling could be achieved by ensembling models, that is, combining predictions from multiple models; a common feature of Kaggle competitions is that winning solutions are dissimilar enough that ensembled models frequently improve predictive ability. We found that ensembling resulted in only modest improvements (Methods), suggesting the majority of signal had been captured by the top two models.

### Top models are capable of deep representation of RNA motifs

We analysed predictions from the first-placed model (‘Nullrecurrent’) in greater depth to better understand its performance. Across all nucleotides in the private test set, 41% of nucleotide-level predictions for SHAPE reactivity agreed with experimental measurements with an error margin that was lower than experimental uncertainty; for comparison, if experimental errors are distributed as normal distributions, a perfect predictor would agree with experimental values over 68% of data points. For Deg_Mg_pH10 and Deg_Mg_50C, 28% and 42% of predictions were within error, respectively.

The nucleotides with the highest RMSE in the Deg_Mg_pH10 data type were any nucleotide type in bulges, and U’s in any unpaired context. Figure 4a depicts representative constructs with the lowest RMSE for the Deg_Mg_pH10 data type out of the private test data, demonstrating that diverse structures and structure motifs were capable of being predicted correctly. Aggregating predictions from the Nullrecurrent model over secondary structure motifs (Fig. 4b) demonstrates that the Nullrecurrent model captured patterns previously observed in the experimental signal13. The most reactive RNA structure motifs were triloops, a previously unknown biological finding. Another unexpected finding from these data was that symmetric internal loops were more stable against degradation than internal loops with asymmetric lengths. The fact that the Nullrecurrent model was able to capture this trend indicates that using such models within a design algorithm would allow for an automated way to model such biochemical attributes within a designed mRNA. Constructs with the highest RMSE highlight instances in which the provided structure features were incorrect. Figure 4c depicts two constructs with the highest RMSE for the SHAPE modification prediction. The SHAPE data for the first construct, ‘2204Sept042020’, has high reactivity in predicted stem areas, indicating that the stems were unfolded in solution. By contrast, construct ‘Triple UUUU Tetraloops’ has low reactivity in the exterior loop, suggesting that those nucleotides were paired rather than unpaired. These examples notwithstanding, we found no correlation between the EternaScore, a metric indicating how closely the experimental reactivity signal matches the predicted structure17, and RMSE summed per construct, suggesting that, in general, quality of the input structure features was not a limitation in model training (Extended Data Fig. 4).

### Kaggle models improve prediction of mRNA degradation

As an independent test, we assessed the ability of the top two Kaggle models to predict the overall degradation rates for a dataset of full-length mRNAs that were not publicly available at the time of the Kaggle competition. Because the throughput of the In-line-seq experimental method is limited to RNA lengths easily accessible by Illumina sequencers (500 nt), these mRNAs could not be probed at a per-nucleotide level akin to the datasets used in the Kaggle experiments. However, their overall degradation rates (related to the per-nucleotide degradation rate via equation (1)), were characterized using PERSIST-seq13. In brief, the PERSIST-seq technique measured the overall degradation rate of a mRNA by monitoring the mRNA’s abundance using reverse transcription followed by polymerase chain reaction amplification (RT-PCR) at varying timepoints after degradation was initiated. The lengths of these mRNAs ranged from 504 to 1,588 with a median length of 928 (Fig. 5a), nearly tenfold longer than the longest RNA fragments used in the OpenVaccine Kaggle challenge (full mRNA dataset, attributes and calculations in Supplementary Table 2). The experimentally determined structures of two example mRNAs designed by Eterna participants13 are depicted in Fig. 5b. Both code for Nanoluciferase but have a 2.5-fold difference in hydrolysis lifetime. ‘Yellowstone’ was designed by an Eterna participant using codons that mimic nucleotide frequencies from organisms in Yellowstone hot springs27; ‘LinearDesign-1’ was designed by an Eterna participant using an initial sequence from the LinearDesign mRNA structure optimization server28.

To compare the Kaggle predictors to the single overall degradation rate from PERSIST-seq, we made predictions for all nucleotides in the full mRNA constructs and summed the predictions from the region that was captured by RT-PCR in PERSIST-seq which, in most cases, included the mRNA’s 5′ untranslated region (UTR) and coding sequence (CDS) (Fig. 5c). Carrying out predictions on the full RNA sequence and then summing over the probed window accounts for interactions between the untranslated regions and CDS, as can be seen for two example constructs in Fig. 5b–nucleotides in the 5′ and 3′ UTRs are predicted to pair with the CDS. We made predictions for 188 mRNAs in four classes of protein targets: a short multi-epitope vaccine (MEV), the model protein nanoluciferase, with one class consisting of varied UTRs and a second consisting of varied CDSs, and enhanced green fluorescent protein (eGFP). We found that the Kaggle second-placed ‘Kazuki2’ model exhibited the highest correlation to experimentally determined degradation rates, followed by the Kaggle first-placed ‘Nullrecurrent’ model (Fig. 5c), with Spearman correlation coefficients of 0.48 (p = 3.3 × 10–12) and 0.43 (p = 9.5 × 1010), respectively. Both Kaggle models outperformed unpaired probability values from ViennaRNA RNAfold v. 2.4.1423 (R = 0.25, p = 5.4 × 104), the DegScore linear regression model (R = 0.36, p = 2.9 × 107) and the DegScore-XGBoost model (R = 0.42, p = 1.8 × 109). An ensemble of the Nullrecurrent and Kazuki2 models did not outperform the Kazuki2 model (R = 0.47, p = 1.4 × 1011), again suggesting that the models themselves had reached their predictive potential. To estimate an upper limit for correlation considering experimental error, we resampled the measured degradation rates from within experimental error and calculated the correlation to the mean degradation rate. This resulted in a Spearman correlation of 0.88 (Table 1).

## Discussion

The OpenVaccine competition uniquely leveraged resources from two complementary crowdsourcing platforms: Kaggle and Eterna. The participants in the Kaggle competition were tasked with predicting stability measurements of individual RNA nucleotides. The urgency of timely development of a stable COVID-19 mRNA vaccine necessitated that the competition be run on a relatively short timeframe of three weeks, as opposed to three months, which is more common with Kaggle competitions.

The models presented here are immediately useful for mRNA design in that they could be called within a stochastic mRNA design algorithm12 to minimize the predicted degradation. There is likely further opportunity to leverage advancements in natural language processing to use datasets such as the ones presented here to generate mRNA designs using text-generation approaches29,30,31. The degradation data used in this competition were from RNA synthesized with unmodified nucleotides, but mRNA vaccines are being formulated with modified nucleotides including pseudouridine or N-1-methyl-pseudouridine15. Modified nucleotides in general will have differing underlying thermodynamics32, and there is a need to develop datasets and predictive models to predict structures and resulting stabilization of mRNAs formulated with modified nucleotides. The In-line-seq method can be performed using RNA with modified nucleotides, and the resulting data could be used to re-train models with architectures such as the ones presented here. Short of developing complete new thermodynamic parameters for modified nucleotides, it may be possible to develop principled heuristics to adapt models to mRNAs synthesized with modified nucleotides. For instance, Leppek et al. modified the DegScore model for pseudouridine by setting all uridine degradation measurements to zero to mimic the stabilization effect of pseudouridine, and saw moderate improvement in correlation13.

An under-investigated aspect of the models presented here is the effect of training on multiple data types. We speculate that because SHAPE reactivity has higher signal-to-noise than the degradation data types (Extended Data Fig. 5), models with architectures that allowed for weight-sharing between data types benefitted from learning to predict SHAPE reactivity as well. Directly predicting RNA degradation without concurrently training on SHAPE data may result in worse model performance. Conversely, the model architectures presented here may also prove to have useful biological applications in predicting only SHAPE reactivity data. Future directions for model development include training such models on larger chemical mapping datasets from more diverse experimental sources22 and integrating into inference frameworks for RNA structure prediction22,33.

Finally, the models for predicting RNA hydrolysis developed in this work may prove useful in computationally identifying classes of natural RNAs that have evolved to be resistant to degradation34. Such future bioinformatic analysis may suggest entirely new biologically inspired approaches for designing hydrolysis-resistant RNA therapeutics. More immediately, it will be of strong interest to computationally design mRNA sequences that optimize the predicted degradation stability discovered in this study, and to experimentally test if such sequences are indeed sufficiently stable to enable wider distribution of mRNA vaccines. In silico design of neural-network-predicted properties is an active area of research, and we speculate that further dual-crowdsourcing studies may help accelerate progress.

## Methods

### Initial feature generation

As a starting point for Kaggle teams, we supplied a collection of features for each RNA sequence, including the minimum free energy (MFE) structure according to the ViennaRNA 2 energy model23, ‘loop type’, or secondary structure type assignments generated with bpRNA24 (S = Stem, I = Internal loop, B = Bulge, H = Hairpin, M = Multiloop, X = external loop, E = end, terminology adopted from bpRNA) and the base-pair probability matrix according to the EternaFold22 energy model. These features were generated using Arnie (https://github.com/DasLab/arnie).

### Experimental data generation

The first experimental dataset used in this work, for the public training and test set, resulted from the ‘Roll-Your-Own-Structure’ Round I lab on Eterna, and had been generated previously by Leppek et al.13

The second experimental dataset used in this work, for the private test set, was generated for this work specifically. To produce these data, and for precise consistency with the public training and test set, In-line-seq was carried out as described by Leppek et al.13 In brief, DNA templates were ordered via custom oligonucleotide pool from Custom Array/Genscript, prepended by the T7 RNA polymerase promoter. Templates were amplified via PCR, transcribed to RNA via the TranscriptAid T7 High Yield Transcription Kit (Thermofisher, K0441), and the purified RNA was subjected to degradation conditions: (1) 50 mM Na-CHES buffer (pH 10.0) at room temperature without added MgCl2; (2) 50 mM Na-CHES buffer (pH 10.0) at room temperature with 10 mM MgCl2; (3) phosphate-buffered saline (PBS, pH 7.2; Thermo Fisher Scientific-Gibco 20012027) at 50 °C without added MgCl2; and (4) PBS (pH 7.2) at 50 °C with 10 mM MgCl2. In parallel, purified RNA was subjected to SHAPE structure-probing conditions, and one sample was subjected to the SHAPE protocol absent addition of the 1-methyl-7-nitroisatoic anhydride reagent.

cDNA was prepared from the six RNA samples (SHAPE probed, control reaction and four degradation conditions). We pooled 1.5 μl of each cDNA sample together, ligated with an Illumina adapter, washed and resuspended the ligated product, which was quantified by qPCR, sequenced using an Illumina Miseq. Resulting reads were analysed using MAPseeker (https://eternagame.org/software) following the recommended steps for sequence assignment, background subtraction of the no-modification control, correction for signal attenuation and reactivity profile normalization as described previously20.

### Signal-to-noise filtering

Data were filtered to include RNAs with a minimum value >0.5, maximum value <20 across five RNA degradation conditions and RNAs with a signal/noise ratio for SHAPE reactivity greater than 1.0. Signal/noise ratio for each construct was calculated as

$${\mathrm{SN}}\;{\mathrm{ratio}} = \frac{1}{M}\mathop {\sum}\nolimits_{i = 1}^M {\frac{1}{N}\mathop {\sum}\nolimits_{j = 1}^N {\frac{{\mu _{i,j}}}{{\sigma _{i,j}}}} } ,$$
(4)

where μi,j is the mean value of data type i at nucleotide j, and σi,j is standard deviation of data type i at nucleotide j, as calculated by MAPseeker. The data that did not pass the above filters were also provided to participants to give the option to use in training, and was flagged with the variable ‘SN_filter = 0’. Applying the above filter did not significantly alter the distribution of the median reactivity or signal/noise of any data type (SHAPE reactivity, Deg_Mg_pH10, Deg_Mg_50C) within either dataset (RYOS 1 or RYOS 2; Extended Data Fig. 6). However, average signal/noise of the Round II constructs was higher than the Round 1 constructs. Average signal/noise ratio for SHAPE reactivity across each dataset increased from 5.3±2.4 (mean ± standard deviation) to 6.2±3.5; for deg_Mg_pH10, 4.1±2.0 to 6.4±3.8 for Rounds 1 and 2; and for deg_Mg_50C 3.87±1.8 to 5.3±3.1 (Extended Data Fig. 5).

We wished to ascertain if the measured reactivities and degradation from Round 2 needed to be rescaled to match Round 1. To assess this, we compared distributions of nucleotide reactivities from nucleotide types. We found that for each data type and nucleotide type, the median values for Round 2 were within the 50% interquartile range of Round 1 (Extended Data Fig. 7a). We also compared distributions of reactivity from the first five nucleotides, which are a constant ‘GGAAA’ for each construct. The median values for each nucleotide in this were within the 50% interquartile range for all except for the first two GG’s in the ‘Deg_Mg_pH10’ data type (Extended Data Fig. 7b). We elected to not rescale the data from Round 2.

### Private test set curation

The private test set was curated to avoid bias toward more highly represented sequence motifs from the Eterna designs. Sequences that passed the above filters were clustered hierarchically using the ‘ward’ method in scikit-learn35 and then clustered at a cophenetic distance of 0.5. That is, sequences within the same cluster have <50% sequence similarity. All sequences that were in clusters with one, two or three members were included in the private test set, as well as one cluster member from other randomly selected clusters to attain the desired number of test set constructs.

### Comparing to the DegScore model

We compared Kaggle models to the ‘DegScore’ linear model13, which models degradation at a given nucleotide i as a linear function of nucleotides surrounding i:

$$\begin{array}{l}Y_i = \mathop {\sum}\nolimits_{k = - w}^w \left[ \mathop {\sum}\nolimits_{n \in \{\mathrm{A,C,G,U}\} } \left( \beta _{k,n}I_{i + k,n} \right) \right] \\+ \mathop {\sum}\nolimits_{k = - w}^w \left[ \mathop {\sum}\nolimits_{s \in \{\mathrm{H,E,I,M,B,S}\} } \left( {\beta _{k,s}I_{i + k,s}} \right) \right] + \beta _0 ,\end{array}$$
(5)

where β represents learned coefficients and I is an indicator function corresponding to the identity of nucleotide i+k. I accounts for sequence identity n(A,C,G,U) and its secondary structure type assignment s (S = stem, E = external loop, I = internal loop, B = bulge, H = hairpin, M = multiloop). The secondary structure type assignment ‘X’ for external loop was replaced with ‘E’ in the DegScore model in ref. 13, to reflect the biophysical similarity between the two categories. w is the maximum window distance, set to be 12 by Leppek et al.13. For a window size of w = 12, there are 251 parameters (25 positions with 4 sequence indicators and 6 secondary structure indicators for each position, and 1 intercept parameter).

### Ensembling models

We explored whether increased accuracy in modelling could be achieved by combining models. We used a genetic algorithm to ensemble maximally 10 of the top 100 models. The score on the public dataset was used to optimize, with the final ensembled model evaluated on the private dataset. With this method, ensembling achieved a public MCRMSE of 0.2237 (compared to the best public MCRMSE of 0.2276) and a private MCRMSE of 0.3397 (compared to the best private test set MCRMSE of 0.3420). By comparison, averaging the outputs of the top two models gave a result of 0.2244 public, 0.33788 private. Blending the top two solutions with the third solution did not improve the result. An estimated bound of ensembling can be found by optimizing directly to the private ensemble score. With this method, it was possible to achieve a private ensemble score of 0.3382 (again, versus best Leaderboard MCRMSE 0.3420). The improvement of 0.0038 over the leaderboard for this last approach is about the distance between the first-placed and tenth-placed teams, and the ‘correct’ way gives an improvement that is the distance between the first- and fifth-placed teams. All these experiments suggest that most of the signal has been captured by the top two models, and that the use of further ensembling provides, at best, modest improvements. The seemingly puzzling result that the simple ensemble of the top two models outperforms the genetic algorithm blend of the top 10 (on the private test set) suggests that the genetic algorithm did not find a global minimum for model weights.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.