The formation of intramolecular secondary structure brings mRNA ends in close proximity

A number of protein factors regulate protein synthesis by bridging mRNA ends or untranslated regions (UTRs). Using experimental and computational approaches, we show that mRNAs from various organisms, including humans, have an intrinsic propensity to fold into structures in which the 5’ end and 3’ end are ≤ 7 nm apart irrespective of mRNA length. Computational estimates performed for ∼22,000 human transcripts indicate that the inherent proximity of the ends is a universal property of most, if not all, mRNA sequences. Only specific RNA sequences, which have low sequence complexity and are devoid of guanosines, are unstructured and exhibit end-to-end distances expected for the random coil conformation of RNA. Our results suggest that the intrinsic proximity of mRNA ends may facilitate binding of translation factors that bridge mRNA 5’ and 3’ UTRs. Furthermore, our studies provide the basis for measuring, computing and manipulating end-to-end distances and secondary structure in mRNAs in research and biotechnology.


INRODUCTION
Regulation of mRNA translation in eukaryotes involves protein-mediated interactions between mRNA ends. Translation initiation requires the recruitment of the small ribosomal subunit to the 5' end of the mRNA 1 . The formation of the initiation complex is stimulated by the interaction between the 5' mRNA cap-binding protein eIF4E and the 3' end poly(A) tail binding protein PABP, which is mediated through their binding to different parts of the translational factor eIF4G 2,3 . The eIF4E•eIF4G•PABP complex is thought to enhance translation initiation by circularizing the mRNA and forming the "closed-loop" structure [4][5][6] . The mechanism by which the mRNA closed loop enhances proteins synthesis is not well understood.
Remarkably, translation initiation of many eukaryotic mRNAs is also regulated by sequences in their 3' UTRs and controlled by the formation of protein bridges between the 5' and 3' UTRs. For example, the 3' UTR regulatory sequences recruit protein complexes (e.g. CPEB•Maskin, Bruno•Cup, or GAIT complex), which inhibit translation by interacting with either eIF4E or eIF4E•eIF4G bound to the 5' end of mRNA 7 . The pervasiveness of protein bridges between mRNA UTRs in the evolution of translation regulation is puzzling because of the significant entropic cost expected for protein-mediated mRNA circularization 8 .
The entropic penalty for the formation of protein bridges between mRNA ends may be partially mitigated by mRNA compaction through intramolecular basepairing interactions.
Recent theoretical analyses suggested that the 5' and 3' ends of long (1,000-10,000 nucleotidelong) RNAs are always brought in the proximity of few nanometers of each other regardless of RNA length and sequence because of the intrinsic propensity of RNA to form widespread intramolecular basepairing interactions [8][9][10] . One study predicted that the 5' to 3' end distance in RNAs is 3 nm, on average 8 . These theoretical predictions were tested by single-molecule Förster resonance energy transfer (smFRET) measurements of end-to-end distances in several viral RNAs and mRNAs from the fungus Trichoderma atroviride, which varied in length between 500 and 5,000 nucleotides and were folded in vitro in the absence of any protein factors 11 .
Experimentally-derived end-to-end distances in RNA molecules, in which FRET was detected, ranged between 5 and 9 nanometers 11 . However, this study did not determine the average endto-end distance in each transcript since only molecules showing FRET were detected. Thus, it is possible that RNA molecules with 5-9 nm-long end-to-end distances account for only a minor fraction of each examined transcript.
The hypothesis that the closeness of RNA ends is a universal property of all natural transcripts remains to be systematically tested. The propensity of human mRNAs to fold into structures with short end-to-end distances has not been examined. It is unclear to what extent end-to-end distance may vary between different transcripts. Sequence features that define RNA potential to fold into structures with short end-to-end distances are unknown. Elucidating whether closeness of the 5' and 3' ends is an intrinsic propensity of all mRNAs may have important implications for various aspects of mRNA metabolism including translation, splicing or degradation. For example, the closeness of mRNA ends may underlie translation regulation mediated by various protein complexes that bridge mRNA UTRs.
Here, we use FRET measurements and computational analysis of RNA structure to examine the end-to-end distances in mRNAs from several species, including humans. We find that most, if not all, mRNAs have an intrinsic propensity to fold into structures with short end-toend distances.

RESULTS
The 5' end of the 5' UTR and 3' end of the 3' UTR of human mRNAs are intrinsically close. We experimentally determined the end-to-end distance in a number of mRNAs using FRET between fluorophores introduced at each end of the mRNAs (Fig. 1a). The range of FRET sensitivity (1 to 10 nm for Cy3-Cy5 pair 12 ) to distance changes matches the theoretically predicted array of distances between the 5' and 3' ends of structured RNAs 8,10 . We selected yeast and human mRNAs that encode abundant housekeeping proteins and have wellannotated 5' and 3' UTR sequences, such as yeast RPL41A (ribosomal protein L41A) and human GAPDH (glyceraldehyde-3-phosphate dehydrogenase) (Supplementary Table 1). In addition, we used rabbit β-globin and firefly luciferase (Fluc) mRNAs that have been used as canonical "standard" mRNAs in many previous mechanistic studies of eukaryotic translation.
Selected mRNAs were less than 2,000 nucleotides long to ensure high yields of in vitro run-off transcription by T7 RNA polymerase (Supplementary Table 1).
We labeled the 5' and 3' ends of mRNAs, which lacked the 5' cap and poly(A) tail, with donor (Cy3) and acceptor (Cy5) fluorescent dyes, respectively. Computational prediction of RNA secondary structure suggests that all examined mRNAs can form extensive intramolecular basepairing interactions (Fig. 1a, Supplementary Fig. 1a). mRNAs were refolded in the absence of protein factors and the presence of 100 mM KCl and 1 mM MgCl 2 . These ionic conditions are considered to be optimal for translation in eukaryotic in vitro translation systems 13,14 . Furthermore, the 1 mM concentration of Mg 2+ used in our experiments is similar to concentrations of free (unbound) cytoplasmic Mg 2+ in human cells (0.5-1 mM) 15 . Energy transfer between fluorophores attached to the 5' end of the 5' UTR and 3' end of the 3' UTR was detected in all eight tested mRNAs. The average end-to-end distances, which were determined for each transcript from ensemble FRET data, were in the range of 5-7 nm irrespective of mRNA length (Fig. 1b, Supplementary Table 1). These distances are two to ten times shorter than those predicted for unstructured RNA by the freely jointed chain model 16,17 of polymer theory ( Fig. 1b).
We next tested whether mRNA ends are brought into close proximity by basepairing interactions. Refolding of human GAPDH and rabbit β-globin mRNAs in the presence of a 50 nucleotide-long DNA oligonucleotide complementary to the 3' end of the respective mRNA led to a dramatic reduction in the efficiency of energy transfer between fluorophores attached to RNA ends. The observed decrease in FRET efficiency is presumably due to annealing of the DNA oligonucleotide to the 3' end of the mRNA and disruption of the intramolecular secondary structure (Fig. 1c, Supplementary Fig. 1b-c).
To further test the effect of intramolecular secondary structure on mRNA end-to-end distance, we replaced 106 nucleotides at the 5' end of the 116 nucleotide-long 5' UTR of GAPDH mRNAs with 53 CA repeats, which have low basepairing potential, to create the 5'UTR(CA) 53 GAPDH mRNA variant. Likewise, 53 CA repeats were inserted at the 3' end of the 202 nucleotide-long 3' UTR of GAPDH mRNA in place of 106 nucleotides of the original sequence, to make the 3'UTR(CA) 53 GAPDH mRNA variant. No energy transfer between the mRNA ends was detected in either of the GAPDH mRNA variants containing CA repeats, i.e. in 5'UTR(CA) 53 GAPDH and 3'UTR(CA) 53 GAPDH mRNAs (Fig. 1c). These results indicate that the 5' and 3' ends of wild-type GAPDH mRNA were brought within FRET distance via the formation of intramolecular basepairing interactions.

The 3' poly(A) tail is not involved in intramolecular basepairing interactions, which
bring the ends of the 5' and 3' UTRs in close proximity. mRNAs in eukaryotic cells undergo 5' capping (attachment of 7-methyl-guanosine to the 5' end) and polyadenylation of the 3' end.
In the experiments described above, we measured the distance between the 5' end of the 5' UTR and the 3' end of the 3' UTR of model mRNAs in the absence of the 5' cap and the poly(A) tail because neither the 5' cap nor adenosine repeats are likely to significantly affect secondary structure of mRNAs that lack extended uridine repeats. We tested the validity of this assumption by attaching donor and acceptor fluorophores to the ends of β-globin and GAPDH mRNAs transcribed with a 30 nucleotide-long poly(A) tail. Addition of a poly(A) tail led to a significant reduction in FRET efficiency (Fig. 1c), corresponding to an increase of end-to-end mRNA distance in both GAPDH and β-globin mRNAs by ~5 nm (Supplementary Table 1). This value is consistent with the ~5 nm end-to-end distance predicted for the 30 nt-long RNA segment in random-coil conformation 18 . Therefore, the poly(A) tail is unstructured and not involved in basepairing interactions with the 5' UTR. mRNAs fold into a dynamic ensemble of structures. Computational predictions suggest that mRNAs fold into an ensemble of structures with comparable thermodynamic stabilities rather than a single structure. To test this prediction, we examined end-to-end distance in individual GAPDH, β-globin and MIF mRNA molecules by measuring single-molecule (sm)FRET using total internal reflection fluorescence (TIRF) microscopy. smFRET reveals the structural dynamics of individual molecules that are masked in ensemble (bulk) FRET measurements because of signal averaging in the heterogeneous and non-synchronized population 12 . To immobilize mRNAs to the surface of the microscope slide, a 20-nucleotide-long DNA oligonucleotide conjugated to biotin was annealed in the middle of the RNA where it was computationally predicted to have a minimal effect on the overall secondary structure and endto-end distance of the RNA 19 (Supplementary Fig. 1b-c). No significant decrease in energy transfer in ensemble FRET experiments was observed when Cy3/Cy5-labeled GAPDH, β-globin and MIF mRNAs lacking poly(A) tail were folded in the presence of biotin-labeled DNA oligonucleotides (Supplementary Fig. 2). Hence, annealing of biotin-labeled DNA oligonucleotides did not affect end-to-end distance nor disrupt the overall RNA structure. mRNAs were tethered to the surface of microscope slides coated with BSA-biotin/ neutravidin and then imaged by exciting the donor (Cy3) fluorescence with the green (532 nm) laser. smFRET traces acquired for GAPDH mRNA, β-globin and MIF mRNAs exhibited singlestep photobleaching of both donor and acceptor fluorophores, indicating that we observe intramolecular rather than intermolecular energy transfer between mRNA ends ( Supplementary   Fig. 3). Because efficiencies of labeling of the 5' end with Cy5 (100%) and the 3' end of mRNA with Cy3 (~20-30%) markedly differed, we also imaged mRNAs by exciting the acceptor (Cy5) fluorescence with the red (642 nm) laser. Single-step Cy5 photobleaching was observed in 97-99% of single-molecule traces of GAPDH, β-globin and MIF mRNAs, indicating that mRNA dimers or higher-order oligomers were essentially absent.
FRET distribution histograms, which were constructed by compiling 300-1,200 smFRET traces, are best fit to a sum of four (GAPDH mRNA) or three (β-globin and MIF mRNAs) Gaussians ( Fig. 2a-c). The distinct FRET peaks in distribution histograms correspond to different FRET states and, thus, different mRNA end-to-end distances. During run-off transcription, T7 RNA polymerase can add one, two or three non-templated nucleotides to the 3' RNA end in a fraction of the transcripts. To test whether the presence of multiple FRET states in FRET histograms corresponds to sequence or secondary structure heterogeneity, we varied the concentration of Mg 2+ , which is known to stabilize the secondary and tertiary structure of RNA. Consistent with the idea that mRNAs fold into a dynamic ensemble of several structural states with multiple end-to-end distances, individual smFRET traces in GAPDH, β-globin and MIF mRNAs showed spontaneous fluctuations between distinct FRET states (Fig. 2d,   Supplementary Fig. 3). Using GAPDH mRNA as an example, we further explored the statistics of fluctuations between FRET states via Hidden Markov Model (HHM) and Transition Density Plot analyses 20,21 . Consistent with FRET distribution histograms, individual GAPDH mRNA molecules predominantly fluctuated between ~0.4, 0.6 and 0.8 FRET states at frequencies of ~0.1 -0.03 s -1 (Fig. 2e, Supplementary Table 2). These rates are similar to previously measured kinetics of the spontaneous transition between two alternative 5 basepair-long RNA Algorithm for the prediction of end-to-end distances in natural mRNAs through computation. Using the RNAstructure software package 23 and a freely jointed chain polymer theory 18 , we developed a new algorithm for modeling the distribution of end-to-end distances for the folding ensemble in natural mRNAs to test the hypothesis about the proximity of mRNA ends at a transcriptome-wide level. In this algorithm, a representative thermodynamic ensemble of structures is selected by stochastic sampling 24 , and then the distance between the 5' and 3' ends is estimated for each member of the sample in nanometers. The calculation employs two segment sizes (unpaired nucleotides and helix ends), which are estimated based on the freely jointed chain model of polymer theory [16][17][18] . We do not consider the presence of the poly(A) tail of the mRNA because the poly(A) tail is unlikely to make base-pairing interactions with the rest of the mRNA (Fig. 1c). Thus, we estimate the distance between the 5' end of mRNA and the 3' end of the 3' UTR at the junction with poly(A) tail. Our algorithm generates a histogram of estimated distances and, thus, examines both average end-to-end distance and the distribution of end-to-end distances in the population of RNA structures.
Average end-to-end distances derived from our ensemble FRET measurements correlate reasonably well with distances predicted for the same mRNAs using computation with a linear regression coefficient, r 2 , of 0.67 (Supplementary Table 1, Supplementary Fig. 6). Deviations between predicted and experimentally measured end-to-end distances do not exceed 3 nm and, at least in part, may result from perturbations of fluorescent properties of the donor and acceptor fluorophores due to local environmental effects, which may lead to a 0.5-1 nm error in determination of FRET-derived distances 12,25 . Furthermore, FRET might overestimate the average end-to-end distance because a fraction of mRNA may be misfolded or unfolded under chosen experimental conditions. Hence, our computational algorithm adequately predicts the end-to-end distance in the ensemble of folded RNA molecules and can be used to examine end-to-end distances in the human transcriptome.

The inherent closeness of the ends is a universal property of most, if not all, human
mRNA sequences. We used our algorithm to predict the end-to-end distance in ~22,000 transcripts of the HeLa human cell transcriptome. The predicted end-to-end distances were relatively narrowly distributed with a population mean of ~ 4 nm (Fig. 3a). Hence, the propensity of folding into structures with short end-to-end distances is common to all human mRNAs.
Furthermore, closeness of mRNA ends appears to be largely independent of nucleotide sequence and mRNA length.
To further explore the dependence of the end-to-end distance on RNA sequence, we estimated end-to-end distances in 10,000 variants of GAPDH mRNA, in which a segment of 106 nucleotides at 3' end of the 3' UTR was shuffled while preserving the original adenosine/guanosine/cytosine/uracil ratio (Fig. 3b). Similar to the distribution of end-to-end distances in the HeLa cell transcriptome, end-to-end distances in GAPDH variants with a shuffled sequence in the 3' UTR were narrowly distributed with a population mean of ~3.8 nm (Fig. 3b). To experimentally test these computational estimates, we cloned, transcribed without the 3' poly(A) tail and labeled with Cy3/Cy5 fluorophores one of shuffled GAPDH variants ("Shuffled_1"). End-to-end distances for the original wild-type and Shuffled_1 GAPDH mRNAs were predicted to have equal end-to-end distances. Consistent with computational prediction, ensemble FRET values measured in wild-type and Shuffled_1 GAPDH mRNA variants were essentially indistinguishable (Fig. 3c, Supplementary Table 1). This result provides additional evidence that mRNA end-to-end distance is largely sequence independent. Hence, random RNA sequences tend to form secondary structure.

Unstructured RNAs are devoid of guanosines and have low sequence complexity.
Although we find that the ends of most RNA sequences are inherently close, we have also demonstrated that the introduction of CA-repeats, which are known to have low basepairing potential, increase end-to-end distance in RNA. To further investigate the relationships between sequence properties, basepairing potential and end-to-end distance of RNA, we evolved the human GAPDH mRNA sequence in silico using a genetic algorithm. In this newly-developed algorithm, populations of sequences are evolved either by random mutation or by crossover (combination of two sequences from the population), and then sequences with the lowest mean basepairing probabilities are selected for subsequent iterative refinement (details can be found in Methods). We performed 500 independent in silico evolution transformations of the GAPDH mRNA sequence; each of these transformations entailed 1,000 mutation/sequence selection rounds. For each iteration of sequence evolution, we estimated RNA end-to-end distance and sequence linguistic complexity 26,27 . This quantity is bounded to be greater than zero and less than or equal to 1, where complexities reflect a larger diversity in oligonucleotide sequences within a sequence (the quantitative definition can be found in Methods).
In silico evolution transformations of the GAPDH mRNA sequence revealed that a reduction in average basepairing probability leads to an increase in end-to-end distance of RNA (Fig. 4a).
In addition, the RNA sequence become enriched with cytosines and depleted of guanosines ( Fig. 4b). Guanosines are likely depleted because, in addition to Watson-Crick G-C base pairs, they can form wobble base pairs with uracils. G-U wobble base pairs have comparable thermodynamic stability to Watson-Crick A-U base pairs and are nearly isosteric to them 28,29 .
Reduction in average basepairing probability is also accompanied by the decrease in sequence linguistic complexity (Fig. 4a) leading to the emergence of degenerate and repetitive sequences. The heat map showing alterations in the distribution of sequence complexities and end-to-end distances over 500 independent in silico evolution transformations of the GAPDH mRNA sequence indicates that the sequence complexity and end-to-end distance undergo anticorrelated changes (Fig. 4c).
Although the decrease in sequence complexity levels off and sequences become completely depleted of guanosines after ~200 iterations, basepairing probability and end-to-end distance continue evolving until they plateau after ~500 iterations. The maximal value of end-toend distance (46 nm) achieved during in silico sequence evolution of the 1327 nt-long GAPDH mRNA is equal to the end-to-end distance predicted for RNA of this length in the random-coil conformation. Therefore, guanosine depletion and diminishing of sequence complexity are necessary but not sufficient to convert structured RNA into a completely unstructured conformation. Only specific, low-complexity sequences of adenosines, cytosines and uracils adopt the random coil conformation. Hence, if intrinsically unstructured RNA sequences occur in organisms, then they likely play an important biological role and emerged as results of intense natural selection.
Rational design of non-repetitive sequences with low basepairing potential and long end-to-end distances. We and others find that long repetitive sequences, such as CA and CAA repeats, which are commonly introduced into RNA to disrupt RNA secondary structure, are notoriously difficult to maintain and propagate in live cells 30 . To overcome this problem, we employed our genetic algorithm to generate 500 sequences of human GAPDH mRNA, in which the last 106 nucleotides of the 3' UTR were evolved into non-repetitive, intrinsically-unstructured sequences ( Fig. 5a-b). During sequence evolution, the selection criteria were changed to consider both linguistic complexity and mean basepairing probability to increase end-to-end distance and also avoid highly-repetitive sequences. One of the resulting GAPDH mRNA sequences was cloned, transcribed without 3' poly(A) tail and fluorescently labeled for FRET measurements of end-to-end distance. Consistent with computational prediction, introduction of the non-repetitive, unstructured sequence into the 3' end of the 3' UTR of GAPDH mRNA led to a dramatic decrease in energy transfer between fluorophores attached to mRNA ends (Fig. 5c).
This proof-of-principle experiment demonstrates that our new genetic algorithm for sequence evolution can help to design non-repetitive unstructured RNA sequences, which may be employed to study roles of RNA secondary structure in different aspects of RNA function.

DISCUSSION
Taken together, our data strongly support the hypothesis 8-10 that RNA as a polymer has an intrinsic propensity to fold into structures in which the 5' and 3' ends are just a few nm apart.
Furthermore, we show that the ends of natural human mRNA sequences, folded in the absence of protein factors, are universally close. This occurs not only because of base pairs between nucleotides in the 5' and 3' UTRs but also because stem loop formation across whole sequences tends to shorten the end-to-end distance (Fig. 1a, Supplementary Fig. 1). Our computation and smFRET studies also show that each mRNA sequence folds into a dynamic ensemble of structures with distinct but nevertheless short end-to-end distances. Hence, in the ensemble of structures, mRNA ends are brought in close proximity by a number of alternative helixes formed between the 5' and 3' UTRs rather than by one specific set of base pairs between mRNA ends.
At least to some degree, the intrinsic mRNA propensity of folding into structures with short end-to-end distances is likely realized in live cells. In vivo, mRNA secondary structure may be disrupted by RNA binding proteins and RNA helicases 31,32 . Furthermore, the ribosome efficiently In the course of this work, we developed approaches and tools for measuring, computing, and manipulating mRNA end-to-end distances and the secondary structure of mRNA UTRs.
This methodology can now be utilized to study unknown roles of mRNA end-to-end distance and secondary structure in mRNA UTRs in protein synthesis and other facets of mRNA biology. FRET values were measured between fluorophores attached to the 5' and 3' ends of the following GAPDH and -globin mRNAs: mRNA lacking poly(A) tail (blue); mRNA lacking poly(A) tail folded in the presence of a 50-nucleotide long DNA oligonucleotide complementary to the 3' end of mRNA (green); mRNA with poly(A) tail (red). FRET could not be detected (n.d.) in GAPDH variants, which lacked poly(A) tail and contained 53 CA repeats introduced into the 5' or 3' UTR of GAPDH mRNA. Each FRET value represents the mean ± standard deviation (SD) of three independent experiments. A star indicates that FRET values are significantly different, as p-values determined by the Student t-test were below 0.05.   . Sequence features of intrinsically-unstructured RNA sequences. The entire 1327nt long GAPDH mRNA sequence was evolved in silico by a genetic algorithm to minimize average basepairing probability and produce intrinsically unstructured sequences. (a) End-to-end distance (blue), sequence complexity (red), and mean basepair probability (green) as functions of iteration number are shown for a single representative in silico sequence evolution experiment. The distance predicted for a 1327-nt long RNA in a random coil conformation is shown by the magenta line. (b) Evolution of nucleotide composition in a single representative in silico sequence transformation experiment shown in (a). Frequency of adenosine (A), cytidine (C), guanosine (G), and uridine (U) are shown in magenta, blue, red, and green, respectively. (c) Surface contour plots generated from 500 independent in silico sequence evolution experiments show changes of sequence complexity (y-axis) as a function of end-to-end distance (x-axis). The range of sequence complexity from 0 to 0.6 was separated into 2,000 bins. The range of end-to-end distance from 1.6 to 46 nm was separated into 500 bins. The resulting heat map shows the frequency count.

Cloning of mRNA-encoding sequences
Human cDNA was used to clone mRNA-encoding sequences. To prepare human cDNA, total RNA was first extracted from HeLa cells using TRIzol reagent (Invitrogen Life Technologies) according to the manufacturer's protocol. Genomic DNA was removed from the sample by DNase treatment (NEB). RT-PCR was performed to synthesize cDNA using 5 μg of RNA, SuperScript III Reverse Transcriptase (Invitrogen Life Technologies) and oligo dT 23 (Sigma), following the manufacturers' protocols. The target genes were amplified by PCR using Q5 DNA polymerase (NEB) and 5' and 3' primers listed in Table 1  Primers were designed based on general guidelines using the IDT OligoAnalyzer tool. The optimal annealing temperature for each primer pair was determined using the NEB Tm Calculator to set the PCR conditions. A 30 second annealing step 3°C above the Tm was used after the initial denaturation step of 30 seconds at 95°C. The extension temperature was set to 72°C for 1 min per kb. The PCR products were cloned into polylinker sites of the pSP64A vector

Construction of GAPDH variants
To obtain the construct containing 53 CA repeats in the 5' UTR of the GAPDH mRNA  Table 2). The latter fragment was PCR amplified using forward (5' AAGAGAGGTACCCTCACTGCT 3') and reverse (5'GGAAACAGCTATGAGAGCTC 3') primers and digested using KpnI and SacI. The GAPDH constructs containing randomized or genetic sequences [GAPDH_3' UTR shuffle; GAPDH_3' UTR Genetic] in the GAPDH 3' UTR were generated as described above by replacing the 152 bp KpnI-SacI sequence at the 3' UTR of GAPDH with the fragments indicated in Table 2.

RNA folding
To measure the end-to-end distances of mRNAs by ensemble FRET, 300 nM doubly-labeled  to a final concentration of 50 pM and immobilized on quartz slides coated with biotinylated BSA (0.2 mg/mL, Sigma) and pre-treated with NeutrAvidin (0.2 mg/mL, Thermo Scientific). Imaging buffer with an oxygen-scavenging system (0.8 mg/mL glucose oxidase and 0.02 mg/mL catalase) was injected into the slide chambers before imaging to prevent photo-bleaching.
smFRET traces were recorded using a prism-based total internal reflection fluorescence (TIRF) microscope as previously described 54

Computational procedures
Estimating end-to-end distance To estimate end-to-end distance distributions and mean end-to-end distances for each RNA sequence, we used a two-scale freely jointed chain approximation 18 for each structure in a Boltzmann ensemble of structures. We generate 1000 structures using stochastic sampling 24 (program stochastic) in RNAstructure 23 . In stochastic sampling, structures are selected at random with the probability equal to their Boltzmann probability. Because the sample is Boltzmann weighted, the mean of a quantity across the sample has the proper Boltzmann weighting. For each structure, we count the number of branches and unpaired nucleotides in the exterior loop, i.e. the loop that contains the 5' and 3' ends, and use: where D is the end-to-end distance, n is the number of unpaired nucleotides, m is the number of helical branches, a = 6.2 Å, and b = 15 Å, where a and b were from a previous parameterization 18 . The mean end to end distance is the arithmetic mean across structures in the sample.

Sequence Complexity
Sequence complexity is a measure of diversity for the nucleotide content of a sequence. In this work, we use Linguistic complexity as introduced by Trifonov 56 , and calculated using the algorithm from Gabrielian et al 26 . The complexity is the product of vocabulary size across kmers: where the vocabulary size, U, is the fraction of possible sequences observed for that k-mer.
The number of possible sequences for a k-mer is the minimum of 4 k or N-k+1, where N is the sequence length. For example, k = 3 has a possible sequence space of 4 3 for sequences of 64 or more nucleotides, and U 3 is the fraction of these 3-mer sequences observed across the sequence. The maximum k-mer size, w, is a function of length. Here we used w = 5 for the 106 nucleotide region of GAPDH mRNA and w = 7 for the full length GAPDH mRNA, following Gabrielian et al 26 .

Genetic Algorithm
We developed a genetic algorithm program to optimize features in a given RNA sequence.
In this work, our goal was to evolve sequences to increase the end-to-end distance of the input sequences.
The genetic algorithm is an iterative process inspired by evolution in which an initial population is evolved to optimize features represented in the objective function 57 . A population of 10 sequences was used in this work, and these ten sequences were initialized uniformly as the starting sequence. In each iteration, sequences in the population are either mutated or new sequences are generated by recombining two sequences (called crossover) to generate 10 new sequences. The optimal 10 sequences (from the set of 10 at the start of the iteration and the 10 new sequences) are kept for subsequent iterations, where optimality is defined as maximizing the value of the objective function. In the mutation steps, each of the ten sequences was mutated independently. Sweeping along the portion of the sequence that is being evolved, there is a probability of 0.03 that a nucleotide will mutate to equal probability of A, C, G, or U. In our algorithm, crossover occurred every 6 steps. For crossover, 5 pairs of sequences are selected at random without replacement from the population of 10 sequences. For each sequence pair, the algorithm scans through the portion of the sequence that is being evolved and each nucleotide position has a probability of 0.03 to be selected as a recombination marker; therefore, on average, the number of recombination markers is 0.03×N. Then, the pair of sequences is recombined by the exchange of homologous segments to make two new sequences. The generation of the two sequences by crossover from the sequence pair is illustrated by the schematics shown below.
We used two objective functions in this work. In calculations shown in Fig. 4, the objective function was the mean probability of each nucleotide being unpaired as determined with a partition function calculation 58 . The mean is taken across only nucleotides that are in the region of the sequence being evolved. In calculations shown in Fig. 5, we summed the mean probability of each nucleotide being unpaired and the sequence complexity.