A possible universal role for mRNA secondary structure in bacterial translation revealed using a synthetic operon

In bacteria, translation re-initiation is crucial for synthesizing proteins encoded by genes that are organized into operons. The mechanisms regulating translation re-initiation remain, however, poorly understood. We now describe the ribosome termination structure (RTS), a conserved and stable mRNA secondary structure localized immediately downstream of stop codons, and provide experimental evidence for its role in governing re-initiation efficiency in a synthetic Escherichia coli operon. We further report that RTSs are abundant, being associated with 18%–65% of genes in 128 analyzed bacterial genomes representing all phyla, and are selectively depleted when translation re-initiation is advantageous yet selectively enriched so as to insulate translation when re-initiation is deleterious. Our results support a potentially universal role for the RTS in controlling translation termination-insulation and re-initiation across bacteria.

T o initiate protein translation, a ribosome binds and assembles an initiation complex in the area of the gene start codon 1 . When monocistronic mRNA encoding a single gene is translated, spatial considerations that could interfere with ribosome binding are largely irrelevant. However, in bacteria, where a single mRNA transcript can contain several genes clustered into an operon, translation initiation must account for the space between genes. Specifically, how does translation initiation of a downstream operon gene occur without interference from the translating ribosome of the upstream gene? Despite our considerable understanding of protein translation in bacteria, this largely remains an unanswered question. Indeed, the mechanisms which control translation initiation in operons remain a matter of debate.
In bacterial operons, the intergenic distance between most of the neighboring cistrons is shorter than 25-30 nucleotides 2,3 . This distance is too small to simultaneously accommodate one ribosome terminating on the stop codon of the proximal gene and a second ribosome initiating de novo translation on the start codon of the distal gene 3 . Translation re-initiation, a scenario whereby the terminating proximal-ribosome does not dissociate from the mRNA after termination and instead re-initiates translation on the neighboring distal cistron, alleviates this problem. Presently, the mechanisms regulating translation reinitiation are not well understood [3][4][5] . Specifically, regulators that determine whether a ribosome dissociates from the mRNA or remains bound to re-initiate translation have yet to be discovered. We thus considered whether mRNA secondary structure could serve this role, given how mRNA structure can affect translation at the de novo initiation 6,7 and elongation 8,9 steps, and can also affect translational coupling between two neighboring genes on the same operon 5,10,11 .
Using Escherichia coli transformed with a synthetic operon as a model system, we discover a stable mRNA secondary structure found near the stop codon, termed the ribosome termination structure (RTS), that controls the efficiency of translation reinitiation. We further report, on the basis of large-scale computational analysis, that such structures are abundant throughout bacteria. Finally, we show that RTSs are positively selected to insulate translation when re-initiation-avoidance is beneficial, yet are depleted where re-initiation could prove useful, principally in operon-clustered genes.
Results mRNA structure drives distal gene expression in a synthetic operon. To test the relation between mRNA secondary structure and translation re-initiation, a library of operons based on the pRXG plasmid 12 was assembled (Fig. 1a). These synthetic operons comprise a proximal gene encoding red fluorescent protein (RFP) and a distal gene encoding polyhistidine-tagged green fluorescent protein (GFP), separated by a stretch of 24 random nucleotides in the inter-cistronic region, downstream of the RFP stop codon. The library was transformed into Escherichia coli MG1655 cells and sorted according to GFP expression levels into eight bins spanning three orders of magnitude (Fig. 1b), using flow cytometry (Fig. 1c). Each bin was barcoded, sequenced, and the weighted Gibbs free energy average of mRNA secondary structure (ΔG fold ) in the variable sequence region in that bin was calculated.
The first two bins (P1 and P2) exhibited GFP expression levels that were not higher than those in the negative wild-type bacteria controls ( Supplementary Fig. 1). As such, bins P1 and P2 were labeled as non-producing populations and not further analyzed. The results from the other bins (P3-P8), however, revealed significant correlation between observed GFP levels and the calculated mean ΔG fold of the~3 × 10 3 unique sequences in each bin (Spearman correlation ρ = 1, n = 6, p value = 0.0028; Fig. 1d). These results illustrate the inverse correlation between expression levels of the distal gene-encoded GFP and mRNA folding stability, such that sequences with lower stability in the variable region were significantly enriched in high GFP-producing populations, and vice versa (Supplementary Fig. 1e).
Next, individual clones from each bin were sorted and sequenced. Thirty-three clones in which the variable intercistronic sequence encoded at least one of the six most abundant start codons for translation initiation 13 also lacked additional inframe stop codons and presented a unique ΔG fold . These clones were isolated, and their GFP expression levels were quantified (Supplementary Table 1). Upon assessing the relation between ΔG fold of the variable sequence and GFP expression, clear correlation was revealed (Spearman correlation ρ = 0.78, n = 33, p value < 10 −7 ; Fig. 1e). Such correlation was independent of mRNA abundance ( Supplementary Fig. 2), expression of the upstream RFP gene ( Supplementary Fig. 3), or of the location or identity of the start codon and adjacent SD sequence in the downstream GFP gene to which the ribosome binds 14 (Supplementary Table 2). No significant effect on growth rate was observed among the clones. Rather, the character of the clone-specific intergenic sequence had a significant impact on GFP levels but not on growth ( Supplementary Fig. 4).
In a distinct subset of eight clones where variability in the start codon was further limited to only one of the three most used GFP-start codons (AUG, GUG, UUG), and variability in their position was limited to only three or four codons downstream of the RFP stop codon, the correlation was strengthened (Spearman correlation ρ = 0.98, n = 8, p value = 4 × 10 −4 ; Fig. 1f). In this subset, in which the SD sequence was identical for all clones, the GFP expression trend was confirmed at the population level using fluorescence-activated cell sorting (FACS) analysis ( Supplementary Fig. 1e). The results thus showed that distal operonic GFP gene expression is negatively affected by a stable mRNA secondary structure in the region directly downstream of the stop codon of the preceding gene ( Fig. 1g and Supplementary data file 3). This structure was termed the Ribosome Termination Structure (RTS), with the likelihood of RTS presence and its strength being defined by the magnitude of ΔG fold (Fig. 1h).
The RTS is conserved across bacterial genomes. To assess the generality of the RTS, mRNA secondary structure stability (ΔG fold ) was calculated in a region spanning 100 nucleotides on either side of each of the~4200 annotated E. coli stop codons using a 40 nucleotide-long sliding window, allowing for the calculation of the mean ΔG fold at each position in a genome-wide manner (Fig. 2a). Such analysis revealed an extreme drop in ΔG fold (reflecting stronger mRNA folding), with a global minimum of À7:94 kcal mol À1window À1 centered five nucleotides downstream of the last nucleotide of a stop codon (Fig. 2b, blue line), corresponding to the expected position and magnitude of an RTS. This demonstrates that RTS-like signals are apparent throughout the E. coli genome.
To confirm that the RTS is directly under selection and as a control for other mRNA-stability factors, the ΔG fold value of each sequence (Fig. 2b, blue line), minus the ΔG fold value of a shuffled version in which nucleotide and codon content but not their order are preserved, was calculated (Fig. 2b, green line). This was repeated for each position across all E. coli genes, providing an average selection landscape of mRNA structure (Fig. 2b, orange line). If only nucleotide or codon content was under selection, then the difference in local folding energy (ΔLFE) between native and randomized sequences should equal zero. Hence, increased ΔLFE deviation in the negative direction indicates direct selection for enhanced secondary structure stability (and vice versa). The results revealed extreme selection for stable structure directly downstream of stop codons (Fig. 2b, orange line) (Wilcoxon test, p value < 10 −30 ), irrespective of the stop codon used ( Supplementary Fig. 4). The global minimum of ΔLFE (À2:67 kcal mol À1window À1 ) represents strong selection for the RTS structure directly downstream of stop codons. The same signal was seen in an average of 128 other bacterial strains representing all phyla (Fig. 2c, blue line), including the evolutionary distant Gram-positive Bacillus subtilis (Fig. 2c,   If RTS presence is indeed under selection, correlation to the level of gene expression would be expected, with genes encoding more abundant proteins being subjected to stronger selection pressure. To test this hypothesis, E. coli genes were grouped according to protein abundance, and the ΔLFE landscape of each was determined (Fig. 2d). Clear and significant correlation between protein abundance and ΔLFE was noted (Mann-Whitney test, p value < 10 −30 ), demonstrating the RTS to be an adaptive trait, possibly controlling distal operon gene translation. This relation also holds true in B. subtilis and all 11 other bacteria for which data is available (Fig. 2e).
Lastly, RTS presence was quantified genome-wide across bacteria. This revealed that an RTS signal, defined as an mRNA structure (ΔG fold ≤ À6 kcal mol À1window À1 ) directly downstream of the stop codon that is significantly more stable than the surrounding sequences (see Methods section), is present in 18%-66% of all genes, depending on the species (Fig. 2f, Supplementary Fig. 6, Supplementary data files 5-7). Genomewide variability between species reflects a combination of selection for structural stability and the fraction of genes that are followed by an RTS.
Translation re-initiation is controlled by RTS. The precise role of the RTS was considered by examining variability in ΔLFE, distinguishing between genes followed by an RTS or not. Such analysis showed the standard deviation of ΔLFE to spike in the vicinity of the stop codon (Fig. 3a), yielding a bi-modal pattern of gene distribution only around the stop codon (Fig. 3b). The parameter best-defining the two groups of gene distribution is the inter-cistronic distance separating neighboring genes (Fig. 3b, inset). E. coli gene pairs separated by shorter distances (<25 nucleotides, N = 1537) were significantly depleted of RTSs (mean ΔLFE ¼ þ0:4 kcal mol À1window À1 , Wilcoxon test, p value = 5 × 10 −19 ); for further-separated neighboring genes (≥25 nucleotides, N = 2,581), RTSs were significantly enriched (mean ΔLFE ¼ À4:0 kcal mol À1window À1 , Wilcoxon test, p value < 10 −30 ).
When the ΔLFE landscape around the stop codon between gene pairs in each group was charted (Fig. 3c), RTS depletion was noted when the intergenic distance is short, or when the two consecutive cistrons overlap. Conversely, when the intergenic distance exceeds 25 nucleotides, an RTS is present (Mann-Whitney, p value < 10 −30 ). This trend is conserved in 128 bacterial species analyzed (Fig. 3d). Considering that 25 nucleotides are the intergenic distance below which translation re-initiation is considered to be advantageous over de novo initiation 3 , and the above-identified correlation between RTS presence and expression of the distal operonic GFP gene ( Fig. 1), the RTS can be linked to translation re-initiation. We thus propose that RTS enrichment in the ≥25 nucleotides group and depletion from the <25 nucleotides group reflects how RTS presence serves to inhibit translation re-initiation when it is not advantageous, while its absence enables this event.
Translation of the distal partner of any operon-based gene pair can be realized by de novo initiation, translation reinitiation, or stop codon read-through. Thus, discounting a link between the RTS and de novo initiation or stop codon readthrough would further support a role for the RTS in translation re-initiation. Accordingly, experiments involving the synthetic operon described above (Fig. 1a) were performed, given how expression of the distal GFP gene could result from any of the above-mentioned processes.
The link between the RTS and stop codon read-through was tested by Western blot analysis of a subgroup of clones described above (Fig. 1f) expressing the RFP-GFP synthetic operon, normalized by OD 600 , using antibodies against the GFP Cterminal polyhistidine tag. The 55 kDa RFP + GFP product resulting from stop codon read-through was barely detectable, compared to the 28 kDa GFP product resulting from de novo initiation or re-initiation (Fig. 3e). The intensities of these SDS-PAGE protein bands obtained from these clones, as well as those from other randomly selected clones, were quantified by densitometry. This confirmed that correlation between the level of the 28 kDa product and ΔG fold was maintained (Spearman correlation ρ = 0.80, n = 58, S = 6479, p value < 10 −13 ; Supplementary Fig. 7). Lastly, exact product masses were verified by mass spectrometry to reveal the initiation codon and its location (  Table 1). These findings thus discount linkage between RTS presence and stop codon read-through.
To determine whether the RTS is linked to de novo initiation or translation re-initiation, the manner of GFP translation initiation was assessed using the release factor 1 (RF1)-deficient E. coli C321.ΔprfA EXP strain 15 and Western blot analysis of random clones, as above. In the absence of RF1, the ribosome cannot efficiently terminate translation at the RFP UAG stop codon, thereby precluding translation re-initiation, which depends on such termination. Instead, GFP expression can only be driven by read-through or de novo initiation in the mutant strain. Western blot analysis detected only the read-through RFP + GFP product (Fig. 3g, Supplementary Fig. 8). This serves as evidence that de novo initiation does not drive GFP translation. Still, the apparent lack of de novo GFP translation initiation in the deletion strain could result from physical interference of the initiation site by RFP-translating ribosomes and increased readthrough. To discount this possibility, the RFP UAG stop codon in E. coli MG1655 was suppressed (see "Methods"" section) so as to mimic conditions of ribosomal occupancy that may occur in RF1-deficient cells. Under these conditions, isolated GFP was produced only in the E. coli MG1655 strain but not in RF1depleted cells (Fig. 3h).
Next, to directly test the ability of the intergenic region to guide de novo initiation of translation, the RFP gene and its ribosomebinding site were deleted from the operons in six selected clones. In the resulting monocistronic GFP construct, only the 18 terminal nucleobases of the RFP gene, the fixed and variable intergenic regions, and the GFP gene remain downstream of the Fig. 1 mRNA secondary structure (ΔG fold ) controls distal operon gene expression. a Synthetic operon design and the FACS scheme employed. b GFP and RFP fluorescence of 10 5 cells. c Sorting of 10 6 cells into color-coded bins with constant RFP and variable GFP levels (top); GFP distribution in 3000 cells from each bin after sorting (bottom). d Correlation between the population mean GFP expression levels and the weighted mean of ΔG fold of 3 × 10 3 unique sequences in each bin. The x and y axes error bars represent the 99% confidence interval and relative standard deviation, respectively. Spearman correlation was performed on the weighted averages of the six bins (n = 6, ρ = 1, p value = 0.0028). Correlation between GFP expression and ΔG fold of (e) all (n = 33) isolated variants, and (f) a subset (n = 8) presenting an AUG start codon at position +3 or +4. g ΔG fold landscape around the stop codon and the mRNA secondary structure presented in the first window outside the stop codon-occupying ribosome footprint of two selected clones (111, 207). The red dot represents the RFP stop codon. Secondary mRNA structures of all clones are available in Supplementary data file 3. h Schematic representation of the role of the RTS in distal operon gene translation (ribosomes are not drawn to scale).  lac operator (Fig. 3i). The 18 terminal nucleobases of the RFP gene were not removed to mimic the exact mRNA sequencecontext encountered by initiating ribosomes in all clones. GFP levels were then compared between the monocistronic and operonic constructs of each clone, using both Western blot analysis (Fig. 3i) and fluorescence measurements (Fig. 3j).
The results revealed that when strong RTSs are present, both constructs exhibit similarly low levels of GFP expression, with the  All bacteria (N = 128) E. coli de novo initiation re-initiation ratio of expression by the two being close to one. Conversely, in clones with weak RTSs, the operonic constructs showed significantly higher levels of GFP expression, reaching levels over five-fold higher than that of the monocistronic constructs. This observation correlates well with the ΔG fold of each pair of clones ( Fig. 3k) (Spearman correlation ρ = 0.94, S = 2, n = 6, p = 0.017). Such correlation indicates that when the RTS is less stable, the difference in GFP expression between monocistronic and operonic constructs increases, as expected according to the hypothesis that a weak RTS allows for increased translation reinitiation. These results thus demonstrate how de novo initiation is not affected by the RTS in the same manner as is translation reinitiation. Moreover, they show that the monocistronic clones recruited new ribosomes for translation initiation with very low efficiency. This low efficiency confirms that a significant part of the observed GFP expression phenotype is dependent on the presence of the upstream RFP gene and, as such, is not likely a result of de novo initiation.
Given that de novo initiation does not correlate with RTS strength, does not result in efficient expression in the monocistronic clones tested, and could not be detected when RF1 was knocked out, argue against de novo initiation as a viable mechanism to explain the dependence of operonic distal GFP expression on the RTS. As such, we conclude that translation reinitiation remains the most likely process by which the RTS controls expression of the operonic distal GFP gene.
RTS is dependent on the operonic position of a gene. Finally, to determine whether the translation re-initiation-controlling role assigned to the RTS can be generalized, "transcriptional unit" data 16 cataloging the arrangement of E. coli genes into operons were assessed (Fig. 4a).
Such analysis revealed that downstream of all operon terminal genes, where re-initiation is deleterious, the presence of an RTS after the stop codon, possibly insulating against re-initiation, is favorable. In contrast, RTSs are depleted after the stop codon of all other operonic genes, possibly encouraging re-initiation (Mann-Whitney, p value < 10 −30 ). These results were strengthened by observing that RTS presence after terminal operonic genes is independent of the presence or absence of start codons in the 50 nucleotide-long stretch downstream of the stop codon, while significant such dependence was seen for other operon genes ( Supplementary Fig. 10). The same held true in B. subtilis and four other bacterial species for which experimental operon arrangement data exists (Fig. 4a).
Gene annotations in 128 bacterial species were analyzed for RTS presence as a function of neighboring gene strand directionality. Such analysis allowed for assessing operons in genomes where operons are not annotated, based on the assumption that neighboring genes on opposite DNA strands are less likely to be on the same operon than are gene pairs on the same strand. Accordingly, pairs of neighboring genes on the same strand, where re-initiation on mRNA is possible, were compared to pairs on opposite strands, where such re-initiation would be useless as the two genes cannot be transcribed as a single mRNA (Fig. 4b). As expected, RTS presence was significantly higher within gene pairs found on opposite strands, where insulation against re-initiation could help avoid translation of the 3′ UTR.
With this understanding, the source of variability between species in terms of the strength of selection for the RTS (i.e., ΔLFE values) was explored. This was performed for each of the 128 bacterial species considered, by distinguishing between gene pairs presenting intergenic distances of less than 25 nucleotides or which are on the same strand (i.e., where an RTS is less likely), and gene pairs separated by larger intergenic distances or found on opposite strands (i.e., where an RTS is more likely).
Three genome-specific parameters were examined, namely, % GC content, the number of gene pairs on opposing strands, and the average intergenic length (Supplementary Fig. 11). Although inter-species variance in RTS selection was found to be correlated to all three parameters, it is of note that the high positive correlation between ΔLFE and genomic %GC content was only seen in gene pairs where an RTS is less likely to occur (Pearson, n = 128, r = 0.546, p value < 10 −10 ; Fig. S11). Such correlation reflects stronger selection for RTS depletion in mid-operonic genes in organisms with higher %GC content. Considering that when %GC content is high, spontaneous mRNA secondary structures are more likely to appear, we expected and indeed observed, that more substantial purifying selection is required for RTS depletion 17 .
Lastly, we explored whether RTS regions in the E. coli genome are enriched in any sequence motifs. Two uncharacterized motifs were identified but only in a small subset of genes, and as such, are unlikely to control re-initiation or account for RTS selection (Table S4). These results, together with the demonstrated lack of RTS linkage to transcription termination ( Supplementary Figs. 2  and 12), are all consistent with the RTS playing a major role in bacterial translation re-initiation.

Discussion
Translation re-initiation affords bacteria the ability to translate operon-clustered genes with minimal interference between terminating and initiating ribosomes. However, the capacity for translation re-initiation also carries risk. Uncontrolled re-initiated translation could evoke high fitness costs due to ribosomes devoting more time scanning for translation re-initiation sites or because of unintended translation re-initiation events. Indeed, as the ribosome can re-initiate in all possible frames and recognizes Fig. 3 The RTS controls translation re-initiation. a ΔLFE standard deviation landscape around stop codons. b E. coli gene density plot (Z-axis) versus ΔLFE (X-axis) and distance from a stop codon (Y-axis). Different colors are used for improved visualization. Inset shows gene density at position zero. Gene pairs separated by an intergenic distance larger or smaller than 25 nucleotides are in cyan and red, respectively. Gray represents the intersection of the two groups. The RTS profile around the stop codon depends on the inter-cistronic distance before the downstream gene in (c) E. coli and (d) 128 bacterial species. All parameters used to calculate ΔLFE are constant across all figures, and relied on a window size of 40 nucleotides. e Representative anti-His-tag Western blot (top) and the mean of n = 3 fluorescence measurements (error bars represents standard error; bottom) of eight AUG (+3/+4) clones, with ΔG fold indicated. f Mass spectrometry analysis of GFP from selected library clones, with the codon and location used for re-initiation indicated. Representative cropped Western blots of seven random E. coli clones (g) without or (h) with stop codon reassignment, each in the presence (left) or absence (right) of RF1. i Genetic constructs of operonic and monocistronic GFP. Each anti-His-tag Western blot represents a comparison, normalized to OD, between the two constructs for each of six tested clones. j The mean fluorescence measurements comparing the two constructs. Error bars represent standard deviation. Significance was determined by Welch two-sample t-tests (from left to right; df = 22.0, p = 0.4164; df = 4.5, p = 0.1091; df = 6.3, p value = 0.0854; df = 20.9, p value = 0.0397; df = 16.3, p value = 0.00061; df = 4.3, p value = 0.0067). k Spearman correlation (n = 6, ρ = 0.94, p value = 0.017), between the ratio of operonic to monocistronic GFP levels and ΔG fold of each clone. Uncropped Western blots are available (Fig. S9). Ribosomes are not drawn to scale. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18577-4 ARTICLE NATURE COMMUNICATIONS | (2020) 11:4827 | https://doi.org/10.1038/s41467-020-18577-4 | www.nature.com/naturecommunications several start codons 13 (Fig. 3f, Supplementary Table 2) even on mRNAs with alternative or lacking SD sequences 10,11 (Supplementary Table 2), unintended translation re-initiation is of real concern. For example, if one considers the median 3′ UTR length of all E. coli genes (50 nucleotides; Supplementary Fig. S13d), the probability of an efficient start codon being present in the sequence, in any frame, is higher than 90% ( Supplementary  Fig. 13a). In agreement with this assessment, E. coli genome analysis reveals that~88% of genes are followed by an efficient start codon within the 50 nucleotides downstream of their stop codon, with an average of 2.1 start codons per gene. As such, control over translation re-initiation is likely to be essential.
Here, we identified a stable mRNA secondary structure downstream of the stop codon (termed the RTS), and experimentally showed that the RTS likely controls translation reinitiation in a synthetic operon in E. coli. We further revealed that robust signals corresponding to RTS presence are found across the E. coli genome, in agreement with recently published transcriptome-wide mRNA stability data 18,19 . We also showed the RTS to be conserved across bacterial phyla, with an RTS signal peaking around a position that correlates with the edge of the mRNA stretch that is shielded by a terminating ribosome, alluding to a possible RTS-ribosome interaction. Indeed, the functional computational analyses and experiments performed here all support the RTS as acting as a translational insulator, inhibiting translation re-initiation.
This claim, however, is based on a synthetic experimental setup. Therefore, at this time, we can only speculate that the interpreted role of the RTS in genetic regulation also holds true in natural bacterial genomes. Future validation of RTS function should entail perturbation and characterization of native RTS sequences in bacterial genomes, as well as defining RTS sequences in genomes and systematically characterized these entities in synthetic reporter operons.
Our findings, moreover, do not exclude additional RTS functions. For example, we cannot exclude that in some contexts, the RTS could serve both as a Rho-independent transcription terminator and as an inhibitor of de novo initiation that can mask 3′ UTRs from unintended translation initiation. Our experimental and computational results did, however, reveal a direct link only between the RTS and translation re-initiation; no such relationship could be detected with transcription.
Further support for the role of the RTS in translation reinitiation comes from the fact that our results do not support a connection of the RTS to de novo initiation, which could not be observed with our synthetic operon in the absence of RF1, nor correlate with RTS stability (Fig. 3). At the same time, de novo initiation model predictions also did not correlate with our results (Supplementary Table 2). In addition, the expression of upstream RFP in the random library clones was not correlated with the strength of downstream GFP expression ( Supplementary Fig. 3), yet significantly correlated with RTS strength. The latter would be unexpected were GFP translation de novo-initiated, as the distance between the RFP stop codon and the GFP start codon is too short (6-24 nucleotides) to allow these genes to simultaneously bear terminating and initiating ribosomes, respectively. Instead, these ribosomes must compete and be inter-dependent for binding. The expression of both genes, however, appears to be independent, as opposed to the dependency of GFP expression on the RTS.
Currently, two competing models explain re-initiation, namely the classic 30S binding model, where ribosomes dissociate from polycistronic mRNA upon gene translation termination only to immediately re-bind, as in de novo initiation, and translate the downstream cistron 14 . In this mode, one would expect translation of a distal cistron by both re-initiating and de novo initiating ribosomes, which would compete for the ribosome-binding domain. The second model is the recently demonstrated 70S scanning model, where the ribosome does not dissociate but instead scans the downstream mRNA for a re-initiation site 3,20 . Our results provide support for the latter model, as de novo initiation was not observed. Moreover, the observed existence of an RTS in terminal genes is more parsimonious when scanning-based re-initiation occurs. Although the molecular mechanism by which the RTS controls ribosomal re-initiation remains unknown, we can conjecture, given earlier reports, that it acts as an energy barrier for the scanning ribosome, which unlike the actively elongating ribosome, does not possess an energy source 3,20,21 .
In summary, the discovery of the ribosome termination structure, a possible translation re-initiation insulator, raises new questions on the function and evolution of operons and could lead to exploitation of this remarkably conserved structural moiety for better control over genetic design.

Methods
Strains and plasmids. The bacterial strains used in this study were E. coli K-12 MG1655 (Yale stock CGSC#6300) and C321.ΔprfA EXP 15 (Addgene #48998). For stop codon suppression by genetic code expansion, experimental strains were transformed with a pEVOL plasmid harboring the Methanosarcina mazei (Mm) orthogonal pair of Mm-PylRS/Mm-tRNA CUA (Pyl-OTS) 22,23 . The synthetic operon plasmid was adapted from the pRXG dual reporter plasmid 12 (Addgene Plasmid #113643), and the random sequence was inserted using random primer amplification followed by Gibson assembly. For this assembly, appropriate forward [TGGCTCCGCTGCTGGTTCTGGCGAATAGACTAGTNNNNNNNN NNNNN NNNNNNNNNNNAAGGGCGAGGAGCTCTTTACTG] and reverse [GGAGTC CAAGC TCAGCTAATTAAGCTTGGCTGCAGGTCGACCCGGGGTACCGA GC] primers were used. Expression of the synthetic operon was controlled by the lac operator so as not to affect bacterial fitness, given the variability of the random sequence, which is only expressed when IPTG (1 mM) is added to the growth media. To control for known stop codon context effects 24 , the first six nucleotides in this variable region (ACUAGU) were fixed. After assembly, the library was transformed into E. coli DH5α, where library complexity was measured as~10 4 by counting colony-forming units. The plasmid library was then purified using a Miniprep kit [Promega] and transformed into the E. coli MG1655 and C321 strains mentioned above. All E. coli MG1655 clones were subjected to FACS [FACSAria III, BD Biosciences]. In addition, individual clones were isolated using agar plating, and their plasmids were purified and sequenced (Supplementary Table 2). Each variable sequence that did not present an additional stop codon in the variable region was named pRXNG and given a running number name (i.e., pRXNG 60 is clone #60) and its RFP and GFP expression levels were measured. Deletion of the RFP gene for the experiments detailed in Fig. 3i, j was achieved by Gibson assembly using the following primers, forward: [ATAACAATTTCACACAGAAACAGAAG CTGGTTCTGGCGAATAGACTAG], reverse: [TTCTGTTTCTGTGTGAAATTG TTATCCG].
Fluorescence-activated cell sorting. Bacterial cells were grown overnight induced with 1 mM IPTG, washed with PBS, and sorted by FACS [FACSAria III, BD Biosciences]. The entire cell population was sorted into eight bins based on constant mRFP1 fluorescence and varying Superfolder GFP (sfGFP) fluorescence, thereby normalizing sfGFP levels to those of mRFP1. Each bin, generated using an 85-micron nozzle at minimal flow, accounted for~12.5% of the entire population. The 8 sorted bins were re-run to map sorting accuracy, which was found to be high (~90% of cells were distributed within 3 bins around any selected bin). Controls consisted of bacterial cells that did not contain the synthetic operon plasmid. Analysis was performed, and figures were created using FlowJo software version 10.6.1.
The gating strategy was as follows: The preliminary FSC-A/SSC-A gates were 630-17,000 and 60-3000, respectively, the SSC-W/SSC-H gates were 0-110,000 and 450-45,000, respectively, and the FSC-W/FSC-H gates were 12,000-62,000 and 200-4000, respectively. Cells that expressed RFP, which served as the positive and normalizing control with levels between 3500 and 15,000, were further gated. Next, the resulting population (49.7% of the total population) was gated into 8~equal groups divided and defined by GFP expression. Each group was intended to represent~12.5% of the parent population. Statistical parameters used are detailed in Supplementary Table 3.
Library construction, next-generation sequencing, and data analysis. Isolated bacteria from each bin were transferred into LB media, grown for 8 h at 37 o C, harvested, and subjected to plasmid extraction using a Miniprep kit [Promega]. Library construction for Illumina MiSeq next-generation sequencing was performed according to the Illumina metagenomic protocol 25 , with adapter and primer sequences detailed in Supplementary Table 1. In each bin, a 118 bp synthetic operon amplicon, which includes the variable region, was PCR-amplified. After two rounds of amplification, the Illumina primer sequence, unique hepta-nucleotide indexes, and adapters were added to each amplicon library. The libraries were then sequenced using the Illumina MiSeq V2 reagent (300 cycles) kit. The resulting sequencing data were processed and parsed with the DADA2 package for R 26 . All identical sequence reads in each bin were aggregated, and the 10,000 most abundant sequences of each bin were obtained (Supplementary data file 1). In the eight bins, the minimal sequence depth was 2-10 reads. From the 10,000 unique sequences of each bin, all sequences that contained an additional stop codon in the variable region were removed, and the remaining sequences were filtered to include only sequences with one of the three efficient start codons (ATG, GTG, TTG) 13 in any in-frame position of the variable region. This process resulted in N = 2580-2694 unique sequences in each bin (Supplementary data file 2). Notably, these unique sequences overlapped between bins, although their frequency in each bin varied. The weighted mean of ΔG fold and the 99% confidence interval were calculated for each bin (see computational method for calculation), and the statistical significance comparing each pair of consecutive bins was determined using a two-tail Wilcoxon rank test.
RFP and GFP expression from the dual reporter of the random library. Measurements from triplicate bacterial cultures grown in a 96-well plate [Thermo Scientific] covered with Breathe-Easy seals [Diversified Biotech] were recorded overnight using a 37 o C incubated plate reader [Tecan]. RFP (excitation: 584 nm; emission: 607 nm) and sfGFP (excitation: 488 nm; emission: 507 nm) expression levels and OD 600 were measured every 15 min. The values presented the plateau value of each clone, which was measured in at least three experimental repeats (n≥3). We reasoned a priori that normalizing fluorescence levels to OD was appropriate, as over-expression of the reporters between clones could have led to changes in total protein amounts among clones. Normalizing to OD, as a proxy for cell number per well, was more relevant for comparing GFP expression and for comparison between the Western blots and fluorescent measurement, which were also normalized to OD. Quantitative PCR. Quantitative PCR was performed according to MIQE guidelines 28 . E. coli MG1655 cells were transformed with the pRXNG clones and grown to logarithmic phase (OD 600 of 0.4-0.5), harvested, and extracted with a GeneJET RNA purification kit [Thermo Scientific] for total RNA extraction, yielding 50 μL of RNA with a concentration of~400 ng μL −1 and of high purify (A 260 /A 280 = 2.1). This step was followed by DNase (RNase free) [Thermo Scientific] digestion using the kit protocol and guidelines. RNA was immediately reverse-transcribed into cDNA with an iScript cDNA Synthesis kit [Biorad], under kit guidelines with 1 μg RNA. Real-time PCR was performed using a KAPA SYBR FAST qPCR reagent Calculation of ΔG fold for synthetic operon clones. All calculations were made using the Vienna package 29 (ViennaRNA version 2.4.9, default settings), with the extracted mRNA sequence window upon which ΔG fold calculations were made for each clone obeying the two following constraints: First, the start of the window was +9 nucleotides from the first nucleotide of the UAG stop codon. This was done to simulate mRNA secondary structure, which exists outside the ribosomal entry tunnel. Second, the window size used was experimentally determined in the limited range between 30-50 nucleotides (length of the random region of interest = 24 nucleotides). However, our analysis was not sensitive to this parameter, and the results were robust in all window sizes across the entire reasonable range. Optimal correlation between ΔG fold and GFP expression was found with a window size of 37 nucleotides. As such, this window size was used to generate the results presented.
Simulation of theoretical ΔG fold of random library clones. Each set of 10 6 random sequences was sampled from a population of uniform nucleotide distribution and filtered as follows. ΔLFE (folding bias) calculations. To estimate the tendency of short-range interactions within the mRNA strand to form stable secondary structures, i.e., Local Fold Energy (LFE), sequences were broken into 40 nucleotide-long windows, and the minimum folding energy was calculated using RNAfold from the Vienna package 29 (using default settings). To identify regions where strong or weak secondary structure may be functional, rather than reflecting a side effect of selection acting on the amino acid sequence, or nucleotide or codon composition (see Randomization, below), the influence of these factors was controlled by comparing LFE of the native sequence to that of a set of randomized sequences maintaining these factors. The difference between the LFE of the native and randomized sequences is denoted as ΔLFE or local folding bias. If only the amino acid sequence, nucleotide composition, and codon composition are under selection at a given position, one expects ΔLFE to be close to 0. Any statistically significant deviation from this value indicates that additional factors maintained under selection are needed to explain the measured native LFE value.
Since this study focused on mRNA, only those regions surrounding proteincoding genes were included; genes shorter than 40 nucleotides were excluded. Genes with a length that is not a multiple of 3, those containing an internal stop codon or where the last codon is not a stop codon were also excluded. To identify features related to translation termination, ΔLFE for all included genes from a given species was averaged at each position relative to the stop codon. All E. coli gene results are available in Supplementary data file 4, and results for the 128 species analyzed are available in Supplementary data files 5a-c. Table  parameter annotations are detailed in the Supplementary information.
Randomization. The randomized sequences were sampled from the distribution representing the null hypothesis, namely that only the amino acid sequence, nucleotide, and codon composition (see below) are under selection at a given position in the coding sequence. To produce random sequences maintaining these properties, synonymous codons within each coding sequence were randomly permutated, and the nucleotides of each UTR were randomly permutated. Regions overlapping multiple coding sequences were maintained without permutations. Codons containing one or more ambiguous nucleotides (N bases) were likewise maintained without permutation. Synonymous codons were identified according to the gene translation table for each species. Randomization of the non-coding UTR regions were randomized by permutating only the nucleotide composition.
RTS model. To estimate the number of genes within each species likely to present an RTS after its stop codon, we examined each gene in all species. The RTS was defined and deemed present if three conditions were met: 1. The gene is separated from its successor by an annotated intergenic region of 25 nucleotides or more, or the next gene is on the opposite DNA strand; 2. At least five consecutive windows opening in the range of −10 to +20 nucleotides (meaning that the windows cover the region of between the −10 to +59 nucleotides, as the window size is 40, relative to the end of the stop codon), and that the ΔLFE is negative; and 3. A threshold of ΔG fold ≤ À6 kcal mol À1window À1 must be crossed in at least one of the five or more negative ΔLFE windows. If all conditions are met, the longest consecutive stretch of windows (5 or more) would be defined as a putative RTS, and the gene will be counted as being followed by an RTS. By repeating this process for all annotated genes of a given species, the fraction of genes followed by an RTS can be calculated. All parameter values used to define an RTS in this model are preliminary, but the parameter sensitivity of the model is low, and the results are robust in large parameter space. Results for all E. coli genes are available in Supplementary data file 6, and file parameter annotations are available in the Supplementary information section.
Plotting. Distributions of multiple genes or averages for multiple species are presented using statistics commonly used for boxplots, as follows. The shaded region spans the 25th and 75th percentiles, with the median plotted as a darker line. Elements outside this region are presented according to their density (blue shading in the background). Densities are shown as kernel density estimates (KDEs), computed separately at each position, using a Gaussian kernel with a bandwidth of 0.5. Plots were created using Scikit Learn (version 1.3.2) 31 and Matplotlib (version 3.1.1) 32 . Taxonomic trees are based on NCBI taxonomy 33 and were plotted using the ETE3 toolkit (version 3.1.1) 34 .
Statistical analysis. All statistical analysis was performed under the guidelines of the tests described in-text. The minimal p value noted in the text was selected to be 10 −30 . In all cases where the precise p value calculated was smaller (i.e., more significant), the test-statistic score is given. To test whether ΔLFE values for a onesample group of genes are statistically different, as compared to a reference value (e.g., for the RTS model), the Wilcoxon signed-ranks test was used on the ΔLFE (randomized ΔG-native ΔG) values for all genes (20 randomization repetitions for each gene). To test whether ΔLFE values for two-sample groups of genes are statistically different from each other, the Mann-Whitney U test was used on the ΔLFE (randomized ΔG-native ΔG) values for all genes (with 20 randomization repetitions for each gene). As such, the test N was 20 times the number of data points of the original sample. The p values and test statistics are reported for the position of the most extreme test-statistic, whereas the surrounding regions showed consistent and significant results. Detailed statistical parameters are available in Table S3.
Code writing and computational tools throughout the computational analysis. For code writing, simulation, and analysis thereof throughout this work, the following packages were used: R package ggplot (version 3.2.1), R package data2 (version 1.14), Python (version 3.7. Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Experimentally determined operonic positions were obtained from ODB4 35 . Protein abundance data were obtained from PaxDb 36 . Experimentally determined 3′-UTR lengths were obtained from regulondb 37 . Termination type data for E. coli genes were obtained from WebGesTer 38 . Source data are provided with this paper.