Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli

Cambray, Guillaume; Guimaraes, Joao C; Arkin, Adam Paul

doi:10.1038/nbt.4238

Resource
Published: 24 September 2018

Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli

Nature Biotechnology volume 36, pages 1005–1015 (2018)Cite this article

14k Accesses
131 Citations
97 Altmetric
Metrics details

Subjects

Abstract

Comparative analyses of natural and mutated sequences have been used to probe mechanisms of gene expression, but small sample sizes may produce biased outcomes. We applied an unbiased design-of-experiments approach to disentangle factors suspected to affect translation efficiency in E. coli. We precisely designed 244,000 DNA sequences implementing 56 replicates of a full factorial design to evaluate nucleotide, secondary structure, codon and amino acid properties in combination. For each sequence, we measured reporter transcript abundance and decay, polysome profiles, protein production and growth rates. Associations between designed sequences properties and these consequent phenotypes were dominated by secondary structures and their interactions within transcripts. We confirmed that transcript structure generally limits translation initiation and demonstrated its physiological cost using an epigenetic assay. Codon composition has a sizable impact on translatability, but only in comparatively rare elongation-limited transcripts. We propose a set of design principles to improve translation efficiency that would benefit from more accurate prediction of secondary structures in vivo.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: High-throughput design of experiments.**

**Figure 2: Protein production under normal and facilitated conditions of translation initiation.**

**Figure 3: Dynamic structure interactions hinder functional predictions.**

**Figure 4: Unexpected growth defects associated with reduced translation initiation.**

**Figure 5: Pathological accumulation of stable transcripts inhibits initiation rate.**

**Figure 6: Impact of archetypal sequences on translation efficiency and physiological cost.**

Improving prime editing with an endogenous small RNA-binding protein

Article Open access 03 April 2024

Genome engineering with Cas9 and AAV repair templates generates frequent concatemeric insertions of viral vectors

Article 08 April 2024

A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions

Article 05 April 2024

Accession codes

Primary accessions

Sequence Read Archive

SRP086076

Referenced accessions

NCBI Reference Sequence

U00096.2

References

Li, G.-W., Burkhardt, D., Gross, C. & Weissman, J.S. Quantifying absolute protein synthesis rates reveals principles underlying allocation of cellular resources. Cell 157, 624–635 (2014).
Article CAS PubMed PubMed Central Google Scholar
Andersson, S.G. & Kurland, C.G. Codon preferences in free-living microorganisms. Microbiol. Rev. 54, 198–210 (1990).
CAS PubMed PubMed Central Google Scholar
Scott, M., Gunderson, C.W., Mateescu, E.M., Zhang, Z. & Hwa, T. Interdependence of cell growth and gene expression: origins and consequences. Science 330, 1099–1102 (2010).
Article CAS PubMed Google Scholar
Ceroni, F., Algar, R., Stan, G.-B. & Ellis, T. Quantifying cellular capacity identifies gene expression designs with reduced burden. Nat. Methods 12, 415–418 (2015).
Article CAS PubMed Google Scholar
Frumkin, I. et al. Gene architectures that minimize cost of gene expression. Mol. Cell 65, 142–153 (2017).
Article CAS PubMed Google Scholar
Ikemura, T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981).
Article CAS PubMed Google Scholar
Sharp, P.M. & Li, W.H. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).
Article CAS PubMed PubMed Central Google Scholar
Cannarozzi, G.M. & Schneider, A. Codon Evolution (Oxford Univ. Press, 2012).
Mitarai, N., Sneppen, K. & Pedersen, S. Ribosome collisions and translation efficiency: optimization by codon usage and mRNA destabilization. J. Mol. Biol. 382, 236–245 (2008).
Article CAS PubMed Google Scholar
Charneski, C.A. & Hurst, L.D. Positively charged residues are the major determinants of ribosomal velocity. PLoS Biol. 11, e1001508 (2013).
Article CAS PubMed PubMed Central Google Scholar
Pop, C. et al. Causal signals between codon bias, mRNA structure, and the efficiency of translation and elongation. Mol. Syst. Biol. 10, 770 (2014).
Article PubMed PubMed Central CAS Google Scholar
Del Campo, C., Bartholomäus, A., Fedyunin, I. & Ignatova, Z. Secondary structure across the bacterial transcriptome reveals versatile roles in mRNA regulation and function. PLoS Genet. 11, e1005613 (2015).
Article PubMed PubMed Central CAS Google Scholar
Adhin, M.R. & van Duin, J. Scanning model for translational reinitiation in eubacteria. J. Mol. Biol. 213, 811–818 (1990).
Article CAS PubMed Google Scholar
Kudla, G., Murray, A.W., Tollervey, D. & Plotkin, J.B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).
Article CAS PubMed PubMed Central Google Scholar
Mutalik, V.K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).
Article CAS PubMed Google Scholar
Espah Borujeni, A., Channarasappa, A.S. & Salis, H.M. Translation rate is controlled by coupled trade-offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic Acids Res. 42, 2646–2659 (2014).
Article CAS PubMed Google Scholar
Tuller, T. & Zur, H. Multiple roles of the coding sequence 5′ end in gene expression regulation. Nucleic Acids Res. 43, 13–28 (2015).
Article CAS PubMed Google Scholar
Tuller, T. et al. An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell 141, 344–354 (2010).
Article CAS PubMed Google Scholar
Tuller, T. et al. Composite effects of gene determinants on the translation speed and density of ribosomes. Genome Biol. 12, R110 (2011).
Article CAS PubMed PubMed Central Google Scholar
Charneski, C.A. & Hurst, L.D. Positive charge loading at protein termini is due to membrane protein topology, not a translational ramp. Mol. Biol. Evol. 31, 70–84 (2014).
Article CAS PubMed Google Scholar
Goodman, D.B., Church, G.M. & Kosuri, S. Causes and effects of N-terminal codon bias in bacterial genes. Science 342, 475–479 (2013).
Article CAS PubMed Google Scholar
Allert, M., Cox, J.C. & Hellinga, H.W. Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918 (2010).
Article CAS PubMed PubMed Central Google Scholar
Ilzarbe, L., Álvarez, M.J., Viles, E. & Tanco, M. Practical applications of design of experiments in the field of engineering: a bibliographical review. Qual. Reliab. Eng. Int. 24, 417–428 (2008).
Article Google Scholar
Montgomery, D.C. Design and Analysis of Experiments (Wiley, 2017).
Zhou, H., Vonk, B., Roubos, J.A., Bovenberg, R.A.L. & Voigt, C.A. Algorithmic co-optimization of genetic constructs and growth conditions: application to 6-ACA, a potential nylon-6 precursor. Nucleic Acids Res. 43, 10560–10570 (2015).
CAS PubMed PubMed Central Google Scholar
Zhang, C., Zou, R., Chen, X., Stephanopoulos, G. & Too, H.-P. Experimental design-aided systematic pathway optimization of glucose uptake and deoxyxylulose phosphate pathway for improved amorphadiene production. Appl. Microbiol. Biotechnol. 99, 3825–3837 (2015).
Article CAS PubMed Google Scholar
Mutalik, V.K. et al. Quantitative estimation of activity and quality for collections of functional genetic elements. Nat. Methods 10, 347–353 (2013).
Article CAS PubMed Google Scholar
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. USA 110, 14024–14029 (2013).
Article CAS PubMed PubMed Central Google Scholar
Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).
Article CAS PubMed PubMed Central Google Scholar
Guimaraes, J.C., Rocha, M., Arkin, A.P. & Cambray, G. D-Tailor: automated analysis and design of DNA sequences. Bioinformatics 30, 1087–1094 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pédelacq, J.-D., Cabantous, S., Tran, T., Terwilliger, T.C. & Waldo, G.S. Engineering and characterization of a superfolder green fluorescent protein. Nat. Biotechnol. 24, 79–88 (2006).
Article PubMed CAS Google Scholar
Young, T.S., Ahmad, I., Yin, J.A. & Schultz, P.G. An enhanced system for unnatural amino acid mutagenesis in E. coli. J. Mol. Biol. 395, 361–374 (2010).
Article CAS PubMed Google Scholar
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Article CAS PubMed PubMed Central Google Scholar
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J.S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).
Article CAS PubMed Google Scholar
Yoo, J.-H. & RajBhandary, U.L. Requirements for translation re-initiation in Escherichia coli: roles of initiator tRNA and initiation factors IF2 and IF3. Mol. Microbiol. 67, 1012–1026 (2008).
Article CAS PubMed PubMed Central Google Scholar
Kelsic, E.D. et al. RNA structural determinants of optimal codons revealed by MAGE-seq. Cell Syst. 3, 563–571.e6 (2016).
Article CAS PubMed PubMed Central Google Scholar
dos Reis, M., Savva, R. & Wernisch, L. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004).
Article CAS PubMed Google Scholar
Hilterbrand, A., Saelens, J. & Putonti, C. CBDB: the codon bias database. BMC Bioinformatics 13, 62 (2012).
Article PubMed PubMed Central Google Scholar
Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014).
Article CAS PubMed PubMed Central Google Scholar
Boël, G. et al. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature 529, 358–363 (2016).
Article PubMed PubMed Central CAS Google Scholar
Chevance, F.F.V., Le Guyon, S. & Hughes, K.T. The effects of codon context on in vivo translation speed. PLoS Genet. 10, e1004392 (2014).
Article PubMed PubMed Central CAS Google Scholar
van Opijnen, T. & Camilli, A. Transposon insertion sequencing: a new tool for systems-level analysis of microorganisms. Nat. Rev. Microbiol. 11, 435–442 (2013).
Article CAS PubMed Google Scholar
Dekel, E. & Alon, U. Optimality and evolutionary tuning of the expression level of a protein. Nature 436, 588–592 (2005).
Article CAS PubMed Google Scholar
Schaechter, M., MaalOe, O. & Kjeldgaard, N.O. Dependency on medium and temperature of cell size and chemical composition during balanced growth of Salmonellatyphimurium. J. Gen. Microbiol. 19, 592–606 (1958).
Article CAS PubMed Google Scholar
Li, G.-W.Howdo bacteria tune translation efficiency? Curr. Opin. Microbiol. 24, 66–71 (2015).
Article PubMed PubMed Central CAS Google Scholar
Deana, A. & Belasco, J.G. Lost in translation: the influence of ribosomes on bacterial mRNA decay. Genes Dev. 19, 2526–2533 (2005).
Article CAS PubMed Google Scholar
Presnyak, V. et al. Codon optimality is a major determinant of mRNA stability. Cell 160, 1111–1124 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hui, M.P., Foley, P.L. & Belasco, J.G. Messenger RNA degradation in bacterial cells. Annu. Rev. Genet. 48, 537–559 (2014).
Article CAS PubMed PubMed Central Google Scholar
Dinçbas, V. & Heurgué-Hamard, V. Shutdown in protein synthesis due to the expression of mini-genes in bacteria. J. Mol. Biol. 291, 745–759 (1999).
Article PubMed Google Scholar
Jaillard, M. et al. A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between kmers and genetic events. Preprint at bioRxiv https://doi.org/10.1101/297754 (2018).
Bulmer, M. The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897–907 (1991).
CAS PubMed PubMed Central Google Scholar
Shah, P., Ding, Y., Niemczyk, M., Kudla, G. & Plotkin, J.B. Rate-limiting steps in yeast protein translation. Cell 153, 1589–1601 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ciandrini, L., Stansfield, I. & Romano, M.C. Ribosome traffic on mRNAs maps to gene ontology: genome-wide quantification of translation initiation rates and polysome size regulation. PLoS Comput. Biol. 9, e1002866 (2013).
Article CAS PubMed PubMed Central Google Scholar
Duval, M. et al. Escherichia coli ribosomal protein S1 unfolds structured mRNAs onto the ribosome for active translation initiation. PLoS Biol. 11, e1001731 (2013).
Article PubMed PubMed Central CAS Google Scholar
Marzi, S. et al. Structured mRNAs regulate translation initiation by binding to the platform of the ribosome. Cell 130, 1019–1031 (2007).
Article CAS PubMed Google Scholar
Qu, X. et al. The ribosome uses two active mechanisms to unwind messenger RNA during translation. Nature 475, 118–121 (2011).
Article CAS PubMed PubMed Central Google Scholar
Takahashi, M.K. et al. Using in-cell SHAPE-Seq and simulations to probe structure-function design principles of RNA transcriptional regulators. RNA 22, 920–933 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ding, Y., Kwok, C.K., Tang, Y., Bevilacqua, P.C. & Assmann, S.M. Genome-wide profiling of in vivo RNA structure at single-nucleotide resolution using structure-seq. Nat. Protoc. 10, 1050–1066 (2015).
Article CAS PubMed Google Scholar
Kosuri, S. & Church, G.M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Article CAS PubMed PubMed Central Google Scholar
Miller, W.G., Leveau, J.H.J. & Lindow, S.E. Improved gfp and inaZ broad-host-range promoter-probe vectors. Mol. Plant Microbe Interact. 13, 1243–1250 (2000).
Article CAS PubMed Google Scholar
Lee, T.S. et al. BglBrick vectors and datasheets: a synthetic biology platform for gene expression. J. Biol. Eng. 5, 12 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kapust, R.B. & Waugh, D.S. Controlled intracellular processing of fusion proteins by TEV protease. Protein Expr. Purif. 19, 312–318 (2000).
Article CAS PubMed Google Scholar
Kapust, R.B., Tözsér, J., Copeland, T.D. & Waugh, D.S. The P1′ specificity of tobacco etch virus protease. Biochem. Biophys. Res. Commun. 294, 949–955 (2002).
Article CAS PubMed Google Scholar
Cambray, G. et al. Measurement and modeling of intrinsic transcription terminators. Nucleic Acids Res. 41, 5139–5148 (2013).
Article CAS PubMed PubMed Central Google Scholar
Glascock, C.B. & Weickert, M.J. Using chromosomal lacIQ1 to control expression of genes on high-copy-number plasmids in Escherichia coli. Gene 223, 221–231 (1998).
Article CAS PubMed Google Scholar
Elowitz, M.B., Levine, A.J., Siggia, E.D. & Swain, P.S. Stochastic gene expression in a single cell. Science 297, 1183–1186 (2002).
Article CAS PubMed Google Scholar
Liang, J.C., Chang, A.L., Kennedy, A.B. & Smolke, C.D. A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. Nucleic Acids Res. 40, e154 (2012).
Article CAS PubMed PubMed Central Google Scholar
Pósfai, G. et al. Emergent properties of reduced-genome Escherichia coli. Science 312, 1044–1046 (2006).
Article PubMed CAS Google Scholar
Csörgo, B., Fehér, T., Tímár, E., Blattner, F.R. & Pósfai, G. Low-mutation-rate, reduced-genome Escherichia coli: an improved host for faithful maintenance of engineered genetic constructs. Microb. Cell Fact. 11, 11 (2012).
Article PubMed PubMed Central CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Article PubMed PubMed Central CAS Google Scholar
Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).
Article CAS PubMed Google Scholar
van Opijnen, T., Bodi, K.L. & Camilli, A. Tn-seq: high-throughput parallel sequencing for fitness and genetic interaction studies in microorganisms. Nat. Methods 6, 767–772 (2009).
Article CAS PubMed PubMed Central Google Scholar
Oh, E. et al. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell 147, 1295–1308 (2011).
Article CAS PubMed PubMed Central Google Scholar
Qin, D. & Fredrick, K. Analysis of polysomes from bacteria. Methods Enzymol. 530, 159–172 (2013).
Article CAS PubMed Google Scholar
R Core Team. R: a language and environment for statistical computing https://www.R-project.org/ (2017).
Sullivan, G.M. & Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med. Educ. 4, 279–282 (2012).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank V. Mutalik, C. Liu, L. Jacob, M. Price, A. Deutschbauer, M. Samoilov, P. Shah, J. Plotkin, J. Savitskaya and L. Ciandrini for discussions. We are grateful to the Agilent Laboratories and the Synthetic Biology Institute (SBI) for providing the OLS array. We thank J. Sampson, P. Anderson and S. Laderman from Agilent Laboratories for discussing OLS setup and processing. G.C. was funded by the Human Frontier Science Program (LT000873/2011-l), J.C.G. by the Portuguese Fundação para a Ciência e Tecnologia (SFRH/BD/47819/2008). We acknowledge financial support by the Synthetic Biology Engineering Research Center (SynBERC under National Science Foundation grant 04-570/0540879). This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley (NIH S10 Instrumentation Grants S10RR029668 and S10RR027303).

Author information

Authors and Affiliations

California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, California, USA.,
Guillaume Cambray & Joao C Guimaraes
DGIMI, Univ. Montpellier, INRA, Montpellier, France
Guillaume Cambray
Department of Bioengineering, University of California, Berkeley, Berkeley, California, USA.,
Joao C Guimaraes & Adam Paul Arkin
Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
Adam Paul Arkin

Authors

Guillaume Cambray
View author publications
You can also search for this author in PubMed Google Scholar
Joao C Guimaraes
View author publications
You can also search for this author in PubMed Google Scholar
Adam Paul Arkin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.C. and A.P.A. conceived the work; G.C. and J.C.G. designed sequences; G.C. performed experiments and processed data; G.C. and A.P.A. analyzed the data and J.C.G. contributed post hoc secondary structure analyses; G.C. and A.P.A. wrote the manuscript.

Corresponding authors

Correspondence to Guillaume Cambray or Adam Paul Arkin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Reanalysis of natural sequences to define properties of interest

All analyses used protein abundance data from Taniguchi et al¹. (n=575 genes) and the genome sequence of E. coli MG1655 (GI:48994873; See Supplementary Code 25; Supplementary Data 1)..

(A) Nucleotide composition biases in coding sequences are related to protein expression. Plots show Pearson correlation coefficients between various nucleotide contents and protein abundances for windows of varying sizes and positions, as shown. Colors correspond to different nucleotide combinations (see bottom right legend). Grey background shadings separate subpanels that correspond to increasing starting position of the windows (numbering below bottom panel). Within subpanels, consecutive points correspond to increase of the window size by one codon from a fixed starting position. Within each window, the three within-codon positions have been analyzed separately, as indicated. Considering the redundancy of the genetic, the third codon position is less constrained and should provide a less biased indication of nucleotide influences on protein production. These data highlights the contribution of AT content (see panels B and C), as previously noted by Allert et al.². Strongest correlations are found at the second codon position for %A, %T, %C but not %G. According to Sjöström and Wold³, this particular pattern strongly suggests the contribution of the hydropathic properties of the corresponding amino-acids (see panels D and E).

(B) Scatter plot of protein abundances against the AT content in the window +4 to +21 used for further design (%AT). Although sizable when only the third codon position is considered (see A), Pearson's correlations with protein abundances are relatively weak when all three codon positions are considered in the calculation of AT content.

(C) Distribution of %AT binned by categories of protein abundances, as shown. No striking pattern differentiates the distributions. A single threshold—corresponding to the average %AT over all natural coding sequences in the reference E. coli genome—was chosen for the discretization of this property into 2 ordinal levels (white line).

(D) Hydropathy is correlated with protein expression. The red line shows the average hydropathy index over a sliding widow of 11 amino acids (Supplementary Data 2). The blue line shows corresponding correlations with protein abundances. Positions corresponds to amino acids. The grey vertical line marks the window chosen for design of the MHI property.

(E) Distribution of MHI binned by categories of protein abundances, as shown. The low protein bin has a clear bimodal distribution. Two thresholds—corresponding to the 15^th and 75^th percentiles of MHI over all natural coding sequences in the reference E. coli genome—were chosen for the discretization of this property into 3 ordinal levels (white lines).

(F) Scatter plot of protein abundance against CAI of whole coding sequences. Regression line is shown in red (Pearson's correlation r=0.54). Grey background shadings mark the 20th and 80th percentile of protein abundances used for categorization in the distributions (see G).

(G) Distribution of CAI binned by categories of protein abundances, as shown. Two thresholds —corresponding to the 20^th and 80^th percentiles of CAI over all natural coding sequences in the reference E. coli genome—were chosen for the discretization of this property into 3 ordinal levels (white lines).

(H) Distribution of codon ramp properties binned by categories of protein abundances, as shown. Plotted are absolute bottleneck positions (Btl_P, left) and bottleneck relative strengths (Btl_S, middle) for all natural coding sequences in the reference E. coli genome. Distribution of Btl_S for sequences with Btl_P downstream of codon 33 (the design threshold dictated by construction constraints; see IJ) is shown on the right. This latter plot guided the definition of a nested threshold for Btl_S, corresponding to the 70th percentile for this property (white line).

(I) Engineering codon ramp bottlenecks in the sfGFP reporter. The profile of relative bottleneck strength for the original sfGFP reporter is shown in grey (20 codons sliding window; Supplementary Data 3). To engineer conditions wherein a variable sequence of 96 nts fused to the reporter could influence bottleneck properties, a total of 22 codons clustered in 3 different region of the reporter sequence were mutated. The resulting profile features a strong C-terminal bottleneck at position 232 and a moderate bottleneck at the beginning of the reporter (bold green line). The strength of the latter can be modulated by the nature of the upstream designed sequence (see J).

(J) Possible bottlenecks in the engineered reporter. Shown is a scatter plot of bottlenecks positions and strengths realized for a million random sequences of 32 codons fused to the engineered reporter (Supplementary Data 4). Bottleneck positions are located within the first 33 codons or position 232, as intended. The nested threshold for Btl_S (red line) is not exceeded by C-terminal bottlenecks.

(K) Smooth variations in secondary structure strength around the start codon of natural coding sequences. Shown are boxplots of predicted minimum free energy for a window of 60 nts slid by steps of 5 nts around the start codon. Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and middle lines mark medians. Points outsides of the whiskers are not plotted for clarity. Colored boxes highlight the windows chosen for design. Background shadings mark the 10^th, 25^th, 50^th, 75^th and 90^th structure percentiles for randomly generated sequences. While structures in 5'UTRs tend to be less stable than expected by chance, structure within genes tend to be more stable.

(L) Distribution of structure's predicted minimal free energies binned by categories of protein abundances, as shown. Two thresholds—corresponding to the 25^th and 75^th percentiles of the properties over all natural coding sequences in the reference E. coli genome— were chosen for the discretization of these properties into 3 ordinal levels (white lines).

1. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

2. Allert, M., Cox, J. C. & Hellinga, H. W. Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918 (2010).

3. Sjöström, M. & Wold, S. A multivariate study of the relationship between the genetic code and the physical-chemical properties of amino acids. J Mol Evol 22, 272–277 (1985).

Supplementary Figure 2 Advantages of carefully designed over natural or random sequences

(A) Uneven property distributions in natural and random sequences. The black profile shows the ranked distribution of properties combinations obtained by generating 244,000 sequences at random (Supplementary Data 6). Grey bars show the corresponding distribution in natural E. coli genes. Both distributions are highly skewed compared to our systematic design (blue line).

(B) Occurrences in random and natural sequences are correlated (n=244,000 and 2580 sequences, respectively; Pearson's correlation r=0.52). Properties of natural sequences are partly shaped by inherent constraints that makes certain combinations hard to obtain (e.g. high %AT content and strong structure). As a result, natural processes have likely evolved to avoid requiring combinations of incompatible properties.

(C) Focal sampling of sequence space by replicate series. Shown are the mean pairwise sequences identities within (error bars show standard deviation across series) and between factorial series (error bars show standard deviation of mean identities between pairs of factorial series) at the nucleotide and amino acid levels (n=56 series and 1540 pairs of series, respectively). Red lines mark random expectations. The 56 full-factorial series were constructed to maximize within-series while minimizing between-series identities (Supplementary Code 26; Supplementary Data 7).

(D) Distributions of designed property scores. Designed scores (black) are representative of wild-type E. coli distributions (red lines). Background shadings mark the separation between ordinal levels used for design (see Figure 1B; Supplementary Data 5). Continuous scores cluster to level boundaries because extreme levels are usually populated by mutations from medium levels sequences during the design process. Btl_S nested within C-terminal Btl_P are shown in dark grey.

(E) Correlations between property scores. Pairwise Pearson correlation coefficients between design scores in the whole library (blue dots; n=244,000 sequences) or within each factorial series (grey dots and boxplots; n=4,374 sequences) are considerably lower than those observed in the natural genome (red dots, n=1540 genes). Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and middle lines mark medians.

Supplementary Figure 3 Sequence logos of replicate factorial series

The 96 positions of the designed sequences are shown as a sequence logo for each of the 56 independent factorial series. Series identification numbers are shown on top. At each position, bases are arranged by decreasing frequencies from top to bottom, with sizes proportional to their frequency. Histograms show the distribution of pairwise differences between sequences in the series at nucleotide (red) and amino acid (blue) level (Supplementary Data 8). As intended by design, the consensus sequence is distinctly different for each series. Variations are well distributed all over the sequence, with some positions more variable than others. Contrasting with nucleotide differences, the distribution of pairwise amino acid differences is often multimodal. This behavior stems from the initial enforcement and eventual relaxation of constraints to favor synonymous mutations during the design process. A sizeable number of sequence variants within each series are synonymous.

Supplementary Figure 4 Library coverage by high-throughput sequencing

(A) Distribution of count numbers per strains aggregated over all sequencing libraries (n=745,595,539 reads). The bulk of the library (90%) produced between 10³ and 10⁴ reads per strain (Supplementary Data 9). One construct was never observed by sequencing, 134 others produced less than a hundred reads each. In contrast, some constructs are highly enriched (26 produced more than 10⁵ reads).

(B) Library multiplexing map. Multiple libraries were pooled on the same sequencing lane and demultiplexed using barcodes. In all, 166 libraries were loaded on 9 lanes of illumina flowcells and run on a HiSeq 2500. Asterixis denote enumerations of libraries with names derived from the same root (Supplementary Data 10).

(C) Distribution of count numbers per strains for each library. Library name, total read counts after demultiplexing and mapping, as well as fit parameters for a negative binomial density (shown in red) are shown. For clarity, axes' names and labels are drawn once at the bottom right. Backgrounds are color coded according to read number (see thermometer on bottom right). Bars exceeding the range of the graph are colored in dark gray. The most informative library for determining the native composition library of the unscreened library is FIT-SEQ_NoCpl_Gen0_round1 (right column, fifth row), in which 242,516 strains (99.4% of the library) are covered by >10 reads.

Supplementary Figure 5 Inducible translational coupling device permits tunable control of translation initiation

(A) Influence of amber codon number and position on translational coupling inducement. Population average fluorescence signals were measured by flow cytometry at mid-exponential growth under increasing dilution of unnatural amino acid (AcF; Supplementary Data 11). Position and number of amber stop codons was varied in a development version of the reporter system showing poor translation in the absence of coupling. Points and shaded backgrounds show the means and standard deviations from 3 biological replicates (color as shown). The construct pGC4470, which bear a single amber at the fifth codon of the leader sequence, provides greater induction though slightly lower repression (green line). Since ribosomes terminating at this position show minimal interference with STR_−30:+30 (Figure 3A), this version of the device was retained for the final reporter.

(B) Inducible translation coupling enables quantitative control of translation rate. Distribution of cellular fluorescence measured by flow cytometry under increasing dilution of AcF (color as shown) for construct pGC4470 (green line in panel A).

(C) The unnatural suppressor system recapitulates the effect of sense and stop codons. The amber stop codon (TAG) was replaced by the ochre stop (TAA) and other sense point-mutants (AAG, TAC and TTG) in the context of 10 reporter variants differing in sequence over the first 10 codons after the start codon. The variants exhibit different expression patterns and are shown in order of increasing expression ratio (full over no induction). In the absence of AcF, amber behaves comparably to ochre, demonstrating little leakage and efficient termination. Expression levels attained under induction by 2.5 nM AcF are almost as high as those obtained with sense codon, demonstrating the high read-through efficiency of the system (Supplementary Data 12).

(D) The early amber codon in the leader does not trigger global translation shutdown. Shown are growth curves for constructs yielding comparably low (4822) or high (4787) protein expression across variants of the amber stop codon (Supplementary Data 13). Comparable growth rates across these strains show that the 5 amino acids minigene produced from the leader sequence under normal condition of reporter initiation does not adversely affect cellular growth or global expression¹. Because the minigene and its immediate context are invariant across the library we expect these observations to hold true for all strains in this study.

1. Dinçbas, V., Heurgué-Hamard, V., molecular, R. B. J. O.1999. Shutdown in protein synthesis due to the expression of mini-genes in bacteria. Elsevier. doi:10.1006/jmbi.1999.3028

Supplementary Figure 6 Measurements of protein production under normal conditions of initiation and relationship to design factors

(A) High-throughput measurements of protein production are comparable to individually measured performances. Shown is a scatter plot of flow-cytometry data measured on individual cultures versus FACS-Seq data under conditions of normal initiation. Points mark the mean of at least 3 biological replicates and grey arrows their standard deviation (n=310 strains). The red line is a linear regression fit, excluding outliers (grey data points). We find excellent agreement between the two types of data (Pearson's correlation r=0.95). The compression on the low end reflects weaker sensitivity of the benchtop flow cytometer used for individual measurements as compared to the more sophisticated FACS machines used for the high-throughputs experiments. Most outliers show large standard deviation and probably correspond to the rise of mutations outside of the sequenced region in either assay.

(B) High-throughput measurements of protein production are highly reproducible. Shown are pairwise scatterplots of processed protein production for 4 biological replicates (Supplementary Data 15). Points are first plotted in solid grey to render isolated outliers and then as transparent black to provide a sense of data density (sample size as shown). Cells from replicates 1-2 and 3-4 were sorted with different FACS machine (see Online Methods). Replicates 1-3 and replicate 2-4 were pooled for sequencing. The reproducibility of the measurements is excellent (r=0.99 on average; individual correlation coefficients as shown).

(C) Sizeable design error in the molecular Design of Experiment. Shown are the cumulative distributions of the coefficients of variation in P_NI amongst experimental replicate (red, experimental error) and the 3 close design replicates within each series (sequences with identical factorial properties and 1-4 nts differences; blue, design error). The design error is distinctly larger than the experimental error, testifying of the inability of the factorial categorization to fully capture functional variations between highly related sequences.

(D) Series-wise decomposition of explainable variance by linear regression. Top: same plot as Figure 2B but derived from multiple linear regressions (MLR) on continuous property scores, as opposed to ANOVA on discretized score levels (Supplementary code 8; Supplementary Data 17; n∈[4,429; 4,372] strains for each series, except n=3,418 for incomplete series #136). Series order and color scheme are maintained for comparison. Bottom: MLR and ANOVA yield comparable results. Shown are scatter plots of series-wise explanatory powers obtained through MLRs versus ANOVAs (n=56 series). Left: total explanatory powers; Right: effect sizes for each design properties and their second order interactions (log scale; n=35 properties). Largest contributions are highly correlated, although MLRs consistently provide slightly better results than ANOVAs.

(E) Recursive regression tree resolves hierarchical dependencies between properties. At each node, data are split according to the rule shown in the colored box, heuristically chosen to maximize the explained variance (Supplementary code 9; Supplementary Data 18; n=242,269 strains). Box colors mark properties concerned with the rule at a given node, following the color code in panel D and Figure 2B. R2 are shown below boxes and summarized in the upper-right pie. Average protein productions within each branch are shown above boxes and color-coded according to the upper-left thermometer.

(F) Design factors describes mutational series characterized by higher phenotypic diversities better. The series-wise mean (left scatter plot) and variance (middle scatter plot) in P_NI are plotted against the explanatory power (R²) achieved by all design factors and their second-order interactions in ANOVA (Supplementary Code 7; Supplementary Data 16; n=56 series). Red lines show linear regression fits (coefficients as shown). Higher mean P_NI is associated with lower design factor contributions to the observed variance. In contrast, higher variance is associated with higher explanatory power of the design factors. Mean and variance in P_NI are moderately correlated (right scatter plot). Series not well explained by design factors fail to implement the intended phenotypic variability. In particular, too high mean P_NI is likely symptomatic of failure to design functionally relevant secondary structure in the initiation region.

(G) Enrichment of codon-adapted sequences amongst highest protein producers. Left: Scatter plot of CAI versus P_NI, with data points colored by STR_-30:+30, as shown. Dark lines represent quartiles of CAI for every percentile of P_NI. Grey lines show the same quantities calculated over the whole library. Blue and red lines show linear regressions using data below and above the top P_NI pentile, respectively (coefficients as shown). Right: Scatter plot of P_NI against CAI colored by STR_-30:+30, for the highest pentile of P_NI (red regression on left panel). The transparent dark line is a linear regression (coefficient as shown). Grey lines mark the quartiles of P_NI for every percentile of CAI. Number of strains as shown.

Supplementary Figure 7 Measurements of protein production under conditions of facilitated initiation and relationship to design factors

(A) Bulk high-throughput measurements of protein production are comparable to individually measured performances. Shown is a scatter plot of individual flow-cytometry data versus FACS-Seq data under conditions of facilitated initiation. Points mark the mean of at least 3 biological replicates and grey arrows their standard deviation (n=310 strains). The red line is a linear regression fit, excluding outliers (grey data points). We find good agreement between the two types of data (r=0.90).

(B) High-throughput measurements of protein production are reproducible. Shown are pairwise scatterplots of processed protein production for 4 biological replicates (Supplementary Data 15). Points are first plotted in solid grey to render isolated outliers and then as transparent black to provide a sense of data density (sample size as shown). Cells from replicates 1-2 and 3-4 were sorted with different FACS machine (see Material and methods). Replicates 1-3 and replicate 2-4 were pooled for sequencing. The reproducibility of the measurements is generally good, although the first replicate shows a somewhat inconsistent signal (r=0.87 on average; r=0.91, excluding replicate #1, individual correlation coefficients as shown). We retained that replicate for the calculation of P_FI because it nonetheless provided valuable information nonetheless.

(C) Lower design error under facilitated initiation. Shown are the cumulative distributions of the coefficients of variation in P_FI amongst experimental replicate (red) and the 3 close design replicates within each series (sequences with identical factorial properties and 1-4 nts differences; blue). Unlike the situation under normal initiation (Supplementary Fig. 5C), the design error is hardly distinguishable from the experimental error under coupling. At least in part, this behavior arises from the combination of lesser experimental reproducibility and lower variance in measured fluorescence across the library. Facilitating initiation may also directly mitigate the impact of the original factors underlying the Design Error (e.g. misprediction of secondary structures).

(D) Series-wise decomposition of explainable variance. Top: same plot as Figure 2B but derived from multiple linear regressions (MLR) on continuous property scores, as opposed to ANOVA on score levels (Supplementary Code 11; Supplementary Data 20; n=238,458 for the whole dataset and n∈[3093; 4,368] strains for each series). Series order and color scheme are maintained for comparison. Bottom: MLR and ANOVA yield comparable results under facilitated condition of initiation. Shown are scatter plots of series-wise explanatory powers obtained through MLRs versus ANOVAs. Left: total explanatory powers (n=56 series); Right: effect sizes for each design properties and their second order interactions (log scale; n=35 properties). The largest contributions are highly correlated, although MLRs consistently provide slightly better results than ANOVAs.

(E) Recursive regression tree resolves hierarchical dependencies between properties. At each node, data are split according to the rule shown in the colored box, heuristically chosen to maximize the explained variance (Supplementary Code 12; Supplementary Data 21; n=238,458 strains). Box colors mark properties concerned with the rule at a given node, following the color code in panel D and Figure 2B. R2 are shown below boxes and summarized in the upper-right pie. Average protein productions within each branch are shown above boxes and color-coded according to the upper-left thermometer. Unlike P_NI, CAI shows sizable contributions to larger P_FI.

(F) Codon usage modulate protein production and is subordinate to non-limiting translation initiation. Left: Scatter plot of CAI versus P_FI colored by STR_+01:+60, as shown. The median CAI for each percentile of P_FI is plotted in yellow. Scaled equivalents for the medians of STR_-30:+30 (red), STR_+01:+60 (blue) and STR_+31:+90 (purple) are shown for comparison. Past a production threshold (dashed vertical line), increasingly faster elongation rates corresponding to higher P_FI are only permitted in strains with commensurate improvement in CAI. Below the threshold, P_FI remains fully limited by initiation, as determined by strong STR_+01:+60 that are not well unfolded by the coupling mechanism. Right: Scatter plot of P_FI against CAI colored by P_NI, excluding the lowest decile of P_FI (red regression on left panel). The transparent dark line is a linear regression (Pearson's coefficient as shown). Grey lines mark the quartiles of P_NI for every percentile of CAI. Number of strains as shown.

Supplementary Figure 8 Impact of other codon and amino acid metrics on protein production under conditions of normal and facilitated translation initiation

(A) Manipulation of the codon ramp does not impact protein production. Left: Distribution of P_NI (top; n=242,269 strains) and P_FI (bottom; n=238,458 strains) according to the predicted position (Btl_P) and strength (Btl_S) of the translation bottleneck. Boxplots over light grey background show distributions of production by amino-acid position along the designed sequence. At each position, blue and red boxes show lower and higher levels of Btl_S strains, respectively. Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and middle lines mark medians. Outliers beyond whiskers are not plotted. Box widths are quadratically related to the sample size. No systematic trend is apparent across N-terminal positions. The two colored boxplots on medium grey background show pooled data across all N-terminal positions, broken down by Btl_S level. The two grey boxplots show distributions binned by Btl_P level. We observe no differences in protein production between these groups. Right: Scatter plot of Btl_S versus P_NI (top) and P_FI (bottom) for N-terminal (black) and C-terminal (red) levels of Btl_P. Grey and dark red lines show corresponding quartiles for each percentiles of protein production. Unlike CAI, the strength of the codon ramp does not correlate with variation in the translation regime.

(B) The codon ramp does not explain the relationship between protein productions under normal and facilitated initiation. Far left: Scatter plot between P_NI and P_FI colored by the ramp bottleneck position, as shown. Middle left: correlations between P_FI and Btl_P for each percentile of P_NI. Middle right: Scatter plot between P_NI and P_FI colored by the ramp bottleneck strength, as shown (n=237,644 strains for all panels). Since designed Btl_S values are nested into the low Btl_P level, only constructs with a N-terminal bottleneck are shown. Middle left: correlation between P_FI and Btls for each percentile of P_FI. These plots indicate no systematic associations between bottleneck strength and protein production.

(C) Optimization of codon indices to explain the relationship between protein production under normal and facilitated initiation. Barplot show the semi-partial correlation between P_FI and various codon metrics controlling for the effect of P_NI on P_FI for all sequences in the library, using parametric (pearson) and non-parametric (spearman) methods, as shown (Supplementary Code 17; Supplementary Data 30; n=237,644 strains). Amongst the metrics investigated, the original CAI¹ provides the best correlations. We tested the tAI, which weight the codon according to the tRNA abundances^2,3; RSCU, a simple measure of relative synonymous codon usage^4,5; the frequency of codon in highly expressed gene (heg-fb)⁵; a measure of codon decoding rate by Dana and Tuller⁶; a codon index recently developed by Boël et al., based on the study of a large gene library⁷; an in vivo measure of codon decoding time by Chevance et al. that used a reporter assay based on an anti-terminator function⁸; the hydropathy⁹ averaged over the full designed sequences (Full HI, as opposed to the HI used for the factorial design which is defined on a more restricted region). Using a heuristic optimization procedure, we derived codon indices that maximize the partial correlation of interest. We did so starting from codon with equal weights (opt. index) or from the original CAI (opt. CAI). Optimized metrics are highly similar and vastly outperform other metrics. However, they show medium capacity to explain gene expression in E. coli (see panel D).

(D) Correlation between codon metrics and gene expression in E. coli. Barplot shows Pearson and Spearman correlations of the various codon metrics with protein abundance (left) and mRNA abundance (right), as reported by Taniguchi et al¹⁰. (n=575 genes). The heg-fb provides the best correlations followed closely by CAI, tAI and RSCU. Indices derived from experimental studies tend to perform poorly. Indices derived from this work perform better but not as well as those derived from codon frequencies observed in natural genes.

1. Sharp, P. M. & Li, W. H. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).

2. Reis, M. D. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004).

3. Tuller, T. et al. An Evolutionarily Conserved Mechanism for Controlling the Efficiency of Protein Translation. Cell 141, 344–354 (2010).

4. Sharp, P. M., Tuohy, T. M. & Mosurski, K. R. Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res. 14, 5125–5143 (1986).

5. Hilterbrand, A., Saelens, J. & Putonti, C. CBDB: The codon bias database. BMC Bioinformatics 13, 62 (2012).

6. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014).

7. Boël, G. et al. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature 529, 358–363 (2016).

8. Chevance, F. F. V., Le Guyon, S. & Hughes, K. T. The effects of codon context on in vivo translation speed. PLoS Genet. 10, e1004392 (2014).

9. Kyte, J. & Doolittle, R. F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132 (1982).

10. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

Supplementary Figure 9 Factors confounding the impact of amino acid hydropathy on protein expression

(A) Hydropathy is related to translation speed. Left: scatter plot between P_NI and P_FI colored by Full HI (the mean hydropathy of the full designed sequence), as shown. Right: Pearson's correlations between P_FI and Full HI for each percentile of P_NI. The association with Full HI is comparable to that of CAI, though weaker. This suggests that amino-acid composition has a small effect on elongation that may feedback on initiation at the beginning of coding sequences. Nonetheless, this effect is much less important than expected from the analysis of genome wide functional data.

(B) The hydropathy signal that guided the factorial design may be confounded by functional localization of the proteins. Left: Boxplots of protein abundance in E. coli partitioned into cytoplasmic (n=438 genes) versus other subcellular localization of the proteins (n=121 genes). Cytoplasmic proteins tend to be more abundant. Middle: Boxplots of average hydropathy index as used in the design properties (amino acid positions 10 to 20) for the same partition (n=2,317 and 1356 genes, respectively). Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges, central lines mark the medians and notches their 95% confidence intervals. Membrane associated and periplasmic proteins tend to be show higher hydropathy, probably due to the necessity to cross or be included into biological membranes¹. Right: Scatter plot of protein abundance versus hydropathy index. Data points are colored according to the subcellular localization of the corresponding protein, as shown. The relationships that initially motivated inclusion of the hydropathy property in the experimental design is largely driven by data points corresponding to non-cytoplasmic proteins (black line; see also Supplementary Fig. 1D), so that removing these largely decrease the observed Pearson's correlation (grey line, coefficients as shown). Protein abundance data are taken from Taniguchi et al.². Protein localization data from Han et al.³.

(C) Mutational series #344 show unusual impact of MHI. Shown is a scatter plot of explainable effect sizes of MHI as calculated by ANOVA on P_NI and P_FI (n=56 series).

(D) Variation of amino-acid composition between extreme MHI in series #344 points to a double proline. Sequence logos of the top and bottom decile of MHI shows the frequency of amino acids at each position. Construct with lowest MHI are notably characterized by the presence of a double proline at position 17-18 (red dashed box). Double prolines are known to be problematic for translation⁴.

(E) The presence of double proline in series #344 is highly correlated with MHI. Shown are boxplot of MHI binned by the number of double proline in the sequences. Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and central lines mark the medians. The strong apparent effect of MHI in this series might be confounded by the presence of double prolines in more than a third of the series. Number of sequence n as shown.

(F) Double prolines are linked to low protein production. Scatter plot of P_NI (left) or P_FI (rigth) as a function of MHI. Sequences with one and two double proline are highlighted in red and blue, respectively. These are associated with much lower MHI and protein productions.

(G) Mutation in the double proline increase protein production. Scatter plot of P_NI (left) or P_FI (right) for pairs of sequence that differ only by one amino acid mutation in the double proline. Points highlighted in red differ only by a single nucleotide. Mutants tend to show increased protein production, further strengthening the role of the double proline in this case.

1. Charneski, C. A. & Hurst, L. D. Positive charge loading at protein termini is due to membrane protein topology, not a translational ramp. Molecular Biology and Evolution 31, 70–84 (2014).

2. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

3. Han, M.-J. et al. Genome-wide identification of the subcellular localization of the Escherichia coli B proteome using experimental and computational methods. Proteomics 11, 1213–1227 (2011).

4. Doerfel, L. K. et al. EF-P Is Essential for Rapid Synthesis of Proteins Containing Consecutive Proline Residues. Science 339, 85–88 (2013).

Supplementary Figure 10 Growth measurements and impact of coding sequences properties

(A) Growth measurements are reproducible. Pairwise comparisons of triplicate estimates of relative growth rates after competition in non-coupling conditions for ca. 60 generations (Supplementary Data 15). Due to the low number of associated sequencing reads, growth estimates are more variable in slow growing strains. Number of observations (n) and pairwise Pearson correlations (r) as shown.

(B) Shorter competitions limit the loss of diversity. The fraction of the total library that can be measured in samples from different time points decreases with competition time. The three replicates shown in panel A are averaged and compared to a fourth population sampled after ca. 13 and 28 generations (color as in C).

(C) Shorter competition increases the dynamic range of growth measurements. The distribution of growth rates estimated from the different samples are shown as histograms (color as shown). Slower growing strains become undetectable in longer competitions.

(D) Competition time impacts growth estimates. Scatter plots show pairwise comparison of growth rates between the three sampled time points. Strains missing in one sample are shown in blue with an arbitrary low value in place of the missing datum. The slowest growers are strongly counter-selected and only observed at earlier time points before they go extinct. Number of observations (n, excluding blue points) and pairwise Pearson correlations (r) as shown.

(E) Codon adaptation has little impact on growth. Far Left: Scatter plot of W_NI as a function of P_NI colored by CAI, as shown. Middle Right: Scatter plot of W_FI as a function of P_FI colored by CAI, as shown. Dark lines show median of growth for each percentile of protein production and a loess smoother. Yellow and cyan lines show the same information for the top and bottom deciles of CAI, respectively. Middle Left: Pearson's correlations between W_NI and CAI for every percentile of P_NI. Far Right: Pearson's correlations between W_FI and CAI for every percentile of P_FI. Color and size of the points convey the mean and variance of CAI, respectively. The grey line is a loess smoother highlighting the trend in the correlations. Higher CAI is weakly associated with a small growth improvement at higher protein production.

(F) Strong interactions between secondary structures impact growth. Interaction plots for the effect of the three designed secondary structures on growth under non-coupling conditions in rich (Left) and minimal (Right) media. Red lines mark the medians per level of STR_-30:+30. Blue and green lines show the medians for combinations of STR_+01:+60_STR_-30:+30 and STR_+31:+90_STR_-30:+30_STR_+01:+60, respectively. Structure strengths are depicted below the boxplots for clarity. Boxes mark interquartile ranges, whiskers measure 1.5 these ranges (n∈[8,335; 8,905] and [6,090; 8,213] strains for each plot in the left and right panel, respectively).

(G) Secondary structures affect growth beyond their impact on protein production. Correlations between STR_+01:+60 and W_NI (Far Left) and W_FI (Middle Left) for every percentile of protein production in the respective initiation conditions. Correlations between STR_+31:+90 and W_NI (Middle Right) and W_FI (Far Right) for every percentile of protein production. Color and size of the points convey the mean and variance of the structures, respectively. The grey line is a loess smoother highlighting the trend in the correlations. Weaker structures are generally associated with faster growth, especially at low protein production.

Supplementary Figure 11 mRNA measurements and impact of coding sequences properties

(A) High-throughput assay of RNA decay. Constant quantities of standard strains are spiked-in library samples. Standard sequencing reads increase as reporter transcript decay, defining corrective coefficients (Supplementary Data 33). Corrected time series are fit to an exponential decay model to estimate RNA abundance at steady state (RNA_SS; t=0), transcript half-life (RNA_HL) and the final fraction of protected RNA (RNA_PTX; t=+∞) for each strain.

(B) Time-series of RNA measurements are noisy but reproducible. Pairwise comparisons of read counts obtained from two biological replicates at different time points (Supplementary Data 15). RNA-Seq counts are normalized by DNA-Seq counts to account for variations in strain abundances, but not yet corrected using decay standards. One unit represents 1e3 RNA-Seq reads count. The number of complete observations (n strains) and pairwise Pearson's correlations (r) between replicates are as shown. Correlations tend to diminish at later time points.

(C) Exponential decay fit captures relevant parameters. For each replicate, the distribution of the sum-of-squares explained by the fit divided by the total sum-of-squares (akin to R²) is shown to provide an estimate of the fits' quality (left; Supplementary Code 21; Supplementary Data 34). The middle and right panels are scatter plots of estimated RNA_SS and RNA_PTX plotted against their nearest measured equivalent i.e. read counts at t=0 and the ratio of read counts at t=40 over t=0, respectively (number of strains as shown). Identity lines are shown in red. Correlations between estimated and observed values are excellent (Pearson's coefficients r as shown).

(D) Pairwise comparison of estimated decay parameters between biological replicates. Due to the sensitivity of the fitting procedure, the initial (RNA_SS) and final (RNA_PTX) states of the system are more reproducible than its dynamic component (RNA_HL). We therefore refrained from analyzing RNA_HL. Number of strains as shown.

(E) Diversity of RNA_SS profiles between replicate series. Red and grey dots mark medians and interquartile ranges, respectively. The red line highlights the median of the whole dataset (n=233,487 for the whole dataset and n∈[3,259; 4,351] strains for each series).

(F) Decay assay artifacts expose complex interactions between degradation and translation. Scatter plot of RNA_HL (Left) and RNA_PTX (Middle) versus P_NI, colored by RNA_SS as shown. Dark lines mark median RNA_HL or RNA_PTX for each percentile of P_NI and the corresponding loess smoothers. Yellow and cyan lines show the top and bottom deciles of RNA_SS, respectively. Number of strains as shown. Various correlations for each percentile of P_NI are shown (Right). Positive association between RNA_HL and RNASS is restricted to low P_NI and progressively inverted, reflecting swifter protection of fast initiated transcript. RNA_PTX is linearly related to P_NI and modulated by RNA_SS. Stronger structures are associated with increased RNA_HL and RNA_SS, especially at low P_NI.

(G) Mediation by translation mechanisms blurs the relationship between transcript stability and abundance. Scatter plot of RNA_SS as a function of RNA_HL colored by P_NI is as shown. Thick and thin dark lines show medians of RNA_SS per percentile of RNA_HL and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of RNA_HL, respectively. Number of strains as shown. The expected positive correlation between RNA_HL and RNA_SS is confined to low protein production regimes and becomes positive with increasing P_NI (correlations as shown).

(H) The association between mRNA stability and distal structure is most visible at low protein production. Scatter plot of RNA_HL (Left) and RNA_SS (Right) as a function of STR_+31:+90 colored by RNA_SS and RNA_HL, respectively, as shown. Only strains from the lowest P_NI decile (cyan regression line in panel I) are plotted. Black and grey lines marks the median and quartiles of the y-axis measurements for each percentile of STR_+31:+90. Number of observations (n strains) and Pearson's correlation coefficients (r) as shown.

Supplementary Figure 12 Polysome profiling and impact of coding sequences properties

(A) High-throughput targeted polysome profiling. The first five polysome fractions were extracted from a sucrose gradient, barcoded and sequenced in multiplex.

(B) Codon adaptation improves mRNA protection by translating ribosomes. Scatter plot of RNA_PTX as a function of MRD colored by CAI, as shown (upper panel). Thick and thin dark lines show medians of RNA_PTX per percentile of MRD and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of CAI, respectively. Pearson's correlation between RNA_PTX and CAI increases within increasing percentiles of MRD (loewr panel). Higher CAI leads to increased protection at higher translation regime, presumably by ensuring smoother ribosome flow over the transcripts.

(C) Codon adaptation faintly increases RNA abundance. Scatter plot of RNA_SS as a function of MRD colored by CAI, as shown (upper panel). Thick and thin dark lines show medians of RNA_SS per percentile of MRD and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of CAI, respectively. Although modest, Pearson's correlation between RNA_SS and CAI increases with MRD (lower panel). This behavior probably reflects the effect of CAI on RNA protection.

(D) Apparent impact of strong distal structure on protein production. Scatter plot of P_NI as a function of MRD colored by STR_+31:+90, as shown (upper panel). Thick and thin dark lines show medians of RNA_SS per percentile of MRD and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of STR_+31:+90, respectively. Negative Pearson's correlation between STR_+31:+90 and P_NI for a given MRD peaks at intermediate MRD (lower panel), even though strong STR_+31:+90 should slow ribosome progression. This relationship probably reflects the impact of STR_+31:+90 on RNA abundance and stability (Supplementary Fig. 10F), which in turn affects the MRD of single transcripts and the collegial protein production (see E,F).

(E) Ribosomal density and RNA abundance show similar relationships with distal structure. Data points show Pearson's correlations between STR_+31:+90 and either MRD (red) or RNA_SS (blue) for each percentile of P_NI. Solid lines are loess smoother highlighting the trends in the correlation. The relationship between STR_+31:+90 and MRD is probably a consequence of that linking STR_+31:+90 to RNA_SS.

(F) Ribosomal density is driven by transcript abundance rather than distal structures. Scatter plot of RNA_SS as a function of STR_+31:+90 colored by MRD, as shown. Thick and thin dark lines show medians of RNA_SS per percentile of STR_+31:+90 and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of MRD, respectively. RNA_SS decreases faintly with increasing STR_+31:+90, irrespective of MRD levels. In contrast, low MRD is strongly associated with high RNA_SS (see also G).

(G) Variation of design properties and measurements in protein-growth phenotypic space. Coarse-grained grids of W_NI versus P_NI. Data are binned first by deciles of P_NI and then by deciles of W_NI as in Fig. 5H. In each grid, bins are color-coded to show the range of means in parameter indicated in each vignette. Smoother variations across the grid are indicative of larger dynamic range and truer effects.

Supplementary Figure 13 Extensive phenotypic diversity between replicate factorial series

(A) Principle component analysis highlights the phenotypic spread of factorial series. The analysis is based on the correlation matrix between series-wise means of shown phenotypic variables (n=56 series).

(B) Visualization of the ensemble phenotypic differences between series. Spider plots are grouped by clusters separated by alternative white and grey backgrounds. Series' identification number are given in the bottom right corner of each plot. Although each series explores the same property space, small initial phenotypic differences may cascade into the observed diversity. Understanding these differences represents a challenge for predictive biology.

Supplementary Figure 14 Examples of comparable structure profiles leading to different protein productions

Minimum free energies of predicted secondary structures are plotted as a function of window position for windows of different length, as shown. Constructs with similar structure profiles (same row) can be found in distinct regions of the protein production space, as indicated by the red gates in a scatter plot of P_FI versus P_NI (top). Conversely, very different structure profiles can yield the same production phenotypes (same column). These profiles also exemplify that different window lengths can yield quite dissimilar profiles for a given construct (e.g. 50 and 70 nucleotide-long windows on construct 71_33111112_1, middle-left plot).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 5936 kb)

Life Sciences Reporting Summary (PDF 165 kb)

Supplementary Tables

Supplementary Tables 1–3 (PDF 299 kb)

Supplementary Notes

Supplementary Note 1 (PDF 267 kb)

Supplementary Code 1

Parameter file. Used to parameterize python scripts involved with processing of sequencing data (Supplementary Code 2–5). (TXT 1 kb)

Supplementary Code 2

De-multiplex fastq. Python script to identify and trim custom sequencing barcodes. Support parallelization. Outputs a separate fastq file for eachbarcode. (TXT 12 kb)

Supplementary Code 3

Python wrapper for BWA and samtools. Produce mapping and quality check of the reads by calling BWA and samtools. Support parallelization. (TXT 3 kb)

Supplementary Code 4

Read counter. Python script to summarize the number of read mapping to each target sequence from the bam files generated by SupplementaryCode 3. Support parallelization. (TXT 21 kb)

Supplementary Code 5

Count aggregator. Python script to aggregate count tables generated by Supplementary Code 4. (TXT 1 kb)

Supplementary Code 6

Processing of protein production under regular and facilitated initiation from FACS-seq data. R script to normalize, rescale and aggregate readcount data from multiple FACS-seq replicate experiments. Convert digital read distribution into a continuous linear measure of protein productionranging between 1 and 100 (P_NI and P_FI). (TXT 14 kb)

Supplementary Code 7

Computation of ANOVA's sum of squares for P_NI. R script to run an ANOVA on P_NI data and extract the sum of squares accounted by designproperties and their first-order interactions. (TXT 6 kb)

Supplementary Code 8

Computation of sum of squares for multiple linear regression of P_NI on design properties. R script to run a multiple linear regression on P_NI dataand extract the ANOVA-like sum of squares accounted by design properties and their first-order interactions. (TXT 5 kb)

Supplementary Code 9

Regression tree analysis for P_NI. R script to run a CART analysis on P_NI. (TXT 0 kb)

Supplementary Code 10

Computation of ANOVA's sum of squares for P_FI. R script to run an ANOVA on P_FI data and extract the sum of squares accounted for by designproperties and their first-order interactions. (TXT 6 kb)

Supplementary Code 11

Computation of sum of squares for multiple linear regression of P_FI on design properties. R script to run a multiple linear regression on P_FI dataand extract the ANOVA-like sum of squares accounted for by design properties and their first-order interactions. (TXT 5 kb)

Supplementary Code 12

Regression tree analysis for P_FI. R script to run a CART analysis on P_FI. (TXT 0 kb)

Supplementary Code 13

Effect of structure strength predicted across sliding windows of different sizes. R script to run linear regression of P_NI and P_FI against minimal freeenergies computed over sliding windows of different length. Report the ANOVA-like sum of squares. (TXT 3 kb)

Supplementary Code 14

Multiple linear regression of P_NI and P_FI on predicted nucleotide accessibilities. R script to run a multiple linear regression of protein production data on predicted nucleotide availabilities. Report the ANOVA-like sum of squares accounted by every position. (TXT 4 kb)

Supplementary Code 15

Call to the RBS calculator web service. Python script to remotely run the RBS calculator on designed sequences. (TXT 7 kb)

Supplementary Code 16

Effect of predictions from the RBS calculator. R script to run linear regression of P_NI and P_FI against RBS calculator outputs. Report theANOVA-like sum of squares. (TXT 1 kb)

Supplementary Code 17

Partial correlation between P_FI and various codon metrics, given P_NI. R script to compute various alternative codon metric for the codon sequence and determine their partial correlations with P_FI accounting for P_NI. (TXT 19 kb)

Supplementary Code 18

Processing of growth measurements from FIT-seq data collected under various conditions. R script to convert differential enrichment of read count data over time into an integrated measure of cell growth. Process read count data from multiple replicate experiments. Convert read count ratios into aggregated measures of relative growth in a given environment(W_NI, W_FI, W_UTX, W_M). (TXT 22 kb)

Supplementary Code 19

Computation of sum of squares for multiple linear regression of W_NI on P_NI and design properties. R script to run a multiple linear regression on of W_NI against P_NI, P_NI² and design properties. Report ANOVA-like sum of squares (TXT 2 kb)

Supplementary Code 20

Computation of sum of squares for multiple linear regression of W_FI on P_FI and design properties. R script to run a multiple linear regression on of W_FI against P_FI, P_FI² and design properties. Report ANOVA-like sum of squares. (TXT 2 kb)

Supplementary Code 21

Processing of RNA abundance and decay measurements from serial RNA-seq. R script to compute RNA decay after transcription arrest. Sampleread counts are corrected using coefficients derived from ratioing counts of spiked-in RNA standards over time. Performs a nonlinear decay fit to the corrected count frequencies to estimate RNA abundance at steady state (RNA_SS), RNA half-life (RNA_HL) and RNA protection (W_PTX). (TXT 10 kb)

Supplementary Code 22

Compute 3D animation of the data. R scripts to produce the images necessary for Supplementary Video 1. (TXT 10 kb)

Supplementary Code 23

Processing of polysome profiles from DNA-seq of separate polysome fractions. R script to compute the distribution of polysome (up to fifthfraction) for each design sequence from read counts. (TXT 2 kb)

Supplementary Code 24

Definition of sequence archetypes. R script to categorize sequences into the most relevant combinations of sequence properties. Calculate the series-wise means of various phenotypes for sequences belonging to these archetypes. (TXT 5 kb)

Supplementary Code 25

GenBank parser. A script to parse coding sequence from GenBank file using BioPython. (TXT 1 kb)

Supplementary Code 26

D-Tailor module. Links to specific D-Tailor modules used in this work. (TXT 0 kb)

Supplementary Code 27

Genome randomization. Perl modules to produce random genome variants that retaining codon usage and protein's amino acid composition. (ZIP 1262 kb)

Supplementary Code 28

Seed generator for D-Tailor. Python script to generate a random input sequence for D-Tailor that maximizes the distance to other input sequences. (TXT 0 kb)

Supplementary Data 1

E. coli's features and measurements. Dataset aggregating various measures of sequence property for every gene in a reference E. coli and corresponding expression data for a subset (Taniguchi, 2009). (ZIP 1914 kb)

Supplementary Data 2

Mean hydropathy index over sliding windows. Calculation of the MHI over sliding windows for every gene in the reference E. coli genome. (ZIP 3239 kb)

Supplementary Data 3

tAI profiles for sfGFP and a designed variant. Calculates tAI over a sliding window. (CSV 9 kb)

Supplementary Data 4

Accessible bottleneck strengths. Calculation of bottleneck strength for random sequence cloned in the translation reporter. (ZIP 36624 kb)

Supplementary Data 5

E. coli's features and levels. Calculation of property scores and discrete categorisation for every gene in the E. coli genome, based on the properties and thresholds set for the Design of Experiments. (ZIP 854 kb)

Supplementary Data 6

Random solutions. Calculation of property scores and categorization for random sequences cloned in the translation reporter context, based on the properties and thresholds set for the Design of Experiments. (ZIP 13803 kb)

Supplementary Data 7

Intra-series distance. Collection of tables reporting Hamming distances between every pair of sequences within the same series. (XLSX 10 kb)

Supplementary Data 8

Series logo. Position-wise nucleotide and amino acid frequency matrices for each series. (ZIP 7341 kb)

Supplementary Data 9

Sequencing count summary. A table reporting the number of counts associated with each design sequence for every sequencing library in this work. (ZIP 29419 kb)

Supplementary Data 10

Illumina lane description. Mapping of the different sequencing libraries on Illumina sequencing lane. (CSV 4 kb)

Supplementary Data 11

TAG coupling upon activation by unnatural amino acids. Table reporting the mean fluorescence observed upon induction by increasing concentration of the unnatural amino acid pAcF. (CSV 1 kb)

Supplementary Data 12

TAG coupling mutants. Table reporting the mean fluorescence observed in various mutants of the TAG position. (CSV 0 kb)

Supplementary Data 13

Growth of TAG mutants. Density of cell culture (OD₆₀₀) over time for various mutants at the TAG position. (CSV 10 kb)

Supplementary Data 14

Number of cells sorted during FACS-seq. Report the number of cells sorted in each bin during the FACS-seq experiments. Used to normalize read counts upon sequencing. (CSV 5 kb)

Supplementary Data 15

Integrated phenotypic measurements. Consolidated dataset comprising design information, intermediates and fully processed phenotypicmeasurements for all 244,000 synthetic sequences. (ZIP 127072 kb)

Supplementary Data 16

ANOVA on P_NI. An R object containing the sum of squares computed by running ANOVAs on the full dataset and independent series (Supplementary Code 7). (ZIP 67 kb)

Supplementary Data 17

MLR on P_NI. An R object containing the sum of squares computed by running multiple linear regressions on the full dataset and independent series (Supplementary Code 8). (ZIP 64 kb)

Supplementary Data 18

CART on P_NI. An R object containing the result of CART analysis (Supplementary Code 9). (ZIP 2935 kb)

Supplementary Data 19

ANOVA on P_FI. An R object containing the sum of squares computed by running ANOVAs on the full dataset and independent series (output of Supplementary Code 7). (ZIP 67 kb)

Supplementary Data 20

MLR on P_FI. An R object containing the sum of squares computed by running multiple linear regressions on the full dataset and independent series (output of Supplementary Code 8). (ZIP 64 kb)

Supplementary Data 21

CART on P_FI. An R object containing the result of CART analysis (output of Supplementary Code 9). (ZIP 2907 kb)

Supplementary Data 22

Effect of minimum free energy over sliding windows. MFE predicted for sliding windows of different length on each designed sequence. (ZIP 38075 kb)

Supplementary Data 23

Sum of squares corresponding to regression of P_NI on MFE over sliding windows (output of Supplementary Code 13). (ZIP 108 kb)

Supplementary Data 24

Sum of squares corresponding to regression of P_FI to the residuals of P_NI's regression on MFEs (output of Supplementary Code 13). (ZIP 108 kb)

Supplementary Data 25

Single nucleotide accessibilities. Predicted accessibilities at every position of each designed sequences. (ZIP 68145 kb)

Supplementary Data 26

Sum of squares for multiple linear regression of P_NI on accessibilities (output of Supplementary Code 14). (ZIP 82 kb)

Supplementary Data 27

Sum of squares for multiple linear regression of P_FI on accessibilities (output of Supplementary Code 14). (ZIP 82 kb)

Supplementary Data 28

RBS calculator predictions. Aggregation of outputs obtained by running each designed sequence in reporter context in the RBS calculator (outputof Supplementary Code 15). (ZIP 3410 kb)

Supplementary Data 29

Sum of squares corresponding to the regression of P_NI on RBS calculator's predictions (output of Supplementary Code 16). (ZIP 2 kb)

Supplementary Data 30

Partial correlation of various codon-based metrics with P_FI, given P_NI (output of Supplementary Code 17). (ZIP 0 kb)

Supplementary Data 31

Sum of squares for multiple linear regression of W_NI on design properties and P_NI (output of Supplementary Code 19). (ZIP 48 kb)

Supplementary Data 32

Sum of squares for multiple linear regression of W_FI on design properties and P_FI (output of Supplementary Code 20). (ZIP 48 kb)

Supplementary Data 33

RNA standards. Counts of reads mapping to RNA standard sequences in RNA decay libraries. (CSV 1 kb)

Supplementary Data 34

Nonlinear decay fit. An R object containing fit data (output of Supplementary Code 21). (ZIP 13532 kb)

Supplementary Data 35

Phenotypic archetypes. Quartiles of series-wise mean for various phenotypes (output of Supplementary Code 24). (ZIP 1 kb)

Supplementary Data 36

Random E. coli genomes. Result of constrained genome randomization (output of Supplementary Code 27). (ZIP 12592 kb)

3D animation of the data in RNA–Protein–Fitness space. (AVI 44746 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cambray, G., Guimaraes, J. & Arkin, A. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nat Biotechnol 36, 1005–1015 (2018). https://doi.org/10.1038/nbt.4238

Download citation

Received: 24 November 2017
Accepted: 02 August 2018
Published: 24 September 2018
Issue Date: November 2018
DOI: https://doi.org/10.1038/nbt.4238

This article is cited by

Selection on synonymous sites: the unwanted transcript hypothesis
- Sofia Radrizzani
- Grzegorz Kudla
- Laurence D. Hurst
Nature Reviews Genetics (2024)
Start codon-associated ribosomal frameshifting mediates nutrient stress adaptation
- Yuanhui Mao
- Longfei Jia
- Shu-Bing Qian
Nature Structural & Molecular Biology (2023)
Genome-wide promoter responses to CRISPR perturbations of regulators reveal regulatory networks in Escherichia coli
- Yichao Han
- Wanji Li
- Fuzhong Zhang
Nature Communications (2023)
Crude enzyme immobilization-based cell-free system for efficient N-acetylneuraminic acid biosynthesis aided by N-terminal coding sequence screening
- Peng Wen
- Xueqin Lv
- Yanfeng Liu
Systems Microbiology and Biomanufacturing (2023)
Tobacco as green bioreactor for therapeutic protein production: latest breakthroughs and optimization strategies
- Muhammad Naeem
- Rong Han
- Lingxia Zhao
Plant Growth Regulation (2023)