Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Resource
  • Published:

Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli

Abstract

Comparative analyses of natural and mutated sequences have been used to probe mechanisms of gene expression, but small sample sizes may produce biased outcomes. We applied an unbiased design-of-experiments approach to disentangle factors suspected to affect translation efficiency in E. coli. We precisely designed 244,000 DNA sequences implementing 56 replicates of a full factorial design to evaluate nucleotide, secondary structure, codon and amino acid properties in combination. For each sequence, we measured reporter transcript abundance and decay, polysome profiles, protein production and growth rates. Associations between designed sequences properties and these consequent phenotypes were dominated by secondary structures and their interactions within transcripts. We confirmed that transcript structure generally limits translation initiation and demonstrated its physiological cost using an epigenetic assay. Codon composition has a sizable impact on translatability, but only in comparatively rare elongation-limited transcripts. We propose a set of design principles to improve translation efficiency that would benefit from more accurate prediction of secondary structures in vivo.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: High-throughput design of experiments.
Figure 2: Protein production under normal and facilitated conditions of translation initiation.
Figure 3: Dynamic structure interactions hinder functional predictions.
Figure 4: Unexpected growth defects associated with reduced translation initiation.
Figure 5: Pathological accumulation of stable transcripts inhibits initiation rate.
Figure 6: Impact of archetypal sequences on translation efficiency and physiological cost.

Similar content being viewed by others

Accession codes

Primary accessions

Sequence Read Archive

Referenced accessions

NCBI Reference Sequence

References

  1. Li, G.-W., Burkhardt, D., Gross, C. & Weissman, J.S. Quantifying absolute protein synthesis rates reveals principles underlying allocation of cellular resources. Cell 157, 624–635 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Andersson, S.G. & Kurland, C.G. Codon preferences in free-living microorganisms. Microbiol. Rev. 54, 198–210 (1990).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Scott, M., Gunderson, C.W., Mateescu, E.M., Zhang, Z. & Hwa, T. Interdependence of cell growth and gene expression: origins and consequences. Science 330, 1099–1102 (2010).

    Article  CAS  PubMed  Google Scholar 

  4. Ceroni, F., Algar, R., Stan, G.-B. & Ellis, T. Quantifying cellular capacity identifies gene expression designs with reduced burden. Nat. Methods 12, 415–418 (2015).

    Article  CAS  PubMed  Google Scholar 

  5. Frumkin, I. et al. Gene architectures that minimize cost of gene expression. Mol. Cell 65, 142–153 (2017).

    Article  CAS  PubMed  Google Scholar 

  6. Ikemura, T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981).

    Article  CAS  PubMed  Google Scholar 

  7. Sharp, P.M. & Li, W.H. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Cannarozzi, G.M. & Schneider, A. Codon Evolution (Oxford Univ. Press, 2012).

  9. Mitarai, N., Sneppen, K. & Pedersen, S. Ribosome collisions and translation efficiency: optimization by codon usage and mRNA destabilization. J. Mol. Biol. 382, 236–245 (2008).

    Article  CAS  PubMed  Google Scholar 

  10. Charneski, C.A. & Hurst, L.D. Positively charged residues are the major determinants of ribosomal velocity. PLoS Biol. 11, e1001508 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Pop, C. et al. Causal signals between codon bias, mRNA structure, and the efficiency of translation and elongation. Mol. Syst. Biol. 10, 770 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Del Campo, C., Bartholomäus, A., Fedyunin, I. & Ignatova, Z. Secondary structure across the bacterial transcriptome reveals versatile roles in mRNA regulation and function. PLoS Genet. 11, e1005613 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Adhin, M.R. & van Duin, J. Scanning model for translational reinitiation in eubacteria. J. Mol. Biol. 213, 811–818 (1990).

    Article  CAS  PubMed  Google Scholar 

  14. Kudla, G., Murray, A.W., Tollervey, D. & Plotkin, J.B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Mutalik, V.K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).

    Article  CAS  PubMed  Google Scholar 

  16. Espah Borujeni, A., Channarasappa, A.S. & Salis, H.M. Translation rate is controlled by coupled trade-offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic Acids Res. 42, 2646–2659 (2014).

    Article  CAS  PubMed  Google Scholar 

  17. Tuller, T. & Zur, H. Multiple roles of the coding sequence 5′ end in gene expression regulation. Nucleic Acids Res. 43, 13–28 (2015).

    Article  CAS  PubMed  Google Scholar 

  18. Tuller, T. et al. An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell 141, 344–354 (2010).

    Article  CAS  PubMed  Google Scholar 

  19. Tuller, T. et al. Composite effects of gene determinants on the translation speed and density of ribosomes. Genome Biol. 12, R110 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Charneski, C.A. & Hurst, L.D. Positive charge loading at protein termini is due to membrane protein topology, not a translational ramp. Mol. Biol. Evol. 31, 70–84 (2014).

    Article  CAS  PubMed  Google Scholar 

  21. Goodman, D.B., Church, G.M. & Kosuri, S. Causes and effects of N-terminal codon bias in bacterial genes. Science 342, 475–479 (2013).

    Article  CAS  PubMed  Google Scholar 

  22. Allert, M., Cox, J.C. & Hellinga, H.W. Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Ilzarbe, L., Álvarez, M.J., Viles, E. & Tanco, M. Practical applications of design of experiments in the field of engineering: a bibliographical review. Qual. Reliab. Eng. Int. 24, 417–428 (2008).

    Article  Google Scholar 

  24. Montgomery, D.C. Design and Analysis of Experiments (Wiley, 2017).

  25. Zhou, H., Vonk, B., Roubos, J.A., Bovenberg, R.A.L. & Voigt, C.A. Algorithmic co-optimization of genetic constructs and growth conditions: application to 6-ACA, a potential nylon-6 precursor. Nucleic Acids Res. 43, 10560–10570 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Zhang, C., Zou, R., Chen, X., Stephanopoulos, G. & Too, H.-P. Experimental design-aided systematic pathway optimization of glucose uptake and deoxyxylulose phosphate pathway for improved amorphadiene production. Appl. Microbiol. Biotechnol. 99, 3825–3837 (2015).

    Article  CAS  PubMed  Google Scholar 

  27. Mutalik, V.K. et al. Quantitative estimation of activity and quality for collections of functional genetic elements. Nat. Methods 10, 347–353 (2013).

    Article  CAS  PubMed  Google Scholar 

  28. Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl. Acad. Sci. USA 110, 14024–14029 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Guimaraes, J.C., Rocha, M., Arkin, A.P. & Cambray, G. D-Tailor: automated analysis and design of DNA sequences. Bioinformatics 30, 1087–1094 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Pédelacq, J.-D., Cabantous, S., Tran, T., Terwilliger, T.C. & Waldo, G.S. Engineering and characterization of a superfolder green fluorescent protein. Nat. Biotechnol. 24, 79–88 (2006).

    Article  PubMed  CAS  Google Scholar 

  32. Young, T.S., Ahmad, I., Yin, J.A. & Schultz, P.G. An enhanced system for unnatural amino acid mutagenesis in E. coli. J. Mol. Biol. 395, 361–374 (2010).

    Article  CAS  PubMed  Google Scholar 

  33. Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Rouskin, S., Zubradt, M., Washietl, S., Kellis, M. & Weissman, J.S. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 505, 701–705 (2014).

    Article  CAS  PubMed  Google Scholar 

  35. Yoo, J.-H. & RajBhandary, U.L. Requirements for translation re-initiation in Escherichia coli: roles of initiator tRNA and initiation factors IF2 and IF3. Mol. Microbiol. 67, 1012–1026 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Kelsic, E.D. et al. RNA structural determinants of optimal codons revealed by MAGE-seq. Cell Syst. 3, 563–571.e6 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. dos Reis, M., Savva, R. & Wernisch, L. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004).

    Article  CAS  PubMed  Google Scholar 

  38. Hilterbrand, A., Saelens, J. & Putonti, C. CBDB: the codon bias database. BMC Bioinformatics 13, 62 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Boël, G. et al. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature 529, 358–363 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Chevance, F.F.V., Le Guyon, S. & Hughes, K.T. The effects of codon context on in vivo translation speed. PLoS Genet. 10, e1004392 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. van Opijnen, T. & Camilli, A. Transposon insertion sequencing: a new tool for systems-level analysis of microorganisms. Nat. Rev. Microbiol. 11, 435–442 (2013).

    Article  CAS  PubMed  Google Scholar 

  43. Dekel, E. & Alon, U. Optimality and evolutionary tuning of the expression level of a protein. Nature 436, 588–592 (2005).

    Article  CAS  PubMed  Google Scholar 

  44. Schaechter, M., MaalOe, O. & Kjeldgaard, N.O. Dependency on medium and temperature of cell size and chemical composition during balanced growth of Salmonellatyphimurium. J. Gen. Microbiol. 19, 592–606 (1958).

    Article  CAS  PubMed  Google Scholar 

  45. Li, G.-W.Howdo bacteria tune translation efficiency? Curr. Opin. Microbiol. 24, 66–71 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  46. Deana, A. & Belasco, J.G. Lost in translation: the influence of ribosomes on bacterial mRNA decay. Genes Dev. 19, 2526–2533 (2005).

    Article  CAS  PubMed  Google Scholar 

  47. Presnyak, V. et al. Codon optimality is a major determinant of mRNA stability. Cell 160, 1111–1124 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Hui, M.P., Foley, P.L. & Belasco, J.G. Messenger RNA degradation in bacterial cells. Annu. Rev. Genet. 48, 537–559 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Dinçbas, V. & Heurgué-Hamard, V. Shutdown in protein synthesis due to the expression of mini-genes in bacteria. J. Mol. Biol. 291, 745–759 (1999).

    Article  PubMed  Google Scholar 

  50. Jaillard, M. et al. A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between kmers and genetic events. Preprint at bioRxiv https://doi.org/10.1101/297754 (2018).

  51. Bulmer, M. The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897–907 (1991).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Shah, P., Ding, Y., Niemczyk, M., Kudla, G. & Plotkin, J.B. Rate-limiting steps in yeast protein translation. Cell 153, 1589–1601 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Ciandrini, L., Stansfield, I. & Romano, M.C. Ribosome traffic on mRNAs maps to gene ontology: genome-wide quantification of translation initiation rates and polysome size regulation. PLoS Comput. Biol. 9, e1002866 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Duval, M. et al. Escherichia coli ribosomal protein S1 unfolds structured mRNAs onto the ribosome for active translation initiation. PLoS Biol. 11, e1001731 (2013).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  55. Marzi, S. et al. Structured mRNAs regulate translation initiation by binding to the platform of the ribosome. Cell 130, 1019–1031 (2007).

    Article  CAS  PubMed  Google Scholar 

  56. Qu, X. et al. The ribosome uses two active mechanisms to unwind messenger RNA during translation. Nature 475, 118–121 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Takahashi, M.K. et al. Using in-cell SHAPE-Seq and simulations to probe structure-function design principles of RNA transcriptional regulators. RNA 22, 920–933 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Ding, Y., Kwok, C.K., Tang, Y., Bevilacqua, P.C. & Assmann, S.M. Genome-wide profiling of in vivo RNA structure at single-nucleotide resolution using structure-seq. Nat. Protoc. 10, 1050–1066 (2015).

    Article  CAS  PubMed  Google Scholar 

  59. Kosuri, S. & Church, G.M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Miller, W.G., Leveau, J.H.J. & Lindow, S.E. Improved gfp and inaZ broad-host-range promoter-probe vectors. Mol. Plant Microbe Interact. 13, 1243–1250 (2000).

    Article  CAS  PubMed  Google Scholar 

  61. Lee, T.S. et al. BglBrick vectors and datasheets: a synthetic biology platform for gene expression. J. Biol. Eng. 5, 12 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Kapust, R.B. & Waugh, D.S. Controlled intracellular processing of fusion proteins by TEV protease. Protein Expr. Purif. 19, 312–318 (2000).

    Article  CAS  PubMed  Google Scholar 

  63. Kapust, R.B., Tözsér, J., Copeland, T.D. & Waugh, D.S. The P1′ specificity of tobacco etch virus protease. Biochem. Biophys. Res. Commun. 294, 949–955 (2002).

    Article  CAS  PubMed  Google Scholar 

  64. Cambray, G. et al. Measurement and modeling of intrinsic transcription terminators. Nucleic Acids Res. 41, 5139–5148 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Glascock, C.B. & Weickert, M.J. Using chromosomal lacIQ1 to control expression of genes on high-copy-number plasmids in Escherichia coli. Gene 223, 221–231 (1998).

    Article  CAS  PubMed  Google Scholar 

  66. Elowitz, M.B., Levine, A.J., Siggia, E.D. & Swain, P.S. Stochastic gene expression in a single cell. Science 297, 1183–1186 (2002).

    Article  CAS  PubMed  Google Scholar 

  67. Liang, J.C., Chang, A.L., Kennedy, A.B. & Smolke, C.D. A high-throughput, quantitative cell-based screen for efficient tailoring of RNA device activity. Nucleic Acids Res. 40, e154 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Pósfai, G. et al. Emergent properties of reduced-genome Escherichia coli. Science 312, 1044–1046 (2006).

    Article  PubMed  CAS  Google Scholar 

  69. Csörgo, B., Fehér, T., Tímár, E., Blattner, F.R. & Pósfai, G. Low-mutation-rate, reduced-genome Escherichia coli: an improved host for faithful maintenance of engineered genetic constructs. Microb. Cell Fact. 11, 11 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  70. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  71. Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).

    Article  CAS  PubMed  Google Scholar 

  72. van Opijnen, T., Bodi, K.L. & Camilli, A. Tn-seq: high-throughput parallel sequencing for fitness and genetic interaction studies in microorganisms. Nat. Methods 6, 767–772 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Oh, E. et al. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell 147, 1295–1308 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Qin, D. & Fredrick, K. Analysis of polysomes from bacteria. Methods Enzymol. 530, 159–172 (2013).

    Article  CAS  PubMed  Google Scholar 

  75. R Core Team. R: a language and environment for statistical computing https://www.R-project.org/ (2017).

  76. Sullivan, G.M. & Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med. Educ. 4, 279–282 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank V. Mutalik, C. Liu, L. Jacob, M. Price, A. Deutschbauer, M. Samoilov, P. Shah, J. Plotkin, J. Savitskaya and L. Ciandrini for discussions. We are grateful to the Agilent Laboratories and the Synthetic Biology Institute (SBI) for providing the OLS array. We thank J. Sampson, P. Anderson and S. Laderman from Agilent Laboratories for discussing OLS setup and processing. G.C. was funded by the Human Frontier Science Program (LT000873/2011-l), J.C.G. by the Portuguese Fundação para a Ciência e Tecnologia (SFRH/BD/47819/2008). We acknowledge financial support by the Synthetic Biology Engineering Research Center (SynBERC under National Science Foundation grant 04-570/0540879). This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley (NIH S10 Instrumentation Grants S10RR029668 and S10RR027303).

Author information

Authors and Affiliations

Authors

Contributions

G.C. and A.P.A. conceived the work; G.C. and J.C.G. designed sequences; G.C. performed experiments and processed data; G.C. and A.P.A. analyzed the data and J.C.G. contributed post hoc secondary structure analyses; G.C. and A.P.A. wrote the manuscript.

Corresponding authors

Correspondence to Guillaume Cambray or Adam Paul Arkin.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Reanalysis of natural sequences to define properties of interest

All analyses used protein abundance data from Taniguchi et al1. (n=575 genes) and the genome sequence of E. coli MG1655 (GI:48994873; See Supplementary Code 25; Supplementary Data 1)..

(A) Nucleotide composition biases in coding sequences are related to protein expression. Plots show Pearson correlation coefficients between various nucleotide contents and protein abundances for windows of varying sizes and positions, as shown. Colors correspond to different nucleotide combinations (see bottom right legend). Grey background shadings separate subpanels that correspond to increasing starting position of the windows (numbering below bottom panel). Within subpanels, consecutive points correspond to increase of the window size by one codon from a fixed starting position. Within each window, the three within-codon positions have been analyzed separately, as indicated. Considering the redundancy of the genetic, the third codon position is less constrained and should provide a less biased indication of nucleotide influences on protein production. These data highlights the contribution of AT content (see panels B and C), as previously noted by Allert et al.2. Strongest correlations are found at the second codon position for %A, %T, %C but not %G. According to Sjöström and Wold3, this particular pattern strongly suggests the contribution of the hydropathic properties of the corresponding amino-acids (see panels D and E).

(B) Scatter plot of protein abundances against the AT content in the window +4 to +21 used for further design (%AT). Although sizable when only the third codon position is considered (see A), Pearson's correlations with protein abundances are relatively weak when all three codon positions are considered in the calculation of AT content.

(C) Distribution of %AT binned by categories of protein abundances, as shown. No striking pattern differentiates the distributions. A single threshold—corresponding to the average %AT over all natural coding sequences in the reference E. coli genome—was chosen for the discretization of this property into 2 ordinal levels (white line).

(D) Hydropathy is correlated with protein expression. The red line shows the average hydropathy index over a sliding widow of 11 amino acids (Supplementary Data 2). The blue line shows corresponding correlations with protein abundances. Positions corresponds to amino acids. The grey vertical line marks the window chosen for design of the MHI property.

(E) Distribution of MHI binned by categories of protein abundances, as shown. The low protein bin has a clear bimodal distribution. Two thresholds—corresponding to the 15th and 75th percentiles of MHI over all natural coding sequences in the reference E. coli genome—were chosen for the discretization of this property into 3 ordinal levels (white lines).

(F) Scatter plot of protein abundance against CAI of whole coding sequences. Regression line is shown in red (Pearson's correlation r=0.54). Grey background shadings mark the 20th and 80th percentile of protein abundances used for categorization in the distributions (see G).

(G) Distribution of CAI binned by categories of protein abundances, as shown. Two thresholds —corresponding to the 20th and 80th percentiles of CAI over all natural coding sequences in the reference E. coli genome—were chosen for the discretization of this property into 3 ordinal levels (white lines).

(H) Distribution of codon ramp properties binned by categories of protein abundances, as shown. Plotted are absolute bottleneck positions (BtlP, left) and bottleneck relative strengths (BtlS, middle) for all natural coding sequences in the reference E. coli genome. Distribution of BtlS for sequences with BtlP downstream of codon 33 (the design threshold dictated by construction constraints; see IJ) is shown on the right. This latter plot guided the definition of a nested threshold for BtlS, corresponding to the 70th percentile for this property (white line).

(I) Engineering codon ramp bottlenecks in the sfGFP reporter. The profile of relative bottleneck strength for the original sfGFP reporter is shown in grey (20 codons sliding window; Supplementary Data 3). To engineer conditions wherein a variable sequence of 96 nts fused to the reporter could influence bottleneck properties, a total of 22 codons clustered in 3 different region of the reporter sequence were mutated. The resulting profile features a strong C-terminal bottleneck at position 232 and a moderate bottleneck at the beginning of the reporter (bold green line). The strength of the latter can be modulated by the nature of the upstream designed sequence (see J).

(J) Possible bottlenecks in the engineered reporter. Shown is a scatter plot of bottlenecks positions and strengths realized for a million random sequences of 32 codons fused to the engineered reporter (Supplementary Data 4). Bottleneck positions are located within the first 33 codons or position 232, as intended. The nested threshold for BtlS (red line) is not exceeded by C-terminal bottlenecks.

(K) Smooth variations in secondary structure strength around the start codon of natural coding sequences. Shown are boxplots of predicted minimum free energy for a window of 60 nts slid by steps of 5 nts around the start codon. Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and middle lines mark medians. Points outsides of the whiskers are not plotted for clarity. Colored boxes highlight the windows chosen for design. Background shadings mark the 10th, 25th, 50th, 75th and 90th structure percentiles for randomly generated sequences. While structures in 5'UTRs tend to be less stable than expected by chance, structure within genes tend to be more stable.

(L) Distribution of structure's predicted minimal free energies binned by categories of protein abundances, as shown. Two thresholds—corresponding to the 25th and 75th percentiles of the properties over all natural coding sequences in the reference E. coli genome— were chosen for the discretization of these properties into 3 ordinal levels (white lines).

1. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

2. Allert, M., Cox, J. C. & Hellinga, H. W. Multifactorial determinants of protein expression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918 (2010).

3. Sjöström, M. & Wold, S. A multivariate study of the relationship between the genetic code and the physical-chemical properties of amino acids. J Mol Evol 22, 272–277 (1985).

Supplementary Figure 2 Advantages of carefully designed over natural or random sequences

(A) Uneven property distributions in natural and random sequences. The black profile shows the ranked distribution of properties combinations obtained by generating 244,000 sequences at random (Supplementary Data 6). Grey bars show the corresponding distribution in natural E. coli genes. Both distributions are highly skewed compared to our systematic design (blue line).

(B) Occurrences in random and natural sequences are correlated (n=244,000 and 2580 sequences, respectively; Pearson's correlation r=0.52). Properties of natural sequences are partly shaped by inherent constraints that makes certain combinations hard to obtain (e.g. high %AT content and strong structure). As a result, natural processes have likely evolved to avoid requiring combinations of incompatible properties.

(C) Focal sampling of sequence space by replicate series. Shown are the mean pairwise sequences identities within (error bars show standard deviation across series) and between factorial series (error bars show standard deviation of mean identities between pairs of factorial series) at the nucleotide and amino acid levels (n=56 series and 1540 pairs of series, respectively). Red lines mark random expectations. The 56 full-factorial series were constructed to maximize within-series while minimizing between-series identities (Supplementary Code 26; Supplementary Data 7).

(D) Distributions of designed property scores. Designed scores (black) are representative of wild-type E. coli distributions (red lines). Background shadings mark the separation between ordinal levels used for design (see Figure 1B; Supplementary Data 5). Continuous scores cluster to level boundaries because extreme levels are usually populated by mutations from medium levels sequences during the design process. BtlS nested within C-terminal BtlP are shown in dark grey.

(E) Correlations between property scores. Pairwise Pearson correlation coefficients between design scores in the whole library (blue dots; n=244,000 sequences) or within each factorial series (grey dots and boxplots; n=4,374 sequences) are considerably lower than those observed in the natural genome (red dots, n=1540 genes). Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and middle lines mark medians.

Supplementary Figure 3 Sequence logos of replicate factorial series

The 96 positions of the designed sequences are shown as a sequence logo for each of the 56 independent factorial series. Series identification numbers are shown on top. At each position, bases are arranged by decreasing frequencies from top to bottom, with sizes proportional to their frequency. Histograms show the distribution of pairwise differences between sequences in the series at nucleotide (red) and amino acid (blue) level (Supplementary Data 8). As intended by design, the consensus sequence is distinctly different for each series. Variations are well distributed all over the sequence, with some positions more variable than others. Contrasting with nucleotide differences, the distribution of pairwise amino acid differences is often multimodal. This behavior stems from the initial enforcement and eventual relaxation of constraints to favor synonymous mutations during the design process. A sizeable number of sequence variants within each series are synonymous.

Supplementary Figure 4 Library coverage by high-throughput sequencing

(A) Distribution of count numbers per strains aggregated over all sequencing libraries (n=745,595,539 reads). The bulk of the library (90%) produced between 103 and 104 reads per strain (Supplementary Data 9). One construct was never observed by sequencing, 134 others produced less than a hundred reads each. In contrast, some constructs are highly enriched (26 produced more than 105 reads).

(B) Library multiplexing map. Multiple libraries were pooled on the same sequencing lane and demultiplexed using barcodes. In all, 166 libraries were loaded on 9 lanes of illumina flowcells and run on a HiSeq 2500. Asterixis denote enumerations of libraries with names derived from the same root (Supplementary Data 10).

(C) Distribution of count numbers per strains for each library. Library name, total read counts after demultiplexing and mapping, as well as fit parameters for a negative binomial density (shown in red) are shown. For clarity, axes' names and labels are drawn once at the bottom right. Backgrounds are color coded according to read number (see thermometer on bottom right). Bars exceeding the range of the graph are colored in dark gray. The most informative library for determining the native composition library of the unscreened library is FIT-SEQ_NoCpl_Gen0_round1 (right column, fifth row), in which 242,516 strains (99.4% of the library) are covered by >10 reads.

Supplementary Figure 5 Inducible translational coupling device permits tunable control of translation initiation

(A) Influence of amber codon number and position on translational coupling inducement. Population average fluorescence signals were measured by flow cytometry at mid-exponential growth under increasing dilution of unnatural amino acid (AcF; Supplementary Data 11). Position and number of amber stop codons was varied in a development version of the reporter system showing poor translation in the absence of coupling. Points and shaded backgrounds show the means and standard deviations from 3 biological replicates (color as shown). The construct pGC4470, which bear a single amber at the fifth codon of the leader sequence, provides greater induction though slightly lower repression (green line). Since ribosomes terminating at this position show minimal interference with STR−30:+30 (Figure 3A), this version of the device was retained for the final reporter.

(B) Inducible translation coupling enables quantitative control of translation rate. Distribution of cellular fluorescence measured by flow cytometry under increasing dilution of AcF (color as shown) for construct pGC4470 (green line in panel A).

(C) The unnatural suppressor system recapitulates the effect of sense and stop codons. The amber stop codon (TAG) was replaced by the ochre stop (TAA) and other sense point-mutants (AAG, TAC and TTG) in the context of 10 reporter variants differing in sequence over the first 10 codons after the start codon. The variants exhibit different expression patterns and are shown in order of increasing expression ratio (full over no induction). In the absence of AcF, amber behaves comparably to ochre, demonstrating little leakage and efficient termination. Expression levels attained under induction by 2.5 nM AcF are almost as high as those obtained with sense codon, demonstrating the high read-through efficiency of the system (Supplementary Data 12).

(D) The early amber codon in the leader does not trigger global translation shutdown. Shown are growth curves for constructs yielding comparably low (4822) or high (4787) protein expression across variants of the amber stop codon (Supplementary Data 13). Comparable growth rates across these strains show that the 5 amino acids minigene produced from the leader sequence under normal condition of reporter initiation does not adversely affect cellular growth or global expression1. Because the minigene and its immediate context are invariant across the library we expect these observations to hold true for all strains in this study.

1. Dinçbas, V., Heurgué-Hamard, V., molecular, R. B. J. O.1999. Shutdown in protein synthesis due to the expression of mini-genes in bacteria. Elsevier. doi:10.1006/jmbi.1999.3028

Supplementary Figure 6 Measurements of protein production under normal conditions of initiation and relationship to design factors

(A) High-throughput measurements of protein production are comparable to individually measured performances. Shown is a scatter plot of flow-cytometry data measured on individual cultures versus FACS-Seq data under conditions of normal initiation. Points mark the mean of at least 3 biological replicates and grey arrows their standard deviation (n=310 strains). The red line is a linear regression fit, excluding outliers (grey data points). We find excellent agreement between the two types of data (Pearson's correlation r=0.95). The compression on the low end reflects weaker sensitivity of the benchtop flow cytometer used for individual measurements as compared to the more sophisticated FACS machines used for the high-throughputs experiments. Most outliers show large standard deviation and probably correspond to the rise of mutations outside of the sequenced region in either assay.

(B) High-throughput measurements of protein production are highly reproducible. Shown are pairwise scatterplots of processed protein production for 4 biological replicates (Supplementary Data 15). Points are first plotted in solid grey to render isolated outliers and then as transparent black to provide a sense of data density (sample size as shown). Cells from replicates 1-2 and 3-4 were sorted with different FACS machine (see Online Methods). Replicates 1-3 and replicate 2-4 were pooled for sequencing. The reproducibility of the measurements is excellent (r=0.99 on average; individual correlation coefficients as shown).

(C) Sizeable design error in the molecular Design of Experiment. Shown are the cumulative distributions of the coefficients of variation in PNI amongst experimental replicate (red, experimental error) and the 3 close design replicates within each series (sequences with identical factorial properties and 1-4 nts differences; blue, design error). The design error is distinctly larger than the experimental error, testifying of the inability of the factorial categorization to fully capture functional variations between highly related sequences.

(D) Series-wise decomposition of explainable variance by linear regression. Top: same plot as Figure 2B but derived from multiple linear regressions (MLR) on continuous property scores, as opposed to ANOVA on discretized score levels (Supplementary code 8; Supplementary Data 17; n[4,429; 4,372] strains for each series, except n=3,418 for incomplete series #136). Series order and color scheme are maintained for comparison. Bottom: MLR and ANOVA yield comparable results. Shown are scatter plots of series-wise explanatory powers obtained through MLRs versus ANOVAs (n=56 series). Left: total explanatory powers; Right: effect sizes for each design properties and their second order interactions (log scale; n=35 properties). Largest contributions are highly correlated, although MLRs consistently provide slightly better results than ANOVAs.

(E) Recursive regression tree resolves hierarchical dependencies between properties. At each node, data are split according to the rule shown in the colored box, heuristically chosen to maximize the explained variance (Supplementary code 9; Supplementary Data 18; n=242,269 strains). Box colors mark properties concerned with the rule at a given node, following the color code in panel D and Figure 2B. R2 are shown below boxes and summarized in the upper-right pie. Average protein productions within each branch are shown above boxes and color-coded according to the upper-left thermometer.

(F) Design factors describes mutational series characterized by higher phenotypic diversities better. The series-wise mean (left scatter plot) and variance (middle scatter plot) in PNI are plotted against the explanatory power (R2) achieved by all design factors and their second-order interactions in ANOVA (Supplementary Code 7; Supplementary Data 16; n=56 series). Red lines show linear regression fits (coefficients as shown). Higher mean PNI is associated with lower design factor contributions to the observed variance. In contrast, higher variance is associated with higher explanatory power of the design factors. Mean and variance in PNI are moderately correlated (right scatter plot). Series not well explained by design factors fail to implement the intended phenotypic variability. In particular, too high mean PNI is likely symptomatic of failure to design functionally relevant secondary structure in the initiation region.

(G) Enrichment of codon-adapted sequences amongst highest protein producers. Left: Scatter plot of CAI versus PNI, with data points colored by STR-30:+30, as shown. Dark lines represent quartiles of CAI for every percentile of PNI. Grey lines show the same quantities calculated over the whole library. Blue and red lines show linear regressions using data below and above the top PNI pentile, respectively (coefficients as shown). Right: Scatter plot of PNI against CAI colored by STR-30:+30, for the highest pentile of PNI (red regression on left panel). The transparent dark line is a linear regression (coefficient as shown). Grey lines mark the quartiles of PNI for every percentile of CAI. Number of strains as shown.

Supplementary Figure 7 Measurements of protein production under conditions of facilitated initiation and relationship to design factors

(A) Bulk high-throughput measurements of protein production are comparable to individually measured performances. Shown is a scatter plot of individual flow-cytometry data versus FACS-Seq data under conditions of facilitated initiation. Points mark the mean of at least 3 biological replicates and grey arrows their standard deviation (n=310 strains). The red line is a linear regression fit, excluding outliers (grey data points). We find good agreement between the two types of data (r=0.90).

(B) High-throughput measurements of protein production are reproducible. Shown are pairwise scatterplots of processed protein production for 4 biological replicates (Supplementary Data 15). Points are first plotted in solid grey to render isolated outliers and then as transparent black to provide a sense of data density (sample size as shown). Cells from replicates 1-2 and 3-4 were sorted with different FACS machine (see Material and methods). Replicates 1-3 and replicate 2-4 were pooled for sequencing. The reproducibility of the measurements is generally good, although the first replicate shows a somewhat inconsistent signal (r=0.87 on average; r=0.91, excluding replicate #1, individual correlation coefficients as shown). We retained that replicate for the calculation of PFI because it nonetheless provided valuable information nonetheless.

(C) Lower design error under facilitated initiation. Shown are the cumulative distributions of the coefficients of variation in PFI amongst experimental replicate (red) and the 3 close design replicates within each series (sequences with identical factorial properties and 1-4 nts differences; blue). Unlike the situation under normal initiation (Supplementary Fig. 5C), the design error is hardly distinguishable from the experimental error under coupling. At least in part, this behavior arises from the combination of lesser experimental reproducibility and lower variance in measured fluorescence across the library. Facilitating initiation may also directly mitigate the impact of the original factors underlying the Design Error (e.g. misprediction of secondary structures).

(D) Series-wise decomposition of explainable variance. Top: same plot as Figure 2B but derived from multiple linear regressions (MLR) on continuous property scores, as opposed to ANOVA on score levels (Supplementary Code 11; Supplementary Data 20; n=238,458 for the whole dataset and n[3093; 4,368] strains for each series). Series order and color scheme are maintained for comparison. Bottom: MLR and ANOVA yield comparable results under facilitated condition of initiation. Shown are scatter plots of series-wise explanatory powers obtained through MLRs versus ANOVAs. Left: total explanatory powers (n=56 series); Right: effect sizes for each design properties and their second order interactions (log scale; n=35 properties). The largest contributions are highly correlated, although MLRs consistently provide slightly better results than ANOVAs.

(E) Recursive regression tree resolves hierarchical dependencies between properties. At each node, data are split according to the rule shown in the colored box, heuristically chosen to maximize the explained variance (Supplementary Code 12; Supplementary Data 21; n=238,458 strains). Box colors mark properties concerned with the rule at a given node, following the color code in panel D and Figure 2B. R2 are shown below boxes and summarized in the upper-right pie. Average protein productions within each branch are shown above boxes and color-coded according to the upper-left thermometer. Unlike PNI, CAI shows sizable contributions to larger PFI.

(F) Codon usage modulate protein production and is subordinate to non-limiting translation initiation. Left: Scatter plot of CAI versus PFI colored by STR+01:+60, as shown. The median CAI for each percentile of PFI is plotted in yellow. Scaled equivalents for the medians of STR-30:+30 (red), STR+01:+60 (blue) and STR+31:+90 (purple) are shown for comparison. Past a production threshold (dashed vertical line), increasingly faster elongation rates corresponding to higher PFI are only permitted in strains with commensurate improvement in CAI. Below the threshold, PFI remains fully limited by initiation, as determined by strong STR+01:+60 that are not well unfolded by the coupling mechanism. Right: Scatter plot of PFI against CAI colored by PNI, excluding the lowest decile of PFI (red regression on left panel). The transparent dark line is a linear regression (Pearson's coefficient as shown). Grey lines mark the quartiles of PNI for every percentile of CAI. Number of strains as shown.

Supplementary Figure 8 Impact of other codon and amino acid metrics on protein production under conditions of normal and facilitated translation initiation

(A) Manipulation of the codon ramp does not impact protein production. Left: Distribution of PNI (top; n=242,269 strains) and PFI (bottom; n=238,458 strains) according to the predicted position (BtlP) and strength (BtlS) of the translation bottleneck. Boxplots over light grey background show distributions of production by amino-acid position along the designed sequence. At each position, blue and red boxes show lower and higher levels of BtlS strains, respectively. Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and middle lines mark medians. Outliers beyond whiskers are not plotted. Box widths are quadratically related to the sample size. No systematic trend is apparent across N-terminal positions. The two colored boxplots on medium grey background show pooled data across all N-terminal positions, broken down by BtlS level. The two grey boxplots show distributions binned by BtlP level. We observe no differences in protein production between these groups. Right: Scatter plot of BtlS versus PNI (top) and PFI (bottom) for N-terminal (black) and C-terminal (red) levels of BtlP. Grey and dark red lines show corresponding quartiles for each percentiles of protein production. Unlike CAI, the strength of the codon ramp does not correlate with variation in the translation regime.

(B) The codon ramp does not explain the relationship between protein productions under normal and facilitated initiation. Far left: Scatter plot between PNI and PFI colored by the ramp bottleneck position, as shown. Middle left: correlations between PFI and BtlP for each percentile of PNI. Middle right: Scatter plot between PNI and PFI colored by the ramp bottleneck strength, as shown (n=237,644 strains for all panels). Since designed BtlS values are nested into the low BtlP level, only constructs with a N-terminal bottleneck are shown. Middle left: correlation between PFI and Btls for each percentile of PFI. These plots indicate no systematic associations between bottleneck strength and protein production.

(C) Optimization of codon indices to explain the relationship between protein production under normal and facilitated initiation. Barplot show the semi-partial correlation between PFI and various codon metrics controlling for the effect of PNI on PFI for all sequences in the library, using parametric (pearson) and non-parametric (spearman) methods, as shown (Supplementary Code 17; Supplementary Data 30; n=237,644 strains). Amongst the metrics investigated, the original CAI1 provides the best correlations. We tested the tAI, which weight the codon according to the tRNA abundances2,3; RSCU, a simple measure of relative synonymous codon usage4,5; the frequency of codon in highly expressed gene (heg-fb)5; a measure of codon decoding rate by Dana and Tuller6; a codon index recently developed by Boël et al., based on the study of a large gene library7; an in vivo measure of codon decoding time by Chevance et al. that used a reporter assay based on an anti-terminator function8; the hydropathy9 averaged over the full designed sequences (Full HI, as opposed to the HI used for the factorial design which is defined on a more restricted region). Using a heuristic optimization procedure, we derived codon indices that maximize the partial correlation of interest. We did so starting from codon with equal weights (opt. index) or from the original CAI (opt. CAI). Optimized metrics are highly similar and vastly outperform other metrics. However, they show medium capacity to explain gene expression in E. coli (see panel D).

(D) Correlation between codon metrics and gene expression in E. coli. Barplot shows Pearson and Spearman correlations of the various codon metrics with protein abundance (left) and mRNA abundance (right), as reported by Taniguchi et al10. (n=575 genes). The heg-fb provides the best correlations followed closely by CAI, tAI and RSCU. Indices derived from experimental studies tend to perform poorly. Indices derived from this work perform better but not as well as those derived from codon frequencies observed in natural genes.

1. Sharp, P. M. & Li, W. H. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987).

2. Reis, M. D. Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32, 5036–5044 (2004).

3. Tuller, T. et al. An Evolutionarily Conserved Mechanism for Controlling the Efficiency of Protein Translation. Cell 141, 344–354 (2010).

4. Sharp, P. M., Tuohy, T. M. & Mosurski, K. R. Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res. 14, 5125–5143 (1986).

5. Hilterbrand, A., Saelens, J. & Putonti, C. CBDB: The codon bias database. BMC Bioinformatics 13, 62 (2012).

6. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014).

7. Boël, G. et al. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature 529, 358–363 (2016).

8. Chevance, F. F. V., Le Guyon, S. & Hughes, K. T. The effects of codon context on in vivo translation speed. PLoS Genet. 10, e1004392 (2014).

9. Kyte, J. & Doolittle, R. F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132 (1982).

10. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

Supplementary Figure 9 Factors confounding the impact of amino acid hydropathy on protein expression

(A) Hydropathy is related to translation speed. Left: scatter plot between PNI and PFI colored by Full HI (the mean hydropathy of the full designed sequence), as shown. Right: Pearson's correlations between PFI and Full HI for each percentile of PNI. The association with Full HI is comparable to that of CAI, though weaker. This suggests that amino-acid composition has a small effect on elongation that may feedback on initiation at the beginning of coding sequences. Nonetheless, this effect is much less important than expected from the analysis of genome wide functional data.

(B) The hydropathy signal that guided the factorial design may be confounded by functional localization of the proteins. Left: Boxplots of protein abundance in E. coli partitioned into cytoplasmic (n=438 genes) versus other subcellular localization of the proteins (n=121 genes). Cytoplasmic proteins tend to be more abundant. Middle: Boxplots of average hydropathy index as used in the design properties (amino acid positions 10 to 20) for the same partition (n=2,317 and 1356 genes, respectively). Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges, central lines mark the medians and notches their 95% confidence intervals. Membrane associated and periplasmic proteins tend to be show higher hydropathy, probably due to the necessity to cross or be included into biological membranes1. Right: Scatter plot of protein abundance versus hydropathy index. Data points are colored according to the subcellular localization of the corresponding protein, as shown. The relationships that initially motivated inclusion of the hydropathy property in the experimental design is largely driven by data points corresponding to non-cytoplasmic proteins (black line; see also Supplementary Fig. 1D), so that removing these largely decrease the observed Pearson's correlation (grey line, coefficients as shown). Protein abundance data are taken from Taniguchi et al.2. Protein localization data from Han et al.3.

(C) Mutational series #344 show unusual impact of MHI. Shown is a scatter plot of explainable effect sizes of MHI as calculated by ANOVA on PNI and PFI (n=56 series).

(D) Variation of amino-acid composition between extreme MHI in series #344 points to a double proline. Sequence logos of the top and bottom decile of MHI shows the frequency of amino acids at each position. Construct with lowest MHI are notably characterized by the presence of a double proline at position 17-18 (red dashed box). Double prolines are known to be problematic for translation4.

(E) The presence of double proline in series #344 is highly correlated with MHI. Shown are boxplot of MHI binned by the number of double proline in the sequences. Boxes span interquartile ranges, whiskers correspond to 1.5 these ranges and central lines mark the medians. The strong apparent effect of MHI in this series might be confounded by the presence of double prolines in more than a third of the series. Number of sequence n as shown.

(F) Double prolines are linked to low protein production. Scatter plot of PNI (left) or PFI (rigth) as a function of MHI. Sequences with one and two double proline are highlighted in red and blue, respectively. These are associated with much lower MHI and protein productions.

(G) Mutation in the double proline increase protein production. Scatter plot of PNI (left) or PFI (right) for pairs of sequence that differ only by one amino acid mutation in the double proline. Points highlighted in red differ only by a single nucleotide. Mutants tend to show increased protein production, further strengthening the role of the double proline in this case.

1. Charneski, C. A. & Hurst, L. D. Positive charge loading at protein termini is due to membrane protein topology, not a translational ramp. Molecular Biology and Evolution 31, 70–84 (2014).

2. Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329, 533–538 (2010).

3. Han, M.-J. et al. Genome-wide identification of the subcellular localization of the Escherichia coli B proteome using experimental and computational methods. Proteomics 11, 1213–1227 (2011).

4. Doerfel, L. K. et al. EF-P Is Essential for Rapid Synthesis of Proteins Containing Consecutive Proline Residues. Science 339, 85–88 (2013).

Supplementary Figure 10 Growth measurements and impact of coding sequences properties

(A) Growth measurements are reproducible. Pairwise comparisons of triplicate estimates of relative growth rates after competition in non-coupling conditions for ca. 60 generations (Supplementary Data 15). Due to the low number of associated sequencing reads, growth estimates are more variable in slow growing strains. Number of observations (n) and pairwise Pearson correlations (r) as shown.

(B) Shorter competitions limit the loss of diversity. The fraction of the total library that can be measured in samples from different time points decreases with competition time. The three replicates shown in panel A are averaged and compared to a fourth population sampled after ca. 13 and 28 generations (color as in C).

(C) Shorter competition increases the dynamic range of growth measurements. The distribution of growth rates estimated from the different samples are shown as histograms (color as shown). Slower growing strains become undetectable in longer competitions.

(D) Competition time impacts growth estimates. Scatter plots show pairwise comparison of growth rates between the three sampled time points. Strains missing in one sample are shown in blue with an arbitrary low value in place of the missing datum. The slowest growers are strongly counter-selected and only observed at earlier time points before they go extinct. Number of observations (n, excluding blue points) and pairwise Pearson correlations (r) as shown.

(E) Codon adaptation has little impact on growth. Far Left: Scatter plot of WNI as a function of PNI colored by CAI, as shown. Middle Right: Scatter plot of WFI as a function of PFI colored by CAI, as shown. Dark lines show median of growth for each percentile of protein production and a loess smoother. Yellow and cyan lines show the same information for the top and bottom deciles of CAI, respectively. Middle Left: Pearson's correlations between WNI and CAI for every percentile of PNI. Far Right: Pearson's correlations between WFI and CAI for every percentile of PFI. Color and size of the points convey the mean and variance of CAI, respectively. The grey line is a loess smoother highlighting the trend in the correlations. Higher CAI is weakly associated with a small growth improvement at higher protein production.

(F) Strong interactions between secondary structures impact growth. Interaction plots for the effect of the three designed secondary structures on growth under non-coupling conditions in rich (Left) and minimal (Right) media. Red lines mark the medians per level of STR-30:+30. Blue and green lines show the medians for combinations of STR+01:+60_STR-30:+30 and STR+31:+90_STR-30:+30_STR+01:+60, respectively. Structure strengths are depicted below the boxplots for clarity. Boxes mark interquartile ranges, whiskers measure 1.5 these ranges (n[8,335; 8,905] and [6,090; 8,213] strains for each plot in the left and right panel, respectively).

(G) Secondary structures affect growth beyond their impact on protein production. Correlations between STR+01:+60 and WNI (Far Left) and WFI (Middle Left) for every percentile of protein production in the respective initiation conditions. Correlations between STR+31:+90 and WNI (Middle Right) and WFI (Far Right) for every percentile of protein production. Color and size of the points convey the mean and variance of the structures, respectively. The grey line is a loess smoother highlighting the trend in the correlations. Weaker structures are generally associated with faster growth, especially at low protein production.

Supplementary Figure 11 mRNA measurements and impact of coding sequences properties

(A) High-throughput assay of RNA decay. Constant quantities of standard strains are spiked-in library samples. Standard sequencing reads increase as reporter transcript decay, defining corrective coefficients (Supplementary Data 33). Corrected time series are fit to an exponential decay model to estimate RNA abundance at steady state (RNASS; t=0), transcript half-life (RNAHL) and the final fraction of protected RNA (RNAPTX; t=+∞) for each strain.

(B) Time-series of RNA measurements are noisy but reproducible. Pairwise comparisons of read counts obtained from two biological replicates at different time points (Supplementary Data 15). RNA-Seq counts are normalized by DNA-Seq counts to account for variations in strain abundances, but not yet corrected using decay standards. One unit represents 1e3 RNA-Seq reads count. The number of complete observations (n strains) and pairwise Pearson's correlations (r) between replicates are as shown. Correlations tend to diminish at later time points.

(C) Exponential decay fit captures relevant parameters. For each replicate, the distribution of the sum-of-squares explained by the fit divided by the total sum-of-squares (akin to R2) is shown to provide an estimate of the fits' quality (left; Supplementary Code 21; Supplementary Data 34). The middle and right panels are scatter plots of estimated RNASS and RNAPTX plotted against their nearest measured equivalent i.e. read counts at t=0 and the ratio of read counts at t=40 over t=0, respectively (number of strains as shown). Identity lines are shown in red. Correlations between estimated and observed values are excellent (Pearson's coefficients r as shown).

(D) Pairwise comparison of estimated decay parameters between biological replicates. Due to the sensitivity of the fitting procedure, the initial (RNASS) and final (RNAPTX) states of the system are more reproducible than its dynamic component (RNAHL). We therefore refrained from analyzing RNAHL. Number of strains as shown.

(E) Diversity of RNASS profiles between replicate series. Red and grey dots mark medians and interquartile ranges, respectively. The red line highlights the median of the whole dataset (n=233,487 for the whole dataset and n[3,259; 4,351] strains for each series).

(F) Decay assay artifacts expose complex interactions between degradation and translation. Scatter plot of RNAHL (Left) and RNAPTX (Middle) versus PNI, colored by RNASS as shown. Dark lines mark median RNAHL or RNAPTX for each percentile of PNI and the corresponding loess smoothers. Yellow and cyan lines show the top and bottom deciles of RNASS, respectively. Number of strains as shown. Various correlations for each percentile of PNI are shown (Right). Positive association between RNAHL and RNASS is restricted to low PNI and progressively inverted, reflecting swifter protection of fast initiated transcript. RNAPTX is linearly related to PNI and modulated by RNASS. Stronger structures are associated with increased RNAHL and RNASS, especially at low PNI.

(G) Mediation by translation mechanisms blurs the relationship between transcript stability and abundance. Scatter plot of RNASS as a function of RNAHL colored by PNI is as shown. Thick and thin dark lines show medians of RNASS per percentile of RNAHL and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of RNAHL, respectively. Number of strains as shown. The expected positive correlation between RNAHL and RNASS is confined to low protein production regimes and becomes positive with increasing PNI (correlations as shown).

(H) The association between mRNA stability and distal structure is most visible at low protein production. Scatter plot of RNAHL (Left) and RNASS (Right) as a function of STR+31:+90 colored by RNASS and RNAHL, respectively, as shown. Only strains from the lowest PNI decile (cyan regression line in panel I) are plotted. Black and grey lines marks the median and quartiles of the y-axis measurements for each percentile of STR+31:+90. Number of observations (n strains) and Pearson's correlation coefficients (r) as shown.

Supplementary Figure 12 Polysome profiling and impact of coding sequences properties

(A) High-throughput targeted polysome profiling. The first five polysome fractions were extracted from a sucrose gradient, barcoded and sequenced in multiplex.

(B) Codon adaptation improves mRNA protection by translating ribosomes. Scatter plot of RNAPTX as a function of MRD colored by CAI, as shown (upper panel). Thick and thin dark lines show medians of RNAPTX per percentile of MRD and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of CAI, respectively. Pearson's correlation between RNAPTX and CAI increases within increasing percentiles of MRD (loewr panel). Higher CAI leads to increased protection at higher translation regime, presumably by ensuring smoother ribosome flow over the transcripts.

(C) Codon adaptation faintly increases RNA abundance. Scatter plot of RNASS as a function of MRD colored by CAI, as shown (upper panel). Thick and thin dark lines show medians of RNASS per percentile of MRD and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of CAI, respectively. Although modest, Pearson's correlation between RNASS and CAI increases with MRD (lower panel). This behavior probably reflects the effect of CAI on RNA protection.

(D) Apparent impact of strong distal structure on protein production. Scatter plot of PNI as a function of MRD colored by STR+31:+90, as shown (upper panel). Thick and thin dark lines show medians of RNASS per percentile of MRD and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of STR+31:+90, respectively. Negative Pearson's correlation between STR+31:+90 and PNI for a given MRD peaks at intermediate MRD (lower panel), even though strong STR+31:+90 should slow ribosome progression. This relationship probably reflects the impact of STR+31:+90 on RNA abundance and stability (Supplementary Fig. 10F), which in turn affects the MRD of single transcripts and the collegial protein production (see E,F).

(E) Ribosomal density and RNA abundance show similar relationships with distal structure. Data points show Pearson's correlations between STR+31:+90 and either MRD (red) or RNASS (blue) for each percentile of PNI. Solid lines are loess smoother highlighting the trends in the correlation. The relationship between STR+31:+90 and MRD is probably a consequence of that linking STR+31:+90 to RNASS.

(F) Ribosomal density is driven by transcript abundance rather than distal structures. Scatter plot of RNASS as a function of STR+31:+90 colored by MRD, as shown. Thick and thin dark lines show medians of RNASS per percentile of STR+31:+90 and a loess smoother, respectively. Yellow and blue lines show the same information for the top and bottom deciles of MRD, respectively. RNASS decreases faintly with increasing STR+31:+90, irrespective of MRD levels. In contrast, low MRD is strongly associated with high RNASS (see also G).

(G) Variation of design properties and measurements in protein-growth phenotypic space. Coarse-grained grids of WNI versus PNI. Data are binned first by deciles of PNI and then by deciles of WNI as in Fig. 5H. In each grid, bins are color-coded to show the range of means in parameter indicated in each vignette. Smoother variations across the grid are indicative of larger dynamic range and truer effects.

Supplementary Figure 13 Extensive phenotypic diversity between replicate factorial series

(A) Principle component analysis highlights the phenotypic spread of factorial series. The analysis is based on the correlation matrix between series-wise means of shown phenotypic variables (n=56 series).

(B) Visualization of the ensemble phenotypic differences between series. Spider plots are grouped by clusters separated by alternative white and grey backgrounds. Series' identification number are given in the bottom right corner of each plot. Although each series explores the same property space, small initial phenotypic differences may cascade into the observed diversity. Understanding these differences represents a challenge for predictive biology.

Supplementary Figure 14 Examples of comparable structure profiles leading to different protein productions

Minimum free energies of predicted secondary structures are plotted as a function of window position for windows of different length, as shown. Constructs with similar structure profiles (same row) can be found in distinct regions of the protein production space, as indicated by the red gates in a scatter plot of PFI versus PNI (top). Conversely, very different structure profiles can yield the same production phenotypes (same column). These profiles also exemplify that different window lengths can yield quite dissimilar profiles for a given construct (e.g. 50 and 70 nucleotide-long windows on construct 71_33111112_1, middle-left plot).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 5936 kb)

Life Sciences Reporting Summary (PDF 165 kb)

Supplementary Tables

Supplementary Tables 1–3 (PDF 299 kb)

Supplementary Notes

Supplementary Note 1 (PDF 267 kb)

Supplementary Code 1

Parameter file. Used to parameterize python scripts involved with processing of sequencing data (Supplementary Code 2–5). (TXT 1 kb)

Supplementary Code 2

De-multiplex fastq. Python script to identify and trim custom sequencing barcodes. Support parallelization. Outputs a separate fastq file for eachbarcode. (TXT 12 kb)

Supplementary Code 3

Python wrapper for BWA and samtools. Produce mapping and quality check of the reads by calling BWA and samtools. Support parallelization. (TXT 3 kb)

Supplementary Code 4

Read counter. Python script to summarize the number of read mapping to each target sequence from the bam files generated by SupplementaryCode 3. Support parallelization. (TXT 21 kb)

Supplementary Code 5

Count aggregator. Python script to aggregate count tables generated by Supplementary Code 4. (TXT 1 kb)

Supplementary Code 6

Processing of protein production under regular and facilitated initiation from FACS-seq data. R script to normalize, rescale and aggregate readcount data from multiple FACS-seq replicate experiments. Convert digital read distribution into a continuous linear measure of protein productionranging between 1 and 100 (PNI and PFI). (TXT 14 kb)

Supplementary Code 7

Computation of ANOVA's sum of squares for PNI. R script to run an ANOVA on PNI data and extract the sum of squares accounted by designproperties and their first-order interactions. (TXT 6 kb)

Supplementary Code 8

Computation of sum of squares for multiple linear regression of PNI on design properties. R script to run a multiple linear regression on PNI dataand extract the ANOVA-like sum of squares accounted by design properties and their first-order interactions. (TXT 5 kb)

Supplementary Code 9

Regression tree analysis for PNI. R script to run a CART analysis on PNI. (TXT 0 kb)

Supplementary Code 10

Computation of ANOVA's sum of squares for PFI. R script to run an ANOVA on PFI data and extract the sum of squares accounted for by designproperties and their first-order interactions. (TXT 6 kb)

Supplementary Code 11

Computation of sum of squares for multiple linear regression of PFI on design properties. R script to run a multiple linear regression on PFI dataand extract the ANOVA-like sum of squares accounted for by design properties and their first-order interactions. (TXT 5 kb)

Supplementary Code 12

Regression tree analysis for PFI. R script to run a CART analysis on PFI. (TXT 0 kb)

Supplementary Code 13

Effect of structure strength predicted across sliding windows of different sizes. R script to run linear regression of PNI and PFI against minimal freeenergies computed over sliding windows of different length. Report the ANOVA-like sum of squares. (TXT 3 kb)

Supplementary Code 14

Multiple linear regression of PNI and PFI on predicted nucleotide accessibilities. R script to run a multiple linear regression of protein production data on predicted nucleotide availabilities. Report the ANOVA-like sum of squares accounted by every position. (TXT 4 kb)

Supplementary Code 15

Call to the RBS calculator web service. Python script to remotely run the RBS calculator on designed sequences. (TXT 7 kb)

Supplementary Code 16

Effect of predictions from the RBS calculator. R script to run linear regression of PNI and PFI against RBS calculator outputs. Report theANOVA-like sum of squares. (TXT 1 kb)

Supplementary Code 17

Partial correlation between PFI and various codon metrics, given PNI. R script to compute various alternative codon metric for the codon sequence and determine their partial correlations with PFI accounting for PNI. (TXT 19 kb)

Supplementary Code 18

Processing of growth measurements from FIT-seq data collected under various conditions. R script to convert differential enrichment of read count data over time into an integrated measure of cell growth. Process read count data from multiple replicate experiments. Convert read count ratios into aggregated measures of relative growth in a given environment(WNI, WFI, WUTX, WM). (TXT 22 kb)

Supplementary Code 19

Computation of sum of squares for multiple linear regression of WNI on PNI and design properties. R script to run a multiple linear regression on of WNI against PNI, PNI2 and design properties. Report ANOVA-like sum of squares (TXT 2 kb)

Supplementary Code 20

Computation of sum of squares for multiple linear regression of WFI on PFI and design properties. R script to run a multiple linear regression on of WFI against PFI, PFI2 and design properties. Report ANOVA-like sum of squares. (TXT 2 kb)

Supplementary Code 21

Processing of RNA abundance and decay measurements from serial RNA-seq. R script to compute RNA decay after transcription arrest. Sampleread counts are corrected using coefficients derived from ratioing counts of spiked-in RNA standards over time. Performs a nonlinear decay fit to the corrected count frequencies to estimate RNA abundance at steady state (RNASS), RNA half-life (RNAHL) and RNA protection (WPTX). (TXT 10 kb)

Supplementary Code 22

Compute 3D animation of the data. R scripts to produce the images necessary for Supplementary Video 1. (TXT 10 kb)

Supplementary Code 23

Processing of polysome profiles from DNA-seq of separate polysome fractions. R script to compute the distribution of polysome (up to fifthfraction) for each design sequence from read counts. (TXT 2 kb)

Supplementary Code 24

Definition of sequence archetypes. R script to categorize sequences into the most relevant combinations of sequence properties. Calculate the series-wise means of various phenotypes for sequences belonging to these archetypes. (TXT 5 kb)

Supplementary Code 25

GenBank parser. A script to parse coding sequence from GenBank file using BioPython. (TXT 1 kb)

Supplementary Code 26

D-Tailor module. Links to specific D-Tailor modules used in this work. (TXT 0 kb)

Supplementary Code 27

Genome randomization. Perl modules to produce random genome variants that retaining codon usage and protein's amino acid composition. (ZIP 1262 kb)

Supplementary Code 28

Seed generator for D-Tailor. Python script to generate a random input sequence for D-Tailor that maximizes the distance to other input sequences. (TXT 0 kb)

Supplementary Data 1

E. coli's features and measurements. Dataset aggregating various measures of sequence property for every gene in a reference E. coli and corresponding expression data for a subset (Taniguchi, 2009). (ZIP 1914 kb)

Supplementary Data 2

Mean hydropathy index over sliding windows. Calculation of the MHI over sliding windows for every gene in the reference E. coli genome. (ZIP 3239 kb)

Supplementary Data 3

tAI profiles for sfGFP and a designed variant. Calculates tAI over a sliding window. (CSV 9 kb)

Supplementary Data 4

Accessible bottleneck strengths. Calculation of bottleneck strength for random sequence cloned in the translation reporter. (ZIP 36624 kb)

Supplementary Data 5

E. coli's features and levels. Calculation of property scores and discrete categorisation for every gene in the E. coli genome, based on the properties and thresholds set for the Design of Experiments. (ZIP 854 kb)

Supplementary Data 6

Random solutions. Calculation of property scores and categorization for random sequences cloned in the translation reporter context, based on the properties and thresholds set for the Design of Experiments. (ZIP 13803 kb)

Supplementary Data 7

Intra-series distance. Collection of tables reporting Hamming distances between every pair of sequences within the same series. (XLSX 10 kb)

Supplementary Data 8

Series logo. Position-wise nucleotide and amino acid frequency matrices for each series. (ZIP 7341 kb)

Supplementary Data 9

Sequencing count summary. A table reporting the number of counts associated with each design sequence for every sequencing library in this work. (ZIP 29419 kb)

Supplementary Data 10

Illumina lane description. Mapping of the different sequencing libraries on Illumina sequencing lane. (CSV 4 kb)

Supplementary Data 11

TAG coupling upon activation by unnatural amino acids. Table reporting the mean fluorescence observed upon induction by increasing concentration of the unnatural amino acid pAcF. (CSV 1 kb)

Supplementary Data 12

TAG coupling mutants. Table reporting the mean fluorescence observed in various mutants of the TAG position. (CSV 0 kb)

Supplementary Data 13

Growth of TAG mutants. Density of cell culture (OD600) over time for various mutants at the TAG position. (CSV 10 kb)

Supplementary Data 14

Number of cells sorted during FACS-seq. Report the number of cells sorted in each bin during the FACS-seq experiments. Used to normalize read counts upon sequencing. (CSV 5 kb)

Supplementary Data 15

Integrated phenotypic measurements. Consolidated dataset comprising design information, intermediates and fully processed phenotypicmeasurements for all 244,000 synthetic sequences. (ZIP 127072 kb)

Supplementary Data 16

ANOVA on PNI. An R object containing the sum of squares computed by running ANOVAs on the full dataset and independent series (Supplementary Code 7). (ZIP 67 kb)

Supplementary Data 17

MLR on PNI. An R object containing the sum of squares computed by running multiple linear regressions on the full dataset and independent series (Supplementary Code 8). (ZIP 64 kb)

Supplementary Data 18

CART on PNI. An R object containing the result of CART analysis (Supplementary Code 9). (ZIP 2935 kb)

Supplementary Data 19

ANOVA on PFI. An R object containing the sum of squares computed by running ANOVAs on the full dataset and independent series (output of Supplementary Code 7). (ZIP 67 kb)

Supplementary Data 20

MLR on PFI. An R object containing the sum of squares computed by running multiple linear regressions on the full dataset and independent series (output of Supplementary Code 8). (ZIP 64 kb)

Supplementary Data 21

CART on PFI. An R object containing the result of CART analysis (output of Supplementary Code 9). (ZIP 2907 kb)

Supplementary Data 22

Effect of minimum free energy over sliding windows. MFE predicted for sliding windows of different length on each designed sequence. (ZIP 38075 kb)

Supplementary Data 23

Sum of squares corresponding to regression of PNI on MFE over sliding windows (output of Supplementary Code 13). (ZIP 108 kb)

Supplementary Data 24

Sum of squares corresponding to regression of PFI to the residuals of PNI's regression on MFEs (output of Supplementary Code 13). (ZIP 108 kb)

Supplementary Data 25

Single nucleotide accessibilities. Predicted accessibilities at every position of each designed sequences. (ZIP 68145 kb)

Supplementary Data 26

Sum of squares for multiple linear regression of PNI on accessibilities (output of Supplementary Code 14). (ZIP 82 kb)

Supplementary Data 27

Sum of squares for multiple linear regression of PFI on accessibilities (output of Supplementary Code 14). (ZIP 82 kb)

Supplementary Data 28

RBS calculator predictions. Aggregation of outputs obtained by running each designed sequence in reporter context in the RBS calculator (outputof Supplementary Code 15). (ZIP 3410 kb)

Supplementary Data 29

Sum of squares corresponding to the regression of PNI on RBS calculator's predictions (output of Supplementary Code 16). (ZIP 2 kb)

Supplementary Data 30

Partial correlation of various codon-based metrics with PFI, given PNI (output of Supplementary Code 17). (ZIP 0 kb)

Supplementary Data 31

Sum of squares for multiple linear regression of WNI on design properties and PNI (output of Supplementary Code 19). (ZIP 48 kb)

Supplementary Data 32

Sum of squares for multiple linear regression of WFI on design properties and PFI (output of Supplementary Code 20). (ZIP 48 kb)

Supplementary Data 33

RNA standards. Counts of reads mapping to RNA standard sequences in RNA decay libraries. (CSV 1 kb)

Supplementary Data 34

Nonlinear decay fit. An R object containing fit data (output of Supplementary Code 21). (ZIP 13532 kb)

Supplementary Data 35

Phenotypic archetypes. Quartiles of series-wise mean for various phenotypes (output of Supplementary Code 24). (ZIP 1 kb)

Supplementary Data 36

Random E. coli genomes. Result of constrained genome randomization (output of Supplementary Code 27). (ZIP 12592 kb)

3D animation of the data in RNA–Protein–Fitness space. (AVI 44746 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cambray, G., Guimaraes, J. & Arkin, A. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nat Biotechnol 36, 1005–1015 (2018). https://doi.org/10.1038/nbt.4238

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.4238

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research