Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Codon influence on protein expression in E. coli correlates with mRNA levels

Abstract

Degeneracy in the genetic code, which enables a single protein to be encoded by a multitude of synonymous gene sequences, has an important role in regulating protein expression, but substantial uncertainty exists concerning the details of this phenomenon. Here we analyse the sequence features influencing protein expression levels in 6,348 experiments using bacteriophage T7 polymerase to synthesize messenger RNA in Escherichia coli. Logistic regression yields a new codon-influence metric that correlates only weakly with genomic codon-usage frequency, but strongly with global physiological protein concentrations and also mRNA concentrations and lifetimes in vivo. Overall, the codon content influences protein expression more strongly than mRNA-folding parameters, although the latter dominate in the initial ~16 codons. Genes redesigned based on our analyses are transcribed with unaltered efficiency but translated with higher efficiency in vitro. The less efficiently translated native sequences show greatly reduced mRNA levels in vivo. Our results suggest that codon content modulates a kinetic competition between protein elongation and mRNA degradation that is a central feature of the physiology and also possibly the regulation of translation in E. coli.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Distributions of representative RNA sequence parameters in protein-expression categories in the large-scale data set.
Figure 2: Log-odds ratio for E5 versus E0 expression categories for proteins encoded by each nucleotide base at positions 4–96.
Figure 3: Codon influence on protein expression in the large-scale data set.
Figure 4: Contributions of physicochemical factors and regions of the coding sequence to protein expression level.
Figure 5: Analyses of synthetic genes designed to enhance protein expression.
Figure 6: Codon influence on protein expression correlates with endogenous E. coli protein levels and mRNA levels and lifetimes.

Similar content being viewed by others

Accession codes

Primary accessions

GenBank/EMBL/DDBJ

Gene Expression Omnibus

NCBI Reference Sequence

Data deposits

Microarray data and the pMGK sequence were deposited in the Gene Expression Omnibus and GenBank under accessions GSE73416 and KT203761, respectively.

References

  1. Chen, G. T. & Inouye, M. Role of the AGA/AGG codons, the rarest codons in global gene expression in Escherichia coli. Genes Dev. 8, 2641–2652 (1994)

    Article  CAS  PubMed  Google Scholar 

  2. Deana, A., Ehrlich, R. & Reiss, C. Synonymous codon selection controls in vivo turnover and amount of mRNA in Escherichia coli bla and ompA genes. J. Bacteriol. 178, 2718–2720 (1996)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  4. Tuller, T., Waldman, Y. Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl Acad. Sci. USA 107, 3645–3650 (2010)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  5. Goodman, D. B., Church, G. M. & Kosuri, S. Causes and effects of N-terminal codon bias in bacterial genes. Science 342, 475–479 (2013)

    Article  ADS  CAS  PubMed  Google Scholar 

  6. Castillo-Méndez, M. A., Jacinto-Loeza, E., Olivares-Trejo, J. J., Guarneros-Pena, G. & Hernandez-Sanchez, J. Adenine-containing codons enhance protein synthesis by promoting mRNA binding to ribosomal 30S subunits provided that specific tRNAs are not exhausted. Biochimie 94, 662–672 (2012)

    Article  PubMed  CAS  Google Scholar 

  7. Bentele, K., Saffert, P., Rauscher, R., Ignatova, Z. & Bluthgen, N. Efficient translation initiation dictates codon usage at gene start. Mol. Syst. Biol. 9, 675 (2013)

    Article  PubMed  PubMed Central  Google Scholar 

  8. Hunt, R. C., Simhadri, V. L., Iandoli, M., Sauna, Z. E. & Kimchi-Sarfaty, C. Exposing synonymous mutations. Trends Genet. 30, 308–321 (2014)

    Article  CAS  PubMed  Google Scholar 

  9. Spencer, P. S., Siller, E., Anderson, J. F. & Barral, J. M. Silent substitutions predictably alter translation elongation rates and protein folding efficiencies. J. Mol. Biol. 422, 328–335 (2012)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Li, G. W., Burkhardt, D., Gross, C. & Weissman, J. S. Quantifying absolute protein synthesis rates reveals principles underlying allocation of cellular resources. Cell 157, 624–635 (2014)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Li, G.-W., Oh, E. & Weissman, J. S. The anti-Shine–Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 484, 538–541 (2012)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. Gingold, H. & Pilpel, Y. Determinants of translation efficiency and accuracy. Mol. Syst. Biol. 7, 481 (2011)

    Article  PubMed  PubMed Central  Google Scholar 

  13. Cannarozzi, G. et al. A role for codon order in translation dynamics. Cell 141, 355–367 (2010)

    Article  PubMed  CAS  Google Scholar 

  14. Sharp, P. M. & Li, W. H. The codon adaptation index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  15. Ninio, J. Fine tuning of ribosomal accuracy. FEBS Lett. 196, 1–4 (1986)

    Article  CAS  PubMed  Google Scholar 

  16. Tuller, T. et al. An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell 141, 344–354 (2010)

    Article  CAS  PubMed  Google Scholar 

  17. Wallace, E. W., Airoldi, E. M. & Drummond, D. A. Estimating selection on synonymous codon usage from noisy experimental data. Mol. Biol. Evol. 30, 1438–1453 (2013)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Caskey, C. T., Beaudet, A. & Nirenberg, M. RNA codons and protein synthesis. 15. Dissimilar responses of mammalian and bacterial transfer RNA fractions to messenger RNA codons. J. Mol. Biol. 37, 99–118 (1968)

    Article  CAS  PubMed  Google Scholar 

  19. Ikemura, T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981)

    Article  CAS  PubMed  Google Scholar 

  20. Muramatsu, T. et al. Codon and amino-acid specificities of a transfer RNA are both converted by a single post-transcriptional modification. Nature 336, 179–181 (1988)

    Article  ADS  CAS  PubMed  Google Scholar 

  21. Zhang, S. P., Zubay, G. & Goldman, E. Low-usage codons in Escherichia coli, yeast, fruit fly and primates. Gene 105, 61–72 (1991)

    Article  CAS  PubMed  Google Scholar 

  22. Bulmer, M. The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897–907 (1991)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Dong, H., Nilsson, L. & Kurland, C. G. Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J. Mol. Biol. 260, 649–663 (1996)

    Article  CAS  PubMed  Google Scholar 

  24. Elf, J., Nilsson, D., Tenson, T. & Ehrenberg, M. Selective charging of tRNA isoacceptors explains patterns of codon usage. Science 300, 1718–1722 (2003)

    Article  ADS  CAS  PubMed  Google Scholar 

  25. Dittmar, K. A., Sorensen, M. A., Elf, J., Ehrenberg, M. & Pan, T. Selective charging of tRNA isoacceptors induced by amino-acid starvation. EMBO Rep. 6, 151–157 (2005)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Zhang, F., Saha, S., Shabalina, S. A. & Kashina, A. Differential arginylation of actin isoforms is regulated by coding sequence-dependent degradation. Science 329, 1534–1537 (2010)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  27. Vivanco-Domínguez, S. et al. Protein synthesis factors (RF1, RF2, RF3, RRF, and tmRNA) and peptidyl-tRNA hydrolase rescue stalled ribosomes at sense codons. J. Mol. Biol. 417, 425–439 (2012)

    Article  PubMed  CAS  Google Scholar 

  28. Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Pelechano, V. & Wei, W. & Steinmetz, Lars M. Widespread co-translational RNA decay reveals ribosome dynamics. Cell 161, 1400–1412 (2015)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Presnyak, V. et al. Codon optimality is a major determinant of mRNA stability. Cell 160, 1111–1124 (2015)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Drummond, D. A. & Wilke, C. O. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell 134, 341–352 (2008)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Shakin-Eshleman, S. H. & Liebhaber, S. A. Influence of duplexes 3′ to the mRNA initiation codon on the efficiency of monosome formation. Biochemistry 27, 3975–3982 (1988)

    Article  CAS  PubMed  Google Scholar 

  33. Quax, T. E. et al. Differential translation tunes uneven production of operon-encoded proteins. Cell Rep . 4, 938–944 (2013)

    Article  CAS  PubMed  Google Scholar 

  34. Letzring, D. P., Wolf, A. S., Brule, C. E. & Grayhack, E. J. Translation of CGA codon repeats in yeast involves quality control components and ribosomal protein L1. RNA 19, 1208–1217 (2013)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Ude, S. et al. Translation elongation factor EF-P alleviates ribosome stalling at polyproline stretches. Science 339, 82–85 (2013)

    Article  ADS  CAS  PubMed  Google Scholar 

  36. Iost, I. & Dreyfus, M. The stability of Escherichia coli lacZ mRNA depends upon the simultaneity of its synthesis and translation. EMBO J. 14, 3252–3261 (1995)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Iost, I., Guillerez, J. & Dreyfus, M. Bacteriophage T7 RNA polymerase travels far ahead of ribosomes in vivo. J. Bacteriol . 174, 619–622 (1992)

    Article  CAS  PubMed  Google Scholar 

  38. Acton, T. B. et al. Robotic cloning and protein production platform of the Northeast Structural Genomics Consortium. Methods Enzymol. 394, 210–243 (2005)

    Article  CAS  PubMed  Google Scholar 

  39. Price, W. N. et al. Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb. Inform. Exp . 1, 6 (2011)

    Article  CAS  Google Scholar 

  40. Duval, M. et al. Escherichia coli ribosomal protein S1 unfolds structured mRNAs onto the ribosome for active translation initiation. PLoS Biol. 11, e1001731 (2013)

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Reuter, J. S. & Mathews, D. H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129 (2010)

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Lu, J. & Deutsch, C. Electrostatics in the ribosomal tunnel modulate chain elongation rates. J. Mol. Biol. 384, 73–86 (2008)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Ishihama, Y. et al. Protein abundance profiling of the Escherichia coli cytosol. BMC Genomics 9, 102 (2008)

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Chen, H., Shiroguchi, K., Ge, H. & Xie, X. S. Genome-wide study of mRNA degradation and transcript elongation in Escherichia coli. Mol. Syst. Biol. 11, 781 (2015)

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  45. dos Reis, M. Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome. Nucleic Acids Res. 31, 6976–6985 (2003)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Nogueira, T., de Smit, M., Graffe, M. & Springer, M. The relationship between translational control and mRNA degradation for the Escherichia coli threonyl-tRNA synthetase gene. J. Mol. Biol. 310, 709–722 (2001)

    Article  CAS  PubMed  Google Scholar 

  47. Richards, J., Sundermeier, T., Svetlanov, A. & Karzai, A. W. Quality control of bacterial mRNA decoding and decay. Biochim. Biophys. Acta 1779, 574–582 (2008)

    Google Scholar 

  48. Ivanova, N., Pavlov, M. Y. & Ehrenberg, M. tmRNA-induced release of messenger RNA from stalled ribosomes. J. Mol. Biol. 350, 897–905 (2005)

    Article  CAS  PubMed  Google Scholar 

  49. Shoemaker, C. J., Eyler, D. E. & Green, R. Dom34:Hbs1 promotes subunit dissociation and peptidyl-tRNA drop-off to initiate no-go decay. Science 330, 369–372 (2010)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  50. Chadani, Y., Ono, K., Kutsukake, K. & Abo, T. Escherichia coli YaeJ protein mediates a novel ribosome-rescue pathway distinct from SsrA- and ArfA-mediated pathways. Mol. Microbiol. 80, 772–785 (2011)

    Article  CAS  PubMed  Google Scholar 

  51. Xiao, R. et al. The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium. J. Struct. Biol. 172, 21–33 (2010)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Acton, T. B. et al. Preparation of protein samples for NMR structure, function, and small-molecule screening studies. Methods Enzymol. 493, 21–60 (2011)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. R Development Core Team. A Language and Environment for Statistical Computing; http://www.r-project.org/ (2012)

  54. Akaike, H. A new look at the statistical model identification. IEEE Trans. Auto. Con . 19, 716–723 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  55. Harrell, F. E. Jr. R package version 4.2-0; http://CRAN.R-project.org/package=rms (2014)

  56. Jansson, M. et al. High-level production of uniformly 15N- and 13C-enriched fusion proteins in Escherichia coli. J. Biomol. NMR 7, 131–141 (1996)

    Article  CAS  PubMed  Google Scholar 

  57. Keseler, I. M. et al. EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res. 41, D605–D612 (2013)

    Article  CAS  PubMed  Google Scholar 

  58. Juncker, A. S. et al. Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci. 12, 1652–1662 (2003)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580 (2001)

    Article  CAS  PubMed  Google Scholar 

  60. Novick, A. & Weiner, M. Enzyme induction as an all-or-none phenomenon. Proc. Natl Acad. Sci. USA 43, 553–566 (1957)

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  61. Jensen, P. R., Westerhoff, H. V. & Michelsen, O. The use of lac-type promoters in control analysis. Eur. J. Biochem. 211, 181–191 (1993)

    Article  CAS  PubMed  Google Scholar 

  62. Guzman, L. M., Belin, D., Carson, M. J. & Beckwith, J. Tight regulation, modulation, and high-level expression by vectors containing the arabinose PBAD promoter. J. Bacteriol. 177, 4121–4130 (1995)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by NIGMS Protein Structure Initiative grant U54-GM094597 to the Northeast Structural Genomics Consortium to J.F.H. and G.T.M., and NIH grant GM106372 to D.P.A. We thank B. Klingenberg, R. Gonzalez, M. Gottesman and V. de Crécy-Lagard for advice.

Author information

Authors and Affiliations

Authors

Contributions

T.B.A., R.X., J.K.E., J.F.H. and G.T.M. developed the protein expression platform. T.B.A., R.X. and G.T.M. generated the expression data set. W.N.P. initiated and M.S., M.V., J.D.L., G.B., J.F.H. and D.P.A. completed the computational analyses. H.N., K.-H.W. and R.X. constructed genes that R.L., K.-H.W. and G.B. used for biochemical studies. G.B., G.T.M., D.P.A. and J.F.H. designed the research and wrote the paper.

Corresponding authors

Correspondence to Daniel P. Aalberts or John F. Hunt.

Ethics declarations

Competing interests

Two patent applications have been submitted related to results reported in this paper. G.T.M. is affiliated with Nexomics Inc., and G.B., G.T.M., D.P.A. and J.F.H. are affiliated with OPTimum Protein Technologies.

Extended data figures and tables

Extended Data Figure 1 Phylogenic distribution of the proteins in the large-scale protein expression data set.

The colours in the cladogram encode the number of genes/proteins from each organism, as indicated by the legend. The data set includes 47 from eukaryotes (45 from humans and 2 from mouse), 809 from archaebacteria, and 96 from E. coli, with the remainder coming from other eubacteria. The organism contributing the largest number of proteins to the data set is the eubacterium Bacteroides thetaiotaomicron (150 proteins).

Extended Data Figure 2 Relationships between additional mRNA sequence parameters and results in the large-scale protein expression data set.

a, i, k, Histograms showing for each expression score the distribution of the overall G+C frequency (a), the frequency in all reading frames of the AGGA core sequence of the Shine–Dalgarno ribosome-binding sequence (i), and the amino acid repetition rate r (k; see Methods for definition). The parameter distributions in the E = 5 and E = 0 categories (n = 3,727 for both combined) are shown in a in dark and light blue, respectively, and in i and k in red and black, respectively. The symbols used for the histograms for the intermediate expression scores (n = 2,621 for all combined) are indicated in the legend for each panel. bh, j, lo, Plots showing the logarithm of the ratio of the number of proteins with E = 5 versus E = 0 scores as a function of parameter value. b, Data for the overall frequencies of the four individual nucleotide bases as well as the combined G + C frequency (labelled GC). ce, The equivalent data separately for the first (c), second (d) and third (e) positions in the codons in the genes. f, Data for genes either not containing or containing at least one occurrence of the ATA–ATA di-codon (P = 2 × 10−32). The error bars in this panel represent 95% confidence limits calculated from bootstrapping; the error bars for the genes without any occurrence of this di-codon are smaller than the size of the symbol. g, h, Data for the codon adaptation index14 (g) and tRNA adaptation index16 (h). j, Data for the frequency in all reading frames of the sequence AGGA. l, m, Data for the amino acid repetition rate r (l) and the codon repetition rate (m). n, o, Data for the statistical entropy of the amino acid (n) and codon sequences (o). The data in ae, i and k are binned in equal ranges of the parameter value, while the data in g, h, j and lo are binned in deciles containing equal populations.

Extended Data Figure 3 Correlations between sequence parameters in the genes included in the large-scale protein expression data set.

ac, Corrgrams representing the signed Pearson correlations coefficients between different mRNA sequence parameters in the genes in the E = 0 plus E = 5 categories in the data set (n = 3,727 for the two combined). The colour-coding is defined schematically on the left in a, with blue being used for positively correlated variables, red for negatively correlated variables, and white for uncorrelated variables. In a, E represents the expression score in the binary categories (0, 5), sall represents the mean value of our new codon-influence metric (coloured symbols in Fig. 3a) over the entire gene (without the LEHHHHH tag), s7–16 and s17–32 represent the mean values of this metric for codons 7–16 and 17–32, respectively, ΔGUH represents the predicted free energy of mRNA folding for the 5′-UTR from the pET21 expression vector plus the first 48 nucleotides in the gene, <∆GT>96 represents the mean value in the remainder of the gene of the predicted free energy of folding in 50% overlapping windows of 96 nucleotides, I represents an indicator variable that assumes a value of 0 or 1 if (ΔGUH <−39 kcal mol−1) and (%GC2–6 > 0.65), dAUA assumes a value of 0 or 1 if there is at least one occurrence of the ATA–ATA di-codon, r represents the codon repetition rate (see Methods), and %GC represents the percentage content of G plus C bases in the gene. The variables aH, aH2, gH2 and u3H represent monomial functions of the fractional content of A, G and U bases in codons 2–6; the correlation coefficient for these nucleotide-composition terms was calculated using their sum weighted by their optimized coefficients from model M (Fig. 4 and Extended Data Table 1a), as given in the equation in the main text. b, Data for the frequencies of the codons positively correlated with expression score E. c, Data for the frequencies of the codons negatively correlated with expression score E. dg, Two-dimensional histograms illustrating the dependence of results in the large-scale protein-expression data set on pairs of sequence parameters. The colours encode the fractional excess of proteins with E = 5 versus E = 0 scores (that is, (#E5 − #E0)/(#E5 + #E0)), as calibrated by the scale bar on the right. The area of each square is proportional to the number of proteins in that bin in the two-dimensional parameter space. The variables sall, s7–16 and stail represent, respectively, the mean values of our new codon-influence metric for the entire gene, for codons 7–16, and for all of the remaining codons downstream in the gene. ΔGUH represents the predicted free energy of mRNA folding for the 5′-UTR from the pET21 expression vector plus the first 48 nucleotides in the gene, <∆GT>96 represents the mean value in the remainder of the gene of the predicted free energy of folding in 50% overlapping windows of 96 nucleotides, and r represents the amino acid repetition rate (as defined in Methods).

Extended Data Figure 4 Relationship of the new codon-influence metric to parameters assumed to influence translation efficiency in previous literature.

a, Average frequency of each non-stop codon in the genes in just the E = 0 plus E = 5 categories (dark grey) or in the E = 0 through E = 5 categories (light grey), with error bars representing the s.d. of the frequency among the genes in each set. b, Codon slopes from single-variable binary logistic regressions (dark grey symbols in Fig. 3a) segregated according to the identity of the nucleotide at each of the three positions in the codon. These slopes come from single-variable linear logistic regressions that were performed separately for each of the individual 61 non-stop codons. c, Codon slopes from the simultaneous multi-parameter binary logistic regression model M (Extended Data Table 1a and coloured symbols in Fig. 3a) segregated according to the identity of the nucleotide at each of the three positions in the codon. dh, The codon slopes from model M plotted versus the relative synonymous codon usage (RSCU) in E. coli BL21 (e), the codon adaptation index14 in E. coli K12 (f), the codon sensitivity24 in E. coli K12 (d), the tRNA adaptation index16 in E. coli K12 (g), and the concentration of exactly cognate tRNAs23 in E. coli K12 (h). The shapes and colour-coding of the symbols in bh, which are the same as in Fig. 3, encode structural and qualitative chemical characteristics of the amino acids.

Extended Data Figure 5 Variation in codon influence as a function of position in the coding sequence.

Plots showing the reduction in the deviance of the computational model resulting from adding a term representing the average value of the codon slope (coloured symbols in Fig. 3a) in a window 5, 10 or 16 codons wide starting at the position indicated on the abscissa (that is, c through (c + 4) in blue, c through (c + 9) in red, or c through (c + 15) in purple, respectively, with c representing the number of the first codon in the window). The reduction in deviance was calculated relative to a base model containing codon frequencies in the entire coding sequence, head nucleotide composition terms (aH, aH2, u3H and gH2), the predicted free energy of RNA folding in the head plus the 5′-UTR (ΔGUH), the binary indicator variable for head folding effects I, the binary variable indicating the occurrence of an AUAAUA di-codon dAUA, and the codon repetition rate r (n = 3,727). The mean slope of codons 2–6 presumably does not improve the model because the head-composition terms rather than codon content dominate the influence of this region on protein-expression level. This effect also probably accounts for the peaks in the sc − (c + 9) and sc − (c + 15) plots for windows starting at codon 7. For reference, adding s7–16 and s16–32 terms to model M contributes 29.7 points (P = 5 × 10−8) and 12 points (P = 5 × 10−4) of model deviance, respectively (Extended Data Table 1 and Fig. 4a). Dropping out terms to measure their influence (Fig. 4a) shows every codon contributes on average (423.7/270) = 1.6 deviance units, while codons 7–16 each contribute on average an additional (29.6/10) = 3.0 deviance units. Therefore, individual codons at positions 7–16 are approximately three times more influential than those in the tail of the gene.

Extended Data Figure 6 Further experiments on synthetic genes designed to enhance protein expression.

ad, Data for three additional proteins equivalent to the data presented in Fig. 5. The in vivo and in vitro expression properties from pET vectors are compared for inefficiently translated native (WT) genes and synonymous genes redesigned in the head or the tail or both using the 6AA, 31C-FO or 31C-FD methods. The type of sequence in the head (H) is indicated separately from that in the tail (T), and the name of the target protein is indicated on the left on each row. a, E. coli BL21(DE3) host cell growth curves at room temperature after induction of the target gene at time zero in chemically defined MJ9 medium. b, Coomassie-blue-stained SDS–PAGE gels of whole cells after overnight induction at 17 °C, with the amount loaded in each lane normalized to the A600 nm of the culture at the time of harvest. Black arrows indicate the migration positions of the target proteins. c, Autoradiographs of SDS–PAGE gels of in vitro translation reactions using fully purified translation components in the presence of [35S]methionine. Each reaction contained an equal amount of purified mRNA that was transcribed in vitro using T7 RNA polymerase. d, Northern blot analyses of the mRNA for the target protein after induction of expression in vivo. An equal amount of total RNA was loaded in each lane, and blots were hybridized with a probe matching the 5′-UTR. e, f, Coomassie blue stained SDS–PAGE gels (e) and anti-tetrahistidine western blots (f) showing that gene optimization has equivalent effects at physiological protein expression levels. Pairs of synonymous native (WT) and codon-optimized 31C-FOH/T genes with C-terminal hexahistidine tags were re-cloned under control of the arabinose-inducible promoter in a pBAD vector62, and the concentration of arabinose in the growth medium was adjusted so the 31C-FOH/T genes yielded protein expression in the physiological range as assessed from Coomassie blue stained SDS-PAGE gels of whole cell extracts. Black arrows indicate locations of the induced target proteins. Substantially lower protein expression from the wild-type genes compared to the synonymous 31C-FOH/T genes in these experiments demonstrates that equivalent codon-usage effects are observed when proteins are overexpressed using a pET vector or expressed at roughly phyiological levels using a pBAD vector, despite changes explained in the online Methods in the polymerase used to transcribe the genes, the medium used to grow the cells, and the timescale and temperature of the protein-induction process.The constitutively expressed ~25-kDa protein that reacts with the anti-tetrahistidine antibody in the cells containing the 31C-FOH/T gene for YcaQ is probably an amino-terminally truncated protein synthesized from a 5′-truncated mRNA transcribed from an internal promoter sequence fortuitously introduced into this synthetic gene. Uncropped scans of the gels shown here are included in Supplementary Fig. 1.

Extended Data Figure 7 In vivo expression of synthetic genes with sequences optimized using the 31C-FO method.

a, Coomassie-blue-stained SDS–PAGE gels of whole-cell extracts after overnight induction at 17 °C of synthetic genes designed using the 31C-FOH method to encode 17 different proteins. All genes were cloned in-frame with a C-terminal hexa-histidine tag in the same pET21 plasmid derivative used to generate our large-scale protein-expression data set38. Equal volumes of induced cultures were loaded in all lanes. b, Coomassie-blue-stained SDS–PAGE gels of whole-cell extracts (top) and the corresponding soluble fractions (bottom) after overnight induction at 17 °C of 14 of the synthetic genes fused in-frame at the C terminus of the gene for the E. coli maltose-binding protein (MBP). The protein sequences come from the following source organisms: LCABL_04230 from Lactobacillus casei BL23; VIPARP466_2889 from Vibrio parahaemolyticus; AM1_4824 from Acaryochloris marina MBIC11017; CLO_0718 from Clostridium botulinum E1; ESAG_04692 from Escherichia sp. 3_2_53FAA; FTCG_00666 and FTCG_01175 from Francisella tularensis subsp. novicida GA99-3549; FTE_1275, FTE_1608, FTE_0420 and FTE_1020 from Francisella tularensis subsp. novicida FTE; FRANO wbtG and A1DS62_FRANO from Francisella novicidal; FTBG_00988 and A7JEH2_FRATL from Francisella tularensis subsp. tularensis FSC033; FTN_1238 from Francisella tularensis subsp. novicida U112; O1O_09285 from Pseudomonas aeruginosa MPAO1/P1; Sthe_2331 from Sphaerobacter thermophilus DSM20745/S6022; SEVCU126_0606 from Staphylococcus epidermidis VCU126; and Y007_20720 from Salmonella enterica subsp. enterica serovar Montevideo 507440-20.

Extended Data Figure 8 Yield of mRNA from in vitro transcription using purified T7 RNA polymerase.

a, Final yield of mRNA purified from reactions conducted under identical conditions, as described in the Methods. The yields were calculated from the optical density at 260 nm. be, Kinetic analyses of in vitro transcription reactions using formaldehyde-agarose gel electrophoresis. Samples were taken at 0, 5, 10 and 30 min. The gels were stained with ethidium bromide. The ‘standard’ lane contains 1 μg of the same mRNA after purification to enable calibration for differences in the sensitivity of the molecules to staining. Reactions were started by addition of the wild-type or 31C-FOH/31C-FOT (31C-FOH/T) linearized plasmids encoding SRU_1983 (b), APE_0230.1 (c), SCO1897 (d), or Eco-YcaQ (e).

Extended Data Table 1 Development and analysis of the simultaneous multi-parameter binary logistic regression model
Extended Data Table 2 Codons used for synonymous gene design

Supplementary information

Supplementary Information

This file contains Supplementary Text, Supplementary References and Supplementary Figure 1, the uncropped gel presented in Extended Data Figures 6 and 7. (PDF 2305 kb)

Supplementary Data 1

This file contains the value, p-value and standard deviation for the Single parameter regressions and the Multiparameter Model M. (XLSX 60 kb)

Supplementary Data 2

This file contains the expression values, the sequences and calculated parameters for the 6348 proteins dataset. (XLSX 2158 kb)

Supplementary Data 3

This file contains the sequences and parameters of the optimized genes. (XLSX 66 kb)

PowerPoint slides

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boël, G., Letso, R., Neely, H. et al. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature 529, 358–363 (2016). https://doi.org/10.1038/nature16509

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature16509

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research