Codon influence on protein expression in E. coli correlates with mRNA levels

Boël, Grégory; Letso, Reka; Neely, Helen; Price, W. Nicholson; Wong, Kam-Ho; Su, Min; Luff, Jon D.; Valecha, Mayank; Everett, John K.; Acton, Thomas B.; Xiao, Rong; Montelione, Gaetano T.; Aalberts, Daniel P.; Hunt, John F.

doi:10.1038/nature16509

Article
Published: 13 January 2016

Codon influence on protein expression in E. coli correlates with mRNA levels

Grégory Boël^1,2,
Reka Letso¹^na1,
Helen Neely¹^na1,
W. Nicholson Price¹^na1^nAff6,
Kam-Ho Wong¹,
Min Su¹,
Jon D. Luff¹,
Mayank Valecha¹,
John K. Everett³,
Thomas B. Acton³,
Rong Xiao³,
Gaetano T. Montelione^3,4,
Daniel P. Aalberts⁵ &
…
John F. Hunt¹

Nature volume 529, pages 358–363 (2016)Cite this article

28k Accesses
250 Citations
49 Altmetric
Metrics details

Subjects

Abstract

Degeneracy in the genetic code, which enables a single protein to be encoded by a multitude of synonymous gene sequences, has an important role in regulating protein expression, but substantial uncertainty exists concerning the details of this phenomenon. Here we analyse the sequence features influencing protein expression levels in 6,348 experiments using bacteriophage T7 polymerase to synthesize messenger RNA in Escherichia coli. Logistic regression yields a new codon-influence metric that correlates only weakly with genomic codon-usage frequency, but strongly with global physiological protein concentrations and also mRNA concentrations and lifetimes in vivo. Overall, the codon content influences protein expression more strongly than mRNA-folding parameters, although the latter dominate in the initial ~16 codons. Genes redesigned based on our analyses are transcribed with unaltered efficiency but translated with higher efficiency in vitro. The less efficiently translated native sequences show greatly reduced mRNA levels in vivo. Our results suggest that codon content modulates a kinetic competition between protein elongation and mRNA degradation that is a central feature of the physiology and also possibly the regulation of translation in E. coli.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Distributions of representative RNA sequence parameters in protein-expression categories in the large-scale data set.**

**Figure 2: Log-odds ratio for E5 versus E0 expression categories for proteins encoded by each nucleotide base at positions 4–96.**

**Figure 3: Codon influence on protein expression in the large-scale data set.**

**Figure 4: Contributions of physicochemical factors and regions of the coding sequence to protein expression level.**

**Figure 5: Analyses of synthetic genes designed to enhance protein expression.**

**Figure 6: Codon influence on protein expression correlates with endogenous *E. coli* protein levels and mRNA levels and lifetimes.**

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Pooled multicolour tagging for visualizing subcellular protein dynamics

Article Open access 19 April 2024

Accession codes

Primary accessions

Data deposits

Microarray data and the pMGK sequence were deposited in the Gene Expression Omnibus and GenBank under accessions GSE73416 and KT203761, respectively.

References

Chen, G. T. & Inouye, M. Role of the AGA/AGG codons, the rarest codons in global gene expression in Escherichia coli. Genes Dev. 8, 2641–2652 (1994)
Article CAS PubMed Google Scholar
Deana, A., Ehrlich, R. & Reiss, C. Synonymous codon selection controls in vivo turnover and amount of mRNA in Escherichia coli bla and ompA genes. J. Bacteriol. 178, 2718–2720 (1996)
Article CAS PubMed PubMed Central Google Scholar
Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009)
Article ADS CAS PubMed PubMed Central Google Scholar
Tuller, T., Waldman, Y. Y., Kupiec, M. & Ruppin, E. Translation efficiency is determined by both codon bias and folding energy. Proc. Natl Acad. Sci. USA 107, 3645–3650 (2010)
Article ADS CAS PubMed PubMed Central Google Scholar
Goodman, D. B., Church, G. M. & Kosuri, S. Causes and effects of N-terminal codon bias in bacterial genes. Science 342, 475–479 (2013)
Article ADS CAS PubMed Google Scholar
Castillo-Méndez, M. A., Jacinto-Loeza, E., Olivares-Trejo, J. J., Guarneros-Pena, G. & Hernandez-Sanchez, J. Adenine-containing codons enhance protein synthesis by promoting mRNA binding to ribosomal 30S subunits provided that specific tRNAs are not exhausted. Biochimie 94, 662–672 (2012)
Article PubMed CAS Google Scholar
Bentele, K., Saffert, P., Rauscher, R., Ignatova, Z. & Bluthgen, N. Efficient translation initiation dictates codon usage at gene start. Mol. Syst. Biol. 9, 675 (2013)
Article PubMed PubMed Central Google Scholar
Hunt, R. C., Simhadri, V. L., Iandoli, M., Sauna, Z. E. & Kimchi-Sarfaty, C. Exposing synonymous mutations. Trends Genet. 30, 308–321 (2014)
Article CAS PubMed Google Scholar
Spencer, P. S., Siller, E., Anderson, J. F. & Barral, J. M. Silent substitutions predictably alter translation elongation rates and protein folding efficiencies. J. Mol. Biol. 422, 328–335 (2012)
Article CAS PubMed PubMed Central Google Scholar
Li, G. W., Burkhardt, D., Gross, C. & Weissman, J. S. Quantifying absolute protein synthesis rates reveals principles underlying allocation of cellular resources. Cell 157, 624–635 (2014)
Article CAS PubMed PubMed Central Google Scholar
Li, G.-W., Oh, E. & Weissman, J. S. The anti-Shine–Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 484, 538–541 (2012)
Article ADS CAS PubMed PubMed Central Google Scholar
Gingold, H. & Pilpel, Y. Determinants of translation efficiency and accuracy. Mol. Syst. Biol. 7, 481 (2011)
Article PubMed PubMed Central Google Scholar
Cannarozzi, G. et al. A role for codon order in translation dynamics. Cell 141, 355–367 (2010)
Article PubMed CAS Google Scholar
Sharp, P. M. & Li, W. H. The codon adaptation index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987)
Article ADS CAS PubMed PubMed Central Google Scholar
Ninio, J. Fine tuning of ribosomal accuracy. FEBS Lett. 196, 1–4 (1986)
Article CAS PubMed Google Scholar
Tuller, T. et al. An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell 141, 344–354 (2010)
Article CAS PubMed Google Scholar
Wallace, E. W., Airoldi, E. M. & Drummond, D. A. Estimating selection on synonymous codon usage from noisy experimental data. Mol. Biol. Evol. 30, 1438–1453 (2013)
Article CAS PubMed PubMed Central Google Scholar
Caskey, C. T., Beaudet, A. & Nirenberg, M. RNA codons and protein synthesis. 15. Dissimilar responses of mammalian and bacterial transfer RNA fractions to messenger RNA codons. J. Mol. Biol. 37, 99–118 (1968)
Article CAS PubMed Google Scholar
Ikemura, T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981)
Article CAS PubMed Google Scholar
Muramatsu, T. et al. Codon and amino-acid specificities of a transfer RNA are both converted by a single post-transcriptional modification. Nature 336, 179–181 (1988)
Article ADS CAS PubMed Google Scholar
Zhang, S. P., Zubay, G. & Goldman, E. Low-usage codons in Escherichia coli, yeast, fruit fly and primates. Gene 105, 61–72 (1991)
Article CAS PubMed Google Scholar
Bulmer, M. The selection-mutation-drift theory of synonymous codon usage. Genetics 129, 897–907 (1991)
Article CAS PubMed PubMed Central Google Scholar
Dong, H., Nilsson, L. & Kurland, C. G. Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J. Mol. Biol. 260, 649–663 (1996)
Article CAS PubMed Google Scholar
Elf, J., Nilsson, D., Tenson, T. & Ehrenberg, M. Selective charging of tRNA isoacceptors explains patterns of codon usage. Science 300, 1718–1722 (2003)
Article ADS CAS PubMed Google Scholar
Dittmar, K. A., Sorensen, M. A., Elf, J., Ehrenberg, M. & Pan, T. Selective charging of tRNA isoacceptors induced by amino-acid starvation. EMBO Rep. 6, 151–157 (2005)
Article CAS PubMed PubMed Central Google Scholar
Zhang, F., Saha, S., Shabalina, S. A. & Kashina, A. Differential arginylation of actin isoforms is regulated by coding sequence-dependent degradation. Science 329, 1534–1537 (2010)
Article ADS CAS PubMed PubMed Central Google Scholar
Vivanco-Domínguez, S. et al. Protein synthesis factors (RF1, RF2, RF3, RRF, and tmRNA) and peptidyl-tRNA hydrolase rescue stalled ribosomes at sense codons. J. Mol. Biol. 417, 425–439 (2012)
Article PubMed CAS Google Scholar
Dana, A. & Tuller, T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 42, 9171–9181 (2014)
Article CAS PubMed PubMed Central Google Scholar
Pelechano, V. & Wei, W. & Steinmetz, Lars M. Widespread co-translational RNA decay reveals ribosome dynamics. Cell 161, 1400–1412 (2015)
Article CAS PubMed PubMed Central Google Scholar
Presnyak, V. et al. Codon optimality is a major determinant of mRNA stability. Cell 160, 1111–1124 (2015)
Article CAS PubMed PubMed Central Google Scholar
Drummond, D. A. & Wilke, C. O. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell 134, 341–352 (2008)
Article CAS PubMed PubMed Central Google Scholar
Shakin-Eshleman, S. H. & Liebhaber, S. A. Influence of duplexes 3′ to the mRNA initiation codon on the efficiency of monosome formation. Biochemistry 27, 3975–3982 (1988)
Article CAS PubMed Google Scholar
Quax, T. E. et al. Differential translation tunes uneven production of operon-encoded proteins. Cell Rep . 4, 938–944 (2013)
Article CAS PubMed Google Scholar
Letzring, D. P., Wolf, A. S., Brule, C. E. & Grayhack, E. J. Translation of CGA codon repeats in yeast involves quality control components and ribosomal protein L1. RNA 19, 1208–1217 (2013)
Article CAS PubMed PubMed Central Google Scholar
Ude, S. et al. Translation elongation factor EF-P alleviates ribosome stalling at polyproline stretches. Science 339, 82–85 (2013)
Article ADS CAS PubMed Google Scholar
Iost, I. & Dreyfus, M. The stability of Escherichia coli lacZ mRNA depends upon the simultaneity of its synthesis and translation. EMBO J. 14, 3252–3261 (1995)
Article CAS PubMed PubMed Central Google Scholar
Iost, I., Guillerez, J. & Dreyfus, M. Bacteriophage T7 RNA polymerase travels far ahead of ribosomes in vivo. J. Bacteriol . 174, 619–622 (1992)
Article CAS PubMed Google Scholar
Acton, T. B. et al. Robotic cloning and protein production platform of the Northeast Structural Genomics Consortium. Methods Enzymol. 394, 210–243 (2005)
Article CAS PubMed Google Scholar
Price, W. N. et al. Large-scale experimental studies show unexpected amino acid effects on protein expression and solubility in vivo in E. coli. Microb. Inform. Exp . 1, 6 (2011)
Article CAS Google Scholar
Duval, M. et al. Escherichia coli ribosomal protein S1 unfolds structured mRNAs onto the ribosome for active translation initiation. PLoS Biol. 11, e1001731 (2013)
Article PubMed PubMed Central CAS Google Scholar
Reuter, J. S. & Mathews, D. H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129 (2010)
Article PubMed PubMed Central CAS Google Scholar
Lu, J. & Deutsch, C. Electrostatics in the ribosomal tunnel modulate chain elongation rates. J. Mol. Biol. 384, 73–86 (2008)
Article CAS PubMed PubMed Central Google Scholar
Ishihama, Y. et al. Protein abundance profiling of the Escherichia coli cytosol. BMC Genomics 9, 102 (2008)
Article PubMed PubMed Central CAS Google Scholar
Chen, H., Shiroguchi, K., Ge, H. & Xie, X. S. Genome-wide study of mRNA degradation and transcript elongation in Escherichia coli. Mol. Syst. Biol. 11, 781 (2015)
Article PubMed PubMed Central CAS Google Scholar
dos Reis, M. Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome. Nucleic Acids Res. 31, 6976–6985 (2003)
Article CAS PubMed PubMed Central Google Scholar
Nogueira, T., de Smit, M., Graffe, M. & Springer, M. The relationship between translational control and mRNA degradation for the Escherichia coli threonyl-tRNA synthetase gene. J. Mol. Biol. 310, 709–722 (2001)
Article CAS PubMed Google Scholar
Richards, J., Sundermeier, T., Svetlanov, A. & Karzai, A. W. Quality control of bacterial mRNA decoding and decay. Biochim. Biophys. Acta 1779, 574–582 (2008)
Google Scholar
Ivanova, N., Pavlov, M. Y. & Ehrenberg, M. tmRNA-induced release of messenger RNA from stalled ribosomes. J. Mol. Biol. 350, 897–905 (2005)
Article CAS PubMed Google Scholar
Shoemaker, C. J., Eyler, D. E. & Green, R. Dom34:Hbs1 promotes subunit dissociation and peptidyl-tRNA drop-off to initiate no-go decay. Science 330, 369–372 (2010)
Article ADS CAS PubMed PubMed Central Google Scholar
Chadani, Y., Ono, K., Kutsukake, K. & Abo, T. Escherichia coli YaeJ protein mediates a novel ribosome-rescue pathway distinct from SsrA- and ArfA-mediated pathways. Mol. Microbiol. 80, 772–785 (2011)
Article CAS PubMed Google Scholar
Xiao, R. et al. The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium. J. Struct. Biol. 172, 21–33 (2010)
Article CAS PubMed PubMed Central Google Scholar
Acton, T. B. et al. Preparation of protein samples for NMR structure, function, and small-molecule screening studies. Methods Enzymol. 493, 21–60 (2011)
Article CAS PubMed PubMed Central Google Scholar
R Development Core Team. A Language and Environment for Statistical Computing; http://www.r-project.org/ (2012)
Akaike, H. A new look at the statistical model identification. IEEE Trans. Auto. Con . 19, 716–723 (1974)
Article MathSciNet MATH Google Scholar
Harrell, F. E. Jr. R package version 4.2-0; http://CRAN.R-project.org/package=rms (2014)
Jansson, M. et al. High-level production of uniformly ¹⁵N- and ¹³C-enriched fusion proteins in Escherichia coli. J. Biomol. NMR 7, 131–141 (1996)
Article CAS PubMed Google Scholar
Keseler, I. M. et al. EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res. 41, D605–D612 (2013)
Article CAS PubMed Google Scholar
Juncker, A. S. et al. Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci. 12, 1652–1662 (2003)
Article CAS PubMed PubMed Central Google Scholar
Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580 (2001)
Article CAS PubMed Google Scholar
Novick, A. & Weiner, M. Enzyme induction as an all-or-none phenomenon. Proc. Natl Acad. Sci. USA 43, 553–566 (1957)
Article ADS CAS PubMed PubMed Central Google Scholar
Jensen, P. R., Westerhoff, H. V. & Michelsen, O. The use of lac-type promoters in control analysis. Eur. J. Biochem. 211, 181–191 (1993)
Article CAS PubMed Google Scholar
Guzman, L. M., Belin, D., Carson, M. J. & Beckwith, J. Tight regulation, modulation, and high-level expression by vectors containing the arabinose PBAD promoter. J. Bacteriol. 177, 4121–4130 (1995)
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by NIGMS Protein Structure Initiative grant U54-GM094597 to the Northeast Structural Genomics Consortium to J.F.H. and G.T.M., and NIH grant GM106372 to D.P.A. We thank B. Klingenberg, R. Gonzalez, M. Gottesman and V. de Crécy-Lagard for advice.

Author information

W. Nicholson Price
Present address: †Present address: WNP, University of New Hampshire School of Law, 2 White Street, Concord, New Hampshire 03301, USA.,
Reka Letso, Helen Neely and W. Nicholson Price: These authors contributed equally to this work.

Authors and Affiliations

Department of Biological Sciences and Northeast Structural Genomics Consortium, 702 Fairchild Center, MC2434, Columbia University, New York, 10027, New York, USA
Grégory Boël, Reka Letso, Helen Neely, W. Nicholson Price, Kam-Ho Wong, Min Su, Jon D. Luff, Mayank Valecha & John F. Hunt
CNRS UMR8261, Institut de Biologie Physico-Chimique, 13-rue Pierre et Marie Curie, Paris, 75005, France
Grégory Boël
Department of Molecular Biology and Biochemistry and Northeast Structural Genomics Consortium, Center for Advanced Biotechnology and Medicine, Rutgers, the State University of New Jersey, Piscataway, 08854, New Jersey, USA
John K. Everett, Thomas B. Acton, Rong Xiao & Gaetano T. Montelione
Department of Biochemistry, Robert Wood Johnson Medical School, Rutgers, the State University of New Jersey, Piscataway, 08854, New Jersey, USA
Gaetano T. Montelione
Department of Physics, Williams College, Williamstown, 01267, Massachusetts, USA
Daniel P. Aalberts

Authors

Grégory Boël
View author publications
You can also search for this author in PubMed Google Scholar
Reka Letso
View author publications
You can also search for this author in PubMed Google Scholar
Helen Neely
View author publications
You can also search for this author in PubMed Google Scholar
W. Nicholson Price
View author publications
You can also search for this author in PubMed Google Scholar
Kam-Ho Wong
View author publications
You can also search for this author in PubMed Google Scholar
Min Su
View author publications
You can also search for this author in PubMed Google Scholar
Jon D. Luff
View author publications
You can also search for this author in PubMed Google Scholar
Mayank Valecha
View author publications
You can also search for this author in PubMed Google Scholar
John K. Everett
View author publications
You can also search for this author in PubMed Google Scholar
Thomas B. Acton
View author publications
You can also search for this author in PubMed Google Scholar
Rong Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Gaetano T. Montelione
View author publications
You can also search for this author in PubMed Google Scholar
Daniel P. Aalberts
View author publications
You can also search for this author in PubMed Google Scholar
John F. Hunt
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.B.A., R.X., J.K.E., J.F.H. and G.T.M. developed the protein expression platform. T.B.A., R.X. and G.T.M. generated the expression data set. W.N.P. initiated and M.S., M.V., J.D.L., G.B., J.F.H. and D.P.A. completed the computational analyses. H.N., K.-H.W. and R.X. constructed genes that R.L., K.-H.W. and G.B. used for biochemical studies. G.B., G.T.M., D.P.A. and J.F.H. designed the research and wrote the paper.

Corresponding authors

Correspondence to Daniel P. Aalberts or John F. Hunt.

Ethics declarations

Competing interests

Two patent applications have been submitted related to results reported in this paper. G.T.M. is affiliated with Nexomics Inc., and G.B., G.T.M., D.P.A. and J.F.H. are affiliated with OPTimum Protein Technologies.

Extended data figures and tables

Extended Data Figure 1 Phylogenic distribution of the proteins in the large-scale protein expression data set.

The colours in the cladogram encode the number of genes/proteins from each organism, as indicated by the legend. The data set includes 47 from eukaryotes (45 from humans and 2 from mouse), 809 from archaebacteria, and 96 from E. coli, with the remainder coming from other eubacteria. The organism contributing the largest number of proteins to the data set is the eubacterium Bacteroides thetaiotaomicron (150 proteins).

Extended Data Figure 2 Relationships between additional mRNA sequence parameters and results in the large-scale protein expression data set.

a, i, k, Histograms showing for each expression score the distribution of the overall G+C frequency (a), the frequency in all reading frames of the AGGA core sequence of the Shine–Dalgarno ribosome-binding sequence (i), and the amino acid repetition rate r (k; see Methods for definition). The parameter distributions in the E = 5 and E = 0 categories (n = 3,727 for both combined) are shown in a in dark and light blue, respectively, and in i and k in red and black, respectively. The symbols used for the histograms for the intermediate expression scores (n = 2,621 for all combined) are indicated in the legend for each panel. b–h, j, l–o, Plots showing the logarithm of the ratio of the number of proteins with E = 5 versus E = 0 scores as a function of parameter value. b, Data for the overall frequencies of the four individual nucleotide bases as well as the combined G + C frequency (labelled GC). c–e, The equivalent data separately for the first (c), second (d) and third (e) positions in the codons in the genes. f, Data for genes either not containing or containing at least one occurrence of the ATA–ATA di-codon (P = 2 × 10⁻³²). The error bars in this panel represent 95% confidence limits calculated from bootstrapping; the error bars for the genes without any occurrence of this di-codon are smaller than the size of the symbol. g, h, Data for the codon adaptation index¹⁴ (g) and tRNA adaptation index¹⁶ (h). j, Data for the frequency in all reading frames of the sequence AGGA. l, m, Data for the amino acid repetition rate r (l) and the codon repetition rate (m). n, o, Data for the statistical entropy of the amino acid (n) and codon sequences (o). The data in a–e, i and k are binned in equal ranges of the parameter value, while the data in g, h, j and l–o are binned in deciles containing equal populations.

Extended Data Figure 3 Correlations between sequence parameters in the genes included in the large-scale protein expression data set.

a–c, Corrgrams representing the signed Pearson correlations coefficients between different mRNA sequence parameters in the genes in the E = 0 plus E = 5 categories in the data set (n = 3,727 for the two combined). The colour-coding is defined schematically on the left in a, with blue being used for positively correlated variables, red for negatively correlated variables, and white for uncorrelated variables. In a, E represents the expression score in the binary categories (0, 5), s_all represents the mean value of our new codon-influence metric (coloured symbols in Fig. 3a) over the entire gene (without the LEHHHHH tag), s_7–16 and s_17–32 represent the mean values of this metric for codons 7–16 and 17–32, respectively, ΔG_UH represents the predicted free energy of mRNA folding for the 5′-UTR from the pET21 expression vector plus the first 48 nucleotides in the gene, <∆G_T>₉₆ represents the mean value in the remainder of the gene of the predicted free energy of folding in 50% overlapping windows of 96 nucleotides, I represents an indicator variable that assumes a value of 0 or 1 if (ΔG_UH <−39 kcal mol⁻¹) and (%GC_2–6 > 0.65), d_AUA assumes a value of 0 or 1 if there is at least one occurrence of the ATA–ATA di-codon, r represents the codon repetition rate (see Methods), and %GC represents the percentage content of G plus C bases in the gene. The variables a_H, a_H², g_H² and u_3H represent monomial functions of the fractional content of A, G and U bases in codons 2–6; the correlation coefficient for these nucleotide-composition terms was calculated using their sum weighted by their optimized coefficients from model M (Fig. 4 and Extended Data Table 1a), as given in the equation in the main text. b, Data for the frequencies of the codons positively correlated with expression score E. c, Data for the frequencies of the codons negatively correlated with expression score E. d–g, Two-dimensional histograms illustrating the dependence of results in the large-scale protein-expression data set on pairs of sequence parameters. The colours encode the fractional excess of proteins with E = 5 versus E = 0 scores (that is, (#E5 − #E0)/(#E5 + #E0)), as calibrated by the scale bar on the right. The area of each square is proportional to the number of proteins in that bin in the two-dimensional parameter space. The variables s_all, s_7–16 and s_tail represent, respectively, the mean values of our new codon-influence metric for the entire gene, for codons 7–16, and for all of the remaining codons downstream in the gene. ΔG_UH represents the predicted free energy of mRNA folding for the 5′-UTR from the pET21 expression vector plus the first 48 nucleotides in the gene, <∆G_T>₉₆ represents the mean value in the remainder of the gene of the predicted free energy of folding in 50% overlapping windows of 96 nucleotides, and r represents the amino acid repetition rate (as defined in Methods).

Extended Data Figure 4 Relationship of the new codon-influence metric to parameters assumed to influence translation efficiency in previous literature.

a, Average frequency of each non-stop codon in the genes in just the E = 0 plus E = 5 categories (dark grey) or in the E = 0 through E = 5 categories (light grey), with error bars representing the s.d. of the frequency among the genes in each set. b, Codon slopes from single-variable binary logistic regressions (dark grey symbols in Fig. 3a) segregated according to the identity of the nucleotide at each of the three positions in the codon. These slopes come from single-variable linear logistic regressions that were performed separately for each of the individual 61 non-stop codons. c, Codon slopes from the simultaneous multi-parameter binary logistic regression model M (Extended Data Table 1a and coloured symbols in Fig. 3a) segregated according to the identity of the nucleotide at each of the three positions in the codon. d–h, The codon slopes from model M plotted versus the relative synonymous codon usage (RSCU) in E. coli BL21 (e), the codon adaptation index¹⁴ in E. coli K12 (f), the codon sensitivity²⁴ in E. coli K12 (d), the tRNA adaptation index¹⁶ in E. coli K12 (g), and the concentration of exactly cognate tRNAs²³ in E. coli K12 (h). The shapes and colour-coding of the symbols in b–h, which are the same as in Fig. 3, encode structural and qualitative chemical characteristics of the amino acids.

Extended Data Figure 5 Variation in codon influence as a function of position in the coding sequence.

Plots showing the reduction in the deviance of the computational model resulting from adding a term representing the average value of the codon slope (coloured symbols in Fig. 3a) in a window 5, 10 or 16 codons wide starting at the position indicated on the abscissa (that is, c through (c + 4) in blue, c through (c + 9) in red, or c through (c + 15) in purple, respectively, with c representing the number of the first codon in the window). The reduction in deviance was calculated relative to a base model containing codon frequencies in the entire coding sequence, head nucleotide composition terms (a_H, a_H², u_3H and g_H²), the predicted free energy of RNA folding in the head plus the 5′-UTR (ΔG_UH), the binary indicator variable for head folding effects I, the binary variable indicating the occurrence of an AUAAUA di-codon d_AUA, and the codon repetition rate r (n = 3,727). The mean slope of codons 2–6 presumably does not improve the model because the head-composition terms rather than codon content dominate the influence of this region on protein-expression level. This effect also probably accounts for the peaks in the s_{c − (c + 9)} and s_{c − (c + 15)} plots for windows starting at codon 7. For reference, adding s_7–16 and s_16–32 terms to model M contributes 29.7 points (P = 5 × 10⁻⁸) and 12 points (P = 5 × 10⁻⁴) of model deviance, respectively (Extended Data Table 1 and Fig. 4a). Dropping out terms to measure their influence (Fig. 4a) shows every codon contributes on average (423.7/270) = 1.6 deviance units, while codons 7–16 each contribute on average an additional (29.6/10) = 3.0 deviance units. Therefore, individual codons at positions 7–16 are approximately three times more influential than those in the tail of the gene.

Extended Data Figure 6 Further experiments on synthetic genes designed to enhance protein expression.

a–d, Data for three additional proteins equivalent to the data presented in Fig. 5. The in vivo and in vitro expression properties from pET vectors are compared for inefficiently translated native (WT) genes and synonymous genes redesigned in the head or the tail or both using the 6AA, 31C-FO or 31C-FD methods. The type of sequence in the head (_H) is indicated separately from that in the tail (_T), and the name of the target protein is indicated on the left on each row. a, E. coli BL21(DE3) host cell growth curves at room temperature after induction of the target gene at time zero in chemically defined MJ9 medium. b, Coomassie-blue-stained SDS–PAGE gels of whole cells after overnight induction at 17 °C, with the amount loaded in each lane normalized to the A_{600 nm} of the culture at the time of harvest. Black arrows indicate the migration positions of the target proteins. c, Autoradiographs of SDS–PAGE gels of in vitro translation reactions using fully purified translation components in the presence of [³⁵S]methionine. Each reaction contained an equal amount of purified mRNA that was transcribed in vitro using T7 RNA polymerase. d, Northern blot analyses of the mRNA for the target protein after induction of expression in vivo. An equal amount of total RNA was loaded in each lane, and blots were hybridized with a probe matching the 5′-UTR. e, f, Coomassie blue stained SDS–PAGE gels (e) and anti-tetrahistidine western blots (f) showing that gene optimization has equivalent effects at physiological protein expression levels. Pairs of synonymous native (WT) and codon-optimized 31C-FO_H/T genes with C-terminal hexahistidine tags were re-cloned under control of the arabinose-inducible promoter in a pBAD vector⁶², and the concentration of arabinose in the growth medium was adjusted so the 31C-FO_H/T genes yielded protein expression in the physiological range as assessed from Coomassie blue stained SDS-PAGE gels of whole cell extracts. Black arrows indicate locations of the induced target proteins. Substantially lower protein expression from the wild-type genes compared to the synonymous 31C-FO_H/T genes in these experiments demonstrates that equivalent codon-usage effects are observed when proteins are overexpressed using a pET vector or expressed at roughly phyiological levels using a pBAD vector, despite changes explained in the online Methods in the polymerase used to transcribe the genes, the medium used to grow the cells, and the timescale and temperature of the protein-induction process.The constitutively expressed ~25-kDa protein that reacts with the anti-tetrahistidine antibody in the cells containing the 31C-FO_H/T gene for YcaQ is probably an amino-terminally truncated protein synthesized from a 5′-truncated mRNA transcribed from an internal promoter sequence fortuitously introduced into this synthetic gene. Uncropped scans of the gels shown here are included in Supplementary Fig. 1.

Extended Data Figure 7 In vivo expression of synthetic genes with sequences optimized using the 31C-FO method.

a, Coomassie-blue-stained SDS–PAGE gels of whole-cell extracts after overnight induction at 17 °C of synthetic genes designed using the 31C-FO_H method to encode 17 different proteins. All genes were cloned in-frame with a C-terminal hexa-histidine tag in the same pET21 plasmid derivative used to generate our large-scale protein-expression data set³⁸. Equal volumes of induced cultures were loaded in all lanes. b, Coomassie-blue-stained SDS–PAGE gels of whole-cell extracts (top) and the corresponding soluble fractions (bottom) after overnight induction at 17 °C of 14 of the synthetic genes fused in-frame at the C terminus of the gene for the E. coli maltose-binding protein (MBP). The protein sequences come from the following source organisms: LCABL_04230 from Lactobacillus casei BL23; VIPARP466_2889 from Vibrio parahaemolyticus; AM1_4824 from Acaryochloris marina MBIC11017; CLO_0718 from Clostridium botulinum E1; ESAG_04692 from Escherichia sp. 3_2_53FAA; FTCG_00666 and FTCG_01175 from Francisella tularensis subsp. novicida GA99-3549; FTE_1275, FTE_1608, FTE_0420 and FTE_1020 from Francisella tularensis subsp. novicida FTE; FRANO wbtG and A1DS62_FRANO from Francisella novicidal; FTBG_00988 and A7JEH2_FRATL from Francisella tularensis subsp. tularensis FSC033; FTN_1238 from Francisella tularensis subsp. novicida U112; O1O_09285 from Pseudomonas aeruginosa MPAO1/P1; Sthe_2331 from Sphaerobacter thermophilus DSM20745/S6022; SEVCU126_0606 from Staphylococcus epidermidis VCU126; and Y007_20720 from Salmonella enterica subsp. enterica serovar Montevideo 507440-20.

Extended Data Figure 8 Yield of mRNA from in vitro transcription using purified T7 RNA polymerase.

a, Final yield of mRNA purified from reactions conducted under identical conditions, as described in the Methods. The yields were calculated from the optical density at 260 nm. b–e, Kinetic analyses of in vitro transcription reactions using formaldehyde-agarose gel electrophoresis. Samples were taken at 0, 5, 10 and 30 min. The gels were stained with ethidium bromide. The ‘standard’ lane contains 1 μg of the same mRNA after purification to enable calibration for differences in the sensitivity of the molecules to staining. Reactions were started by addition of the wild-type or 31C-FO_H/31C-FO_T (31C-FO_H/T) linearized plasmids encoding SRU_1983 (b), APE_0230.1 (c), SCO1897 (d), or Eco-YcaQ (e).

Extended Data Table 1 Development and analysis of the simultaneous multi-parameter binary logistic regression model

Full size table

Extended Data Table 2 Codons used for synonymous gene design

Full size table

Supplementary information

Supplementary Information

This file contains Supplementary Text, Supplementary References and Supplementary Figure 1, the uncropped gel presented in Extended Data Figures 6 and 7. (PDF 2305 kb)

Supplementary Data 1

This file contains the value, p-value and standard deviation for the Single parameter regressions and the Multiparameter Model M. (XLSX 60 kb)

Supplementary Data 2

This file contains the expression values, the sequences and calculated parameters for the 6348 proteins dataset. (XLSX 2158 kb)

Supplementary Data 3

This file contains the sequences and parameters of the optimized genes. (XLSX 66 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

PowerPoint slide for Fig. 5

PowerPoint slide for Fig. 6

Source data

Source data to Fig. 1

Source data to Fig. 2

Source data to Fig. 3

Source data to Fig. 4

Source data to Fig. 5

Source data to Fig. 6

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boël, G., Letso, R., Neely, H. et al. Codon influence on protein expression in E. coli correlates with mRNA levels. Nature 529, 358–363 (2016). https://doi.org/10.1038/nature16509

Download citation

Received: 13 November 2014
Accepted: 01 December 2015
Published: 13 January 2016
Issue Date: 21 January 2016
DOI: https://doi.org/10.1038/nature16509

This article is cited by

Oral administration of a recombinant modified RBD antigen of SARS-CoV-2 as a possible immunostimulant for the care of COVID-19
- Norma A. Valdez‑Cruz
- Diego Rosiles-Becerril
- Mauricio A. Trujillo‑Roldán
Microbial Cell Factories (2024)
Comparative analysis of codon usage patterns in chloroplast genomes of ten Epimedium species
- Yingzhe Wang
- Dacheng Jiang
- Yunlong Sun
BMC Genomic Data (2023)
Sustainable and high-level microbial production of plant hemoglobin in Corynebacterium glutamicum
- Mengmeng Wang
- Zhong Shi
- Jibin Sun
Biotechnology for Biofuels and Bioproducts (2023)
Quality control of protein synthesis in the early elongation stage
- Asuteka Nagao
- Yui Nakanishi
- Tsutomu Suzuki
Nature Communications (2023)
Tobacco as green bioreactor for therapeutic protein production: latest breakthroughs and optimization strategies
- Muhammad Naeem
- Rong Han
- Lingxia Zhao
Plant Growth Regulation (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.