Cotranslational protein folding can facilitate rapid formation of functional structures. However, it can also cause premature assembly of protein complexes, if two interacting nascent chains are in close proximity. By analyzing known protein structures, we show that homomeric protein contacts are enriched toward the C termini of polypeptide chains across diverse proteomes. We hypothesize that this is the result of evolutionary constraints for folding to occur before assembly. Using high-throughput imaging of protein homomers in Escherichia coli and engineered protein constructs with N- and C-terminal oligomerization domains, we show that, indeed, proteins with C-terminal homomeric interface residues consistently assemble more efficiently than those with N-terminal interface residues. Using in vivo, in vitro and in silico experiments, we identify features that govern successful assembly of homomers, which have implications for protein design and expression optimization.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We are grateful to G. Kramer and B. Bukau (Deutsches Krebsforschungszentrum, Heidelberg, Germany) for their generous gift of trigger factor protein and A. Drummond (Department of Biochemistry & Molecular Biology, University of Chicago) for the generous gift of plasmids. We also thank L. Byung-Gil for useful advice and N. Sanchez De Groot for technical support. We thank C. Vogel, M.T. Burgas and E. Arbely for helpful suggestions and critical reading. E.N. thanks N. Weiner and the ISEF foundation for their support. M.M.B., T.F. and G.C. are supported by the Medical Research Council (MC_U105185859). T.F. was also supported by the Boehringer Ingelheim Fond. B.P. and C.P. thank ‘Lendület’ Programme of the Hungarian Academy of Sciences and the Wellcome Trust for supporting this work and the European Research Council (C.P.). B.K. is supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and NKFI 120220. Z.M. is supported by GINOP-2.3.2-15-2016-00001. P.H. thanks the National Brain Research Programme and the TEKES Finland Distinguished Professor Grant for their support. S.A.T. thanks the Lister Institute, the MRC, the EMBL-European Bioinformatics Institute and the Wellcome Trust Sanger Institute. N.S. and T.E. were partly supported by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science and Technology (ME5a–c and Supplementary Fig. 10 areXT), mostly Innovative Areas of “Chemistry for Multimolecular Crowding in Biosystems” (JSPS KAKENHI Grant No. JP17H06351) and MEXT-Supported Program for the Strategic Research Foundation at Private Universities (2014-2019) and The Hirao Taro Foundation of KONAN GAKUEN for Academic Research. J.M. is supported by an MRC Career Development Award (MR/M02122X/1). C.R. is supported by the Medical Research Council, Grant Reference MR/N020413/1. L.H.V. was supported by EMBO (award number ALTF 698-2012), Directorate-General for Research and Innovation (FP7-PEOPLE-2010-IEF, ThPLAST 274192) and an EMBL Interdisciplinary Postdoctoral fellowship, supported by H2020 Marie Sklodowska Curie Actions. B.P. and H.P. acknowledge funding from GINOP-2.3.2-15-2016-00026. A.H.E.‘s work was supported by the National Institutes of Health through grant R01 GM099865. This work is dedicated to Jakob Natan and Shalom Marciano.
Integrated supplementary information
(a-b) Distribution of interface-forming residues for homomers, as in Fig. 2, divided into bacteria, eukaryotes and archaea. (a) Including all homomers in our dataset, (b) Including only full-length or nearly full-length homomers, in which the crystallized construct contains >90% of the residues present in the UniProt sequence of the full protein. There is no apparent C-terminal interface enrichment in archaea; since this is also by far the smallest group it is difficult to say whether or not this is due to genuine biological differences or due to small numbers. (c) Distribution of interface-forming residues for all heteromeric subunits as a control, showing no C-terminal enrichtment. (d) Relative enrichment in interface in the C-terminal halves of proteins compared to the N-terminal halves for all species with >100 non-redundant heteromeric subunit structures in our dataset. No significant enrichment for heteromers is evident, in contrast to homomers. (e) Distribution of interface-forming residues for heteromeric subunits from bacteria, eukaryotes and archaea. Error bars for all plots are calculated the same as in Fig. 2 with 104 bootstrapping replicates. Again, no significant enrichment is evident. (f) Given the notable difference between the enrichment observed for humans and rats in Fig. 2b, we compared the C-terminal interface enrichment in homomers from humans and rats, considering only those structures that are closely related (> 70% sequence identity) between the two groups. There is very little difference between the enrichment seen in human vs. rat structures. Error bars are calculated the same as in Fig. 2 with 104 bootstrap replicates.
Supplementary Figure 2 Comparison of the C-terminal interface enrichment for homomers of different lengths and from different symmetry groups
(a) Homomers were split into three equally sized groups of short, medium-length and long proteins, and the interface enrichment was plotted as in Fig. 2a. (b) Homomers were grouped on the basis of the most common symmetry types, and the interface enrichment was plotted as in Fig. 2a. (c) Relative enrichment of interface and solvent-accessible surface area across the length of proteins. These plots are analogous to Fig 2a, except interface enrichment is not normalized by solvent-accessible surface area; instead, both are shown separately. Solvent-accessible surface area is calculated only considering monomeric subunits, neglecting interactions.
(a) 611 native E. coli homomers with C-terminal GFP labels were compiled from the complete set of E. coli K-12 ASKA library. All proteins were over-expressed in 96-well plates and cells were imaged to determine GFP-signal. Using a supervised machine-learning algorithm each cell was classified into phenotypes. Two types of cells were selected: (i) ‘Green cells’, which are cells with homogeneous and high GFP signal along the cell, and (ii) ‘Dark cells’ are cells with GFP-signal at background levels. Finally, each homomer was classified into one of these groups, depending on which phenotype was predominant in the corresponding cell population. (b-c) N-terminal regions in the ‘Dark’ homomers are enriched in N-terminal interface forming residues as compared to ‘Green’ homomers across all length and relative interface size categories.(b) Homomers were split into two equally sized groups of short (left) and long (right) proteins. The relative enrichment of interface-forming residues along the protein length is shown in green and grey for ‘Green’ and ‘Dark’ cells, respectively, as in Fig. 2c. Error bars represent standard errors calculated from 104 bootstrapping replicates as before. In the ‘Dark’ group, N-terminal regions with significant interface enrichment (indicated with *) were observed as compared to ‘Green’ proteins, both in the long and short protein groups. (c) As for the length-based analysis, homomers were split into two equally sized groups based on relative interface size. Relative interface size was calculated for each protein by dividing the size of the homomer interface with the total available surface area of the protein. Both in the small and large relative interface size category significant enrichment of interface forming regions was observed for ‘Dark’ homomers as compared to the ‘Green’ ones. (d) The sub-group of cytoplasmic-only proteins was analyzed separately, and the observed enrichment trend was kept. The dataset for membrane proteins was too small, thus was not presented.
Supplementary Figure 4 Constructs of the YFP sublibrary and flow cytometry and ESI-MS characterization
(a) The different constructs are identical or almost identical in sequence composition. Three N-terminal variants were used in this work, monomeric, dimeric and tetrameric differing by a single amino acid residue. In addition, a construct with a tetrameric oligomerization-domain positioned at the C-terminus was also used. (b) Flow cytometry measurements of all four variants. The lines along the dot-plots indicate the intensity of the tetrameric (@N) and monomeric variants. The N-terminal tetrameric variants always show the lowest fluorescence level. The ratio is shown in (c). (d) Western Blot of Tet-SL-YFP, YFP-SL-Tet and empty vector, as shown in Fig. 3b. The blot is reprehensive blot, where each variant was expressed from three different colonies (n=3) and three times from each colony (total n=9). (e-h) Verifying the oligomerization state of YFP sub-library constructs using ESI-MS to measure the oligomeric state of the following four constructs: (e) Tet-SL-YFP, (f) YFP-SL-Tet (g) YFP-LL-Tet and (h) Mono-SL-YFP. All spectra possess charge state distributions with deconvoluted masses in agreement with the theoretical masses calculated from their amino acid sequence. Importantly, tandem MS (MSMS) experiments confirmed that the first three constructs (e–g) are tetramers by applying high energy ejecting a monomeric subunit and a trimeric complex, as shown in insets.
Supplementary Figure 5 Flow cytometry of constructs with tetrameric (Tet@N) or monomeric (Mono@N) oligomerization domain
(a) Flow cytometry of YFP constructs with with short-, medium-, or long-linker. The lines along the dot-plots indicate the intensity of the tetrameric and monomeric variants. The ratio calculated is shown in Fig. 4b. The plots are representative for all experiments (n>5). (b-c) Flow Cytometry of long-linker GFP or fGFP constructs with tetrameric (Tet@N) or monomeric (Mono@N) oligomerization-domain at 37 °C and 18 °C. (b) The lines along the dot-plots indicate the intensity of the monomeric constructs. As shown by Confocal Microscopy (Fig. 4c), there is only a very small difference between the fGFP variants. On the other hand the GFP tetrameric variant has a significantly lower fluorescence in comparison to the monomeric variant. This difference can be compromised if the strains are grown at 18°C. (c) Analysis and ratios calculated from data presented in (b). (Independent cell cultures replicates, **p-value < 0.01, *p-value < 0.05, double sided t-test. Error bars represent s.d.).
Supplementary Figure 6 Luciferase misassembly in vivo and in vitro is similar to the slow-folding GFP rather than the fast-folding fGFP and YFP
The Luciferase (Luc) reporter was chosen because of its significantly different fold compared to YFP and GFP, and for its slower folding-rate. (a) In vivo assay of luminescence level after normalization to the number of cells. The Tet-LL-Luc shows almost no signal. A single amino-acid substitution to generate a monomeric variant increases the levels of signal dramatically. Increasing the linker length for the tetrameric variant from short- to long-linker also increases dramatically. Similarly to the short-linker variants, the monomeric long-linker variant had much higher luminescence signal than the tetrameric variant. (b) In vitro results using the polysomic conditions in PURE system. The results were in agreement with the in vivo experiments. (c) Comparison of the monosomic conditions with that of the polysomic (as presented in B). According to our hypothesis, reducing the ribosome local concentration will decrease the frequency of cotranslational assembly events, thus decreasing misassembly. The results align with the hypothesis. Moreover, the C-terminal tetrameric construct, which cannot assemble cotranslationally, did not show a significant difference between the monosomic and polysomic conditions. This further confirms our hypothesis. (d-e) Similarly to Fig. 5, we tested the same three chaperone groups: [“KJE mix”, which includes DnaK, DnaJ and GrpE, “GroE mix”, which includes GroEL and GroES, and Trigger Factor (TF)]. Overall the effect of chaperones was similar, and even stronger in comparison to the GFP sub-library (p-value *<0.05, ** <0.01, results represent as mean of the different replicates and error bars represent s.d.).
(a) Western Blot of the different GFP constructs examined. For each construct, both the polysomic and monosomic conditions are shown. (b) Average quantification of (a). (Error bars represent s.d.).
(a–c). Summary of GFP, fGFP and Luc sub-libraries expression, with or without chaperones, using the PURE system. In each row, the tetrameric and monomeric construct of the different sub-library is examined. (a) GFP, (b) fGFP and (c) Luc. Overall, the effect of chaperones correlated with oligomeric state, i.e. tetramer versus monomer and with folding-rate, i.e., fast- and slow-folding proteins. The highest rescue effect was achieved by the KJE mix, particularly with the tetrameric slow folding Luc and GFP. (Results represent as mean of the different replicates and error bars represent s.d.). (d–g) Analysis of the influence of chaperones on homomeric and heteromeric complexes in E. coli from Ref25. (d) Depletion of misfolded homomeric and heteromeric protein complexes from the soluble fraction of E. coli mutant with ΔKJT deletion (DnaK/DnaJ and TF are deleted). (e) The change in abundance of homomeric and heteromeric protein complexes the insoluble fraction is shown for the same E. coli mutant strain.(f) Interaction of homomeric and heteromeric complex proteins with DnaK (PD/BG ratio). The relative frequencies were normalised to account for the number of homomeric and heteromeric complexes. (g) Histograms of absolute numbers of interactions of homomeric and heteromeric complex proteins with DnaK (PD/BG ratio).
The snapshots are of the three main constructs of the YFP sub-library: Tet-SL-YFP, YFP-SL-Tet and Tet-LL-YFP. Representatives of these simulations can be watched in Movies S1-S3. (a) Endpoint of twenty simulations of Tet-SL-YFP. (b) Endpoint of ten simulations of the constructs YFP-SL-Tet. (c) Endpoint of twenty simulations of Tet-LL-YFP. Symbol in bottom left of some simulation indicates that Tet was assembled.
(a) Real-time growth-rate of strains that express the different constructs used in this work. Each curve represents the average of three same culture replications. The plot is a representative of three such averaged curves. All experiments show the same trend: N-terminal tetrameric constructs consistently grow more slowly than the other variants. Error bars represent s.d. (b) All available E. coli homomeric protein structures were analyzed to create a library of protein structures of proteins with (i) a discrete oligomerization-domain and (ii) data predicting whether the protein folds post- or cotranslationally. Three such protein structures were found. Oligomerization-domains are shown in red, domain-linkers in cyan, and other domains in yellow. CTP Synthetase (PDB:1s1m) has an N-terminal oligomerization-domain, which may be compensated by a linker (cyan). The other two proteins have C-terminal oligomerization-domains, no linkers, and are: aspartate-semialdehyde dehydrogenase (PDB:1t4b), and alpha-N-Acetyl-galactosaminidase (PDB:2p53). More information about these proteins is provided in Supplementary Data Set 4.
Supplementary Figures 1–10, Supplementary Table 1 and Supplementary Note 1
‘Dark’ and ‘Green’ homomers from the genome wide in vivo screen. The table describes for each homomer the parameters from the in vivo screen, the structural parameters of the interface location and the Western Blot analysis.
Western blot analysis to test the expression of all the 136 ‘Dark’ and a selected set of 25 ‘Green’ homomers. The homomers were detected using a GFP-specific antibody. Both GFP negative (C–) and positive (C+) samples were loaded on each gel. Coomassie Brilliant Blue (CBB) stained separate gels (#1-5) or the membranes used for the Western blot assays (#6–14) are shown to justify the loading (shown in each case above the image of Western blotting). Asterisks show the expressed GFP-tagged proteins. Molecular masses (in kDa) are indicated on the left.
Protein complex immunoprecipitation (Co-IP). A strain with an empty vector and strains that express the tetrameric N-terminus (Tet-SL-YFP) and tetrameric C-terminus (YFP-SL-Tet) constructs were harvested a few hours after induction. Then, the cells’ contents were mixed with magnetic anti-HA antibody beads. The samples were washed and eluted. The eluted samples were run on an SDS gel, and selected bands were analyzed by MS. A list of the different proteins that were identified and their fold changes are indicated.
Characterization of examples of representative proteins as shown in Figure S10. The table describes E. coli structures that have oligomerization domains and have a known full-length protein structure as well as associated folding parameters.
Nonredundant sets of homomer structures. These are split into the sets of all complexes filtered for sequence redundancy across all structures, or at the species level, and the set of only full-length structures. The total amount of interface and monomer accessible surface area (in Å2) is given for the N-terminal and C-terminal halves of each protein. The same data is provided for heteromers.
Simulation of Tet@N with short linker. For all movies, the red segment is the Tet, and the yellow segment is the YFP β-barrel. Both cotranslational folding and misassembly take place once the Tet appear outside of the ribosome tunnels.
Simulation of Tet@C with short linker. There was no cotranslational assembly as the Tet of the leading ribosome leaves the ribosomal tunnel prior to the translation of the second ribosome, which allows it time to diffuse before the second Tet leaves the ribosome tunnel.
Simulation of Tet@N with long linker. Cotranslational assembly takes place, but not misassembly. Similarly to movies S1-S2, the red segment is the Tet and the yellow is the YFP reporter-gene, the cyan segment is the long-linker. The less frequent misassembly events fit the observed in vivo and in vitro data.