Protein design under competing conditions for the availability of amino acids


Isolating the properties of proteins that allow them to convert sequence into the structure is a long-lasting biophysical problem. In particular, studies focused extensively on the effect of a reduced alphabet size on the folding properties. However, the natural alphabet is a compromise between versatility and optimisation of the available resources. Here, for the first time, we include the impact of the relative availability of the amino acids to extract from the 20 letters the core necessary for protein stability. We present a computational protein design scheme that involves the competition for resources between a protein and a potential interaction partner that, additionally, gives us the chance to investigate the effect of the reduced alphabet on protein-protein interactions. We devise a scheme that automatically identifies the optimal reduced set of letters for the design of the protein, and we observe that even alphabets reduced down to 4 letters allow for single protein folding. However, it is only with 6 letters that we achieve optimal folding, thus recovering experimental observations. Additionally, we notice that the binding between the protein and a potential interaction partner could not be avoided with the investigated reduced alphabets. Therefore, we suggest that aggregation could have been a driving force in the evolution of the large protein alphabet.


The amino acid alphabet encoding the protein function is common to all living organisms and is the result of millions of years of evolution. It is composed of 20 letters, in contrast to the ones of other biopolymers, such as DNA and RNA, which possess 4 letters only. Such a large alphabet gives to proteins the vast variety of configurations and functions that we know so far.

The advent of computational protein evolution (also known as protein design)1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 opens the possibility to address fundamental questions about the nature of the amino acid alphabet17,18,19,20. Protein design consists in searching for protein sequences capable of folding into a given backbone conformation. The search is usually done by point mutations while keeping the backbone structure fixed. In addition to several applications to medicine12,14,21,22,23 and material science15,24,25,26,27, protein design offers the possibility to explore fundamental problems of protein evolution.

One of the questions that mostly attracts the attention of the scientific community is about the universality of the 20 letters. Of course, the complex spectrum of proteins functionalities calls for a wide range of building blocks. However, could it be possible to design proteins to fold using a reduced alphabet? And, if yes, why not simply stick with such a reduced alphabet?

The early work on protein design with alphabets of different sizes was carried out for protein lattice models in which the protein chain is constrained to be on a cubic lattice. With such models it was possible to design heteropolymers with a large variety of alphabets defined by the amino acid interactions28,29,30,31,32,33,34,35,36,37. It became rapidly apparent that even in such simplified systems it is necessary to have a minimum number of residue types to encode the target configurations38. Moreover, such simple models allowed to explore the related question on how the alphabet size influences protein-protein interactions39,40,41,42. Finally, works done on realistic models offer substantial evidence that protein design with a minimalistic alphabet is possible43,44,45,46,47. In particular, statistical analysis of protein databases48,49,50,51,52,53,54 demonstrated that a considerable fraction of the information encoded in natural proteins could be packed into smaller efficient alphabets from 1254 all the way down to just 5 residue types43,45,54,55,56,57. However, all the mentioned studies completely neglect the possibility that a competition for the availability of amino acids may have played a role in the evolution of the protein alphabet size.

In this work, we devised a design strategy that includes such a competition to spontaneously drive the selection towards the minimal subset of residues essential for protein folding.

Our principal result is the identification of an optimal protein alphabet with the minimum number of letters, without the need of imposing neither the size nor the composition of it. The results show that for the folding of a small protein the minimum number of amino acid types needed is just 4. Incidentally, 4 is also the alphabet size of RNA that was hypothesized to be a precursor of proteins during the early stages of life. Additionally, by having a binary system, we can explore the effect of the alphabet reduction on aggregation in different protein-protein binding scenarios. From our simulations we observe that the alphabet reduction compromises the heterogeneity of the protein-protein interactions28,36,40,41,42 and binding cannot be avoided.

These results have interesting implications towards the understanding of the evolution of protein sequences and structures when the amino acid availability is taken into account. In fact, living systems are under constant pressure for using the smallest variety of amino acids as possible, e.g. to limit the resources needed to construct specialised tRNA molecules necessary for the translation process58. Hence, it is reasonable to assume that during the early stages of life, the protein capable of being designed with a smaller alphabet could have been advantageous. If protein aggregation was not crucial at that stage, then our results demonstrate that protein-based life could have started with an alphabet size compatible with the one of DNA and RNA. On the other hand, the simple condition of avoiding protein aggregation could be a strong driving force against alphabet reduction.


We consider systems composed of the natural protein G structure (already successfully redesigned with several protein models3,7) and a competing element (a mould of a part of protein G, that mimics with a surface-like shape a potential binding site of a larger protein). Both proteins are represented with the caterpillar coarse-grain model, which has been successfully tested to design and refold natural and artificial proteins7,9 including the protein G.

In the following we will use the denominations: protein G referring to both natural structure and sequence as stored in the PDB with the ID 1pgb; protein \(\bar{G}\) referring to an artificial sequence designed for the natural protein G structure; protein \(\Gamma \) referring to the surface-like competing protein partner.

The protein \(\Gamma \) is created immersing the protein G structure into a flat surface until its centre of mass (CM) reaches the desired relative height \(\zeta \) with respect to it. The flat surface is pushed down creating a mould, which is kept at fixed distance \(\mu =13\,{\rm{\AA }}\) from the surface of the protein G. Then, the protein G is rotated around its CM to maximise the mould surface area, which represents the binding site of a second protein. We create four moulds, each corresponding to a different value of \(\zeta \) and composed by a different number of amino acids, labelled as \({C}_{{\rm{surf}}}\). The systems are characterised by \(\zeta \) = (0.20, 0.40, 0.60, 0.80), thus leading to surface areas = (4717.5, 3842.2, 3051.5, 2320.5) Å2 and Csurf = (158, 127, 100, 78) residues respectively (see the Modelling protein \(\Gamma \) of the Supplementary Materials SM for details). For the sake of simplicity, we call sequence the amino acid identities of protein \(\Gamma \), although its surface-like structure is frozen and far from a polymeric chain of beads.

The procedure employed in the present work follows the steps pictorially represented in Fig. 1,

Figure 1

Pictorial representation of the steps employed to enforce a competition for amino acid availability between protein \(\bar{G}\) and a protein \(\Gamma \), and to test its effect on the folding ability of protein \(\bar{G}\) in presence and absence of the artificial partner. (I) Create a Caterpillar version of the experimentally determined crystal structure of protein G (II) Shape four competing partner proteins \(\Gamma \) modelled as moulds of increasing portions of the protein G. The size of the mould will influence the competition for resources, as further explained in the following sections. The larger the surface, the higher the competition. (III) Design each of the four systems considering simultaneously the proteins \(\bar{G}\) and \(\Gamma \). The procedure consists in searching for the ensemble of sequences that minimise the energy of both protein \(\bar{G}\) and \(\Gamma \) while keeping the system conformation frozen in space. The competition for the amino acids is created at this stage of our simulations. (IV) After selecting the best designed sequence (see the Design subsection for details about the criterion) for each system, isolate the portion relative to the protein \(\bar{G}\) and test its folding ability in a single-protein folding simulation. (V) Check how the folding of the latter sequences is influenced by the presence of protein \(\Gamma \) frozen in the simulation box (bearing the sequence designed concurrently to protein \(\bar{G}\)).

Once the protein \(\Gamma \) modelling is complete, the structures of both proteins are frozen, with the protein G immersed into the mould \(\Gamma \) and kept at distance μ from it (as represented in Fig. S1b). The design scheme consists of a computational exploration of the sequence space via point mutations, looking for the ones that minimise the total energy among the ones that maximise the permutations \({N}_{P}=\frac{N!}{{n}_{A}!{n}_{B}!{n}_{C}!\ldots }\) of the total amino acid composition (N is the total number of amino acids and \([{n}_{A},{n}_{B},{n}_{C},\ldots ]\) are the abundance on amino acids of type \(A,B,C,\mathrm{.}.\) respectively). See the subsection Design of Technical aspects of the methodology in the SM for details. It is important to stress that \({N}_{P} > {N}_{P}^{\bar{G}}{N}_{P}^{\Gamma }\), where \({N}_{P}^{\bar{G}}\) and \({N}_{P}^{\Gamma }\) are the permutations of protein \(\bar{G}\) and \(\Gamma \) respectively. This inequality implies that, indeed, the sequences of \(\bar{G}\) and \(\Gamma \) are correlated, since the most heterogeneous sequence is not the one that maximises \({N}_{P}^{\bar{G}}\) and \({N}_{P}^{\Gamma }\) separately. In turns it means also that \({N}_{P}\) can be maximised without maximising \({N}_{P}^{\bar{G}}\) and \({N}_{P}^{\Gamma }\) separately, and the residues can be distributed dishomogeneously between protein and substrate.

The choice of the distance μ between the two proteins guarantees that, during the design, the protein-protein interaction energy is negligible. Under such conditions, the design scheme leads inherently to sequences that minimise the energy of the protein \(\bar{G}\) and optimise the exposure to the solvent of each residue of protein \(\Gamma \). Since protein \(\bar{G}\) and \(\Gamma \) are energetically uncorrelated, the coupling between the proteins is then only through the maximisation of the total permutations NP.


For each scenario, i.e. for each \(\zeta \in (0.2,0.4,0.6,0.8)\), the design algorithm generates a basin of solutions containing approximately 105 sequences. From each basin, we select the sequence with highest permutation number and lowest energy, considering it as representative of the whole basin, and use it to test the folding and binding properties. The selected protein \(\bar{G}\) sequences for each scenario are shown in Table S1, while in Table S2 we show how much they differ from each other. To search for the smallest alphabet, we decided to focus on a single sequence instead of an average over a basin. Taking as a reference the centroid of the basin would have shifted the solution space towards higher energy sequences that tend to have larger alphabets.

We observe that the residues of protein \(\bar{G}\) tend to adopt a limited set of letters. Moreover, increasing the protein \(\Gamma \) area (and hence the number of amino acids belonging to it) reduces de facto the amino acids accessible by the protein \(\bar{G}\) to minimise its energy. Hence, the fractionation of the alphabet is not caused by specific interactions between the residues but by the coupling through the maximisation of the total permutations NP.

We can control the competition pressure by changing the size of protein \(\Gamma \). This competition leads to an effective reduced alphabet used by the protein \(\bar{G}\). We observe that the effective alphabet grows from 4 to 6 letters going from larger (ζ = 0.20 and 0.40) to smaller \(\Gamma \) proteins (ζ = 0.60 and 0.80) respectively. It is interesting to notice that the alphabets are made of amino acids with an average attractive pair-interaction energy and high variability in terms of the residue-solvent interactions (see Table S1 in ref. 9). Moreover, the alphabets differ from each other (letters GKVY and GKRV corresponding to \(\zeta =(0.20,0.40)\) and FGHKRY common to both \(\zeta =(0.60,0.80)\)), and for each scenario the protein amino acids are not present in the corresponding protein \(\Gamma \) sequence (see SM Fig. S11). Therefore, part of the 20 letters are segregated on the protein \(\Gamma \) sequence.

Our finding shows that the design process indeed mimics a process under competition for available amino acids. It is important to stress that such competition is the results of the coupling alone as we impose neither the size nor the composition of the reduced alphabet. Hence, the particular letters that the design process chooses for protein \(\bar{G}\) are presumably optimal to stabilise the folded structure. This feature is the crucial element of our design scheme that allows us to isolate the critical set of residues in our alphabet for design and folding.

Finally, we test the folding and binding properties of the designed sequences. Hence, we perform Monte Carlo simulations keeping fixed the amino acid sequence generated for each scenario, and extensively exploring the conformational space of the protein \(\bar{G}\).

To test the selected sequences, we first examine the folding stability of the protein \(\bar{G}\) alone, therefore performing a folding simulation in an empty box starting from a fully stretched configuration. Figure 2 shows the free energy profiles as a function of the distance root mean square displacement DRMSD (defined in Eq. S10 of SM). From previous works7,9, the criterion for assessing a stable fold is to observe a funnel shape of the free energy profile and a global free energy minimum for \(DRMSD\le 2\,{\rm{\AA }}\). Using this criterion, we can say that all protein sequences fold back into the target configuration, although with different precision. Sequences with a larger effective alphabet fold with higher precision, as can be seen from the DRMSD value of the configurations corresponding to the global free energy minimum for each system (The DRMSD values correspond to 4.9; 5.5; 2.4 and 2.7 Å in RMSD respectively). The sequence optimised at \(\zeta =0.40\) shows a secondary minimum in the free energy, corresponding to misfolded compact structures, therefore being the system less stable for the folding in the bulk. A possible explanation of such a behaviour is that the effective 4 letters protein \(\bar{G}\) alphabet for \(\zeta =0.40\) involves only hydrophilic residues (GKRY), thus leading to a lower stability.

Figure 2

Folding free energy profiles F/kBT of single protein (only protein \(\bar{G}\), no protein \(\Gamma \)) at reduced temperature 0.55 as a function of DRMSD from the native target structure (protein G structure, PDB ID: 1pgb). Different colours correspond to protein \(\bar{G}\) sequences obtained via the design procedure in the presence of the protein \(\Gamma \) characterised by the \(\zeta \) value specified in the key. Right hand side: configurations corresponding to the free energy minimum for each system are represented in red, compared to the native protein G (in green). \(DRMSD=2.1\,{\rm{\AA }}\) for \(\zeta =0.20\); \(DRMSD=1.9\,{\rm{\AA }}\) for \(\zeta =0.40\); \(DRMSD=1.3\,{\rm{\AA }}\) for \(\zeta =0.60\) and \(DRMSD=1.5\,{\rm{\AA }}\) \(\zeta =0.80\).

From the described scenario, we can draw two important conclusions: firstly, design with a limited alphabet of 4 letters can produce a funnel-like folding free energy landscape; secondly, with 6 letters we recover the folding precision of previous caterpillar designs made with 20 letters9. Our results are consistent with the experimental observation that 6 letters are a minimal set necessary to maintain protein structure and function43,45,54,55,56,57.

The Random Energy Model59,60,61 provides a criterion for a heteropolymer to be designable: it has to satisfy the relation \(q > \exp (\omega )\), where q is the alphabet size and \(\omega \) the conformational entropy per residue. Hence, a 4 letters alphabet gives an upper bound to the conformational entropy \(\omega \) of the caterpillar backbone and therefore of the more restricted natural protein backbones. Such a result is compatible with the recent observations of Cardelli et al.62 who mapped the designability phase space for a general heteropolymer decorated with directional interactions similar to the hydrogen bonds present along the protein backbone. For polymers with two directional interactions per particle the minimum alphabet measured was four, as the one presented here.

To test the effect of the alphabet reduction on protein-protein interaction, we also perform folding simulations in the presence of the protein \(\Gamma \), that represent a potential binding site. In Fig. 3 we plot the free energy landscape as a function of \(DRMS{D}_{{intra}}\) and \(DRMS{D}_{{inter}}\). \(DRMS{D}_{{intra}}\) is the DRMSD intra protein \(\bar{G}\), and uses the native protein G structure as target configuration. \(DRMS{D}_{{inter}}\) is the DRMSD between protein \(\bar{G}\) and protein \(\Gamma \), and uses the folded bound configuration (shown in the insets of Fig. 3 for each scenario) as a target. This choice allows us to monitor the folding and binding properties of the system independently. Conformations that are folded and bound can be found in the bottom left corner, while folded unbound ones in the top left corner.

Figure 3

Folding free energy landscapes F/kBT at reduced temperature 0.76 as a function of the \(DRMS{D}_{{intra}}\) distance from the native protein G as target and the \(DRMS{D}_{{inter}}\) inter-protein distance from the folded protein bound to protein \(\Gamma \) (configurations depicted in the panels). The binding affinity decreases along with the protein \(\Gamma \) surface size, as shown by the value of the association constants Ka in the plot key.

Additionally, we also separately check the free energy profiles as a function of \(DRMS{D}_{{intra}}\) for conformations with protein \(\bar{G}\) in contact with protein \(\Gamma \) (see Fig. S9 in the SM) and in the bulk solution (i.e. where no inter-protein contacts are possible, see Fig. S10) in the SM. For a sketch of the definition of contact and bulk solution configurations see Fig. S6 in the SM. To verify the consistency of the two different folding simulations, we checked that the free energy profiles of configurations in the latter region correctly fold into the target structure (Fig. S10), reproducing the behaviour observed in the isolated protein folding simulations (Fig. 2).

For all scenarios, upon binding to protein \(\Gamma \), we observe a significant enhancement of misfolded configurations with respect to what observed in the bulk solution (compare Figs. S9(a) and S10 in the SM). In particular, there is a considerable shift in the equilibrium towards states at \(DRMSD\sim 3\,{\rm{\AA }}\) that have a free energy that is now comparable to the one of properly folded configurations. It should be noticed that natural binding sites expose much smaller surface areas then the one modelled with protein \(\Gamma \). Hence, the latter effect might be mitigated considering smaller surfaces for protein \(\Gamma \).

Analysing the behaviour of the binding process as a function of temperature we find that the random binding is overall very strong and it decreases while increasing the temperature. The van’t Hoff plot63,64 shows positive binding affinities and an exothermic process above the folding temperature (Fig. S7 SM; see Fig. S6 SM for details about the evaluation of the association constant and Fig. S8 SM for the folding temperature evaluation). At the same time, while increasing the temperature, the equilibrium shifts from partially-misfolded to fully-misfolded, indicating that the unfolding process takes place at the surface while the protein remains bound (see Fig. S9(b)). This is particularly evident for extended protein \(\Gamma \) surfaces, i.e. systems characterised by \(\zeta =(0.20,0.40)\). Hence, we observe a strong tendency of the protein \(\bar{G}\) designed with 4 letters to absorb and aggregate on protein \(\Gamma \).

Overall, this binding behaviour is an unexpected result. In the crowded cellular ambient, natural protein designed by evolution with the 20 letters alphabet are not aggregating. As such, in the present work, protein \(\bar{G}\) and \(\Gamma \) should not aggregate, since they interact through the total caterpillar alphabet of 18 letters. However, our design scheme imposes a segregation of few letters on the protein \(\bar{G}\) sequence. We identify the following observation as a possible cause. The 4 letters alphabets (GKVY and GKRY corresponding to \(\zeta =(0.20,0.40)\)) have an average intra-protein residue interaction of −0.2kBT, while the average interaction of the single protein \(\bar{G}\) letters with all the others, i.e. the inter-protein interaction, is much lower −0.3kBT. This makes impossible for the protein \(\bar{G}\) to stabilise the folded state in contact with protein \(\Gamma \). Conversely, the 6 letter alphabet (FGHKRY common to both \(\zeta =(0.60,0.80)\)) has an average intra-protein residue interaction of −0.4kBT, that is lower than the inter-protein one of −0.3kBT. This helps in stabilizing the folded structure upon binding. If, on the other end, the residues would have been properly mixed, there would be no difference between inter and intra averages, and the random interactions should be washed out by thermal fluctuations28. Hence, there is a fundamental pressure to increase the alphabet size and fully use it to achieve folding and avoid strong absorption.

This is an essential factor that could explain why natural proteins tend to have and use a larger alphabet than 6 letters. However, the origin of the 20 letters is still only matter of speculation. In fact many molecular process require additional chemical modification of the proteins like glycolisation that effectively increases the available pools of potential letters. Hence, it is not even accurate to consider 20 as the upper limit, that is why in this study we focused on the lower limit that has more clear definition.

In conclusion, the design procedure employed in our work has a significant segregation effect on the alphabet letters used in the protein \(\bar{G}\) sequence. The larger the number of residues on the competing protein \(\Gamma \), the smaller is the effective alphabet available for the protein \(\bar{G}\) sequence. On the one side, the design is capable of selecting a subset of letters that still allows the folding of the protein in the bulk solution even for the smallest effective alphabet (4 letters). The precision of the folding increases with the effective alphabet size. Interestingly, the experimentally determined minimum alphabet size of 6 letters is also what we identify as minimum alphabet that recovers the design accuracy commonly obtained with a 20 letter alphabet. This implies that functionality will push the alphabet to grow. This trend could explain why reduced alphabets obtained form the analysis of natural proteins then to be larger54.

It is important to stress that the reduced alphabet presented here might not be the only possible solution. It would be interesting to perform a larger study of the folding sequences and generate a spectrum of possible 4 letters alphabets, and with models that include amino acids charges more explicitly.

Our results have far-reaching implications both in the field of protein design and for the understanding of protein evolution. In protein design, the possibility of using a reduced alphabet would considerably accelerate the search of the sequence space for good folders. In the field of protein evolution instead, the understanding of the smallest alphabet necessary for accurate protein design is still an open question. To the best of our knowledge, this study represents the first successful design of a full natural protein structure with a reduced alphabet of just 4 letters. Moreover, such a result offers an interesting parallelism with the 4 letter alphabet of RNA which studies speculates had a role in the early stages of life before the advent of proteins.


  1. 1.

    Gutin, A. M. & Shakhnovich, E. Ground-state of random copolymers and the discrete Random Energy-model. J. Chem. Phys. 98, 8174–8177, (1993).

    ADS  CAS  Article  Google Scholar 

  2. 2.

    Dahiyat, B. I. & Mayo, S. De Novo Protein Design: Fully Automated Sequence Selection. Sci. (80-). 278, 82–87, (1997).

    CAS  Article  Google Scholar 

  3. 3.

    Koehl, P. & Levitt, M. De novo protein design. I. In search of stability and specificity. J. Mol. Biol. 293, 1161–81, (1999).

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Kortemme, T. & Baker, D. Computational design of protein-protein interactions. Curr. Opin. Chem. Biol. 8, 91–97, (2004).

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Fung, H. K., Welsh, W. J. & Floudas, C. A. Computational de novo peptide and protein design: Rigid templates versus flexible templates. Ind. Eng. Chem. Res. 47, 993–1001, (2008).

    CAS  Article  Google Scholar 

  6. 6.

    Samish, I., Macdermaid, C., Perez-Aguilar, J. & Saven, J. Theoretical and computational protein design. Annu. Rev. Phys. Chem. 62, 129–149, (2011).

    ADS  CAS  Article  PubMed  Google Scholar 

  7. 7.

    Coluzza, I. A coarse-grained approach to protein design: learning from design to understand folding. Plos One 6, e20853, (2011).

    ADS  CAS  Article  Google Scholar 

  8. 8.

    Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 222–227,, NIHMS150003 (2012).

    ADS  CAS  Article  Google Scholar 

  9. 9.

    Coluzza, I. Transferable Coarse-Grained Potential for De Novo Protein Folding and Design. Plos One 9, e112852,, arXiv:1406.4373v1 (2014).

    ADS  Article  Google Scholar 

  10. 10.

    Thomson, A. R. et al. Computational design of water-soluble a-helical barrels. Sci. (80-). 346, 485–488, (2014).

    ADS  CAS  Article  Google Scholar 

  11. 11.

    Sevy, A. M., Jacobs, T. M., Crowe, J. E. & Meiler, J. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences. Plos Comput. Biol. 11, e1004300, (2015).

    ADS  CAS  Article  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Pelay-Gimeno, M., Glas, A., Koch, O. & Grossmann, T. N. Structure-Based Design of Inhibitors of Protein-Protein Interactions: Mimicking Peptide Binding Epitopes. Angew. Chemie - Int. Ed. 54, 8896–8927, (2015).

    CAS  Article  Google Scholar 

  13. 13.

    Chevalier, A. et al. Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79, (2017).

    ADS  CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Marcos, E. et al. Principles for designing proteins with cavities formed by curved b sheets. Sci. (80-). 355, 201–206, (2017).

    ADS  CAS  Article  Google Scholar 

  15. 15.

    Coluzza, I. et al. Perspectives on the future of ice nucleation research: Research needs and Unanswered questions identified from two international workshops. Atmosphere (Basel). 8, (2017).

    ADS  Article  Google Scholar 

  16. 16.

    Bianco, V., Pagès-Gelabert, N., Coluzza, I. & Franzese, G. How the stability of a folded protein depends on interfacial water properties and residue-residue interactions. J. Mol. Liq. 245, 129–139 (2017).

    CAS  Article  Google Scholar 

  17. 17.

    Davidson, A. R. & Sauer, R. T. Folded proteins occur frequently in libraries of random amino acid sequences. Proc. Natl. Acad. Sci. 91, 2146–2150, (1994).

    ADS  CAS  Article  PubMed  Google Scholar 

  18. 18.

    Riddle, D. S. et al. Functional rapidly folding proteins from simplified amino acid sequences. Nat. Struct. Biol. 4, 805–809, (1997).

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Cordes, M. H. J., Davidson, A. R. & Sauer, R. T. Sequence space, folding and protein design. Curr. Opin. Struct. Biol. 6, 3–10, (1996).

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Davidson, A. R., Lumb, K. J. & Sauer, R. T. Cooperatively folded proteins in random sequence libraries. Nat. Struct. Biol. 2, 856 (1995).

    CAS  Article  Google Scholar 

  21. 21.

    Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320–327, (2016).

    ADS  CAS  Article  Google Scholar 

  22. 22.

    Parmeggiani, F. & Huang, P.-S. Designing repeat proteins: a modular approach to protein design. Curr. Opin. Struct. Biol. 45, 116–123, (2017).

    CAS  Article  PubMed  Google Scholar 

  23. 23.

    Baran, D. et al. Principles for computational design of binding antibodies. Proc. Natl. Acad. Sci. 114, 10900–10905, (2017).

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Mejias, S. H. et al. Repeat protein scaffolds: ordering photo- and electroactive molecules in solution and solid state. Chem. Sci. 7, 4842–4847, (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Cortajarena, A. L., Liu, T. Y., Hochstrasser, M. & Regan, L. Designed Proteins To Modulate Cellular Networks. ACS Chem. Biol. 5, 545–552, (2010).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Mejias, S. H., Aires, A., Couleaud, P. & Cortajarena, A. L. Designed Repeat Proteins as Building Blocks for Nanofabrication. In Cortajarena, A. L. & Grove, T. Z. (eds) Adv. Exp. Med. Biol., vol. 940, chap. Protein-ba, 61–81, (Springer International Publishing, Cham, 2016).

    Google Scholar 

  27. 27.

    Bianchi, E., Capone, B., Coluzza, I., Rovigatti, L. & van Oostrum, P. D. J. Limiting the valence: advancements and new perspectives on patchy colloids, soft functionalized nanoparticles and biomolecules. Phys. Chem. Chem. Phys. 19, 19847–19868,, 1705.04383 (2017).

    CAS  Article  Google Scholar 

  28. 28.

    Coluzza, I. & Frenkel, D. Designing specificity of protein-substrate interactions. Phys. Rev. E 70, 51917, (2004).

    ADS  CAS  Article  Google Scholar 

  29. 29.

    Coluzza, I., Muller, H. G. & Frenkel, D. Designing refoldable model molecules. Phys. Rev. E 68, 046703, (2003).

    ADS  CAS  Article  Google Scholar 

  30. 30.

    Salvi, G., Mölbert, S. & De Los Rios, P. Design of lattice proteins with explicit solvent. Phys. Rev. E 66, 61911, (2002).

    ADS  CAS  Article  Google Scholar 

  31. 31.

    Wang, T. R., Miller, J., Wingreen, N. S., Tang, C. & Dill, K. A. Symmetry and designability for lattice protein models. J. Chem. Phys. 113, 8329–8336,, 0006372 (2000).

    ADS  CAS  Article  Google Scholar 

  32. 32.

    Deutsch, J. M. & Kurosky, T. A New Algorithm for Protein Design. Phys. Rev. Lett. 76, 10,, 9508127 (1995).

    ADS  CAS  Article  Google Scholar 

  33. 33.

    Shakhnovich, E. I. & Gutin, A. M. Engineering of stable and fast-folding sequences of model proteins. Proc. Natl. Acad. Sci. 90, 7195–7199, (1993).

    ADS  CAS  Article  PubMed  Google Scholar 

  34. 34.

    Yue, K. & Dill, K. A. Inverse protein folding problem: designing polymer sequences. Proc. Natl. Acad. Sci. USA 89, 4163–4167, (1992).

    ADS  CAS  Article  PubMed  Google Scholar 

  35. 35.

    Bryngelson, J. D. D. & Wolynes, P. G. G. Spin glasses and the statistical mechanics of protein folding. Proc. Natl. Acad. Sci. USA 84, 7524–7528, (1987).

    ADS  CAS  Article  PubMed  Google Scholar 

  36. 36.

    Coluzza, I. & Frenkel, D. Monte Carlo study of substrate-induced folding and refolding of lattice proteins. Biophys. J. 92, 1150–1156, (2007).

    ADS  CAS  Article  PubMed  Google Scholar 

  37. 37.

    Abeln, S. & Frenkel, D. Disordered Flanks Prevent Peptide Aggregation. Plos Comput. Biol. 4, e1000241, (2008).

    ADS  CAS  Article  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Chan, H. S. & Dill, K. A. Comparing folding codes for proteins and polymers. Proteins Struct. Funct. Genet. 24, 335–344,;2-F (1996).

  39. 39.

    Sear, R. P. & Cuesta, J. A. Instabilities in Complex Mixtures with a Large Number of Components. Phys. Rev. Lett. 91, 245701,, 0307326 (2003).

  40. 40.

    Sear, R. P. Specific protein–protein binding in many-component mixtures of proteins. Phys. Biol. 1, 53–60,, 0312033 (2004).

    ADS  CAS  Article  Google Scholar 

  41. 41.

    Sear, R. P. Highly specific protein–protein interactions, evolution and negative design. Phys. Biol. 1, 166–172, (2004).

    ADS  CAS  Article  PubMed  Google Scholar 

  42. 42.

    Madge, J. & Miller, M. A. Design strategies for self-assembly of discrete targets. J. Chem. Phys. 143, 044905, (2015).

    ADS  CAS  Article  PubMed  Google Scholar 

  43. 43.

    Plaxco, K. W., Riddle, D. S., Grantcharova, V. & Baker, D. Simplified proteins: Minimalist solutions to the ‘protein folding problem’. Curr. Opin. Struct. Biol. 8, 80–85, (1998).

    CAS  Article  PubMed  Google Scholar 

  44. 44.

    Walter, K. U., Vamvaca, K. & Hilvert, D. An active enzyme constructed from a 9-amino acid alphabet. J. Biol. Chem. 280, 37742–37746,, jbc.M507210200 (2005).

    CAS  Article  Google Scholar 

  45. 45.

    Reetz, M. T. & Wu, S. Greatly reduced amino acid alphabets in directed evolution: making the right choice for saturation mutagenesis at homologous enzyme positions. Chem. Commun. 5499, (2008).

  46. 46.

    Liu, B. et al. IDNA-Prot—dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. Plos One 9, (2014).

    ADS  Article  Google Scholar 

  47. 47.

    Sun, Z. et al. Reshaping an Enzyme Binding Pocket for Enhanced and Inverted Stereoselectivity: Use of Smallest Amino Acid Alphabets in Directed. Evolution. Angew. Chemie - Int. Ed. 54, 12410–12415, (2015).

    CAS  Article  Google Scholar 

  48. 48.

    Wang, J. & Wang, W. Simplification of complexity in protein molecular systems by grouping amino acids: a view from physics. Adv. Phys. X 1, 444–466, (2016).

    CAS  Article  Google Scholar 

  49. 49.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60, (2014).

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Ferreiro, D. U., Komives, E. A. & Wolynes, P. G. Frustration in biomolecules. Q. Rev. Biophys. 47, 285–363, (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Uversky, V. N. A decade and a half of protein intrinsic disorder: Biology still waits for physics. Protein Sci. 22, 693–724, (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Longo, L. M. & Blaber, M. Protein design at the interface of the pre-biotic and biotic worlds. Arch. Biochem. Biophys. 526, 16–21, (2012).

    CAS  Article  PubMed  Google Scholar 

  53. 53.

    Li, T., Fan, K., Wang, J. & Wang, W. Reduction of protein sequence complexity by residue grouping. Protein Eng. 16, 323–330, (2003).

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Murphy, L. R., Wallqvist, A. & Levy, R. M. Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 13, 149–152, (2000).

    CAS  Article  PubMed  Google Scholar 

  55. 55.

    Chan, H. S. Folding alphabets. Nat. Struct. Biol. 6, 994–6, (1999).

    CAS  Article  PubMed  Google Scholar 

  56. 56.

    Wang, J. & Wang, W. A computational approach to simplifying the protein folding alphabet. Nat. Struct. Biol. 6, 1033–1038, (1999).

    CAS  Article  PubMed  Google Scholar 

  57. 57.

    Solis, A. D. Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins Struct. Funct. Bioinforma. 83, 2198–2216, (2015).

    CAS  Article  Google Scholar 

  58. 58.

    Alberts, B. et al. Molecular Biology of the Cell (Garland Science, 2002).

  59. 59.

    Derrida, B. Phenomenological Renormalization Of The Self Avoiding Walk In 2 Dimensions. J. Phys. A-Mathematical Gen. 14, L5–L9 (1981).

    ADS  Article  Google Scholar 

  60. 60.

    Pande, V. S., Grosberg, A. Y. & Tanaka, T. Heteropolymer freezing and design: Towards physical models of protein folding. Rev. Mod. Phys. 72, 259–314, (2000).

    ADS  CAS  Article  Google Scholar 

  61. 61.

    Pande, V. S. V., Grosberg, A. Y. A. & Tanaka, T. Statistical mechanics of simple models of protein folding and design. Biophys. J. 73, 3192–3210, (1997).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  62. 62.

    Cardelli, C. et al. The role of directional interactions in the designability of generalized heteropolymers. Sci. Rep. 7, 4986, (2017).

    ADS  CAS  Article  PubMed  PubMed Central  Google Scholar 

  63. 63.

    Lim, C. W. & Kim, T. W. Dynamic [2]Catenation of Pd(II) Self-assembled Macrocycles in Water. Chem. Lett. 41, 70–72, (2012).

    ADS  CAS  Article  Google Scholar 

  64. 64.

    Hino, S., Ichikawa, T. & Kojima, Y. Thermodynamic properties of metal amides determined by ammonia pressurecomposition isotherms. J. Chem. Thermodyn. 42, 140–143, (2010).

    CAS  Article  Google Scholar 

Download references


All simulations presented in this paper were carried out on the Vienna Scientific Cluster (VSC). We acknowledge support from the VSC School, as well as from the Austrian Science Fund (FWF) project 26253-N27. V.B. acknowledges the support from FWF Grant No. M 2150-N36. I.C. gratefully acknowledges support from the Ministerio de Economià y Competitividad (MINECO) (FIS2017-89471-R). This work was performed under the Maria de Maeztu Units of Excellence Program from the Spanish State Research Agency – Grant No. MDM-2017-0720.

Author information




I.C. designed the research, F.N. performed the simulations, F.N. and I.C. performed the data analysis. C.C., F.N., V.B., L.T., I.C. and C.D. wrote the manuscript and discussed the research.

Corresponding author

Correspondence to Ivan Coluzza.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nerattini, F., Tubiana, L., Cardelli, C. et al. Protein design under competing conditions for the availability of amino acids. Sci Rep 10, 2684 (2020).

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing