Protein design under competing conditions for the availability of amino acids

Isolating the properties of proteins that allow them to convert sequence into the structure is a long-lasting biophysical problem. In particular, studies focused extensively on the effect of a reduced alphabet size on the folding properties. However, the natural alphabet is a compromise between versatility and optimisation of the available resources. Here, for the first time, we include the impact of the relative availability of the amino acids to extract from the 20 letters the core necessary for protein stability. We present a computational protein design scheme that involves the competition for resources between a protein and a potential interaction partner that, additionally, gives us the chance to investigate the effect of the reduced alphabet on protein-protein interactions. We devise a scheme that automatically identifies the optimal reduced set of letters for the design of the protein, and we observe that even alphabets reduced down to 4 letters allow for single protein folding. However, it is only with 6 letters that we achieve optimal folding, thus recovering experimental observations. Additionally, we notice that the binding between the protein and a potential interaction partner could not be avoided with the investigated reduced alphabets. Therefore, we suggest that aggregation could have been a driving force in the evolution of the large protein alphabet.


TECHNICAL ASPECTS OF THE METHODOLOGY
In what follows we will describe the procedure used to simulate the competition for amino acid types between two proteins, one of which is protein G. We make use of the de novo design and folding techniques as described in Ref.s [1][2][3]. Protein design consists in the sampling of the sequence space, to the aim of predicting the ensemble of sequences that will fold in a specific target structure. Protein folding, instead, is the process by which a specific protein sequence assumes a well defined 3D structure. We tackle the problem computationally, via Monte Carlo simulations. The proteins are represented with a coarsegrained model, namely the caterpillar [1,2], that is capable of refolding the native sequence of several proteins including protein G. The competition is implemented including in the design a protein partner, that is a surface representing an exposed region of a larger protein, not fully considered to reduce the computational cost of the simulations. To give to the surface a protein-like shape, we moulded it around the protein G.
Our choice of the procedure is motivated by the intent of investigating the competition for resources, under which the system can spontaneously select an optimal subset of amino acids, i.e. reduce the alphabet size, for the design of protein G's structure. The design scheme explained in the following subsection searches for sequences that minimise the energy of the protein and maximise the heterogeneity of the sequence relative to the surface-like partner.
Indeed, this results in a high variability for the potential partner and a reduced alphabet for protein G. To the solely aim of investigating the effect of a reduced alphabet on folding properties, one could have a priori imposed a specific subset of amino acids to be used in the single-protein design. However, our method goes beyond such an approach, since gives the opportunity of letting the system spontaneously select both the optimal alphabet size and the best subset of amino acids among all possible combinations. On the other hand, one could have achieved the same goal without the need of an explicit partner, letting the protein compete with a reservoir as in a grand canonical Monte Carlo simulation. However, we decided to include such a partner, to have the additional benefit of further investigate the influence of it in the folding ability of artificial protein G.

Protein model
The caterpillar protein model reduces the amino acid structure and represent it by the backbone atoms: C, O, C α , N , H ( Fig. S1(a)). The intramolecular energy for a protein of length N has the form: where α = 1/4 and E HOH = 0.75 are added to balance the relative weight of different energy terms. E hb (θ 1 , θ 2 , r OH ) is a 10 − 12 Lennard-Jones potential commonly used to represent hydrogen bonds [4]: being r OH the distance between the hydrogen atom of the amide group and the oxygen atom of the carboxyl group of the main chain; θ 1 , θ 2 the angles between the atoms COH and OHN respectively ( Fig. S1(a)), and account for the hydrogen bonds directionality; ν = 2; σ = 2Å and ϵ H = 3.1K B T .
E sc (r Cα i Cα j ) mimics the side chain-side chain interaction via a smoothed square-well-like isotropic potential: where r Cα i Cα j is the distance between the Cα atoms; d HC = 4Å is the hard core diameter; r h = 12Å. Eq. S3 is a sigmoid function with a flex at r h = r Cα i Cα j , where E sc (r Cα i Cα j ) = ϵ Cα i Cα j /2. The terms ϵ Cα i Cα j are the residue-residue interaction parameters of the interaction matrix and their value is taken from Tab. S1 of Ref. [1].
E sol (Ω − Ω i ) is an implicit solvent energy term that acts as an energy penalty if a hydrophobic (hydrophilic) amino acid is exposed (buried), and has the form: where Ω = 21 is a threshold for the number of contacts in the native structure above which the amino acid is considered to be fully buried, and ϵ i sol is the Dolittle hydrophobicity index [5]. For more details about the model see Ref. [1,2]. For the protein Γ amino acids, instead, we use a single atom representation and make use of the E sol and E sc terms only.

Modelling protein Γ
The protein Γ is modelled as a fixed layer of amino acids, idealizing a small pocket on the surface of a much larger globular protein. The artificial protein Γ is a mould obtained by pushing the folded protein conformation on a planar mesh of self avoiding beads (self avoiding radius = 2Å), as illustrated in Fig. S1(b). The high density of beads in the mesh prevents the protein from passing through it.
To obtain the mould, we initially generate a dense mesh in the z = 0 plane by placing the beads on a 2D square lattice with step 0.5Å.
We then identify the centre of mass of the protein and use it to measure the height of the protein with respect to the plane, which is one of our main control parameters. Specifically, we consider the CM height z CM normalised over the maximum distance,r M AX , of a C α from the CM of the protein, so that ζ = z CM r M AX ranges between 0 (when the CM lies on the z = 0 plane) and 1.
We consider several values of ζ and for each of them we identify a protein orientation which maximise the contact surface with protein Γ according to the following procedure. We begin by aligning the minor inertia axis of the protein with the z axis and then perform a discrete set of rotations around the x and y axes. For each of these rotations, we recompute the mould, map it to a triangular mesh, and calculate the surface of the latter. The area of the surface is obtained by summing the area of each triangle formed by all the triplets of the mesh points at z ̸ = 0.
The shape of protein Γ is obtained by setting a minimum protein-mesh bead distance µ = 13Å and using it to inflate the radius of the protein C α s, thus generating a mould by pushing downwards all mesh beads within the inflated radius. To avoid large gaps within protein Γ, we perform an iterative smoothing procedure. For each bead of the surface we compute its distance to all its neighbouring beads on the square lattice, and if this is larger than the self-avoiding radius, we shift the vertical position of the bead so to fill the gap. The value µ = 13Å ensures that protein Γ and protein G are energetically decoupled during the design process.
In order to define the sequence of protein Γ, a homogeneously distributed set of mesh points are "activated", i.e. switched from pure self avoiding beads to Cα atoms. In order to do so, we firstly switch all the beads of the mesh to C α , then we set a minimum surface C α − C α distance δ = 5Å and loop over all bead points starting from a corner of the mesh.
For each activated atom in the loop we deactivate all points within a sphere of radius δ.
Clearly, after the first atom, the loop will incur both in activated and de-activated beads.
The latter are simply skipped by the procedure. The value of δ is derived from the typical nearest distance between two residues in natural proteins, as explained in Fig. S2. Finally, we deactivate all the beads at z = 0, so that only the pocket contains C α beads.
Since the amino acids in the C Surf set are represented by C α atoms, the inter-protein interaction energy includes the caterpillar E sc and E sol terms only (see Eqs. S3 and S4).
Given that protein Γ representation is limited to the surface residues, such an approximation affects the correct evaluation of the protein Γ solvent exposure term E sol . Since in turns E sol will influence the protein binding properties, we add an offset to the number of contacts for each amino acid of protein Γ (∼ 6), to correctly account for the solvent exposure term E sol .
The offset was calculated in such a way that the maximal amino acid exposure is compatible to the one of natural proteins (e.g. the protein G itself described in Fig. S5). Finally, since protein Γ's conformation is fixed, we neglect all the interactions between all the pairs of the C surf set.

Design
In order to enforce a competition for resources, we apply the previously mentioned design scheme. During the design, the configuration of the proteins are frozen in the geometry defined during the modelling procedure of the protein Γ. The sequences of protein G and Γ are optimized simultaneously with mutation moves affecting all residues of the global system. The design consists in the optimisation of the residue-residue interactions for the target configuration, thus leading to the need of extensively investigate the sequence space.
To this aim, we perform Virtual Move Parallel Tempering (VMPT) with adaptive umbrella sampling [3] Monte Carlo simulations, with swap and single point amino acid mutation moves along the sequence, a procedure that has been already successfully employed in protein design [1,2,6]. We simulate the same system at different temperatures, and swap sequences between the replicas on the fly (acceptance probability in Eq. S9), thus enhancing the overcoming of energy barriers and, therefore, improving the sampling. Moreover, at each temperature, we collect statistics using the information coming from all other replicas, according to the virtual move scheme described in Ref. [3]. In our implementation we simu- The best candidate sequences for the folding are the ones which minimise the total energy of the target structure among the ones maximising the total number of permutations N p , given by: In Eq. S5 q = 18 is the alphabet size (proline and cysteine are not included in the design, due to their peculiar role in protein structure, which is beyond the scope of our model); N is the total number of monomers in both proteins; n i is the total number of monomers of type i.
It is important to stress that the Hamiltonian used for the design is a simplification of Eq. S1, since none of the Monte Carlo moves involve a change in the geometry of the target conformation. Since we are not interested in the folding properties of the protein Γ itself, we neglect the intra-protein Γ interactions. It reduces to: where the sum of the residue-residue interactions runs on intra-protein and inter-proteins only. Moreover, the minimum inter-protein distance has been chosen high enough to keep the proteins energetically decoupled, leaving protein Γ optimisation on the solvent interaction only. In this framework, Eq. S5 becomes a crucial ingredient since the constraint of maximising the number of permutations along the jointed sequence is the only coupling between the partners, therefore enforcing the competition during the design. Given the complex residue-residue interaction network of the compact structure of the protein, it is hard to find highly heterogeneous protein sequences with very low energy. Therefore, the design spontaneously isolate an optimal subset of amino acids for the protein design problem, thus leading as a consequence to extremely heterogeneous sequences for the protein Γ, so to maximise the overall permutation index.
associated to the amino acid replacement and amino acids swap moves, respectively. In the latter equations ∆W , ∆E and ln E p = 20 K B T is a scaling factor set to a value high enough to generate highly heterogeneous sequences (as in Ref. [2]).
The acceptance probability for the adaptive parallel tempering is: where ∆W , ∆E and ln N new p N old p refer to the differences of the relative quantities between the sequences belonging to replicas at different temperatures, and ∆T is the temperature difference.

Folding
We perform two different kind of folding simulations: the single-protein G folding (step IV in Fig.1) and the folding in presence of protein Γ (step V in Fig.1).
For the single-protein simulations, we apply the method fully described in Ref. [1,2] in temperature. We sample the protein conformations using crankshaft and pivot moves, to change the protein geometry.
As for the adaptive umbrella sampling, we construct a bias potential depending on two variables: the distance root mean square displacement DRM SD and the number of hydrogen bonds HB. For HB we consider all the pairs of oxygen and hydrogen atoms that are closer than 2.5Å, are at a distance along the chain larger than second neighbours and their respective residues are facing each other (i.e. (cos(θ 1 ) cos(θ 2 )) ν in Eq. S2 differs from zero).
The DRM SD is a measure of the distance of a configuration from a target structure: where the sum runs over the ij pairs in contact in the target structure, namely the native contacts C, identified using a cut off of 17Å according to Ref. [1,2]. ∆ ⃗ r ij is the distance between the residues i and j and ∆ ⃗ r ij T is the same distance calculated over the target structure. As target, we use the crystallographic structure from the Protein Data Bank (PDB ID: 1pgb). We preferred the DRMSD measure instead of the classical RMSD, because it is more efficient in a Monte Carlo based code where few atoms are updated at each move.
The values in Eq. S10 can be updated on the flight providing only the new distances.
For the simulations in presence of protein Γ, we adapted the procedure in order to handle a system composed of interacting partners. The simulation is performed in a cubic box (as shown in Fig. S6) containing two replicas of protein Γ (box side ∼ 360Å), that are the mirror image of the other with respect to the xy plane passing through the centre of the box.
The proteins Γ are frozen in the box, and protein G starts the folding from a fully stretched conformation. An impenetrable slab region is defined between the proteins Γ to prevent the protein G to approach them from the convex side instead of the concave one generated with the moulding procedure.
Additionally to crankshaft and pivot, we perform rotation, translation and mirroring moves, to let the protein G reorient, diffuse in the box and switch from right to left handed conformations. The simulation is performed in parallel at 32 different temperatures and the replicas exchange information through the VMPT bias scheme [3]. Each replica sampling is enhanced by a bias potential depending on two collective variables, namely the distance root mean square displacement within the protein DRM SD intra , and between proteins, DRM SD inter . Both the DRM SDs are evaluated according to Eq. S10, with the following specifications: accordingly, the DRM SD intra is computed adopting the native and isolated protein conformation as a target structure, while the DRM SD inter is computed using as a target structure the native protein bound to the protein Γ; the normalisation factor C is the number of native contacts for DRM SD inter and twice the number of native contacts for DRM SD intra ; ∆ ⃗ r ij for DRM SD inter is evaluated with respect to the protein Γ closest to the protein G for each ij pair.

Association constant
For the evaluation of the association constant n , we need to assess the number of protein Γ in the box n = 2, define the accessible simulation volume V box and discern bulk solution configurations (contributing to Q f , the partition function of the free protein) from the ones where the protein G is directly in contact with the protein Γ (contributing to Q b , the partition function of the bound configurations). The accessible volume is easily obtained by subtracting the hard wall contribution to the box volume. The bound configurations are the ones characterised by a protein-protein interaction energy E inter ̸ = 0. The configurations contributing to Q f , instead, are the ones for which the central amino acid of the protein is at a distance r > H lenght + R int with respect to all the amino acids of the surface of protein Γ (being H lenght half of the stretched protein length; R int the interaction radius) and, therefore, where the interaction with protein Γ is not possible. We refer to the latter region as the bulk solution.

SOLUTION BASINS
For all scenarios, we generated a large number of sequences that we group in solution basins. In Fig. S4 we plot the distribution of the Hamming distances calculated for all the possible pairs between the basins. The latter is measured using the Hamming distance that determines the difference between two sequences of equal length by counting the number of substitutions needed to make them coincide. It is often more convenient to normalise the Hamming distance by dividing by the protein length N . For all scenario we observe large number of solutions that differ for more than 50% of the residues.
A crucial feature of the distributions in Fig. S4, are the peaks at high Hamming distances.
They demonstrate that the protein-protein coupling induces the design process to explore distinct solution basins. Given that the proteins are energetically decoupled (evident from the low value of E sc -inter that ranges from −1.4 to −2.7 k B T ), the influence of the protein Γ can be attributed to the maximisation of the letter permutations expressed by Eq. (S5).
Such coupling enforces an increasing competition between the proteins for the alphabet available while increasing the size of protein Γ.
However, ζ= 0.60 and 0.80 have also a significant number of related solutions with sequences differing for 20% only (see Fig. S4(f)). To check that nevertheless the ζ= 0.60 and ζ= 0.80 basins are distinct, we also calculated the self-overlap distributions for ζ= 0.60 and ζ= 0.80 (see Fig. S3) and we obtained profiles that are different from each other. More importantly, they differ from the one plotted in Fig. S4(f), demonstrating that although there are similarities, the two basins are distinct.
It is important to notice that the significant difference between the sequences is an evidence of the substantial effect of the coupling established by the maximisation of the total permutations N P . The distance between the ζ 0.60 and ζ 0.80 solutions is considerably smaller than between the others. However, we verified that they correspond to distinct solution basins.
We select one sequence from each basin, that we consider representative of the whole basin. The selected sequences are shown in Tab. S1. In Fig. S11 we plot the occurrence frequency of each amino acid type in the selected sequences.
Given that the caterpillar alphabet is composed by 18 letters, the frequency of each amino acid in a maximally heterogeneous sequence would be 1/18 ∼ 6%. Therefore, we set a threshold at a slightly higher value=10%, to safely discern the contribution of dominating amino acids from the random occurrences. From the plot, it becomes apparent that the protein G designed sequence has a restricted composition of amino acids leading to a reduced effective protein alphabet.
. a) Caterpillar protein model: the green circle represents the amino acid self-avoidance volume, which has a radius of 2Å and is centred on the position of the C α atom. Each amino acid is represented through backbone atoms only. Side chain-side chain interactions are represented via a square-well-like potential (Eq. S3).For the hydrogen bonds, the model include a 10 − 12 Lennard-Jones potential, that is a function of the distance r OH , and tunes it with a multiplicative factor that involves the angles θ 1 and θ 2 , so to account for the directionality of the bond. b) Artificial protein Γ parameters. Gray dots are self-avoiding beads; blue spheres are C Surf atoms, i.e. C α atoms of the protein Γ. The red spot represents the centre of mass of the protein, z CM is its height with respect to the z = 0 plane and r M AX is the maximum CM-C α distance. µ is the minimum C α protein-C α surface distance, i.e. the value one has to use to inflate the protein for shaping protein Γ. δ is the minimum distance between two activated C α of protein Γ.
S2. Radial distribution function g(r) calculated over the active residues C Surf of the protein Γ placed at the minimum distance δ = 5Å from each other. The latter was determined by identifying the value of δ that yields the mean nearest distance ∫ 7.5 0 ρ4πr 2 g(r)r dr closest to 5.7Å obtained by averaging over 145 protein structures taken from the PDB (black line in the inset). The integral interval was chosen visually taking a region large enough to fully contain the first peak in the g(r).
It is important to stress that in the protein g(r) the selected distribution includes the contribution from all secondary structure elements [7]. S5. Water exposure profiles of protein G in presence of proteins Γ constructed with different ζ values. The colour scheme of the protein is related to the number of total contacts: solvent exposed regions in red, buried in blue. The plot shows the number of contacts for each protein residue. The grey dashed line defines the threshold Ω = 21 on the number of neighbours above which an amino acid is considered buried in the protein core. We used the information relative to the exposure of protein G to set the radius of the probe sphere used in the evaluation of the number of contacts for each active C α of the protein Γ (the blue spheres in the figure). We counted the offset for the number of neighbours for each of the protein Γ C α by multiplying the fraction of the volume of a sphere of radius 6Å that lies under the mesh surface by the average amino acid density in globular proteins (ρ = 0.011687 aa/Å 3 , evaluated over a set of 145 globular proteins). The radius of the sphere was optimized to reach an average exposure of each residue similar to the one of the most exposed residues of protein G (corresponding to the minima in the plot).

Simulation Box
S6. Simulation box for protein folding in presence of two copies of the protein Γ. The protein Γ is a surface decorated with Cα atoms, i.e. the coloured spheres in the image. Each colour, both on the protein backbone and on the Cα of the surface of protein Γ, represents a different amino acid type. The proteins sequence are designed simultaneously with the procedure described in the Design sub-section. The protein is a flexible polymer diffusing in the box under Periodic Boundary Conditions. A mirroring move provides a swap between opposite chiralities, since the caterpillar model does not take into account the amino acid chirality. The cubic box contains two proteins Γ, one the mirror image of the other, the position of which is fixed during the simulation. The proteins Γ are far apart enough to prevent the protein G to interact with both at the same time. A hard wall between the protein Γ prevents the protein G to approach the surfaces from the convex side.