Introduction

The biological function of proteins is closely linked to their folded, three-dimensional structure1,2. Thus, the ability to predict the folded structure of a protein from knowledge of the sequence of amino acids opens many possibilities – such as predicting de-stabilizing mutations3 and design of synthetic proteins4.

The thermodynamic stability of proteins is primarily attributed to the hydrophobic effect5,6. Mutations swapping one amino acid with another of a similar hydrophobic character have been shown to frequently result in a similar folded state7,8,9. Guided by these observations, models of proteins have been proposed in which amino acids are simply classified as hydrophobic, polar or neutral5,10. To further reduce complexity, the solvent is usually treated implicitly, including the hydrophobic effect as an effective attraction between hydrophobic amino acid residues. These models are computationally fast and have played an important role in developing an understanding of the thermodynamics of protein folding11, mis-folding12,13, aggregation14,15,16 and adsorption17,18. However, because of the absence of explicit solvent, they can only use empirical parameterizations to capture the temperature and pressure-dependent changes in the hydrophobic effect, which is thought to be responsible for phenomena like cold and pressure denaturation19. Simulation studies also show that the solvent plays an important role in the kinetics of the hydrophobic collapse20. All-atom simulations (e.g. Ref. 21,22,23,24,25) of proteins in water can circumvent these issues, but at substantial computational cost. There is clearly a need for intermediate-level coarse-grained models that are computationally tractable, while explicitly accounting for water and thus capturing the behavior of real proteins with fidelity.

Protein molecules are special heteropolymers26,27. Although the outer surface of a protein is patchy, with both hydrophobic and polar amino acids28 and there are stable charged groups in the protein's interior29, the molecules fold so as to predominantly shield their hydrophobic amino acids while exposing the polar residues to the solvent30. This kind of spatial arrangement of amino acids would be expected to prevent protein aggregation, unlike the behavior of oil molecules in water31.

The folding of a protein molecule involves a global hydrophobic collapse of the entire protein chain and the formation of local secondary structures32,33,34. The kind of secondary structure that a protein segment forms is a function of the sequence of hydrophobic and polar amino acids in that segment35,36. The sequence-dependence of kinetics and thermodynamics of the hydrophobic collapse of a heteropolymer chain has not been systematically studied in experiments and computationally has been restricted to implicit solvent models37,38,39 with a few exceptions40.

The goal of this work is to design a protein-like heteropolymer (composed of hydrophobic and polar-like groups) in an explicit water-like medium that has the potential to capture the essential features of hydrophobic-driven collapse and the predominant exposure of polar groups to the solvent. Many lattice-based algorithms have been proposed for designing sequences that are capable of folding into a unique or a low degeneracy conformation in an implicit solvent41,42,43,44,45. These ideas are clearly not applicable in our case and instead we use a combination of real space Monte Carlo (MC) and sequence space simulated annealing MC simulations (see Methods). This methodology, which is inspired by early ideas of Khokhlov and Khlatur45, allows us to design sequences that have many protein-like features, such as a collapsed state with a hydrophobic core and a polar exterior. It will be seen that this protein model exhibits cold and heat denaturation, as found in real proteins. Further, we show that the hydrophobic collapse is a function of the sequence of amino acids in the protein chain. As a caveat we emphasize that no attempt has been made to capture the well-known fact that proteins develop secondary structures and we defer this important development to future work.

Results

Fig. 1 shows how the root mean squared radius of gyration, <Rg2>1/2 of the designed and the random 1000-mer heteropolymer, varies as a function of x ( = Vpolar/Vatt), determined by averaging the <Rg2>1/2 from MC simulations of three different sequences for both the designed and the random heteropolymer. (For the definition of parameters of the model, see Methods section) We find that simulations of the random heteropolymer starting with a collapsed conformation equilibrates to an open conformation for x > 0.82. The designed heteropolymer, on the other hand, can attain a collapsed conformation for x ≤ 0.94. Further, we find that the collapsed state of the designed heteropolymer always has a hydrophobic core and a polar exterior (Figs. 2(b) and 2(d)), while the random sequence does not show such segregation (Figs. 2(a) and 2(c)). To understand the random sequence data we note that a chain comprised purely of hydrophobic monomers is collapsed under these state conditions46. Since the collapsed state of the designed sequence is stable for a broader range of x than the random sequence, we speculate that the tendency of the designed heteropolymer to collapse for x ≤ 0.94 probably reflects a micro-phase separation of the chain because of the partial segregation of the polar and hydrophobic monomers in the sequence. These interaction-dependent results are consistent with the expectation that the hydrophobic collapse of a protein is a function of its sequence.

Figure 1
figure 1

Root-mean-squared radius of gyration, <Rg2>1/2 of designed (red) and random (blue) heteropolymers as a function of x at T = 0.7 and ρ = 0.257 determined using canonical ensemble MC simulations.

x is the ratio of the potential well depth of the polar monomers with respect to the solvent monomers (Vpolar/Vatt). Three different simulations with different starting conformations were averaged for the calculation of <Rg2>1/2. For both the designed and random heteropolymer, three different sequences were chosen for the simulations. The dotted line is at <Rg2>1/2 = 9.53. The configurations for which the <Rg2>1/2 lies below the dotted line are considered collapsed (see text). The error bars represent the standard deviation of the <Rg2>1/2 from the three simulations. The lines are guide to the eye.

Figure 2
figure 2

Comparison of the spatial arrangement of monomers in the collapsed state of a designed sequence and of a random sequence at x = 0.8, T = 0.7 and ρ = 0.257 obtained from a canonical ensemble MC simulation.

Similar spatial profiles were obtained from simulations with different starting conformations and different designed and random sequences. The radial number density of the hydrophobic monomers (red) and the polar monomers (black) for (a) the random heteropolymer and (b) the designed heteropolymer as a function of the distance from their center of mass. The mole fraction of polar monomers in (c) the random heteropolymer and (d) the designed heteropolymer as a function of the distance from their center of mass.

Fig. 2 shows the spatial distribution of the hard sphere (“hydrophobic”) and Jagla (“polar”) monomers of a random sequence and a designed heteropolymer, with x = 0.8, in their collapsed conformations after long canonical ensemble MC simulations. Let us first examine the random sequence (Figs. 2(a) and 2(c)). Polar monomers are preferred at the surface, but it is apparent that they are not excluded from the core. This supports the picture that the chain with a random sequence collapses non-specifically, as may be expected from a homopolymer chain in poor solvent conditions. This behavior contrasts strongly with that of the designed sequence, in which the polar monomers are completely excluded from the core (Figs. 2(b) and 2(d)). While both polar and hydrophobic monomers are present on the surface, there is a strong segregation of polar monomers to the surface of the heteropolymer. Further, this sequence can be collapsed from a random, open conformation, to a spatial distribution similar to that represented in Figs. 2b and 2d. This supports the assertion that the collapsed state of this sequence is the thermodynamic equilibrium state under these conditions. We note that this collapsed conformation is not unique, and, as noted, it does not acquire secondary structure, because our model only employs flexible bond angles, has no torsional potentials and does not have any side groups which can direct this process. However, the ensemble of conformations in the collapsed state always has a hydrophobic core and a polar exterior and therefore has the basic structural feature of protein molecules in their native state.

The designed heteropolymer is seen to display both cold and heat denaturation (Fig. 3). For T ≤ 0.5 and T > 0.9, the designed sequence displays an open conformation. Upon heat denaturation, the <Rg2>1/2 of the designed heteropolymer becomes greater than the <Rg2>1/2 of a 1000-mer Gaussian chain (~12.9). Upon cold denaturation, the <Rg2>1/2 of the heteropolymer increases but does not attain a random walk state. This behavior is analogous to the behavior of homopolymer47. It has been shown that at pressure P close to 0.0, the solubility of hard spheres in the Jagla solvent increases and the solvent-separated configuration of two hard spheres becomes the most stable configuration as T is decreased below 0.75 46,48. It is also consistent with peaked distribution of the angle between the two subsequent bonds of a swollen homopolymer, with a maximum near 60 degrees47. The increase in <Rg2>1/2 of the heteropolymer for T ≤ 0.5 is a consequence of the increase in solubility of hard spheres in Jagla solvent. We have defined the formation of the ground state of the designed heteropolymer simply by the value of <Rg2>1/2 and not necessarily by the loss of secondary structure or native contacts, as is more typical for experiment and more detailed models. It is now well accepted that proteins can lose secondary structure, but still maintain a molten globule shape over a broader temperature range. Based on these facts, the stability range of the native state of a protein as defined by its size is expected to be broader than would be the case when using a metric that only incorporates secondary structure changes. This might explain the relatively large temperature range over which the collapsed state of the heteropolymer is stabilized.

Figure 3
figure 3

Mean squared radius of gyration of the designed heteropolymer as a function of temperature for a number density ρ = 0.257 and x = 0.86.

The data is averaged over four simulations with different designed sequences and started from different random conformations. The designed heteropolymer displays cold and heat denaturation. The error bars represent the standard deviation of <Rg2>1/2 from the four simulations. The line is guide to the eye.

We now explore the molecular origins of these results. We define a hydrophobic segment length as the number of hydrophobic monomers that separate two polar monomers along the chain backbone. The segment length distribution, S(L), (Fig. 4) was calculated by multiplying the length of a segment, L, by its probability of occurring in a sequence. The S(L) of the random sequence of polar and hydrophobic monomers was determined by assuming that we have xN beads of P-H type and (1 − 2x)N beads of H type, which are mixed randomly in the sequence. Random sequences of length N can be generated by randomly selecting beads of type P-H with probability x/(1 − 2x) and beads of type H with probability (1 − 2x)/(1 − x). The probability to find a sequence of (L − 1) H in a row (which corresponds to distance L between two neighboring P) will be then a pure geometric distribution P(L) = [(1 − 2x)/(1 − x)]L−2x/(1 − x). S(L) for random sequences is simply L P(L). The actual distribution of S(L) for random sequences determined by generating 104 random sequences slightly deviates from the above theoretical distribution for S(L) because of the finite size of the sequence. The S(L) of the designed sequence was computed from configurations that were generated during the sequence-space MC simulation when the θ = 0 condition was reached. As compared to the S(L) of a random heteropolymer, the S(L) of the designed sequences has a larger peak at small L and consequently a broader tail. This implies that the designed sequence contains statistically both smaller and larger blocks of hydrophobic units than a random sequence. This result is similar to that found by Khokhlov and Khalatur45. We rationalize it as follows. If a polar monomer is at the surface of the collapsed protein, then the next polar monomer should be relatively close to it in the sequence so that it too can be placed at the surface without much entropic loss. On the other hand, if a hydrophobic sequence goes into the core of the collapsed chain, then it must be a long sequence since polar units are preferentially excluded from the core. The formation of a protein-like structure with a hydrophobic core and the preferential placement of polar monomers at the surface, therefore appears to be at the heart of the behavior of S(L).

Figure 4
figure 4

Segment length distribution, S(L) of the designed sequence (red circles) as compared to a random sequence (blue squares).

The designed sequence is more blocky in hydrophobic monomers. That is, the designed sequence has a larger number of very small segments and large segments (see text for the definition of a segment). The S(L) of random sequences of polar and hydrophobic monomers was determined theoretically by assuming a chain of infinite length. The theoretical distribution deviates only slightly from S(L) of random sequences computed by randomly generating 104 sequences of length 1000. The S(L) of the designed sequence was computed from configurations that were generated during the sequence-space MC simulation when the θ = 0 condition was reached. The lines are guides to the eye.

We also explored the possibility of observing pressure denaturation in our model. We know from previous work46,47 that a hydrophobic homopolymer can show pressure induced swelling in a Jagla solvent, thus implying that pressure induced denaturation should also be observed for these designed heteropolymers. For the pressure range 0.0 ≤ P ≤ 1.4, we find that a swollen chain undergoes a collapse implying that the collapsed state is the thermodynamically stable state at these conditions. However, in the simulations with P > 1.4, within the length of our simulations, a swollen chain does not undergo any collapse and at the same time, a collapsed chain does not attain a swollen state. The hysteresis in the results for P > 1.4 implies that as the density of the solvent increases, the activation barriers between the “folded” and “unfolded” states of the heteropolymer become large, thus preventing us from clearly identifying the pressure denaturation transition without a more sophisticated or intensive sampling. More work needs to be performed to explore this interesting finding and quantify the free energy barrier that apparently becomes large at high pressure.

Discussion

We have developed a simple model of a protein-like heteropolymer in an explicit solvent medium having water-like behavior. In contrast to “real” water, which has three atoms and explicit hydrogen bonding, our model represents “water” as a single sphere that interacts with all other atoms through a spatially isotropic potential called the Jagla potential. The Jagla potential has been shown previously to be quite successful in displaying water-like thermodynamic, dynamic and structural anomalies49,50 as well as water-like solvation thermodynamics46,48. The Jagla potential has effectively two length scales – a repulsive ramp and a hard-sphere core. At high densities and temperatures, the hard-sphere core defines the effective radius of the Jagla particles. Upon reducing pressure or decreasing temperature, the interparticle separation of Jagla pairs changes from close to the hard-core diameter to a distance more comparable to the repulsive ramp diameter, thus allowing the system to achieve a more open structure. This behavior is reminiscent of water at low temperature and density, wherein an open-structured liquid is formed because of the formation of H-bonds. These simplifications yield significant computational advantages, but without the loss of qualitative hydrophobic behavior of solutes. Because of the presence of explicit solvent, the model is able to capture a priori the phenomena of cold and heat denaturation reminiscent of real proteins. The model results show that a sequence that is able to attain a spatial arrangement in which, predominantly, hydrophobic monomers are in the interior and polar monomers are on the surface of the collapsed state has a much stronger tendency to undergo a hydrophobic collapse than does a random heteropolymer. This is consistent with the expectation that hydrophobic collapse in natural proteins is aided by their sequence. The results also show that the evolved sequences producing collapse have longer than statistical runs of hydrophobic groups, consistent with the needed topology of a quasi-spherical collapsed state with a predominantly solvophilic surface. In this study, we did not attempt to include features in the model that would lead to secondary structure and we have only touched on the influence of high pressure. Both of these aspects will be topics of future studies.

Methods

Model

We model water molecules through a spherically symmetric Jagla potential46,51. The Jagla potential has a hard sphere core of radius, σ/2 (defining the characteristic length scale for the model), a repulsive, linear ramp from r = σ/2 to r = a/2 and an attractive tail from r = a/2 to r = b/2. At r = σ/2+, the potential of the repulsive ramp is a positive value, Vrep and at r = a/2, the potential is chosen to be Vatt, defining the characteristic energy scale for the model. In this work, we measure length in units of σ, energy in units of |Vatt|, pressure in units of |Vatt−3, temperature in units of |Vatt|kB−1 and number density, ρ in units of σ−3. From r = a/2 to r = b/2 the attractive tail increases linearly to zero. The potential is zero for r > b/2. The parameter values invoked here are those used by Buldyrev et al.46 (a = 1.72, b = 3 and Vrep = 3.5). The Jagla model has been shown to display water-like thermodynamic, dynamic and structural anomalies49 and water-like solvation thermodynamics46,52. The water-like anomalies of the Jagla potential are due to its two length scales – the repulsive ramp and the hard-sphere core. At high densities and temperatures, the hard-sphere core defines the effective radius of the Jagla particles. Upon reducing pressure or decreasing temperature, the inter-particle separation of Jagla nearest-neighbor pairs changes from close to the hard-core diameter (r = 1) to a distance more comparable to the repulsive ramp diameter (r = 1.72). This allows the system to achieve a more open structure, a behavior that is reminiscent of water's at low temperature and density, wherein an open-structured liquid is formed because of the formation of H-bonds. This provides one interpretation for the success of the Jagla model in describing water-like anomalies50,53.

We model a protein molecule as a linear flexible chain of polar and hydrophobic monomers. The polar monomers interact with each other through the Jagla potential. The parameters of the Jagla potential between a pair of polar monomers is taken to be the same as that of the solvent particles except for the potential well depth, which can vary with a scaling parameter x; Vpolar = x Vatt. The interaction potential between a solvent molecule and a polar monomer is taken to be Jagla potential, with the well depth of the Jagla potential = x0.5 Vatt, as determined using Lorentz-Berthelot mixing rule. The hydrophobic monomers interact with all the particles, including each other, through a hard sphere potential defined by a hydrophobic particle hard core diameter equal to 1.0. The bond length, is fixed to 1.0. No torsional or bond-angle potentials are included in the present model. The polymer molecule is taken to be of length N = 1000 monomers.

We pick this relatively long chain so that a clear distinction can be made between the surface and the core of the chain in the collapsed state. Chains of length 176, for example, are too short to make this distinction54. We choose to limit this study of sequence to a fixed fraction of polar monomers in the “protein” chain of 0.25.

Finally, in the present study, a design constraint was imposed in which polar monomers were not allowed to be adjacent to each other in the sequence. This constraint was imposed to ensure that a complete segregation of polar and hydrophobic monomers does not occur. This is a strong constraint as polar monomers do occur sequentially adjacent to each other in proteins. A less stringent constraint could be to have a finite but decreasing probability for having Np polar monomers adjacent to each other in sequence. Such alternatives are appropriate for further investigations.

Designing a protein-like sequence

Simulations were performed in a box size of 36 × 36 × 36 (in reduced units; see Model section). Periodic boundary conditions were used in the x, y and z directions. The temperature was taken to be T = 0.7 and the total number density, ρ of the system was 0.257. The pressure P for this system was found to be 0.01 ± 0.01. Insensitivity to finite size effects was demonstrated by doing simulations in a box size of 50 × 50 × 50 and the same ρ of 0.257 (number of solvent particles = 31,125). The results from the larger box size were found to match the ones from the smaller box size. A freely jointed chain of N = 1000 monomers has a root mean squared radius of gyration (RMSRG), <Rg2>1/2 = (N/6)1/2 = 12.9. At T = 0.7, the RMSRG of a 1000-mer hard-sphere homopolymer collapsed with an internal (or “core”) density ρ = 0.257 and an <Rg2>1/2 of 7.4 ± 0.1, which is consistent with a simple spherical geometric estimate of <Rg2>1/2 = (3/5)1/2(3N/4πρ)1/3σ = 7.55. We can make the result more relevant to our context by realizing that the effective size of a Jagla monomer (≈a/2 = 0.86) is larger than that of the hydrophobic monomers (0.5). After accounting for these size differences and for the probability of the occurrence of a Jagla monomer in a peptide chain, p = 0.25 (see Model section), we can then write a better estimate for the size of the “protein” chain if fully collapsed:

We thus expect that the <Rg2>1/2 of the heteropolymer will fall somewhere between 7.55 and 9.53. Thus, the chain size corresponding to the collapsed state is small enough, relative to the dimensions of the simulation cell, to avoid a collapsed chain interacting with its own image. The chain dimensions obtained in the collapsed state are thus expected to be reliable, as is the predicted unfolding temperature, while the dimensions of less ordered states may be quantitatively less reliable.

A protein-like sequence was designed using a two-step procedure. In the first step, random heteropolymers containing 25% of polar monomers were generated by randomly selecting monomers from the sequence and labeling them as polar. The remaining monomers were labeled as hydrophobic (or hard core monomers). The random heteropolymers with x varying from 0 to 1 (recall that x is defined as Vpolar/Vatt), were investigated using canonical ensemble Monte Carlo (MC) simulation at T = 0.7 and ρ = 0.257. It was found that for x < 0.82, random heteropolymers adopt collapsed equilibrium configurations, i.e., characterized by <Rg2>1/2 ≤ 9.53. In the second step, we chose x = 0.8 and, starting from a random sequence and a random conformation, performed a canonical ensemble MC simulation (T = 0.7 and ρ = 0.257) combined with a simulated annealing MC simulation in sequence space to generate the optimal sequences of a collapsed “protein”-like sequence. In the sequence space MC, a polar and a hydrophobic monomer were chosen at random and an attempt was made to swap their positions in the sequence, consistent with the design constraint that polar monomers cannot be adjacent to each other in the sequence. The attempt to swap was accepted with the probability , where θ is the sequence-space “temperature” and ΔE is the change in energy accompanying the swap. The sequence space MC was started with a large value of θ ( = 0.8). After 4000 attempted swap moves, θ was decreased by 0.1. The above steps were continued until θ = 0. The sequence space MC moves were attempted with a probability of 0.02 after every real space canonical ensemble MC cycle. Overall, 6 × 106 real space MC cycles were performed. Once this simulation was complete, we found that the heteropolymer always attained a collapsed state with hydrophobic groups predominantly in the core and the polar groups close to the surface. The surface of the heteropolymer in its collapsed state was identified by determining the Gibbs surface as follows: The spherically averaged spatial density profile of monomers around the center of mass of the heteropolymer in its collapsed state was determined. We assumed a “vapor” density of zero and a “liquid” density equal to the monomer density in the inner core of the protein. The “mid” plane where the net surface depletion of monomers on the liquid side exactly compensates the surface excess on the vapor side is the location of the Gibbs plane, rGibbs. rGibbs was found by solving the below equation,

In the above equation, the center of mass of the protein is at r = 0 and ρ (r) is the number density of the monomers as a function of r. The sequence of the heteropolymer achieved after this simulation was called the “designed heteropolymer”.

Several realizations of the above simulation were performed. The final sequence of the designed heteropolymer was not unique, but the ensemble of sequences thus obtained showed similar properties with respect to the distribution of monomers in the sequence and the collapse tendency. In the above sequence design procedure, x was chosen to be 0.8, as this ensures that the heteropolymer attains a collapsed state during the design procedure (Fig. 1). If too large a value of x is used during the sequence design procedure, the heteropolymer attains an open conformation and hence good sequence design is not achieved. However, as described below, this design procedure is capable of generating sequences that remain stable in the collapsed state even for the range of x where a random heteropolymer is not (x ≥ 0.82). The behavior of the designed heteropolymer was studied as a function of x, T and P using real space canonical ensemble MC simulations.