Introduction

Protein solubility and propensity to aggregate has been central to biotechnology and biosciences through the era of recombinant protein expression. It is also becoming increasingly important in the area of formulation and preparation of biologics (therapeutic proteins) and in consideration of disorders arising from misfolding1. A common view of protein aggregation at relatively high concentration holds that partial unfolding (a structural feature) leads to association of non-polar stretches of amino acids (a sequence feature). Whilst structural and sequence properties combine in this view of protein aggregation, computational algorithms that attempt to predict solubility largely divide into those based on sequence and those based on structure, although some features (e.g. net charge) span this division.

A key question relates to how solubility is defined in benchmark sets. Early work2 distinguishes between proteins that form inclusion bodies (IBs) and those that do not, with a study of sequence features. The two properties correlating best with IB formation were found to be average charge (more net charge, less IB) and turn-forming residue fraction (more gives IBs, perhaps due to slow folding e.g. with prolines). Other work3 also uses the IB/non-IB distinction, together with sequence and structure-based correlations with solubility, including thermostability and relative lack of β-sheet. Some reports define soluble proteins as those for which a structure has been solved and deposited in the protein data bank, PDB4. This definition is used alongside resources that record progress in protein expression for structural genomics5,6, such as the TargetDB database7. Machine-learning techniques are then employed to optimise distinction between soluble and insoluble proteins, although it can be difficult to extract physico-chemical interpretation from such methods. Other work combines machine-learning with soluble/insoluble datasets obtained through keyword searching in the literature8. The relationship between mRNA levels and protein solubility in E. coli has been examined9. Proteins with sequence more prone to aggregation are generally expressed at lower levels, where amino acid polarity is used to indicate aggregation potential, i.e. proteins with a more non-polar sequence have lower mRNA levels. The REFOLD database10 annotates proteins as soluble or insoluble, but in practice all of these proteins have been expressed through IB formation.

There are a number of aggregation prediction schemes based on the experimental observation that many proteins can be induced to adopt an amyloid, β-rich conformation11. These include TANGO12, PASTA13 and Zyggregator14. Such schemes can include many factors, but generally, the β-forming propensity for linear segments of amino acid sequence is an important element. A 3D surface polarity approach has been adopted in the redesigning of protein surface to improve solubility15, with the introduction of groups to break up non-polar patches. This is reminiscent of the discovery that charges on the surfaces of hyperthermophile proteins are more closely packed, on average, than those in mesophile proteins16. It was assumed that the higher temperature environment of hyperthermophiles increases the strength of hydrophobic interactions, leading to the requirement for a more stringent breaking up of non-polar patches with charges. A charge influence has also appeared in the context of translation rate, a property that will impact on protein production and therefore potentially solubility. A dependence of ribosomal velocity on positively charged residues in newly synthesised proteins has been found, due to interaction with the negatively-charged ribosomal exit tunnel17. More generally, translation rate has a well-studied correlation with codon bias18.

Methods for predicting protein solubility have been reviewed19. The availability of experimental data, where proteins have been expressed in consistent conditions, continues to present a significant problem with assessing prediction schemes. A significant study addressing this point used a high throughput cell-free system for classification of E. coli protein solubility20. The authors of this work concluded that factors correlating to some degree with solubility include charge and structural class, whilst algorithms based largely on propensity to form β-structure/amyloid performed less well, although a machine-learning study subsequently identified a correlation between sequence-based calculation of physico-chemical properties and measured solubility for this dataset21.

In the current work, computational methods for characterising charge and potential distributions in proteins22 have been used alongside patch-based calculations of surface properties23 to analyse the properties of soluble and insoluble subsets of proteins. The experimental data used in this study derive from cell-free expression20 using the PURE system of E. coli factors, lacking chaperones24. Encouraged by a study in which computation over many proteins revealed a correlation between electrostatic properties and subcellular location25, a similar approach was used in respect of solubility. Whilst some correlation is found between insolubility and larger non-polar patches, by far the most significant relationship associates insolubility with large positively-charged patches. The pattern underlying this unexpected result is similar to that which separates nucleic acid (NA)-binding from non-NA-binding proteins.

Results

Surface potential patches and solubility

At neutral pH most of the insoluble and soluble dataset proteins are predicted to be moderately negatively-charged and there is no significant separation of the distributions (Fig. 1a, p = 0.872 for a Mann-Whitney test of subsets being sampled from the same underlying distribution). The maximal positive and negative potential patches for each protein show quite different behaviour, with no significant separation for negative potential (p = 0.227), but clear separation for positive potential (p = 7.1 × 10−13, Fig. 1b). A patch analysis of charge clustering (with 13 Å patch radius) was performed in order to establish whether the positive potential patches, based on contours, were mirrored in charge geometry. This is the case, with the largest net positive charge on a patch also distinguishing soluble and insoluble protein datasets (p = 2.2 × 10−4, Fig. 1c).

Figure 1
figure 1

Cumulative fractions of soluble (SOL) and insoluble (INS) protein datasets, upon calculation of particular features.

(a) Net charge, predicted at pH 7.0. (b) Grid points within the largest positive (pos) and largest negative (neg) contours of electrostatic potential. (c) Maximum net positive charge in a geometric patch (13 Å radius). (d) The maximum ratio (for each protein) of non-polar to polar patch SASA. (e) Largest positive patch contours are re-plotted, now as a ratio to a 3000 grid point threshold, alongside calculations with DNA-binding and non-DNA-binding datasets. (f) Separation according to the geometrical patch with the largest Arg content.

Surface polarity and solubility

We next examined a potential role for the association of proteins via non-polar surfaces, through calculation of non-polar to polar solvent accessible surface area (SASA) ratios, for patches of radius 13 Å centred on each atom. The maximum of this ratio was identified for each protein. Fig. 1d shows the separation of soluble and insoluble subsets (p = 2.3 × 10−3). Whilst there is some correlation between increased non-polarity and insolubility, it is far smaller than that exhibited by positive potential. ROC plot analysis demonstrates this distinction (Fig. 2). An area under the curve (AUC) of 0.85 for the positive potential features (Fig. 2a) compares with an AUC of 0.62 for the non-polar to polar surface area ratio (Fig. 2b). A threshold of 3000 grid points (in the contours of positive potential) gives the best separation between soluble and insoluble datasets. Positive patch size can be reported as a ratio to this value.

Figure 2
figure 2

ROC plots for insoluble and soluble subset separation.

(a) ROC plot (AUC = 0.85) showing separation by positive potential. TPR is true positive rate and FPR false positive rate. (b) ROC plot (AUC = 0.62) quantifying the separation by non-polar to polar surface ratio (13 Å radius patch).

Comparison with discrimination for DNA-binding proteins

Positively-charged surfaces are implicated in nucleic acid binding26 and contribute to prediction schemes for NA binding27. The same potential patch analysis applied to the solubility data, was used to examine DNA-binding and non-DNA-binding protein datasets. Separation of the DNA/non-DNA-binding subsets is strikingly similar to that for the insoluble/soluble subsets (Fig. 1e). There is some enrichment for known DNA-binding proteins in the insoluble subset, 13 of 56, as compared with 16 of 111 in the soluble subset. Clearly though not all DNA-binding proteins are present in the insoluble subset.

Generally in NA binding, non-specific charge interactions typically function alongside more specific interactions arising from hydrogen-bonding and shape complementarity. Such additional interactions may play a role in distinguishing an interesting pair of DNA-binding homologues, IHF in the insoluble subset and HUα in the soluble subset. Maximal positive patches (measured as a ratio to the threshold) are consistent with the subset membership, whether calculated for the monomer (IHF 1.38, HUα 0.26) or the dimeric biological units (IHF 3.31, HUα 0.45). Although these two proteins are closely related structurally, they have very different positive potential distributions and solubility in the cell-free system. Functionally, HUα and IHF have divergent DNA substrate preferences28, that may be related to their positive charge distributions.

It appears that larger positive patches exert an influence towards protein aggregation in the cell free expression system20. If this is also the case for intracellular expression, then it might be anticipated that it would be countered by a cell maintaining lower levels of proteins with larger positive potential patches. Abundance of mRNA is not entirely representative of protein level29. Indeed, at a fixed time point for a single cell, correlation between protein and mRNA level can be absent, due to the much shorter lifetime of mRNAs as compared with proteins30. Currently though, mRNA levels measured from populations of cells, which are correlated with protein level30, provide the most extensive data. An anti-correlation (R = 0.283, p = 5.75 × 10−3, not shown) was found between largest positive patch size and a log measure of mRNA levels in E. coli31. Again the positive potential is differentiated from negative potential, for which the largest patch gives no significant relationship with mRNA level (R = 0.124, p = 0.138, not shown).

Positive and negative charge, arginine and lysine

Sequence-based calculation of the fraction of charged groups that are either positively-charged or negatively-charged at neutral pH separates soluble and insoluble subsets (p = 1.417 × 10−3, not shown). Consistent with the patch calculations, a higher fraction of positive charge tends towards insolubility. With a greater separation of soluble and insoluble subsets for the (3D) patch-based property, relative to the sequence-based charge fraction, the structural property appears to be a crucial component in a physico-chemical understanding of the cell-free expression data.

Thus far, positively-charged amino acid sidechains, at neutral pH, have been combined. Taking the maximum positive charge patches of Fig. 1c, calculation of Arg enrichment in these patches (compared with the Arg to Lys content overall for each protein) gives a separation (p = 0.023, not shown), with more Arg in the maximum positive patches of the insoluble subset. Next, looking at the maximum number of Arg in any geometric patch of radius 13 Å, there is also separation (p = 3.884 × 10−4, Fig. 1f). Only a small number of the insoluble proteins have a geometric patch containing less than 4 Arg. The Lys to Arg ratio calculated directly from protein sequence also separates (p = 0.037, not shown), with very few of the insoluble dataset having this ratio greater than 1.

Discussion

Our results indicate that factors contributing on average to separation of the structurally annotated soluble and insoluble subsets in cell-free expression20, are non-polar surface (moderate contribution) and positively-charged patches (major contribution, particularly where Arg is more prevalent than Lys). Correlation between largest positive patch and insolubility implies that this property, or another feature to which it is strongly related, acts in some direct or indirect way to promote protein-protein interactions. It could be argued that a concentration of positive charge may tend towards lower folded state stability through unfavourable charge interactions and thus influence solubility via (partial) unfolding. However, a similar influence would be expected for negatively-charged patches, which is absent. Through what other mechanisms could positive charge clustering contribute to insolubility? Given that the characteristic for insolubility observed in the current work closely matches that for NA-binding proteins, a hypothetical mechanism based on an intermediate step of binding to NA is presented. This model applies only to media rich in NA, such as during expression. A second area of discussion is around the growing literature on negative charge and the solubility of purified proteins.

Fig. 3a shows an equilibrium model for protein binding to nucleic acid, based on a relatively weak interaction of 15 kJ/mole. This corresponds to 3 close salt-bridges or greater than 3 flexible +/− charge interactions, consistent with a net charge threshold of about 4.5 (Fig. 1c). Briefly, having estimated maximal concentrations of charged protein (positive) and NA (negative) sites each at 4 mM, concentration ramps up to the these values are used to account for subsets of protein and NA sites possessing the appropriate unshielded charge (see Methods for more detail). This simple calculation (Fig. 3a) shows that a weak interaction, coupled with relatively high concentrations, leads to a substantial fraction of (transient) complexes. The diagonal drawn on Fig. 3a relates to equal concentrations of interacting components and thus also applies to the case of direct protein-protein interactions.

Figure 3
figure 3

Weak interactions and association in a crowded environment.

(a) Two species interact with an energy of 15 kJ/mole. Concentrations are varied (0 to 4 mM) for protein interacting sites (horizontally) and NA interacting sites (vertically). The heat map shows the proportion of interacting protein sites that are complexed (scale bar under the map). See text for more detail. (b) A hypothetical scheme is drawn in which protein-NA interactions are mediated by charge interactions (upper left), followed by partial unfolding concomitant with NA base – protein interactions (upper right), then protein-protein association through non-polar interactions (lower right) and finally dissociation of protein from NA (lower left).

Fig. 3a does not address the issue of how charge-based complexation between protein and nucleic acid might lead to protein insolubility. Available data indicate that nucleic acid constitutes at most a small fraction of inclusion body material for proteins expressed in E. coli32, although nucleic acid can affect the rate of aggregation33. In Fig. 3b, a scheme is outlined that indicates how transient protein-nucleic acid interactions could lead to a lowering of the activation energy for folding/unfolding transitions, thereby accelerating protein-protein complexation and insolubility if this complexed state ultimately leads to kinetically trapped aggregates. Increased polyanion hydrophobicity leads to a reduction in protein stability in protein-polyanion complexes34. Nucleic acids have a substantial non-polar component and Fig. 3b schematises non-polar interactions between bases and partially unfolded regions of protein. This could lead to protein-protein interactions if partially unfolded proteins transiently bound to the polyanion are adjacent to each other. Such a mechanism could contribute to seeding protein aggregation, effectively concentrating a population of proteins undergoing folding transitions, through transient condensation onto polyanions.

A study of net charge, within sequence windows of 21 amino acids in the yeast proteome, found that larger net positive charge was substantially under-represented in comparison with the equivalent net negative charge35 and when present was often associated with NA binding. This is consistent with net positive charge on proteins being moderated unless functionally associated with nucleic acid binding, perhaps to avoid pathways such as that hypothesised in Fig. 3b. Generally, NA-binding proteins such as transcription factors can be difficult to express36,37.

It is worth stating the fundamental points that relate protein surface charges to protein solubility. The hydration of charged groups is correlated with protein solubility in aqueous solutions38. Beyond this, solubility often decreases near to the isoelectric point as net charge and electrostatic repulsion decreases, allowing non-specific attractive interactions to form. Additionally, near to the pI, proteins with anisotropic charge distributions sample attractive interactions between patches of opposite charge, which are screened with increasing ionic strength39. Within this general framework, there are several reports that bear on the relative role of positive and negative surface charges in protein solubility. A strong preference for Asp/Glu over Lys/Arg was observed in a phage display screen for substitutions that enhance resistance to aggregation in human antibody variable domains40. Addition of an acidic tag to a positively-charged intrabody enhances expression41. It has been reported that many chaperones possess regions of negative charge and that acidic regions modulate the anti-aggregation activity of Hsp9042. The current work suggests that surface charge chaperoning may be a contributing factor. A growing body of data is becoming available with which to test such hypotheses43.

Solubility measurements for 7 proteins in different precipitants show that negative surface charge correlates with increased solubility, independent of the nature of the precipitant44. These experiments, which reflect protein-protein interactions between folded (purified) proteins, are quite different to the cell-free translation study20 on which the current work is based, but given the importance of charge, we made patch calculations for these 7 proteins. No correlation is seen for maximum positive patch size and solubility (R = 0.393, p = 0.191, not shown), but a relationship may be present between overall Lys to Arg ratio and solubility (R = 0.720, p = 0.034, not shown), with (again) a higher ratio tending towards more soluble. One interpretation is that the Arg sidechain is particularly prone to interactions. Solubility in the cell-free system could be related more to the avoidance of basic clusters (and perhaps NA binding), whereas the purified protein experiments may be probing, in part, a more general stickiness associated with the Arg sidechain. The authors of the precipitant-based solubility study concluded that strong water binding by acidic amino acids may underpin the results44. The balance between negative and positive amino acid sidechain charges in fine-tuning solubility, remains to be established.

It is of interest that Arg has been identified as a ubiquitous interacting amino acid in informatics studies, with an elevated propensity (relative to average surface occurrence) for interfaces in both protein-protein and protein-NA complexes45,46. Cation – π interactions, involving Arg, are common at protein-protein interfaces47 and Arg is also common in protein crystal contacts at low ionic strength48. Arginine content of antigen-combining sites in antibodies is correlated with increased non-specific binding49. The excipient properties of Arg are also of interest. Solutions of Arg can be effective in solubilising proteins50, an effect that becomes more pronounced in mixtures with Glu51. This solubility enhancement is related to an increase in the number of Arg and Glu molecules forming interactions with the protein52. The interacting properties of Arg cover a range of systems. We suggest that such diversity may lead to a correlation of Arg enrichment with insolubility, whether clustering into patches (with Lys) for polyanion binding, or more generally over a protein surface.

Reduction of positive patches should be of use as a design tool for expression systems and substituting Arg with other charges could aid the maintenance of high concentrations of purified protein in solution. The hypothesis of protein basic patch interactions with NA in expression systems could be investigated with uncoupling of transcription and translation, to vary relative mRNA and protein levels53. There is much yet to establish about the association between basic clusters and Arg enrichment and insolubility, given for example the report that green fluorescent protein engineered to bear high net positive or negative charge expresses in E. coli and is much more soluble than wild-type protein54. In this case perhaps the extreme net charges provide sufficient repulsive interactions to overcome other effects.

Methods

Soluble and insoluble datasets and DNA-binding/non-binding protein datasets

Subsets for soluble and insoluble E. coli protein expression in the cell-free system were defined following the authors' description20. Specifically, soluble proteins are those with a solubility of more than 70% and insoluble with a solubility of less than 30%. Percentage solubilities had been obtained, following cell-free expression of radiolabelled protein, as the ratio of soluble protein (supernatant from a centrifugation step) and total protein20. Members of the soluble and insoluble subsets with structures in the PDB4 were obtained through cross-referencing with UniProt55. A further filtering step was applied with a cull for sequence identity at a 90% identity threshold, using the PISCES tool56. This procedure allows the retention of homologues, since they may have different surface charge and polarity distributions. Final subsets of 111 (soluble) and 56 (insoluble) E. coli proteins were available for processing.

Sets of DNA-binding and non-binding proteins were obtained from earlier work57,58. Most of these PDB ids were accessible and ran successfully through the electrostatic potential patch analysis (128 DNA-binding proteins, 108 non-binders). Calculations were also made for a set of 7 proteins for which solubility data were available in precipitant studies44, using the same PDB ids specified in that work.

Charge, potential and polarity calculations

For polarity analysis, a sphere of radius 13 Å was centred on each non-hydrogen atom. Polar and non-polar solvent accessible surface area was then summed for all non-hydrogen atoms within that sphere, using a 1.4 Å radius solvent probe and polar/non-polar character assigned according to atom type and functional group23. The relative polarity of a patch is then calculated as the ratio of non-polar SASA to polar SASA and the maximum value of this ratio (i.e. most non-polar region) recorded for each protein. When the polar and non-polar SASAs are summed for each patch, the average of this distribution over patches is about 1300 Å2. In comparison, a typical evolved interface between proteins buries about 1600 Å2 in total, although this is quite variable59. Considering that the entirety of each of the two contributing surfaces will not be buried in an association, a patch radius of 13 Å seems reasonable in generating a footprint for non-specific protein-protein interactions.

Electrostatic potential was calculated around each protein using a Finite Difference Poisson-Boltzmann methodology22, with negatively-charged Asp, Glu sidechains and C-termini and positively-charged Lys, Arg and N-termini. Ionic strength was 0.15 M in the calculations. The resulting potential map was contoured at thresholds of +/− kT/e. Importantly, the contours were drawn on a single shell of the calculation grid (on the solvent side of the protein), so that the number grid points in each contoured patch effectively represents the size of that patch. Grid step for electrostatic potential calculation was a constant 0.6 Å, independent of protein. A parallel approach was introduced to confirm that positive charge location underpins the contours of positive potential. The patch analysis described for surface polarity was also used to record the maximum net charge within a geometrical patch. For this purpose, sidechain charges were approximated at Cβ atoms to minimise the effects of sidechain conformational variation.

Receiver Operator Characteristic (ROC) plots were generated for the ability of calculated features to discriminate between soluble and insoluble proteins. As the numerical value of a feature is varied and applied as a threshold to the datasets, corresponding true positive rates (TPRs) and false positive rates (FPRs) are calculated and given in a ROC plot. Area Under the Curve (AUC) is used to estimate effectiveness for separating datasets, with 1.0 equating to complete separation and 0.5 to random. The Mann-Whitney U test was applied to the calculated feature subsets. The probability of occurrence of these particular feature values, if there is no difference in the underlying distributions, is given. A significant difference is inferred if this probability is < 0.05.

To investigate whether a relationship exists between calculated features and expression at the mRNA level, protein IDs for soluble and insoluble datasets were mapped to mRNA abundances for E. coli proteins31.

A model for non-specific protein-nucleic acid charge interactions

The model of Fig. 3a is based on a 15 kJ/mole interaction, or about 3 salt-bridges, since typical pKa shifts for a surface salt-bridge are 1 pH unit or 5–6 kJ/mole60. Total concentration of protein was calculated for an estimate of 2.35 × 106 protein molecules61 in an E. coli cell of side 1 μm, giving 4 mM. Summing the estimated contributions of tRNAs and mRNAs61 and comparing with protein molecular weight, gives a ratio of about 1:10, nucleic acid to protein. Each protein molecule, of average molecular weight 40 kD61, might bind to a polyanion through a single positive patch, whereas each polyanion nucleic acid molecule has multiple binding sites. With a single base molecular weight of about 0.32 kD, assuming that binding sites could recur approximately every 12 bases, then two factors of 10 approximately cancel and the maximum concentrations of positively and negatively-charged binding sites are roughly equal. Although 4 mM is set as this maximum, only a subset of proteins will exhibit positive patches above a certain threshold, whilst there are many factors that will contribute to structuring of nucleic acids and the masking of negative charge. Thus linear concentration ramps, up to 4 mM, are applied in Fig. 3a for interacting subsets of protein and nucleic acid. The heat map is then generated as the proportion of the interacting subset of proteins that is bound to nucleic acid, as the concentrations are altered, given the 15 kJ/mole interaction.