Soluble expression of proteins correlates with a lack of positively-charged surface

Prediction of protein solubility is gaining importance with the growing use of protein molecules as therapeutics, and ongoing requirements for high level expression. We have investigated protein surface features that correlate with insolubility. Non-polar surface patches associate to some degree with insolubility, but this is far exceeded by the association with positively-charged patches. Negatively-charged patches do not separate insoluble/soluble subsets. The separation of soluble and insoluble subsets by positive charge clustering (area under the curve for a ROC plot is 0.85) has a striking parallel with the separation that delineates nucleic acid-binding proteins, although most of the insoluble dataset are not known to bind nucleic acid. Additionally, these basic patches are enriched for arginine, relative to lysine. The results are discussed in the context of expression systems and downstream processing, contributing to a view of protein solubility in which the molecular interactions of charged groups are far from equivalent.

the context of translation rate, a property that will impact on protein production and therefore potentially solubility. A dependence of ribosomal velocity on positively charged residues in newly synthesised proteins has been found, due to interaction with the negativelycharged ribosomal exit tunnel 17 . More generally, translation rate has a well-studied correlation with codon bias 18 .
Methods for predicting protein solubility have been reviewed 19 . The availability of experimental data, where proteins have been expressed in consistent conditions, continues to present a significant problem with assessing prediction schemes. A significant study addressing this point used a high throughput cell-free system for classification of E. coli protein solubility 20 . The authors of this work concluded that factors correlating to some degree with solubility include charge and structural class, whilst algorithms based largely on propensity to form b-structure/amyloid performed less well, although a machine-learning study subsequently identified a correlation between sequence-based calculation of physico-chemical properties and measured solubility for this dataset 21 .
In the current work, computational methods for characterising charge and potential distributions in proteins 22 have been used alongside patch-based calculations of surface properties 23 to analyse the properties of soluble and insoluble subsets of proteins. The experimental data used in this study derive from cell-free expression 20 using the PURE system of E. coli factors, lacking chaperones 24 . Encouraged by a study in which computation over many proteins revealed a correlation between electrostatic properties and subcellular location 25 , a similar approach was used in respect of solubility. Whilst some correlation is found between insolubility and larger non-polar patches, by far the most significant relationship associates insolubility with large positively-charged patches. The pattern underlying this unexpected result is similar to that which separates nucleic acid (NA)-binding from non-NA-binding proteins.

Results
Surface potential patches and solubility. At neutral pH most of the insoluble and soluble dataset proteins are predicted to be moderately negatively-charged, and there is no significant separation of the distributions (Fig. 1a, p 5 0.872 for a Mann-Whitney test of subsets being sampled from the same underlying distribution). The maximal positive and negative potential patches for each protein show quite different behaviour, with no significant separation for negative potential (p 5 0.227), but clear separation for positive potential (p 5 7.1 3 10 213 , Fig. 1b). A patch analysis of charge clustering (with 13 Å patch radius) was performed in order to establish whether the positive potential patches, based on contours, were mirrored in charge geometry. This is the case, with the largest net positive charge on a patch also distinguishing soluble and insoluble protein datasets (p 5 2.2 3 10 24 , Fig. 1c).
Surface polarity and solubility. We next examined a potential role for the association of proteins via non-polar surfaces, through calculation of non-polar to polar solvent accessible surface area (SASA) ratios, for patches of radius 13 Å centred on each atom. The maximum of this ratio was identified for each protein. Fig. 1d shows the separation of soluble and insoluble subsets (p 5 2.3 3 10 23 ). Whilst there is some correlation between increased nonpolarity and insolubility, it is far smaller than that exhibited by positive potential. ROC plot analysis demonstrates this distinction (Fig. 2). An area under the curve (AUC) of 0.85 for the positive potential features (Fig. 2a) compares with an AUC of 0.62 for the  non-polar to polar surface area ratio (Fig. 2b). A threshold of 3000 grid points (in the contours of positive potential) gives the best separation between soluble and insoluble datasets. Positive patch size can be reported as a ratio to this value.
Comparison with discrimination for DNA-binding proteins.
Positively-charged surfaces are implicated in nucleic acid binding 26 , and contribute to prediction schemes for NA binding 27 . The same potential patch analysis applied to the solubility data, was used to examine DNA-binding and non-DNA-binding protein datasets. Separation of the DNA/non-DNA-binding subsets is strikingly similar to that for the insoluble/soluble subsets (Fig. 1e). There is some enrichment for known DNA-binding proteins in the insoluble subset, 13 of 56, as compared with 16 of 111 in the soluble subset. Clearly though not all DNA-binding proteins are present in the insoluble subset.
Generally in NA binding, non-specific charge interactions typically function alongside more specific interactions arising from hydrogen-bonding and shape complementarity. Such additional interactions may play a role in distinguishing an interesting pair of DNA-binding homologues, IHF in the insoluble subset and HUa in the soluble subset. Maximal positive patches (measured as a ratio to the threshold) are consistent with the subset membership, whether calculated for the monomer (IHF 1.38, HUa 0.26) or the dimeric biological units (IHF 3.31, HUa 0.45). Although these two proteins are closely related structurally, they have very different positive potential distributions and solubility in the cell-free system. Functionally, HUa and IHF have divergent DNA substrate preferences 28 , that may be related to their positive charge distributions.
It appears that larger positive patches exert an influence towards protein aggregation in the cell free expression system 20 . If this is also the case for intracellular expression, then it might be anticipated that it would be countered by a cell maintaining lower levels of proteins with larger positive potential patches. Abundance of mRNA is not entirely representative of protein level 29 . Indeed, at a fixed time point for a single cell, correlation between protein and mRNA level can be absent, due to the much shorter lifetime of mRNAs as compared with proteins 30 . Currently though, mRNA levels measured from populations of cells, which are correlated with protein level 30 , provide the most extensive data. An anti-correlation (R 5 0.283, p 5 5.75 3 10 23 , not shown) was found between largest positive patch size and a log measure of mRNA levels in E. coli 31 . Again the positive potential is differentiated from negative potential, for which the largest patch gives no significant relationship with mRNA level (R 5 0.124, p 5 0.138, not shown).
Positive and negative charge, arginine and lysine. Sequence-based calculation of the fraction of charged groups that are either positively-charged or negatively-charged at neutral pH separates soluble and insoluble subsets (p 5 1.417 3 10 23 , not shown). Consistent with the patch calculations, a higher fraction of positive charge tends towards insolubility. With a greater separation of soluble and insoluble subsets for the (3D) patch-based property, relative to the sequence-based charge fraction, the structural property appears to be a crucial component in a physico-chemical understanding of the cell-free expression data.
Thus far, positively-charged amino acid sidechains, at neutral pH, have been combined. Taking the maximum positive charge patches of Fig. 1c, calculation of Arg enrichment in these patches (compared with the Arg to Lys content overall for each protein) gives a separation (p 5 0.023, not shown), with more Arg in the maximum positive patches of the insoluble subset. Next, looking at the maximum number of Arg in any geometric patch of radius 13 Å , there is also separation (p 5 3.884 3 10 24 , Fig. 1f). Only a small number of the insoluble proteins have a geometric patch containing less than 4 Arg. The Lys to Arg ratio calculated directly from protein sequence also separates (p 5 0.037, not shown), with very few of the insoluble dataset having this ratio greater than 1.

Discussion
Our results indicate that factors contributing on average to separation of the structurally annotated soluble and insoluble subsets in cell-free expression 20 , are non-polar surface (moderate contribution), and positively-charged patches (major contribution, particularly where Arg is more prevalent than Lys). Correlation between largest positive patch and insolubility implies that this property, or another feature to which it is strongly related, acts in some direct or indirect way to promote protein-protein interactions. It could be argued that a concentration of positive charge may tend towards lower folded state stability through unfavourable charge interactions, and thus influence solubility via (partial) unfolding. However, a similar influence would be expected for negativelycharged patches, which is absent. Through what other mechanisms could positive charge clustering contribute to insolubility? Given that the characteristic for insolubility observed in the current work closely matches that for NA-binding proteins, a hypothetical mechanism based on an intermediate step of binding to NA is presented. This model applies only to media rich in NA, such as during expression. A second area of discussion is around the growing literature on negative charge and the solubility of purified proteins. Fig. 3a shows an equilibrium model for protein binding to nucleic acid, based on a relatively weak interaction of 15 kJ/mole. This corresponds to 3 close salt-bridges or greater than 3 flexible 1/2 charge interactions, consistent with a net charge threshold of about 4.5 (Fig. 1c). Briefly, having estimated maximal concentrations of charged protein (positive) and NA (negative) sites each at 4 mM, concentration ramps up to the these values are used to account for subsets of protein and NA sites possessing the appropriate unshielded charge (see Methods for more detail). This simple calculation (Fig. 3a) shows that a weak interaction, coupled with relatively high concentrations, leads to a substantial fraction of (transient) complexes. The diagonal drawn on Fig. 3a relates to equal concentrations of interacting components, and thus also applies to the case of direct protein-protein interactions. Fig. 3a does not address the issue of how charge-based complexation between protein and nucleic acid might lead to protein insolubility. Available data indicate that nucleic acid constitutes at most a small fraction of inclusion body material for proteins expressed in E. coli 32 , although nucleic acid can affect the rate of aggregation 33 . In Fig. 3b, a scheme is outlined that indicates how transient proteinnucleic acid interactions could lead to a lowering of the activation energy for folding/unfolding transitions, thereby accelerating protein-protein complexation and insolubility if this complexed state ultimately leads to kinetically trapped aggregates. Increased polyanion hydrophobicity leads to a reduction in protein stability in proteinpolyanion complexes 34 . Nucleic acids have a substantial non-polar component and Fig. 3b schematises non-polar interactions between bases and partially unfolded regions of protein. This could lead to protein-protein interactions if partially unfolded proteins transiently bound to the polyanion are adjacent to each other. Such a mechanism could contribute to seeding protein aggregation, effectively concentrating a population of proteins undergoing folding transitions, through transient condensation onto polyanions.
A study of net charge, within sequence windows of 21 amino acids in the yeast proteome, found that larger net positive charge was substantially under-represented in comparison with the equivalent net negative charge 35 , and when present was often associated with NA binding. This is consistent with net positive charge on proteins being moderated unless functionally associated with nucleic acid binding, perhaps to avoid pathways such as that hypothesised in Fig. 3b. Generally, NA-binding proteins such as transcription factors can be difficult to express 36,37 .
It is worth stating the fundamental points that relate protein surface charges to protein solubility. The hydration of charged groups is correlated with protein solubility in aqueous solutions 38 . Beyond this, solubility often decreases near to the isoelectric point as net charge and electrostatic repulsion decreases, allowing non-specific attractive interactions to form. Additionally, near to the pI, proteins with anisotropic charge distributions sample attractive interactions between patches of opposite charge, which are screened with increasing ionic strength 39 . Within this general framework, there are several reports that bear on the relative role of positive and negative surface charges in protein solubility. A strong preference for Asp/Glu over Lys/Arg was observed in a phage display screen for substitutions that enhance resistance to aggregation in human antibody variable domains 40 . Addition of an acidic tag to a positively-charged intrabody enhances expression 41 . It has been reported that many chaperones possess regions of negative charge, and that acidic regions modulate the anti-aggregation activity of Hsp90 42 . The current work suggests that surface charge chaperoning may be a contributing factor. A growing body of data is becoming available with which to test such hypotheses 43 .
Solubility measurements for 7 proteins in different precipitants show that negative surface charge correlates with increased solubility, independent of the nature of the precipitant 44 . These experiments, which reflect protein-protein interactions between folded (purified) proteins, are quite different to the cell-free translation study 20 on which the current work is based, but given the importance of charge, we made patch calculations for these 7 proteins. No correlation is seen for maximum positive patch size and solubility (R 5 0.393, p 5 0.191, not shown), but a relationship may be present between overall Lys to Arg ratio and solubility (R 5 0.720, p 5 0.034, not shown), with (again) a higher ratio tending towards more soluble. One interpretation is that the Arg sidechain is particularly prone to interactions. Solubility in the cell-free system could be related more to the avoidance of basic clusters (and perhaps NA binding), whereas the purified protein experiments may be probing, in part, a more general stickiness associated with the Arg sidechain. The authors of the precipitant-based solubility study concluded that strong water binding by acidic amino acids may underpin the results 44 . The balance between negative and positive amino acid sidechain charges in fine-tuning solubility, remains to be established.
It is of interest that Arg has been identified as a ubiquitous interacting amino acid in informatics studies, with an elevated propensity (relative to average surface occurrence) for interfaces in both protein-protein and  protein-NA complexes 45,46 . Cation -p interactions, involving Arg, are common at protein-protein interfaces 47 , and Arg is also common in protein crystal contacts at low ionic strength 48 . Arginine content of antigen-combining sites in antibodies is correlated with increased non-specific binding 49 . The excipient properties of Arg are also of interest. Solutions of Arg can be effective in solubilising proteins 50 , an effect that becomes more pronounced in mixtures with Glu 51 . This solubility enhancement is related to an increase in the number of Arg and Glu molecules forming interactions with the protein 52 . The interacting properties of Arg cover a range of systems. We suggest that such diversity may lead to a correlation of Arg enrichment with insolubility, whether clustering into patches (with Lys) for polyanion binding, or more generally over a protein surface.
Reduction of positive patches should be of use as a design tool for expression systems, and substituting Arg with other charges could aid the maintenance of high concentrations of purified protein in solution. The hypothesis of protein basic patch interactions with NA in expression systems could be investigated with uncoupling of transcription and translation, to vary relative mRNA and protein levels 53 . There is much yet to establish about the association between basic clusters, and Arg enrichment, and insolubility, given for example the report that green fluorescent protein engineered to bear high net positive or negative charge expresses in E. coli and is much more soluble than wild-type protein 54 . In this case perhaps the extreme net charges provide sufficient repulsive interactions to overcome other effects.

Methods
Soluble and insoluble datasets and DNA-binding/non-binding protein datasets. Subsets for soluble and insoluble E. coli protein expression in the cell-free system were defined following the authors' description 20 . Specifically, soluble proteins are those with a solubility of more than 70%, and insoluble with a solubility of less than 30%. Percentage solubilities had been obtained, following cell-free expression of radiolabelled protein, as the ratio of soluble protein (supernatant from a centrifugation step) and total protein 20 . Members of the soluble and insoluble subsets with structures in the PDB 4 were obtained through cross-referencing with UniProt 55 . A further filtering step was applied with a cull for sequence identity at a 90% identity threshold, using the PISCES tool 56 . This procedure allows the retention of homologues, since they may have different surface charge and polarity distributions. Final subsets of 111 (soluble) and 56 (insoluble) E. coli proteins were available for processing.
Sets of DNA-binding and non-binding proteins were obtained from earlier work 57,58 . Most of these PDB ids were accessible and ran successfully through the electrostatic potential patch analysis (128 DNA-binding proteins, 108 non-binders). Calculations were also made for a set of 7 proteins for which solubility data were available in precipitant studies 44 , using the same PDB ids specified in that work.
Charge, potential and polarity calculations. For polarity analysis, a sphere of radius 13 Å was centred on each non-hydrogen atom. Polar and non-polar solvent accessible surface area was then summed for all non-hydrogen atoms within that sphere, using a 1.4 Å radius solvent probe and polar/non-polar character assigned according to atom type and functional group 23 . The relative polarity of a patch is then calculated as the ratio of non-polar SASA to polar SASA, and the maximum value of this ratio (i.e. most non-polar region) recorded for each protein. When the polar and non-polar SASAs are summed for each patch, the average of this distribution over patches is about 1300 Å 2 . In comparison, a typical evolved interface between proteins buries about 1600 Å 2 in total, although this is quite variable 59 . Considering that the entirety of each of the two contributing surfaces will not be buried in an association, a patch radius of 13 Å seems reasonable in generating a footprint for non-specific protein-protein interactions.
Electrostatic potential was calculated around each protein using a Finite Difference Poisson-Boltzmann methodology 22 , with negatively-charged Asp, Glu sidechains and C-termini, and positively-charged Lys, Arg and N-termini. Ionic strength was 0.15 M in the calculations. The resulting potential map was contoured at thresholds of 1/2 kT/e. Importantly, the contours were drawn on a single shell of the calculation grid (on the solvent side of the protein), so that the number grid points in each contoured patch effectively represents the size of that patch. Grid step for electrostatic potential calculation was a constant 0.6 Å , independent of protein. A parallel approach was introduced to confirm that positive charge location underpins the contours of positive potential. The patch analysis described for surface polarity was also used to record the maximum net charge within a geometrical patch. For this purpose, sidechain charges were approximated at Cb atoms to minimise the effects of sidechain conformational variation.
Receiver Operator Characteristic (ROC) plots were generated for the ability of calculated features to discriminate between soluble and insoluble proteins. As the numerical value of a feature is varied and applied as a threshold to the datasets, corresponding true positive rates (TPRs) and false positive rates (FPRs) are calculated and given in a ROC plot. Area Under the Curve (AUC) is used to estimate effectiveness for separating datasets, with 1.0 equating to complete separation and 0.5 to random. The Mann-Whitney U test was applied to the calculated feature subsets. The probability of occurrence of these particular feature values, if there is no difference in the underlying distributions, is given. A significant difference is inferred if this probability is , 0.05.
To investigate whether a relationship exists between calculated features and expression at the mRNA level, protein IDs for soluble and insoluble datasets were mapped to mRNA abundances for E. coli proteins 31 .
A model for non-specific protein-nucleic acid charge interactions. The model of Fig. 3a is based on a 15 kJ/mole interaction, or about 3 salt-bridges, since typical pKa shifts for a surface salt-bridge are 1 pH unit or 5-6 kJ/mole 60 . Total concentration of protein was calculated for an estimate of 2.35 3 10 6 protein molecules 61 in an E. coli cell of side 1 mm, giving 4 mM. Summing the estimated contributions of tRNAs and mRNAs 61 and comparing with protein molecular weight, gives a ratio of about 1510, nucleic acid to protein. Each protein molecule, of average molecular weight 40 kD 61 , might bind to a polyanion through a single positive patch, whereas each polyanion nucleic acid molecule has multiple binding sites. With a single base molecular weight of about 0.32 kD, assuming that binding sites could recur approximately every 12 bases, then two factors of 10 approximately cancel and the maximum concentrations of positively and negatively-charged binding sites are roughly equal. Although 4 mM is set as this maximum, only a subset of proteins will exhibit positive patches above a certain threshold, whilst there are many factors that will contribute to structuring of nucleic acids and the masking of negative charge. Thus linear concentration ramps, up to 4 mM, are applied in Fig. 3a for interacting subsets of protein and nucleic acid. The heat map is then generated as the proportion of the interacting subset of proteins that is bound to nucleic acid, as the concentrations are altered, given the 15 kJ/mole interaction.