Introduction

Interactions among proteins constitute the molecular basis of most processes in living organisms, and their deregulation or disruption often leads to disease1,2,3. Among other things, such interactions may differ in the number (dimers, tetrameters, etc) and kind of involved proteins (homo or hetero complexes), the stability of the binding (transient/permanent bindings), and the type of the binding process, i.e. lock and key, induced fit and conformational selection. While it has been estimated that over 80% of proteins operate in molecular complexes4, detailed comprehension of the mechanism behind the protein binding process and the stability of the resulting protein complexes is still incomplete. At a qualitative level, binding involves a recognition phase where distant molecules have to recognize themselves in the crowded cellular environment, followed by a docking process where the two molecules reorient/adapt to binding in specific regions. Despite this, complex formation is often highly specific: a binding partner could be recognized by only one of the members in a protein family even if they all have the same folds5. This compatibility is determined by an interplay between various contributions on the molecular surface and can either (i) be present from the beginning, when the two proteins are far apart (lock and key model), or (ii) be assumed by the proteins while exploring their conformational landscape (conformational selection model) or be gained while interacting with the partner (induced fit model)6,7,8. Once the proteins are bound, their binding regions are known to display a combination of geometrical and chemical complementarities, which ultimately reflect on the binding stability9,10,11,12,13,14.

At the level of amino acid composition, it is widely known that the composition of binding regions differs with respect to the rest of the solvent-exposed region: while the latter is preferentially populated by hydrophilic residues, binding regions have a higher number of hydrophobic residues, like Val and Leu, that tend to establish stronger van der Waals interactions10,15. From a geometrical point of view, the optimization of short-ranged interactions between atoms at the interface leads to a local shape complementarity of the proteins’ molecular surfaces. Indeed, the side chain rearrangements minimize the van der Waals interaction, thus determining shape complementarity at the interfaces, which is typically evaluated by geometrical approaches16,17,18,19,20,21.

Conversely, there is still no full consensus on the role played by electrostatic interactions, including hydrogen bonding, ionic/Coulombic, cation− \(\pi\), \(\pi -\pi\), lone-pair sigma hole, and orthogonal multipolar interactions5,22,23,24. In fact, acting at longer distances, it is unanimously understood that electrostatic compatibility plays a role at the beginning of the recognition process when partners are far away from each other25; indeed, proteins move in a very crowded environment and since electrostatic interactions are the most long-ranged ones, they can produce a drift in the Brownian motion of the two binding proteins. However, while this could be true for heterodimers (that may possess opposite charges), homodimers have the same net charge, thus attractive interactions can only take place between parts of the proteins (at the most)26.

Therefore, many studies are focusing on assessing the electrostatic match of protein complexes, to better understand why and how binding happens9,26,27,28,29,30,31. In particular, McCoy et al.32 found that binding sites are characterized by significant electrostatic complementarity, if defined as the correlation of surface electrostatic potential at binding sites on a small number of protein complexes. Another study discussed the importance of electrostatic interactions in the binding adaptation33. Shashikala and coworkers25 investigated the role of electrostatic interactions in diseases, finding that disease-causing mutations frequently alter wild-type electrostatic interactions. Moreover, electrostatic turned out to be a key feature even for machine learning methods that look at the identification of protein-protein binding sites18. Similarly, electrostatic and shape complementarity turned out to be sufficient to predict the DNA-binding sites on proteins with 80% accuracy29.

Figure 1
figure 1

Amino acid composition, charge properties, and classification of the dataset. (a) The complexes in the dataset are divided into heterodimers and IBR, SBR and nIBR homodimers. The colored boxes report an example for each category. The same colors are used to indicate in the pie chart each class abundance in the dataset. (b) For each protein, the sum of the charges of all its residues and only the interacting residues on the surface is computed. For each complex, these total and interacting charges from the two interacting partners are multiplied. The bar plot shows, for the whole dataset and each class, the percentage of complexes whose total (in orange) and interacting (in blue) products are negative. (c) The relative abundances of each of the twenty natural amino acids considering all the residues (orange), only the interacting ones (in green), and only the solvent-exposed residues (brown) are shown. The results are divided into the four classes.

Since electrostatic interactions can act both at short and long distances34, the interaction region that should be considered is of non-trivial definition. In fact, smaller regions are able to capture the binding properties due to van der Waals forces, which lead to a shape complementarity between the two interacting molecular surfaces. On the other hand, larger regions lose shape complementarity but involve more charged residues, which are typically excluded from the binding regions but are widely present in the other exposed regions, playing a crucial role in the thermal stability of the protein structure35. It is in fact known that electrostatic interactions between 9 and 12 Å are of crucial importance to distinguish obligate from non-obligate complexes36, thus playing a key role in the characterization of the binding. For a quantitative description of the Lennard-Jones potential at the interfaces of protein complexes, we have recently shown that shape complementarity is maximized for interacting patches less than 9 Å16, highlighting thus the effect of van der Waals interactions at the binding interface10. For a patch on the interface of this size, where the involvement of long-range electrostatic interactions is reduced, the quantification of the contribution of electrostatic complementarity remains unclear.

Here, we characterize the role of electrostatic interaction in protein binding and quantitatively measure the electrostatic complementarity at the interface of the molecular complexes, defining the binding interfaces in a 9 Å radius sphere. With this aim, we collect a dataset of protein complexes (see Methods for details) and characterize the complexes in terms of interface type, amino acid composition, and charge properties. To study the relationship between binding affinity and electrostatic complementarity we consider a second dataset, the ‘Affinity’ dataset (see Methods for details). Next, we analyze the contribution to the binding of electrostatic, by comparing the potential values of mirroring points on binding regions. We show that taking into account properly rescaled values of the electrostatic potential, we obtain a negative correlation between the complexes binding affinity and electrostatic complementarity. Following these results, we propose a new method able to quickly distinguish between interacting and non-interacting patches by describing their electrostatic potential projections with vectors and looking at the difference between these descriptors. This computational approach has been developed starting from the 2D Zernike method, that we proposed to quickly evaluate the shape complementarity at interfaces16,37,38,39,40,41; for what concerns shape complementarity, our method has already been demonstrated to be able to efficiently identify interacting regions by measuring the shape complementarity in terms of the Euclidean distance between the Zernike invariant descriptors associated with the projections of the molecular surfaces patches (see Methods for more details).

Here, the Zernike invariant vectors describe the electrostatic potential by considering in the same function both positive and negative values. As a final step, we show that the Zernike descriptions of the electrostatic allow for fast and superposition-free discrimination between transient and permanent interactions.

Results

Charges distribution and compatibility

To characterize the role of electrostatic complementarity in protein binding, we collected a balanced dataset of human protein complexes for which structural data were available (‘Human’ dataset). The dataset is composed of 164 homodimers, which can be divided into 44 dimers with an Identical Binding Region (IBR-hom), 66 Shifted Binding Region (SBR-hom), and 54 non-Identical Binding Region (nIBR-hom), depending on the similarity of the interacting patches; finally, the dataset includes 35 heterodimers (nIBR-het). In addition to this dimer classification, the same complexes can also be structurally classified looking at the prevailing secondary structure of the proteins: the same dataset includes 133 complexes where both binding partners have a prevalence of helices residues (HH), 57 where both proteins have mostly strands residues (SS) and 9 complexes where one of the partner has more strand residues while the other one has more helices (SH). See Figs. 1a, S2a and Methods section for more details. To begin with, we investigated the charge distribution of the proteins in the ‘Human’ dataset and their amino acid composition with respect to their dimer classification. In Fig. 1b, the percentage of complexes that have a total sum of the charges (considering either all the residues of each protein or only the interacting ones) with opposite signs is displayed. According to this analysis, heterodimers are the only class whose interacting patches (\(\sim 4\)%) have a discordant sum of the charges among binding partners. Looking more specifically at the charge of interacting residues, as shown in Figure S1a of the Supplementary, we observe that only \(\sim 1.5\)% of the negative interacting residues are close to other negative residues on the other protein in the complex. Positive residues instead can be found in proximity more frequently, between 1.6% and 3.2% of the time, depending on the complex category and on radius defining the surrounding of a residue. This condition is particularly common for nIBR homodimers. Oppositely charged residues can be rarely found (0.15-0.25% of the time) among the interacting patches of SBR homodimers as well. On the other hand, heterodimers and IBR homodimers have a higher percentage, up to 3.6% and 3.2% respectively, of opposite charges facing each other in binding regions. Figure S1b in the Supplementary characterizes the residues surrounding non-charged amino acids as well. It can be seen that, in general, all complexes tend to have non-charged residue facing each other, confirming the predominantly hydrophobic nature of the interacting regions of the protein-protein complexes42.

Figure 1c shows a general overview of the amino acid abundances, computed considering (i) all the residues in a protein, (ii) only the solvent-exposed ones, and (iii) only the ones included in the binding regions (see Methods for details about the definition of solvent-exposed residues and binding regions). The analysis of the amino acid composition confirms the (well-known) result10 that hydrophobic amino acids, such as Ile or Met, are uncommon in the solvent-exposed surface of proteins. However, when one of them is present in the exposed regions, it is more likely to find it in a binding site rather than on the rest of the surface. On the contrary, charged amino acids, such as Lys or Glu, are more present on the surface, but the fraction taking part in the binding is relatively small. Figure 1c also shows that IBR homodimers tend to have less charged amino acid on the interacting regions, compared to heterodimers and nIBR homodimers.

Finally, we performed the same analyses for the three classes in which the dataset is divided when considering the secondary structures. The results are shown in Figure S2. In particular, Figure S2b shows that the heterodimers whose interacting patches have a discordant sum of the charges are classified as SS or HH. However, as shown in Figure S2c, when considering neighboring charges, SH complexes tend to have more negative interacting residues close to positive residues on other protein’s surface, even reaching  3.3% for a small patch radius. For all three classes the hydrophobic nature and the amino acid composition of the binding sites are confirmed by Figure S2d and S2e.

Figure 2
figure 2

Electrostatic complementarity contribution in protein-protein complexes (a) On the left, the 3D representation of two proteins forming a complex and their electrostatic potential surfaces. In the center, the EMs of two interacting patches. Each pixel of the matrices is colored according to the electrostatic potential value of the surface points projected in that region. On the right, the SEMs of the same two patches: red and blue pixels correspond respectively to positive and negative values of the electrostatic potential. (b) Distributions of the F values computed for interacting (orange) and random (grey) patches taken from the ’Human’ dataset. In the insert the corresponding ROC curve. (c) Distributions of the F values of the interacting patches in complexes from the nIBR-het (red) and SBR-hom (yellow) classes. In the insert the corresponding ROC curves. (d) The distributions of the F values computed for interacting patches in a) are classified in pH ranges: low in light purple, physiological in green, and high in light green. In the insert the corresponding ROC curves. (e) Fraction of concordant regions as a function of the pH and computed correlation (in the legend). From left to right the considered complexes are the whole ’Human’ dataset, the nIBR-het, the IBR-hom, the SBR-hom, and the nIBR-hom.

The specificity of electrostatic complementarity depends on both complex kinds and environmental factors

Next, we moved to analyze the spatial disposition of the electrostatic interactions. To do so, we evaluated the electrostatic potential generated by the protein charges on the molecular surface43. Being a high-level representation of the external protein structure, such representation allows for an efficient evaluation of both the geometrical and chemical complementarities. In fact, the amino acids composition analysis provides information only about the chemical and physical properties of the residues belonging to the two binding sites, while the resolution of the Poisson-Boltzmann equation provides information on the electrostatic potential distribution at the interfaces, due to all the atoms of the protein, even those not in close proximity with the binding site.

To analyze the contribution of electrostatic to the binding between two proteins, we define a quantitative measure of electrostatic complementarity able to differentiate between interacting and non-interacting regions. We start by analyzing the ’Human’ dataset and comparing the electrostatic potential values of the binding regions with those of non-interacting regions. To measure the complementarity, we describe each patch with a Simplified Electrostatic Matrix (SEM). The latter is computed starting from the Electrostatic Matrix (EM), obtained by projecting the considered region of the electrostatic potential surface on the x-y plane, which is defined as the best fit of the surface points belonging to the specific patch. The matrix is then built on the plane and each pixel is associated with the average value of the electrostatic potential of the points projected into that pixel (see Fig. 2a and Methods section for more details). To obtain the SEM, we assign +1 and -1 to all the positive and negative pixels respectively. Figure 2a shows on the right the SEMs of two interacting patches, and in the center the corresponding EMs. To evaluate the electrostatic complementarity between two patches, we compare each pixel of the SEM describing the first patch with the pixel at the same position on the SEM describing the second one. In this way, we can check if surface points that face each other on the binding partners have electrostatic potentials with opposite signs. Operatively, we defined F as the fraction of pixel pairs in which the two pixels have the same sign: patches with a high electrostatic complementarity, i.e. a high number of opposite neighbor points with discordant electrostatic potentials, should have a low value of F. The measured complementarity is compared with that one would obtain by chance, randomly selecting surface points and building around each one a patch on the surface of the partner protein.

Figure 2b shows the distribution of F scores for the whole ‘Human’ dataset. In particular, one can see that random patches (i.e. decoys) have a gaussian-like distribution with a mode of \(\tilde{0}.49\), as we expect that for two random patches the probability that a spatial corresponding region has an opposite sign is 0.5. The distribution of all complexes instead is shifted toward values lower than 0.5. In particular, it has a mode of 0.38, indicating that protein binding regions have a degree of electrostatic complementarity higher than what one would expect by chance. Indeed, this can be quantified by computing the ROC curve (see inset Fig. 2b) and evaluating the Area Under the Curve (AUC), which is 0.55 in this case.

Looking at the shape of the distribution, one can see that it appears to be composed of different populations of proteins; in fact, it presents bi/tri modalities. We thus proceeded to separate the dataset according to the complex classes.

Doing so, we found different behaviors for the various classes. To quantify the differences between the distributions, we evaluated (see Fig. 2c and Figure S3 in the Supplementary) the ROC curves of each distribution with respect to the decoy’s one and computed the corresponding AUC. The classes with the lowest AUC of the ROC curve are SBR and nIBR homodimers (at 0.51 and 0.53 respectively), for which the classification performance can not be distinguished from that of random decoys. A slightly better classification is obtained for the IBR homodimers, with an AUC of the ROC curve of 0.56. On the other hand, heterodimers present an AUC of the ROC curve of 0.64.

Interestingly, a trend is observed also stratifying the dataset according to the pH of each complex. Figure 2d shows the distributions and relative ROC curves for three ranges of pH values, i.e. high (pH\(>7.5\)), low (pH\(<5.5\)), and physiological (\(5.5<\)pH\(<7.5\)). Complexes in the latter range have a high degree of electrostatic complementarity, having an AUC of the ROC curve of 0.58, whereas the F score of interacting patches in the low pH range is shifted to higher values (resulting in an AUC of the ROC curve of 0.33), meaning that in this case, the binding regions have a higher fraction of concordant points facing each other.

Moreover, in Fig. 2e, we show the fraction of concordant regions as a function of the experimental pH value for every complex of the ‘Human’ dataset and each subclass. It is interesting to note that while the whole Human dataset does not show a strong correlation with the pH (-0.29), the correlation values vastly differ among the subclasses. The F values computed for SBR and nIBR homodimers have a correlation of -0.11 and -0.29 respectively. On the other hand, IBR homodimers and heterodimers anti-correlate with the pH value, correlating -0.4 and -0.6. Nevertheless, Table 1 shows that independently from the class, the interacting patches of complexes in the low range have an AUC lower than 0.5, meaning that for low pH the electrostatic potential in points facing each other on interacting patches has the same sign. Table 1 reports as well how the performances of the F score for increasing radius R of the patch: after \(R=12\)Å  its characterization of interacting patches does not improve or even worsen. Even if larger regions include more charged residues by extending out of the hydrophobic binding sites, the complementarity of the charges is lost when the surfaces of the complex are not interacting. This analysis confirms our choice of a 9 Å radius to define the interacting patches.

Low-affinity interactions use electrostatic complementarity to achieve specificity

Since both the stratification by classes and pH did not fully account for the observed shape of the distribution, we look for binding affinity data. To do so, we collected a dataset of complexes with known structure and experimental dissociation constant, \(K_d\). In particular, we took the dataset proposed by Desantis et al., which we refer to as the ‘Affinity’ dataset (see Methods for details), which is exclusively composed of 123 heterodimers.

Figure 3a shows the F score distribution of interacting and random patches for the whole ‘Affinity’ dataset. The latter have a gaussian-like distribution with a mode of 0.49, as for the random patches of the ’Human’ dataset. The former instead is shifted to lower values, having a mode of 0.31 and an AUC of the ROC curve of 0.69 (as shown in the insert of Fig. 3a).

Stratifying the dataset in three groups according to the complex binding affinities [\(B_a=\log _{10}(K_d)\)], we obtain the results shown in Fig.  3b,c. The three distributions, corresponding to high (\(B_a<-9.0\)), medium (\(-9.0<B_a<-6.0\)) and low (\(B_a>-6.0\)) binding affinity are well separated and shifted on different ranges of F scores. Low-affinity complexes are moved to lower values of F, resulting in an AUC of the ROC curve of 0.81, while the ones with high binding affinity can not be distinguished from random decoys, having an AUC of the ROC curve of 0.55. The medium binding affinity complexes cover an intermediate range of F values and have an AUC of 0.67. Interestingly, if we look at the F value of each complex as a function of its binding affinity, we get a negative correlation ( -0.38) with the binding affinity, as shown in Fig. 3d. Table 2 shows how the performance of the F score changes for increasing radius R of the patch: as already discussed for Table I, after R = 12 Å its characterization of interacting patches does not improve.

Table 1 AUC of the ROC curves of the F score for varying patch radius and pH. The AUC of the ROC curves are computed using the random distribution and the distributions of the ’Human’ dataset interacting patches, divided according to the pH. Increasing values of the radius R defining the patches are tested.
Table 2 AUC of the ROC curves of F for varying patch radius and binding affinity. The AUC of the ROC curves are computed using the binding and random regions Fs distributions. The binding region distributions are divided into three groups, according to the binding affinity \(B_a\) of their complex. Increasing values of the radius R defining the patches are tested. The complexes are part of the ’Affinity’ dataset.
Figure 3
figure 3

Electrostatic complementarity contribution in the binding stability of complexes. (a) Distributions of the F values computed for interacting (violet) and random (grey) patches taken from the ’Affinity’ dataset. In the insert the corresponding ROC curve. (b) The distribution of the F values of the interacting patches in a) is divided according to the binding affinity of the complex: high affinity in red, medium in yellow, and low in green. In grey is the distribution of the random decoys. (c) ROC curves and corresponding AUC (in the legend) of the distributions of the interacting patches in b), computed against the random distribution. (d) Fraction of concordant regions as a function of the binding affinity and computed correlation (in the legend). Each point is colored according to the pH value, as indicated by the color bar.

Projecting the molecular surface on an orthogonal basis allows to compactly describe the electrostatic contribution to the interface of complexes

Leveraging on the results of the previous sections, we looked for a compact method to simultaneously measure both electrostatic and shape complementarity between protein patches.

To describe and compare the electrostatic surfaces more efficiently, we apply the 2D Zernike polynomials, which constitute a complete basis in which any function of two variables defined in a unitary disk can be decomposed. The Zernike expansion associates each portion of a surface to an ordered set of numerical descriptors, invariant under rotation, allowing an easy and fast metric comparison between different protein regions for complementarity evaluation (see Methods for details). The rotational invariance is a fundamental property in the blind search for interacting patches. The complementarity of the binding regions can then be evaluated in terms of the euclidean distance between their corresponding Zernike vectors. In particular, we measure how much the distance between the Zernike descriptors of a pair of interacting sites is smaller than the distances between random patches.

Figure 4 (panels from a to d) shows a schematic representation of the computational protocol for comparing, in terms of shape and electrostatic, interacting proteins. For each protein, the molecular surface and the electrostatic potential surface are built. The former corresponds to the solvent-accessible surface, the latter is obtained by assigning to each point of the molecular surface the value of the electrostatic potential computed in that region as obtained by solving the Poisson-Boltzmann equation44. On each surface, a patch is iteratively selected, and the corresponding regions of both the molecular and electrostatic surfaces are separately projected onto a plane. An example of both projections for two interacting patches is shown in Fig. 4e. More details can be found in the Methods. We assess the shape and electrostatic complementarity between the patches by expanding in terms of Zernike polynomials the 2D projections of the molecular and electrostatic potential surfaces respectively. The distance between the Zernike vectors of interacting patches is smaller than the one between the vectors of two random patches, as shown in Fig. 4f. As shown in Fig. 4g,h and in Figure S4a,b in the Supplementary our results are in line with what has been observed in16: interacting patches are efficiently distinguished from random decoys, with an AUC of the ROC curve close to \(\sim\)0.8. The class whose interacting sites can be better identified includes IBR homodimers (with a success rate of 0.96), whereas the lowest efficiency is obtained for nIBR homodimers (AUC at 0.72). Next, we extended the Zernike method for the study of electrostatic complementarity. Since Zernike coefficients can represent only real-valued functions over the unit disk we define the Confined Electrostatic Matrix (CEM). CEMs are obtained by capping the EMs pixels above +30 and below \(-30\)45. This allows us to obtain Zernike-expandable functions. We then define electrostatic complementarity as the distance between the Zernike vectors associated with the CEMs. Figure 4i-j and Figure S4c,d in the Supplementary show that this definition of complementarity reaches an efficiency in distinguishing interacting and random patches comparable with the one obtained with F values, with an AUC of the ROC close to 0.60. IBR and nIBR homodimers correspond to the best (0.68) and worst (0.55) performances. As previously assessed, complexes with low pH have a low electrostatic complementarity and can not be easily distinguished from random decoys, reaching an AUC of 0.55, as shown in Table 3. Interestingly, the opposite behavior can be observed for the shape complementarity, which is higher when the pH is low.

Figure 4
figure 4

Schematic representation of the computational protocol and Zernike evaluation of complementarity. (a) Molecular representations of the surfaces of two proteins forming a complex. (b) Molecular representation of the surface of one of the proteins depicted in a). A sphere is used to select a possible patch on the surface: the dark shadow highlights the selected points. (c) Electrostatic potential surface, where each point is colored according to its electrostatic potential value. In the zoom, the region of the electrostatic potential surface corresponding to the patch selected in b).(d) 2D projections of the patch. In the blue scale the shape projection, for which the colors in the plane are determined by the distance of the surface points from a predefined origin (see Methods for details). In the blue-red scale the electrostatic projection, where the colors are determined by the electrostatic potential values of the above points. (e) Comparison between the shape and electrostatic projections of two binding regions. (f) In grey the distributions of the Zernike shape (top) and electrostatic (bottom) distances between random patches. The blue and red lines correspond to the distances between the Zernike vectors describing the two patches on top and bottom respectively in e). (g) Distributions of the distances between the Zernike vectors describing the molecular surface of IBR-hom and nIBR-hom interacting (blue and green respectively) and random (grey) patches in the ’Human’ dataset. (h) ROC curves of the distributions in g) and corresponding AUC (in the legend) computed against the random distribution. (i) For each patch the distance between the Zernike vectors describing the electrostatic potential surface in that region is computed. Then the same analysis and classification as in g) is performed. (j) ROC curves of the distributions in i) and corresponding AUC (in the legend) computed against the random distribution.

Table 3 Discriminating power of the Zernike-based expansion of molecular and electrostatic potential surface for different pH ranges. In the \(Zernike_{Shape}\) column, the AUC of the ROC curves is computed from the distribution of the Zernike distances between the molecular surface of interacting patches and the random decoys. In the \(Zernike_{Electrostatic}\) column the AUC of the ROC curves is computed considering the electrostatic potential description of the patches. The results are divided according to the pH of the complexes.

Finally, to evaluate the impact of the structural characteristics on electrostatic complementarity, we stratified our results -both concerning the F value and the Zernike distance- according to the prevailing secondary structures of the complex.

As shown in Fig. 5a, the F value distribution does not significantly vary between the three structural classes. Moreover, when the complexes are divided in SS, HH and SH the classification performance of the F value can not be distinguished from that of random decoys, reaching a value of the ROC AUC of 0.54, 0.56 and 0.5 respectively.

When looking instead at the fraction of concordant regions as a function of the experimental pH value for each of the three structural classes of the ‘Human’ dataset, more interesting observations can be done. Figure 5b confirms that complexes with a higher pH value tend to have a higher degree of electrostatic complementarity. This is particularly true for SH complexes, which have a correlation of -0.88 (p-value at 0.002). The F values of SS and SH complexes are more randomly distributed (correlation at -0.25 and -0.27 respectively), nevertheless this division results in an overall better correlation of the single classes with the pH, compared to what is obtained when considering the dimer class.

Next, we stratified the Zernike distances between the molecular surface patches according to this division. As shown in Figs. 5c and d, when considering shape complementarity SS complexes are the most easy to distinguish from random decoys (ROC AUC of 0.79). Note that face-to-face interactions between \(\beta\)-strands are usually characterized by a high shape complementarity46. SH complexes reach a ROC AUC of 0.76, instead.

The opposite trend can be observed for the electrostatic complementarity studied with the Zernike method: in this case, as depicted in Fig. 5e and f, SH complexes are the most distinguishable from random decoys (ROC AUC at 0.68), whereas SS complexes are the most difficult to classify (ROC AUC at 0.6).

Figure 5
figure 5

Electrostatic complementarity contribution in protein-protein complexes divided according to their secondary structure (a) Distributions of the F values of the interacting patches in complexes from the SS (red), HH (yellow) and SH (blue) classes. In the insert the corresponding ROC curves. (b) Fraction of concordant regions as a function of the pH and computed correlation (in the legend). From left to right the considered complexes are the SS, HH and SH complexes. (c) Distributions of the distances between the Zernike vectors describing the molecular surface of SS, HH and SH interacting (red, yellow and blue respectively) and random (grey) patches in the ’Human’ dataset. (d) ROC curves of the distributions in c) and corresponding AUC (in the legend) computed against the random distribution. (e) For each patch the distance between the Zernike vectors describing the electrostatic potential surface in that region is computed. Then the same analysis and classification as in c) is performed. (f) ROC curves of the distributions in e) and corresponding AUC (in the legend) computed against the random distribution.

Transient from permanent interactions can be distinguished solely based on the electrostatic complementarity

Figure 6
figure 6

Superposition-free classification of transient and permanent interactions. (a) Probability density functions of the Zernike electrostatic distances of the ’Affinity’ dataset complexes. Green (respectively red) distribution corresponds to complexes having \(B_a\) values higher (respectively lower) than -6, corresponding to transient and permanent interactions, respectively. Grey curves correspond to the distances of random decoy patches on the protein surfaces. (b) ROC curves of the transient (green) and permanent (red) interactions with respect to the decoy distribution, together with the ROC curve of the transient distribution with respect to the permanent one (brown). (c) Same as in a) but considering transient complexes with pH higher than 7 and permanent interactions with pH lower than 7, in maroon and light green respectively. (d) ROC curves of the two distributions displayed in panel c).

At last, we apply our method to the ’Affinity’ dataset to test the ability of our descriptor to distinguish between permanent and transient interactions, as this property has important effects on biological functions47. In particular, defining permanent (respectively transient) interactions based on the binding affinity48 being lower (resp. higher) than \(B_a = -6\), we obtained the distributions shown in Fig. 6a. Interestingly, transient interactions display higher than random electrostatic complementarity values (green distribution), while permanent interactions (red distribution) have Zernike distances slightly higher than that one would expect by chance. This can be quantified again by looking at the ROC curves in Fig. 6b and evaluating the AUC values. Indeed, transient interactions display an AUC of 0.69 with respect to the decoy distribution, while permanent interactions have an AUC of 0.43. Permanent interactions can be distinguished from transient ones with an AUC of the ROC of 0.78. Notably, knowing the pH of the considered complexes allows for an even better classification as one can see from Fig. 6c, d.

Finally, we tested the capability of the Zernike method to quantify the effect of point mutations at the interface on the binding affinity of the complex. Results, discussed in the SI, indicate that the electrostatic complementarity evaluated on five mutants with known mutations at the binding sites and experimentally determined dissociation constant is higher the lower is the stability of the complex.

Discussion

The full mapping of the organisms’ interactomes is fundamental for understanding molecular interactions and their many physiological and pathological implications. The well-tested toolbox of experimental techniques we dispose of, such as X-ray crystallography49, NMR50,51 and cryo-EM52,53, is allowing for the detection of protein-protein binding and the determination of the complexes atomistic structure. However, all these techniques are expensive and time-consuming54 so that up to now only small fractions of the organisms’ interactomes have been experimentally determined at the structural level55,56,57. In this respect, computational methods represent a powerful tool to unveil the uncharted landscape of protein complexes58 by predicting protein-protein associations in normal conditions and under mutations/modifications59,60,61,62, which further complicate the compilation of the interactomes by increasing the number of matches to probe.

To predict the protein complex, the identification of putative binding interfaces plays a key role and most of the proposed strategies identify the interfaces as those showing some geometrical/chemical complementary between the molecular partners. In particular, the side chain rearrangements minimize the van der Waals interaction thus determining shape complementarity at the interfaces, which is typically evaluated by geometrical approaches requiring structural alignment between the two interacting molecules.

Notably, the geometric complementarity of the final complexes does not depend on the dynamical specifics of the binding process. In fact, partners can undergo very few changes upon binding (i.e they follow a “lock-and-key” model) or the interactions between two approaching structures can induce conformational changes (“induced fit”) or the protein conformation suitable for binding (bound state) can be explored by the protein even in the absence of the molecular partner (the “conformational selection” model); these three views suggest different key contributors to the conformational changes between the unbound and bound structures, but for all of them shape complementarity is a necessary condition for the complex formation.

Usually, by including the electrostatic contribution to the binding process investigation, in addition to the van der Waals forces, one aims at more precise discrimination of the biological interfaces.

In this respect, computational methods can be divided into two categories: model-based and feature-based approaches. The former exploits the residue-conservation found between similar proteins, the latter is based on local features of protein sequences and/or structures. Feature-based approaches are more general and can work on any type of protein. Even if the availability of protein structures is less abundant than sequences, structural features are fundamental for understanding binding between proteins. Moreover, the recent advances in the field of protein structure prediction starting only from its amino acid sequence63, vouch for an even more important role of the structural-based method than in the past years. Nonetheless, even using structural information, the identification of interfaces remains a challenge in structural biology. Machine learning-based approaches give promising results, but they require the definition and training of several parameters and lack a clear physical-chemical interpretation. Here, we proposed an approach for the rapid and quantitative evaluation of electrostatic complementarity and we probed its role in the identification of binding regions and complexes’ stability. Analysis of the electrostatic potential of protein-protein complexes has led to the general assertion that electrostatic complementarity must be of importance at the interfaces of biological complexes22; nevertheless, a well-settled definition of how electrostatic complementarity should be quantified and what is its role was still missing.

With this aim, we collected two large datasets of protein dimers with known structural information stratified by dimer type, prevailing secondary structure and stability (quantified by means of experimental binding affinity). At first, we analyzed the amino acid composition of the binding region with respect to those of the proteins’ cores and solvent-exposed regions. Next, we looked at the presence and disposition of the charged residues on the binding regions finding that different classes of dimers and structures have slightly different disposition/abundances of charged-charged interactions. Finally, we further increase the complexity of the electrostatic description, considering the full electrostatic potential generated by the protein partial charges on the solvent-exposed molecular surface. This representation allows for a high-level measurement of the electrostatic complementary at the interface of the interacting molecules. Indeed, comparing the spatial correspondence of the potential sign (see F descriptor), we found that the binding regions exhibit a complementary higher than the one we could expect by chance. Notably, the signal is influenced both by the complex class and the experimental pH and binding affinity. In particular, we observed that the maximum complementarity is shown by low-affinity complexes (\(B_a>-6.0\)), while SH complexes and homodimers sharing some residues on the binding regions (SBR-hom) exhibit a nearly random match. Finally, we propose a novel method to assess electrostatic complementarity without the need of having complex structures. Indeed, we already developed a novel computational protocol based on the Zernike polynomials to describe the shape of portions of the molecular surface in the form of a vector of numbers16,37,38,39,40. Here, the method is extended to molecular surfaces for which the electrostatic potential has been calculated through the Poisson-Boltzmann equation44. Indeed, after a proper projection of the electrostatic potential surfaces on the 2D plane, electrostatic complementarity can then be defined again as the Euclidean distance between these new Zernike invariant descriptors (see Methods).

Comparing the electrostatic complementarities at the complex interface via the Zernike method, we found that we are able to discriminate between transient and permanent interactions with an AUC of the ROC of 0.8. Interestingly, the electrostatic complementarity evaluated with the Zernike method on five mutants with known mutations at the binding sites and experimentally determined dissociation constant seems to indicate that our method is able to capture to a certain extent also the effect of point mutations on the complex binding affinity.

In conclusion, we found that electrostatic complementarity in the binding region is efficiently measured simply requiring a spatial match between the signs of the electrostatic potentials. Moreover, such complementarity strongly depends on both the kind of the considered complex, the pH of the environment, and the transient/permanent nature of the binding. In particular, we observe an evident inversely proportional relationship between electrostatic complementarity and the experimental binding affinity. Our results thus help shed light on the often contrasting conclusions of previous works that measured electrostatic complementarity using large datasets. Leveraging on our findings, we adapted the Zernike formalism to measure both shape and electrostatic complementarity in a fast and superposition-free manner. Finally, we note that our findings could be used to reinforce the docking algorithm, or/and to perform pose selection. Moreover, our method could be adapted to other properties that can be described with numerical values assigned to each surface point, since the Zernike expansion can be applied to any function.

Methods

Protein complex datasets

To probe the degree of electrostatic complementarity in protein-protein binding regions, we collect a dataset of protein-protein dimers for which structure information was available from the 3D complex database64. Selecting only non-redundant human dimers, with an x-ray crystal resolution better than 3.0 Å and no missing residues in the binding region, we ended up with 199 human protein complexes in PDB format65. We opted to restrict to only one organism to avoid spurious effects on the charges distribution in the protein structure, due for instance to thermal adaptation10,13,66.

Looking at the dimer composition and spatial orientation, we classify the dataset, that we call ‘Human‘ dataset, into four groups:

  • 44 homodimers with Identical Binding Regions (IBR-hom), i.e. binding regions that have at least 70% of common residues.

  • 66 homodimers with Shifted Binding Regions (SBR-hom), i.e. interacting patches that have between 30% and 70% of common residues.

  • 54 homodimers with non-Identical Binding Regions (nIBR-hom), i.e. binding regions that share less than 30% of the residues.

  • 35 heterodimers (nIBR-het), where two different proteins are interacting.

To gain more insights into the structural dependence of electrostatic complementarity, we considered a second independent classification of the same dataset. For this classification we looked at the secondary structure of each protein. Per residue secondary structure assignment was done using the DSSP67 module implemented in Python. Proteins with a prevalence of residues associated to helices are classified as H, otherwise as S. Looking at the structural composition of the binding partners, the ’Human’ dataset is classified in three classes:

  • 133 HH complexes, i.e. complexes where both partners have a prevalence of helices over strands.

  • 57 SS complexes, i.e. complexes where both partners have a prevalence of strands over helices.

  • 9 SH complexes, i.e. complexes where one of the partners has more helices residues and the other more strands residues.

Table SI shows the list of the PDB id of the complexes in the ’Human’ dataset, together with their dimer and structural classification. Figure S2 shows the amino acid composition and charge properties of the three classes.

To analyze the correlation between electrostatic complementarity and binding stability we consider a second dataset, composed of 123 complexes extracted from the dataset used in10. To our knowledge, that dataset is the largest available collection of complexes with experimental data of binding affinity \(B_a\), defined as the \(log_{10}\) of the equilibrium dissociation constant \(K_d\)68. We then select the complexes with known pH and no missing atoms or residues and call the resulting collection ’Affinity’ dataset. Two of these complexes are also part of the ‘Human‘ dataset as heterodimers: 2HTH and 3MZG. The list of the complexes together with their \(B_a\) is reported in Table SII.

Computation of the surfaces and surface residues definition

The solvent-accessible surface for each structure of the dataset was computed using DMS69, with a density of 5 points per Å\(^2\) and a water probe radius of 1.4 Å. For each surface point, the unit normal vector was calculated with the flag −n. Starting from these surfaces, the electrostatic potential of each protein was calculated independently from the partner using the APBS code43, considering the experimental pH. The electrostatic potential surface was then defined by building a grid and selecting the values of the electrostatic potential in the grid cells corresponding to each surface point.

To select among the residues included in the surface the mainly superficial ones, we computed the Relative Solvent Accessibility as the ratio between Solvent Accessibility and the maximum Solvent Accessible Surface Area of the considered amino acid. The Solvent Accessibility is calculated with DMS by computing the portion exposed to the solvent of each residue involved in the interaction, while the maximum Solvent Accessible Surface Area of the twenty natural amino acids was taken from70. A residue is considered superficial if it has a Relative Solvent Accessibility higher than 0.25. The interacting regions were defined as the points on a protein surface closer than 6Å to its partner surface.

Patch definition and projection

To define a surface patch, we use a spherical region with radius R centered at one point of the surface. This point is randomly extracted for the decoy random patches, while to study the binding regions the geometrical center of the experimental interacting regions is considered. For this study, we chose \(R=9\)Å to be able to study simultaneously the shape and electrostatic complementarity with the 2D Zernike-based method. Indeed, in a previous work, we discussed the range of R values resulting in the best identification of binding regions when considering shape complementarity16.

Once the patch has been selected, we re-orient the coordinates. When two random patches are compared, for each patch we build a plane passing through it and we orient the coordinates so that the z-axis is perpendicular to the plane. It must be remembered that when comparing the shape of patches, their relative orientation must be evaluated: to assess their shape complementarity, we have to orient the patches contrariwise, i.e. one patch with the solvent-exposed part toward the positive z-axis (‘up’) and one toward the negative z-axis (‘down’).

On the other hand, to compare the EMs and SEMs of interacting patches we compute the mean of the normal vector of the first partner and the inverse of the normal vector of the second one. The binding patches are then rotated so that this averaged vector is along the z-axis. This step results again in two patches contrariwise oriented, but in addition to this, we can preserve the spatial correspondence of the surface points after the rotation. We want to remark here how this correspondence is not necessary when the projections are decomposed in the Zernike basis and compared with the Zernike protocol, giving the rotation invariance of the Zernike polynomials. Therefore, to study the patches in terms of the Zernike polynomials expansion we reorient each binding site along the z-axis independently from its partner.

Once the patches have been rotated, two protocols can be implemented. The first one is used to obtain the projections of the corresponding regions of the electrostatic potential surface, whereas the second one provides the projections of the molecular surfaces.

Electrostatic projection

Each point of the re-oriented electrostatic surface is projected on the x-y plane. Next, we build a square grid (25\(\times\)25 pixels) and associate each pixel with the mean value of the electrostatic potential of the points projected inside of it, and call it the Electrostatic Matrix (EM).

Shape projection

Once the patch has been rotated, given a point C on the z-axis we define the angle \(\theta\) as the largest angle between the z-axis and a secant connecting C to any point of the patch. C is then set so that \(\theta =45^\circ\).

To study the shape of the patch, each surface point is labeled with its distance r to C. We then build a square grid (25\(\times\)25 pixels), associating each pixel with the mean r value calculated on the points inside it.

Zernike 2D protocol

Each function of two variables \(f(r,\psi )\) defined in polar coordinates inside the region of the unitary circle (\(r<1\)) can be decomposed in the Zernike basis as

$$\begin{aligned} f(r,\psi )=\sum _{n'=0}^\infty \sum _{m=0}^{n'}c_{n'm}Z_{n'm}(r,\psi ), \end{aligned}$$
(1)

where

$$\begin{aligned} c_{n'm}=\frac{n'+1}{\pi }\int _0^1dr\,r\int _0^{2\pi }d\psi Z_{n'm}^*(r,\psi )f(r,\psi ) \end{aligned}$$
(2)

and

$$\begin{aligned} Z_{n'm}=R_{n'm}(r)e^{im\psi }. \end{aligned}$$
(3)

\(c_{n'm}\) are the expansion coefficients, while the complex functions \(Z_{n'm}(r,\psi )\) are the Zernike polynomials. The radial part \(R_{n'm}\) is given by

$$\begin{aligned} R_{n'm}(r)=\sum _{k=0}^{\frac{n'-m}{2}}\frac{(-1)^k(n'-k)!}{k!\big (\frac{n'+m}{2}-k\big )!\big (\frac{n'-m}{2}-k\big )!}. \end{aligned}$$
(4)

Since for each couple of polynomials, it is true that

$$\begin{aligned} \langle {Z_{n'm}|Z_{n''m'}\rangle }=\frac{\pi }{n'+1}\delta _{n'n''}\delta _{mm'}, \end{aligned}$$
(5)

the complete sets of polynomials form a basis, and knowing the set of complex coefficients \({c_{n'm}}\) allows for a univocal reconstruction of the original patch. The resolution of this reconstruction depends on the order of expansion \(N=max(n^{'}).\)

The norm of the coefficients \(z_{n'm}=|c_{n'm}|\) defines the Zernike invariant descriptor, which is invariant for rotations around the origin of the unitary circle.

The complementarity between two given patches defined with a sphere of radius R can then be measured as the Euclidean distance between the two corresponding invariant vectors: the more the complementary the smaller the distance between their corresponding Zernike vectors. This evaluation can be applied to any properties of the patches that can be described by assigning a numerical value to each surface point.

The efficiency of this method depends on two key parameters: the radius R and the Zernike maximum expansion order N. When R is too low, the patches lack sufficient surface to distinguish the compatibility between interacting regions, whereas too-large patches would include non-interacting regions that have a low complementarity per se. N, on the other hand, determines the level of details captured: too low orders could confuse interacting and random patches because the surfaces are excessively “smoothed”, while an excessively accurate level of description would model unnecessary (and time-consuming) details.

In this study, we performed the Zernike protocol using \(R=9\)Å and \(N=20\), in accordance with the most efficient parameters identified in the previously mentioned work16.