Introduction

Knowledge about protein–protein interactions (PPI) is critical to understand the molecular mechanisms of biochemical processes and cellular pathways. Advent of high-throughput techniques has enabled genome-wide identification of PPIs for quite a few model organisms1,2,3,4,5,6,7,8,9,10,11,12. These large number of experimentally verified as well computationally predicted interactions are collected and systematically stored in various PPI databases, such as molecular interaction database (MINT)13, the Human Protein Reference Database (HPRD)14, STRING15, database of interacting proteins (DIP)16, the protein interaction database (IntAct)17, etc. These databases contain important information for thousands of interactions, which are regularly used in network based ‘omics’ data analysis. However, most of these interactions lack detailed structural information and thereby making them therapeutically non-viable targets. Under this scenario, computational approaches capable of generating reliable model of protein complexes using protein–protein docking tools can play an important role in complementing the experimental initiatives. However, as these complexes are generated using predictive approaches, objective tests and evaluation tools are required to determine their reliability.

PPI interfaces have been studied extensively to analyze and understand the critical characteristics features that provide affinity, stability and specificity of the complexes. Properties like accessible surface area (ASA) and buried surface area (BSA), interface residue conservation, hydrogen bonds, electrostatic and hydrophobic interactions play major roles in determining the nature of the protein interfaces. Knowledge of interface characteristics has been studied and was further used in identification of protein interfaces or to predict binding specificity18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38. Similarly, numerous scoring functions and schemas were developed for improved prediction of protein–protein interfaces out of which only a few could be mentioned due to space restraints39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74. Despite all these works, still gap is prevailed between the optimally scoring solutions and the biologically active complexes75,76,77,78.

Here, we attempt to utilize protein–protein interface properties to establish discernible differences between native-like protein complexes from the non-native ones. We fed the interface properties to a support vector machine (SVM) based classification scheme and trained models to successfully differentiate between native and non-native like complexes derived by protein–protein docking. Machine learning based techniques have been used previously to analyze and predict protein–protein interactions78,79,80,81,82,83,84,85,86. Our exhaustive testing and benchmarking suggest very good performance of our SVM models in distinguishing native and non-native like interfaces for homo and hetero complexes. We also implemented this approach in validating protein interactions, which are proven via experimental findings but the three-dimensional (3D) structure of the complexes and the subsequent interface(s) are yet to be discovered. Finally, we provide a web server platform namely PCPIP to predict whether the interacting interface of a given protein–protein dimer complex significantly resembles known protein interfaces.

Materials and methods

Collection and generation of non-redundant protein–protein complex dataset

989 protein–protein dimer complex structures have been derived from protein data bank (PDB)87, which were categorized into homo (560) and hetero (429) dimers. Exhaustive redundancy check and filter was applied to these datasets using the CD-HIT88 and BLASTp89 programs so that no protein complex (both chains) is more than 40% identical to any other complex within the homo or hetero categories, respectively (Table S1, Figure S1). 371 and 346 homo and hetero complexes were obtained after the redundancy check, which were further filtered by successful docking and interface generation criteria (see later for docking and interface filters).

To validate our machine learning based classification system, we have built a separate validation dataset to perform the benchmarking where both the dimer complex and individual monomer structures are available separately (Apo-Holo validation set). This dataset of complex (holo) and non-complex (apo) forms of proteins were collected from a recent report from Viswanathan et al.90. Initially, this dataset contained 95 protein–protein hetero complexes (holo-complexes) and their respective monomer structures. However, these 95 complexes were screened to identify native and non-native like interfaces via fraction of conserved native contacts (FNAT) based definition (please see later for details) and we could retrieve docking decoys that passed the FNAT filtration criteria for 32 such complexes (Table S1). This dataset contained 32 protein–protein hetero complexes (holo-complexes) and their respective monomer structures that are separately available as apo proteins (Table S1). Similarly, the 95 complexes were also screened to identify native and non-native interfaces via interface root mean square deviation (iRMSD) based definition (please see later for details) where 68 such complexes were retrieved that passed the iRMSD filtration criteria.

We have also created a negative dataset of 130 protein–protein complexes and subsequent interfaces for proteins that are not supposed to interact according to the Negatome database91,92 (Negatome validation set). Further details about the dataset collection are provided in supplementary information file.

Protein and/or domain structures for which experimentally validated protein–protein interaction is reported in the STRING database 10.515 were also collected individually from the PDB database. Protein–protein interaction pairs for which individual monomer structures are available were selected randomly and were further utilized to generate probable dimer structures using protein–protein docking via the PatchDock program93. 32 such docked model complexes (STRING dataset; ten for each complex; total: 320) were generated and further evaluated by our machine learning based protein–protein interface prediction algorithm.

Generation of native and non-native like protein–protein complexes and their interfaces

Known protein dimer complexes were utilized to create native and non-native like interfaces. Constituent monomers of the dimer complexes were separated and docked using the PatchDock protein–protein docking software and the resultant docked solutions were screened to create native and non-native like complexes based on the following criteria. Fraction of conserved native contacts (FNAT) usually provides a fraction of the common residues at interface of docked interface with respect to that from the original complex whereas interface root mean square deviation (iRMSD) compares the actual orientation of the interface forming residues between the docked and original complexes. Hence, FNAT and iRMSD complement each other in their approach of evaluation of a predicted interface.

FNAT based categorization

Fraction of conserved native contacts (FNAT) is the number of native (correct) residue–residue contacts in the docked complex divided by the number of contacts in the original complex. So, FNAT reflects the overlap between the original and docked complex interfaces while a FNAT value 1.0 indicates complete overlap between the two. In our training and testing models, the original complex and the docked complexes with FNAT > 0.8 were regarded as true or native like complexes for each dimer whereas false/non-native like complexes were identified using four separate FNAT thresholds, (a) FNAT ≤ 0.25 (highly distinguishable from the native like complexes), (b) FNAT > 0.25 and ≤ 0.5, (moderately distinguishable from the native like complexes) and (c) FNAT > 0.5 and ≤ 0.8 (weakly distinguishable from the native like complexes), and (d) FNAT ≤ 0.8 (mixed), respectively.

Similar to the training–testing dataset, we have used the same four different FNAT thresholds to define the non-native like complexes for the benchmarking dataset (Apo-Holo validation set) also.

iRMSD based categorization

Interface root mean square deviation (iRMSD) is the root mean square deviation between the residues of both chains at the interface region. Similar to the FNAT based protocol, native and non-native sets were also identified based on iRMSD where original complex along with one docked complex with iRMSD < 5 Å with respect to the original complex were regarded as native like complex for each dimer whereas non-native like complexes were identified with four different thresholds, (a) iRMSD ≥ 15 Å (highly distinguishable from the native like complexes), (b) iRMSD ≥ 10 Å and < 15 Å, (moderately distinguishable from the native like complexes) and (c) iRMSD ≥ 5 Å and < 10 Å (weakly distinguishable from the native like complexes), and (d) iRMSD ≥ 5 Å (mixed), respectively.

Native and non-native like complexes (Table S1 and Figure S2) categorized according to the criteria explained above with 1:5 ratio (2 native and 10 non-native samples, respectively) were further used for classification purpose.

Calculation of PPI interface properties

PISA software94,95 was used to calculate structural and chemical properties of the macromolecular interfaces such as accessible/buried surface area, free energy of dissociation, presence/absence of hydrogen bond and salt bridges, etc. Further details are provided in supplementary information file. Please check Table S2 for a full list of features that were used for machine leaning method based classification between native and non-native like protein–protein interaction complexes.

Classification of the PPI interfaces via SVM

Support vector machine (SVM) is a supervised model used for classification by analyzing given features with associated learning algorithms. Here, we have used a radial basis kernel function (RBF) via 100-fold cross validation method where 100 times randomly selected 80% of the whole data has been used to train the SVM model and the rest 20% has been used as test data. Performance on the test models was measured using average of the hundred random trials. LibSVM96 was used to build the classifier models.

For both FNAT and iRMSD based categorizations, 100 fold randomized selection of the training (80% data) and testing (20% data) followed by SVM based classification trials were performed for each of the native and non-native complex threshold criterion. Figure S3 provides an overview of the various training and testing SVM runs employed in this study using multiple categories of native and non-native like protein–protein complexes while Table S3 provides the kernel function and other relevant parameters of the various SVM models.

Benchmarking and comparison of performance

We have used separate set of complexes in order to check the efficacy of our SVM models in correctly identifying native and non-native complexes. Previously described native and non-native complexes within the Apo-Holo validation set were tested against the 100 SVM models built with 80% training data. Based on the definition of native and non-native like complexes using FNAT and iRMSD categorizations, each set containing highly, moderately, weakly, and mixed distinguishable non-native like complexes from the native like complexes were tested against the corresponding 100 models of the training data sets. Average sensitivity, specificity, precision, and F1 score from the 100 runs was recorded.

The separate datasets of Apo-Holo heterodimers using the mixed distinguishable non-native like complex thresholds [FNAT-native (FNAT > 0.8): 64 and non-native (FNAT ≤ 0.8): 320 and iRMSD-native (iRMSD < 5 Å): 134 and non-native (iRMSD ≥ 5 Å):  680] (Table S1) were used as input to the CCharPPI online server97 where 10 different composite scoring functions such as ZRANK, ZRANK2, ROSETTADOCK, PYDOCK, FIREDOCK, PISA score, CP_PIE, and SIPPER were applied on each complex and the corresponding scores were obtained. Individual receiver operating characteristic (ROC) plot was created by calculating the true positive rate (TPR; sensitivity) and false positive rate (FPR; 1-specificity) for each scoring function.

Additionally, 130 protein pairs extracted from the Negatome database91,92 were subjected to PatchDock to generate biologically non-feasible protein complexes. 5 docked decoy complexes were collected for each query pairs following the criteria described in supplementary information file. This dataset (Negatome validation set) was used to validate the false prediction rate of the SVM classifier.

In absence of reference complex structure, we selected top ten docking solutions according to the PatchDock docking score for each 31 protein–protein interactions enlisted within the STRING dataset, which was further subjected to interface feature generation followed by SVM prediction using the heterodimer training model.

Standard performance metrics such as sensitivity, specificity, precision, F1 score, Mathew’s correlation coefficient (MCC), etc. were calculated along with the ROC statistics to measure the performance of our models in different scenario. Please see supplementary information file for more details.

Development of the webserver

We have developed a web based server named, “PCPIP (Protein Complex Prediction by Interface Properties)” where the classification and prediction schemes were embedded within a web module. Given a protein–protein complex, the PCPIP server would be able to predict whether the interacting interface resembles significantly with known protein interfaces. The server is available via http://www.hpppi.iicb.res.in/pcpip/ and is developed on PHP and CGI-PERL platform.

The server has two input options, single and batch mode, respectively. In the single mode option, single protein–protein complex (homo or hetero dimer) file saved in standard PDB format can be uploaded to check whether the interacting interface, if there is any, resembles the interfaces extracted from the native protein–protein complexes or not.

SVM models for both homo and hetero dimer along with their interface features are kept as background search models against which the uploaded protein complex can be testified based on FNAT and iRMSD criteria.

Results

Protein–protein interface properties to differentiate between native and non-native like interfaces

PPI interface properties that were showing statistically significant (p ≤ 0.01) differences between the native and non-native like complexes, categorized either by FNAT or iRMSD criteria were compared (Fig. 1A). 62 such features were common in homo and heterodimer. A large fraction of the distinguishable features represent accessible surface area (ASA) of amino acids located at the native and non-native like interfaces (Fig. 1B and Figure S4). However, only phenylalanine (PHE), tyrosine (TYR), and isoleucine (ILE) possess significantly higher buried surface area (BSA) in native interfaces whereas lysine (LYS) possesses significantly lower buried surface area (BSA) in native interfaces (Fig. 1C,D). Hydrogen bonds between aspartate-arginine (ASP-ARG) and glutamate-arginine (GLU-ARG) were found to be significantly more in native interfaces (Fig. 1E,F) whereas all the native interfaces were found be significantly more stable with respect to binding energy (Fig. 1G,H). Abundance of negatively charged amino acids (ASP and GLU), serine (SER), threonine (THR) and cysteine (CYS) are relatively lower at the native interfaces of hetero complexes with respect to non-native interfaces whereas positively charged ARG and HIS are relatively higher at the native interfaces of homo complexes, respectively (Figure S5). In addition, frequencies of ALA, GLY, PRO, ASN, GLN, LYS, HIS, PHE, TRP, and TYR are also found to be different between native and non-native interfaces (Figure S5). These observations indicate presence of discernable differences between them and advocates the utilization of the interface features to classify and predict native PPI interfaces.

Figure 1
figure 1

Comparison of protein–protein interaction interface properties. (A) The overlap among interface properties that were showing statistically significant (p ≤ 0.01) differences between the native and non-native like complexes, categorized either by FNAT and iRMSD criteria. HETERO_FNAT and HETERO_iRMSD provide numbers of significantly different interface properties for heterodimers while HOMO_FNAT and HOMO_iRMSD provide numbers of significantly different interface properties for homodimers native and non-native like complexes, respectively. FNAT, fraction of conserved native contacts. iRMSD, interface root mean square deviation. (B) The distribution of the common interface properties that showed statistically significant (p ≤ 0.01) differences between all the native and non-native like complexes. ASA, accessible surface area. BSA, buried surface area. H-bond, hydrogen bonds. (C,D) plot the buried surface area (BSA) of the two amino acids that possessed significantly different BSA at the native interfaces compared to the non-native ones identified based on FNAT (C) and iRMSD (D) definitions, respectively. (E,F) show the hydrogen bond forming amino acid pairs that are found to be significantly higher at the native interfaces compared to the non-native ones identified based on FNAT (E) and iRMSD (F) based definitions, respectively. (G,H) plot the average binding energy represented by ΔG for the native and non-native interfaces identified based on FNAT (G) and iRMSD (H) based definitions, respectively.

Classification of protein–protein interaction interfaces via SVM

Support vector machine (SVM) was used for classification with 100-fold cross validation approach. Table 1 provides mean test accuracies for homodimer and heterodimer complexes, respectively. 100 fold randomized selection of the training (80% data) and testing (20% data) samples followed by SVM based classification trials were performed for each of the native and non-native complex threshold criterion. Classifications were performed with different training models where non-native like interfaces were selected with various FNAT thresholds, (a) FNAT ≤ 0.25 (highly distinguishable from the native like complexes), (b) FNAT ≥ 0.25 and ≤ 0.5, (moderately distinguishable from the native like complexes) and (c) FNAT ≥ 0.5 and ≤ 0.8 (weakly distinguishable from the native like complexes), respectively with respect to the original complex. Similar 100 fold classification was also performed using native and non-native like complexes defined by iRMSD where non-native like complexes were identified with three thresholds, (a) iRMSD ≥ 15 Å (highly distinguishable from the native like complexes), (b) iRMSD ≥ 10 Å and ≤ 15 Å, (moderately distinguishable from the native like complexes) and (c) iRMSD ≥ 5 Å and ≤ 10 Å (weakly distinguishable from the native like complexes), respectively.

Table 1 Mean test and train accuracies for prediction of protein–protein interaction interfaces.

Our SVM models performed quite well and yielded very good performances with all categories of non-native like protein–protein complexes categorized by both FNAT and iRMSD definitions (Table 1). Utilization of both of these criteria adds more reliability to the process of evaluation of PPI interface comparison.

Benchmarking and comparative validation

As mentioned before, for benchmarking and validation we have used Apo-Holo-validation dataset comprising of 32 (FNAT) and 68 (iRMSD) dimer complexes for which individual monomer structures are also available separately. Monomers were docked using the PatchDock protein–protein docking software93 and native and non-native like complexes were generated using various ranges of FNAT and iRMSD based criteria. Original and all docking solutions with FNAT ≥ 0.8 were regarded as native like complexes whereas solutions with FNAT ≤ 0.25, FNAT ≥ 0.25 and ≤ 0.5, FNAT ≥ 0.50 and ≤ 0.80, and FNAT < 0.80, respectively were considered as non-native like complexes. Similarly, Original and all docking solutions with iRMSD < 5 Å were regarded as native like complexes whereas solutions with iRMSD ≥ 15 Å, iRMSD ≥ 10 Å and ≤ 15 Å, iRMSD ≥ 5 Å and ≤ 10 Å, respectively were considered as non-native like complexes. Native and non-native like complexes from this validation datasets were mixed together and the classifier was asked to differentiate the native and non-native ones correctly based on their interface features. Performances of the classifiers were measured using standard parameters. Table 2, shows the performance measure values estimated for a range of probability threshold (0.50–0.95). It is clearly reflected that the performances of the individual models are quite good.

Table 2 Benchmarking results for Apo-Holo datasets using the Hetero FNAT and iRMSD models.

The datasets showed in Table 2 were used in CCharPPI online server97 where 10 different composite scoring functions were applied on each complex and the corresponding scores are obtained. ROC plots were created (Fig. 2) by calculating the true positive rate (TPR) and false positive rate (FPR; 1-specificity). Figure 2 clearly demonstrates much better performance of our FNAT (PCPIP_FNAT) and iRMSD (PCPIP_iRMSD) based prediction models in predicting complexes correctly. Hence, in this apparently difficult datasets our method performed much better compared to the other methods.

Figure 2
figure 2

Comparison of prediction performances. The prediction performances of the SVM based prediction models (PCPIP_FNAT and PCPIP_iRMSD) for native and non-native like complexes from the Apo-Holo dataset were compared against 10 different types of scoring functions. Receiver operating characteristic (ROC) plots were created by calculating the true positive rate (TPR; Y axes) and false positive rate (FPR; X axes). PCPIP stands for Protein Complex Prediction by Interface Properties. Area under curve (AUC) values for each of the methods is also provided. Benchmarking was performed using the FNAT (A) and iRMSD (B) definitions based sub-datasets from the Apo-Holo validation set where native-like complexes were defined by FNAT > 0.8 and iRMSD < 5 Å, respectively and non-native like complexes were identified using FNAT ≤ 0.8 and iRMSD ≥ 15 Å,  respectively.

Further, the accuracy of all 100 randomly generated training models was testified using a test dataset of complexes that are not supposed to be formed physiologically. Analyzing top 25 cases, it is evident that at all probability thresholds (0.5–0.95) more than 90% of the Negatome complexes were predicted as false by FNAT model where iRMSD model is taking 0.85 probability threshold to reach 90% accuracy (Fig. 3).

Figure 3
figure 3

Verification of prediction accuracy. Percentage of correctly predicted non-native hetero complexes extracted from the Negatome dataset using both FNAT and iRMSD definitions are plotted. Accuracies are plotted as bar diagram for each probability threshold cutoff marked by different colors.

Applications of the prediction algorithm

We collected 32 STRING suggested, experimentally verified protein–protein interactions and the complexes were modeled using the PatchDock based protein docking algorithm where monomer structures/domains were collected from the PDB87. Docking solution with highest probability threshold score was considered as the top ranked prediction based on FNAT and iRSMD models, respectively. Distributions of probability score thresholds within the FNAT and iRMSD based top ranked prediction and the PatchDock based top solutions are plotted (Fig. 4A,B), which suggest docking score based ranked solutions are less likely to contain the native like complexes with respect to that achieved by FNAT and iRMSD based prediction. Out of the 320 interfaces (10 solutions for each complex), 12 interfaces were commonly predicted by FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95). ΔGs of binding for these predicted complexes are comparable with that achieved for known heterodimer complexes (Fig. 4C) indicating reliability of the predicted poses. Out of these 12 predicted complexes we showcase three complexes formed by glyceraldehyde 3-phosphate dehydrogenase (GAPDH) with phosphoglycerate kinase (PGK1), enolase 1 (ENO1) and triose-phosphate isomerase (TIM), respectively (Fig. 4D,F). Figure S6 provides the mode of interaction and the interface parameters for all the 12 complexes that were commonly predicted by FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95).

Figure 4
figure 4

Prediction of probable interaction surface. (A,B) The frequency of the probability threshold scores within the FNAT (A) and iRMSD (B) based top ranked solutions in comparison with same derived from PatchDock based top ranked solutions. (C) Box plot representation of the binding energy of the protein–protein interaction interface (represented via ΔG) of the 12 docked complexes that were commonly predicted by both FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95) along with the same obtained from the known 3D structures of the heterodimer complexes. ΔGs of the three representative complexes of GAPDH-PGK, GAPDH-ENO1, and GAPDH-TIM are also plotted. (D,F) show the 3D cartoon representations of the complexes where GAPDH is shown in cyan and the PGK1, ENO1, and TIM are shown in purple, orange, and blue, respectively.

A web based server platform namely Protein Complex Prediction by Interface Properties (PCPIP) is developed to predict whether the interacting interface of a given protein–protein dimer complex resembles significantly with known protein interfaces. PCPIP predicts whether submitted interface(s) is likely to be native like or not. This prediction server would be particularly useful in identifying correct docking poses out of numerous solutions that standard protein docking programs offer. Figure S7 provides a snapshot of the input and output options of the PCPIP server.

Discussion

Protein–protein interactions (PPI) are extremely crucial for intra and inter-cellular functions and inter-molecular connectivity. Due to significant improvement in experimental techniques, large numbers of protein structures are available now. Similarly, improved high-throughput studies like yeast two-hybrid system (Y2H), mass spectrometry (MS), tandem affinity purification (TAP) have identified numerous PPI that are previously unknown. However, these efforts are expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. Hence, the need for computational techniques has been increased to augment experimentally identified PPI and provide a larger repertoire of cellular PPIs. Another daunting challenge is to physically construct and map these large numbers of PPI complexes and identify the mode of interaction. Computationally generated PPI complexes could be quite useful and may expedite the experiments that are required to validate the binding interface and critical residues for the interactions. However, these predictions are generally error prone and therefore, need to be validated very carefully. Even though various protein–protein docking programs are available, methods for systematic evaluation of the predicted PPI complexes are limited.

In our effort, we tried to study the known protein–protein interface properties and utilize the knowledge of native PPI interface properties to a standard machine learning technique, support vector machine (SVM) to delineate native-like complexes from non-native like complexes. Interestingly, this simple approach turned out to be quite effective as suggested by very good performance metrics of our SVM models in distinguishing native and non-native like interfaces for homo and hetero complexes. Our exhaustive testing and benchmarking exercises using a completely non-redundant training–testing dataset and various degrees of distinguishing thresholds between native and non-native like complexes suggest a very high accuracy of the models. The categorizations of non-native instances were implemented using strict, moderate and lenient definitions to consider many intermediate docking models with partial overlap with native interfaces. Performance of the models demonstrate high efficiency of the approach in distinguishing native like complexes from non-native like complexes having high, moderate and low overlap with the actual interfaces (Table 1).

We also evaluated the performance of our method under further validation test where it was subjected to differentiate native like complexes from non-native complexes prepared from individually solved monomer structures of known complex structures. This apparently difficult dataset overrules the possibility of the monomers structures to be primed to form native complexes when subjected to protein docking approaches. In this Apo-Holo validation test, our method performs reasonably well, especially for the highly distinguished native and non-native like complexes (Table 2). However, comparative analysis using this dataset shows much better performance with respect to other available methods (Fig. 2).

We applied this approach in real scenario where protein interactions information is proven via experimental findings but the three-dimensional (3D) structure of the complexes and the subsequent interface(s) are yet to be discovered. We generated such complexes using 32 high-confidence STRING protein–protein interactions and identified the most likely interaction modes for 12 complexes. Such filtered structural models could be very useful for designing subsequent experiments to validate the actual mode of interaction even without attempting to solve the entire complex structures.

Finally, we converted the methodology into a user-friendly, easy-to-use web server platform namely PCPIP to predict whether the interacting interface of a given protein–protein dimer complex significantly resembles known protein interfaces. We believe that this resource could be a useful tool for biologists to evaluate protein–protein docking derived results and gain helpful knowledge to design confirmatory experiments.