Classification and prediction of protein–protein interaction interface using machine learning algorithm

Structural insight of the protein–protein interaction (PPI) interface can provide knowledge about the kinetics, thermodynamics and molecular functions of the complex while elucidating its role in diseases and further enabling it as a potential therapeutic target. However, owing to experimental lag in solving protein–protein complex structures, three-dimensional (3D) knowledge of the PPI interfaces can be gained via computational approaches like molecular docking and post-docking analyses. Despite development of numerous docking tools and techniques, success in identification of native like interfaces based on docking score functions is limited. Hence, we employed an in-depth investigation of the structural features of the interface that might successfully delineate native complexes from non-native ones. We identify interface properties, which show statistically significant difference between native and non-native interfaces belonging to homo and hetero, protein–protein complexes. Utilizing these properties, a support vector machine (SVM) based classification scheme has been implemented to differentiate native and non-native like complexes generated using docking decoys. Benchmarking and comparative analyses suggest very good performance of our SVM classifiers. Further, protein interactions, which are proven via experimental findings but not resolved structurally, were subjected to this approach where 3D-models of the complexes were generated and most likely interfaces were predicted. A web server called Protein Complex Prediction by Interface Properties (PCPIP) is developed to predict whether interface of a given protein–protein dimer complex resembles known protein interfaces. The server is freely available at http://www.hpppi.iicb.res.in/pcpip/.

Classification of the PPI interfaces via SVM. Support vector machine (SVM) is a supervised model used for classification by analyzing given features with associated learning algorithms. Here, we have used a radial basis kernel function (RBF) via 100-fold cross validation method where 100 times randomly selected 80% of the whole data has been used to train the SVM model and the rest 20% has been used as test data. Performance on the test models was measured using average of the hundred random trials. LibSVM 96 was used to build the classifier models.
For both FNAT and iRMSD based categorizations, 100 fold randomized selection of the training (80% data) and testing (20% data) followed by SVM based classification trials were performed for each of the native and non-native complex threshold criterion. Figure S3 provides an overview of the various training and testing SVM runs employed in this study using multiple categories of native and non-native like protein-protein complexes while Table S3 provides the kernel function and other relevant parameters of the various SVM models.
Benchmarking and comparison of performance. We have used separate set of complexes in order to check the efficacy of our SVM models in correctly identifying native and non-native complexes. Previously described native and non-native complexes within the Apo-Holo validation set were tested against the 100 SVM models built with 80% training data. Based on the definition of native and non-native like complexes using FNAT and iRMSD categorizations, each set containing highly, moderately, weakly, and mixed distinguishable non-native like complexes from the native like complexes were tested against the corresponding 100 models of the training data sets. Average sensitivity, specificity, precision, and F1 score from the 100 runs was recorded.
Additionally, 130 protein pairs extracted from the Negatome database 91,92 were subjected to PatchDock to generate biologically non-feasible protein complexes. 5 docked decoy complexes were collected for each query pairs following the criteria described in supplementary information file. This dataset (Negatome validation set) was used to validate the false prediction rate of the SVM classifier.
In absence of reference complex structure, we selected top ten docking solutions according to the PatchDock docking score for each 31 protein-protein interactions enlisted within the STRING dataset, which was further subjected to interface feature generation followed by SVM prediction using the heterodimer training model.
Standard performance metrics such as sensitivity, specificity, precision, F1 score, Mathew's correlation coefficient (MCC), etc. were calculated along with the ROC statistics to measure the performance of our models in different scenario. Please see supplementary information file for more details.
Development of the webserver. We have developed a web based server named, "PCPIP (Protein Complex Prediction by Interface Properties)" where the classification and prediction schemes were embedded within a web module. Given a protein-protein complex, the PCPIP server would be able to predict whether the interacting interface resembles significantly with known protein interfaces. The server is available via http://www. hpppi .iicb.res.in/pcpip / and is developed on PHP and CGI-PERL platform.
The server has two input options, single and batch mode, respectively. In the single mode option, single protein-protein complex (homo or hetero dimer) file saved in standard PDB format can be uploaded to check whether the interacting interface, if there is any, resembles the interfaces extracted from the native protein-protein complexes or not.
SVM models for both homo and hetero dimer along with their interface features are kept as background search models against which the uploaded protein complex can be testified based on FNAT and iRMSD criteria.

Results
Protein-protein interface properties to differentiate between native and non-native like interfaces. PPI interface properties that were showing statistically significant (p ≤ 0.01) differences between the native and non-native like complexes, categorized either by FNAT or iRMSD criteria were compared (Fig. 1A). 62 such features were common in homo and heterodimer. A large fraction of the distinguishable features represent accessible surface area (ASA) of amino acids located at the native and non-native like interfaces ( Fig. 1B and Figure S4). However, only phenylalanine (PHE), tyrosine (TYR), and isoleucine (ILE) possess significantly higher buried surface area (BSA) in native interfaces whereas lysine (LYS) possesses significantly lower buried surface area (BSA) in native interfaces (Fig. 1C,D). Hydrogen bonds between aspartate-arginine (ASP-ARG) and glutamate-arginine (GLU-ARG) were found to be significantly more in native interfaces (Fig. 1E,F) whereas all the native interfaces were found be significantly more stable with respect to binding energy (Fig. 1G,H). Abundance of negatively charged amino acids (ASP and GLU), serine (SER), threonine (THR) and cysteine (CYS) are relatively lower at the native interfaces of hetero complexes with respect to non-native interfaces whereas positively charged ARG and HIS are relatively higher at the native interfaces of homo complexes, respectively ( Figure S5). In addition, frequencies of ALA, GLY, PRO, ASN, GLN, LYS, HIS, PHE, TRP, and TYR are also www.nature.com/scientificreports/ found to be different between native and non-native interfaces ( Figure S5). These observations indicate presence of discernable differences between them and advocates the utilization of the interface features to classify and predict native PPI interfaces.

Classification of protein-protein interaction interfaces via SVM. Support vector machine (SVM)
was used for classification with 100-fold cross validation approach. Table 1 provides mean test accuracies for homodimer and heterodimer complexes, respectively. 100 fold randomized selection of the training (80% data) and testing (20% data) samples followed by SVM based classification trials were performed for each of the native and non-native complex threshold criterion. Classifications were performed with different training models where non-native like interfaces were selected with various FNAT thresholds, (a) FNAT ≤ 0.25 (highly distinguishable from the native like complexes), (b) FNAT ≥ 0.25 and ≤ 0.5, (moderately distinguishable from the native like complexes) and (c) FNAT ≥ 0.5 and ≤ 0.8 (weakly distinguishable from the native like complexes), respectively with respect to the original complex. Similar 100 fold classification was also performed using native and nonnative like complexes defined by iRMSD where non-native like complexes were identified with three thresholds, (a) iRMSD ≥ 15 Å (highly distinguishable from the native like complexes), (b) iRMSD ≥ 10 Å and ≤ 15 Å, (moderately distinguishable from the native like complexes) and (c) iRMSD ≥ 5 Å and ≤ 10 Å (weakly distinguishable from the native like complexes), respectively. Our SVM models performed quite well and yielded very good performances with all categories of non-native like protein-protein complexes categorized by both FNAT and iRMSD definitions ( Table 1). Utilization of both of these criteria adds more reliability to the process of evaluation of PPI interface comparison.
Benchmarking and comparative validation. As mentioned before, for benchmarking and validation we have used Apo-Holo-validation dataset comprising of 32 (FNAT) and 68 (iRMSD) dimer complexes for which individual monomer structures are also available separately. Monomers were docked using the PatchDock protein-protein docking software 93 and native and non-native like complexes were generated using various ranges of FNAT and iRMSD based criteria. Original and all docking solutions with FNAT ≥ 0.8 were regarded as native like complexes whereas solutions with FNAT ≤ 0.25, FNAT ≥ 0.25 and ≤ 0.5, FNAT ≥ 0.50 and ≤ 0.80, and FNAT < 0.80, respectively were considered as non-native like complexes. Similarly, Original and all docking solutions with iRMSD < 5 Å were regarded as native like complexes whereas solutions with iRMSD ≥ 15 Å, iRMSD ≥ 10 Å and ≤ 15 Å, iRMSD ≥ 5 Å and ≤ 10 Å, respectively were considered as non-native like complexes. Native and non-native like complexes from this validation datasets were mixed together and the classifier was asked to differentiate the native and non-native ones correctly based on their interface features. Performances of the classifiers were measured using standard parameters. Table 2, shows the performance measure values estimated for a range of probability threshold (0.50-0.95). It is clearly reflected that the performances of the individual models are quite good.
The datasets showed in Table 2 were used in CCharPPI online server 97 where 10 different composite scoring functions were applied on each complex and the corresponding scores are obtained. ROC plots were created (Fig. 2) by calculating the true positive rate (TPR) and false positive rate (FPR; 1-specificity). Figure 2 clearly demonstrates much better performance of our FNAT (PCPIP_FNAT) and iRMSD (PCPIP_iRMSD) based prediction models in predicting complexes correctly. Hence, in this apparently difficult datasets our method performed much better compared to the other methods.
Further, the accuracy of all 100 randomly generated training models was testified using a test dataset of complexes that are not supposed to be formed physiologically. Analyzing top 25 cases, it is evident that at all probability thresholds (0.5-0.95) more than 90% of the Negatome complexes were predicted as false by FNAT model where iRMSD model is taking 0.85 probability threshold to reach 90% accuracy (Fig. 3).
Applications of the prediction algorithm. We collected 32 STRING suggested, experimentally verified protein-protein interactions and the complexes were modeled using the PatchDock based protein docking algo-  87 . Docking solution with highest probability threshold score was considered as the top ranked prediction based on FNAT and iRSMD models, respectively. Distributions of probability score thresholds within the FNAT and iRMSD based top ranked prediction and the PatchDock based top solutions are plotted (Fig. 4A,B), which suggest docking score based ranked solutions are less likely to contain the native like complexes with respect to that achieved by FNAT and iRMSD based prediction. Out of the 320 interfaces (10 solutions for each complex), 12 interfaces were commonly predicted by FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95). ΔGs of binding for these predicted complexes are comparable with that achieved for known heterodimer complexes (Fig. 4C) indicating reliability of the predicted poses. Out of these 12 predicted complexes we showcase three complexes formed by glyceraldehyde 3-phosphate dehydrogenase (GAPDH) with phosphoglycerate kinase (PGK1), enolase 1 (ENO1) and triose-phosphate isomerase (TIM), respectively (Fig. 4D,F). Figure S6 provides the mode of interaction and www.nature.com/scientificreports/ the interface parameters for all the 12 complexes that were commonly predicted by FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95).
A web based server platform namely Protein Complex Prediction by Interface Properties (PCPIP) is developed to predict whether the interacting interface of a given protein-protein dimer complex resembles significantly with known protein interfaces. PCPIP predicts whether submitted interface(s) is likely to be native like or not. This prediction server would be particularly useful in identifying correct docking poses out of numerous solutions that standard protein docking programs offer. Figure S7 provides a snapshot of the input and output options of the PCPIP server.

Discussion
Protein-protein interactions (PPI) are extremely crucial for intra and inter-cellular functions and inter-molecular connectivity. Due to significant improvement in experimental techniques, large numbers of protein structures are available now. Similarly, improved high-throughput studies like yeast two-hybrid system (Y2H), mass spectrometry (MS), tandem affinity purification (TAP) have identified numerous PPI that are previously unknown. However, these efforts are expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. Hence, the need for computational techniques has been increased to augment  www.nature.com/scientificreports/ experimentally identified PPI and provide a larger repertoire of cellular PPIs. Another daunting challenge is to physically construct and map these large numbers of PPI complexes and identify the mode of interaction. Computationally generated PPI complexes could be quite useful and may expedite the experiments that are required to validate the binding interface and critical residues for the interactions. However, these predictions are generally error prone and therefore, need to be validated very carefully. Even though various protein-protein docking programs are available, methods for systematic evaluation of the predicted PPI complexes are limited. In our effort, we tried to study the known protein-protein interface properties and utilize the knowledge of native PPI interface properties to a standard machine learning technique, support vector machine (SVM) to delineate native-like complexes from non-native like complexes. Interestingly, this simple approach turned out to be quite effective as suggested by very good performance metrics of our SVM models in distinguishing native and non-native like interfaces for homo and hetero complexes. Our exhaustive testing and benchmarking exercises using a completely non-redundant training-testing dataset and various degrees of distinguishing thresholds between native and non-native like complexes suggest a very high accuracy of the models. The categorizations of non-native instances were implemented using strict, moderate and lenient definitions to consider many intermediate docking models with partial overlap with native interfaces. Performance of the models demonstrate high efficiency of the approach in distinguishing native like complexes from non-native like complexes having high, moderate and low overlap with the actual interfaces (Table 1).
We also evaluated the performance of our method under further validation test where it was subjected to differentiate native like complexes from non-native complexes prepared from individually solved monomer structures of known complex structures. This apparently difficult dataset overrules the possibility of the monomers structures to be primed to form native complexes when subjected to protein docking approaches. In this Apo-Holo validation test, our method performs reasonably well, especially for the highly distinguished native and non-native like complexes (Table 2). However, comparative analysis using this dataset shows much better performance with respect to other available methods (Fig. 2). www.nature.com/scientificreports/ We applied this approach in real scenario where protein interactions information is proven via experimental findings but the three-dimensional (3D) structure of the complexes and the subsequent interface(s) are yet to be discovered. We generated such complexes using 32 high-confidence STRING protein-protein interactions and identified the most likely interaction modes for 12 complexes. Such filtered structural models could be very useful for designing subsequent experiments to validate the actual mode of interaction even without attempting to solve the entire complex structures.
Finally, we converted the methodology into a user-friendly, easy-to-use web server platform namely PCPIP to predict whether the interacting interface of a given protein-protein dimer complex significantly resembles known protein interfaces. We believe that this resource could be a useful tool for biologists to evaluate protein-protein docking derived results and gain helpful knowledge to design confirmatory experiments.