Classification and prediction of protein–protein interaction interface using machine learning algorithm

Das, Subhrangshu; Chakrabarti, Saikat

doi:10.1038/s41598-020-80900-2

Download PDF

Article
Open access
Published: 19 January 2021

Classification and prediction of protein–protein interaction interface using machine learning algorithm

Subhrangshu Das¹ &
Saikat Chakrabarti¹

Scientific Reports volume 11, Article number: 1761 (2021) Cite this article

19k Accesses
46 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Structural insight of the protein–protein interaction (PPI) interface can provide knowledge about the kinetics, thermodynamics and molecular functions of the complex while elucidating its role in diseases and further enabling it as a potential therapeutic target. However, owing to experimental lag in solving protein–protein complex structures, three-dimensional (3D) knowledge of the PPI interfaces can be gained via computational approaches like molecular docking and post-docking analyses. Despite development of numerous docking tools and techniques, success in identification of native like interfaces based on docking score functions is limited. Hence, we employed an in-depth investigation of the structural features of the interface that might successfully delineate native complexes from non-native ones. We identify interface properties, which show statistically significant difference between native and non-native interfaces belonging to homo and hetero, protein–protein complexes. Utilizing these properties, a support vector machine (SVM) based classification scheme has been implemented to differentiate native and non-native like complexes generated using docking decoys. Benchmarking and comparative analyses suggest very good performance of our SVM classifiers. Further, protein interactions, which are proven via experimental findings but not resolved structurally, were subjected to this approach where 3D-models of the complexes were generated and most likely interfaces were predicted. A web server called Protein Complex Prediction by Interface Properties (PCPIP) is developed to predict whether interface of a given protein–protein dimer complex resembles known protein interfaces. The server is freely available at http://www.hpppi.iicb.res.in/pcpip/.

DIPS-Plus: The enhanced database of interacting protein structures for interface prediction

Article Open access 03 August 2023

Alex Morehead, Chen Chen, … Jianlin Cheng

Exploring the chemical space of protein–protein interaction inhibitors through machine learning

Article Open access 28 June 2021

Jiwon Choi, Jun Seop Yun, … Jong In Yook

Accurate Classification of Biological and non-Biological Interfaces in Protein Crystal Structures using Subtle Covariation Signals

Article Open access 30 August 2019

Yoshinori Fukasawa & Kentaro Tomii

Introduction

Knowledge about protein–protein interactions (PPI) is critical to understand the molecular mechanisms of biochemical processes and cellular pathways. Advent of high-throughput techniques has enabled genome-wide identification of PPIs for quite a few model organisms^{1,2,3,4,5,6,7,8,9,10,11,12}. These large number of experimentally verified as well computationally predicted interactions are collected and systematically stored in various PPI databases, such as molecular interaction database (MINT)¹³, the Human Protein Reference Database (HPRD)¹⁴, STRING¹⁵, database of interacting proteins (DIP)¹⁶, the protein interaction database (IntAct)¹⁷, etc. These databases contain important information for thousands of interactions, which are regularly used in network based ‘omics’ data analysis. However, most of these interactions lack detailed structural information and thereby making them therapeutically non-viable targets. Under this scenario, computational approaches capable of generating reliable model of protein complexes using protein–protein docking tools can play an important role in complementing the experimental initiatives. However, as these complexes are generated using predictive approaches, objective tests and evaluation tools are required to determine their reliability.

PPI interfaces have been studied extensively to analyze and understand the critical characteristics features that provide affinity, stability and specificity of the complexes. Properties like accessible surface area (ASA) and buried surface area (BSA), interface residue conservation, hydrogen bonds, electrostatic and hydrophobic interactions play major roles in determining the nature of the protein interfaces. Knowledge of interface characteristics has been studied and was further used in identification of protein interfaces or to predict binding specificity^{18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38}. Similarly, numerous scoring functions and schemas were developed for improved prediction of protein–protein interfaces out of which only a few could be mentioned due to space restraints^{39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74}. Despite all these works, still gap is prevailed between the optimally scoring solutions and the biologically active complexes^75,76,77,78.

Here, we attempt to utilize protein–protein interface properties to establish discernible differences between native-like protein complexes from the non-native ones. We fed the interface properties to a support vector machine (SVM) based classification scheme and trained models to successfully differentiate between native and non-native like complexes derived by protein–protein docking. Machine learning based techniques have been used previously to analyze and predict protein–protein interactions^{78,79,80,81,82,83,84,85,86}. Our exhaustive testing and benchmarking suggest very good performance of our SVM models in distinguishing native and non-native like interfaces for homo and hetero complexes. We also implemented this approach in validating protein interactions, which are proven via experimental findings but the three-dimensional (3D) structure of the complexes and the subsequent interface(s) are yet to be discovered. Finally, we provide a web server platform namely PCPIP to predict whether the interacting interface of a given protein–protein dimer complex significantly resembles known protein interfaces.

Materials and methods

Collection and generation of non-redundant protein–protein complex dataset

989 protein–protein dimer complex structures have been derived from protein data bank (PDB)⁸⁷, which were categorized into homo (560) and hetero (429) dimers. Exhaustive redundancy check and filter was applied to these datasets using the CD-HIT⁸⁸ and BLASTp⁸⁹ programs so that no protein complex (both chains) is more than 40% identical to any other complex within the homo or hetero categories, respectively (Table S1, Figure S1). 371 and 346 homo and hetero complexes were obtained after the redundancy check, which were further filtered by successful docking and interface generation criteria (see later for docking and interface filters).

To validate our machine learning based classification system, we have built a separate validation dataset to perform the benchmarking where both the dimer complex and individual monomer structures are available separately (Apo-Holo validation set). This dataset of complex (holo) and non-complex (apo) forms of proteins were collected from a recent report from Viswanathan et al.⁹⁰. Initially, this dataset contained 95 protein–protein hetero complexes (holo-complexes) and their respective monomer structures. However, these 95 complexes were screened to identify native and non-native like interfaces via fraction of conserved native contacts (FNAT) based definition (please see later for details) and we could retrieve docking decoys that passed the FNAT filtration criteria for 32 such complexes (Table S1). This dataset contained 32 protein–protein hetero complexes (holo-complexes) and their respective monomer structures that are separately available as apo proteins (Table S1). Similarly, the 95 complexes were also screened to identify native and non-native interfaces via interface root mean square deviation (iRMSD) based definition (please see later for details) where 68 such complexes were retrieved that passed the iRMSD filtration criteria.

We have also created a negative dataset of 130 protein–protein complexes and subsequent interfaces for proteins that are not supposed to interact according to the Negatome database^91,92 (Negatome validation set). Further details about the dataset collection are provided in supplementary information file.

Protein and/or domain structures for which experimentally validated protein–protein interaction is reported in the STRING database 10.5¹⁵ were also collected individually from the PDB database. Protein–protein interaction pairs for which individual monomer structures are available were selected randomly and were further utilized to generate probable dimer structures using protein–protein docking via the PatchDock program⁹³. 32 such docked model complexes (STRING dataset; ten for each complex; total: 320) were generated and further evaluated by our machine learning based protein–protein interface prediction algorithm.

Generation of native and non-native like protein–protein complexes and their interfaces

Known protein dimer complexes were utilized to create native and non-native like interfaces. Constituent monomers of the dimer complexes were separated and docked using the PatchDock protein–protein docking software and the resultant docked solutions were screened to create native and non-native like complexes based on the following criteria. Fraction of conserved native contacts (FNAT) usually provides a fraction of the common residues at interface of docked interface with respect to that from the original complex whereas interface root mean square deviation (iRMSD) compares the actual orientation of the interface forming residues between the docked and original complexes. Hence, FNAT and iRMSD complement each other in their approach of evaluation of a predicted interface.

FNAT based categorization

Fraction of conserved native contacts (FNAT) is the number of native (correct) residue–residue contacts in the docked complex divided by the number of contacts in the original complex. So, FNAT reflects the overlap between the original and docked complex interfaces while a FNAT value 1.0 indicates complete overlap between the two. In our training and testing models, the original complex and the docked complexes with FNAT > 0.8 were regarded as true or native like complexes for each dimer whereas false/non-native like complexes were identified using four separate FNAT thresholds, (a) FNAT ≤ 0.25 (highly distinguishable from the native like complexes), (b) FNAT > 0.25 and ≤ 0.5, (moderately distinguishable from the native like complexes) and (c) FNAT > 0.5 and ≤ 0.8 (weakly distinguishable from the native like complexes), and (d) FNAT ≤ 0.8 (mixed), respectively.

Similar to the training–testing dataset, we have used the same four different FNAT thresholds to define the non-native like complexes for the benchmarking dataset (Apo-Holo validation set) also.

iRMSD based categorization

Interface root mean square deviation (iRMSD) is the root mean square deviation between the residues of both chains at the interface region. Similar to the FNAT based protocol, native and non-native sets were also identified based on iRMSD where original complex along with one docked complex with iRMSD < 5 Å with respect to the original complex were regarded as native like complex for each dimer whereas non-native like complexes were identified with four different thresholds, (a) iRMSD ≥ 15 Å (highly distinguishable from the native like complexes), (b) iRMSD ≥ 10 Å and < 15 Å, (moderately distinguishable from the native like complexes) and (c) iRMSD ≥ 5 Å and < 10 Å (weakly distinguishable from the native like complexes), and (d) iRMSD ≥ 5 Å (mixed), respectively.

Native and non-native like complexes (Table S1 and Figure S2) categorized according to the criteria explained above with 1:5 ratio (2 native and 10 non-native samples, respectively) were further used for classification purpose.

Calculation of PPI interface properties

PISA software^94,95 was used to calculate structural and chemical properties of the macromolecular interfaces such as accessible/buried surface area, free energy of dissociation, presence/absence of hydrogen bond and salt bridges, etc. Further details are provided in supplementary information file. Please check Table S2 for a full list of features that were used for machine leaning method based classification between native and non-native like protein–protein interaction complexes.

Classification of the PPI interfaces via SVM

Support vector machine (SVM) is a supervised model used for classification by analyzing given features with associated learning algorithms. Here, we have used a radial basis kernel function (RBF) via 100-fold cross validation method where 100 times randomly selected 80% of the whole data has been used to train the SVM model and the rest 20% has been used as test data. Performance on the test models was measured using average of the hundred random trials. LibSVM⁹⁶ was used to build the classifier models.

For both FNAT and iRMSD based categorizations, 100 fold randomized selection of the training (80% data) and testing (20% data) followed by SVM based classification trials were performed for each of the native and non-native complex threshold criterion. Figure S3 provides an overview of the various training and testing SVM runs employed in this study using multiple categories of native and non-native like protein–protein complexes while Table S3 provides the kernel function and other relevant parameters of the various SVM models.

Benchmarking and comparison of performance

We have used separate set of complexes in order to check the efficacy of our SVM models in correctly identifying native and non-native complexes. Previously described native and non-native complexes within the Apo-Holo validation set were tested against the 100 SVM models built with 80% training data. Based on the definition of native and non-native like complexes using FNAT and iRMSD categorizations, each set containing highly, moderately, weakly, and mixed distinguishable non-native like complexes from the native like complexes were tested against the corresponding 100 models of the training data sets. Average sensitivity, specificity, precision, and F1 score from the 100 runs was recorded.

The separate datasets of Apo-Holo heterodimers using the mixed distinguishable non-native like complex thresholds [FNAT-native (FNAT > 0.8): 64 and non-native (FNAT ≤ 0.8): 320 and iRMSD-native (iRMSD < 5 Å): 134 and non-native (iRMSD ≥ 5 Å): 680] (Table S1) were used as input to the CCharPPI online server⁹⁷ where 10 different composite scoring functions such as ZRANK, ZRANK2, ROSETTADOCK, PYDOCK, FIREDOCK, PISA score, CP_PIE, and SIPPER were applied on each complex and the corresponding scores were obtained. Individual receiver operating characteristic (ROC) plot was created by calculating the true positive rate (TPR; sensitivity) and false positive rate (FPR; 1-specificity) for each scoring function.

Additionally, 130 protein pairs extracted from the Negatome database^91,92 were subjected to PatchDock to generate biologically non-feasible protein complexes. 5 docked decoy complexes were collected for each query pairs following the criteria described in supplementary information file. This dataset (Negatome validation set) was used to validate the false prediction rate of the SVM classifier.

In absence of reference complex structure, we selected top ten docking solutions according to the PatchDock docking score for each 31 protein–protein interactions enlisted within the STRING dataset, which was further subjected to interface feature generation followed by SVM prediction using the heterodimer training model.

Standard performance metrics such as sensitivity, specificity, precision, F1 score, Mathew’s correlation coefficient (MCC), etc. were calculated along with the ROC statistics to measure the performance of our models in different scenario. Please see supplementary information file for more details.

Development of the webserver

We have developed a web based server named, “PCPIP (Protein Complex Prediction by Interface Properties)” where the classification and prediction schemes were embedded within a web module. Given a protein–protein complex, the PCPIP server would be able to predict whether the interacting interface resembles significantly with known protein interfaces. The server is available via http://www.hpppi.iicb.res.in/pcpip/ and is developed on PHP and CGI-PERL platform.

The server has two input options, single and batch mode, respectively. In the single mode option, single protein–protein complex (homo or hetero dimer) file saved in standard PDB format can be uploaded to check whether the interacting interface, if there is any, resembles the interfaces extracted from the native protein–protein complexes or not.

SVM models for both homo and hetero dimer along with their interface features are kept as background search models against which the uploaded protein complex can be testified based on FNAT and iRMSD criteria.

Results

Protein–protein interface properties to differentiate between native and non-native like interfaces

PPI interface properties that were showing statistically significant (p ≤ 0.01) differences between the native and non-native like complexes, categorized either by FNAT or iRMSD criteria were compared (Fig. 1A). 62 such features were common in homo and heterodimer. A large fraction of the distinguishable features represent accessible surface area (ASA) of amino acids located at the native and non-native like interfaces (Fig. 1B and Figure S4). However, only phenylalanine (PHE), tyrosine (TYR), and isoleucine (ILE) possess significantly higher buried surface area (BSA) in native interfaces whereas lysine (LYS) possesses significantly lower buried surface area (BSA) in native interfaces (Fig. 1C,D). Hydrogen bonds between aspartate-arginine (ASP-ARG) and glutamate-arginine (GLU-ARG) were found to be significantly more in native interfaces (Fig. 1E,F) whereas all the native interfaces were found be significantly more stable with respect to binding energy (Fig. 1G,H). Abundance of negatively charged amino acids (ASP and GLU), serine (SER), threonine (THR) and cysteine (CYS) are relatively lower at the native interfaces of hetero complexes with respect to non-native interfaces whereas positively charged ARG and HIS are relatively higher at the native interfaces of homo complexes, respectively (Figure S5). In addition, frequencies of ALA, GLY, PRO, ASN, GLN, LYS, HIS, PHE, TRP, and TYR are also found to be different between native and non-native interfaces (Figure S5). These observations indicate presence of discernable differences between them and advocates the utilization of the interface features to classify and predict native PPI interfaces.

Classification of protein–protein interaction interfaces via SVM

Support vector machine (SVM) was used for classification with 100-fold cross validation approach. Table 1 provides mean test accuracies for homodimer and heterodimer complexes, respectively. 100 fold randomized selection of the training (80% data) and testing (20% data) samples followed by SVM based classification trials were performed for each of the native and non-native complex threshold criterion. Classifications were performed with different training models where non-native like interfaces were selected with various FNAT thresholds, (a) FNAT ≤ 0.25 (highly distinguishable from the native like complexes), (b) FNAT ≥ 0.25 and ≤ 0.5, (moderately distinguishable from the native like complexes) and (c) FNAT ≥ 0.5 and ≤ 0.8 (weakly distinguishable from the native like complexes), respectively with respect to the original complex. Similar 100 fold classification was also performed using native and non-native like complexes defined by iRMSD where non-native like complexes were identified with three thresholds, (a) iRMSD ≥ 15 Å (highly distinguishable from the native like complexes), (b) iRMSD ≥ 10 Å and ≤ 15 Å, (moderately distinguishable from the native like complexes) and (c) iRMSD ≥ 5 Å and ≤ 10 Å (weakly distinguishable from the native like complexes), respectively.

Table 1 Mean test and train accuracies for prediction of protein–protein interaction interfaces.

Full size table

Our SVM models performed quite well and yielded very good performances with all categories of non-native like protein–protein complexes categorized by both FNAT and iRMSD definitions (Table 1). Utilization of both of these criteria adds more reliability to the process of evaluation of PPI interface comparison.

Benchmarking and comparative validation

As mentioned before, for benchmarking and validation we have used Apo-Holo-validation dataset comprising of 32 (FNAT) and 68 (iRMSD) dimer complexes for which individual monomer structures are also available separately. Monomers were docked using the PatchDock protein–protein docking software⁹³ and native and non-native like complexes were generated using various ranges of FNAT and iRMSD based criteria. Original and all docking solutions with FNAT ≥ 0.8 were regarded as native like complexes whereas solutions with FNAT ≤ 0.25, FNAT ≥ 0.25 and ≤ 0.5, FNAT ≥ 0.50 and ≤ 0.80, and FNAT < 0.80, respectively were considered as non-native like complexes. Similarly, Original and all docking solutions with iRMSD < 5 Å were regarded as native like complexes whereas solutions with iRMSD ≥ 15 Å, iRMSD ≥ 10 Å and ≤ 15 Å, iRMSD ≥ 5 Å and ≤ 10 Å, respectively were considered as non-native like complexes. Native and non-native like complexes from this validation datasets were mixed together and the classifier was asked to differentiate the native and non-native ones correctly based on their interface features. Performances of the classifiers were measured using standard parameters. Table 2, shows the performance measure values estimated for a range of probability threshold (0.50–0.95). It is clearly reflected that the performances of the individual models are quite good.

Table 2 Benchmarking results for Apo-Holo datasets using the Hetero FNAT and iRMSD models.

Full size table

The datasets showed in Table 2 were used in CCharPPI online server⁹⁷ where 10 different composite scoring functions were applied on each complex and the corresponding scores are obtained. ROC plots were created (Fig. 2) by calculating the true positive rate (TPR) and false positive rate (FPR; 1-specificity). Figure 2 clearly demonstrates much better performance of our FNAT (PCPIP_FNAT) and iRMSD (PCPIP_iRMSD) based prediction models in predicting complexes correctly. Hence, in this apparently difficult datasets our method performed much better compared to the other methods.

Further, the accuracy of all 100 randomly generated training models was testified using a test dataset of complexes that are not supposed to be formed physiologically. Analyzing top 25 cases, it is evident that at all probability thresholds (0.5–0.95) more than 90% of the Negatome complexes were predicted as false by FNAT model where iRMSD model is taking 0.85 probability threshold to reach 90% accuracy (Fig. 3).

Applications of the prediction algorithm

We collected 32 STRING suggested, experimentally verified protein–protein interactions and the complexes were modeled using the PatchDock based protein docking algorithm where monomer structures/domains were collected from the PDB⁸⁷. Docking solution with highest probability threshold score was considered as the top ranked prediction based on FNAT and iRSMD models, respectively. Distributions of probability score thresholds within the FNAT and iRMSD based top ranked prediction and the PatchDock based top solutions are plotted (Fig. 4A,B), which suggest docking score based ranked solutions are less likely to contain the native like complexes with respect to that achieved by FNAT and iRMSD based prediction. Out of the 320 interfaces (10 solutions for each complex), 12 interfaces were commonly predicted by FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95). ΔGs of binding for these predicted complexes are comparable with that achieved for known heterodimer complexes (Fig. 4C) indicating reliability of the predicted poses. Out of these 12 predicted complexes we showcase three complexes formed by glyceraldehyde 3-phosphate dehydrogenase (GAPDH) with phosphoglycerate kinase (PGK1), enolase 1 (ENO1) and triose-phosphate isomerase (TIM), respectively (Fig. 4D,F). Figure S6 provides the mode of interaction and the interface parameters for all the 12 complexes that were commonly predicted by FNAT and iRMSD models with highest reliability (probability threshold ≥ 0.95).

A web based server platform namely Protein Complex Prediction by Interface Properties (PCPIP) is developed to predict whether the interacting interface of a given protein–protein dimer complex resembles significantly with known protein interfaces. PCPIP predicts whether submitted interface(s) is likely to be native like or not. This prediction server would be particularly useful in identifying correct docking poses out of numerous solutions that standard protein docking programs offer. Figure S7 provides a snapshot of the input and output options of the PCPIP server.

Discussion

Protein–protein interactions (PPI) are extremely crucial for intra and inter-cellular functions and inter-molecular connectivity. Due to significant improvement in experimental techniques, large numbers of protein structures are available now. Similarly, improved high-throughput studies like yeast two-hybrid system (Y2H), mass spectrometry (MS), tandem affinity purification (TAP) have identified numerous PPI that are previously unknown. However, these efforts are expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. Hence, the need for computational techniques has been increased to augment experimentally identified PPI and provide a larger repertoire of cellular PPIs. Another daunting challenge is to physically construct and map these large numbers of PPI complexes and identify the mode of interaction. Computationally generated PPI complexes could be quite useful and may expedite the experiments that are required to validate the binding interface and critical residues for the interactions. However, these predictions are generally error prone and therefore, need to be validated very carefully. Even though various protein–protein docking programs are available, methods for systematic evaluation of the predicted PPI complexes are limited.

In our effort, we tried to study the known protein–protein interface properties and utilize the knowledge of native PPI interface properties to a standard machine learning technique, support vector machine (SVM) to delineate native-like complexes from non-native like complexes. Interestingly, this simple approach turned out to be quite effective as suggested by very good performance metrics of our SVM models in distinguishing native and non-native like interfaces for homo and hetero complexes. Our exhaustive testing and benchmarking exercises using a completely non-redundant training–testing dataset and various degrees of distinguishing thresholds between native and non-native like complexes suggest a very high accuracy of the models. The categorizations of non-native instances were implemented using strict, moderate and lenient definitions to consider many intermediate docking models with partial overlap with native interfaces. Performance of the models demonstrate high efficiency of the approach in distinguishing native like complexes from non-native like complexes having high, moderate and low overlap with the actual interfaces (Table 1).

We also evaluated the performance of our method under further validation test where it was subjected to differentiate native like complexes from non-native complexes prepared from individually solved monomer structures of known complex structures. This apparently difficult dataset overrules the possibility of the monomers structures to be primed to form native complexes when subjected to protein docking approaches. In this Apo-Holo validation test, our method performs reasonably well, especially for the highly distinguished native and non-native like complexes (Table 2). However, comparative analysis using this dataset shows much better performance with respect to other available methods (Fig. 2).

We applied this approach in real scenario where protein interactions information is proven via experimental findings but the three-dimensional (3D) structure of the complexes and the subsequent interface(s) are yet to be discovered. We generated such complexes using 32 high-confidence STRING protein–protein interactions and identified the most likely interaction modes for 12 complexes. Such filtered structural models could be very useful for designing subsequent experiments to validate the actual mode of interaction even without attempting to solve the entire complex structures.

Finally, we converted the methodology into a user-friendly, easy-to-use web server platform namely PCPIP to predict whether the interacting interface of a given protein–protein dimer complex significantly resembles known protein interfaces. We believe that this resource could be a useful tool for biologists to evaluate protein–protein docking derived results and gain helpful knowledge to design confirmatory experiments.

References

Butland, G. et al. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 433(7025), 531–537 (2005).
Article ADS CAS PubMed Google Scholar
Kühner, S. et al. Proteome organization in a genome-reduced bacterium. Science 326(5957), 1235–1240 (2009).
Article ADS PubMed CAS Google Scholar
Uetz, P. et al. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403(6770), 623–627 (2000).
Article ADS CAS PubMed Google Scholar
Krogan, N. J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084), 637–643 (2006).
Article ADS CAS PubMed Google Scholar
Yu, H. et al. High-quality binary protein interaction map of the yeast interactome network. Science 322(5898), 104–110 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Tarassov, K. et al. An in vivo map of the yeast protein interactome. Science 320(5882), 1465–1470 (2008).
Article ADS CAS PubMed Google Scholar
Giot, L. et al. A protein interaction map of Drosophila melanogaster. Science 302(5651), 1727–1736 (2003).
Article ADS CAS PubMed Google Scholar
Li, S. et al. A map of the interactome network of the metazoan C. elegans. Science 303(5657), 540–543 (2004).
Article ADS CAS PubMed PubMed Central Google Scholar
Rual, J. F. et al. Towards a proteome-scale map of the human protein–protein interaction network. Nature 437(7062), 1173–1178 (2005).
Article ADS CAS PubMed Google Scholar
Stelzl, U. et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell 122(6), 957–968 (2005).
Article CAS PubMed Google Scholar
Ewing, R. M. et al. Large-scale mapping of human protein–protein interactions by mass spectrometry. Mol. Syst. Biol. 3(1), 89 (2007).
Article PubMed PubMed Central CAS Google Scholar
Arabidopsis Interactome Mapping Consortium. Evidence for network evolution in an Arabidopsis interactome map. Science 333(6042), 601–607 (2011).
Article PubMed Central CAS Google Scholar
Chatr-Aryamontri, A. et al. MINT: the Molecular INTeraction database. Nucleic Acids Res. 35(suppl 1), D572–D574 (2007).
Article CAS PubMed Google Scholar
Keshava Prasad, T. S. et al. Human protein reference database—2009 update. Nucleic Acids Res. 37(suppl 1), D767–D772 (2009).
Article CAS PubMed Google Scholar
Von Mering, C. et al. STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res. 33(suppl 1), D433–D437 (2005).
Google Scholar
Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids Res. 32(suppl 1), D449–D451 (2004).
Article CAS PubMed PubMed Central Google Scholar
Kerrien, S. et al. IntAct—open source resource for molecular interaction data. Nucleic Acids Res. 35(suppl 1), D561–D565 (2007).
Article CAS PubMed Google Scholar
Chothia, C. & Janin, J. Principles of protein–protein recognition. Nature 256(5520), 705–708 (1975).
Article ADS CAS PubMed Google Scholar
Janin, J., Miller, S. & Chothia, C. Surface, subunit interfaces and interior of oligomeric proteins. J. Mol. Biol. 204(1), 155–164 (1988).
Article CAS PubMed Google Scholar
Hu, Z., Ma, B., Wolfson, H. & Nussinov, R. Conservation of polar residues as hot spots at protein interfaces. Proteins Struct. Funct. Bioinform. 39(4), 331–342 (2000).
Article CAS Google Scholar
Valdar, W. S. & Thornton, J. M. Protein–protein interfaces: analysis of amino acid conservation in homodimers. Proteins Struct. Funct. Bioinform. 42(1), 108–124 (2001).
Article CAS Google Scholar
Jones, S. & Thornton, J. M. Principles of protein–protein interactions. Proc. Natl. Acad. Sci. 93(1), 13–20 (1996).
Article ADS CAS PubMed PubMed Central Google Scholar
Jones, S. & Thornton, J. M. Analysis of protein–protein interaction sites using surface patches. J. Mol. Biol. 272(1), 121–132 (1997).
Article CAS PubMed Google Scholar
Palma, P. N., Krippahl, L., Wampler, J. E. & Moura, J. J. BiGGER: a new (soft) docking algorithm for predicting protein interactions. Proteins Struct. Funct. Bioinform. 39(4), 372–384 (2000).
Article CAS Google Scholar
Xu, D., Tsai, C. J. & Nussinov, R. Hydrogen bonds and salt bridges across protein–protein interfaces. Protein Eng. 10(9), 999–1012 (1997).
Article CAS PubMed Google Scholar
Jones, S. & Thornton, J. M. Prediction of protein–protein interaction sites using patch analysis. J. Mol. Biol. 272(1), 133–143 (1997).
Article CAS PubMed Google Scholar
Hoskins, J., Lovell, S. & Blundell, T. L. An algorithm for predicting protein–protein interaction sites: abnormally exposed amino acid residues and secondary structure elements. Protein Sci. 15(5), 1017–1029 (2006).
Article CAS PubMed PubMed Central Google Scholar
De Vries, S. J. & Bonvin, A. M. Intramolecular surface contacts contain information about protein–protein interface regions. Bioinformatics 22(17), 2094–2098 (2006).
Article PubMed CAS Google Scholar
Aloy, P., Ceulemans, H., Stark, A. & Russell, R. B. The relationship between sequence and interaction divergence in proteins. J. Mol. Biol. 332(5), 989–998 (2003).
Article CAS PubMed Google Scholar
Hakes, L., Lovell, S. C., Oliver, S. G. & Robertson, D. L. Specificity in protein interactions and its relationship with sequence diversity and coevolution. Proc. Natl. Acad. Sci. 104(19), 7999–8004 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Chakrabarti, P. & Janin, J. Dissecting protein–protein recognition sites. Proteins Struct. Funct. Bioinform. 47(3), 334–343 (2002).
Article CAS Google Scholar
Bahadur, R. P., Chakrabarti, P., Rodier, F. & Janin, J. Dissecting subunit interfaces in homodimeric proteins. Proteins Struct. Funct. Bioinformat. 53(3), 708–719 (2003).
Article CAS Google Scholar
Ofran, Y. & Rost, B. Analysing six types of protein–protein interfaces. J. Mol. Biol. 325(2), 377–387 (2003).
Article CAS PubMed Google Scholar
Crowley, P. B. & Golovin, A. Cation–π interactions in protein–protein interfaces. Proteins Struct. Funct. Bioinform. 59(2), 231–239 (2005).
Article CAS Google Scholar
Kufareva, I., Budagyan, L., Raush, E., Totrov, M. & Abagyan, R. PIER: protein interface recognition for structural proteomics. Proteins Struct. Funct. Bioinform. 67(2), 400–417 (2007).
Article CAS Google Scholar
Cazals, F. Revisiting the Voronoi description of protein–protein interfaces: algorithms. In IAPR International Conference on Pattern Recognition in Bioinformatics 419–430 (Springer, Berlin, 2010).
de Vries, S. J. & Bonvin, A. M. How proteins get in touch: interface prediction in the study of biomolecular complexes. Curr. Protein Pept. Sci. 9(4), 394–406 (2008).
Article PubMed Google Scholar
Tuncbag, N., Kar, G., Keskin, O., Gursoy, A. & Nussinov, R. A survey of available tools and web servers for analysis of protein–protein interactions and interfaces. Brief. Bioinform. 10(3), 217–232 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pallara, C. et al. Expanding the frontiers of protein–protein modeling: from docking and scoring to binding affinity predictions and other challenges. Proteins Struct. Funct. Bioinform. 81(12), 2192–2200 (2013).
Article CAS Google Scholar
Sukhwal, A. & Sowdhamini, R. Oligomerisation status and evolutionary conservation of interfaces of protein structural domain superfamilies. Mol. BioSyst. 9(7), 1652–1661 (2013).
Article CAS PubMed Google Scholar
Liu, S. & Vakser, I. A. DECK: Distance and environment-dependent, coarse-grained, knowledge-based potentials for protein–protein docking. BMC Bioinform. 12(1), 1–7 (2011).
Article Google Scholar
Feliu, E., Aloy, P. & Oliva, B. On the analysis of protein–protein interactions via knowledge-based potentials for the prediction of protein–protein docking. Protein Sci. 20(3), 529–541 (2011).
Article CAS PubMed PubMed Central Google Scholar
Mitra, P. & Pal, D. Using correlated parameters for improved ranking of protein–protein docking decoys. J. Comput. Chem. 32(5), 787–796 (2011).
Article CAS PubMed Google Scholar
Pons, C., Talavera, D., De La Cruz, X., Orozco, M. & Fernandez-Recio, J. Scoring by intermolecular pairwise propensities of exposed residues (SIPPER): a new efficient potential for protein−protein docking. J. Chem. Inf. Model. 51(2), 370–377 (2011).
Article CAS PubMed Google Scholar
Demir-Kavuk O, Krull F, Chae MH, Knapp EW. Predicting protein complex geometries with linear scoring functions. In Genome Informatics 2010: Genome Informatics Series, Vol. 24 21–30 (2010).
Cheng, T. M., Blundell, T. L. & Fernandez-Recio, J. pyDock: electrostatics and desolvation for effective scoring of rigid-body protein–protein docking. Proteins Struct. Funct. Bioinform. 68(2), 503–515 (2007).
Article CAS Google Scholar
Lyskov, S. & Gray, J. J. The RosettaDock server for local protein–protein docking. Nucleic Acids Res. 36(suppl 2), W233–W238 (2008).
Article CAS PubMed PubMed Central Google Scholar
Dominguez, C., Boelens, R. & Bonvin, A. M. HADDOCK: a protein−protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. 125(7), 1731–1737 (2003).
Article CAS PubMed Google Scholar
Pierce, B. & Weng, Z. ZRANK: reranking protein docking predictions with an optimized energy function. Proteins Struct. Funct. Bioinform. 67(4), 1078–1086 (2007).
Article CAS Google Scholar
Pierce, B. & Weng, Z. A combination of rescoring and refinement significantly improves protein docking performance. Proteins Struct. Funct. Bioinform. 72(1), 270–279 (2008).
Article CAS Google Scholar
Andrusier, N., Nussinov, R. & Wolfson, H. J. FireDock: fast interaction refinement in molecular docking. Proteins Struct. Funct. Bioinform. 69(1), 139–159 (2007).
Article CAS Google Scholar
Mashiach, E., Nussinov, R. & Wolfson, H. J. FiberDock: flexible induced-fit backbone refinement in molecular docking. Proteins Struct. Funct. Bioinform. 78(6), 1503–1519 (2010).
Article CAS Google Scholar
Zacharias, M. Protein–protein docking with a reduced protein model accounting for side-chain flexibility. Protein Sci. 12(6), 1271–1282 (2003).
Article CAS PubMed PubMed Central Google Scholar
Tovchigrechko, A. & Vakser, I. A. Development and testing of an automated approach to protein docking. Proteins Struct. Funct. Bioinform. 60(2), 296–301 (2005).
Article CAS Google Scholar
Liang, S., Liu, S., Zhang, C. & Zhou, Y. A simple reference state makes a significant improvement in near-native selections from structurally refined docking decoys. Proteins Struct. Funct. Bioinform. 69(2), 244–253 (2007).
Article CAS Google Scholar
Murphy, J., Gatchell, D. W., Prasad, J. C. & Vajda, S. Combination of scoring functions improves discrimination in protein–protein docking. Proteins Struct. Funct. Bioinform. 53(4), 840–854 (2003).
Article CAS Google Scholar
Huang, S. Y. & Zou, X. An iterative knowledge-based scoring function for protein–protein recognition. Proteins Struct. Funct. Bioinform. 72(2), 557–579 (2008).
Article CAS Google Scholar
Moont, G., Gabb, H. A. & Sternberg, M. J. Use of pair potentials across protein interfaces in screening predicted docked complexes. Proteins Struct. Funct. Bioinform. 35(3), 364–373 (1999).
Article CAS Google Scholar
Geppert, T., Proschak, E. & Schneider, G. Protein–protein docking by shape-complementarity and property matching. J. Comput. Chem. 31(9), 1919–1928 (2010).
CAS PubMed Google Scholar
Müller, W. & Sticht, H. A protein-specifically adapted scoring function for the reranking of docking solutions. Proteins Struct. Funct. Bioinform. 67(1), 98–111 (2007).
Article CAS Google Scholar
Xu, X., Yan, C., Wohlhueter, R. & Ivanov, I. Integrative modeling of macromolecular assemblies from low to near-atomic resolution. Comput. Struct. Biotechnol. J. 13, 492–503 (2015).
Article CAS PubMed PubMed Central Google Scholar
Aumentado-Armstrong, T. T. et al. Algorithmic approaches to protein–protein interaction site prediction. Algorithms Mol. Biol. 10, 7 (2015).
Article PubMed PubMed Central CAS Google Scholar
Sukhwal, A. & Sowdhamini, R. PPcheck: a webserver for the quantitative analysis of protein–protein interfaces and prediction of residue hotspots. Bioinform. Biol. Insights 9, 141–151 (2015).
Article CAS PubMed PubMed Central Google Scholar
Malhotra, S., Mathew, O. K. & Sowdhamini, R. DOCKSCORE: a webserver for ranking protein–protein docked poses. BMC Bioinform. 16, 127 (2015).
Article CAS Google Scholar
Zhang, Z., Schindler, C. E. M., Lange, O. F. & Zacharias, M. Application of enhanced sampling Monte Carlo methods for high-resolution protein–protein docking in Rosetta. PLoS ONE 10, e0125941 (2015).
Article PubMed PubMed Central CAS Google Scholar
Levy, E. D. & Teichmann, S. Structural, evolutionary, and assembly principles of protein oligomerization. Prog. Mol. Biol. Transl. Sci. 117, 25–51 (2013).
Article CAS PubMed Google Scholar
Soni, N. & Madhusudhan, M. S. Computational modeling of protein assemblies. Curr. Opin. Struct. Biol. 44, 179–189 (2017).
Article CAS PubMed Google Scholar
Guo, F. et al. Identifying protein–protein interface via a novel multi-scale local sequence and structural representation. BMC Bioinform. 20, 483 (2019).
Article CAS Google Scholar
Li, B. Q., Zhang, Y. H., Jin, M. L., Huang, T. & Cai, Y. D. Prediction of protein-peptide interactions with a nearest neighbor algorithm. Curr. Bioinform. 13(1), 14–24 (2018).
Article CAS Google Scholar
Ramrez-Aportela, E., Lopez-Blanco, J. & Chacon, P. FRODOCK 2.0: fast protein–protein docking server. Bioinformatics 32, 2386–2388 (2016).
Article CAS Google Scholar
Nadalin, F. & Carbone, A. Protein–protein interaction specificity is captured by contact preferences and interface composition. Bioinformatics 34, 459–468 (2018).
Article CAS PubMed Google Scholar
Guo, F. et al. Structural neighboring property for identifying protein–protein binding sites. BMC Syst. Biol. 9(Suppl 5), S3 (2015).
Article PubMed PubMed Central CAS Google Scholar
Guo, F., Ding, Y., Li, S. C., Shen, C. & Wang, L. Protein–protein interface prediction based on hexagon structure similarity. Comput. Biol. Chem. 63, 83–88 (2016).
Article CAS PubMed Google Scholar
Nadalin, F. & Carbone, A. Protein–protein interaction specificity is captured by contact preferences and interface composition. Bioinformatics 34(3), 459–468 (2018).
Article CAS PubMed Google Scholar
Esmaielbeiki, R., Krawczyk, K., Knapp, B., Nebel, J. C. & Deane, C. M. Progress and challenges in predicting protein interfaces. Brief. Bioinform. 17(1), 117–131 (2016).
Article CAS PubMed Google Scholar
Moal, I. H., Moretti, R., Baker, D. & Fernandez-Recio, J. Scoring functions for protein–protein interactions. Curr. Opin. Struct. Biol. 23(6), 862–867 (2013).
Article CAS PubMed Google Scholar
Kastritis, P. L. & Bonvin, A. M. Are scoring functions in protein−protein docking ready to predict interactomes? Clues from a novel binding affinity benchmark. J. Proteome Res. 9(5), 2216–2225 (2010).
Article CAS PubMed Google Scholar
Liu, S., Liu, C. & Deng, L. Machine learning approaches for protein–protein interaction hot spot prediction: progress and comparative assessment. Molecules 23(10), 2535 (2018).
Article PubMed Central CAS Google Scholar
Chae, M. H., Krull, F., Lorenzen, S. & Knapp, E. W. Predicting protein complex geometries with a neural network. Proteins Struct. Funct. Bioinform. 78(4), 1026–1039 (2010).
Article CAS Google Scholar
Bourquard, T., Bernauer, J., Azé, J. & Poupon, A. A collaborative filtering approach for protein–protein docking scoring functions. PLoS ONE 6(4), e18541 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Bernauer, J., Azé, J., Janin, J. & Poupon, A. A new protein–protein docking scoring function based on interface residue properties. Bioinformatics 23(5), 555–562 (2007).
Article CAS PubMed Google Scholar
Fink, F., Hochrein, J., Wolowski, V., Merkl, R. & Gronwald, W. PROCOS: computational analysis of protein–protein complexes. J. Comput. Chem. 32(12), 2575–2586 (2011).
Article CAS PubMed Google Scholar
Bordner, A. J. & Gorin, A. A. Protein docking using surface matching and supervised machine learning. Proteins Struct. Funct. Bioinform. 68(2), 488–502 (2007).
Article CAS Google Scholar
Torchala, M., Moal, I. H., Chaleil, R. A., Agius, R. & Bates, P. A. A Markov-chain model description of binding funnels to enhance the ranking of docked solutions. Proteins Struct. Funct. Bioinform. 81(12), 2143–2149 (2013).
Article CAS Google Scholar
Patel, S., Tripathi, R., Kumari, V. & Varadwaj, P. Deepinteract: deep neural network based protein–protein interaction prediction tool. Curr. Bioinform. 12(6), 551–557 (2017).
Article CAS Google Scholar
Wei, L. et al. Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med. 83, 67–74 (2017).
Article PubMed Google Scholar
Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28(1), 235–242 (2000).
Article CAS PubMed PubMed Central Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Viswanathan, R., Fajardo, E., Steinberg, G., Haller, M. & Fiser, A. Protein–protein binding supersites. PLoS Comput. Biol. 15(1), e1006704 (2019).
Article ADS PubMed PubMed Central CAS Google Scholar
Blohm, P. et al. Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res. 42(D1), D396–D400 (2014).
Article CAS PubMed Google Scholar
Smialowski, P. et al. The Negatome database: a reference set of non-interacting protein pairs. Nucleic Acids Res. 38(suppl 1), D540–D544 (2010).
Article CAS PubMed Google Scholar
Schneidman-Duhovny, D., Inbar, Y., Nussinov, R. & Wolfson, H. J. PatchDock and SymmDock: servers for rigid and symmetric docking. Nucleic Acids Res. 33(suppl 2), W363–W367 (2005).
Article CAS PubMed PubMed Central Google Scholar
Krissinel, E. & Henrick, K. Inference of macromolecular assemblies from crystalline state. J. Mol. Biol. 372(3), 774–797 (2007).
Article CAS PubMed Google Scholar
Krissinel, E. Crystal contacts as nature’s docking solutions. J. Comput. Chem. 31(1), 133–143 (2010).
Article CAS PubMed Google Scholar
Chang, C. C. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011).
Article Google Scholar
Moal, I. H., Jiménez-García, B. & Fernández-Recio, J. CCharPPI web server: computational characterization of protein–protein interactions from structure. Bioinformatics 31(1), 123–125 (2015).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Authors thank Shreemoyee Dutta Majumdar for her initial involvement in the project. SC acknowledges CSIR-Indian Institute of Chemical Biology (IICB) for infrastructural and financial support. SD thanks CSIR for fellowship.

Author information

Authors and Affiliations

Structural Biology and Bioinformatics Division, CSIR-Indian Institute of Chemical Biology, Kolkata, WB, India
Subhrangshu Das & Saikat Chakrabarti

Authors

Subhrangshu Das
View author publications
You can also search for this author in PubMed Google Scholar
Saikat Chakrabarti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.C. conceptualized the project. S.D. did the analysis and created the server. S.C. wrote the paper.

Corresponding author

Correspondence to Saikat Chakrabarti.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Das, S., Chakrabarti, S. Classification and prediction of protein–protein interaction interface using machine learning algorithm. Sci Rep 11, 1761 (2021). https://doi.org/10.1038/s41598-020-80900-2

Download citation

Received: 04 August 2020
Accepted: 15 December 2020
Published: 19 January 2021
DOI: https://doi.org/10.1038/s41598-020-80900-2

This article is cited by

An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences
- Nosiba Yousif Ahmed
- Wafa Alameen Alsanousi
- Mohamed Elhafiz M. Musa
International Journal of Computational Intelligence Systems (2024)
Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention
- Hanhan Cong
- Hong Liu
- Yuehui Chen
BMC Bioinformatics (2023)
Evaluating protein binding interfaces with transformer networks
- Vitalii Stebliankin
- Azam Shirali
- Giri Narasimhan
Nature Machine Intelligence (2023)
Efficient link prediction in the protein–protein interaction network using topological information in a generative adversarial network machine learning model
- Olivér M. Balogh
- Bettina Benczik
- Bence Ágg
BMC Bioinformatics (2022)
OptNCMiner: a deep learning approach for the discovery of natural compounds modulating disease-specific multi-targets
- Seo Hyun Shin
- Seung Man Oh
- Hee Yang
BMC Bioinformatics (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.