Pattern to Knowledge: Deep Knowledge-Directed Machine Learning for Residue-Residue Interaction Prediction

Residue-residue close contact (R2R-C) data procured from three-dimensional protein-protein interaction (PPI) experiments is currently used for predicting residue-residue interaction (R2R-I) in PPI. However, due to complex physiochemical environments, R2R-I incidences, facilitated by multiple factors, are usually entangled in the source environment and masked in the acquired data. Here we present a novel method, P2K (Pattern to Knowledge), to disentangle R2R-I patterns and render much succinct discriminative information expressed in different specific R2R-I statistical/functional spaces. Since such knowledge is not visible in the data acquired, we refer to it as deep knowledge. Leveraging the deep knowledge discovered to construct machine learning models for sequence-based R2R-I prediction, without trial-and-error combination of the features over external knowledge of sequences, our R2R-I predictor was validated for its effectiveness under stringent leave-one-complex-out-alone cross-validation in a benchmark dataset, and was surprisingly demonstrated to perform better than an existing sequence-based R2R-I predictor by 28% (p: 1.9E-08). P2K is accessible via our web server on https://p2k.uwaterloo.ca.


Introduction
In Protein-Protein interaction (PPI), residue-residue interaction (R2R-I) prediction refers to the identification of pairs of interacting residues, usually under close contact, residing on separate interacting proteins. Identification of R2R-I is critical as it enhances our understanding of PPI and furnishes potential targets for inhibiting the PPI 1 . One example of the utility of R2R-I is the use of small molecules to inhibit the interaction between p53 and MDM2 2 , a potential cancer treatment. However, despite its importance, the identification of R2R-I in PPI is hampered by expensive, labor-intensive and time-consuming experiments, such as X-ray crystallography, nuclear magnetic resonance or mutagenesis assays 3 .
Throughout the years, computational R2R-I prediction methods have been developed. However, their application is still limited as these methods often require additional data beyond sequence information. Methods 4,5 based on computational docking need to have unbound structures or their template-based structures 6 . Method based on co-evolution conjecture require homologous sequences of the given protein sequences 7,8 to conduct multiple sequence alignment (MSA). Methods 9,10 based on motifs rely on external motif databases. Methods 11,12 based on interaction profiles of Hidden Markov Models (imHMMs) 13 , which describe domain-domain interaction, have also been used to predict R2R-I in PPI, where these imHMMs are obtained from an external database termed 3DID 14 .
On the other hand, there are a few methods 3,15 requiring only sequence information. However, existing methods based on sequence information 3,15 rely on feature engineering over external knowledge of the sequences, such as using external software to obtain features of the input sequences. The key drawback is that a large amount of time is needed in the trial-and-error process of engineering features, i.e. selecting the optimum combinations of features 16 (e.g. surface accessibility, hydrophobicity, charge…). There is a recent review 16 on the commonly used features for R2R-I prediction. One recent attempt 17 is to use different R2R-I predictors for different types of PPI but it requires the users to have prior knowledge over the type of PPI of the input protein sequences. Another recent attempt 18 is to adopt deep learning using graph convolutional neuron network, where advanced programming skills and high-end graphical processing units are required during its development.
Our method, Pattern-to-Knowledge (P2K), moves in a new direction to predict R2R-I between two proteins based only on sequence information, leveraging the deep knowledge discovered from R2R contact (R2R-C) data and uses the discovered deep knowledge to acquire features for building machine learning models. The entire process requires neither time-consuming feature engineering over external knowledge on the sequences nor high-end hardware equipment during the construction of the machine learning models.

P2K Overview with Illustrative Results on Dataset 618
P2K is a software system which could discover deep knowledge from R2R-C data in PPI structures acquired from Protein Data Bank (PDB) 19 , and leverage the deep knowledge discovered to construct a sequencebased R2R-I predictor. What exactly is deep knowledge? From a general prospective, deep knowledge means the physical occurrences which are masked or entangled by subtle or unknown factors. In other words, the subtle and not obvious knowledge could not be obtained from a superficial inspection of the data. It means that certain happenings observed from the data could be misleading because they are masked or overwhelmed by happenings caused by some other unknown or multiple entwining factors. For example, two residues with positive charge should not interact but their close contacts are frequently accounted from instances as recorded in the surface value of the R2R-C data ( Fig. S1.2). This implies that quite frequent, their close contacts accounted in the 3-D space could be caused by other nearby stereo structural or interacting factors. However, the fact that residues of opposite charge attract and same charge repel still govern R2R-I and should be embedded somewhere in the data. That statistics, if disentangled, could be unveiled, extracted and used for predicting R2R-I. Can such deeper level of knowledge not obvious in the observed data be discovered and disentangled from the statistics inherent in the given data? This motivates the development of P2K. Fig. S1.1 furnishes a schematic overview of P2K, with procedures represented by arrows and procedure outcomes presented in blocks. The definitions are written in Section 3 in Supplement Note 1. A detailed exposition of the major steps is given below corresponding to the labelled steps in Fig. S1.1.
(1) Data Acquisition. R2R-C data in PPI structures are acquired from Protein Data Bank (PDB) 19 . 618 non-redundant PPI structures used in a previous study 20 were acquired from PDB. Previous studies 3,15,20 suggested that two residues are considered as a R2R contact pair (R2R-C) (Definition 1) if their contact  19 . 17,278 R2R-C pairs were acquired from 618 non-redundant 3D PPI complexes collected in a previous study 20 and this dataset was designated as Dataset 618. (2) R2R Contact Frequency Matrix (R2RCFM) Construction. R2RCFM is constructed from the frequency count of contact between residues obtained from the R2R-C data. Each contact frequency in R2RCFM is converted into a statistical residual (SR) accounting for the deviation of that frequency from that if the contact is a random happening. We denote this matrix after conversion as R2R Statistic Residual Matrix (R2RSFM). For mathematical transformation, R2RSRM is also considered as an SR vector space (SRV) such that each row denotes a residue-vector (r-vector) with coordinates representing the SR of that residue interacting with other (3) Statistical Residual Vector Space (SRV) Conversion. To extract the deviation from randomness, the R2RCFM is converted to a matrix by replacing each matrix entry with a statistical measure, known as statistical residual (SR) (Definition 4), which accounts for the deviation of the observed frequency of the R2R-C pair if it is a random happening. We refer this matrix as R2R Statistic Residual Matrix (R2RSRM) (Definition 5). For later mathematical transformation to reveal deep knowledge, we treat R2RSRM as a vector space, called Statistic Residual Vector Space (SRV) (Definition 6), such that each row is considered as a residue-vector (r-vector) (Definition 7), in which each vector coordinate represents the SR of that residue interacting with other residue corresponding to the column. Fig. S1.3 provides an illustration of SRV. Nevertheless, at this stage, we still observe a problem in the SRV. While the 15.35 value of SR (>> 1.96) between C-C (disulfide bond) is reasonable, the SR value of 0.73 between R-R is questionable because R is positively charged and its SR should be below -1.96.
(4) Principal Component Decomposition (PCD) and Projection. PC Decomposition (PCD) 21 (Definition 7) was conducted on SRV to obtain PCs 21 sorted by their eigenvalues in descending order. It is a statistical procedure to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components (PC). We use the PCD to sort out correlated r-vectors with strong association and later the RSRV to reveal the SR of their R2R-I associations captured by the PC with other residues. We selected the top 6 PCs so that the total data variance coverage was almost 80% (Definition 7). We then conducted PC Projection (Definition 8) of the r-vectors in SRV onto these 6 PCs, shown at the top panels in Fig. S1.4 (a)-(f). We observed that each PC reveals a type of molecular interaction property of residues. For example, PC5 reveals if a residue is charged. As shown on the top panel in Fig. S1.4 (e)), the distinctive groups discovered are the green group (R) and the red group (W, E, D). Note that residue R is positively charged and residues E and D are negatively charged. While the residue W is not usually listed as one with negative charge, a recent finding 22 reported that its surface is negatively charged. We also observed that the green group (R) is projected at the left end, while the red group (W, E, D) is projected at the right end. It indicates that the projected coordinates reflect (by the correlated SR coordinates in the SRV) the statistical interacting strength of the molecular interaction property, i.e. if the charged residue is positively or negatively charged.

(5) Construction of Re-projected SRV ( ).
On the PC, we observe that the projection of a r-vector, if located far away from the mean (zero), signifies its strong SR associating with other residues, but we do not know exactly which one(s) and the corresponding SR values. For example, a distinct group of residues like (W, E, D) with projections of their r-vectors on PC5 close to each other in Fig. S1.4 (e) suggests that their correlated R2R-I are strong, but we do not know the SR strength of the R2R-I pairs. To find that out, a re-projection procedure (Definition 10) is introduced. It maps the r-vector projections on the PC back to the SRV with new SR coordinates reflecting the R2R-I captured by the PC. We refer this SRV as the Re-projected SRV (RSRV) (Definition 10). As shown at the bottom panels of Fig. S1.4 (a)-(f), RSRV1 to RSRV6 were obtained by mapping the r-vector projections in PC1 to PC6 (the top panels in Fig. S1.4 (a)-(f)) onto SRV, respectively. In Fig. S1. 4 (e), the matrix at the bottom panel is the RSRV5 of PC5. We observed that R-R is -ve significant, while R-D, R-E and R-W are +ve significant, consistent with our knowledge that residues of the same charge repel while those of opposite charge attract. While we observe that W, E, D are away from the mean and close to each other on PC5, we notice that the SR values of their interactions are succinctly revealed in RSRV5. In Fig. S1.4 (a) to (f), we observe that the statistical strength of specific R2R-I of various residue groups is reflected in the PC projections as well as the SR values in the RSRV. Hence both PCs and RSRVs bring out the statistical significance and functional relevancy of R2R-I.

(6) R2R-I Predictor Construction.
Leveraging the projected coordinates in PCs and the reprojected coordinates in RSRVs, a sequence-based R2R-I predictor can be constructed). The predictor takes the deep knowledge discovered in the PCs and RSRVs to construct feature vectors (FVs) through ML for R2R-I prediction (See Section 5 for details).

(7) R2R-I Prediction.
Given two protein sequences, P2K constructs FVs for all residue pairs and input them to the R2R-I Predictor to predict R2R-I (See Section 5 for details).

Definitions of P2K and Details of the Illustrative Experiment on Dataset 618
Definition 1 -Residue-Residue Contact (R2R-C). R2R-C refers to an event where two residues are considered to be in close contact in the 3D coordinate space when the closest Euclidean distance between their C-Beta atoms (C-Alpha atoms for Gly (G)) 20 ) is less than 6Å 3,15,20 . Though other distance-based definitions have been used, the results are consistent with those with the 6Å distance 23 . A R2R-C pair is referred to as a pair of residues ( , ) in close contact in the 3D coordinate space, where , ∈ Σ R .   From this R2RCFM, note that the frequency count of the C-C contact, which potentially attracts through a disulfide bond, is 43. This number is close to the value of 39 observed for the R-R contact, which potentially repels through positively charged electrostatic force. This observation indicates that frequency count needs to be converted into a statistical measure for fair comparison.
Definition 4 -Statistical residual (SR). The statistical residual , of a residue pair ( , ) is defined as the deviation of the observed frequency of ( , ) in R2R contact from the default case where the happening is random. Assuming the occurrence of contact follows a normal distribution, for a confidence interval of 95%, the observed frequency of a residue pair ( , ) in R2R contact is considered as positively or negatively statistically significant if its SR is >1.96 or < -1.96 respectively, while those with SR between -1.96 and 1.96 are considered chance, random or irrelevant happening. , is defined as where , is the observed number of a residue pair ( , ) in R2R contact; is the total observed number of residue pairs in R2R contact and p(r , r ) is the probability of observing ( , ) in R2R contact if it is a random happening. The formal derivation of , is provided in Section 4.

Definition 5 -R2R Statistical Residual Matrix (
). R2RSRM is a |Σ R | × |Σ R | matrix derived from (R2RCFM) , such that (R2RSRM) , = , Figure S1.3. Statistical Residual Vector Space ( ). The figure was obtained from R2RCFM in Figure S1.2. by converting each of its frequency counts into a Statistical Residual (SR). SR is >1.96 is colored yellow and SR is < -1.96 is colored green. The 15.35 value of SR between C-C (>> 1.96) is reasonable; but the SR value of 0.73 between R-R is questionable since R, a positively charged residue, is unlikely to interact with another positive residue R. Thus, we would expect that its SR should be less than -1.96. Here we speculated that there are additional factors influencing this SRV such as multiple types of physiochemical binding forces added up together to bring two residues in close contact, and external binding forces brought by water molecules and/or physiochemical entanglement. This motivated the use of Principal Component Decomposition to disentangle the underlying factors.
is the k th eigenvector represented as a column vector with |Σ R | rows, and is its corresponding eigenvalue. The eigenvectors are sorted such that +1 ≥ . In practice, we determine an appropriate value of k such that the data variance coverage is almost 0.8.
Definition 9 -PC Projection. Given SRV and its k th PC , we can obtain a column vector with |Σ R | real values 21 , where each real value represents a projected coordinate (corresponding to the k th PC of SRV) from a r-vector from SRV. Formally, we define a PC projection as: is also denoted as the k th PC projection containing the projections of all the r-vector s in SRV.   Here, SRV is defined as a matrix with at a dimension of |Σ R | × |Σ R |: where each matrix entry 1, = 2, = ⋯ = |Σ R |,j is set as the mean of the j-th column of SRV, for = 1,2, … , |Σ R |.
We also denote the matrix, at a dimension of |Σ R | × |Σ R | , obtained by the Re-projection of the k th PC projection as the k th re-projected SRV (RSRV ):

Formal derivation of the statistical residual of the observed frequency of a residue pair
As stated in Definition 4, SR, , is a statistical residual that accounts for the deviation of the observed frequency of a residue pair ( , ) in R2R contact if it is a random happening. Technically, it is in the form of adjusted standard residual 24 . For a confidence interval of 95%, the observed frequency of a residue pair ( , ) in R2R contact is positively or negatively statistically significant if its SR is >1.96 or < -1.96 respectively. If the SR is between -1.96 and 1.96 exclusively, the observed frequency of a residue pair ( , ) in R2R contact is statistically considered as a chance, random or irrelevant happening. , is defined as Here we provide a formal derivation of , leveraging (adjusted) standard residual 24 . For easier presentation, we refer as the i th element in the residue alphabet set ∑ . Thus we first define the observed frequency of a residue pair in R2R contact involving as: We can further define the probability of observing a residue pair in R2R contact involving r as: Next, if the occurrence of a residue pair in R2R contact is random, we define the probability of observing a residue pair (r , r ) in R2R contact as: In the case of ≠ , the frequency count of the two residue pairs (r , r ) and (r , r ) in R2R contact are summed and considered to be the frequency count of the residue pair (r , r ) in R2R contact. Hence, in calculating the probability of observing a residue pair (r , r ) in R2R contact, we need to sum the probability of observing the two residue pairs (r , r ) and (r , r ) in R2R contact.
Here, the set of all mutually exclusive residue pairs is = {(r , r )|1 ≤ ≤ | ∑ |,1 ≤ ≤ }. Note that each type of residue pair is mutually independent. Let , be a random variable representing the occurrence of a residue pair (r , r ) in R2R contact. The standard residual 24 , is thus defined as follows.
We model the random process of observations of residue pairs in R2R-contact with | | types, as if extracting balls (with = | | types) from a bag with replacement. In other words, it is a multinomial distribution.
, is thus as written as follows: For a more precise analysis, standard residual has to be adjusted by its variance 24 . The adjusted standard residual , , is defined as: We compute ( , ) as follows: Hence, the adjusted standard residual is derived as: Putting it altogether, the adjusted standard residual is derived as:

R2R-I Predictor Construction and Operation
Given two protein sequences, a sequence based R2R-I prediction is the process of predicting all R2R-I between two interacting proteins. An illustration is given in Fig. S1.5.
R2R-I is referred to as a residue in one protein interacting with a residue in another protein. R2R-I occurs usually when the two residues are under close contact. Previous studies 3,15,20 suggested that if the contact distance between two residues is below a certain threshold (such as 6Å 3,15,20 ), R2R-I is likely to occur. Hence, in our machine learning experiments, all R2R-C pairs, those with contact distance < 6, were marked as positive (+ve) R2R-I, otherwise (≥ 6) as negative (-ve) R2R-I. For each PPI pair with only sequence information available, all residues pairs between the two proteins are denoted as candidate R2R-I.

R2R-I Predictor Construction (Training phase)
Intrinsically, given two residues located on two protein sequences, the R2R-I predictor is a binary classifier to predict whether they can interact. Its construction (Fig. S1.6) is described as below.

) Collection of Positive/Negative R2R-I in PPI
Complexes from PDB for Experimental Testing. R2R pairs (a pair of residues residing on two different protein sequences) in PPI complexes were obtained from PDB. Previous studies 3,15,20 suggested that if the contact distance between two residues is below a threshold (such as 6Å 3,15,20 ), R2R-I is likely to occur in the R2R pair. Hence, in our experiments, all R2R pairs, with contact distance < 6, were marked as having positive (+ve) R2R-I, otherwise (≥ 6) as having negative (-ve)  Each R2R pair between the two proteins was transformed into an FV (like that in Fig. 7b) but without class labels. We inputted these FVs into the R2R-I predictor. (9) The R2R-I predictor then assigned each FV a score. We then outputted the R2R pairs with top scores as the predicted R2R-Is. An illustration is given in Fig. S1.8 in Supplement Note 1.

(1) Collection of Positive/Negative R2R-I in PPI Complexes from PDB for Prediction Experiments.
R2R pairs (a pair of residues residing on two different protein sequences) in PPI complexes were obtained from the PDB. In our experiments, we used the protein-protein docking benchmark dataset (DBD) version 4.0 20 . Previous studies 1,2,20 suggested that if the contact distance between two residues is below a threshold (such as 6Å 1,2,20 ), R2R-I is likely to occur in the R2R pair. Hence, in our experiments, all R2R pairs, with contact distance < 6, were marked as having positive (+ve) R2R-I, while those with contact distances of ≥ 6̇ were considered to have negative (-ve) R2R-I.
(2) Acquisition of Deep Knowledge Discovered via P2K. We first obtained via P2K the deep knowledge, consisting of PCs and their corresponding RSRVs with top variance, on the +ve R2R-I pairs. In this study, the top six PCs with a total variance of approximately 80% was selected to direct the predictor construction. (

3) Extraction of Neighbors of Positive/Negative R2R-I Pairs.
For each positive/negative R2R-I pairs chosen in step 1, we exploited the R2R-I knowledge of their neighbors via feature vector construction using the practice adopted in machine learning.

R2R-I Predictor Operation in the Testing phase.
The R2R-I predictor constructed could be used for predicting R2R-I between two input protein sequences. This Testing Phase is described in the continuing steps (7) Protein Pairs for Testing. Given two input protein sequences, we took all R2R pairs between them for R2R-I prediction.
(8) Obtain and Input the FVs for Prediction. Each R2R pair between the two proteins was transformed into an FV (like that in Fig. S1.7 (b)), but without class labels. We inputted these FVs into the R2R-I predictor.
(9) The R2R-I predictor then assigned each FV a score. We then outputted the R2R pairs with top scores as the predicted R2R-Is. An illustration is given in Fig. S1.8 in Supplement Note 1.
An illustration example is provided in Fig. S1.8. A pair of input proteins, say Protein A: SC and Protein B: KLC, is given. All R2R pairs between them are enumerated. The R2R pairs are (S, K), (S, L), (S, C), (C, K), (C, L) and (C, C). All of these R2R pairs are then transformed into a FV. These R2R pairs are then inputted to the R2R-I predictor to obtain a score. The R2R pairs with top scores are outputted.

Illustration of the input and Output R2R-I Predictor.
A real case study is conducted on the target complex 1GL1-A:I, which is one of the 52 PPI complexes newly introduced in protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 4 ).

Input:
R2R-I predictor needs to have input of two protein sequences. Hence, in the case study, only the two protein sequence chains of the target complex 1GL1-A:I were inputted into the R2R-I predictor. It should be noted that the R2R-I predictor does not need the users to input any structures to function.

Output:
R2R-I predictor outputs a score of every pair residues between the two protein sequences. The higher the score, the more likely the pair is having interaction. The score is between 0 and 1 inclusively. The top 10 output of R2R-I prediction on the two protein sequence chains of the target complex 1GL1-A:I is illustrated in Table S1.2.

Supplement Note 2: Related work
In this study, R2R-I refers to residues binding between two separate proteins in PPI. In practice, the binding in R2R-I is often considered that the residues in the two proteins are in close contact 1,2 . A high-level definition of R2R-I prediction is: "Given two proteins A and B, predict which residues in protein A in close contact with residues in protein B, assuming proteins A and B can interact" 1,2 . Such prediction is also referred to as partner-specific R2R-I prediction. There are four types of computational methods geared towards predicting R2R-I.
The first type is computational docking 3 , which requires unbound structures of the target proteins to simulate whether they can interact based on physiochemical properties such as shape complementarity, electrostatics, and biochemical information 4 . However, computational docking is applicable only if the protein structures are available. It should be noted that not many protein sequences have corresponding unbound structures, where the structure to sequence ratio was found to be 0.13% 5 . Recently, template-based methods have been described 6 that map protein sequences to homologous structures to perform docking.
The second type 7,8 is based on co-evolution conjecture, which requires external software such as PSI-Blast 9 to create a multiple sequence alignment (MSA) separately for both proteins. It conjectures that, in PPI between two protein chains, mutations on a protein chain are often compensated by correlated mutations in another protein chain. By creating a MSA separately for both proteins, statistically associated columns are then predicted to be in spatial proximity. It should be noted that the quality of MSA relies on whether or not there is adequate sequence homology.
The third type 10,11 leverages statistical measures derived from external protein databases. InSite 10 takes a library of conserved sequence motifs and a library of motif-motif interaction as input to predict site-to-site interaction between proteins. Sites are considered as interacting if their motifs are similar to interacting motifs in the motif interacting library. PIPE-Sites 11 takes a dataset of PPI as input and predicts site-to-site interaction if those sites are found to be frequently co-occurring in the dataset; the effectiveness of this method depends on the quality of the database. It should be noted that in this context 10,11 , a site is referred to as a region on a protein sequence instead of between two interacting residues. Hence, an R2R-I pair is unable to be pinpointed by these methods.
The fourth type 1,2,12,13 is based on machine learning. Taking a dataset of PPI complexes as input for both interacting pairs and non-interacting pairs in building the prediction models, the machine learning method first derives a variety of features, either from structures or MSA created from PSI-Blast external software 9 . Prediction is then achieved if the same type of features can be extracted from the two input proteins. The method 2 is structure-based if it requires structures from the two input proteins. On the other hand, the method 1,2 is sequence-based as it requires only sequences from the two input proteins. Recently there are methods 12,13 that leverage interaction profiles of Hidden Markov Models (imHMMs) 14 , which describe domain-domain interaction, have been used to predict R2R-I in PPI, where these imHMMs are obtained from an external database termed 3DID 15 . All these existing methods based on sequence information 1,2 rely on feature engineering over external knowledge of the sequences, such as using external software to obtain features of the input sequences. The key drawback is that a large amount of time is needed in the trail-anderror in the engineering of features, i.e. selecting the optimum combinations of features 16 (e.g. surface accessibility, hydrophobicity, charge…). There is a recent review 16 on the commonly used features for R2R-I prediction. One recent attempt 17 is to use different R2R-I predictors for different types of PPI but it requires the users to have prior knowledge over the type of PPI of the input protein sequences. Another recent attempt 18 is to adopt deep learning using graph convolutional neuron network, where high-end graphical processing units are required during its development. In this study, Pattern-to-Knowledge (P2K), moves in a new direction to predict R2R-I between two proteins based only on sequence information, leveraging the deep knowledge discovered from R2R contact (R2R-C) data as features to build machine learning models.
The entire process does not require neither time-consuming feature engineering over external knowledge on the sequences nor high-end hardware equipment during the construction of the machine learning models.
In this paper, we compared with PPiPP 1 in our experiments because it is the closest counterpart available, as it is 1) requiring only sequence input; 2) do not require the users to have prior information over the type of PPI of the input protein sequences; 3) do not require hand-end graphical processing units in neither constructing nor using the machine learning models; 4) providing an easy-to-use web interface. PPiPP 1 is the only available exisiting software meeting these four crieria.

Supplement Note 3: Exploratory Experiment on Dataset 618
While the main text furnishes a brief analysis of the exploratory experiment on Dataset 618 (Definition 2, Supplement Note 1), this document provides a comprehensive analysis. For ease of understanding and completeness, we restate certain parts as presented in the main text.
The structure of Supplement Note 3 is described as below.

Introduction
The projections in PCs 1-6, covering about 80% of the data variance of SRV, are shown in Fig. S3.1-3.6 (i). For each PC, on the negative side, the projections with 1 threshold distance away from the mean are colored green, while those on the positive side with 1 threshold distance away from the mean are colored red. The remaining projections are colored blue. The threshold is chosen to be the standard deviation of the projections in PC1. The corresponding RSRVs are shown in Fig. S3.1-3.6 (ii). The yellow-shaded values signify positive statistical significance at a confidence interval of 95% (>1.96) while green-shaded values signify negative statistical significance at a confidence interval of 95% (<-1.96). Blank-shaded values signify statistically insignificant. We enclosed the significant residue(s) (r-vector(s)) in the PC and the corresponding RSRVs in boxes (referred to as C-box) of the same color.
For each significant residue in the PCs, its statistical R2R-I preference in RSRVs is enclosed by boxes with the same colored border as the one enclosing it in the PC. We denote these boxes with colored borders as colored boxes (C-boxes).
Though most of the distinct residues in the top 6 PCs are found significant only in one PC and its corresponding RSRV, we noted some interesting exceptions (PC3 and RSRV3). We also noticed that a few residues (like V, E and W) appear in more than one RSRVs ---an indication of multiple interacting functionality in the 3-D environment. Here we list the major findings. Experimental and observational details are elaborated later in the discussion.

Analysis of the projections of r-vectors on PC1 and their re-projections on RSRV1
The projections of r-vectors in SRV on PC1, covering a variance of 31.51%, are shown in Fig. S3.1 (i). We found that the green groups (L V I F M A) are associated with a high hydrophobicity scale (or hydropathy index) 1 and the red groups (T S N D) are associated with a low hydrophobicity scale. The higher the hydrophobicity scale a residue has, the more hydrophobic it is 1 . We also enclosed M with a purple C-box, as M is close to the green group and the hydrophobicity scale of M is also relatively high. Table S3.1 summarizes the hydrophobicity scale of the residues in these two groups. It should be noted that hydrophobicity reflects the tendency for a residue to be found on the surface of a protein 2,3 . Studies reported that hydrophobicity plays an important role in intermolecular recognition processes 4 such as PPI 5 , and hydrophobicity has been used for predicting protein interface residues [6][7][8][9] . When we compared the projection coordinates of this group of residues in PC1, a disentangled statistical/functional subspace, with their hydrophobicity scale in Table S3.1 taken from the literature 1 , we were surprised to find that the statistical strength and the functional strength astoundingly follow the same order. This is a perfect case to show that physical knowledge could be reflected by statistics obtained from disentangled subspace from R2R-C data.
RSRV1 shows the re-projections of the projections of r-vectors in SRV on PC1 in Fig. S3.1 (ii). We observed that the members of each group are statistically more preferred with their own members but not those of others. The positive significant statistical association among the red group (D S T N) could be attributed to hydrophilic-hydrophilic interaction 10 . Likewise, those association among the green group (L V I F M A) could be attributed to hydrophobic-hydrophobic interaction 10 . It has also been reported that surface patches with high hydrophobicity are energetically unfavorable in an aqueous solution, but favorable when in contact with other hydrophobic surfaces 11 . We would also like to point out that as shown in the SRV (Fig. 3) the SR of S-T is 1.23 (insignificant), whereas in RSRV1 (Fig. S3.1 (ii)) the SR of S-T and T-S are 2.02 (significant) and 1.91 (almost significant), respectively. This indicates that disentanglement helps to reveal that S-T does interact in hydrophilic setting.  Figure S3.1 (i). The projections of r-vectors in SRV on PC1 with variance = 31.51%. We observed two groups: hydrophilic (red) and hydrophobic (green). (ii) In the RSRV1, those SRs that are positively statistically significant are colored yellow, and those that are negatively significant are colored green. In this RSRV, residues in each group only interact with members within their own group but not with members of the other group; and all the SRs not in these two groups have low values since they are both statistically insignificant and/or functionally orthogonal to this corresponding PC. The Pearson correction between the PC1 projection and the hydropathy scale has been computed. The value of R is -0.8063, indicating a strong negative correlation. In other words, the projected coordinates in PC1 are found to be negatively correlated with the hydropathy scale. The value of R 2 , the coefficient of determination, is 0.6501. The P-Value of the Pearson correction is 1.8E-05 < 0.05. This indicates that the correlation between the PC1 projection and the hydropathy scale is statistically significant. In addition, the Spearman correlation between the PC1 projection and the hydropathy scale has a value of R = -0.71774 and the two-tailed P-value is 3.7E-04 < 0.05.

Analysis of the projections of r-vectors on PC2 and their re-projections on RSRV2
The projections of r-vectors in SRV on PC2, covering a data variance of 14.77%, are shown in Fig. S3.2 (i). We observed that the projection of C (in green) is highly distinctive. We also enclosed the projection of S in purple. In RSRV2 (Fig. S3.2 (ii)), the SRs of R2R pairs C-C and C-S are both positively significant, and the C-C pair is also highly distinctive. This could be attributed to the fact that C-C could form a disulfide bond which plays a major role in PPI 12 . This is also consistent with the observation in a previous study 10 . In addition, the finding that C is likely to interact with S via a C-S hydrogen bond (H-bond) was reported only in a recent study 13 . As we observed, both C and S in PC2 are not very close to each other. This suggests that they may not be that close in functionality. Note that the SR of C-S in SRV as shown in Fig. 3 is 1.53, which shows no statistical significance, but their significance is revealed in RSRV2 as 2.47. Furthermore, we have a very significant and revealing observation from this PC. We found that the projected coordinates of the r-vectors on it are distinct from other PCs in such a way that there is no residue with a projected coordinate opposite that of "C" on the other side of the Mean. This result suggests that the disulfide bond may be the dominant molecular force between C and C, with no obvious counterparts. Since this is the natural result of functional disentanglement, it renders another case showing that deep knowledge can be unveiled by P2K from R2R-C data. It also furnishes additional information that biochemists and other scientists can use to acquire a more complete picture of R2R-I. Figure S3.2 (i). The projections of r-vectors in SRV on PC2 cover a data variance of 14.77%. There are two distinctive projected r-vectors corresponding to C and S (green and purple); (ii) In RSRV2, only C-C (distinct) and C-S (less distinct) are boxed. While C interacting with C via a disulfide bond is known to be important in PPI 12 , the finding that C is able to interact with S was confirmed only in a recent study 13 . Both C and S are on the same side in the PC box, as they are related through their interaction with C and they are not opposite R2R-I. However, the finding that they are not very close in the PC might suggest that they may not be that close in functionality.

Analysis of the projections of r-vectors on PC3 and their re-projections on RSRV3
The projections of r-vectors in SRV on PC3, covering a data variance of 9.65%, are shown in Fig. S3.3 (i). We observed that there are two groups, the green group (A, V) and the red group (P,Y,W). We conjectured that these two groups correspond to aliphatic-hydrophobic and aromatic groups, respectively, where only residues within, but not between groups could interact 14 . In RSRV3 (Fig. S3.3 (ii)), most of the SRs in the C-boxes are slightly below 1.96, except for that between A-A, which is conjectured to be an aliphaticaliphatic interaction 14 . We reasoned that the statistical preference between A and V could be attributed to aliphatic-hydrophobic interaction 14 , while those between Y and W could be attributed to aromatic-aromatic interaction 14 . As for proline (P), with SR > 1.62, we found that it can interact with aromatic residues through CH/π interaction 15 . As for isoleucine (I), although it is aliphatic, its SRs are low in PC3 and RSRV3. In RSRV3, we observed that only residues within, but not between groups, could interact. Note that the majority of their SRs is below 1.96 but above 1.62, indicating that their interactions are weaker. We conjectured that these two groups (A,V) and (P, Y, W) are corresponding to aliphatichydrophobic and aromatic groups, respectively 14 .

Analysis of the projections of r-vectors on PC4 and their re-projections on RSRV4
The projections of r-vectors in SRV on PC4, covering a data variance of 9.43%, are shown in Fig. S3.4 (i). We observed that there are two groups, green (H, L) and red (K, E, V) that could interact with their own group members but not between groups. (ii) In RSRV4 (Fig. S3.4 (ii)), The SRs of V-V, V-E are strong (SR=2.15 and 2.03 respectively). With SR = 2.77 and 2.70 respectively, we observed strong contact preferences of H-L and H-H. We speculated that interaction within the (K, E, V) group and the (H, L) group are due to hydrogen bonding (H-bonding). For both groups, we found examples of 3D PPI complex to support our speculation that the interaction within the (K, E, V) group and the (H, L) group are due to hydrogen bonding (H-bonding). For the (K, E, V) group, they are 1) 1pma-C159E and 1pma-D60E, 2) 1gua-A37E and 1gua-B69V, 3) 1hsl-A25E and 1hsl-B39K, 4) 1mdt-A13V and 1mdt-B13V, 5) 1cdc-A39V and 1cdc-B51K, 6) 1dyn-A15K and 1dyn-B15K; while the group (H, L) related to another type of H-bond, as found in 1) 1bhm-A121H and 1bhm-B121H, and 2) 1sft-A5H and 1sft-B67L. Our reasoning is that H-bonds are essential in determining binding specificity 16,17 and provide favorable free energy for the binding 18 , while unfulfilled bonds, introduced by the presence of hydrogen bonding residues without a bonding partner, could destabilize binding 19 . The contrast in energetics contributes to a high selectivity in matching the H-bonds between proteins, and confers binding specificity. Hence, we conjectured that H-bonds are determining the binding specificity within these groups 16 . Figure S3.4 (i). The projections of r-vectors in SRV on PC4 covering a data variance of 9.43%. There are two conjectured groups: H-bond 1 (red) and H-bond 2 (green). (ii) RSRV4 shows that only members in the same group could interact but not those between groups.

Analysis of the projections of r-vectors on PC5 and their re-projections on RSRV5
The projections of r-vectors in SRV on PC5 covering a data variance of 7.30% are shown in Fig. S3.5 (i). We observed that there are two distinctive groups: The green group (R) and the red group (W, E, D), which correspond with positive and negative charge, respectively. While the residue W is not usually listed as a residue with negative charge, a recent finding reported that its surface is negatively charged 15 , complying with the R2R-I knowledge (SR=2.08) we discovered. Note that the projection of K is close to that of R, and it also contains positive charge. In RSRV5 (Fig. S3.5 (ii)), it is obvious that residues of opposite charges attract and of same charges repel (reflected by the yellow and green cells, respectively). We noticed that in PC5 and RSVR5, which are dominated by charged residues, the SR values for all other non-charged residues are insignificant and irrelevant, an obvious manifestation of the result of disentanglement. Figure S3.5 (i). The projections of r-vectors in SRV on PC5 covering a data variance of 7.30%. There are two distinct groups: the green group (R) and the red group (W, E, D), which are found to have positive and negative charge, respectively. While the residue W is not listed as a residue with negative charge in most text books, it was identified as a negative residue due to its negative surface charge, as reported in a recent study 15 . (ii) In RSRV5, residues with the opposite charge attract and the same charge repel (shown by SR in yellow and green cells, respectively).

Analysis of the projections of r-vectors on PC6 and their re-projections on RSRV6
The projections of r-vectors in SRV on PC6, covering a data variance of 6.97%, are shown in Fig. S3.6 (i). We observed that there are two distinctive groups, the green group (N) and the red group (G). In RSRV6 (Fig. 3.6 (ii)), G-G is statistically preferred and G-N is not. Though not significant, the SR of N-N is quite large (1.88). Note that it is reported that interaction involving G is not rationalized 10 . Figure S3.6 (i). The projections of r-vectors in SRV on PC6 covering a data variance of 6.97%. There are two distinct groups: N (in green) and G (in red). (ii) In RSRV6, we observed that each group preferred to interact with others in the same group, but not with those in the other group.

Discussion
Here we discuss some exceptional and interesting cases that resulted from our observations above.
In the PC1 projection, we observed that there are two strong groups: the hydrophobic group (I, L, V, F, M, A) and the hydrophilic group (T, S, N, D). We observed that the last two residues in the hydrophobic group, apart from the M-L interaction, do not generally exhibit very strong interactions. The most amazing finding is that the PC coordinates of the residue projections in PC1 are astoundingly correlated with their hydrophobic indices (Table S3.1). Another surprising observation is related to the isoleucine residue (I).
While both the disentangled statistic results and the chemical findings show prominent associations, it is surprising to find in the report 20 that "its side chain is very non-reactive and rarely directly involved in protein function such as catalysis, but can be involved in binding/recognition." Hence further study seems to be necessary to validate our finding.
We observed that on the whole, the statistically distinctive residue projections on the PC comply with special R2R-I functionality, as reflected by their appearance in unique RSRVs. Examples are residues "C" in PC2/RSRV2 and most of the charged residues (W, D, R; except E in PC5). We also observed that few other residues distinctly appear in more than one of the PCs and RSRVs. Examples are residue V which appears in PC1 (hydrophobic contact) with (I, V, L) and PC4 (special H-bond) with (V, E); residue E with PC5 (charged) and PC4 (special H-Bond) with (V, E); and residue W in PC5 (with negative surface charge) and in PC3 (aromatic contact) with (W, Y, P), though not very strong. These are indications of multiple interacting functionalities in the 3-D environment, and their statistical association with different functionality can be brought forth via disentanglement from the R2R-C data.
As mentioned in the Analysis of R2R-I of residues on PC2 and their re-projections on RSRV2, we found that the projected coordinates of the r-vectors C and S on PC2 are distinct from other PCs since there is no residue with a projected coordinate opposite to that of "C" on the other side of the mean ---suggesting that the disulfide bond in C-C has no obvious counterparts. It also revealed the contact preference of S with C, which was not discovered until recently 13 . It is interesting to observe that the R2R-I of S with C is on the same side of the mean in PC2 but not too close to C, suggesting their R2R-I differences. These observations furnish additional information that biochemists and other investigators can use to acquire a more complete picture of R2R interactions.
Of all these experimental results, the most intriguing case can be found in PC3 and RSRV3. Although its eigenvalue is ranked third among all PCs, a closer look at the distribution of their projections on PC3 show that the distributions are quite broad. This may explain its high eigenvalues. It looks as though the contribution to its high variance is actually due to the aromatic group (W, E, P) and the aliphatichydrophobic residue A (due to strong A-A association). Nevertheless, their overall R2R-I SR values are relatively weak compared with other RSRVs. In RSRV3, the SR for V interacting with others in this space is weak.

Supplement Note 4: Benchmarking experiments
While the main text furnishes a brief description of the benchmarking experiments on protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 1 ), this document provides a comprehensive explanation. For ease of understanding and completeness, we restate certain parts as presented in the main text.
The structure of Supplement Note 4 is shown below.

Introduction
We have achieved the discovery of deep knowledge from R2R contact data procured from existing PPI complexes. However, could such discovered knowledge be used for R2R-I prediction? To answer this question, we developed a R2R-I predictor utilizing such information. We devised a new learning method to compose the deep knowledge discovered by P2K in the form of a feature vector for machine learning.
Through extensive experiments, we aim to show that we can leverage the deep knowledge discovered to construct machine learning models for R2R-I prediction without feature engineering over external knowledge of the sequences.

Experiments
To evaluate the effectiveness of statistical measures reflected by PC Projections and RSRVs discovered by P2K in predicting R2R-I between two protein sequences, we conducted benchmark experiments as described below.

Problem definition
As mentioned in Section 5 in Supplement Note 1, the problem of sequence-based R2R-I prediction is defined as follows. Given two protein sequences, a sequence based R2R-I prediction is the process of predicting all R2R-I between two interacting proteins.

Data preprocessing
As mentioned in Definition 2 in Supplement Note 1, R2R-C is referred to as a pair of residues in separate protein chains with the closest Euclidean distance in 3D coordinate space between their C-Beta atoms (C-Alpha atoms for Gly (G)) 2 ) being less than 6Å 2-4 . Hence, in our machine learning experiments, all R2R-C pairs, whose with contact distance < 6, were marked as positive (+ve) R2R-I, otherwise (≥ 6) as negative (-ve) R2R-I. For each PPI pair with only sequence information available, all residues pairs between the two proteins are denoted as candidate R2R-I.

Experimental configuration
In our experimental configuration, we used the protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 1 ), which contains 176 non-redundant PPI complexes, to evaluate and compare our results with those obtained by other methods considered the best to date. We divided the benchmark data into two sets: one was used for training and the other was retained for testing. The training data consisted of the first 124 non-redundant PPI complexes accessible from the Protein Data Bank (PDB) 5 , as summarized in Supplement Table 4.1. This set of data is equivalent to protein-protein docking benchmark dataset version 3.0 (abbreviated as DBD 3.0 6 ). The testing data was the last 52 non-redundant PPI complexes in DBD 4.0 1 , as summarized in Supplement Table 4.2.
Following a previous study 3 , in our experiment, given a PPI complex, all pairwise ligand/receptor protein chains were considered. For example, for the case of 1FCC-AB:C, the R2R-I predictor was requested to make prediction on both 1FCC-A:C and 1FCC-B:C, and the prediction performance was evaluated together as a whole for 1FCC-AB:C. Here, 1FCC is the PDB PPI Complex ID. It has three protein chains: A, B, C. "AB:C" means that there could only be 2 possible PPI cases, which are A:C and B:C.
In machine learning, the term AUC (Area Under Curve) refers to the area under a receiver operating characteristic (ROC) curve 7 , which is often used for the evaluation of a binary classification system. The higher the AUC value, the better the prediction performance. The maximum value of AUC value is 1.0 whereas the minimum value is 0.0. In our experiments, based on a previous study 3 , we evaluated the prediction performance of the sequence-based R2R-I predictor based on AUC.
For training P2K, an extra decision tree classifier 8 (one variant of the widely used classifier random forests) was trained with 1000 trees under default parameter except setting class weight to be balanced using the machine learning package scikit-learn 0.18.2 9 .
For our leave-one-complex-out-alone experiment, we trained P2K on the training set and validated it on the validation set. By leave-one-complex-out-alone cross-validation, we mean that in one fold of an experiment, one case in Supplement Table 4.1 is consider to be validation set, while the other cases are considered to be training set. The leave-one-complex-out-alone cross-validation would have 124 folds such that each case would have a chance to be as the validation set. An AUC value is obtained in each fold. The final output is the average of AUC obtained among the 124 folds.
In this paper, we compared with PPiPP 1 in our experiments because it is the closest counterpart available, as it is 1) requiring only sequence input; 2) do not require the users to have prior information over the type of PPI of the input protein sequences; 3) do not require hand-end graphical processing units in neither constructing nor using the machine learning models; 4) providing an easy-to-use web interface. PPiPP 1 is the only available exisiting software meeting these four crieria.
For our comparison with PPiPP 3 , we trained P2K on the first 124 PPI complexes in DBD 4.0 1 containing a total of 8,497,005 training samples, as summarized in Supplement Table 4.1. We then applied P2K to predict R2R-I on the last 52 PPI complexes in DBD 4.0 1 , as summarized in Supplement Table 4.2. For benchmarking purpose, we compared our performance with PPiPP 3 , which was also trained by the same 124 non-redundant PPI complexes and is the one that provides a web interface to access the predictor. It should also be noted that the primary objective of the comparison is only to demonstrate that the PC projections and RSRVs discovered by P2K are effective for R2R-I prediction between two protein sequences.

Experiment rationale
As demonstrated in Supplement Note 3, the R2R-C data procured from 618 protein interacting complexes with 17,278 contact pairs had captured various physiochemical characteristics of the target R2R-I. Our next question is whether the physiochemical characteristics captured can be applied to the neighboring residues of the target R2R-I. The underlying hypothesis is that the neighboring residues could affect the target R2R-I pair. As it is difficult to prove such a hypothesis, we turned to an empirical statistical approach using machine learning. Thus, we designed an experiment to test how much the discovered deep knowledge could help to acquire more R2R-I statistical support from additional data in the cloud. We then incrementally applied the discovered deep knowledge to the neighboring residues of the target candidate R2R-I pair from the PPI data acquired from the PDB in the construction of machine-learning prediction models, and investigated empirically whether additional statistical information could improve the prediction performance.   Table 4.1: A summary of the first 124 non-redundant PPI complexes in protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 1 ). This set of data is equivalent to proteinprotein docking benchmark dataset version 3.0 (abbreviated as DBD 3.0 6 ), and was the training data in our experiments. Here the complex column records the PDB PPI Complex ID. For example, 1FCC is the PDB ID. It has three protein chains: A, B, C. "AB:C" means that there could only be 2 possible PPI cases, which are A:C and B:C. 12 AUC (0.59600±0.01531) is still much higher than that of a Random Predictor (0.50000±0.00000). This strongly indicates that the deep knowledge discovered from R2R-C data is effective for R2R-I prediction. We further compared the R2R-I prediction results of P2K with that of PPiPP 3 , which is an existing sequence-based R2R-I prediction software that leverages feature engineering over external knowledge of sequences, on the 52 PPI complexes newly introduced in protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 1 ). Table 4.5, P2K has achieved a higher average AUC (0.64317±0.04159) than that of PPiPP 3 (0.50112±0.00257). All the experimental results are summarized in Supplement Table 4.6. We conducted a two-tailed paired student's t-test between the AUC obtained by PPiPP 3 and those obtained by P2K (no. of neighboring residues considered=3). The p-value was 1.9E-08 < 0.05, demonstrating a strong statistical significance. This indicates that the deep knowledge discovered from R2R-C data is effective for R2R-I prediction. To ensure robustness, experiments were repeated using the machine learning package scikit-learn 0.19.2 9 to obtain the results as shown in Supplement Tables 4.5B and 4.6B. The results show that there is no significant difference between using scikit-learn 0.18.2 9 and scikit-learn 0.19.2 9 .   Table 4.6B: A summary of AUC achieved by comparing P2K with PPiPP 3 for the 52 PPI complexes newly introduced in protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 1 ) using the latest scikit-learn machine package 0.19.2 9 . Here the complex column records the PDB PPI Complex ID. For example, 1FCC is the PDB ID. It has three protein chains: A, B, C. "AB:C" means that there could only be 2 possible PPI cases, which are A:C and B:C.

Case Study
To demonstrate the effectiveness of P2K for illustration purpose, we conducted a case study on the target complex 1GL1-A:I, which is one of the 52 PPI complexes newly introduced in protein-protein docking benchmark dataset version 4.0 (abbreviated as DBD 4.0 1 ). This target complex has the PDB 5 ID 1GL1, which describes the PPI between bovine alpha-chymotrypsin and PMP-C, an inhibitor from the insect Locusta migratoria. According to DBD 4.0 1 , R2R-I occurs on protein sequence chains A and I. Following the procedure mentioned, we trained P2K over the first 124 PPI complexes in DBD 4.0 1 under default parameter setting except setting no. of neighboring residues considered=3. Also, the latest scikit-learn machine learning package 0.19.2 9 was adopted. We then applied P2K to predict R2R-I between the chains A and I of the target complex 1GLI. A working practice of an experimental biologist is to focus on the top predictions. Hence, in this experiment, only the top 5 positive predictions remained and the other positive predictions were forced to be negative predictions. As shown in Supplement Table 4.7, P2K outperformed PPiPP 3 , where P2K had a precision of 80% among the top 5 predictions while PPiPP 3 had a precision of 0% among the top 5 predictions. Fig. S4.2 provides a 3D illustration of the predictions.

Figure. S4.2.
A demonstration of the R2R-I prediction between protein sequence chains A and I on the target complex 1GL1. The residues on protein chain A are enclosed by purple circles while the residues on the protein sequence chain I are enclosed by yellow circles. The 4 out of 5 true positive predictions of P2K are shown in the figure and are connected by blue dash lines. They are 1: (S195 on chain A, K31 on chain I), 2: (S195 on chain A, L30 on chain I), 3: (F41 on chain A, A32 on chain I) and 4: (F41 on chain A, A32 on chain I), while the false positive prediction is (S195 on chain A, S25 on chain I). As shown in Supplement Table 4.7, P2K outperformed PPiPP 3 , where P2K had a precision of 80% among the top 5 predictions while PPiPP 3 had a precision of 0% among the top 5 predictions.