An Intriguing Correlation Based on the Superimposition of Residue Pairs with Inhibitors that Target Protein-Protein Interfaces

Druggable sites on protein-protein interfaces are difficult to predict. To survey inhibitor-binding sites onto which residues are superimposed at protein-protein interfaces, we analyzed publicly available information for 39 inhibitors that target the protein-protein interfaces of 8 drug targets. By focusing on the differences between residues that were superimposed with inhibitors and non-superimposed residues, we observed clear differences in the distances and changes in the solvent-accessible surface areas (∆SASA). Based on the observation that two or more residues were superimposed onto inhibitors in 37 (95%) of 39 protein-inhibitor complexes, we focused on the two-residue relationships. Application of a cross-validation procedure confirmed a linear negative correlation between the absolute value of the dihedral angle and the sum of the ∆SASAs of the residues. Finally, we applied the regression equation of this correlation to four inhibitors that bind to new sites not bound by the 39 inhibitors as well as additional inhibitors of different targets. Our results shed light on the two-residue correlation between the absolute value of the dihedral angle and the sum of the ∆SASA, which may be a useful relationship for identifying the key two-residues as potential targets of protein-protein interfaces.

In this article, we studied a method for identifying the key two-residues (residue pairs) to rationally design inhibitors that target protein-protein interfaces. Our analysis was based on the differences between residues that were superimposed onto small molecule inhibitors (SIRs) and non-superimposed residues (non-SIRs). Publicly available information for 8 drug targets, which included 39 inhibitors that target the protein-protein interfaces of those drug targets and 64 hot spot residues on the interfaces, was obtained. To determine the entry angles of the residues into small pockets on the interfaces and the spatial relationships between the pharmacophores of the PPIs, we focused on two-residue relationships and the dihedral angle (DA) and measured the distances for every two-residue combination. We evaluated shape-related descriptors (i.e., distance, DA) and binding-related descriptors (i.e., hydrophobic interaction, ∆SASA, binding free energy [∆G]) of the residues that were like anchor residues that provided clues for identifying key residue pairs superimposed with the inhibitors targeting the protein-protein interfaces. Finally, we applied the regression equation of this correlation to 4 inhibitors that bind to new sites not bound by the 39 inhibitors as well as additional inhibitors of different targets. Our results shed light on the two-residue correlation between the absolute value of the DA and the sum of the ∆SASAs, which may be a useful signature for identifying key residue pairs as potential targets of protein-protein interfaces. In this report, the protein to which small molecules bind is referred to as the "target protein", whereas the protein that interacts with the target protein is referred to as the "partner protein".

Results
Basic data: 8 target-partner protein combinations, 39 inhibitors, and 64 residues. To extract solid structural information regarding the target-partner protein combinations from the Protein Data Bank (PDB) database, we used the following four criteria: 1) target proteins for which inhibitor-protein complexes were reported after 2005 28 ; 2) basic data of inhibitor-protein complexes and corresponding protein-protein complexes were available from the PDB; 3) inhibitors (small compounds) directly bound to the interface of the target protein; and 4) at least two different crystal structures of the inhibitor-protein complexes were available as of March 31, 2015. Eight target-partner protein combinations were selected according to these criteria and used for further analysis. This information enabled us to compare the properties of the residues of the protein-protein complexes, such as descriptors of their shapes and binding-related parameters, with those of the protein-small molecule (inhibitor) complexes (Table 1). In addition, 39 protein-inhibitor complexes, in which most of the inhibitors were bound to the target proteins in different positions (as shown in Table 1 as PDB IDs of protein-inhibitor complexes), were preferentially selected to avoid structural redundancy between the protein-inhibitor complexes, although this selection method limited the number of protein-inhibitor complexes available for analysis. Sixty-four residues of the eight partner proteins with a ∆SASA (the change in solvent accessible area for each side-chain upon binding 27 ) greater than 5 Å 2 and a predicted -∆Gi value (estimated free-energy-based scoring function 27 ) greater than 1 kcal/mol were selected for further analysis from the ANCHOR database (Table 1, Supplementary Table 1) 27,29,30 . These residues and inhibitors were on the same interface of the corresponding target proteins. We performed structural alignments between the structures of the target protein in the native protein-protein complexes and the structures of the corresponding protein-inhibitor complexes (Fig. 1a,b). Of the 64 residues, 26 were classified as SIRs based on the thresholds described in the Methods ( Supplementary Fig. 1), whereas the remaining residues were classified as non-SIRs. When the secondary structures of the 64 residues were analyzed, 34 residues (53%) belonged to α -helices. This finding is consistent with a previous report showing that most interfaces of the reported PPIs for inhibitor-protein complexes are α -helices (Table 2) [15][16][17] . The difference in ∆Gi between the SIRs and non-SIRs was 0.1 kcal/mol, whereas the difference in the hydrophobic effect (HE) of each residue 30 between the SIRs and non-SIRs was 0.5 kcal/mol. The mean Δ SASA of the SIRs (85.4 Å 2 ) was significantly larger than the mean Δ SASA of the non-SIRs (61.9 Å 2 ; p = 0.00624; t-test). It seems reasonable that the key descriptors of SIRs are similar to those of the anchor residues in vivo because Δ SASA is important to both SIRs and anchor residues 27 . Correlations between DA and ∑∆SASA of the superimposed residue pairs. Based on reports about fragment based drug discovery and ligand efficiency, it is assumed that a hit compound with a binding free energy value of -6.9 kcal/mol is generally effective at a concentration of 10 μ M [31][32][33][34] . The mean ∆Gi of the 64 single residues was only -3.5 kcal/mol, however, which does not seem to be enough energy to obtain high-throughput screening hit compounds. We calculated that 2 or more residues were superimposed onto 37 inhibitors (95%), based on the 39 protein-inhibitor complexes (Fig. 1c). This finding led us to hypothesize that correlations exist between two or more residues that might be informative in PPI research. Therefore, we then analyzed the two residues on the same interface of the drug target (i.e., residue pair). To provide a method for measuring the spatial position of a first residue relative to a second residue, we measured three structural parameters, i) distances between C α (alpha carbon atoms of an amino acid) of the first residue and C α of the second residue (C α -C α ), ii) β -turn 9 4 5 loop d 21 10 11  Table 2. Difference between the superimposed residues a (n = 26) and non-superimposed residues b (n = 38). a Residues that were superimposed onto the inhibitors. b Residues that were not superimposed onto the inhibitors. c Number of secondary structures to which the residues belonged. d Loop or strand.

Mean (median) SD Mean (median) SD Mean (median) SD
distances between the C ω (basically, the farthest carbon atom from the C α carbon of an amino acid) of the first residue and the C ω of the second residue pair (C ω -C ω ), and iii) the DA of C ω -C α -C α -C ω for each residue pair (Fig. 1d). We then classified each residue pair into residue pairs that were both SIRs (superimposed residue pair, SIRP), residue pairs with both non-SIR (non-superimposed residue pair, nonSIR-nonSIR), and residue pairs in which one was superimposed with an inhibitor and the other was not (SIR-nonSIR). 26 SIRs and 38 non-SIRs on 8 target proteins (Table 1) resulted in 35 SIRPs, 90 nonSIR-nonSIRs and 116 SIR-nonSIRs (Supplementary Table 2, note in Supplementary information). To evaluate the effects of binding-related parameters, we calculated the sum of HE (∑HE), ∆Gi (Σ ∆Gi), and ∆SASA (∑∆SASA) for each residue pair. The mean distances between the atoms (C ω -C ω , C α -C α ) of the SIRPs were shorter than those of the nonSIR-nonSIR (4.4 Å for C ω -C ω , Table 3). As expected, the mean ∑∆SASA of the SIRPs was 168 Å 2 , which was significantly larger than that of the nonSIR-nonSIR (131 Å 2 , p = 0.000090, t-test) and the SIR-nonSIR (144 Å 2 , p = 0.0078, t-test Considering that visualizing the DA of four atoms (C ω -C α -C α -C ω ) and distances (C α -C α , C ω -C ω ) is equivalent to the Sawhorse projections and the Newman projections in chemistry, we further investigated the correlation between shape-related descriptors (i.e., distances, DAs) and binding-related descriptors (i.e., hydrophobic interaction, ∆SASA, ∆Gi) in the SIRPs and the non-SIRPs, and found strong correlations between the DA (x-axis) and ∑∆SASA (y-axis) in only the SIRPs (n = 35; Fig. 2a). Clear correlations between the DA and ∑∆SASA were observed for the positive DA values (DA > 0; r = − 0.61, p < 0.035, n = 12) and negative DA values (DA < 0; r = 0.70, p < 0.00021, n = 23). Considering that the largest ∑∆SASA in both DA > 0 and DA < 0 increased as the DAs approached the zero degree (Fig. 2a), the absolute value of the DA (|DA|) was used instead of the DA (Fig. 2b). Once again, a clear correlation between |DA| and ∑∆SASA was observed (r = -0.68 with p < 0.00001, y = − 0.57 x + 211, Fig. 2b). The correlation between |DA| and ∑∆SASA implied that not only ∑∆SASA (an interaction descriptor) but also |DA| (a shape-descriptor) can be used to distinguish SIRPs from non-SIRPs (Fig. 2b,c).

Feasibility of the correlation using other inhibitors and another target protein.
To demonstrate the feasibility of our hypothesis that the correlation between |DA| and ∑∆SASA could be useful for distinguishing SIRPs from non-SIRPs, we applied this correlation to an additional target-inhibitor dataset. First, we tested four additional inhibitors targeting three of the eight previously used target proteins (Supplementary Table 3). Notably, these inhibitors bind to different positions than the previous 39 inhibitors. From an inhibitor of BclxL (pdb: 4C52), we obtained three new SIRPs (I90-A91, I90-L94, and I90-I97) 35 . Two Mcl inhibitors (pdb: 4OQ5 and 4WGI) provided two additional SIRPs (I58-L62 and A59-L62) 36,37 . An integrase inhibitor (pdb: 3ZT1) provided two new SIRPs (K364-I365 and K364-D366) 38 . All seven new SIRPs were in the range of the regression equation ± SE (n = 35, Σ Δ SASA = − 0.57 |DA| + 211 ± SE, SE = 32.4; Fig. 3a). This finding suggests that the correlation can be used for new inhibitors, even when they bind to different positions on the same interfaces of their targets.
We then performed the Leave-One-Out Cross Validation (LOOCV) method using the 42 samples to properly and strongly validate the correlation between |DA| and ∑∆SASA because of the limited amount of data used. No statistically significant differences in the gradient or in the intercept of the regression equation were detected between the results of the 35 training samples and those of the LOOCV (Fig. 3b). Also, there was no difference between the estimated errors of the seven tested samples and those of the LOOCV, suggesting that the correlation between |DA| and ∑∆SASA was not incidental, but intrinsic.
We further tested the regression equation, which was based on the parameters obtained through the LOOCV process with 42 samples, using additional validation data of 10 new SIRPs, including novel target-partner protein combinations, such as Keap1/Nrf2 (pdb: 1× 2R, Supplementary  Table 3. Comparison among the superimposed residue pairs a , non-superimposed residue pairs b , and superimposed residue-non-superimposed residue pairs c . a Pairs of residues that were superimposed onto the inhibitors. b Pairs of residues that were not superimposed onto the inhibitors. c Pairs that one residue was superimposed onto an inhibitor and another was not superimposed onto inhibitors. The shortest SIRPs of each inhibitor can be used as a filter for extracting plausible SIRPs. The shortest SIRP distances (C α -C α , C ω -C ω ) of the 48 inhibitors were selected to identify the distance necessary to inhibit the PPIs on the interfaces. Redundant SIRPs were removed, leaving 23 SIRPs (Supplementary Table 5). For these 23 SIRPs, the mean distances + 1.96 SD were 8.89 Å (C α -C α ) and 11.20 Å (C ω -C ω ) (C α : mean = 5.29 Å, SD = 1.84; C ω : mean = 7.22 Å, SD = 2.03). By contrast, the mean distances + 1.96 SD for all 52 SIRPs were 14.3 Å (C α -C α ) and 16.1 Å (C ω -C ω ) (C α : mean = 7.16 Å, SD = 3.66; C ω : mean = 9.33 Å, SD = 3.56). These findings, including those of the LOOCV, validation with an additional dataset, and shorter distances, suggested that the correlation between the absolute value of the DA and the sum of the ∆SASAs of the residues could be applied to new inhibitors and unknown targets.

Discussion
Considering that most of the SIRPs were non-polar residues and almost half (49%) were on α -helix motifs ( Supplementary Fig. 2), we analyzed the effects of both the polarity and secondary structure to investigate whether the relation between |DA| and ∑∆SASA was intrinsic to SIRPs. First, the residue pairs were classified into three groups: two non-polar residues (group 1); one non-polar residue and one polar residue (group 2); and two polar residues (group 3). The polar character of each group was in the order group1 < group2 < group3 (Supplementary Table 6). When the residue pairs were classified into three groups, group 3 had the smallest number of residue pairs and no SIRPs (nonSIR-nonSIR 14, SIR-SIR 8). Although there was no correlation between |DA| and ∑∆SASA between any of the nonSIR-nonSIR pairs, two SIRP groups showed correlations between |DA|(x-axis) and ∑∆SASA (y-axis) (group1: r = − 0.67, p < 0.000014, n = 27, y = -0.55x + 212; group2: r = -0.54, p < 0.17, n = 8, y = − 0.49x + 194). The slope and y-intercepts of groups 1 and 2 were similar to those of the pre-classification correlation, indicating that this correlation was not affected by differences in residue polarity. On the other hand, only one pair of SIR-nonSIR in group 3 showed a correlation between |DA| and ∑∆SASA (r = − 0.75, p = 0.0319, n = 8, y = − 0.584x+ 148). The difference between correlations of the SIRPs and the SIR-nonSIR pairs in group 3 was the y-intercept, indicating that polarity of the residue might affect the SIR-nonSIR pairs.
We then classified the residue pairs into nine groups, according to the combination of secondary structures between the residue pairs ( Supplementary Fig. 2). There were eight combinations of secondary structures for non-SIRPs (n = 208) because none of the residue pairs were on different α -helices. When classified into the combinations, more than half of both the nonSIR-nonSIR and SIR-nonSIR pairs were on the same α -helix. One possible reason for this tendency is that an α -helix on the PPI interface is long enough to gain binding energy or ∑∆SASA from many residues that act as anchors on the interface.
Although there were no correlations between |DA| and ∑∆SASA for any of the secondary structure combinations of the non-SIRPs, SIRPs on the same α -helix (n = 17) and on the same loop or strand (n = 12) showed correlations between |DA| (x-axis) and ∑∆SASA (y-axis) (α -helix: r = − 0.72, p < 0.0011, n = 17, y = − 0.59x + 209; loop or strand: r = − 0.43, p < 0.17, n = 12, y = − 0.38x + 184). The slopes and y-intercepts of the two groups were also similar to those of the non-classified correlation. No other secondary structure combination was suitable for investigating the correlation because there were only two pairs of residues on the same β -turn and four pairs on the β -turn and loop (strand). These findings suggest that this correlation is not affected by different combinations of secondary structures to which the residues belong. Further study with large number of data should be performed to validate these findings.
Based on the definition of Δ Gi 27 , it may be that Δ Gi cannot explain the difference between SIRs and non-SIRs (Table 2). We think, however, that there may be no difference in Δ Gi between SIRs and non-SIRs because we did not use Δ Gi to select the SIRs. Instead, by focusing on the SIRs, we noticed that the Δ SASA of SIRs was different from that of the non-SIRs, leading to further studies of the Σ Δ SASA of SIRPs.
Recently, Moreira and colleagues reported a method for predicting hot spots in protein-protein and protein-nucleic acid interfaces based on the SASA) 44 , which is consistent with a previous report demonstrating that ∆SASA is important to anchor residues 27 . Our results, extracted utilizing three publicly available databases, demonstrated that features of SIRPs correlated between |DA| and ∑∆SASA and the distances between residue pairs (Fig. 4a,b). These findings could be applied to novel inhibitors and a novel target (Fig. 4c). One example of the application is to filter out non-SIRPs and select plausible SIRPs for novel targets (noted in Supplementary information, Supplementary Table 7).
Although we determined the contribution of the residue using the ANCHOR database, there are other published methods for determining the contribution of a residue at a PPI, such as Rosetta scanning 45 and mCSM-PPI 46 . Therefore, we used the mCSM-PPI method with a single mutation. When the 64 residues shown in Table 2 were mutated to alanine, we observed a slight but nonsignificant difference between SIRs (n = 26, mean ∆∆G − 1.687 kcal/mol SD 0.845) and non-SIRs (n = 34, mean ∆∆G − 1.286 kcal/mol SD 0.858; p = 0.071). Further studies are needed to determine the contribution of a residue at a PPI.
In summary, we focused on two-residue relationships and found a linear negative correlation between |DA| and ∑∆SASA for SIRPs based on a comparison of the protein-protein complexes and the protein-inhibitor complexes. This correlation was successfully applied to five additional inhibitors of different targets. Our results shed light on the two-residue correlation between the absolute value of the DA and the sum of the ∆SASAs. Further studies should be performed to evaluate multi-residue correlations by focusing on three-(or more) residue relationships.

Methods
Data collection. The set of complex structures evaluated in this study is listed in Table 1. The structures of the complexes were obtained from the PDB and TIMBAL. All structural figures were generated using PyMOL (http://www.pymol.org). The predicted values of ∆SASA and ∆Gi are publicly available from the ANCHOR database (http://structure.pitt.edu/anchor) 27,29,30 . The ∆SASA for each side-chain upon binding and an estimate of its contribution to the ∆Gi are listed in the database. Rajamani et al. calculated the conformation-dependent portion of the empirical ∆Gi using the expression ∆Gi = ∆ E elec (i) + ∆Gdes(i), where ∆E elec (i) denotes the electrostatic interaction energy between atoms in the ligand residue i and the receptor, and ∆Gdes(i) is an estimation of the desolvation free energy of residue i 27 , To estimate the hydrophobic interactions of each selected residue, we used the estimated values for the HE of the amino acid residues reported by Karpus 54 . These data for the 64 selected residues are summarized in Supplementary Table 1. Polar and non-polar side chains of amino acids were classified by Perutz's method 55 . Determination of the superimposed residues. Every residue that was superimposed onto an inhibitor (SIR) was selected using the same method used for the Mcl/p53 complex (PDB:1YCR) and a corresponding protein-inhibitor complex (PDB:1RV1), which is shown as an example in Supplementary Fig. 1. First, we used the "align" command in PyMOL to perform structural alignments between the structures of the target protein in the native protein-protein complexes and the structures of the corresponding protein-inhibitor complexes ( Supplementary Fig. 1a). The native sequence of the target protein and the corresponding sequence of the inhibitor-bound target protein were considered to have few differences. The average root-mean-square deviation was 0.866 Å (Supplementary Table 9). The bonds of the selective residues of the partner protein were then drawn as sticks, whereas the inhibitors were drawn as spheres with radii equal to the van der Waals radii ( Supplementary  Fig. 1b,c). Finally, when a residue containing at least two heavy atoms (other than the atoms of the amide bond) whose centers were superimposed onto the sphere of the inhibitor was observed, the residue was considered to be superimposed onto the inhibitor and was thus defined as a SIR. Residue pairs were determined in the same way. After the redundant residue pairs were removed, 35 residue pairs that were superimposed onto 39 inhibitors were found (Supplementary Table 10).
Structure of the residue pairs. In this report, every combination of two residues on the same drug target was defined as a residue pair. The total number of residue pairs was the sum of the combination of n selected residues taken two at a time in each target (∑{ n1 C 2 (target1) + n2 C 2 (targt2) + n3 C 2 (target3) + … + n8 C 2 (tar-get8)} = 243 [residue pairs]). A total of 243 residue pairs were found in the partner proteins of those proteins targeted by the 8 drugs. PyMOL was used to measure the distances and DAs between the residue pairs. The distance between the C α (alpha carbon of an amino acid) of one residue and C α of the second residue (C α -C α ) was measured for each residue pair. The distance between the C ω (the farthest carbon atom from C α or C β carbon of an amino acid) of one residue and the C ω of the second residue (C ω -C ω ) was also measured. The C ω -C α -C α -C ω DAs of each residue pair were also measured. Basically, C ω was defined as either the farthest heavy atom from the C β of the side chain of an aromatic amino acid or the end heavy atoms of the side chain of a non-aromatic amino acid. For the branched end amino acids (Val, Leu, Glu, Gln, Asp, Asn, and Arg), the carbon atoms that branched before the end atoms were assumed to be the farthest atoms (C ω ). With Pro, C4 was assumed to be the farthest atom from C α . The sum of HE (∑HE), ∆Gi (Σ ∆Gi), and ∆SASA (∑∆SASA) for each residue pair was calculated. All residue pairs (n = 243) were investigated (Supplementary Table 2). Statistical analysis. Two-tailed Student's t-test assuming equal variances was used in this study to compare the mean Δ SASA and Δ Δ G of the SIRs vs. non-SIRs as well as to compare the mean distances between the atoms (C ω -C ω , C α -C α ), Σ Δ SASA and Σ HE of the SIRPs vs. nonSIR-nonSIR and SIR-nonSIR. The Pearson correlation coefficient was calculated to measure the linear relationship between DA and Σ Δ SASA as well as between |DA| and Σ Δ SASA. Two-tailed p-value for the correlation coefficient was calculated using Student's t-distribution.