Modeling and fitting protein-protein complexes to predict change of binding energy

It is possible to accurately and economically predict change in protein-protein interaction energy upon mutation (ΔΔG), when a high-resolution structure of the complex is available. This is of growing usefulness for design of high-affinity or otherwise modified binding proteins for therapeutic, diagnostic, industrial, and basic science applications. Recently the field has begun to pursue ΔΔG prediction for homology modeled complexes, but so far this has worked mostly for cases of high sequence identity. If the interacting proteins have been crystallized in free (uncomplexed) form, in a majority of cases it is possible to find a structurally similar complex which can be used as the basis for template-based modeling. We describe how to use MMB to create such models, and then use them to predict ΔΔG, using a dataset consisting of free target structures, co-crystallized template complexes with sequence identify with respect to the targets as low as 44%, and experimental ΔΔG measurements. We obtain similar results by fitting to a low-resolution Cryo-EM density map. Results suggest that other structural constraints may lead to a similar outcome, making the method even more broadly applicable.


Results
The template-based docking protocol introduced here results in good Δ Δ G precision for homologous templates, those which (in this work) have sequence identities (vs. the targets) in the range of 44% to 51%. It is naturally more precise for self-templates, meaning those which have high sequence identity ( > 93%) to their targets. In this work when we provide RMSD (Root Mean Square Deviation) we refer in all cases to the discrepancy in backbone 3D atomic structure of modeled vs. template complexes. When we provide RMSE (Root Mean Square Error) and correlation, we are comparing experimental to computed Δ Δ G.
Double-free models are made by docking two targets onto the template complex, while single-free models are made by docking one target onto the template (see Methods). For the double-free models based on self-templates the Root Mean Square Deviations (RMSDs) range from 0.71 to 3.42 Å ( Table 1). The exception is the TGF-β Type II Receptor/TGF-β 3 complex for which the RMSD is 22.84 Å (Table 1). If we analyze the structures of the model and template in detail (Fig. S1) we can observe that distal region of the co-crystal chain A (TGF-β 3) is poorly resolved. In fact, if we omit template chain A residues 40 to 80 the RMSD decreases from 21.7826 Å to just 0.8819 Å, which is in line with RMSD found for the other complexes (Table 1). Since these poorly resolved residues are placed on a distal region of chain A, far from the interface, the Δ Δ G prediction precision with this model is in line with that of the rest of the dataset. We describe this complex in more detail later. For the singleand double-free models based on homologous templates the RMSD ranges from 2.76 to 4.86 Å. As expected the RMSDs of this sub-group are higher than the ones observed for the double-free models based on self-templates. However the differences are relatively small, which is in line with idea that structure is more conserved than sequence 27 . We discuss this point in more detail below.
Based on the validation dataset described in Table 1, we compared the performance of ZEMu to FoldX-only (meaning FoldX with no prior MMB equilibration) in predicting Δ Δ G. For the entire dataset the correlation between FoldX-only and experimental Δ Δ G (Δ Δ G FoldX-only and Δ Δ G exp , respectively) is 0.12, while the Root Mean Square Error (RMSE) is 1.83 kcal/mol. To our knowledge this is the first time FoldX is evaluated with modeled complexes; this also serves as an external (non-ZEMu) validation of our modeling protocol. For ZEMu the correlation improves to 0.34 (p-value < 0.00001); the RMSE also improves, to 1.54 kcal/mol ( Table 2, Fig. 1, Tables S1 and S2).
The improvement over FoldX-only is reflected in the complete dataset as well as in the single and multiple mutant sub-groups (Table 2). For the single mutants, FoldX-only shows an RMSE of 1.89 kcal/mol and correlation of 0.06, while for ZEMu we obtain an RMSE of just 1.54 kcal/mol and a higher correlation of 0.24 (p-value = 0.00004) ( Table 2, Fig. 1, Tables S1 and S2). In the case of the multiple mutants FoldX-only achieves an RMSE of 1.79 kcal/mol and correlation of 0.15, and ZEMu an RMSE of 1.54 kcal/mol and a correlation of 0.37 (p-value < 0.00001) ( Table 2, Fig. 1, Tables S1 and S2).
From the main dataset we also created a sub-group comprising mutants for which self-template structures are available (Table 3). Based on this sub-group we compared the performance of ZEMu for modeled vs. crystallographic complexes. As expected, performance was better for crystallographic than for modelled complexes, but only moderately (RMSE of 1.58 vs. 1.76, Correlation of 0.61 vs. 0.38, respectively). This further highlights the quality of the modeling protocol. In the particular case of the TGF-β Type II Receptor / TGF-β 3 complex model, the RMSD vs. its self-template is 22.84 Å, when computed based on all resolved residues. As explained above the huge RMSD value found is due to poorly resolved residues in a distal region of chain A co-crystal (Table 1, Fig. S1) and so does not affect the quality and performance of the model at the interface. The RMSEs are 1.68 kcal/mol and 1.19 kcal/mol, and the correlations are is 0.28 and 0.83, for the model and co-crystal, respectively. The performance is thus in line with the other models.
We then tested a ZEMu variant with an additional flexibility radius (0-14 Å) (Fig. 2) centered on the mutation site, which can include residues from the binding partner. When a radius of 4 Å is used an RMSE very slightly higher than that of regular ZEMu is obtained, also an added computational cost is incurred, so there is no reason to use the added flexibility. Nonetheless, ZEMu variants with an extra flexibility radius of 4 to 10 Å still show better performance than FoldX-only.
The application to the Fcγ RI/IgG1 complex illustrates some of the advantages of ZEMu. We created single-free and double-free template models, and a fitted model, of this complex ( IgG1 co-crystal was available, but one was reported more recently 34 . With the Fcγ RI/IgG1 single-free template model, FoldX-only yields an RMSE of 2.33 and a correlation of 0.12 while ZEMu RMSE is just 0.92 and correlation is 0.42. The difference in performance between the two protocols is overwhelmingly due to three mutations involving N-terminus residue G236. When the three mutants are left out, FoldX-only RMSE drops substantially to 0.32 and the correlation increases to 0.32. Though this is a small sample, it illustrates even more strikingly that FoldX's rigid backbone is particularly unsuitable for modeling the terminal region, which has characteristically high mobility. For the double-free and fitted Fcγ RI/IgG1 models the three mutations involving N-terminus residue G236 are immediately adjacent to unresolved residues. The free IgG1 structure (3DNK) was missing several residues from the N-terminus (residues 229 to 235, part of the lower hinge connecting Fc to Fab in a full-length IgG1), which were resolved in the template complex (1E4K). These three mutants plus one mutation in the missing region therefore could not be modeled in the double-free and fitted models. ZEMu still outperforms FoldX-only but by a smaller margin (Table 4). Directly comparing against the double-free and fitted models, we can conclude that for both FoldX-only and ZEMu the best performance was obtained for the fitted model (Table 4). If we analyze the RMSD of both models with respect to the co-crystal structure we can observe that the fitted model has a lower RMSD by ~0.32 Å. In the fitted model the hinges between the D1 and D2 domains on Fcγ RI, and between the CH2 and CH3 domains on Fc, were made flexible, to allow domain motions, explaining the better RMSD of chains A and C. This highlights the efficacy of the fitting protocol.
ZEMu and MMB performance for models of Fcγ RIII/IgG1 and Fcγ RII/IgG1 based on homologous templates further demonstrates the efficacy of the protocol in cases where the self-template is not available. For instance, for the Fcγ RIII/IgG1 model based on its self-template the RMSE is 0.85 kcal/mol and the RMSD is 1.98 Å. In the case of the Fcγ RIII/IgG1 model based on the Fcγ RII/IgG1 template (PDB:3RY6) 32   . In order to compute statistical quantities like this we would need a random sample of possible mutations. However many of the available experimental Δ Δ G's are the result of alanine scanning mutagenesis experiments and although 36% of the dataset have experimental Δ Δ G < 0 kcal/mol, only 5% have experimental Δ Δ G < − 0.5 kcal/mol. Also given the interest in affinity maturation 36 it is likely that there are more affinity-improving mutations in the peer-reviewed literature and in SKEMPI than would occur through random mutagenesis. It is thus doubtful that SKEMPI, or our dataset, contains a representative sample of substituted residue types or a representative ratio of affinity increasing vs. affinity reducing mutations. We nonetheless computed the PPV which gives the odds of obtaining Δ Δ G exp ≤ c exp for c exp = 0, − 0.5 and − 1.0 kcal/mol, given Δ Δ G ZEMu ≤ c ZEMu , for several c ZEMu 's. Considering the entire dataset, for Δ Δ G ZEMu ≤ − 0.5 kcal/mol, the probability of satisfying Δ Δ G exp ≤ − 0.5 kcal/mol is 0.43. We emphasize strongly that this PPV is not applicable to the case that Δ Δ G ZEMu is computed for all possible point mutations at an interface, as would be done in a likely practical application.

Figure 2. Results of flexibilizing a spatial neighbourhood of the mutation sites for a dataset composed of 687 mutants (not includes the template models of FcγRIII/IgG1 and FcγRII/IgG1 based on a homologous co-crystal complex). Ordinary ZEMu has only five flexible residues about each mutation site flexibilized.
Here we also flexibilize all residues within a radius (0-14 Å) of the mutation sites. Radius of 0 Å corresponds to ordinary ZEMu.

Discussion
In prior work 1 we described a means to predict Δ Δ G in crystallographically observed PPIs. A key goal in Structural Bioinformatics is the ability to compute Δ Δ G even for the very common case of proteins whose structure is known crystallographically only in the free form. In many cases evolutionary information 17 does not exist or is not applicable. We formed template models based on an available cocrystallized complex (which in principle could be e.g. isoforms or homologous of the free structures), with a new template modelling protocol that we describe. We here demonstrate that the protocol does not require a high sequence identity for building significant template models based on a homologous template (recall, sequence identity for homologous proteins ranged between 44% and 51%), whereas existing methods work only with high sequence identity templates 17,18 . For low sequence identity, our method is only moderately less precise, again related to the idea that structure is more conserved than sequence 27 . We also assembled protein complexes by flexibly fitting to a 10 Å resolution density map of an isoform protein 21 .
We emphasize that all results labelled "FoldX-only" were obtained on complexes modelled by us (albeit without subsequent ZEMu processing of the mutation sites), providing independent (non-ZEMu) validation of our template-docking protocol.
Globally, our success using template-docked and fitted models suggests that other means of docking under constraints, e.g. using biochemical or bioinformatical data, may lead to comparable results. We further suggest that if experimental Δ Δ G data is available, ZEMu could be used to validate and/or refine the docked, modeled, or fitted structures. This could significantly improve docking results 37 but is important even for template-based modeling or fitting when the constraints come from an isoform or homolog, since the binding mode may not be conserved.
Our main validation dataset consists of template modeled complexes. We used several variations of ZEMu on these complexes and evaluated the accuracy of Δ Δ G prediction. More-sophisticated variants of ZEMu, which flexibilized various spatial regions, had very similar results on the main dataset when compared to the published method 1 , indicating that it is best to limit the flexibility to the immediate sequence neighborhood of the mutated residues. This validates the perturbative assumption underlying ZEMu, namely that the substitution mutations have the largest effect in the near neighborhood of the mutation site, and less effect farther from that position. Anecdotal observations suggest MD potentials introduce structural artifacts, so leaving as much as possible of the crystallographic structure unmodified may be key to ZEMu's success. We introduce an MMB modeling protocol and show that it leads to a versatile method to predict Δ Δ G on modeled and fitted protein-protein complexes.

Methods
We built models of protein-protein complexes using MMB template modeling to align the free protein structures to relevant protein crystal complexes and also by fitting to a low-resolution density maps using ICFF 21 (described below). We then used ZEMu 1 to predict Δ Δ G upon mutation and compared to the results of using FoldX-only 13 . Note that in all cases FoldX-only analysis was conducted with the MMB modeled or fitted structures. ZEMu was mostly used as described in [1][2][3] . Though we also tested the effect of flexibilizing additional residues in the spatial neighborhood (up to 14 Å) of the mutation sites.
Dataset. The validation dataset comprises 846 mutants (each with 1-6 simultaneous substitutions) of 11 different protein-protein complexes models for which Δ Δ G exp is available (Table 1) 36, . The dataset consists mostly of double-free template models (two targets). In some cases the templates were the same proteins as the targets (self-templates, Table S1); such systems are useful for benchmarking purposes. In other cases we used homologous templates ( Fig. 3A; Table S2). We also created a single-free template model, for which only one of the two targets comes from a free structure, while the other is retained from the template ( Fig. 3B; Table S2). Finally, we also generated a model of the biomedically important Fcγ RI /IgG1 61 by fitting to a low-resolution synthetic density map of Fcγ RIII/IgG1 33 using ICFF 21 .
Template modeling in MMB. Several good template-based modeling methods exist [62][63][64] . Our procedure starts with a sequence alignment 65 between unbound (target) and bound (template) target and template residues, followed by structural alignment. We then resolve any steric clashes, after which the model is ready to be used for Δ Δ G prediction with ZEMu or potentially other purposes. The procedure (described in detail below) is straightforward and convenient to do in MMB.
Structural alignment based on gapped sequence alignment. We have previously described morphing 21,66 and homology modeling of RNA 20 and ribonucleoprotein complexes 19 , using springs which connect residues in a rigid molecule of known structure, with corresponding residues in a flexible molecule of unknown structure. The mentioned correspondence is determined by a simple SeqAn 65 gapped sequence alignment now available in MMB, with a mismatch and gap opening penalty of − 1. In contrast with our homology modeling work (in which a fully-flexible unstructured target chain is aligned with a rigid template) 19 , or our morphing work (in which a partially-flexible structured target is aligned with a rigid template) 21,66 , here those springs align one or more unbound targets (or their binding domains) onto corresponding domains in the template, (Fig. 3) with both the targets and the template remaining fully rigid during the process. The template is then deleted, leaving the modeled targets in their place (Fig. 3A,B).
Declashing. The thus-modeled complex typically has a small number of clashing residues, as the binding interface of the target was previously exposed to solvent allowing greater freedom to the side chains and to some extent the backbone. These clashes must be removed for accurate Δ Δ G calculation (Fig. 3A,B). The problem of side chains that clash due to modeling is not unlike that of side chains that clash due to in silico mutagenesis, the problem that is addressed by ZEMu. Therefore we use a ZEMu-like method 1 to declash. We flexibilize only Flexible fitting to low-resolution density maps. In addition to the template models described above, we also flexibly fitted free Fcγ RI and IgG1 structures into a synthetic 10 Å-resolution density map (based on PDB ID 1E4K) of the Fcγ RIII-IgG1 complex. We previously described Internal Coordinate Flexible Fitting (ICFF). In ICFF, each atom in the molecule (or relevant fragment thereof) perceives a force which is proportional to the electronic density gradient at the nuclear position 21 . The molecule in question can have hinge and interface flexibility, and MD forces can be applied about such points of flexibility. Hinges between the D1 and D2 domains on Fcγ RI, and between the CH2 and CH3 domains on Fc, were made flexible, to allow domain motions. Interface residues 1236 and 1298 (on Fc) and 148 (on Fcγ RI) were granted side-chain flexibility, to relieve clashes which would otherwise occur as the two subunits come into contact to form the complex (Fig. 3C).
Evaluating ΔΔG for the modeled complexes. ZEMu involves first specifying a flexibility zone comprising five residues consecutive in sequence, centered on the mutation site. The flexibility zone is treated in torsion space, leaving the remainder of the protein rigid and fixed. Also, a physics zone of 12 Å around each flexible residue is defined, inside of which electrostatic and van der Waals forces are active. The system is then minimized by dynamics 1 .
The interaction energy of the equilibrated complex is evaluated with the Knowledge Based (KB) potential FoldX 13 . We conduct the calculation for the MMB-equilibrated wild type and mutant complexes to obtain Δ G wild type and Δ G mutant , respectively. An estimate of Δ Δ G exp is obtained as follows 68 : Δ Δ G ZEMu ≡ Δ G mutant -Δ G wild type ≈ Δ Δ G exp ZEMu+ additional flexibility radius. We also tested a variation of ZEMu which grants flexibility to residues in the spatial neighborhood of the mutation sites. This was defined as all residues within a certain radius of the mutation site, potentially including residues of the adjacent protein. We evaluated radii ranging from 0 (equivalent to ordinary ZEMu) to 14 Å. After equilibration we evaluated Δ Δ G with FoldX 13 as before.