Predicting resistance of clinical Abl mutations to targeted kinase inhibitors using alchemical free-energy calculations

The therapeutic effect of targeted kinase inhibitors can be significantly reduced by intrinsic or acquired resistance mutations that modulate the affinity of the drug for the kinase. In cancer, the majority of missense mutations are rare, making it difficult to predict their impact on inhibitor affinity. We examine the potential for alchemical free-energy calculations to predict how kinase mutations modulate inhibitor affinities to Abl, a major target in chronic myelogenous leukemia (CML). These calculations have useful accuracy in predicting resistance for eight FDA-approved kinase inhibitors across 144 clinically identified point mutations, with a root mean square error in binding free-energy changes of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.1_{0.9}^{1.3}$$\end{document}1.10.91.3 kcal mol−1 (95% confidence interval) and correctly classifying mutations as resistant or susceptible with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$88_{82}^{93}$$\end{document}888293% accuracy. This benchmark establishes the potential for physical modeling to collaboratively support the assessment and anticipation of patient mutations to affect drug potency in clinical applications.

T argeted kinase inhibitors are a major therapeutic class in the treatment of cancer. A total of 38 selective smallmolecule kinase inhibitors have now been approved by the FDA 1 , including 34 approved to treat cancer, and perhaps 50% of all current drugs in development target kinases 2 . Despite the success of selective inhibitors, the emergence of drug resistance remains a challenge in the treatment of cancer [3][4][5][6][7][8][9][10] and has motivated the development of second-and then third-generation inhibitors aimed at overcoming recurrent resistance mutations [11][12][13][14][15] .
While a number of drug resistance mechanisms have been identified in cancer (e.g., induction of splice variants 16 , or alleviation of feedback 17 ), inherent or acquired missense mutations in the kinase domain of the target of therapy are a major form of resistance to tyrosine kinase inhibitors (TKI) 10,18,19 . Oncology is entering a new era with major cancer centers now deep sequencing tumors to reveal genetic alterations that may render subclonal populations susceptible or resistant to targeted inhibitors 20 , but the use of this information in precision medicine has lagged behind. It would be of enormous value in clinical practice if an oncologist could reliably ascertain whether these mutations render the target of therapy resistant or susceptible to available inhibitors; such tools would facilitate the enrollment of patients in mechanism-based basket trials 21,22 , help prioritize candidate compounds for clinical trials, and aid the development of nextgeneration inhibitors.
While some cancer missense mutations are highly recurrent and have been characterized clinically or biochemically, a long tail of rare mutations collectively accounts for the majority of clinically observed missense mutations (Fig. 1a), leaving clinicians and researchers without knowledge of whether these uncharacterized mutations might lead to resistance. While rules-based and machine learning schemes are still being assessed in oncology contexts, work in predicting drug response to microbial resistance has shown that rare mutations present a significant challenge to approaches that seek to predict resistance to therapy 23 . Clinical cancer mutations may impact drug response through a variety of mechanisms by altering kinase activity, ATP affinity, substrate specificities, and the ability to participate in regulatory interactions, compounding the difficulties associated with limited datasets that machine learning approaches face. In parallel with computational approaches, high-throughput experimental techniques such as MITE-Seq 24 have been developed to assess the impact of point mutations on drug response. However, the complexity of defining selection schemes that reliably correlate with in vivo drug effectiveness and long turn-around times might limit their ability to rapidly and reliably impact clinical decisionmaking.
Physics-based approaches could be complementary to machine-learning and experimental techniques in predicting changes in TKI affinity due to mutations with few or no prior clinical observations. Alchemical free-energy methods permit receptor-ligand binding energies to be computed rigorously, including all relevant entropic and enthalpic contributions 25 . Encouragingly, kinase:inhibitor binding affinities have been predicted using alchemical free-energy methods with mean unsigned errors of 1.0 kcal mol −1 for CDK2, JNK1, p38, and Tyk2 [26][27][28][29][30][31][32][33] . Recently, one study has hinted at the potential utility of alchemical free-energy calculations in oncology by predicting the impact of a single clinical mutation on the binding free energies of the TKIs dasatinib and RL45 34 .
Here, we ask whether physical modeling techniques may be useful in predicting whether clinically identified kinase mutations lead to drug resistance or drug sensitivity. We perform state-ofthe-art relative alchemical free-energy calculations using FEP+ 26 , recently demonstrated to achieve sufficiently good accuracy to drive the design of small-molecule inhibitors for a broad range of   20 show that 68.5% of missense kinase mutations in cancer patients have never been observed previously, while 87.4% have been observed no more than ten times; the vast majority of clinically observed missense kinase mutations are unique to each patient. b To compute the impact of a clinical point mutation on inhibitor binding free energy, a thermodynamic cycle can be used to relate the free energy of the wild-type and mutant kinase in the absence (top) and presence (bottom) of the inhibitor. c Summary of mutations studied in this work. Frequency of the wild-type (dark green) and mutant (green) residues for the 144 clinically-identified Abl mutations used in this study (see Table 1 for data sources). Also shown is the frequency of residues within 5 Å (light blue) and 8 Å (blue) of the binding pocket. The ordering of residues along the x-axis corresponds to the increasing occurrence of residues within 5 Å of the binding pocket. The number of wild-type Phe residues (n = 45) and mutant Val residues (n = 31) exceeded the limits of the y-axis targets during lead optimization [25][26][27]35 , to calculate the effect of point mutation on the binding free energy between the inhibitor and the kinase receptor (Fig. 1b, c). We compare this approach against a fast but approximate physical modeling method implemented in Prime 36 (an MM-GBSA approach) in which an implicit solvent model is used to assess the change in minimized interaction energy of the ligand with the mutant and wild-type kinase. We consider whether these methods can predict a ten-fold reduction in inhibitor affinity (corresponding to a binding freeenergy change of 1.36 kcal mol −1 ) to assess baseline utility. As a benchmark, we compile a set of reliable inhibitor ΔpIC 50 data for 144 clinically identified mutants of the human kinase Abl, an important oncology target dysregulated in cancers like chronic myelogenous leukemia (CML), for which six 1 Table 1). Due to the change in bond topology required by mutations involving proline, which is not currently supported by the FEP+ technology for protein residue mutations, the three mutations H396P (axitinib, gefitinib, erlotinib) were excluded from our assessment. As single-point mutations were highly represented in the Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT) study analyzed in Fig. 1a, we excluded double mutations from this work. However, the impact of mutations from multiple sites can potentially be modeled by sequentially mutating each site and this will be addressed in future work. Experimental ΔpIC 50 measurements for wild-type and mutant Abl were converted to ΔΔG in order to make direct comparisons between physics-based models and experiment. However, computation of experimental uncertainties were required to understand the degree to which differences between predictions and experimental data were significant. Since experimental error estimates for measured IC 50 s were not available for the data in Table 1, we compared that data to other sources that have published IC 50 s for the same mutations in the presence of the same TKIs (Fig. 2a-c). Cross-comparison of 97 experimentally measured ΔΔGs derived from cell viability assay IC 50 data led to an estimate of experimental variability of 0:32 0:36 0:28 kcal mol −1 root mean square error (RMSE) that described the expected repeatability of the measurements. Because multiple factors influence the IC 50 aside from direct effects on the binding affinity we also compared ΔΔGs derived from ΔpIC 50 s with those derived from binding affinity measurements (ΔK d ) for which data for a set of 27 mutations was available (Fig. 2d). The larger computed RMSE of 0:81 1:04 0:59 kcal mol −1 represents an estimate of the lower bound of the RMSE to the IC 50 -derived ΔΔGs that we might hope to achieve with FEP+ or Prime, which were performed using non-phosphorylated models, when comparing sample statistics directly. Comparing 31 mutations for which phosphorylated and non-phosphorylated ΔK d s were available, we found a strong correlation between the ΔΔGs derived from those data (r = 0.94, Supplementary Figure 1).
Most mutations do not significantly reduce TKI potency. The majority of mutations do not lead to resistance by our 10-fold affinity loss threshold: 86.3% of the co-crystal set (n = 113) and 86.8% of the total set (n = 125). Resistance mutations, which are likely to result in a failure of therapy, constitute 13.7% of the co-crystal set (n = 18) and 13.2% of the total set of mutations (n = 19). The ΔpIC 50 s for all 144 mutations are summarized in Supplementary Tables 2-7. Two mutations exceeded the dynamic range of the assays (IC 50 > 10,000 nM); as these two mutations  (Fig. 1c). From prior experience with relative alchemical free-energy calculations for ligand design, good initial receptor-ligand geometry was critical to obtaining accurate and reliable free-energy predictions 26 , so we first focused on the 131 mutations in Abl kinase across six TKIs for which wild-type Abl: TKI co-crystal structures were available. Figure 3 summarizes the performance of predicted binding free-energy changes (ΔΔG) for all 131 mutants in this set for both a fast MM-GBSA physics-based method that only captures interaction energies for a single structure (Prime) and rigorous alchemical free-energy calculations (FEP+). Scatter plots compare experimental and predicted free-energy changes (ΔΔG) and characterize the ability of these two techniques to predict experimental measurements. Statistical uncertainty in the predictions and experiment-to-experiment variability in the experimental values are shown as ellipse height and widths, respectively. The value for experimental variability was 0.32 kcal mol −1 , which was the standard error computed from the cross-comparison in Fig. 2. For FEP+, the uncertainty was taken to be the standard error of the average from three independent runs for a particular mutation, while Prime results are deterministic and are not contaminated by statistical uncertainty.
To better assess whether discrepancies between experimental and computed ΔΔGs simply arise for known forcefield limitations or might indicate more significant effects, we incorporated an additional error model in which the forcefield error was taken to be a random error σ FF ≈ 0.9 kcal mol −1 , a value established form previous benchmarks on small molecules absent conformational sampling or protonation state issues 37 . Thin error bars in Fig. 2 represent the overall estimated error due to both this forcefield error and experimental variability or statistical uncertainty 38,39 .
To assess overall quantitative accuracy, we computed both RMSE-which is rather sensitive to outliers, and mean unsigned error (MUE). For Prime, the MUE was 1:  While quantitative accuracy (MUE, RMSE) is a principle metric of model performance, an application of potential interest is the ability to classify mutations as raising resistance to a specific TKI. To characterize the accuracy with which Prime and FEP+ classified mutations in a manner that might be therapeutically relevant, we classified mutations by their experimental impact on the binding affinity as susceptible (affinity for mutant is diminished by no more than 10-fold, ΔΔG ≤ 1.36 kcal mol −1 ) or as resistant (affinity for mutant is diminished by least 10-fold, ΔΔG > 1.36 kcal mol −1 ). Summary statistics of experimental and computational predictions of these classes are shown in Fig. 2 (bottom) as truth tables (also known as confusion matrices).
The simple minimum-energy scoring method Prime correctly classified 9 of the 18 resistance mutations in the dataset while merely 85 of the 113 susceptible mutations were correctly classified (28 false positives). In comparison, the alchemical freeenergy method FEP+, which includes entropic and enthalpic contributions as well as explicit representation of solvent, correctly  Fig. 2 Cross-comparison of the experimentally measured effects that mutations in Abl kinase have on ligand binding, performed by different labs. ΔΔG was computed from publicly available ΔpIC 50 or ΔpK d measurements and these values of ΔΔG were then plotted and the RMSE between them reported. a ΔpIC 50 measurements (X-axis) from ref. 79 compared with ΔpIC 50 measurements (Y-axis) from ref. 81 . b ΔpIC 50 measurements (X-axis) from ref. 79 compared with ΔpIC 50 measurements (Y-axis) from ref. 80 . c ΔpIC 50 measurements (X-axis) from ref. 81 compared with ΔpIC 50 measurements (Y-axis) from ref. 80 . d ΔpIC 50 measurements (X-axis) from ref. 79  patient bearing a resistance mutation in the kinase domain of Abl has an equal chance of Prime correctly predicting this mutation would be resistant to one of the TKIs considered here, while if the mutation was susceptible, the chance of correct prediction would be~75%. By contrast, the classification specificity of FEP+ was substantially better. For FEP+, the sensitivity was 0:50 0:74 0:29 while the specificity was 0:93 0:97 0:88 . There is a very high probability that FEP+ will correctly predict that one of the eight TKIs studied here will remain effective for a patient bearing a susceptible mutation.
How reliant are classification results on choice of cutoff? Previous work by O'Hare et al. utilized TKI-specific thresholds for dasatinib, imatinib, and nilotinib 40 , which were~2 kcal mol −1 .
Supplementary Figure 2 shows that when our classification threshold was increased to a 20-fold change in binding (1.77 kcal mol −1 ), FEP+ correctly classified 8 of the 13 resistant mutations and with a threshold of 100-fold change in binding (2.72 kcal mol −1 ), FEP+ correctly classified the only two resistant mutations (T315I/dasatinib and T315I/nilotinib). With the extant multilayered and multinodal decision-making algorithms used by experienced oncologists to manage their patients' treatment, or by medicinal chemists to propose candidate compounds for clinical trials, the resistant or susceptible cutoffs could be selected with more nuance than the simple 10-fold affinity threshold we consider here. With a larger affinity change cutoff, for example, the accuracy with which physical models predict resistance mutations increases beyond 90% (Supplementary Figure 2). For the alchemical  To better highlight true outliers unlikely to simply result from expected forcefield error, we presume forcefield error (σ FF ≈ 0.9 kcal mol −137 ) also behaves as a random error, and represent the total estimated statistical and forcefield error ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi σ 2 FF þ σ 2 exp=cal q as vertical error bars. The yellow region indicates area in which predicted ΔΔG is within 1.36 kcal mol −1 of experiment. Two mutations were beyond the concentration limit of the assay and were not plotted; N = 129. Bottom panel: Truth tables and classification results include T315I/dasatinib and L248R/imatinib; 131 points were used. Truth tables of classification accuracy, sensitivity and specificity using two-classes (resistant: ΔΔG > 1.36 kcal/mol; ΔΔG ≤ 1.36 kcal/mol). For MUE, RMSE, and classification statistics, sub/superscripts denote 95 % CIs. For Prime, *MUE highlights that the Bayesian model yields a value for MUE that is noticeably larger than MUE for observed data due to the non-Gaussian error distribution of Prime approach, classification accuracy was 0:92 0:96 0:87 when an affinity change cutoff of 20-fold was used while using an affinity change cutoff of 100-fold further improved the accuracy to 0:98 1:00 0:96 .
Bayesian analysis can estimate the true error. The statistical metrics-MUE, RMSE, accuracy, specificity, and sensitivitydiscussed above are based on analysis of the apparent performance of the observed modeling results compared with the  observed experimental data via sample statistics. However, this analysis considers a limited number of mutants, and both measurements and computed values are contaminated with experimental or statistical error. To obtain an estimate of the intrinsic performance of our physical modeling approaches, accounting for known properties of the experimental variability and statistical uncertainties, we used a hierarchical Bayesian model to infer posterior predictive distributions from which expectations and 95% predictive intervals could be obtained.
The results of this analysis are presented in Fig. 3 (central tables). FEP+ is significantly better than Prime at predicting the impact of mutations on TKI binding affinities, as the apparent performance as well as the intrinsic performance were well-separated outside their 95% CI in nearly all metrics. Applying the Bayesian model, the MUE and RMSE for FEP+ was 0: Understanding the origin of mispredictions. Resistance mutations that are mispredicted as susceptible are particularly critical because they might mislead the clinician or drug designer into believing the inhibitor will remain effective against the target. Which resistance mutations did FEP+ mispredict as susceptible? Nine mutations were classified by FEP+ to be susceptible when experimentally measured ΔpIC 50 data indicate the mutations should have increased resistance according to our 10-fold affinity cutoff for resistance. Notably, the 95% CI for five of these mutations included the 1. Four mutations, however, were misclassified to a degree that is statistically significant: dasatinib/T315A, bosutinib/T315I, imatinib/E255V, and nilotinib/E255V. For dasatinib/T315A, although the T315A mutations for bosutinib, imatinib, nilotinib, and ponatinib were correctly classified as susceptible, the predicted free-energy changes for these four TKIs were consistently more negative than the corresponding experimental measurements, like dasatinib/T315A, indicating there might be a generic driving force contributing to the errors in T315A mutations for these five TKIs. Abl is known to be able to adopt many different conformations (including DFG-in and DFG-out), and it is very likely that the T315A mutation induces conformational changes in the apo protein 41 , the inadequate sampling of which may have led to the errors for the T315A mutation. By comparison, the T315I mutations for axitinib, bosutinib, imatinib, nilotinib, and ponatinib were all accurately predicted with the exception of bosutinib/T315I being the only misprediction, suggesting an issue specific to bosutinib. The interactions between the 2,4-dichloro-5methoxyphenyl ring in bosutinib and the positively charged amine of the catalytic Lys271 may not be accurately captured by the fixed-charge OPLS3 force field, possibly leading to the misprediction for bosutinib/T315I mutation.
Insufficient sampling might also belie the imatinib/E255V and nilotinib/E255V mispredictions because they reside in the highly flexible P-loop. Since E255V was a charge change mutation, we utilized a workflow that included a transmutable explicit ion (see Methods). The distribution of these ions in the simulation box around the solute might not have converged to their equilibrium state on the relatively short timescale of our simulations (5 ns), experimental classes using a 1.36 kcal mol −1 (10-fold change) threshold; columns denote predicted susceptible (s, ΔΔG ≤ kcal mol −1 ) or resistant (r, ΔΔG > kcal mol −1 ). Correct predictions populate diagonal elements (orange text), incorrect predictions populate off-diagonals. Accuracy, specificity, and sensitivity for two-class classification are shown below the truth table. Elliptical point sizes and error bars in the scatter plots depict estimated uncertainty/variability and error, respectively, (±σ) of FEP+ values (vertical size) and experimental values (horizontal size). Note: The sensitivity for axitinib and ponatinib is NA, because there is no resistant mutation for these two drugs and the insufficient sampling of ion distributions coupled with P-loop motions might lead to misprediction of these two mutations.
How strongly is accuracy affected for docked TKIs? To assess the potential for utilizing physics-based approaches in the absence of a high-resolution experimental structure, we generated models of Abl bound to two TKIs-erlotinib and gefinitib-for which co-crystal structures with wild-type kinase are not currently available. In Fig. 5, we show the Abl:erlotinib and Abl: gefitinib complexes that were generated using a docking approach (Glide-SP, see Methods). These two structures were aligned against the co-crystal structures of EGFR:erlotinib and EGFR: gefinitib to highlight the structural similarities between the binding pockets of Abl and EGFR and the TKI binding mode in Abl versus EGFR. As an additional test of the sensitivity of FEP+ to system preparation, a second set of Abl:erlotinib and Abl: gefitinib complexes was generated in which crystallographic water coordinates were transferred to the docked inhibitor structures (see Methods).
Alchemical free-energy simulations were performed on 13 mutations between the two complexes; 7 mutations for erlotinib and 6 mutations for gefitinib. The quantitative accuracy of FEP+ in predicting the value of ΔΔG was excellent-MUE and RMSE of 0:58 0:86 0:33 and 0:80 1:09 0:44 kcal mol −1 , respectively, if crystal waters are omitted, and 0:50 0:78 0:26 kcal mol −1 and 0:69 0:97 0:35 kcal mol −1 if crystal waters were restored after docking. Encouragingly, these results indicate that our initial models of Abl bound to erlotinib and gefitinib were reliable because the accuracy and dependability of our FEP+ calculations were not sensitive to crystallographic waters. Our secondary concern was the accuracy with which the approach classified mutations as resistant or susceptible.
While the results presented in (Fig. 5) indicate that FEP+ is capable of achieving good quantitative accuracy when a co-crystal structure is unavailable, it is important to understand why a mutation was predicted to be susceptible but was determined experimentally to be resistant. F317I was the one mutation that increased resistance to erlotinib (or gefitinib) because it destabilized binding by more than 1.36 kcal mol −1 -1:35 1:67 1:03 kcal mol −1 (gefitinib) and 1:58 1:90 1:26 kcal mol −1 (erlotinib), but the magnitude of the experimental uncertainty means we are unable to confidently discern whether this mutation induces more than 10-fold resistance to either TKI. Therefore, the one misclassification by FEP+ in Fig. 5 is not statistically significant and the classification metrics presented there underestimate the nominal performance of this alchemical free-energy method.

Discussion
The results presented in this work are summarized in Table 2. The performance metrics summarized in Table 2 indicates that the set of 131 mutations for the six TKIs in which co-crystal structures were available is on par with the complete set (144 mutations), which included results based on Abl:TKI complexes generated from docking models. The performance results for the 13 mutations for the two TKIs (erlotinib and gefitinib) in which co-crystal structures were unavailable exhibited good quantitative accuracy (MUE and RMSE) and good classification power.
Overall  From the perspective of a clinician, classification rate would be an important metric to measure the predictive power of technologies such as Prime and FEP+. To test the hypothesis that reducing the large spread in Prime predictions could improve its classification rate, we scaled the computed relative free energies and recalculated the performance metrics (Supplementary Table 8). As expected, the MUE and RMSE were improved but the specificity of Prime was drastically diminished. Scaling FEP+ eliminated its sensitivity and a naive model (all ΔΔGs = 0.00 kcal mol −1 ) had zero sensitivity. Lastly, we constructed a consensus model in which free energies were a weighted average of scaled Prime and FEP+. This model also had zero sensitivity.
To address the impact of picking a cutoff to classify predicted free energies as resistant or sensitizing, we computed ROC curves for the various predicted datasets: Prime, FEP+, naive model, and consensus model (Supplementary Figure 3 A hierarchical Bayesian approach was developed to estimate the intrinsic accuracy of the models when the noise in the experimental and predicted values of ΔΔG was accounted for. Utilizing this approach, the MUE and RMSE for Prime was found to be 1:  (N = 142), respectively, which is significantly better than Prime. Likewise, a clearer picture of the true classification accuracy, specificity, and sensitivity of FEP+ was found-0:90 0:93 0:86 , 0:92 0:95 0:90 , and 0:68 1:00 0:46 , respectively. The high accuracy of FEP+ is very encouraging, and the accuracy can be further improved with more accurate modeling of a number of physical chemical effects not currently considered by the method. While highly optimized, the fixed-charged OPLS3 37 force field can be further improved by explicit consideration of polarizability effects 42 , as hinted by some small-scale benchmarks 43 . These features could be especially important for bosutinib, whose 2,4-dichloro-5-methoxyphenyl ring is adjacent to the positively charged amine of the catalytic Lys271. Many simulation programs also utilize a long-range isotropic analytical dispersion correction intended to correct for the truncation of dispersion interactions at finite cutoff, which can induce an error in protein-ligand binding free energies that depends on the number of ligand heavy atoms being modified; 44 recently, efficient Lennard-Jones PME methods 45,46 and perturbation schemes 44 have been developed that can eliminate the errors associated with this truncation. While the currently employed methodology for alchemical transformations involving a change in system charge reduces artifacts that depend on the simulation box size and periodic boundary conditions, the explicit ions that were included in these simulations may not have sufficiently converged to their equilibrium distributions in these relatively short simulations. Kinases and their inhibitors are known to possess multiple titratable sites with either intrinsic or effective pK a s near physiological pH, while the simulations here treat protonation states and proton tautomers fixed throughout the bound and unbound states; the accuracy of the model can be further improved with the protonation states or tautomers shift upon binding or mutation considered 47,48 . Similarly, some systems display significant salt concentration dependence 49 , while the simulations for some systems reported here did not rigorously mimic all aspects of the experimental conditions of the cell viability assays.
While we have shown that predicting the direct impact of mutations on the binding affinity of ATP-competitive TKIs for a single kinase conformation has useful predictive capacity, many additional physical effects that can contribute to cell viability are not currently captured by examining only the predicted change in inhibitor binding affinity. For example, kinase missense mutations can also shift the populations of kinase conformations (which may affect ATP and inhibitor affinities differentially), modulate ATP affinity, modulate affinity for protein substrate, or modulate the ability of the kinase to be regulated or bounded by scaffolding proteins. While many of these effects are in principle tractable by physical modeling in general it is valuable to examine   Accuracy, specificity, and sensitivity were computed to assess two-class prediction performance: resistant (ΔΔG > 1.36 kcal mol −1 ) or susceptible (ΔΔG ≤ 1.36 kcal mol −1 ) 95% CIs (sub-/superscripts) were estimated from 1000 bootstrap replicates. The sensitivity for axitinib and ponatinib is NA, because there is no resistant mutation for these two drugs N quant Number of mutations for which quantitative metrics were evaluated, N class number mutations for which classification metrics were evaluated, All all mutations, xtals all mutations for which co-crystal structures were available, Glide erlotinib and gefitinib our mispredictions and outliers to identify whether any of these cases are likely to induce resistance (as observed by ΔpIC 50 shifts) by one of these alternative mechanisms.
A simple threshold of 10-fold TKI affinity change is a crude metric for classifying resistance or susceptibility due to the myriad biological factors that contribute to the efficacy of a drug in a person. In addition to affecting the binding affinity of inhibitors, missense mutations can also cause drug resistance through other physical mechanisms including induction of splice variants or alleviation of feedback. While the current study only focused on the effect of mutation on drug binding affinity, resistance from these other physical mechanisms could be similarly computed using physical modeling. For example, some mutations are known to activate the kinase by increasing affinity to ATP, which could be computed using free-energy methods like FEP.
In this communication, we tested the hypothesis that FEP+, a fully automated relative-alchemical free-energy workflow, had reached the point where it can accurately and reliably predict how clinically observed mutations in Abl kinase alter the binding affinity of eight FDA-approved TKIs. To establish the potential predictive impact of current-generation alchemical free-energy calculations-which incorporate entropic and enthalpic effects and the discrete nature of aqueous solvation-compared to a simpler physics-based approach that also uses modern forcefields but scores a single minimized conformation, we employed a second physics-based approach (Prime). This simpler physicsbased model was able to capture a useful amount of information to achieve substantial predictiveness with an MUE of 1:14 1 N = 144). While future enhancements to the workflows for Prime and FEP + to account for additional physical and chemical effects are likely to improve predictive performance further, the present results are of sufficient quality and achievable on a sufficiently rapid timescale (with turn-around times~6 h/calculation) to impact research projects in drug discovery and the life sciences. This work illustrates how the domain of applicability for alchemical free-energy methods is much larger than previously appreciated, and might further be found to include new areas as research progresses: aiding clinical decision-making in the selection of firstor second-line therapeutics guided by knowledge of likely subclonal resistance; identifying other selective kinase inhibitors (or combination therapies) to which the mutant kinase is susceptible; supporting the selection of candidate molecules to advance to clinical trials based on anticipated activity against likely mutations; facilitating the enrollments of patients in mechanism-based basket trials; and generally augmenting the armamentarium of precision oncology.

Methods
System preparation. All system preparation utilized the Maestro Suite (Schrödinger) version 2016-4. Comparative modeling to add missing residues using a homologous template made use of the Splicer tool, while missing loops modeled without a template used Prime. All tools employed default settings unless otherwise noted. The Abl wild-type sequence used in building all Abl kinase domain models utilized the ABL1_HUMAN Isoform IA (P00519-1) UniProt gene sequence spanning S229-K512. Models were prepared in non-phosphorylated form. We used a residue indexing convention that places the Thr gatekeeper residue at position 315 to match common usage; an alternate indexing convention utilized in experimental X-ray structures for Abl:imatinib (PDB: 1OPJ) 50 and Abl:dasatinib (PDB: 4XEY) 51 was adjusted to match our convention.
Complexes with co-crystal structures. Chain B of the experimental structure of Abl:axitinib (PDB: 4WA9) 52 was used, and four missing residues at the N and C termini were added using homology modeling with PDB 3IK3 53 as the template following alignment of the respective termini of the kinase domain. Chain B was selected because chain A was missing an additional 3 and 4 residues at the N and C termini, respectively, in addition to 3-and 20-residue loops, both of which were resolved in chain B. All missing side chains were added with Prime. The co-crystal structure of Abl:bosutinib (PDB: 3UE4) 54 was missing 4 and 10 N-and C-terminal residues, respectively, in chain A that were built using homology modeling with 3IK3 as the template. All loops were resolved in chain A (chain B was missing two residues in the P-loop, Q252 and Y253). All missing side chains were added with Prime. The co-crystal structure of Abl:dasatinib (PDB: 4XEY) 51 was missing 2 and 9 N-and C-terminal residues, respectively, that were built via homology modeling using 3IK3 as the template. A 3 residue loop was absent in chain B but present in chain A; chain A was chosen. The co-crystal structure of Abl:imatinib (PDB: 1OPJ) 50 had no missing loops. Chain B was used because chain A was missing two C-terminal residues that were resolved in chain B. A serine was present at position 336 (index 355 in the PDB file) and was mutated to asparagine using Prime to match the human wild-type reference sequence (P00519-1). The co-crystal structure of Abl:nilotinib (PDB: 3CS9) 55 contained four chains in the asymmetric unit all of which were missing at least one loop. Chain A was selected because its one missing loop involved the fewest number of residues of the four chains; chain A was missing 4 and 12 N-and C-terminal residues, respectively, that were built using homology modeling with 3IK3 as the template. A 4-residue loop was missing in chain A (chain B and C were missing two loops, chain D was missing a five residue loop) that was built using Prime. The co-crystal structure of Abl:ponatinib (PDB: 3OXZ) 56 contained only one chain in the asymmetric unit. It had two missing loops, one 4 residues (built using Prime) and one 12 residues (built using homology modeling with 3OY3 56 as the template). Serine was present at position 336 and was mutated to Asn using Prime to match the human wild-type reference sequence (P00519-1). Once the residue composition of the six Abl:TKI complexes were normalized to have the same sequence, the models were prepared using Protein Preparation Wizard. Bond orders were assigned using the Chemical Components Dictionary and hydrogen atoms were added. Missing side chain atoms were built using Prime. Termini were capped with N-acetyl (N terminus) and N-methyl amide (C terminus). If present, crystallographic water molecules were retained. Residue protonation states (e.g., Asp381 and Asp421) were determined using PROPKA 57 with a pH range of 5-9. Ligand protonation state was assigned using PROPKA with pH equal to the experimental assay. Hydrogen bonds were assigned by sampling the orientation of crystallographic water, Asn and Gln flips, and His protonation state. The positions of hydrogen atoms were minimized while constraining heavy atoms coordinates. Finally, restrained minimization of all atoms was performed in which a harmonic positional restraint (25.0 kcal mol −1 Å −2 ) was applied only to heavy atoms. Supplementary Table 9 summarizes the composition of the final models used for FEP.
Complexes without co-crystal structures. Co-crystal structures of Abl bound to erlotinib or gefitinib were not publicly available. To generate models of these complexes, Glide-SP 58 was utilized to dock these two compounds into an Abl receptor structure. Co-crystal structures of these two compounds bound to EGFR were publicly available and this information was used to obtain initial ligand geometries and to establish a reference binding mode against which our docking results could be structurally scored. The Abl receptor structure bound to bosutinib was used for docking because its structure was structurally similar to that of EGFR in the erlotinib-(PDB: 4HJO) 59 and gefitinib-bound (PDB: 4WKQ) 60 co-crystal structures. Abl was prepared for docking by using the Protein Preparation Wizard (PPW) with default parameters. Crystallographic waters were removed but their coordinates retained for a subsequent step in which they were optionally reintroduced. Erlotinib and gefitinib protonation states at pH 7 ± 2 were determined using Epik 61 . Docking was performed using the Glide-SP workflow. The receptor grid was centered on bosutinib. The backbone NH of Met318 was chosen to participate in a hydrogen bonding constraint with any hydrogen bond donor on the ligand. The hydroxyl of T315 was allowed to rotate in an otherwise rigid receptor. Ligand docking was performed with enhanced sampling; otherwise default settings were used. Epik state penalties were included in the scoring. The 16 highest ranked (Glide-SP score) poses were retained for subsequent scoring. To determine the docked pose that would be subsequently used for free-energy calculations, the ligand heavy-atom RMSD between the 16 poses and the EGFR co-crystal structures (PDB IDs 4HJO and 4WKQ) was determined. The pose in which erlotinib or gefitinib most structurally resembled the EGFR co-crystal structure (lowest heavyatom RMSD) was chosen as the pose for subsequent FEP+. Two sets of complex structures were subjected to free-energy calculations to determine the effect of crystal waters: In the first set, without crystallographic waters, the complexes were prepared using Protein Prep Wizard as above. In the second set, the crystallographic waters removed prior to docking were added back, and waters in the binding pocket that clashed with the ligand were removed.
Force field parameter assignment. The OPLS3 forcefield 37 version that shipped with Schrödinger Suite release 2016-4 was used to parameterize the protein and ligand. Torsion parameter coverage was checked for all ligand fragments using Force Field Builder. The two ligands that contained a fragment with a torsion parameter not covered by OPLS3 were axitinib and bosutinib; Force Field Builder was used to obtain these parameters. SPC parameters 62 were used for water. For mutations that change the net change of the system, counterions were included to neutralize the system with additional Na+ and Cl− ions added to achieve 0.15 M excess to mimic the solution conditions of the experimental assay.
Prime (MM-GBSA). Prime was used to predict the geometry of mutant side chains and to calculate relative changes in free energy using MM-GBSA single-point estimates 36 . VSGB 63 was used as the implicit solvent model to calculate the solvation free energies for the four states (complex/wild-type, complex/mutant, apo protein/wild-type, and apo protein/mutant) and ΔΔG calculated using the thermodynamic cycle depicted in Fig. 1b. Unlike FEP (see below), which simulates the horizontal legs of the thermodynamic cycle, MM-GBSA models the vertical legs by computing the interaction energy between the ligand and protein in both wild-type and mutant states, subtracting these to obtain the ΔΔG of mutation on the binding free energy.
Alchemical free-energy perturbation calculations using FEP+. Alchemical free-energy calculations were performed using the FEP+ tool in the Schrödinger Suite version 2016-4, which offers a fully automated workflow requiring only an input structure (wild-type complex) and specification of the desired mutation. The default protocol was used throughout: It assigns protein and ligand force field parameters (as above), generates a dual-topology 64 alchemical system for transforming wild-type into mutant protein (whose initial structure is modeled using Prime), generates the solvent-leg endpoints (wild-type and mutant apo protein), and constructs intermediate windows spanning wild-type and mutant states. Simulations of the apo protein were setup by removing the ligand from the prepared complex (see System Preparation) followed by an identical simulation protocol as that used for the complex. Charge-conserving mutations utilized 12 λ windows (24 systems) while charge-changing mutations utilized 24 λ windows (48 systems). Each system was solvated in an orthogonal box of explicit solvent (SPC water 62 ) with box size determined to ensure that solute atoms were no less than 5 Å (complex leg) or 10 Å (solvent leg) from an edge of the box. For mutations that change the net charge of the system, counterions were included to neutralize the charge of the system, and additional Na+ and Cl− ions added to achieve 0.15 M excess NaCl to mimic the solution conditions of the experimental assay. The artifact in electrostatic interactions for charge change perturbations due to periodic boundary conditions in MD simulations are corrected based on the method proposed by Rocklin et al. 65 , where the difference in solvation free energy of the solute under non-periodic boundary condition and that under periodic boundary condition is approximated by Poisson-Boltzmann method and serves as the correction term for each system.
System equilibration was automated. It followed the default 5-stage Desmond protocol: (i) 100 ps with 1 fs time steps of Brownian dynamics with positional restraints of solute heavy atoms to their initial geometry using a restraint force constant of 50 kcal mol −1 Å −2 ; this Brownian dynamics integrator corresponds to a Langevin integrator in the limit when τ → 0, modified to stabilize equilibration of starting configurations with high potential energies; particle and piston velocities were clipped so that particle displacements were limited to 0.1 Å, in any direction. (ii) 12 ps MD simulations with 1 fs time step using Langevin thermostat at 10 K with constant volume, using the same restraints; (iii) 12 ps MD simulations with 1 fs time step using Langevin thermostat and barostat 66 at 10 K and constant pressure of 1 atmosphere, using the same restraints; (iv) 12 ps MD simulations with 1 fs time step using Langevin thermostat and barostat at 300 K and constant pressure of 1 atmosphere, using the same restraints; (v) a final unrestrained equilibration MD simulation of 240 ps with 2 fs time step using Langevin thermostat and barostat at 300 K and constant pressure of 1 atmosphere. Electrostatic interactions were computed with particle-mesh Ewald (PME) 45 and a 9 Å cutoff distance was used for van der Waals interactions. The production MD simulation was performed in the NPT ensemble using the MTK method 67 with integration time steps of 4, 4, and 8 fs, respectively, for the bonded, near, and far interactions following the RESPA method 68 through hydrogen mass repartitioning 69 . Production FEP+ calculations utilized Hamiltonian replica exchange with solute tempering (REST) 70 , with automated definition of the REST region. Dynamics were performed with constant pressure of 1 atmosphere and constant temperature of 300 K for 5 ns in which exchanges between windows was attempted every 1.2 ps.
Because cycle closure could not be used to reduce statistical errors via path redundancy 70 , we instead performed mutational free-energy calculations in triplicate by initializing dynamics with different random seeds. The relative free energies for each mutation in each independent run were calculated using BAR 71,72 . The reported ΔΔG was computed as the mean of the computed ΔΔG from three independent simulations. Triplicate simulations were performed in parallel using four NIVIDA Pascal Architecture GPUs per alchemical free-energy simulation (12 GPUs in total), requiring~6 h in total to compute ΔΔG.
Obtaining ΔΔG from ΔpIC 50 benchmark set data. Reference relative free energies were obtained from three publicly available sources of ΔpIC 50 data (Table 1). Under the assumption of Michaelis-Menten binding kinetics (pseudo first-order, but relative free energies are likely consistent), the inhibitor is competitive with ATP (eq:ic50). This assumption has been successfully used to estimate relative free energies 34,[73][74][75] using the relationship between IC 50 and competitive inhibitor affinity K i , If the Michaelis constant for ATP (K M ) is much larger than the initial ATP concentration S 0 , the relation in eq:ic50 will tend towards the equality IC 50 = K i . The relative change in binding free energy of Abl:TKI binding due to protein mutation is simply, where IC 50,WT is the IC 50 value for the TKI binding to the wild-type protein and IC 50,mut is the IC 50 value for the mutant protein. R is the ideal gas constant and T is taken to be room temperature (300 K). As alluded to above, relating ΔpIC 50 s to ΔΔGs assumes that the Michaelis constant for ATP is much larger than the initial concentration of ATP, and that the experimentally observed ΔpIC 50 change is solely from changes in kinase:TKI binding affinity. In practice, not all of these assumptions may hold. For example, the experimentally observed ΔpIC 50 might depend on the metabolism of drugs, and for drugs with different mechanisms of action than directly binding to the kinase binding pocket (e.g., binding to the transition structures of kinases, target gene amplification, up/downregulation of positive-/negative-feedback effectors, diminished synergism of pro-apoptotic machinery, decoupling of the target from cell survival circuits) 76,77 , their inhibition ability might not correlate well with binding affinity. However, the comparison between ΔpIC 50 and ΔK D is presented in Fig. 2d, and this comparison indicates the assumptions we used to relate ΔpIC 50 to ΔΔG are reasonable for the dataset we studied.
Quantitative accuracy metrics. MUE was calculated by taking the average absolute difference between predicted and experimental estimates of ΔΔG. RMSE was calculated by taking the square root of the average squared difference between predicted and experimental estimates of ΔΔG. MUE depends linearly on errors such that large and small errors contribute equally to the average value, while RMSE depends quadratically on errors, magnifying their effect on the average value.
Truth tables. Two-class truth tables were constructed to characterize the ability of Prime and FEP+ to correctly classify mutations as susceptible (ΔΔG ≤ 1.36 kcal mol −1 ) or resistant (ΔΔG > 1.36 kcal mol −1 ), where the 1.36 kcal mol −1 threshold represents a 10-fold change in affinity. Accuracy was calculated as the fraction of all predictions that were correctly classified as sensitizing, neutral, or resistant. Sensitivity and specificity were calculated using a binary classification of resistant (ΔΔG > 1.36 kcal mol −1 ) or susceptible (ΔΔG ≤ 1.36 kcal mol −1 ). Specificity was calculated as the fraction of correctly predicted non-resistant mutations out of all truly susceptible mutations S. Sensitivity was calculated as the fraction of correctly predicted resistant mutations out of all truly resistant mutations, R. The number of susceptible mutations was 113 for axitinib, bosutinib, dasatinib, imatinib, nilotinib and ponatinib, and 12 for erlotinib and gefitinib; the number of resistant mutations R was 18 for axitinib, bosutinib, dasatinib, imatinib, nilotinib, and ponatinib, and 1 for erlotinib and gefitinib.
Consensus model. First, Prime and FEP+ (n = 142) were scaled by minimizing their RMSE to experiment by optimizing slope using linear regression. The resulting (minimum) RMSE was used in a subsequent step to combine the scaled FEP+ and scaled Prime free energies with inverse-variance weighted averaging.

ROC.
A ROC curve was generated by computing the true positive rate (sensitivity) and the true negative rate (specificity) when the classification cutoff differentiating resistant from sensitizing mutations is changed for (only) the predicted values of ΔΔG. Cutoffs were chosen by taking the minimum and maximum value of ΔΔG for a dataset (Prime or FEP+), and iteratively computing specificity and sensitivity in steps of 0.001 kcal mol −1 , which by this definition will be in the range [0,1]. Experimental positives and negatives were classified with the 1.36 kcal mol −1 cutoff. ROC-AUC was computed using the trapezoidal rule.
Estimating uncertainties of physical-modeling results. 95% symmetric CI (95%) for all performance metrics were calculated using bootstrap by resampling all datasets with replacement, with 1000 resampling events. Confidence intervals were estimated for all performance metrics and reported as x x high x low where x is the mean statistic calculated from the complete dataset (e.g., RMSE), and x low and x high are the values of the statistic at the 2.5th and 97.5th percentiles of the value-sorted list of the bootstrap samples. Uncertainty for ΔΔGs was computed by the standard deviation between three independent runs (using different random seeds to set initial velocities), where the 95% CI was [ΔΔG − 1.96 × σ FEP+ , ΔΔG + 1.96 × σ FEP+ ] kcal mol −1 . 1σ used in plots for FEP+ and experiment; 0σ for Prime.
Bayesian hierarchical model to estimate intrinsic error. We used Bayesian inference to estimate the true underlying prediction error of Prime and FEP+ by making use of known properties of the experimental variability (characterized in Fig. 2) and statistical uncertainty estimates generated by our calculations under weak assumptions about the character of the error.
We presume the true free-energy differences of mutation i, ΔΔG true i , comes from a normal background distribution of unknown mean and variance, where there are M mutations in our dataset. We assign weak priors to the mean and variance μ mut $ U À6; þ6 ð Þ ð 4Þ where we limit σ > 0. We presume the true computational predictions (absent statistical error) differ from the (unknown) true free-energy difference of mutation ΔΔG true i by normally distributed errors with zero bias but standard deviation equal to the RMSE for either Prime or FEP+, the quantity we are focused on estimating: In the case of Prime, since the computation is deterministic, we actually calculate ΔΔG true Prime for each mutant. For FEP+, however, the computed free-energy changes are corrupted by statistical error, which we also presume to be normally distributed with standard deviation σ calc,i , where ΔΔG i,FEP+ is the free energy computed for mutant i by FEP+, and σ i,FEP+ is the corresponding statistical error estimate. The experimental data we observe is also corrupted by error, which we presume to be normally distributed with standard deviation σ exp : Here, we used an estimate of K d − and IC 50 -derived ΔΔG variation derived from the empirical RMSE of 0.81 kcal mol −1 , where we took σ exp % 0:81= ffiffi ffi 2 p ¼ 0:57 kcal mol −1 to ensure the difference between two random measurements of the same mutant would have an empirical RMSE of 0.81 kcal mol −1 .
Under the assumption that the true ΔΔG is normally distributed and the calculated value differs from the true value via a normal error model, it can easily be shown that the MUE is related to the RMSE via x calc À x true j j ð11Þ The model was implemented using PyMC3 78 , observable quantities were set to their computed or experimental values, and 5000 samples drawn from the posterior (after discarding an initial 500 samples to burn-in) using the default NUTS sampler. Expectations and posterior predictive intervals were computed from the marginal distributions obtained from the resulting traces.
Data availability. All relevant data are publicly available: compiled experimental datasets, input files for Prime and FEP+, and computational results that support our findings can be found at GitHub by following the URL: https://github.com/ kehauser/Predicting-resistance-of-clinical-Abl-mutations-to-targeted-kinaseinhibitors-using-FEP.