Application of ESMACS binding free energy protocols to diverse datasets: Bromodomain-containing protein 4

As the application of computational methods in drug discovery pipelines becomes more widespread it is increasingly important to understand how reproducible their results are and how sensitive they are to choices made in simulation setup and analysis. Here we use ensemble simulation protocols, termed ESMACS (enhanced sampling of molecular dynamics with approximation of continuum solvent), to investigate the sensitivity of the popular molecular mechanics Poisson-Boltzmann surface area (MMPBSA) methodology. Using the bromodomain-containing protein 4 (BRD4) system bound to a diverse set of ligands as our target, we show that robust rankings can be produced only through combining ensemble sampling with multiple trajectories and enhanced solvation via an explicit ligand hydration shell.

complex r eceptor l igand where 〈 〉 G complex , 〈 〉 G receptor and 〈 〉 G ligand are the average values of the Gibbs free energy for the complex, receptor (protein) and ligand respectively.
Sampling of the complex and its two components can be performed independently or conformations of the receptor and ligand extracted from simulation of the complex. The latter approach is more commonly used due to its improved convergence behaviour, a consequence of cancellation between the noisy terms describing the internal energy of the ligand, receptor and complex 30 . However, recent work has indicated that adaptation energies associated with confining the receptor and ligand in a complex can differ significantly even for closely related complexes 9 . Here we investigate a range of ESMACS protocols incorporating different component sampling strategies. When both the receptor and ligand contributions are computed from the complex trajectory we designate this a "1traj protocol". When all three derive from independent trajectories we refer to this as the "3traj protocol" and when only one or other of the receptor or ligand contributions do so a "2traj protocol". A suffix (either -fl or -fr, for flexible ligand and receptor respectively) is added to the protocol name to signify which component is derived from the independent simulation. Additional variants involve the use of the average receptor contribution across the complex simulations for all comparable ligands, which is indicated with an -ar (averaged receptor) suffix in the protocol name. A summary of all of the protocols, describing from which simulation component data is obtained, is given in Table 1. It should be noticed that the statistical performance of the pair of protocols 1traj-ar and 2traj-fr, and 2traj-ar and 3traj are the same, as the receptor contribution in all cases is constant. Consequently, we do not analyze the 3traj or 2traj-fr protocols explicitly.
The binding free energy change calculated by MMPBSA (ΔG MMPBSA ) can be broken down into a number of components:  Table 1. Summary of the origin of component contributions in 6 ESMACS protocols indicating whether they come from the ensemble of simulations run for the complex (C) or separate ensembles performed for the receptor (R) and ligands (L). Constant refers to the use of a constant, usually the average value across the studied systems.
www.nature.com/scientificreports www.nature.com/scientificreports/ where ΔG ele MM , ΔG vdW MM and ΔG int MM are the electrostatic, van der Waals and the internal bonded contributions to the molecular mechanics free energy difference, respectively, and ΔG pol sol and ΔG nonpol sol are the polar and non-polar solvation terms, respectively.
The MMPBSA.py 31 program, provided as part of the AmberTools 14 package 32 , was used in the evaluation of all components of the MMPBSA calculation. The electrostatic free energy of solvation, ΔG pol sol , is the part of the calculation described by the Poisson-Boltzmann (PB) calculation. Default values were used for the PB calculation (grid spacing of 0.5 Å, internal and external dielectric constants of 1 and 80, respectively). The non-polar solvation free energy calculation is calculated from the solvent accessible surface area using the traditional one component method (specified using inp = 1 in the input file). In this approach the surface tension, γ, is set to 0.00542 kcal mol −1 Å −2 ) and the off-set, β, to 0.92 kcal mol −1 . The fill ratio parameter was set to 4.0 which does not impact the results but ensures the stability of the calculations. For calculations in which explicit water molecules were incorporated as part of the receptor, the closest N molecules to the ligand were chosen for inclusion.
Entropic contribution to binding free energies. A variety of options are available to incorporate entropic contributions to ΔG. The most common approach is normal mode analysis 33,34 but it can require similar computational effort to the underlying simulations in order to obtain converged results 18 . Consequently, here we explore the use of another, more computationally efficient, alternative approach proposed by Duan et al. 35 . In their formulation the "variational entropy" can be derived from the fluctuations of the receptor-ligand interaction energy, E inter . This energy can be calculated using components of the MMPBSA calculation: The fluctuation in interaction energy is then given by:

inter i nter inter
where angle braces indicate an ensemble average. This is then used to compute the entropic contribution to binding via: where k B is the Boltzmann constant and β = k T 1/ B .

Figure 1.
Overview of the ESMACS workflow. The 1traj protocol is shown in (a) consisting of an ensemble of 1 to N (25 in this study) simulations of the protein-ligand complex. Each simulation is made up of (min)imization and two (eq)uilibration steps and a single production NAMD run which are each analyzed independently using the MMPBSA.py script. The output of the analysis is then collated and bootstrap statistics produced. The multiple trajectory approaches, shown in (b) follow a similar outline but with independent trajectories also run of the ligand system alone.
www.nature.com/scientificreports www.nature.com/scientificreports/ simulation setup. Ensembles of 25 replica MD simulations were conducted using the package NAMD 2.11 36 for each system (complex, receptor or ligand) studied. All simulations were conducted using the protocol incorporated into BAC 23 . We have previously shown that the use of 25 replica ensembles provides a good balance of computational cost and calculation uncertainty for a number of varied systems 8,17,18 .
Each system was minimized with all heavy protein atoms restrained at their initial positions (with a restraining force constant of 4 kcal mol −1 Å −2 ). Initial velocities were then generated independently for each replica from a Maxwell-Boltzmann distribution at 50 K. Each system was virtually heated to 300 K over 60 ps and subsequently maintained at this temperature using a thermostat (employing a coupling coefficient of 1 ps −1 ) during which time the restraints applied during minimization were retained. Once the system reached the correct temperature the pressure was maintained at 1 bar using a Berendesen barostat (with a pressure coupling constant of 0.1 ps). Subsequent to the heating, a series of equilibration runs, totaling 2 ns, were conducted, during which the restraints on heavy atoms were gradually reduced. The restraint reduction occurs in ten 100 ps steps, after each one the force constant was halved. Finally, 4 ns production simulations were executed with snapshots output for analysis every 100 ps. A 2 fs time step was used for all MD simulation steps. The workflow of the ESMACS protocols is shown in Fig. 1. For each system run through the 1traj protocol an ensemble of independent NAMD simulations is executed, consisting of four steps. The first minimization (min), which is followed by two equilibration steps (labelled eq1 and eq2 respectively). In Eq. 1 the system is heated while restraints are applied to heavy atoms. In Eq. 2 restraints are gradually reduced before free simulation is undertaken. After 2 ns of aggregate equilibration the 4 ns production phase is initiated. It is the production trajectory which is analysed by MMPBSA.py. A script is then run to aggregate these results from the ensemble of simulations and values of ΔG MMPBSA computed along with bootstrap statistics. In multiple trajectory approaches a second ensemble of ligand-only simulations is conducted and fed into the aggregation and bootstrapping script. Full simulation details are provided in the main text. experimental Datasets. This study investigates a combination of BRD4 ligand binding datasets which have been the subjects of earlier studies. The first, previously studied by Aldeghi et al. 15 using a combination of FEP based absolute binding free energy and MMPBSA techniques, contains a diverse set of 11 ligands which will be referred to as the diverse (DIV) dataset. The second was recently studied by our group in collaboration with GlaxoSmithKline 9 (using a combination of ESMACS and ensemble thermodynamic integration approaches) and contains 16 ligands, all based on a single tetrahydroquinoline (THQ) template (consequently we identify this as the THQ dataset). The compounds were selected to represent a range of chemical functionality and binding affinities, despite their shared scaffold. The first 11 compounds are labeled 1 to 9 according to the scheme used by Aldeghi et al. 15 , the THQ based ligands are labeled THQ1 to THQ16 (the numbers correspond to those used in Wan et al. 9 ). The chemical structure of the first 11 compounds and the THQ scaffold are shown in Fig. 2. Details of the groups found at positions R1 to R4 in the THQ based ligands are detailed in Fig. 3 and Table 2. Ligand 4 was parameterized with a charge of +1. Compounds THQ10 to THQ12 and THQ16 are positively charged (+1), and compounds THQ13 to THQ15 are negatively charged (−1).
Experimental binding free energies (ΔG expt ) for the first dataset were obtained from a combination of SPR, Alphascreen and Isothermal Titration Calorimetry (ITC) experiments 15 , whereas those for the THQ dataset are derived from IC 50 values from FRET 9 . These techniques are very different from one another and will necessarily introduce varying levels of uncertainty into the data they provide. The divergence in the origin of the measurements is representative of the sources of experimental data to which free energy calculations are typically compared. This, alongside the lack  15 (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11) and the tetrahydroquinoline (THQ) scaffold. Full R-group information for the THQ dataset ligands is provided in Fig. 3 and Table 2. www.nature.com/scientificreports www.nature.com/scientificreports/ of rigorously derived uncertainty estimates in the experimental data, must be borne in mind when assessing protocol performance. In Table 3 we provide the full experimental binding affinities for both the diverse (DIV) and tetrahydroquinoline scaffold (THQ) datasets. structural models. The ligands from both datasets were simulated bound to the two BRD4 structural models based on PDBs 2OSS and 4BJX respectively (these are the initial structures used in Aldeghi et al. 15 and Wan et al. 9 ). The former represents the apo BRD4 and the latter the protein bound to a THQ based ligand. The secondary structure of both models is very similar (see Fig. 4a) and the RMSD between the two structures is 0.44 Å. All crystallographic water molecules were retained, including four which are conserved in the binding site of both models. The poses of the ligands in the DIV dataset were extracted from crystal structures (PDBs: 3U5J, 3U5L, 4OGI, 4OGJ, 3MXF, 4MR3, 4MR4, 3SVG, 4J0R and 4HBV), except for one ligand, labelled 10, which was modeled (based on PDB 3SVG) and docked into 2OSS as two conformers. These are the same two conformers used in Aldeghi et al. 15 , differing by a 180° flip of the trifluorophenyl moiety. The modelled poses were aligned and copied into the 4BJX based models. Poses of the THQ ligands were based on that of I-BET726 as found in the 4BJX structure.
System setup, including the creation of a water box and addition of neutralizing ions, was performed using AmberTools 17 37,38 . The majority of simulations were conducted using protein parameters taken from the standard Amber force field for bioorganic systems (ff14SB) 39 . Reproducibility studies of the THQ ligands were conducted using an earlier version of the forcefield, ff99SBildn 40 .
Drug parameters were produced using the general Amber force field (GAFF) 41 . The majority of the simulations presented here employ ligands prepared using the Gaussian/RESP protocol. In this approach, Gaussian 98 42 was used to perform geometric optimization of the inhibitor with 6-31G** basis functions, and the restrained electrostatic potential (RESP) procedure was used to calculate the partial atomic charges. Reproducibility studies Table 2. Composition of the ligands of the THQ dataset. The groups found at R4 are shown in Fig. 3(a-h), with the common THQ scaffold in Fig Table 2.
www.nature.com/scientificreports www.nature.com/scientificreports/ of the DIV dataset were conducted using AM1-BCC 43 derived charges. All charge assignment and input file generation was performed in the Antechamber component of AmberTools.
statistics and uncertainties. All statistics presented use their standard definitions with the exception of the mean unsigned error (MUE). It is well known that MMPBSA results have a significant offset from experimental values (typically of the order of 15 to 25 kcal mol −1 ) due to a range of factors, in particularly the neglect of entropic contributions 33,34 . Consequently we present values corrected for the systematic (mean signed) error and designate them cMUE.
We compute uncertainties for all metrics through bootstrapping analysis. This method involves resampling with replacement the N input data points (in this case, the replica averages of ΔG MMPBSA ) to provide a new bootstrap sample also containing N data points. This process is repeated many times (in our case 5000 times) and the statistic of interest of each bootstrap population calculated. The standard deviation of these values provides an estimate of the uncertainty associated with an average derived from a given sample; this is what is quoted as the bootstrap error measure of our statistics. For correlation coefficients samples are drawn from the overall averages for each ligand paired with the relevant experimental value. In addition to this metric, when making a direct comparison of specific correlation coefficients we will also quote 95% confidence intervals. These intervals are calculated by sorting the bootstrap sample distribution of correlation coefficients and taking the values falling at the 2.5 and 97.5 percentiles.

Results
Here we evaluate the performance of a range of ESMACS protocols in reproducing the experimental rankings across the full diverse ligand dataset, the robustness of this ranking to choices in system setup and the influence of non-standard MMPBSA components. In the DIV dataset better ranking can be obtained using receptor flexibility alone, but in order to obtain good rankings for THQ both additional contributions are required. This is the same behaviour observed in the simulation results for the THQ dataset in Wan et al. 9 ; however the overall ranking is worse (the original study obtained an r s of 0.78 [CI: 0.53-0.92]), primarily due to the stronger predicted binding affinity for the experimentally least potent drug, THQ16, in the present study.
The improvement between 1traj and 2traj-ar is illustrated in Fig. 5, which shows that outliers are moved closer to the overall trend line (particularly apparent for the DIV ligands 3, 4 and 5 which were also outliers in Aldeghi et al. 15 ). These three ligands have similar experimental binding energies but a difference of 15 kcal mol −1 in 1traj and 10 kcal mol −1 in 2traj-ar is seen in ΔG MMPBSA The ranking improvement is larger for the THQ ligands than the DIV dataset, with the 1traj results exhibiting little if any correlation with experiment. The main THQ outliers in the 1traj results are THQ12, THQ13 and THQ9. The first two are moved closer to the trend in the 2traj-ar results but THQ9 remains more negative than might be expected. Another feature of the 2traj-ar data here is that greater separation is seen between the results obtained from the two BRD4 structures for TH12, THQ13 and most It can also be seen in Table 4 that the impact of the incorporation of receptor 'strain' in the 1traj-ar and 2traj-ar protocols is different in the DIV and THQ subsets. In the 4BJX simulations the DIV rankings are notably less good than that in the 1traj, whilst they are fairly similar in the 2OSS case. Whereas for THQ, we find that accounting for the receptor and ligand flexibility is necessary to obtain a good ranking in both cases. Overall the results from the 2OSS structure are better than those from 4BJX. However, it should be noted that the ΔG MMPBSA values for all drugs using the 1traj protocol agree within error (see Fig. 5a).

Robustness of Ranking to parameterization.
Two of the key decisions in ligand binding free energy calculations are the choices of the forcefield and how small molecules are parameterized. For simulations using Amber forcefields the choice of procedures for ligand preparation is usually whether to use AM1-BCC or Gaussian/RESP based protocols to determine atom charges in combination with the GAFF general purpose forcefield parameters. Following the choice in Wan et al. 9 we used Gaussian/RESP for the majority of simulations in this work, but to evaluate the influence of this we re-ran the DIV dataset in the 2OSS model using the AM1-BCC methodology. Figure 6a shows that the ΔG MMPBSA values for the large majority of the ligands are highly correlated between the two schemes (within 1-2 kcal mol −1 ). This and the similar correlation with experiment (shown in Table 5) indicates that our results are robust with respect to this choice.
The Wan et al. 9 study employed the Amber ff99ildn forcefield for the protein, whilst in this study we have used ff14. In general the results obtained for all ligands are consistent but two ligands at either end of the rankings, THQ9 and THQ15, differ significantly as shown in Fig. 6b. The ranking performance with ff99ildn is described in Table 6. Comparing to those for ff14 (the THQ subset values in Table 4) shows ff99ildn provides better results, especially those for 2traj-ar in the 4BJX model (r s of 0.80 compared to 0.46). There are many factors which may cause this difference but one we identified was the possibility that the balance between direct and water mediated interactions might be altered by modifications to the amino acid side chain parameters. This in part motivated our investigation of the impact of including explicit water molecules in the receptor component of our calculations (see the following section).  www.nature.com/scientificreports www.nature.com/scientificreports/ Inclusion of explicit Water. Aldeghi et al. 22 found that the inclusion of explicit water molecules as part of the receptor in MMPBSA calculations improved the correlation with experiment in the DIV dataset. Here we explore whether this finding is reproducible using ensemble simulations and is robust to the addition of THQ ligands to the dataset under investigation. We use the same strategy in selecting water molecules for inclusion as the previous work, namely using the closest N to the ligand in each frame of the simulation trajectory.
We found a large difference in the impact of explicit water molecules between the combined DIV + THQ and DIV alone datasets. The correlations within the THQ dataset do not benefit from the inclusion of the additional water molecules in any protocol. For the combined dataset we find that up to around 5 explicit water molecules improves the rankings for all protocols (see Fig. 7a). After 50 water molecules are included 1traj performance drops to show no significant correlation with experiment and is only slightly improved as more molecules are added. A similar pattern is observed for the 1traj-ar and 2traj-ar results although, after the initial improvements, performance is more stable until 100 water molecules are included when an even sharper fall off is observed.
For the DIV dataset as shown in Fig. 7b the improvements are yet more marked. The biggest improvement is seen in the 1traj results. Furthermore, the MUE for these rankings does not increase with adding more water near peak performance 3.38/3.54 for 0 and 3.08/3.08 for 5 water molecules (2OSS/4BJX   www.nature.com/scientificreports www.nature.com/scientificreports/ the peak performance has > . r 0 9 s ; however, unlike in the previous work here we see this at 2 water molecules included with a decline after 5 (as opposed to a peak at 20 and consistent performance thereafter). A number of factors could impact this including our use of ensembles of 4 ns trajectories (compared to single 16 ns runs) and Gaussian/RESP charges (as opposed to AM1-BCC). Overall though, it is important to retain at least four of the conserved water molecules in the binding site for the ESMACS calculations in order to obtain consistently good rankings across datasets. Moreover, the impact of adding water molecules differs between runs initiated with different starting structures, as shown in Table 7.  www.nature.com/scientificreports www.nature.com/scientificreports/ The combined DIV + THQ 4BJX 1traj ranking shows only a consistent result, with no improvement, as the first 5 water molecules were incorporated, whereas in 2OSS the ranking improves from an r s of 0. 46 Table 7. Performance of 1traj and 2traj-ar MMPBSA based ESMACS protocols in reproducing experimental binding free energies incorporating different numbers of explicit water molecules. Performance is measured by mean unsigned error (MUE) and Spearman's rank coefficient (r s ) (bootstrapped error provided in brackets). Results are provided for the combined diverse (DIV) and THQ datasets bound to protein models based on both PDBs 2OSS and 4BJX. *MUE corrected for mean signed error in kcal mol −1 . Figure 6. Comparison of 1traj ESMACS results using different forcefield choices. In (a) the ranking of the DIV dataset is shown with the same protein forcefield (ff14) but different ligand parameterization methods (Gaussian-RESP in dark green, AM1-BCC in lighter green). In (b) results for the THQ dataset are compared using the ff14 (dark green is based on PDB 2OSS, purple on 4BJX) and ff99ildn (cyan based on PDB 2OSS, orange on 4BJX) forcefields. Solid lines represent lines of best fit, dashed ones optimal correlations. www.nature.com/scientificreports www.nature.com/scientificreports/ from a lower baseline rapidly whilst those from 2OSS remain consistent until 5 water molecules are added, at which point the results from both structures give an r s of around 0.6. A similar pattern is seen in 2traj-ar, but with the peak performance at 2 water molecules of 0.70 [CI: 0.54-0.97]/0.67 [CI: 0.49-0.96] (2OSS/4BJX) as shown in Table 7. The increase in MUE which accompanies the improvement in correlation indicates that the effects are not uniform across all ligands. Marginal gains in correlation coefficient should not be over emphasized (as can be seen in Table 7, improvements are often within error); we rather wish to draw attention to the trend that inclusion of water molecules likely to be involved in mediating stable ligand-protein interactions improves (or at least does not degrade) calculation performance. The most important observation is that the addition of explicit water molecules improves the reproducibility of the ranking when using different starting models.  www.nature.com/scientificreports www.nature.com/scientificreports/ Variational entropy. Accounting correctly, and computationally efficiently, for the entropic component of binding free energies remains a challenge for MMPBSA based computations. Here we investigated the use of the variational entropy technique on the ranking of different ESMACS protocols. In all cases the variational entropy was computed using the fluctuations from the 1traj simulations. As shown in Table 8 the inclusion of this term results in a reduction in the performance of all protocols in simulations based on both initial models. Furthermore, the incorporation of explicit water molecules into the receptor reduces this to an even greater extent. Looking in more detail we see that some compounds suffer a deterioration in prediction whilst others manifest an improvement. For instance, Fig. 8 shows that the three DIV outliers 3, 4 and 5 are closer to the trend line than in Fig. 5, whereas THQ12 and THQ13 are more poorly predicted. The entropic term is based on the variation in interaction energy during the complex simulation. As it compares versus the average it captures properties of the interaction energy surface. For molecules such as 6, 8, 9 and 10 that have few degrees of freedom, the interaction energy surface is likely to be steep, with small changes in conformation or translations leading to a rapid loss of interaction energy. Meanwhile, larger more flexible compounds such as 4 and 5 (which has a flexible benzhydryl core) can adapt to conformational changes of the receptor and maintain a favourable interaction energy, leading to a flatter potential surface. The results suggest that this entropic term is suited to the latter but not the former examples. Correctly capturing entropic contributions is key to obtaining truly reliable rankings in diverse datasets and further work in this area is required. Also, components of the MMPBSA calculation (particularly the surface area term) incorporate some entropic contributions and such double counting may account at least in part for the poor performance of variational entropy here.

Discussion
In summary, we have investigated the influence of different analysis choices on the results of ensemble MMPBSA based free energy calculations. The basis of our tests are two datasets which cover common computational chemistry challenges -one which is based on a set of related ligands and the other a highly diverse set of ligands with differing binding modes. In order to obtain successful rankings across the two datasets we found it necessary to incorporate receptor and ligand strains.  www.nature.com/scientificreports www.nature.com/scientificreports/ significant despite the relatively modest size of the dataset (which contains a total of 27 ligands). It should be noted that increase in computational cost is minimal here as the only additional simulations required are of the ligand (which are much smaller than either complex or receptor) with the receptor energy replaced by a constant. Hence, for prospective day to day applications, we recommend accounting for both ligand and receptor strain through independent ligand simulations and either further simulation of the apo receptor (as in the 3 traj ESMACS protocol) or the use of an average value for the receptor energies (2traj-ar).
A key consideration in the use of binding free energy calculations in real world (industrial or clinical) settings is the reproducibility of the results. Other considerations include computational cost and calculation stability. ESMACS protocols offer advantages in both these regards as they make use of relatively simple and fast classical MD simulations compared to many parallel simulations of intermediate states as required in alchemical calculations of absolute binding free energies 44 . We have shown that the results obtained in this study are robust to changing the ligand charge generation protocol (to use AM1-BCC instead of Gaussian/RESP) and the forcefield used to parameterize the protein (from Amber ff14SB to ff99SBildn). The use of ensemble simulations is the key to obtaining this reproducibility as individual replicas in ensembles varied by as much as 15 kcal mol −1 (which is in line with our own and other groups previous results 9, 16,18,45 ). Despite this, performance differences were found for all initial protocols when simulations were initiated from different crystal structures.
This observation, along with the fact that some ligands which have very similar experimental binding energies were widely separated even using protocols which accounted for receptor flexibility (1traj-ar and 2traj-ar), prompted us to investigate potential enhancements of the pure MMPBSA protocol. Specifically, we looked at the inclusion of an explicit ligand hydration shell in the receptor and variational entropy which had previously been investigated for single replica simulations by Aldeghi et al. 22 (though they also only investigated what we would term "1traj" calculations). These additional components capture chemical and physical features of the system neglected by MMPBSA but at minimal computational cost, a key consideration for practical binding affinity calculation applications. The entropy term reduced extreme outliers but at the expense of decreased overall ranking performance. This observation replicates that obtained by Aldeghi et al. 22 for the DIV dataset bound to BRD4, although they found the term improved results for sensitivity based datasets including multiple proteins. When less than five water molecules were incorporated into the receptor our rankings were improved with the best ranking across the full dataset obtained using this in combination with the 2traj-ar protocol. The most important observation of our work, however, is that the inclusion of these bound water molecules considerably reduced the performance difference between simulations initiated from models based on different crystal structures. A criticism of continuum based methods is that they are incapable of capturing the effect of crucial water molecules, possible activity cliffs, etc, that are now a well understood feature of structure-activity relationship (SAR) landscapes and medicinal chemistry lead optimization. Here it is shown again how this challenge can be met, with the simple inclusion of explicit water molecules. Future work should address how to consider this in prospective application scenarios and in a wider range of protein targets.
The reason for the improved performance observed for the diverse datasets in this study is presumably due to the capture of interactions between the ligand and the closest of the four conserved water molecules also found in the binding site. This observation is in line with other work in which system dependent numbers of water molecules were found to improve rankings [46][47][48][49] and the broader phenomenon of the impact of crucial water molecules on SAR landscapes. Incorporation of the water molecules was highly effective in differentiating the ligands with diverse binding modes but less effective in the set of related THQ-scaffold based compounds. The fact that our observations fit a general pattern, and that the level of explicit water hydration which improves results is similar to the number of conserved water molecules suggests that the approach can be applied more generally.
Overall we have shown that, for a diverse set of ligands, in order to deliver reproducible results from ESMACS (MMPBSA) calculations it is necessary to account for receptor and ligand strain and account explicitly for water molecules bound alongside ligands. Essential to obtaining these results is the use of ensemble simulations to generate meaningfully quantified uncertainties.

Data Availability
Simulation input topologies and coordinates (alongside ligand parameters) for all protein-ligand systems and collated MMPBSA results are made available via Zenodo, 10.5281/zenodo.1484050. Trajectories are available from the corresponding author on reasonable request.