The role of water in host-guest interaction

One of the main applications of atomistic computer simulations is the calculation of ligand binding free energies. The accuracy of these calculations depends on the force field quality and on the thoroughness of configuration sampling. Sampling is an obstacle in simulations due to the frequent appearance of kinetic bottlenecks in the free energy landscape. Very often this difficulty is circumvented by enhanced sampling techniques. Typically, these techniques depend on the introduction of appropriate collective variables that are meant to capture the system’s degrees of freedom. In ligand binding, water has long been known to play a key role, but its complex behaviour has proven difficult to fully capture. In this paper we combine machine learning with physical intuition to build a non-local and highly efficient water-describing collective variable. We use it to study a set of host-guest systems from the SAMPL5 challenge. We obtain highly accurate binding free energies and good agreement with experiments. The role of water during the binding process is then analysed in some detail.

of these water-related CVs has been able to accurately describe the highly non-local changes in water structure that take place during binding, both in the vicinity of the ligand and in and around the binding pocket.
In order to succeed in our endeavour, we rely on a combination of physical considerations and modern machine learning (ML) techniques.In particular, we use a method that we have recently developed that goes under the name of Deep Linear Discriminant Analysis (Deep-LDA) [24].Deep-LDA builds efficient CVs from the equilibrium fluctuations of a large set of descriptors, expressing them as a neural network (NN).In this context, the choice of descriptors is essential and we appeal to our physical understanding to introduce one such set that is capable of characterising not only the ligand solvation shell but also the water structure inside and outside the binding cavity.After building such a CV, we use it in OPES for accelerating the sampling of binding-unbinding events.
We measure the performance of our approach on a set of test systems taken from the SAMPL5 competition [25,26] and study the interaction of six ligands with an octa-acid calixarene host (OAMe) (see Fig. 1).We choose this system because, despite its relative simplicity, it retains most of the key features of a biologically relevant protein-ligand system.Very recently, a closely related system has been used to investigate how water flows in and out of the system in the absence of a ligand [27].Furthermore, its symmetry simplifies the analysis and comparison can be made to existing experiments [28] and theoretical calculations [29,30,31].

Collective variables from equilibrium fluctuations with Deep-LDA
In this work, we are mainly interested in computing the free energy difference ∆G between the bound state (B) in which the ligand sits in the lowest free energy binding pose and the unbound state (U) where the ligand is solvated in water and free to diffuse.In order to obtain a CV able to capture water behaviour we use the recently developed machine learning Deep-LDA method [24].
Deep-LDA is a non-linear evolution of the time-honoured Linear Discriminant Analysis (LDA) classification method [32].In LDA, one takes two sets of data, in our case the configurations visited in short unbiased simulations in B and U, and defines a set of N d descriptors d that are able to distinguish between the two.The aim of LDA is to find the linear combination of descriptors s = w T d that best separates the two sets of data, w being a N d -dimensional vector.
To this effect, one calculates for each set of data the vectors of the average descriptors values µ B , µ U and their variance matrices S B , S U .With these quantities, one then computes the so-called Fisher's ratio: where one has defined the within scatter matrix S w = S B + S U and the between one The w that maximises this ratio is the direction that optimally discriminates the two states and gives the best separated projection of the data in the one-dimensional s space.The variable thus obtained has been shown to perform well as CV in many cases, especially if one uses its Harmonic LDA variant [33,34].
In Deep-LDA, a similar paradigm applies with the key difference that LDA is performed on a nonlinear transformation of the descriptors.The non-linearity is introduced by a neural network (NN) (see Fig. 2) whose input is the set of N d descriptors d and the outputs are the N h components of the last hidden layer h.LDA is performed on the components of h, so that, after determining the corresponding S w and S b , the NN is optimised using J (w) as loss function.At convergence, one determines the weights of the NN and the N h -dimensional optimal vector w that produces the Deep-LDA projection: Deep-LDA is a powerful classifier that tends to compress the data into very sharp distributions which are unsuitable for enhanced sampling applications.To address this issue, we smooth the distributions by applying the following cubic transformation s w = s + s 3 , in the spirit of what done in Ref. [35].The CV thus obtained will be used to describe water behaviour in our simulations.

Including water in the model
The choice of the descriptors d is of paramount importance since it implies the physics that we want to describe.In our case, we are interested in capturing the role of water in the binding process.To this effect, we choose two sets of points around which we compute the water coordination number.One set is located on the ligand, while the second one is fixed along the host's axis z at regular intervals (see Fig. 1 and the Supporting Information (SI)).
The first set of coordination numbers {L i } describes water solvation around the ligand and is similar in spirit to the ligand solvation variables that have been used in the past [4,23].The second one {V i } is aimed instead at capturing the water arrangement inside and outside the binding pocket without any explicit reference to the ligand.The whole set of descriptors {L i , V i } gives information on the structure of water and its non-local changes on a small to medium length scale during the binding-unbinding process.The use of these descriptors is one of the elements of novelty in our approach and one of the keys to its success.
Figure 3: Free energy surfaces projected along the host-guest distance.For each of the six ligands, we compute the free energy along the s z variable using a standard umbrella-sampling-like reweighting formula to recover the unbiased distribution [11].The shaded areas indicate the errors, whose calculation is detailed in the SI.To ensure that the results do not depend on a specific realisation of the Deep-LDA CV, we repeat the training three times by using different initial weights of the NN.The resulting CVs are denoted as s a w , s b w and s c w and the corresponding FES are indicated respectively by dashed, dotted and dash-dotted lines.For clarity, curves related to the same ligand but with different CVs are shifted by 1 kcal/mol, while the shift between different ligand curves is 5 kcal/mol.

Binding free energies from enhanced sampling simulations
We perform OPES simulations to estimate the binding free energies of all the six ligands of Fig. 1.We use the Deep-LDA CV s w together with a second CV s z , that is the projection of the ligand centre of mass on the binding axis z.In the ligand binding context, using the latter is a natural choice [4,31] as it has a clear physical interpretation and helps in clearly distinguishing B from U. Furthermore, we employ a funnel-like restraint potential [3] to encourage the ligand to find its way back to the binding site once it is out in the solution.The entropic correction to the free energy due to the imposition of the funnel can be calculated analytically (see Eq. S-4) and is taken into account when computing the binding free energies ∆G.We refer the interested reader to the SI for further details.
The combined use of these two CVs leads to a very efficient sampling, which is reflected in a high number of binding-unbinding events per unit time (see for example Fig. S-16).We notice a clear improvement over a more standard set of CVs [31], namely s z itself and the cosine of the angle θ between the binding axis z and the ligand orientation (see Fig. S-17).The introduction of a water-based CV in enhanced sampling simulations allows the system to reach a regime where it diffuses effortlessly from one metastable state to another, yielding a high accuracy in estimating ensemble averages of physical quantities.In (b), we report their difference with the experimental values and compare them with other computational results performed using the same simulation setup.Results from [31] are indicated with red circles, from [30] in green diamonds and from [29] in yellow squares.Performing enhanced sampling simulations allows retrieving the equilibrium distribution P(s) of any collective variable s [12].Here we focus on the free energy surface (FES), defined as FES(s) = −k B T log P(s) where k B is the Boltzmann constant and T is the temperature of the system.In the context of ligand binding, it is customary to look at the FES as a function of the host-guest distance s z .For each of the six ligands we compute the FES and estimate the errors with a block average analysis.We report these results in Fig. 3 in which we also assess the robustness of the Deep-LDA CV by showing the results corresponding to three different Deep-LDA training.
We then report the binding energies ∆G corrected for the presence of the funnel in Tab. 1.In Fig. 4 we compare them with experimental values and theoretical calculations performed on the same model but with different sampling techniques [29,30,31].Our results are by and large in agreement with a previous MetaD calculation [31].However in our case there is a dramatic reduction in the errors.

Water behaviour in the ligand-free state
To gain a better understanding of the role that the water plays in host-guest interaction, we first investigate how water interacts with the host in the absence of a ligand.For such analysis, we run a plain molecular dynamics (MD) simulation and find that the number of water molecules inside the cavity fluctuates with a bi-modal distribution between wet and dry states (see Fig. 5 (b)), as observed in a similar system [27].A typical wet configuration is the one in which three water molecules form a linear cluster inside the cavity, in agreement with results on an analogous system [16].
Another way of representing cavity solvation is to calculate the water oxygen density averaged over the angles around the binding axis (see Fig. 5 (a)), taking advantage of the host's symmetry (see Fig. 1).We observe that there is a high probability of finding a water molecule at the centre of the cavity in proximity of the 8 equatorial oxygen atoms.An analysis of the charge distribution shows that this position is a minimum of the electrostatic potential (see Fig S-2 in the SI).Starting from this position a short wire of hydrogen-bonded water molecules can form inside the cavity.This wire can possibly link up with water outside the pocket as indicated by the density bands in Fig. 5 (a).

The role of water in ligand binding: the case of G4
The use of the Deep-LDA CV s w not only allows us to obtain accurate binding energies but also a detailed insight into water behaviour during the binding process.We now describe the results obtained for the case of the host in the presence of the ligands.We illustrate here the case of G4, the guest that shows the most complex behaviour, and refer the interested reader to the SI for a detailed analysis of all the other ligands.
In Fig. 6 we show the FES of G4 and the cylindrically averaged water density in the different metastable states.We find that the system presents two binding poses B and B1.The lowest free energy binding pose B is the same as the one found in the experiments and contains no water.Our simulation discovered a second binding pose B1 that differs from B for the presence of a water molecule at the centre of the cavity.This second pose is ≈ 2 k B T higher in free energy and thus it is occupied with a much lower probability.
When the ligand exits the pocket, before being fully solvated, it can pass through two intermediate short lived states I and I1.In I, the cavity is dry and the ligand is free to rotate in front of the cavity entrance.In I1, the ligand sits again in front of the host entrance but its rotations favour configurations in which the ligand bromine atom points towards the cavity forming a linear arrangement where a water at the centre of the cavity is bridged by another water to the Br − anion (see Fig. S-20 in SI).We underline that neither B1 nor I and I1 were part of the Deep-LDA training.
The ability of the Deep-LDA CV s w to capture the non-local water structural changes that appear in our system is the main reason behind our capability to study the system's FES and its metastable states at this level of detail.Non-locality manifests itself in a collective action at a distance on the water in the enhanced sampling simulations, allowing the water to be moved in and out of the pocket even while the ligand is fully solvated and far from the host.Local CVs that only describe the average ligand solvation can only partially take into account these non-local effects.Moreover, the use of CVs that concentrate solely on the position of the ligand with respect to the binding site such as s z would clearly lead to an incomplete picture.In fact, B and B1 (and similarly I and I1) cannot be distinguished properly by s z alone and, without the presence of a bias pushing the bound water out of the cavity, B1 would erroneously be over represented in the sampling.

Conclusions
We have shown that, even in the relatively simple systems studied here, a complex and subtle reorganisation of water structure takes place and our strategy is able to capture it.Our calculations not only lead to binding free energies of remarkable accuracy but also offer a powerful analysis tool, proving once more that choosing the right CV is not a mere technical issue but is, in a sense, the solution of the problem.Having been able to reduce this much the sampling error, we might even be tempted to claim that the discrepancies with respect to experiments can be blamed mainly on the inaccuracy of the force field.The method is very robust and defines a protocol that can be naturally applied to larger and more complex systems.In fact, the sampling proficiency of our method will prove even more crucial in complex scenarios where a large number of water molecules can be trapped in multiple pocket locations.
Figure 6: Binding FES of ligand G4 with a study of the water presence in the visited states.We show the two-dimensional FES of the ligand G4 with respect to s z and Deep-LDA CV s w .Different adjacent colours corresponds to a free energy difference of 1 k B T ≈ 0.6 kcal/mol.We highlight some relevant states over which we perform plain MD simulations to measure the presence of water.We show histograms of the water oxygen atoms density in cylindrical coordinates z, r.Each histogram is normalised by the density value in its top right corner and darker colours correspond to higher water density regions.The position of the ligand in these plots is illustrative.

Figure 1 :
Figure 1: Sketch of the octa-acid host OAMe with the funnel restraint geometry and the guest molecules from the SAMPL5 challenge.We indicate the position of the points where the descriptors are centred and hint at their spatial outreach by drawing surfaces at a constant radius around some of them.

Figure 2 :
Figure 2: Schematics of the Deep-LDA architecture used in this work.The descriptors d are fed to a NN that generates s as a linear combination of the last NN hidden layer h and the LDA eigenvector w.The Deep-LDA CV is then s w = s + s 3 .

Figure 4 :
Figure 4: Comparison of the binding free energies with experiments and other calculations.In (a), we plot the value of ∆G obtained from the Deep-LDA simulations (in blue crosses) for every ligand versus the experimental values.In (b), we report their difference with the experimental values and compare them with other computational results performed using the same simulation setup.Results from[31] are indicated with red circles, from[30] in green diamonds and from[29] in yellow squares.

Figure 5 :
Figure 5: Water distribution analysis in the presence of the host without a guest.In (a), histogram in cylindrical coordinate z, r representing the presence of the water oxygen atoms around the host without any guest molecule being present.Darker colours correspond to a higher water density.In (b), probability distribution of the number of water molecules inside the pocket in the absence of a guest molecule.We show the snapshots of typical configurations for each case.

Table 1 : Binding free energies.We show
the mean binding energy ∆G (kcal/mol) for every ligand and the corresponding experimental value.We calculate ∆G as a weighted block average over the simulations with all Deep-LDA CVs (see SI for further details).