Past–future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics

The ability to rapidly learn from high-dimensional data to make reliable bets about the future is crucial in many contexts. This could be a fly avoiding predators, or the retina processing gigabytes of data to guide human actions. In this work we draw parallels between these and the efficient sampling of biomolecules with hundreds of thousands of atoms. For this we use the Predictive Information Bottleneck framework used for the first two problems, and re-formulate it for the sampling of biomolecules, especially when plagued with rare events. Our method uses a deep neural network to learn the minimally complex yet most predictive aspects of a given biomolecular trajectory. This information is used to perform iteratively biased simulations that enhance the sampling and directly obtain associated thermodynamic and kinetic information. We demonstrate the method on two test-pieces, studying processes slower than milliseconds, calculating free energies, kinetics and critical mutations.

Supplemental information to "Past-future information bottleneck framework for sampling molecular reaction coordinate, thermodynamics and kinetics" SUPPLEMENTARY NOTES Note 1 In this note we define various terms introduced in the main text as well. In all of these we use P (X), P (X, Y) respectively to denote the probability distribution of a random variable X and the joint probability distribution of two random variables X and Y. Other variables are the same as defined in the main text.

Mutual information
This is a commonly used information theoretic measure to describe how much information is shared between two random variables X and Y . It is defined as:

Shannon entropy
The Shannon entropy for a random variable X is defined as:

Cross entropy
The cross entropy between two probability distributions P (X) and Q(X) is given by:

Kullback-Leibler Divergence
The Kullback-Leibler (KL) D KL (P ||Q) divergence between two probability distributions P (X) and Q(X) is given by:

Exact decoder definition
The exact decoder can be defined as the the conditional probability distribution of X ∆t given a RC χ. It can be calculated by using Bayes theorem:

Acceleration factor
As shown in [1], and under the conditions detailed there, it is possible recover the unbiased timescales from biased simulations through the calculation of a simple acceleration factor. If the biased simulation time is t, the associated unbiased timescale τ can be calculated by: where V (χ(t )) is the bias experienced by the system at time t , constructed as a function of the RC χ. This can equivalently be calculated as a discrete sum over the M integration time-steps.

Note 2
In this note we provide the missing details of various derivations given in the main text.

Gibbs's inequality and variational lower bound
The bottleneck function L defined in the main text for our neural network architecture is given by a difference of two Shannon entropies [2]: Gibbs's inequality [2] guarantees that that the KL-divergence between two probability distributions is always larger than 0. We thus have: Only in the limit that our approximate decoder is exactly the same as the exact inverse-Bayes decoder, D KL (P θ ||Q φ ) equals 0. By combining Eq.7 and Eq.8 , we get the relationship: Here, H(P (X ∆t )) only depends on the data set and is independent of the parametrization of the encoder and the decoder. Thus the term H(P (X ∆t ) can be completely ignored while optimizing the parameters θ and φ.
Maximizing the objective function L = −C(P θ (X ∆t |χ)) is then equivalent to maximizing the lower bound of L, which is the expression stated in Eq.4 in the main text.

PIB objective L for unbiased trajectory
For a unbiased trajectory X 1 , X 2 , ..., X M +k , for given θ, we can get corresponding χ 1 , χ 2 , ..., χ M using χ i = i c i s i . We also have a sampling of the states of the system after corresponding ∆t intervals: X 1+k , X 2+k , ..., X M +k . Together the pair (χ i , X i+k ) sampled from the dataset follows the distribution P θ (X ∆t |χ). With this, we have: log Q(X n+1 |χ n ) (10)

PIB objective L for biased trajectory
For a biased trajectory X 1 , X 2 , ..., X M with corresponding biasing potential values V 1 , V 2 , ..., V M , the unbiased probability distribution of X can be calculated by: The encoder P (χ|X) and decoder P (X ∆t |χ) are taken to be independent of the bias. The first assumption is strictly true, while the second is valid for small enough ∆t as explained in the main text. Therefore Note 3 In this note we provide details of the mutual information calculation for critical residue prediction. We use the backbone dihedral angles to describe the motion of each residue. Here we denote them as φ i , ψ i , where i is the index of a particular residue. We use I(θ, χ) to quantify the correlation between dihedral angle θ (where θ could be φ or ψ) and the unbinding process, where I(θ, χ) is mutual information between a dihedral angle θ and the RC χ.
For each residue, we have two dihedral angles. At the same time, for one dihedral angle, we can calculate its mutual information with χ 1 or χ 2 . So we have four quantities for each residue( I(φ, χ 1 ),I(ψ, χ 1 ), I(φ, χ 2 ) and I(ψ, χ 2 ) ). We rank the importance of each residue by the maximum of the four quantities. Here we only consider the parts of trajectory that is bias-free(χ 1 > 0.4 and χ 2 > 0.3). Because we require that the energy barriers between metastable states should be bias-free to ensure we can get the correct reweighted dynamics of the unbinding process. However, we only guarantee that the main energy barrier between bound and unbound state has zero bias when we perform biased MD simulation while bias can still be added on barriers between other metastable states. Looking at the unbiased region all us to reduce the influence of the biasing potential on the dynamics of the system and focus more on the transition states.

Simulation set-up for Alanine dipeptide in vacuum
We follow [3] to set up our simulation for alanine dipeptide in vacuum. The simulations are performed with the software GROMACS 2016/GROMACS 5.0 [4,5], patched with PLUMED 2.4 [6]. We constrain bonds involving hydrogens using the LINCS algorithm [7] and employ an integration time-step of 0.002 ps. The temperature is kept constant at 300K using the velocity rescaling thermostat (relaxation time of 0.1 fs) [8]. We employ no periodic boundary conditions and no cut-offs for the electrostatic and non-bonded Van der Waals interactions.

Simulation set-up for L99A T4L-benzene
We follow [9] to set up our simulation for L99A T4L-benzene. The simulations are performed with the software GROMACS 2016/GROMACS 5.0 [4,5] patched with PLUMED 2.4 [6]. The simulations are done with the constant number, pressure, temperature (NPT) ensemble with temperature 298 K and pressure 1.0 bar. Constant pressure is maintained using Parrinello-Rahaman barostat while the constant temperature is maintained using the v-rescale thermostat (modified Berendsen thermostat) [10]. The simulation box with periodic boundary condition is filled with TIP3P water. The side lengths of the box are 10Å and there are around 10, 000 water molecules. The interaction is described by the force field CHARMM22*. The integration time step here as well was taken to be 2 fs.

SUPPLEMENTARY DISCUSSION
To discuss the flexibility in choosing basis functions, we apply the same framework with another set of order parameters. As shown in Supplementary Table 1 and Supplementary Table 2, there are two central differences between this new set of order parameters and the set we discussed in the main text: (a) all helix-helix distances have been removed, (b) protein-ligand distances are defined by the distance between ligand and a new set of residues. In Supplementary Figure 6(a), the alpha carbons used to defined the protein-ligand distances in two basis function sets are colored differently, and our results demonstrate that the choices of distances are fairly arbitrary.We followed the protocol discussed in the main text. Only two minor changes were made, demonstrating further robustness of our protocol. First, we extended the simulation time from 4ns to 10ns in each trial. Second, the energy was increased by 8kJ in each round instead of 5kJ. We should stress that these two parameters can be chosen from a relatively wide range as long as the simulation time is long enough to sample local configurations and the increment of bias is not too big to include the poorly sampled region.
Order parameters weights in each RC are shown in Supplementary Figure 6 (b). With these, we were able to gain unbinding trajectories to do the critical residues calculation as discussed before. Since RCs were constructed from a basis function set that can't fully describe the whole system, they only contained partial information of the unbinding process. With this new set of basis function, critical residue calculation does not give exactly the same ranking as shown in the main text, however, as shown in Supplementary Figure 6 (c), the key residues are preserved. Some of the important residues are: Phe114, Gly110, Val111, Ala112, Thr109, Ile29, Glu108, Ser136, Lys135 and Trp138. These residues can still be classified into two broad groups: (a) residue 114, residue 135, residue 136, residue 138 and residues 108-112 contribute to breathing movement (b) residue 29 lies in a disordered region and can be picked due to noise or existence of likely allosteric communication pathway. These results show that the method is robust in terms of the choice of order parameters and has the potential to be applied to studying other complex systems. The k of f reported in the main text is calculated as the inverse of the fitted time constant here, 303 ± 76 ms. In (b), we provide the fitted time constant with associated error bars for 3 different biasing protocols, demonstrating convergence of the estimated time constant, at least on a log-scale. It is worth noting that for the weakest bias, we had the lowest number of dissociation events and hence the time-constant is a lower bound due to not having captured enough slow events. Hence the agreement should in principle be even better.