Introduction

Structural information is crucial for understanding functional mechanisms of ligands that bind proteins to modulate biochemical signals. In particular, rational drug design requires accurate information about ligand binding poses within the protein. X-ray crystallography gives the most detailed experimental structural information, but can be difficult to produce, particularly for membrane-bound protein-ligand complexes1 or for low affinity ligands2. Molecular dynamics (MD) simulations use physically rigorous models of protein, ligand and solution to give an atomistic and dynamic description of ligand binding and significantly inform drug design without the inaccuracies that plague simplified approaches3,4. Ligand binding events have been captured using long timescale MD simulations performed on specialized hardware5,6 and for only one ligand to a protein7,8. We adapt an approach that has been successful in studies of protein folding9 and conformational change10,11 in which we build a statistical Markov state model (MSM) from ligand binding MD simulations performed on cloud computing architectures. Our model gives accurate ab initio predictions of ligand binding poses for five ligands labeled as L2, L3, L6, L9 and 5 Androstan-3α-ol in Fig. 1, with a range of binding affinities for the FK506 binding protein 12 (FKBP12), a class of immunophilins with peptidyl-prolyl isomerase (PPI) activity12 and diverse roles in cellular signaling, particularly in immunosuppression13 and neurological function14,15. Furthermore, our approach allows a kinetic description of the binding mechanism, including association rates, protein-ligand encounter complexes, secondary binding sites or “hot spots” and druggable cryptic sites on the protein16. A summary of the steps involved in our approach is illustrated in Fig. 2.

Figure 1
figure 1

FKBP12 ligands used for predictions in this study.

The chemical structures highlight the common core in the ligands L2, L3, L6 and L9 derived from FK506 from Holt, et al. Ki values are also listed19,43. A structure is available for L943, but structure factors for computing electron densities are only available for FK50625.

Figure 2
figure 2

Overview of the scheme used to for FKBP12 ligand predictions.

The scheme for the approach described in this study is illustrated, with steps as follows: 1) select diverse structures for the protein-ligand complex, (2) perform extensive MD simulations on cloud computing architectures, (3) construct a ligand-binding MSM at the end of a simulation period, (4) evaluate the MSM convergence and repeat steps 1–4 until convergence is reached and reliable predictions are available (5). In step 4, we show the convergence of the lowest free energy state for L2, selected from MSMs built at ≈ 10 μs intervals and plotted as the state RMSD to the reference pose as the rolling mean with standard deviations over 2 data points (≈20 μs).

Results

MSM-guided scheme for improved sampling of ligand binding events

To generate data for our model, we set up an initial round of MD simulations with diverse starting structures for the ligand and protein. These starting structures include unbound states, near-bound states and initial predicted bound states from small molecule docking to the FK506 binding site. With predicted near-bound states, we aim to enhance the sampling of binding events by incorporating available experimental information, which can include binding residues identified from structural or mutagenic studies. We drive the ligand binding MD simulations with statistical MSMs, which are constructed by first clustering the aggregate simulation data by the protein-aligned positions of the ligand, using root mean squared deviation (RMSD) as a metric. Then, transitions between these structurally defined states are counted from the raw trajectory data in order to build a transition matrix at a particular time unit that lumps the clusters into Markovian states, such that intra-state transitions are faster than inter-state transitions (see Methods). We run new simulations using an adaptive sampling approach that has been shown to improve conformational sampling by seeding new simulations based on existing MSM states17. For this study, we launched three adaptive sampling rounds, detailed in Supplementary Table 1, with the initial round performed on local computing clusters at Stanford and subsequent rounds on the distributed computing network Folding@home18. The number of adaptive sampling rounds can be chosen based on availability of resources as well as convergence plots, described below. Our results illustrate the power of MSMs, which can stitch together events that occur to different extents in many independent, parallel simulations, but can take a long wall clock time to occur altogether in any single simulation. For example, in seven long timescale simulations performed in Shan, et al6, only 3 binding events for dasatinib to Src kinase were observed in 115 μs, while we observe hundreds of binding events for our ligands (Supplementary Table 1).

MSM-derived equilibrium populations find accurate ligand binding poses and druggable binding sites

We build MSMs from simulation data at approximately 10 μs intervals and analyze metastable ligand states that can be ranked according to the maximum likelihood estimate of the equilibrium population derived from the MSM transition matrix. The final rankings are listed for the FKBP12 ligands in Supplementary Table 2. The top populated pose is monitored over time for convergence behavior, plotted as the pose RMSD to the reference structures available for the FK506-derived ligands, described in Fig. 1. In the case of the blind prediction for 5 Androstan-3α-ol, the RMSD to the final predicted pose, or distances to known key binding residues, is evaluated for convergence. The convergence plot is shown for L2 in the scheme of Fig. 2 and shown for all ligands in Supplementary Fig. 1. These plots help us to evaluate our confidence in the predictions and guide the iterative process of adaptive sampling. The ligand poses converge to < 3 Å RMSD to the available experimental structure after ≈ 100 μs of aggregate simulation time. Our final predictions for the FK506-derived ligands have close overlap with the FK506-derived electron density, shown for L2 in Fig. 3a, as well as with the reference poses in Fig. 3b, which are in agreement at < 1.3 Å for three out of four ligands.

Figure 3
figure 3

Comparison of FKBP12 ligand pose predictions with experiment.

(a) The available electron density for FK50625 is shown in the left panel, compared with the adjusted density that corresponds to the common scaffold for L2, in the right panel. The density is shown at the 1σ contour level of the 2Fo–Fc difference map computed in PHENIX44. The predicted L2 pose is shown in cyan stick representation to illustrate overlap with the available density. (b). Overlap of the predicted pose from the MSM (cyan), determined as the converged, highly populated ligand MSM state, with the validation pose (green). The validation pose is derived from crystallography experiments for L943, or from overlap and minimization of common scaffolds for L2, L3 and L6 with the L9 and FK506 structures, as done previously for accurate binding free energy predictions for these ligands28,41. The RMSD between the structures is listed. (c). The 1.0 kcal/mol contour of the 3-D MSM-weighted free energy map within the active site is shown for L2 and 5 Androstan-3α-ol, with key binding residues labeled and a solid bar with a star denoting the 80's loop region. The free energy minimum surface is shown for all ligands in Supplementary Information Fig. 3. The L2 surface can be closely compared to the electron density contour in (a).

The entire converged MSM can be incorporated into a 3-D free energy map (see Methods) that reveals both low and high free energy ligand binding sites on the protein. The minimal free energy surface in the active site of FKBP12 is illustrated in Fig. 3c for L2 and 5 Androstan-3α-ol and shown for all ligands in Supplementary Fig. 2. These surfaces are strikingly similar to the FK506-derived electron densities (observed for L2 in Fig. 3a and for all ligands in Supplementary Fig. 2). The surfaces also overlap very well with the correctly predicted binding pose, shown in cyan stick representation. The similarity of the predicted free energy surfaces and experimental electron densities demonstrates the utility of our approach for ascribing atomistic detail to complement, or use in place of, difficult crystallography experiments. We see that similar interactions are maintained with FKBP12 in the binding poses for L2 and the blindly predicted 5 Androstan-3α-ol in Fig. 3c. Four of these interactions, hydrophobic contacts with F36, I91 and F99, as well as a hydrogen bond to I56, are experimentally validated binding residues for this steroidal class19.

In addition to prediction of the lowest free energy binding pose, the MSM free energy map can be contoured to identify other druggable sites on the protein and guide structure-activity relationship experiments. Low free energy surfaces extend into regions beyond the top population binding pose, revealing un-utilized but favorable regions of the binding site that could be targeted with new chemical moieties or incorporated in the design of novel ligand scaffolds. Our results directly support this application. The low free energy surface of the MSM 3-D free energy map, illustrated in Supplementary Fig. 3 for L2, allows identification of the region occupied by ring groups in L6, L9 and FK506, which contributes an approximately two orders of magnitude increase in binding affinity. Scaling the free energy surface to high free energy surfaces reveals lower probability sites, which were found in this study to be nonspecific binding sites, labeled A, B and C in Supplementary Fig. 4 and are visited by all ligands in the simulations. In other proteins, these surfaces could identify druggable cryptic or allosteric binding sites.

Characterization of binding pathways using kinetic information from the MSM

We also gain unbiased kinetic information about the binding process from the MSM. Calculated association rates for the ligands, listed in Supplementary Table 3, have a favorable ballpark comparison with the available experimental information for FKBP12 kinetics. Using transition path theory20,21, we find high probability binding pathways, which provide an atomistic description of predominant encounter complexes and ligand transition states. The FKBP12 ligand binding pathways are characterized by the encounter complex in Supplementary Table 4, with many pathways proceeding through the nonspecific sites A, B and C, as well as through metastable conformations near the C- and N-terminal portions of the 80's loop. This loop region is labeled with a star in Fig. 3c and has been experimentally shown to mediate binding of FK506 as well as other FKBP signaling partners22. We also characterize intermediate ligand transition states; the FK506-derived ligands form a common flipped state, shown for L2 and L6 in Supplementary Fig. 5. Key binding residues I56 and Y82 participate in a hydrogen bond exchange with ligand carbonyl groups to convert the ligand to the fully bound state. This transition requires rotation about the proline-mimetic ϕ and ψ angles of the ligands and corresponds to the FKBP12 enzymatic conversion of proline from a trans to cis conformation23. Our analysis of the FKBP12 ligand binding pathways have biochemical validation and allow testable predictions for optimizing binding kinetics via ligand changes, in the case of drug design, or protein mutations, in the case of protein engineering.

Discussion

Altogether, we provide a well-defined protocol for incorporating receptor and ligand flexibility into a mechanistic or drug design study, particularly when structural information about the system of interest may be difficult to obtain or incomplete. Until now, ligand binding events have only been captured with very long timescale simulations. Our MSMs aggregate information from simulations on diverse computing architectures, which is ideal for data generated from growing cloud computing resources. As computing power advances, we can more quickly and easily create accurate models of biological interactions and produce extensive datasets that inform predictions. Our approach aims to improve efficiency of the generation of “Big Data” on protein-ligand dynamics and to create a human-readable view of ligand binding from data analysis. The goal is to increase the interface between theory and experiment, for understanding biological mechanisms and improving efficiency of structure-based drug design.

Methods

Building the initial protein-ligand ensembles

To create predicted near-bound starting states, small molecule docking was performed using the program Surflex24 and the crystal structure of FK506-bound FKBP1225 to identify and target the FKBP active site. Information about the binding poses of the specific ligands in this study was not used. The option pgeomx was employed to dock each protonated (pH = 7) ligand to a protomol, which is the inverse 3-D representation of the binding site generated by Surflex from the FK506-bound FKBP12 structure. The top 20 binding poses from the docking were used as initial predicted bound states. Near bound and unbound states were generated by translating these poses in a 20 Å grid around the binding site using the program VMD, removing poses that produced steric clashes with the protein when preparing the complex with AMBER12 leap (see below). These configurations were used to initialize ≈ 50 independent trajectories on local Stanford clusters (Supplementary Table 1). Further simulations were launched according to an adaptive sampling scheme, described below.

Setup and performance of MD simulations

Molecular Dynamics (MD) simulations were performed using GROMACS 4.526 on CPU resources from both local Stanford clusters and the Folding@home distributed computing platform. The protein-ligand complexes were set up in AMBER12 leap using the AMBER99SB-ILDN parameters27 and ligands were parametrized as in a previous work28 using the Generalized Amber Force Field (GAFF)29. All parameters were ported to GROMACS with the program acpype30. The complexes were solvated using GROMACS tools in a triclinic solvent box with dimensions 68 · 68 · 48 Å3 with ≈ 7000 TIP3P31 water molecules such that water extended at least 10 Å away from the surface of the protein. Four chloride ions were added to neutralize the charge. For performing simulations, GROMACS input files were edited such that covalent bonds involving hydrogen atoms were constrained with LINCS32 and particle mesh Ewald33 with cubic interpolation and a 1.2 Å grid spacing for Fast Fourier Transform was used to treat long-range electrostatic interactions. The neighbor list was updated with a grid search using the switching algorithm with a van-der-Waals cutoff of 9 Å and short-range neighbor list and electrostatic cutoffs of 10 Å. The starting structures were initially minimized for 50,000 steps with steepest descent and equilibrated for 200 ps in the NVT ensemble, then equilibrated for 5 ns in the NPT ensemble before production NPT ensemble simulations at 300 K with a Nose-Hoover thermostat34 and constant pressure at 1 atm respectively, with semiisotropic coupling to a Parrinello-Rahman barostat35 with a time constant of 5 ps and compressibility of 4.5 · 10−5 per bar. Periodic boundary conditions were used for all simulations and randomized starting velocities to initialize the simulations were assigned from a Maxwell-Boltzmann distribution.

Building Markov state models

All MSMs were built by first clustering the simulation data, at an interval of 10 ns in the final MSM down to 100 ps intervals for MSMs with less data (Supplementary Table 5). We used a k-centers clustering algorithm, followed by the hybrid k-medioids algorithm in MSMBuilder236 with the metric of root mean square deviation (RMSD) of protein backbone aligned ligand coordinates, with a cutoff of 3 Å for cluster similarity. The simulation data was then assigned to these clusters and, in MSMBuilder2, used to construct a transition count matrix Cij, the number of observed transitions from state i at time t to state j at time t + τ, where τ is the lag time of the model and corresponding transition probability matrix Pij, the probability of transitioning from state i at time t to state j at time t + τ. The Markov lag time τ is the smallest time interval in which the data can be demonstrated as Markovian and was determined by plotting rates k from eigenvalues μ of the transition probability matrix at varied lag times τ as . This equation comes from the equivalence between discrete time MSMs and continuous time master equation37,38,39. These rates, called implied timescales, should be unchanged when a system is Markovian, to satisfy the Chapman-Kolmogorov test40 and were monitored for all models. If a model for a particular aggregate simulation time did not indicate this behavior, it was discarded and more simulation data was added. The implied timescales for the final MSM are shown in Supplementary Fig. 6. Lag times varied among the datasets and are listed in Supplementary Table 6, with τ = 10 ns used for all ligands in the final MSM.

Adaptive Sampling

Adaptive sampling was performed first on local computing clusters at Stanford and subsequent rounds on the distributed computing network Folding@home18. New rounds were launched after monitoring the convergence plots in Supplementary Fig. 1 and are detailed in Supplementary Table 1, along with the total binding and unbinding events observed for the final dataset. Convergence behavior of the RMSD of the top populated ligand MSM state with respect to a reference structure at 10 μs intervals of aggregate simulation time, was not seen until the third round of adaptive sampling. Experimentally derived reference structures were available for the FK506-derived ligands, but distance to binding residues can be used for ligands without experimental information. The reference structure corresponds to the available crystal structure for L9 and is generated by overlap and minimization of the common scaffolds for L2, L3 and L6 with the L9 and FK506 structures, as done previously for accurate binding free energy predictions for these ligands28,41.

Mapping the Markov model derived free energies to 3-D space

The final ligand MSM equilibrium state probabilities were mapped to a 1 Å grid space centered on the FKBP protein using a conditional probability assignment for each 1 Å3 grid cell ci. Occupancy of each cubic cell of the grid by the ligand in a Markov state mj is evaluated, with an occupancy of 1 assigned to the cell if any heavy atom of the ligand is found in the cell, given its conformation relative to the protein in the Markov state mj. Then, the MSM-derived equilibrium probabilities P(mj) are mapped onto the occupied cell as . The probabilities were converted to free energies as kBTln(P(ci)) and the free energy minimum was set to zero by subtracting the minimal value from all free energy data. This 3-D map was converted to OpenDX format for easy visualization of isosurfaces in VMD42. The code for converting the MSM into the 3-D map is provided at https://github.com/mlawrenz/LigandPMF3D.git and is compatible with MSMBuilder2 output.

Transition path theory

The implementation of transition path theory20,21 in MSMBuilder2 was used to trace high flux paths between unbound ligand states, defined as ligand positions > 20 Å from any protein surface atom, to bound ligand states with ligand RMSD < 3 Å to the predicted crystallographic pose. Characterized pathways were those with at least 50% of the maximum flux pathway. Association rates were computed as the average mean first passage time (MFPT) from unbound to bound states, defined with the same criteria.