Introduction

Protein kinases are regulators of biochemical pathways in eukaryotic cells1,2,3 responsible for initializing and controlling signaling cascades3,4 by catalyzing the transfer of ATP’s gamma phosphate group to target residues on other enzymes. Given their critical involvement in cellular processes, their functionality is tightly controlled through a combination of regulatory domains2,5,6 and post-translational modificiations3 that modulate their multi-state behavior1,6,7,8,9. Sequence mutations, truncations, and over-expression of various kinases have been phenotypically linked to various cancers3,4,10 and other diseases11.

Bruton tyrosine kinase (BTK)12,13,14,15, part of the TEC family of kinases, is involved in T-cell and B-cell development. In humans, poor B-cell maturation leads to severe immune deficiencies, including increased susceptibility to bacterial infections16. Therefore, BTK’s catalytic domain is a pharmaceutical target with several inhibitors, including FDA-approved drugs17. Similar to other kinases, BTK’s catalytic domain (Fig. 1a) is bi-lobal with a \(\beta \)-sheet heavy N-lobe and a \(\alpha \)-helical C-lobe. ATP and magnesium ions bind in the active site between the two lobes (Fig. 1a). The activation loop (A-loop, Fig. 1a, red) connects the two lobes and modulates substrate binding. Phosphorylation of Tyr551 at the C-terminal end of the A-loop increases BTK’s activity by ten fold15. The N-terminal end of the A-loop contains the highly conserved aspartate-phenylalanine-glycine motif (DFG, Fig. 1b, blue) which samples several pharmacologically relevant states18. The aspartate of the DFG can be protonated9, modulating drug binding19. In the N-lobe, the glycine rich phosphate-positioning loop (P-loop, Fig. 1a, green) helps to position ATP’s gamma phosphate group. One \(\beta \)-sheet over, the N-lobe contains a conserved Lysine (Lys430, Fig. 1b) that hydrogen bonds with ATP’s \(\alpha \)-phosphate group. The N-lobe also has the catalytic helix (C-helix, Fig. 1b, orange) which contains a conserved Glutamate residue (Glu445, Fig. 1b) that forms critical salt bridges to Lys430 in the active state and Arg544 in the inactive form.

Figure 1
figure 1

BTK exists in several thermodynamically stable states. Within the MD ensemble, BTK catalytic domain (a) samples several states, including active (DFGin/C-helixin) (b), Src-like (A-loop folded/DFGin/C-helixout) (c) and DFGout (d). The transition from active (b) to Src like (c) is defined by the outward rotation of the C-helix (orange) and folding of the A- loop (red). The C-helix rotation breaks a critical salt bridge between Glu445-Lys430 and forms salt bridges between Glu439-Arg468 (yellow) and Glu445-Arg544(orange). In the DFGout state, Phe540(purple), part of the DFG motif (blue) rotates away from the core of the protein towards the ATP binding site. The R-spine (purple surface) forms continous hydorphobic contacts in the active state but is broken in the other states.

Crystallographic and biochemical studies on BTK13,14,15 and other kinases3,20 have already provided a considerable amount of insight into their thermodynamically accessible enzymatic states. BTK can exist in active14, inactive21, and DFG out21 states. In the putative active state (Fig. 1b), the DFG-aspartate residue moves towards the ATP binding site (DFGin) to chelate magnesium; the C-helix rotates into the protein core (C-helixin)14; and the A-loop is unfolded and transiently samples a \(\beta \)-sheet secondary structure. In the DFGin inactive state, the C-helix rotates outwards (C-helixout), and the A-loop folds into a double helix (Fig. 1c)14. We refer to this double helical inactive state as Src-like due to its topological similarity to the Src kinase’s inactive state22. In the DFGout state, the DFG-Asp rotates towards the core of the protein (Fig. 1d, Supplementary Table 2).

While the crystallographic coordinates for BTK provide us with structural insights, these models do not provide information about unrealized thermodynamically stable states or the pathways connecting them. Molecular dynamics (MD) is a computational modeling technique used to complement experimental work in biophysical systems. MD7,9 provides atomistic insight into complex processes, and has led to the proposal of several new kinase structural intermediates7,23,24,25 and allosteric pathways8.

In this paper, we performed an aggregate of 1.7 milliseconds of MD simulations on the DFG-deprotonated (BTK-ASP) and DFG-protonated (BTK-ASH) forms of the unliganded BTK catalytic domain on the massively distributed Folding@home26 computing platform (see Methods for details regarding the homology modeling). The aggregate simulation times make this study three times longer than the largest reported MD results on kinases7 and three orders of magnitude larger than any computational investigation into BTK’s dynamics12,27. We characterized kinome-wide structural plasticity within the C-helix and DFG motifs, identifying a number of conformations as viable pharmaceutical targets. We used Markov state models28,29 (MSMs) to gain atomistic insight into the thermodynamics and kinetics of BTK’s conformational ensemble, identifying a structurally diverse intermediate state that links the active, Src-like, and DFGout states.

BTK kinase domain samples kinome-wide conformational space

We began our analysis by comparing the structural heterogeneity in the MD BTK-ASP dataset to the kinome-wide PDB classification of Möbitz et al.18 (Fig. 2). In that paper, the authors classified kinase structures along variations of the conserved DFG motif, the C-helix, and the A-loop. Starting from several publically available BTK protein coordinates (Supplementary Table 3), our simulations capture kinome-wide crystallographically observed states that were previously stabilized via a combination of sequence, drugs, small peptides, and crystallographic conditions.

Figure 2
figure 2

BTK’s apo domain contains kinome-wide conformational plasticity. Comparison of 9% of MD generated structures (a) against publically available kinase domain structures (b) projected along three key degrees of freedom as outlined in Möbitz et al.18. We used the data and classification scheme provided in ref. 18 to generate (b). The top y-coordinate tracks the C-helixin to C-helixout transition while the bottom y-coordinate tracks the DFGin to DFGout transition. The common x-axis subdivides the conformations into pharmacologically relevant states of the DFG motif. The white circles in (a) correspond to the starting configurations for the MD simulations. The points are colored according to their Möbitz classification and detailed in Supplementary Fig. 16. For BTK’s free energies along these coordinates, see Supplementary Fig. 9.

For example, our simulations predict several configurations that Möbitz classified as the Imatinib, a leukemia drug, binding mode (Supplementary Fig. 3) for Abl kinase19,30,31. In this pose, the DFG-Phe540 residue (Fig. 1d) rotates towards the ATP binding site (Fig. 1a), creating a back pocket capable of accepting an aromatic moiety. It is worth emphasizing that the A-loop (including DFG-Phe540) was not resolved in similar BTK21 structures, creating difficulties for structure-based drug discovery. This result indicates the increasing ability of MD simulations to predict physiologically and pharmacologically relevant positioning of critical structural motifs. Intriguingly, BTK’s plasticity supports the model that all kinases sample a single conformational landscape whose topology is modulated by their sequence and/or chemical environment.

BTK’s apo domain is primarily inactive

To gain insight into the thermodynamics and kinetics of BTK, we built a statistically robust MSM for the hundreds of collected MD trajectories (Supplementary Figs 1 and 2). Markov modeling28 involves Voronoi partitioning of the accessible phase space into states and counting the transitions between the states. The metastable states are defined using a kinetically relevant distance metric (see Methods) that is learnt via sparse time structure-based independent component analysis (sparse-tICA)32,33,34,35. tICA finds linear combinations of input MD features that de-correlate the slowest within the given dataset. The dominant components – tICs – relate the slow structural changes to long timescale protein dynamics. After performing this dimensionality reduction, we built an optimized36,37 MSM (see Methods Table 1) whose transition matrix reflects the ensemble equilibrium populations and long timescale processes.

Table 1 Selected hyper parameters for the best model. See the detailed methods in the SI outlining the scoring function used to select the best model.

We projected the BTK-ASP and BTK-ASH MD ensembles onto the dominant tICs (Fig. 3a) and several structural order parameters (Supplementary Figs 9–11). These structural projections help us to understand the relative thermodynamic stability of various states as a function of the chosen order parameter. Our analysis indicates that the two principal tICs correspond to (1) the flipping of the DFG motif (Supplementary Fig. 6) and (2) the unfolding of the A-loop coupled with the rotation of the C-helix to the protein core, respectively (Supplementary Fig. 7). The free energy surface (Fig. 3a) lets us define a four-state model (Fig. 3b, Supplementary Fig. 8) in which a structurally heterogeneous intermediate hub (I1, Supplementary Fig. 5) controls access to active, Src-like, and DFGout states. This intermediate contains a partially or fully unfolded A-loop, DFGin, and C-helixout. Our simulations predict the Src-like state to be the most stable state of BTK-ASP (Fig. 3a) with the active and DFGout states both being 1-2 kcal/mol above the minimum energy observed. Within the BTK-ASP model, these states (Supplementary Figs 8, 10 and 12) have populations of \({52}_{51}^{53}\)%, \({7}_{6.5}^{7.5}\)%, and \({1}_{0.9}^{1.6}\)% respectively. The sub-script and super-script indicate the 95% confidence interval. See Supplementary Fig. 2 for the complete distribution. The rest of the population exists in the intermediate state. The low active state population lines up with the large number of inactive crystal structures (Fig. 2a, dark magenta region), and biochemical studies showing that BTK’s Tyr551’s phosphorylation15 is required for full activation. The number of minor and major populated states, and their micro to millisecond exchange timescales, are also consistent with NMR studies on other kinases. For example, the DFG-Phe in apo-p38α38 is unobserved in NMR due to line broadening, attributed to DFG-flip conformational exchange. Similarly, dual phosphorylation on ERK239 shifts the equilibrium to its active state by about 3 kcal/mol while ligand binding to protein kinase A38 induces slow inter-domain motion. This suggests that the solvated kinase catalytic domain samples a diffuse free energy landscape whose topology can be modulated by a number of factors. MSMs present a natural framework to handle these perturbations, and we next focus on the effects of one of them namely protonation of the DFG-Asp539.

Figure 3
figure 3

MSMs predict a multistate ensemble whose populations are modulated via DFG protonation. Thermodynamics of the BTK-ASP and BTK-ASH ensembles projected along the two dominant tICs (a) show a stable Src-like state. For standard errors along each coordinate, see Supplementary Fig. 12. Simple four state cartoon model (b) of the kinase dynamics. Kinetics of several molecular switches as a function of time along a MSM trajectory for BTK-ASP (c). The MSM trajectory was generated using a Monte Carlo algorithm to simulate a trajectory of 800 μs from the Markovian transition matrix. At each step, we randomly selected a simulation structure assigned to that state to report the instantaneous observables. The root mean squared deviation (RMSD) of the A-loop is calculated using the heavy atoms of residue Asp539-Phe559. For A-loop RMSD to the extended state, see Supplementary Fig. 17. We used the delta carbon of the Glu439 and zeta carbon of Arg468, and the delta carbon of Glu445 and zeta nitrogen of catalytic Lys430 to calculate the distances in the next two panels to quantify C-helix in to out transition. We used Thr410-Val415, and Phe540, Met449, His519, and Leu460 heavy atoms to quantify the P-loop, and R-spine RMSD. The R-spine is only completely formed when the C-helix (orange trace) is rotated inwards. The DFG RMSD is calculated using heavy atoms from Asp539-Gly541. For all RMSD calculations, we used a double helical inactive state as the reference state. The lighter color traces give the instantaneous value for the observable and the dark traces provide moving averages across 10 frames. The color corresponds to the color scheme used in Fig. 1 to highlight structural motifs in BTK.

Based upon pKa calculations40 (Supplementary Note 2), and past MD studies on EGFR, Src, and Abl kinases9,19,41, we next selectively protonated DFG-Asp539 (BTK-ASH) to quantify its effects on thermodynamics and kinetics. Our models predict that the DFGout state is stabilized by approximately 1 kcal/mol (relative to BTK-ASP’s DFGout) upon protonating the aspartate (Fig. 3a). Compared to BTK-ASP, the increase in DFGout population comes from the reduced free energy cost19 of putting a neutral protonated DFG-Asp539 in a hydrophobic environment (Fig. 1d, Supplementary Fig. 19). Within the BTK-ASH simulation set, we find that the Src-like, active, and DFGout states have populations of \({47}_{46}^{48} \% ,\,{6}_{5.9}^{6.6} \% ,\,\text{and}\,{9}_{7.5}^{10} \% \,\,\)respectively. Furthermore, we observed that protonation accelerates the DFG flip. To quantify this effect, we calculated the median value for the mean first passage time (MFPT)42,43 between DFGin and DFGout states (Supplementary Figs 2e and 8). Starting from the DFGin states, the median value for the MFPT to the DFGout state reduces from ~\({1.2}_{0.7}^{1.8}\,\)ms in BTK-ASP ensemble to ~\({300}_{180}^{400}\mu s\) in the protonated BTK-ASH ensemble. The reverse value remains relatively similar (~\({20}_{10}^{40}\mu s\)\({30}_{20}^{40}\mu s\)). See Supplementary Fig. 2e,f for the full distribution of macro state populations and median DFG MPFTs across several hundred rounds of bootstrapping.

To understand BTK-ASP’s kinetics, we sampled multiple long trajectories from the BTK-ASP microstate Markovian transition matrix using a kinetic Monte Carlo algorithm (see Methods). The stochastic algorithm allows us to stitch the shorter trajectories (hundreds of nanoseconds) into a longer “mock” trajectory that would be otherwise inaccessible by traditional sampling. A representative 800 \(\mu s\,\,\)trajectory projected along several of the dynamical regions within the kinase domain is illustrated in Fig. 3c. The model predicts that the A-loop is quite flexible, sampling conformations spanning RMSDs on the order of 10 s of angstroms (Supplementary Fig. 17). Such structural heterogeneity within the A-loop has been widely reported in kinase crystal structures6 and MD models of apo9 and ATP7 bound kinases. We find that these transitions are coupled to the outward rotation of the C-helix and flipping of the DFG motif. The outward transition of the C-helix also breaks the regulatory spine (R-spine)1,44, which consists of the four conserved residues Met449 (part of the C-helix), Leu460, Phe540 (part of conserved DFG motif) and His519 (part of the conserved HRD motif). Over the course of the simulation, the R-spine samples three distinct macrostates (Fig. 3c purple trace) corresponding to the Src-like, active, and DFGout states. In the active state, the R-spine forms a rigid, continuous hydrophobic surface stabilized via multiple Van der Walls interactions. These interactions are broken in the inactive states (Supplementary Table 2) where the Met449 moves out of the core of the protein and Phe540 adopts a range of configurations (Fig. 1).

Interestingly, our model predicts that the kinase deactivation to a Src-like state (A-loop folded, DFGin, & C-helixout, Fig. 3c between the 350–450 \(\mu s\) mark) quenches motions in the dynamic glycine-rich phosphate positioning loop (P-loop, Fig. 3c green trace). The P-loop samples open and closed configurations in the active and DFGout states. However, the P-loop closes when the A-loop folds. Our model indicates this rigidity is due to both the formation of a new backbone hydrogen bond (Supplementary Fig. 14) between Lys433 and Phe413 and favorable π-stacking (Supplementary Fig. 14) between Leu542 and Phe413.

Deactivation proceeds through an intermediate state

We now turn to the structural changes involved in the two dominant apo BTK transitions, namely active (C-helixin/ DFGin) to Src-like (A-loop folded/C-helixout/ DFGin) and DFGout to DFGin. Like most kinases44,45,46, BTK’s active state features the C-helix rotated towards the core of the protein, enabling the formation of a critical7,47 salt bridge between the Glu445 and Lys430 (Fig. 4a). The A-loop is extended and transiently forms a beta-sheet.

Figure 4
figure 4

BTK’s deactivation proceeds via an intermediate state. Starting from the active state (a), the C-helix swings out to form an intermediate (b) characterized by a disordered A-loop and a stable Arg468-Glu439 salt bridge. The activation loop then folds into a Src-like double helical inactive state (c). The double helical state is stabilized by a secondary salt bridge between the catalytic Glu445 and Arg544. The P-loop has been omitted in all three panels for the sake of clarity. The heat map (d) shows the projection of the centroids of these states unto our free energy landscape. Panel (d) has been reproduced from Fig. 3 for clarity.

Deactivation to a Src-like state exposes an allosteric binding pocket. In both the BTK-ASP and BTK-ASH ensembles, deactivation to a Src-like state follows a two-step7,9,19,24,25 process. In the first step, the C-helix rotates out of the core of the protein (Supplementary Fig. 13) to a metastable intermediate conformation (Fig. 4b). In several of our trajectories the C-helix rotation was preceded by a backbone shift at the N-terminus region of the A-loop. This intermediate is stabilized by a salt bridge between Arg468 and Glu439. The A-loop is still relatively unstructured but transiently samples partially helical states (Supplementary Fig. 5). The outward rotation of the C-helix to the catalytically inactive intermediate state opens an allosteric pocket (Supplementary Fig. 4)7, which can potentially be used to design selective BTK inhibitors. Supplementary Movie 1 contains an example of one of our BTK-ASP trajectories that spontaneously goes from the active to the intermediate.

Starting from the intermediate state, the A-loop folds into a double helix. This state forms a deep free energy basin in our MSM (Fig. 4c,d). The inactive state is stabilized by the presence of two complementary salt bridges (Glu439-Arg468 and Glu445-Arg5445,7,8, Fig. 4c). Within several milliseconds of aggregate sampling, we do not observe a deactivation event where the A-loop folds prior to the outward rotation of the C-helix, suggesting that this pathway is thermodynamically inaccessible.

DFG flip occurs via the C-lobe

We observed several partial and complete DFG transitions for the BTK-ASP and BTK-ASH ensembles in trajectories that lasted hundreds of nanoseconds to a few microseconds (Supplementary Movies 2–4). For both ensembles, our MSMs inferred the equilibrium populations and kinetics using all of the trajectories, regardless of whether they contained a crossover event (Fig. 3 and Fig. 5d).

Figure 5
figure 5

BTK DFG flips via the C-lobe, and proceeds after the formation of a helical intermediate state (a). Snapshots (b) going from red to white to blue, from the DFGout to DFGin trajectory showing the transient outward rotation of Met449 and Phe517 for the DFG flip. The DFGout to DFGin cross-over (c, panel 1) is preceded by the folding of residues Ser543-Leu547 (c, panel 2) and transient outward rotation of both Met449 and Phe517 (c, panel 3). Projection of the 3 selected frames from (b) onto the top two tICs (d) gives us the approximate free energies of the DFGout, intermediate and DFGin states.

Figure 5 shows the details of one of the transition trajectories from the BTK-ASH ensemble starting from the 3OCT crystal structure with an unfolded A-loop and DFGout state (Supplementary Movie 2). Within the trajectory, the A-loop first folds into the ATP binding site, forming a helical intermediate (Fig. 5c, Supplementary Fig. 18). In this intermediate, residues Leu542 to Leu547 form a helical turn that folds into the kinase’s core though the rest of the A-loop remains mobile. This intermediate has been previously reported9,19 for the EGFR kinase. Furthermore, Kuglstatter et al.21 reported a DFGin crystal structure for BTK where the the A-loop folds into the ATP binding site, demonstrating its [meta]stability. Within our simulation, the BTK-ASH molecule samples this intermediate until the DFG-Phe rotates towards the core of the protein (Fig. 5b, white) coming on to the same side as the DFG-Asp. In the last step, the DFG-Asp moves into the ATP binding site, completing the crossover transition.

While no unbiased MD simulation results exist for the DFG flip for either BTK-ASP or BTK-ASH molecule, the pathway presented here diverges from the previously reported DFG-flip pathways for Src48, Abl19,48, and EGFR9 kinases in two aspects. In the previous simulations, the DFG-Phe residue flips by moving across the N-lobe of the kinase. Within our model, the phenylalanine residue exclusively moves via the C-lobe of the kinase while the aspartate moves via the N-lobe (Supplementary Fig. 15). This sequence is conserved in both the full and partial trajectories (Supplementary Movies 2–4) that transition from the DFGin state to the DFGout state and vice versa. To our knowledge, this mechanism has never been proposed, though that is likely due to the computational difficulties of simulating the DFG flip transition. It is worth noting that our distributed computing approach26 allowed us to sample the DFG flip in an unbiased fashion using commodity GPUs, and our MSM was able to capture the DFG transition as the slowest mode within our tICA model. Secondly, while we do observe the presence of a helical intermediate state in several of the transitions, we also observed a DFGout to DFGin transition for the BTK-ASP molecule in which the A-loop remains unstructured (Supplementary Movie 3). The litany of flipping pathways follows from the inherent stochasticity of molecular conformational change and emphasizes the need for extensive sampling and robust statistical modeling.

Three residues sterically and chemically hinder the DFG transition. The conserved catalytic Lys430 hydrogen bonds with the DFG-Asp539 while the conserved Met449 sterically prevents rotation of the DFG-Asp539 towards the ATP binding site. On the other side, Phe517 (Fig. 5b-c) hinders the rotation of the DFG-Phe540 towards the protein core. Previous MD studies9,19 of the DFG flip observed spontaneous DFG transitions upon in-silico mutations of Met449, or its equivalent residue, and protonation of the DFG-Asp. As previously proposed, our data supports that protonation of the DFG-Asp can increase the likelihood of a spontaneous DFG flip. Our comparison of BTK-ASH to BTK-ASP showed that the DFGout state is stabilized by 1 kcal/mol upon DFG-Asp539 protonation. However, the combined effect of both DFG protonation and Met449 mutation remains to be observed. Lastly, while our simulations showed that the collapse of the folded A-loop into the kinase core predominantly precedes the DFG flip, it is not necessarily required, highlighting the ensemble nature of the pathway.

To summarize, the present results offer a detailed atomistic description of the thermodynamics and kinetics of the protonated and deprotonated forms of the apo BTK catalytic domain. Our model predicts that the apo kinase domain samples a range of conformational states that are yet to be crystallized but for which equivalent structures from other kinase domains exist. We complete structural modeling of a DFGout binding pocket for BTK, which could potentially be used to design a new class of BTK inhibitors. Furthermore, our model indicates that a structurally diverse intermediate state connects the active, Src-like, and DFGout states. For the first time, our results provide estimates for the equilibrium populations of all three dominant kinase states within a single model.

While we have chosen to separately analyze the BTK-ASP and BTK-ASH ensembles, the BTK ensemble in solution is a combination of both modulated by the DFG-Aspartate’s pKa in each microstate. Modeling this ensemble coupling would ideally require the use of constant pH simulations but can also be done post-hoc by analytical mixing of the parameterized transition matrices. This would entail running short constant pH (or QM/MM49,50) simulations to link the microstates, and is the subject of future research. Perhaps more interestingly, our BTK-ASP and BTK-ASH MSMs use a single set of state definitions, allowing us to explicitly compare relative free energies of differing kinase states. This can be extended to understanding the thermodynamic and kinetic effects of small molecule binding, mutations, regulatory domains, and post-transitional modifications. Given enough computational resources, it is theoretically possible to model all 602 known BTK mutations in the X-linked agammaglobulinemia database16. Such detailed atomistic characterization could be used to design the next generation of personalized and specific kinase inhibitors while increasing our understanding of the fundamental interplay between sequence and function.

Methods

The Supplementary information contains a more detailed methods section.

Simulation setup

Briefly, we downloaded 23 publically available BTK pdbs from the protein databank51, and used Modeller52 with default parameters to mutate out all the sequences to the human sequence for both BTK-ASP and BTK-ASH ensembles. We only kept the protein coordinates and removed all ligands (21 of the 23 structures had a co-crystal ligand). In cases, where the P-loop, C-helix or A-loop was un-resolved, we modeled them in as an extended chain. Amber tools suite53,54,55 was used to solvate the protein structures in a water box and add counter ions. The Amber99sb-ildn56 force field was used to model protein dynamics in conjunction with the TIP3P57 water model. The structures were minimized in two steps using Amber and then loaded into OpenMM58 for NPT production runs on Folding@home26. Overall we generated 1.7ms of aggregate data for both ensembles.

Markov state model

Building a MSM requires identification of metastable kinetically similar states. This splitting of the phase space is followed by counting the transitions between those states as observed in our trajectories at a Markovian (memory free) lag time. This transition model can be summarized using the following equation:

$${\bf{p}}({\bf{t}}+{\boldsymbol{\tau }})=\,{\bf{p}}({\bf{t}})T(\tau )\,$$
(1)

where \(\,p(t)\) is the probability distribution at time “t” while \({\bf{p}}({\bf{t}}+{\boldsymbol{\tau }})\) is the probability distribution after a Markovian lagtime \(\tau \). Spectral decomposition of the MSM transition matrix was used to estimate the equilibrium populations and dynamical processes connecting those Markov states. The relaxation timescales for these dynamical processes can be obtained by using the following transformation on the associated eigenvalue \(\mu \)

$$Relaxation\,timescale=-\,\frac{\tau }{\mathrm{ln}(\mu )}$$
(2)

After sampling the MD trajectories using Folding@home, a total of 2,140 trajectories were vectorized using the protein dihedrals and selective closest heavy atom distances. This feature selection led to each frame being represented as a feature vector of length 5,532. We normalized the data and reduced its dimensionality using time structure independent component analysis (tICA)32,35. tICA seeks to find a set of linear combinations of features that de-correlate the slowest (at a certain lag time) while minimizing their correlation. This is done by solving the following generalized eigenvalue problem:

$$C(\tau ){\boldsymbol{\nu }}={\boldsymbol{\lambda }}\Sigma {\boldsymbol{\nu }}\,$$
(3)

where \({\boldsymbol{\nu }}\) are the associated eigevectors (tICs), \({\boldsymbol{\lambda }}\) the eigenvalues, \(\Sigma \) is the covariance matrix

$${\Sigma }_{ij}={\rm{{\rm E}}}\,[{{\bf{X}}}_{{\rm{i}}}(t){{\bf{X}}}_{j}(t)]\,$$
(4)

and C(τ) is the time lagged correlation matrix whose ij element is defined as

$$C{(\tau )}_{ij}={\rm{{\rm E}}}\,[{{\bf{X}}}_{{\rm{i}}}(t){{\bf{X}}}_{j}(t+\tau )]$$
(5)

In equations 4 and 5, \({\rm{{\rm E}}}[\ldots ]\) is the average/expectation over the entire ensemble. The aim of equation 3 is to find the slowest/most highly auto-correlated set of coordinates \(({\boldsymbol{\nu }})\) with in our dataset at a certain lag time. Here, \(\tau \) is the tICA lagtime and can be different from Markovian lagtime. The tICA-transformed dataset was clustered using the K-means algorithm. We then used the cluster labeled dataset to build a MSM. For all projections, the deprotonated (BTKASP) ensemble’s highest populated state was assigned an absolute free energy of 0 kcal/mol and all other free energies were reported relative to that state. Based upon previous work7,59 and the convergence of the implied timescales plot (Supplementary Fig. 1) for 50–500 state models, we chose a Markovian lag time of 80 ns. For all the other hyper-parameters, including the tICA lagtime, choice of kinetic mapping, number of tICA components, and number of cluster states, within the model, we turned to cross validation36,37. The parameters for the best model are given below:

After we determined the optimal model given the current amount of sampling, we retrained the model on the entire set of trajectories. For the reported tICA model, we used a sparse variant of tICA34 for increased interpretability (Supplementary Figs 6–7). The Markov transition matrix was fit via maximum likelihood estimation (MLE) with reversibility and ergodicity constraints. To obtain error bars for the equilibrium populations, 200 rounds of bootstrapping were performed over the original set of trajectories. The models were primarily analyzed using techniques laid out in previous papers35,60. To further query the model, we sampled an 800 \({\rm{\mu }}s\) long kinetic Monte Carlo trajectory (10,000 frames at a lagtime of 80 ns) from the Markovian transition matrix.

The trajectories were featurized and analyzed using the MDTraj61 package while tICA dimensionality reduction and Markov modeling were performed using MSMBuilder62. Most of the analysis was performed within the IPython/Jupyter scientific environment63 with extensive use of the matplotlib64, and scikit-learn libraries65. All protein images were generated using visual molecular dynamics (VMD)66, all protein surfaces were rendered using SURF67,and secondary structure was assigned using STRIDE68 as implemented in VMD.

Data Availability

The simulation data and modeling results that support the findings of this study are available from the corresponding author (V.S.P) upon reasonable request.