Enhanced unbiased sampling of protein dynamics using evolutionary coupling information

One of the major challenges in atomistic simulations of proteins is efficient sampling of pathways associated with rare conformational transitions. Recent developments in statistical methods for computation of direct evolutionary couplings between amino acids within and across polypeptide chains have allowed for inference of native residue contacts, informing accurate prediction of protein folds and multimeric structures. In this study, we assess the use of distances between evolutionarily coupled residues as natural choices for reaction coordinates which can be incorporated into Markov state model-based adaptive sampling schemes and potentially used to predict not only functional conformations but also pathways of conformational change, protein folding, and protein-protein association. We demonstrate the utility of evolutionary couplings in sampling and predicting activation pathways of the β 2-adrenergic receptor (β 2-AR), folding of the FiP35 WW domain, and dimerization of the E. coli molybdopterin synthase subunits. We find that the time required for β 2-AR activation and folding of the WW domain are greatly diminished using evolutionary couplings-guided adaptive sampling. Additionally, we were able to identify putative molybdopterin synthase association pathways and near-crystal structure complexes from protein-protein association simulations.


Brownian particle simulations
The external potential on all simulated two-dimensional brownian particles was chosen to be a sum of n two-dimensional Gaussians of the form: where the parameters used in simulations are listed in Supplementary Table S1. Additionally, a half-cubic potential was added to constrain the particle to the rugged portion of the potential energy landscape as follows: 3 , if |x| > |x i | and sgn(x) = sgn(x i ) 0, otherwise v j (y) = k y, j 3 (y + y j ) 3 , if |y| > |y i | and sgn(y) = sgn(y j ) 0, otherwise (2) where parameters used in simulations are listed in Supplementary Table S2.

MoaD-MoaE MSM construction
Trajectories from simulation of MoaD-MoaE dimerization were featurized by their centers of mass using the MDTraj Python package. 1 For each frame, the MoaD center of mass vector was subtracted from the MoaE center of mass vector and a set of normalized basis vectors were defined from MoaD atom coordinates as follows: where v F75Cα (n) is the position of the F75 Cα atom in the n th frame andx(n),ŷ(n),ẑ(n) are orthonormal basis vectors spanning IR 3 in the n th frame. MoaD was chosen to define the coordinate system due to the relatively narrow and left-shifted distribution of its RMSD with respect to its crystal structure. The center of mass trajectories were then projected onto the MoaD basis sets and the new trajectories were clustered.

λ -repressor MSM construction and kinetic Monte Carlo
We featurized previously published MD simulations 2 by the distribution of reciprocal interatomic distances (DRID), 3 performed time-lagged independent components analysis (tICA), 4, 5 projected the DRID coordinates onto the 4 slowest decorrelating tICs, and created a 300 state MSM with a lag time of 3000 ns. The number of tICs to project onto and the number of MSM states were chosen so as to maximize the variational cross-validation score calculated using Osprey. 6,7 Evolutionary couplings guided adaptive sampling (ECAS) was performed in the same way as for β 2 -AR and the FiP35 WW domain. 270 evolutionarily coupled residue pairs (the number of pairs with coupling scores over an arbitrarily chosen cutoff score of 0.012, where the number of couplings was rounded) were used.
Testing the effects of multiple sequence alignment size and number of evolutionarily coupled residue pairs used on ECAS performance For sampling on β 2 -AR and the FiP35 WW domain, we took the original multiple sequence alignments (MSAs) returned by the EVCouplings web server 8 and randomly chose 20%, 40%, 60%, and 80% of the sequences (along with the target sequence) to use for calculation of evolutionary couplings, in order to test the effects of progressively poorer evolutionary coupling quality, and therefore the choice of residue pairs for adaptive sampling, on sampling efficacy. ECAS sampling was done with each resulting set of evolutionarily coupled residue pairs chosen according to the poorer quality couplings. 800 residue pairs were used for β 2 -AR and 70 pairs were used for the WW domain. The sizes of MSAs used are listed in Supplementary Table S3. The number of evolutionarily coupled residue pairs was chosen by taking all residue pairs with coupling scores above an arbitrarily chosen cutoff of 0.01, amounting to 800 pairs for β 2 -AR, 70 pairs for the FiP35 WW domain, and 270 pairs for the λ -repressor. We then tested ECAS with a range of the top evolutionarily coupled residue pairs (50, 400, 800, 1200, and 1600 for β 2 -AR, 10, 30, 50, 70, 90, 110, and 272 for WW domain, and 230, 250, 270, 290, and 310 for the λ -repressor) to determine the effects of residue pair numbers on sampling performance.

Biasing of simulations through adaptive sampling
Our method employs adaptive seeding of new trajectories without any restraints; the protein dynamics is entirely produced by the native forcefield of the protein and random forces from the solvent or from use of a Langevin integrator. Adaptive sampling, as used here, periodically resets the simulation to a structure fulfilling some condition after a deterministic, constant waiting time. If we consider κ to be the number of simulation steps between resetting events, in the limit κ → ∞ we recover unbiased sampling. Conversely, for κ = 1 we realize a method reminiscent of Monte Carlo simulation, where a single step of simulation is performed and the starting structure of the next round of sampling is chosen as follows: if the change in the metric guiding sampling has the desired sign, select the newly generated conformation. Otherwise, start the next simulation from the same structure as before. In the κ ∼ 1 regime, adaptive sampling imposes a near monotonically change of the chosen metric towards the desired value, very possibly yielding low probability pathways. However, for κ >> 1, resetting is far less dominant and should provide a comparatively gentle bias towards the desired values of the adaptive sampling metric. As κ → ∞, the disparities in path probabilities estimated from adaptive sampling and unbiased sampling will vanish with sufficient sampling.
As κ is clearly a finite number in any implementation of our method, the probability distribution over protein conformations will be biased by resetting. However, by building an MSM, one can effectively remove the bias introduced by resetting. This is possible because the only MSM parameters estimated from simulations are conditional transition probabilities between states, so artificially starting a simulation from a structure within a state does not directly bias the result. Transitions between states are entirely determined by the stochastic dynamics if the Markov property holds for the MSM, and should therefore be faithful to the unbiased dynamics if the system remains in local equilibrium 9,10 . If the transition probabilities are accurate, then because MSMs are irreducible, reversible Markov chains there will be a unique stationary distribution over the states, which is also accurate. If there is high statistical error in some transition probabilities, the mode of sampling can be changed so that trajectories are seeded from states with high-error transition probabilities in order to refine the model 11 . Additionally, after a path for folding or conformational change is found, further sampling can be initiated from states along the path in order to discover potential alternate routes.
There is no guarantee that the path found by ECAS will be the lowest free energy path of folding or conformational change. However, this is also true for unbiased sampling. By using large κ values and constructing Markov state models to debias simulations, we believe that pathways of folding and conformational change found using ECAS should not significantly differ from unbiased molecular dynamics simulations. Our method is not intended to guarantee optimality; rather, it is intended as a useful heuristic to improve sampling without altering the potential energy function. It is also not intended purely as a structure prediction method; there are far more efficient methods of structure prediction using evolutionary couplings 12 . S2/S18 Supplementary Table S1. Parameters for each of the 9 Gaussians used for the external potential in brownian dynamics simulations.   Figure S1. and number of trajectories is the total number of trajectories run for each specific scheme, given by the product of the number of parallel trajectories and the number of sampling rounds. S15/S18 Supplementary Figure S10. Travel time from the inactive to active state of β 2 -AR using evolutionary couplings-guided adaptive sampling with couplings calculated from truncated alingments. Time to the active state in kinetic Monte Carlo simulation on the β 2 -AR MSM using evolutionary couplings calculated from truncated multiple sequence alignments generated from the full alignment returned by the EVCouplings web server. The truncated alignments were generated by randomly choosing sets of sequences from the full alignment of varying size, given by the percent of the number of sequences in the full alignment: a) 20% b) 40% c) 60% d) 80% and e) 100%. Scaled trajectory length is the length of each trajectory in a specific sampling scheme in terms of the model lag time (τ) and number of trajectories is the total number of trajectories run for each specific scheme, given by the product of the number of parallel trajectories and the number of sampling rounds. S16/S18 Supplementary Figure S11. Folding time of the FiP35 WW domain using evolutionary couplings-guided adaptive sampling with couplings calculated from truncated alingments. Folding time in kinetic Monte Carlo simulation on the FiP35 WW domain MSM using evolutionary couplings calculated from truncated multiple sequence alignments generated from the full alignment returned by the EVCouplings web server. The truncated alignments were generated by randomly choosing sets of sequences from the full alignment of varying size, given by the percent of the number of sequences in the full alignment: a) 20% b) 40% c) 60% d) 80% and e) 100%. Scaled trajectory length is the length of each trajectory in a specific sampling scheme in terms of the model lag time (τ) and number of trajectories is the total number of trajectories run for each specific scheme, given by the product of the number of parallel trajectories and the number of sampling rounds. S17/S18 Supplementary Figure S12. Kernel density estimate of the backbone atom RMSD with respect to the crystal structure (PDB ID: 1FM0 15 ) from 427 ns of simulation of the MoaD-MoaE dimer starting from the crystal structure.