Learning molecular dynamics with simple language model built upon long short-term memory neural network

Recurrent neural networks have led to breakthroughs in natural language processing and speech recognition. Here we show that recurrent networks, specifically long short-term memory networks can also capture the temporal evolution of chemical/biophysical trajectories. Our character-level language model learns a probabilistic model of 1-dimensional stochastic trajectories generated from higher-dimensional dynamics. The model captures Boltzmann statistics and also reproduces kinetics across a spectrum of timescales. We demonstrate how training the long short-term memory network is equivalent to learning a path entropy, and that its embedding layer, instead of representing contextual meaning of characters, here exhibits a nontrivial connectivity between different metastable states in the underlying physical system. We demonstrate our model’s reliability through different benchmark systems and a force spectroscopy trajectory for multi-state riboswitch. We anticipate that our work represents a stepping stone in the understanding and use of recurrent neural networks for understanding the dynamics of complex stochastic molecular systems.

R ecurrent neural networks (RNN) are a machine learning/ artificial intelligence (AI) technique developed for modeling temporal sequences, with demonstrated successes including but not limited to modeling human languages [1][2][3][4][5][6][7] . A specific and extremely popular instance of RNNs are long shortterm memory (LSTM) 8 neural networks, which possess more flexibility and can be used for challenging tasks such as language modeling, machine translation, and weather forecasting 6,9,10 . LSTMs were developed to alleviate the limitation of previously existing RNN architectures wherein they could not learn information originating from far past in time. This is known as the vanishing gradient problem, a term that captures how the gradient or force experienced by the RNN parameters vanishes as a function of how long ago did the change happen in the underlying data 11,12 . LSTMs deal with this problem by controlling flows of gradients through a so-called gating mechanism where the gates can open or close determined by their values learned for each input. The gradients can now be preserved for longer sequences by deliberately gating out some of the effects. This way it has been shown that LSTMs can accumulate information for a long period of time by allowing the network to dynamically learn to forget aspects of information. Very recently LSTMs have also been shown to have the potential to mimic trajectories produced by experiments or simulations 13 , making accurate predictions about a short time into the future, given access to a large amount of data in the past. Similarly, another RNN variant named reservoir computing 14 has been recently applied to learn and predict chaotic systems 15 . Such a capability is already useful for instance in weather forecasting, where one needs extremely accurate predictions valid for a short period of time. In this work, we consider an alternate and arguably novel use of RNNs, specifically LSTMs, in making predictions that in contrast to previous work 13,15 , are valid for very long periods of time but only in a statistical sense. Unlike domains such as weather forecasting or speech recognition where LSTMs have allowed very accurate predictions albeit valid only for short duration of time, here we are interested in problems from chemical and biological physics, where the emphasis is more on making statistically valid predictions valid for extremely long duration of time. This is typified for example through the use of the ubiquitous notion of rate constant for activated barrier crossing, where short-time movements are typically treated as noise, and are not of interest for being captured through a dynamical model.
Here we suggest an alternative way to use LSTM-based language model to learn a probabilistic model from the time sequence along some low-dimensional order parameters produced by computer simulations or experiments of a highdimensional system. We also show by our computer simulations of different model systems that the language model can produce the correct Boltzmann statistics (as can other AI methods such as refs. 16,17 ) but also the kinetics over a large spectrum of modes characterizing the dynamics in the underlying data. We highlight here a unique aspect of this calculation that the order parameter our framework needs could be arbitrarily far from the true underlying slow mode, often called reaction coordinate. This in turn dictates how long of a memory kernel must be captured which is in general a very hard problem to solve 18,19 . Our framework is agnostic to proximity from the true reaction coordinate and reconstructs statistically accurate dynamics in a wide range of order parameters. We also show how the minimization of loss function leads to learning the path entropy of a physical system, and establish a connection between the embedding layer and transition probability. Followed by this connection, we also show how we can define a transition probability through embedding vectors. We provide tests for Boltzmann statistics and kinetics for Langevin dynamics of model potentials, MD simulation of alanine dipeptide, and trajectory from single molecule force spectroscopy experiment on a multi-state riboswitch 20 , respectively. We also compare our protocol with alternate approaches including Hidden Markov Models. Our work thus represents a new usage of a popular AI framework to perform dynamical reconstruction in a domain of potentially high fundamental and practical relevance, including materials and drug design.

Results
Molecular dynamics can be mapped into a sequence of characters. Our central rationale in this work is that molecular dynamics (MD) trajectories, adequately discretized in space and time, can be mapped into a sequence of characters in some languages. By using a character-level language model that is effective in predicting future characters given the characters so far in a sequence, we can learn the evolution of the MD trajectory that was mapped into the characters. The model we use is stochastic since it learns each character through the probability they appear in a corpus used for training. This language model consists of three sequential parts shown schematically in Fig. 1. First, there is an embedding layer mapping one-hot vectors to dense vectors, followed by an LSTM layer which connects input states and hidden states at different time steps through a trainable recursive function, and finally a dense layer to transform the output of LSTM to the categorical probability vector.
Specifically, here we consider as input a one-dimensional time series produced by a physical system, for instance through Langevin dynamics being undergone by a complex molecular system. The time series consist of data points {ξ (t) }, where t labels the time step and ξ 2 R is some one-dimensional collective variable or order parameter for the high-dimensional molecular system. In line with standard practice for probabilistic models, we LSTM layer

Actual value
Loss function Embedding layer Dense layer Fig. 1 Neural network schematic. The schematic plot of the simple character-level language model used in this work. The model consists of three main parts: The embedding layer, the LSTM layer, and a dense output layer. The embedding layer is a linear layer which multiplies the one-hot input s (t) by a matrix and produces an embedding vector x (t) . The x (t) is then used as the input of LSTM network, in which the forget gate f (t) , the input gate i (t) , the output gate o (t) , and the candidate valuec ðtÞ are all controlled by (x (t) , h (t−1) ). The forget gate and input gate are then used to produce the update equation of cell state c (t) . The output gate decides how much information propagates to the next time step. The output layer predicts the probabilitiesŷ ðtÞ by parametrizing the transformation from h (t) toŷ with learned weights D d and learned biases b d . Finally, we can compute the cross entropy between the predicted probability distributionŷ ðtÞ and the true probability distribution y (t) = s (t+1) .
convert the data points to one-hot encoded representations that implement spatial discretization. Thus each data point {ξ (t) } is represented by a N-dimensional binary vector s (t) , where N is the number of discrete grid-points. An entry of one stands for the representative value and all the other entries are set to zeros. The representative values are in general finite if the order parameter is bounded, and are equally spaced in R with in total N representative values. Note that the time series {ξ (t) } does not have to be one-dimensional. For a higher-dimensional series, we can always choose a set of representative values corresponding to locations in the higher-dimensional space visited trajectory. This would typically lead to a larger N in the one-hot encoded representations, but the training set size itself will naturally stay the same. We find that the computational effort only depends on the size of training set and very weakly on N, and thus the time spent for learning a higher dimensional time series does not increase much relative to a one-dimensional series.
In the sense of modeling languages, the one-hot representation on its own cannot capture the relation between different characters. Take for instance that there is no word in the English language where the character c is followed by x, unless of course one allows for the possibility of a space or some other letter in between. To deal with this, computational linguists make use of an embedding layer. The embedding layer works as a look-up table which converts each one-hot vector s (t) to a dense vector x ðtÞ 2 R M by the multiplication of a matrix Λ which is called the embedding matrix, where M is called the embedding dimension The sequence of dense representation x (t) accounts for the relation between different characters as seen in the training time series. x (t) is then used as the input of the LSTM layer. Each x (t) generates an output h ðtÞ 2 R L from LSTM layer, where L is a tunable hyperparameter. Larger L generally gives better learning capability but needs more computational resources. The LSTM itself consists of the following elements: the input gate i (t) , the forget gate f (t) , the output gate o (t) the cell state c (t) , the candidate valuec ðtÞ , and h (t) which is the hidden state vector and the final output from the LSTM. Each gate processes information in different aspects. 8 Briefly, the input gate decides which information to be written, the forget gate decides which information to be erased, and the output gate decides which information to be read from the cell state to the hidden state. The update equation of these elements can be written as follows: c ðtÞ ¼ f ðtÞ c ðtÀ1Þ þ i ðtÞ c ðtÞ ð6Þ where W and b are the corresponding weight matrices and bias vectors. The tanhðvÞ operates piecewise on each element of the vector v. The operation ∘ is the Hadamard product 21 .
The final layer in Fig. 1 is a simple dense layer with fully connected neurons which converts the output h (t) of the LSTM to a vector y (t) in which each entry denotes the categorical probability of the representative value for the next time step t + 1. The loss function J for minimization during training at every timestep t is then defined as the cross entropy between the output of the modelŷ ðtÞ and the actual probability for the next timestep y ðtÞ which is just the one-hot vector s t+1 where T is the total length of trajectory, and the final loss function is the sum over the whole time series. The softmaxðxÞ i ¼ expðx i Þ= P j expðx j Þ is a softmax function mapping x to a probability vectorŷ.
Training the network is equivalent to learning path entropy. The central finding of this work, which we demonstrate through numerical results for different systems, is that a LSTM framework used to model languages can also be used to capture kinetic and thermodynamic aspects of dynamical trajectories prevalent in chemical and biological physics. In this section we demonstrate theoretically as to why LSTMs possess such a capability. Before we get into the mathematical reasoning detailed here, as well as in Supplementary Note 1, we first state our key idea. Minimizing the loss function J in LSTM (Eq. (9)), which trains the model at time t to generate outputŷ ðtÞ resembling the target output s t+1 , is equivalent to minimizing the difference between the actual and LSTM-learned path probabilities. This difference between path probabilities can be calculated as a cross-entropy J 0 defined as: Pðx ðTÞ :::x ð0Þ Þln Qðx ðTÞ :::x ð0Þ Þ ð10Þ where P(x (t+1) , . . . , x (0) ) and Q(x (t+1) , . . . , x (0) ) are the corresponding true and neural network learned path probabilities of the system. Equation (10) can be rewritten 22 as the sum of path entropy H(P) for the true distribution P and Kullback-Liebler distance D KL between P and Q: J 0 ¼ HðPÞ þ D KL ðPjjQÞ. Since D KL is strictly non-negative 22 attaining the value of 0 iff Q = P, the global minimum of J 0 happens when Q = P and J 0 equals the path entropy H(P) of the system. 23 Thus we claim that minimizing the loss function in LSTM is equivalent to learning the path entropy of the underlying physical model, which is what makes it capable of capturing kinetic information of the dynamical trajectory.
To prove this claim we start with rewriting J in Eq. (9). For a long enough observation period T or for a very large number of trajectories, J can be expressed as the cross entropy between conditional probabilities: Pðx ðtþ1Þ jx ðtÞ :::x ð0Þ Þ ln Qðx ðtþ1Þ jx ðtÞ :::x ð0Þ Þ ð11Þ where P(x (t+1) |x (t) . . . x (0) ) is the true conditional probability for the physical system, and Q(x (t+1) |x (t) . . . x (0) ) is the conditional probability learned by the neural network. The minimization of Eq. (11) leads to minimization of the cross entropy J 0 as shown in the SI. Here we conversely show how Eq. (10) reduces to Eq. (9) by assuming a stationary first-order Markov process as in ref. 23 : where Pðx ðtþ1Þ j jx ðtÞ i Þ P ij is the transition probability from state x i to state x j and Pðx ð0Þ k Þ P k is the occupation probability for the single state x k . Plugging Eq. (12) into Eq. (10), and following the derivation in ref. 23 with the constraints X j P ij ¼ 1 we arrive at an expression for the cross-entropy J, which is very similar to the path entropy type expressions derived for instance in the framework of Maximum Caliber 23 : In Eq. (14) as the trajectory length T increases, the second term dominates in the estimate of J leading to Eq. (15). This second term is the ensemble average of a time-dependent quantitỹ Jðx from which we directly obtain Eq. (9). Therefore, under firstorder Markovianity and ergodicity, minimizing the loss function J of Eq. (9) is equivalent to minimizing J 0 and thereby learning the path entropy. In the SI we provide a proof for this statement that lifts the Markovianity assumption as well-the central idea there is similar to what we showed here.
Embedding layer captures kinetic distances. In word embedding theory, the embedding layer provides a measure of similarity between words. However, from the path probability representation, it is unclear how the embedding layer works since the derivation can be done without embedding vectors x. To have an understanding to Q lm in the first-order Markov process, we first write the conditional probability Q lm ¼ Qðx where f is the recursive function which is defined with the update equation in Eq.
(2)-(7). In Eq. (17), θ denotes various parameters including all weight matrices and biases, and the summation index k runs over all possible states. Now we can use multivariable Taylor's theorem to approximate f θ as the linear term around a point a as long as a is not at any local minimum of f θ : Then Eq. (17) becomes where C We can see in Eq. (19) how the embedding vectors come into the transition probability. Specifically, there is a symmetric form between output one-hot vectors s ðtþ1Þ m and the input one-hot vectors s (t) , in which x (t) = Λs (t) and Λ is the input embedding matrix, D d A θ can be seen as the output embedding matrix, and C ðtþ1Þ i is the correction of time lag effect. While we do not have an explicit way to calculate the output embedding matrix so defined, Eq. (19) motivates us to define the following ansatz for the transition probability: where x m and x l are both calculated by the input embedding matrix Λ. The expression in Eq. (20) is thus a tractable approximation to the more exact transition probability in Eq. (19). Furthermore, we show through numerical examples of test systems that our ansatz for Q lm does correspond to the kinetic connectivity between states. That is, the LSTM embedding layer with the transition probability through Eq. (20) can capture the average commute time between two states in the original physical system, irrespective of the quality of low-dimensional projection fed to the LSTM [25][26][27] .
Test systems. To demonstrate our ideas, here we consider a range of different dynamical trajectories. These include three model potentials, the popular model molecule alanine dipeptide, and trajectory from single molecule force spectroscopy experiments on a multi-state riboswitch. 20 The sample trajectories of these test systems and the data preprocessing strategies are put in the Supplementary Note 5 and Supplementary Figs. 14-18 When applying our neural network to the model systems, the embedding dimension M is set to 8 and LSTM unit L set to 64. When learning trajectories for alanine dipeptide and riboswitch, we took M = 128 and L = 1024. All time series were batched into sequences with a sequence length of 100 and the batch size of 64. For each model potential, the neural network was trained using the method of stochastic gradient descent for 20 epochs until the training loss becomes smaller than the validation loss, which means an appropriate training has been reached. For alanine dipeptide, 40 training epochs were used. Our neural network was built using TensorFlow version 1.10. Further system details are provided in "Methods" section.
Boltzmann statistics and kinetics for model potentials. The first test we perform for our LSTM set-up is its ability to capture the Boltzmann weighted statistics for the different states in each model potential. This is the probability distribution P or equivalently the related free energy F ¼ À 1 β log P, and can be calculated by direct counting from the trajectory. As can be seen in Fig. 2, the LSTM does an excellent job of recovering the Boltzmann probability within error bars.
Next we describe our LSTM deals with a well-known problem in analyzing high-dimensional data sets through low-dimensional projections. One can project the high-dimensional data along many different possible low-dimensional order parameters, for instance x, y, or a combination thereof in Fig. 2. However most such projections will end up not being kinetically truthful and give a wrong impression of how distant the metastable states actually are from each other in the underlying high-dimensional space. It is in general hard to come up with a projection that preserves the kinetic properties of the high-dimensional space. Consequently, it is hard to design analysis or sampling methods that even when giving a time-series along a sub-optimal projection, still capture the true kinetic distance in the underlying high-dimensional space.
Here we show how our LSTM model is agnostic to the quality of the low-dimensional projection in capturing accurate kinetics. Given that for each of the 3 potentials the LSTM was provided only the x−trajectory, we can expect that the chosen model potentials constitute different levels of difficulties in generating correct kinetics. Specifically, a one-dimensional projection along x is kinetically truthful for the linear 3-state potential in Fig. 2a but not for the triangular 3-state and the 4-state potentials in Fig. 2b and c, respectively. For instance, Fig. 2e gives the impression that state C is kinetically very distant from state A, while in reality for this potential all 3 pairs of states are equally close to each other. Similar concerns apply to the 4-state potential.
In Figs. 3 and 4a-c and d-f we compare the actual versus LSTM-predicted kinetics for moving between different metastable states for different model potentials, for all pairs of transitions in both directions (i.e., for instance A to B and B to A). Specifically, Fig. 3a-c and 3d-f shows results for moving between the 3 pairs of states in the linear and triangular 3-state potentials, respectively. Figure 4 shows results for the 6 pairs of states in the 4-state potential. Furthermore, for every pair of state, we analyze the transition time between those states as a function of different minimum commitment or commit time, i.e., the minimum time that must be spent by the trajectory in a given state to be classified as having committed to it. A limiting value, and more specifically the rate at which the population decays to attain to such a limiting value, corresponds to the inverse of the rate constant for moving between those states 28,29 . Thus here we show how our LSTM captures not just the rate constant, but timedependent fluctuations in the population in a given metastable state as equilibrium is attained. The results are averaged over 20 independent segments taken from the trajectories of different trials of training for the 3-state potentials and 10 independent segments for the 4-state potential.
As can be seen in Figs. 3 and 4, the LSTM model does an excellent job of reproducing well within errorbars the transition times between different metastable states for different model potentials irrespective of the quality of the low-dimensional projection. Firstly, our model does tell the differences between linear and triangular 3-state models (Fig. 3) even though the projected free energies along the x variable input into LSTM are same (Fig. 2). The number of transitions between states A and C is less than the others; while for triangular configuration, the numbers of transitions between all pairs of states are similar. The rates at which the transition count decays as a function of commitment time is also preserved between the input data and the LSTM prediction.
The next part of our second test is the 4-state model potential. In Fig. 4 we show comparisons for all 6 pairs of transitions in both forward and reverse directions. A few features are immediately striking here. Firstly, even though states B and C are perceived to be kinetically proximal from the free energy (Fig. 2), the LSTM captures that they are distal from each other and correctly assigns similar kinetic distance to the pairs B, C as it does to A, D. Secondly, there is asymmetry between the forward and backward directions (for e.g., A to D and D to A, indicating that the input trajectory itself has not yet sufficiently sampled the slow transitions in this potential. As can be seen from Fig. 2c the input trajectory has barely 1 or 2 direct transitions for the very high barrier A to D or B to C. This is a likely explanation for why our LSTM model does a bit worse than in the other two model potentials in capturing the slowest transition rates, as well as the higher error bars we see here. In other words, so far we can conclude that while our LSTM model can capture equilibrium probabilities and transition rates for different model potentials irrespective of the input projection direction or order parameter, it is still not a panacea for insufficient sampling itself, as one would expect.
Boltzmann statistics and kinetics for alanine dipeptide. Finally, we apply our LSTM model to the study of conformational transitions in alanine dipeptide, a model biomolecular system comprising 22 atoms, experiencing thermal fluctuations when coupled to a heat bath. The structure of alanine dipeptide is shown in Fig. 5a. While the full system comprises around 63 degrees of freedom, typically the torsional angles ϕ and ψ are used to identify the conformations of this peptide. Over the years a large number of methods have been tested on this system in order to perform enhanced sampling of these torsions, as well as to construct optimal reaction coordinates [30][31][32][33] . Here we show that our LSTM model can very accurately capture the correct Boltzmann statistics, as well as transition rates for moving between the two dominant metastable states known as C 7eq and C 7ax . Importantly, the reconstruction of the equilibrium probability and transition kinetics, as shown in Fig. 5 and Table 1 is extremely accurate irrespective of the choice of one-dimensional projection time series fed into the LSTM. Specifically, we do this along sin ϕ and sin ψ, both of which are known to quite distant from an optimized kinetically truthful reaction coordinate 19,34 , where again we have excellent agreement between input and LSTM-predicted results.
Learning from single molecule force spectroscopy trajectory. In this section, we use our LSTM model to learn from single molecule force spectroscopy experiments of a multi-state riboswitch performed with a constant force of 10.9 pN. The data points are measured at 10 kHz (i.e., every 100 μs). Other details of the experiments can be found in ref. 20 . The trajectory for a wide range of extensions starting 685 nm up to 735 nm was first spatially discretized into 34 labels, and then converted to a time series of one hot vectors, before being fed into the LSTM model. The results are shown in Fig. 6. In Fig. 6a, we have shown an agreement between a profile of probability density averaged over 5 independent training sets with the probability density calculated from the experimental data. Starting from the highest extension, the states are fully unfolded (U), longer intermediate (P3) and shorter intermediate (P2P3) 20 . From Fig. 6b-c, we see that the LSTM model captures the kinetics for moving between all 3 pairs of states for a very wide range of commitment times.   Embedding layer based kinetic distance. In Eq. (19), we derived a non-tractable relation for conditional transition probability in the embedding layer, and then through Eq. (20) we introduced a tractable ansatz in the spirit of Eq. (19). Here we revisit and numerically validate Eq. (20). Specifically, given any two embedding vectors x l and x m calculated from any two states l and m, we estimate the conditional probability Q lm using Eq. (20). We use Q i to denotes the Boltzmann probability predicted by the LSTM model. We then write down the interconversion probability k lm between states l and m as: From inverting this rate we then calculate an LSTM-kinetic time as t lm ≡ 1/k lm = 1/(Q l Q lm + Q m Q ml ). In Fig. 7, we compare t lm with the actual transition time τ lm obtained from the input data, defined as Here N lm is the mean number of transitions between state l and m. As this number varies with the precise value of commitment time, we average N lm over all commit times to get 〈N lm 〉. These two timescales t lm and τ lm thus represent the average commute time or kinetic distance 25,26 between two states l and m. To facilitate the comparison between these two very differently derived timescales or kinetic distances, we rescale and shift them to lie between 0 and 1. The results in Fig. 7 show that the embedding vectors display the connectivity corresponding to the original high-dimensional configuration space rather than those corresponding to the one-dimensional projection. The model captures the correct connectivity by learning kinetics, which is clear evidence that it is able to bypass the projection error along any degree of freedom. The result also explains how is it that no matter what degree of freedom we use, our LSTM model still gives correct transition times. As long as the degree of freedom we choose to train the model can be used to discern all metastable states, we can even use Eq. (20) to see the underlying connectivity. Therefore, the embedding vectors in LSTM can define a useful distance metric which can be used to understand and model dynamics, and are possibly part of the reason why LSTMs can model kinetics accurately inspite of quality of projection and associated non-Markvoian effects.
Comparing with Markov state model and Hidden Markov Model. In this section, we briefly compare our LSTM model with  always generates the data points with the same temporal precision as it has in the training data irrespective of the intrinsic timescales it learns from the system. In Fig. 8, we provide the results of using HMM and MSM for the riboswitch trajectory with the same binning method and one-hot encoded input, to be contrasted with similar plots using LSTM in Fig. 6. Indeed both MSM and HMM achieve decent agreement with the true kinetics only if the commit time is increased approximately beyond 10 ms, while LSTM as shown in Fig. 6 achieved perfect agreement for all commit times. From this figure, it can be seen that the LSTM model achieves an expected agreement with as fine of a temporal precision as desired, even though we use 20 labels for alanine dipeptide and 34 labels for experimental data to represent the states. The computational efforts needed for the various approaches (LSTM, MSM, and HMM) are also provided in the  Supplementary Note 3 and Supplementary Table 2

Discussion
In summary we believe this work demonstrates potential for using AI approaches developed for natural language processing such as speech recognition and machine translation, in unrelated domains such as chemical and biological physics. This work represents a first step in this direction, wherein we used AI, specifically LSTM flavor of recurrent neural networks, to perform kinetic reconstruction tasks that other methods 41,42 could have also performed. We would like to argue that demonstrating the ability of AI approaches to perform tasks that one could have done otherwise is a crucial first step. In future works we will exploring different directions in which the AI protocol developed here could be used to perform tasks which were increasingly nontrivial in non-AI setups. More specifically, in this work we have shown that a simple character-level language model based on LSTM neural network can learn a probabilistic model of a time series generated from a physical system such as an evolution of Langevin dynamics or MD simulation of complex molecular models. We show that the probabilistic model can not only learn the Boltzmann statistics but also capture a large spectrum of kinetics. The embedding layer which is designed for encoding the contextual meaning of words and characters displays a nontrivial connectivity and has been shown to correlate with the kinetic map defined for reversible Markov chains 25,26,43 . An interesting future line of work for the embedding layer can be to uncover different states when they are incorrectly represented by the same reaction coordinate value, which is similar to finding different contextual meaning of the same word or character. For different model systems considered here, we could obtain correct timescales and rate constants irrespective of the quality of order parameter fed into the LSTM. As a result, we believe this kind of model outperforms traditional approaches for learning thermodynamics and kinetics, which can often be very sensitive to the choice of projection. Finally, the embedding layer can be used to define a new type of distance metric for high-dimensional data when one has access to only some low-dimensional projection. We hope that this work represents a first step in the use of RNNs for modeling, understanding and predicting the dynamics of complex stochastic systems found in biology, chemistry and physics.

Methods
Model potential details. All model potentials have two degrees of freedom x and y. Our first two models (shown in Fig. 2a and b) have three metastable states with governing potential U(x, y) given by Uðx; yÞ ¼ Wðx 6 þ y 6 Þ À Gðx; x 1 ÞGðy; y 1 Þ À Gðx; x 2 ÞGðy; y 2 Þ À Gðx; x 3 ÞGðy; y 3 Þ ð23Þ  where W = 0.0001 and Gðx; x 0 Þ ¼ e À ðxÀx 0 Þ 2 2σ 2 denotes a Gaussian function centered at x 0 with width σ = 0.8. We also build a 4-state model system with governing interaction potential: Uðx; yÞ ¼ Wðx 4 þ y 4 Þ þ Gðx; 0:0ÞGðy; 0:0Þ À Gðx; 2:0ÞGðy; À1:0Þ À Gðx; 0:5ÞGðy; 2:0Þ À Gðx; À0:5ÞGðy; À2:0Þ À Gðx; À2:0ÞGðy; 1:0Þ The different local minima corresponding to the model potentials in Eq. (23) and Eq. (24) are illustrated in Fig. 2. We call these as linear 3-state, triangular 3-state, and 4-state models, respectively. The free energy surfaces generated from the simulation of Langevin dynamics 44 with these model potentials are shown in Fig. 2a-c. Molecular dynamics details. The integration timestep for the Langevin dynamics simulation was 0.01 units, and the simulation was performed at β = 9.5 for linear 3-state and 4-state potentials and β = 9.0 for triangular 3-state potential, where β = 1/k B T. The MD trajectory for alanine dipeptide was obtained using the software GROMACS 5.0.4 45 6 Boltzmann statistics and kinetics for riboswitch. Using LSTM model to learn thermodynamics and kinetics from a folding and unfolding trajectory taken from a single molecule force spectroscopy measurement 20 : a Comparison between the probability density learned by the LSTM model and calculated from the experimental data. The regions between errorbars defined as standard errors are filled with blue color. b-d Commit time plots calculated by counting the transitions in the trajectory generated by LSTM and the experimental trajectory. The commit time is the minimum time that must be spent by the trajectory in a given state to be classified as having committed to it. Error bars are illustrated and were calculated as standard errors. In a, we use solid circle and empty square markers, respectively to represent linear and triangular 3-state model potentials. In each plot, the data points are shifted slightly to the right for clarity. The distances marked actual and LSTM represent rescaled mean transition times as per Eqs. (22) and (21), respectively. Error bars were calculated as standard errors over 50 different trajectories.

Data availability
The single-molecule force spectroscopy experiment data for riboswitch was obtained from the authors of ref. 20 and they can be contacted for the same. All the other data associated with this work is available from the corresponding author on request.