The understanding of protein function is often interlinked with understanding protein dynamics. Molecular dynamics (MD) simulations are a valuable tool to study these dynamics on an atomistic level1,2,3,4,5,6. However, further methods are necessary to extract the statistically relevant information and to help overcome the discrepancy between feasible simulation length and the timescales of relevant processes. A common approach to enhance sampling of a specific process of interest is to bias the simulation along a reaction coordinate aligning with the process7,8,9,10,11,12,13. In comparison, the Markov modeling approach14,15,16,17,18,19,20 extracts kinetic information and tackles the sampling problem without requiring the definition of few predefined reaction coordinates by combining arbitrary numbers of short unbiased distributed simulations to model the long-timescale behavior of target systems. Consequently, multiple software packages21,22 have been developed over the last decade providing assistance in estimating these models. They often include a pipeline for feature selection21,22,23,24, dimension reduction25,26,27,28,29,30,31, clustering32,33,34,35, transition matrix estimation15,19,36,37, and coarse graining38,39,40,41,42,43,44. Markov state models (MSMs) have been applied to a wide range of molecular biology problems such as protein aggregation45,46,47 or ligand binding48,49,50 and can be a valuable tool to understand experimental data on the atomistic scale51,52.

The necessity to assess a model’s performance and thereby rank its quality encouraged the development of variational methods53,54, in particular the variational approach for Markov processes (VAMP)55. This variational formulation has allowed us to replace the aforementioned pipeline with an end-to-end deep learning framework called VAMPnet56, which simultaneously learns a dimension reduction of the molecular system to the collective variables best describing the rare event processes and an MSM on these variables. The framework can be used to further drive MD simulations along these learned collective variables57,58. We can also use this framework to estimate statistically reversible MSMs and incorporate constraints from experimental observables59,60,61.

Despite these developments, there is a fundamental scaling problem in describing MD in terms of transitions between global system states: While the assignment of MD configurations to discrete global states representing the metastable groups of structures is an excellent model for small cooperative molecular systems, such as small to medium proteins, larger molecular systems (e.g., proteins with hundreds of amino acids) have an increasing number of subsystems whose dynamics are (nearly) independent62 (Fig. 1). Consider, for example, a solution of N proteins which undergo transitions between their open and closed states independently when these proteins are dissociated and these transitions only (partially) couple when they are associated with other proteins. The number of global system states is 2N, i.e., grows exponentially with the number of subsystems N63,64. This means any form of simulation or analysis which explicitly distinguishes global system states will not scale to large molecular systems.

Fig. 1: The iVAMP concept as visualized by modeling dynamics of a protein that has two independent, flexible regions separated by a rigid barrel.
figure 1

iVAMPnets learn an assignment of the C- (blue/top) and N-termini (green/bottom) into independent subsystems from molecular dynamics trajectories (left column). Moreover, the dynamics of both termini are modeled with statistically independent VAMPnets (right column).

At the same time, the (approximate) independence between subsystems is also key to the solution of the problem. A scalable solution needs to address two separate issues: (a) dividing the protein system into approximately Markovian subsystems and (b) learning the coupling between them. Olsson & Noé63 made a first attempt at (b), by learning a dynamic graphical model between predefined subsystems. This approach leads to a graphical model, or Markov random field, resembling Ising or Potts models in physics, with the key difference that both the definition of the individual subsystems or spins as well as their transition dynamics need to be learned. In contrast, Hempel et al.64 proposed a solution for (a) by approximating the global system dynamics as a set of independent (uncoupled) Markov models (termed Independent Markov decomposition, IMD). They furthermore propose a pairwise independence score of features, which allows to detect nearly uncoupled regions where independent Markov state models can be estimated subsequently.

In this manuscript, we present a joint IMD and VAMP approach (termed independent VAMPnet, or shorthand iVAMPnet) that significantly advances our ability to identify approximately independent Markovian subsystems (issue a) by generalizing IMD to neural network basis functions. iVAMPnets are an integrated end-to-end learning approach that decomposes the macromolecular structure into subsystems that are dynamically weakly coupled, and estimates a VAMPnet for each of these subsystems to promote a comprehensible analysis of the subsystem dynamics (Fig. 1). In comparison to previous implementations of IMD, our approach learns an optimal decomposition into independent subsystems and can find collective variables that are nonlinear combinations of the input features.


Markov state models and Koopman models

Markovian dynamics can be modeled by the transition density:

$${p}_{\tau }({{{{{{{\bf{y}}}}}}}}|{{{{{{{\bf{x}}}}}}}})={\mathbb{P}}({{{{{{{{\bf{x}}}}}}}}}_{t+\tau }={{{{{{{\bf{y}}}}}}}}|{{{{{{{{\bf{x}}}}}}}}}_{t}={{{{{{{\bf{x}}}}}}}}),$$

which is the probability density to observe configuration y at time t + τ given that the system was in configuration x at time t. Based on the transition density we can characterize the time evolution of a probability density χ as:

$${\chi }_{t+\tau }({{{{{{{\bf{y}}}}}}}})=\int{p}_{\tau }({{{{{{{\bf{y}}}}}}}}|{{{{{{{\bf{x}}}}}}}}){\chi }_{t}({{{{{{{\bf{x}}}}}}}}){{{{{{{\rm{d}}}}}}}}{{{{{{{\bf{x}}}}}}}}.$$

By discretizing the molecular state space in a suitable way and defining a transition matrix T between discrete states, we can linearize this equation as:

$${{{{{{{{\boldsymbol{\chi }}}}}}}}}_{t+\tau }({{{{{{{\bf{y}}}}}}}})={{{{{{{{\bf{T}}}}}}}}}_{\tau }^{T}{{{{{{{{\boldsymbol{\chi }}}}}}}}}_{t}({{{{{{{\bf{x}}}}}}}})$$

This is the equation of a Markov state model, where the element i of the vector χt+τ(y) is the probability to be in the discrete state i at time t + τ. Furthermore, the transition matrix elements \({({{{{{{{{\bf{T}}}}}}}}}_{\tau })}_{ij}\) describe the transition probabilities for jumping to state j given state i within a time τ. In the case of fuzzy state assignments, e.g., as with VAMPnets, Eq. (3) describes the more general Koopman model65 and Tτ becomes the Koopman matrix. This means that probability densities are still propagated but the matrix elements cannot be interpreted as transition probabilities.

The lag time τ is common to all Markovian models and is usually chosen with the aid of an implied timescales test66. If a too small τ is chosen, the resulting model is not a valid Markov model (resulting in errors of the predicted variables)—a too large lag time produces a model that unnecessarily discards kinetic information. We therefore usually choose the smallest lag time above which the implied timescales are approximately constant.

We now seek to find a state assignment χ and model matrix T that satisfy Eq. (3) and also succeed in predicting the long-time behavior, i.e., for multiples of the lag time τ. Formally, χ are (initially unknown) basis functions, i.e., we assume that the relevant dynamic features can be expressed by a linear combination of them. VAMP55 tells us that an optimal solution is reached when χ can span the left \({({\psi }_{1},...,{\psi }_{k})}^{T}\) and right singular functions \({({\phi }_{i},...,{\phi }_{k})}^{T}\) of the transition operator. They can be found by maximizing the singular values of a matrix that can be estimated from simulation data (see Eqs. (9)–(13) in “Methods”). In the case of a VAMPnet56, deep neural networks are trained by maximizing the VAMP score, so as to represent optimal fuzzy state assignments. In equilibrium, the singular functions correspond to the eigenfunctions of the Markov state model and the singular values to its eigenvalues. As the Koopman model still propagates densities, it is instructive to inspect the eigenfunctions and implied timescales of T since they describe the slow dynamics of a given system.

iVAMPnets and iVAMP-score

To implement iVAMPnets, we need to bridge the gap between the deep neural networks of VAMPnets and the spatial decomposition of independent Markov models. The general idea is to set up multiple parallel VAMPnets, each modeling the Markovian dynamics of a separate, independent subsystem of the molecule, together with an attention mechanism that identifies these subsystems. Thus, each independent VAMPnet should only receive the time dependent molecular geometry features representing its specific subsystem. For example, such an attention mechanism could separate different protein domains and channel the data of individual domains to separate VAMPnets. We, therefore, develop an architecture that combines a meaningful attention mechanism and parallel VAMPnets and trains them with a loss function that simultaneously promotes dynamic independence between the subsystems and slow kinetics within each subsystem (Fig. 2). iVAMPnets are designed to optimize both these objectives simultaneously.

Fig. 2: Architecture of an iVAMPnet for N subsystems, where trainable parts are shaded green.
figure 2

Two lobes are given for configuration pairs xt (light) and xt+τ (dark) where the weights are shared. Firstly, the input features are element wise weighted \({\bar{{{{{{{{\bf{Y}}}}}}}}}}_{t}={{{{{{{\bf{G}}}}}}}}\odot {{{{{{{{\bf{x}}}}}}}}}_{t}\) with a mask \({{{{{{{\bf{G}}}}}}}}\in {{\mathbb{R}}}^{D\times N}\), where each subsystem learns its individual weighting. The mask values can be interpreted as probabilities to which subsystem the input feature belongs. In order to prevent the subsequent neural network to reverse the effects of the mask, we draw for each input feature i and subsystem j an independent, normally distributed random variable \({\epsilon }_{ij} \sim {{{{{{{\mathcal{N}}}}}}}}(0,\sigma (1-{G}_{ij}))\). This noise is added to the weighted features \({{{{{{{{\bf{Y}}}}}}}}}_{t}={\bar{{{{{{{{\bf{Y}}}}}}}}}}_{t}+{{{{{{{\boldsymbol{\epsilon }}}}}}}}\). Thereby, the attention weight linearly interpolates between input feature and Gaussian noise, i.e., if the attention weight Gij = 1, Yij carries exclusively the input feature xi, if Gij = 0, Yij is simple Gaussian noise. Afterwards, the transformed feature vector is split for each individual subsystem \({{{{{{{{\bf{Y}}}}}}}}}_{t}=[{{{{{{{{\bf{Y}}}}}}}}}_{t}^{1},...,{{{{{{{{\bf{Y}}}}}}}}}_{t}^{N}]\) and passed through the subsystem specific neural network ηi. We call the whole transformation for a subsystem i the fuzzy state assignment \({{{{{{{{\boldsymbol{\chi }}}}}}}}}^{i}({{{{{{{{\bf{x}}}}}}}}}_{t})={{{{{{{{\boldsymbol{\eta }}}}}}}}}^{i}({{{{{{{{\bf{Y}}}}}}}}}_{t}^{i})\).

In practice, we extract all time-lagged data pairs xt, xt+τ that contain all molecular geometry features (e.g., distances, contacts, torsions) of our simulation data and pass them through the architecture presented in Fig. 2. The data is fed through an attention mechanism (represented by the matrix G) that yields subsystem specific vectors \({{{{{{{{\bf{Y}}}}}}}}}_{t}^{i}\), each of which attends to features relevant for subsystem i. These vectors then serve as inputs to N parallel feature transformations ηi (parallel VAMPnets) which transform those into output features χ1, …χN (with \({{{{{{{{\boldsymbol{\chi }}}}}}}}}^{i}({{{{{{{{\bf{x}}}}}}}}}_{t})={{{{{{{{\boldsymbol{\eta }}}}}}}}}^{i}({{{{{{{{\bf{Y}}}}}}}}}_{t}^{i}({{{{{{{{\bf{x}}}}}}}}}_{t}))\)) that represent slow collective coordinates or directly fuzzy assignments to metastable Markov states of each molecular subsystem. Equipped with the state assignments, we can compute correlation matrices (Eq. (9)) and derive a Koopman model matrix from those (Eq. (10)). As in VAMPnets, the feature transformations η1, …ηN are represented by deep neural networks. In the present study, we use multilayer perceptrons with a SoftMax output layer representing fuzzy state assignments. However, other architectures could be chosen, e.g., graph convolution networks when parameter sharing is desired67,68, and a linear output layer could be chosen if the aim is to represent slow collective variable rather than discrete states57,58. The parameters of the feature transformations η and the attention matrix are learned end-to-end via backpropagation.

In more detail, given N individual subsystem models, the global system state can be given by the Kronecker product of all subsystem states:

$${{{{{{{{\boldsymbol{\chi }}}}}}}}}^{G}({{{{{{{{\bf{x}}}}}}}}}_{t})=\mathop{\bigotimes}\limits_{i}{{{{{{{{\boldsymbol{\chi }}}}}}}}}^{i}({{{{{{{{\bf{x}}}}}}}}}_{t})$$

and by computing the global correlation matrices \(({{{{{{{{\bf{C}}}}}}}}}_{00}^{G},{{{{{{{{\bf{C}}}}}}}}}_{0\tau }^{G},{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{G})\) from Eqs (9) using χG. We note that this step does not require that we have independent Markovian models, but it is simply a formalism to express global states in terms of a combination of local states.

Furthermore, we construct a candidate for the global Koopman model from the subsystem models by combining the individual singular values and vectors with a Kronecker product64:

$${\hat{{{{{{{{\bf{K}}}}}}}}}}^{G}=\mathop{\bigotimes}\limits_{i}{{{{{{{{\bf{K}}}}}}}}}^{i}\qquad {\hat{{{{{{{{\bf{U}}}}}}}}}}^{G}=\mathop{\bigotimes}\limits_{i}{{{{{{{{\bf{U}}}}}}}}}^{i}\qquad {\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}=\mathop{\bigotimes}\limits_{i}{{{{{{{{\bf{V}}}}}}}}}^{i}.$$

The matrices \({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G}\) and \({\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}\) map the global state assignments onto the constructed singular functions and are computed from the local matrices as defined in Eqs. ((11), (12)). The diagonal matrix \({\hat{{{{{{{{\bf{K}}}}}}}}}}^{G}\) encodes the singular values and is computed from the subsystem singular value matrices via Eq. (10).

In order to evaluate the performance of the constructed model to predict the dynamics in the global state space, the VAMP-E validation55 score can be exploited,

$$\begin{array}{l}{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{G}={{{{{{{\rm{tr}}}}}}}}\left[2{\hat{{{{{{{{\bf{K}}}}}}}}}}^{G}{({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{0\tau }^{G}{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}-\right.\\ \left.{\hat{{{{{{{{\bf{K}}}}}}}}}}^{G}{({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{00}^{G}{\hat{{{{{{{{\bf{U}}}}}}}}}}^{G}{\hat{{{{{{{{\bf{K}}}}}}}}}}^{G}{({\hat{{{{{{{{\bf{V}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{G}{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}\right].\end{array}$$

The VAMP-E score measures the difference between the estimated Koopman model and the true dynamics. Here, it is evaluated for the global state assignments iχi (as encoded in \({{{{{{{{\bf{C}}}}}}}}}_{00}^{G},{{{{{{{{\bf{C}}}}}}}}}_{0\tau }^{G},{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{G}\)) mapped on the constructed singular functions (as encoded in \({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G},{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}\)). If the subsystems are independent the constructed singular functions are optimal and the singular values of the global system are indeed the product of singular values of the subsystems (as formalized in Conditions for independent systems, also see Supplementary Note 1). In this case, the global VAMP-E score Eq. (6) has a product form


that poses a necessary condition for subsystem independence.

To finally train the model, we develop a loss function that (i) maximizes the global VAMP-E score, assuming that they describe independent dynamics (Eqs. (4)–(6)), and (ii) minimizes a term that penalizes statistical dependence between these subsystems (Eq. (7)) scaled by a weighting factor ξ.

We evaluate the scores only pairwise, to escape the growth of the global state space, and sum over all possible pairs i, j:

$$L=-\mathop{\sum}\limits_{i < j}{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{ij}+\xi \mathop{\sum}\limits_{i < j}\frac{||{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{ij}-{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{i}{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{j}||}{{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{ij}}.$$

Here, \({{{{{{{{\mathcal{R}}}}}}}}}_{E}^{ij}\) measures the quality of the constructed Koopman model of subsystems i and j and is computed using Eq. (6). The weighting factor ξ is a hyperparameter that should be chosen large enough to find decoupled systems and small enough to not interfer with the subsystem dynamics. Even though the choice of an appropriate ξ depends on the nature of the dynamics and the coupling, it is directly related to the training procedure as it, briefly, balances focus of the optimizer between kinetics and decoupling. Further conditions (Eq. (18)), which evaluate the independence of the singular functions and values, can be used as post training validation metrics for adjusting ξ and for testing to which degree dynamically independent subsystems were found.

Benchmark model with two independent subsystems

The iVAMPnet architecture, which is implemented using PyTorch69, is depicted in Fig. 2. Generally, various neural network architectures are possible; we here choose fully connected feed forward neural networks with up to 5 hidden layers with 100 nodes each. The scripts to reproduce the results including the details for the training routine, choice of hyperparameters, and network architecture can be found in our GitHub repository. We note that an implementation of VAMPnets is available in the current version of DeepTime70.

We first demonstrate that iVAMPnets are capable of decomposing a dynamical system into its independent Markovian subsystems based on observed trajectory data using an exactly decomposable benchmark model (Fig. 3).

Fig. 3: Hidden Markov state model as a benchmark example for independent subsystems.
figure 3

a 2 subsystems with 2 and 3 states emit independently to an x and y axis, respectively. The corresponding 2D space embeds all 6 global states. b The learned mask, depicted in gray-scale from 0 (white) to 1 (black), shows that each subsystem focuses on one input dimension. c The estimated subsystem transition matrices are compared with the ground truth (in percent). d Subsystem eigenfunctions (color-coded) and corresponding eigenvalues (number prints) as found by iVAMPnet. Independent processes are recovered from the 2D data.

Akin to the protein illustrated in Fig. 1, we define a system that consists of two independent subsystems with two and three states, respectively. It is modeled by two transition matrices with the corresponding number of states. We sample a discrete trajectory with each matrix (100k steps)70. The global state is defined as a combination of these discrete states. The discrete subsystem states are now interpreted as the hidden states of hidden Markov models71 that emit to separate, subsystem-specific dimensions of a 2D space. The output of each subsystem is modeled with Gaussian noise \(N({\mu }_{i},\tilde{\sigma })\in {\mathbb{R}}\) that is specific to the state that the system is in, specified by the mean μi, and a constant \(\tilde{\sigma }\). The two state subsystem, therefore, describes a jump process between Gaussian basins along the x-axis and the three state subsystem along the y-axis, respectively (Fig. 3a). These variables compare to collective variables of the green (x) and blue (y) system depicted in Fig. 1. Please note that while in this benchmark system the relevant slow collective variables are known, iVAMPnets are generally capable of finding them (cf. 10D hypercube benchmark model and Synaptotagmin-C2A).

Since the generative benchmark model consists of perfectly independent subsystems and the pair already describes the global system, our method can simply be optimized for the global VAMP-E score (Eq. (6)) without the need for any further constraints. We train a model with a two and three state subsystem at a lag time of τ = 1 step.

Once trained, the iVAMPnet yields a model of the dynamics in each of the identified subsystems. As expected, we find that the estimated transition matrices for both subsystems closely agree with the ground truth (Fig. 3c). To additionally assess the slow subsystem dynamics in more detail, we borrow concepts from MSM analysis and conduct an eigenvalue decomposition of the iVAMPnet models (cf. VAMPnets). The analysis of the eigenfunctions demonstrates that, by construction, the system exhibits one independent process along the x-axis (λ1 = 0.90) and two along the y-axis (λ2 = 0.89 and λ4 = 0.66) (Fig. 3d). In contrast, we note that in the picture of global states, two additional processes would appear as a result of mixing the independent processes (cf. Supplementary Note 2), which makes the combined dynamical model more challenging to analyze, whereas the iVAMPnet analysis remains straightforward and simple.

Besides the dynamical models, our iVAMPnet yields assignments between input features and subsystems. We find that the method correctly identifies the two state system as the x-axis and the three states as the y-axis feature, respectively (Fig. 3b).

10D hypercube benchmark model

In a next step we test the iVAMPnet approach with ten 2-state subsystems, which corresponds to 1024 global states (Fig. 4a, b). As before, the dynamics is generated by ten independent hidden Markov state models with unique timescales. The system is split into five pairs of subsystems, and the two coordinates governing the transition dynamics of each pair are rotated in order to make them more difficult to separate (Fig. 4a). Additionally, we make the learning problem harder by adding ten noise dimensions such that the global system lives on a 10-dimensional hypercube embedded in a 20 dimensional space.

Fig. 4: Hidden Markov state model with 1024 global states forming a 10D hypercube embedded in a 20D space.
figure 4

a The hypercube is composed of ten independent 2-state subsystems. A pair of two subsystems always lives in a common rotated 2D-manifold. Therefore, two subsystems need the same input features to be well approximated. b 2D depiction of the hypercube in an orthographic projection89,90, where the global system can jump freely between all 1024 vertices, and the ten 2-state models retrieved from it by the iVAMPnet (colors denote subsystem identity). c Learned mask, depicted in gray-scale from 0 (white) to 1 (black), assigning inputs to subsystems (color-coded). It shows that for each subsystem, the network assigns two highly important input features which are shared with exactly one other subsystem, mirroring the rotated input space. Noise dimensions (x10-x19) are assigned low importance values. d Implied timescales as a function of the model lag time (both in arbitrary units, a.u.) of all ten subsystems learned by our method (dots) approximate the underlying true timescales (lines). Time scales are color-coded by index.

Although the subsystems are perfectly independent, we will estimate an iVAMPnet with the VAMP-E score in a pairwise fashion, thereby avoiding to estimate expensively large correlation matrices in \({{\mathbb{R}}}^{1024\times 1024}\). As this is only justified if all systems are independent, we additionally enforce Eq. (7) during training by minimizing Eq. (8) and thereby rule out that any two subsystems approximate the same process.

The iVAMPnet estimation yields subsystem models which, as common in MSM analysis, can be validated by testing whether their implied relaxation timescales are converged in the model lag time τ. We find that the implied timescales learned by the iVAMPnet are indeed converged and accurately reproduce the ground truth (Fig. 4d). We note that in addition to the timescales of the individual subsystems that are identified by the iVAMPnet, a global model would also contain all timescales that result from products of eigenvalues, resulting in a total of 1024 timescales. Thus, the iVAMPnet analysis provides a much simpler and more concise model than a global MSM or VAMPnet would.

Furthermore, the subsystem assignment mask indicates that the method correctly assigns high importance weight to two input features for each model (Fig. 4c). Therefore, the method proves its capability of decomposing a noisy, high dimensional global system into its independent sub-processes in a data efficient way.

We have generalized the 10-cube system to a variable number of subsystems (N-cube) to conduct a performance benchmark, finding that iVAMPnets outperform VAMPnets for this particular system. We however note that this result may not be generalizable to arbitrary systems as the N-cube features truely independent 2-state subsystems (compare Supplementary Note 6 for details).


Finally, we test iVAMPnets on an all-atom protein system. In comparison to our benchmark examples, we expect the underlying global dynamics to be only approximately decomposable into independent subsystems. Our test data consists of 184 μs aggregate MD data of each 2 μs length (92 × 2 μs) of the C2A domain of synaptotagmin (Supplementary Note 7) that was described previously72; synaptotagmin plays a crucial role in the regulation of neurotransmitter release73. It was shown to consist of approximately uncoupled subsystems containing the calcium binding region (CBR) and the C78 loop, respectively64.

First, we attempted to model the protein with a global model, i.e., with a single (regular) VAMPnet. Indeed, this approach failed because there were not enough simulation statistics to estimate a reversibly connected transition model between all global metastable states, resulting in diverging implied timescales (Supplementary Note 3 and Supplementary Fig. 2). This is exactly the scenario where iVAMPnets should provide an advantage, by only relying on locally rather than globally converged transition statistics.

Next, we train an iVAMPnet to seek two subsystems of twelve and six states, respectively, each at a lag time of τ = 10 ns where we enforce constraint Eq. (7) to find uncoupled subsystems.

The trained iVAMPnet identifies one subsystem comprising all three CBR loops (CBR-1, CBR-2, CBR-3; Fig. 5a). The second subsystem consists not only of the aforementioned C78 loop but also of the loop connecting beta sheets 3 and 474 (termed C34 henceforth). When mapping the residue positions on the protein structure it becomes obvious that the two subsystems are physically well separated (Fig. 5a), supporting the conclusion that both regions are only weakly coupled64.

Fig. 5: iVAMPnet of synaptotagmin-C2A with two subsystems and twelve and six states, respectively.
figure 5

a Importance values of the trainable mask depicted as color-coded protein secondary structure, indicating assignment to subsystem I (II) in green (blue). b Implied timescales of the two subsystems with a 90% percentile over 20 runs (dot markers denote means), color-coded by index. c Superposed representative structures of both extrema of the slowest resolved eigenfunctions of each subsystem (residues not assigned a high importance value or not showing significant movement are omitted for clarity). The slowest process of subsystem I changes between green and gray structures showing an orchestrated movement of the full Calcium Binding Region (CBR1, CBR2, and CBR3). The slowest process of subsystem II occurs between the blue and gray structures and describes a combined movement of the loops C78 and C34.

The implied timescales of both systems are approximately constant in the model lag time τ. Most timescales are in the range of 1–10 μs, with the exception of one much slower process with a 100 μs relaxation time found in the first subsystem (Fig. 5b), which has not been found previously. Analysis of the structural changes governing this process reveals that it involves an orchestrated transition of all CBR loops (Fig. 5c). Such a process could however not be resolved by the previous study72 where the CBR was modeled as individual loops. The process of the second system involves a simultaneous movement of the C78 and C34 loops (Fig. 5c).

iVAMPnets find metastable structures in the local features that are comparable to the ones described in our previous work 72. Specifically, α-helices in two distinct positions and a state burying a methionine residue (Met173) can be found in the CBR1. In the adjacent CBR2 site, both tightly bound and loose configurations are identified, and the C78 site features all three previously described valine residue conformations (Val250, Val255). In addition to the features modeled in our preceding study72, iVAMPnets identify dynamics in a lysine rich cluster (Lys189-192) that was previously reported as important for membrane interaction75. Please compare Supplementary Note 4 for a detailed view on the metastable states and exchange kinetics. In contrast to our previous work, the kinetic models in the local subsystems are more complex and incorporate a larger number of dynamic processes, providing a more comprehensive picture without the need to define a partitioning manually. In fact, conducting domain-decomposition and local kinetic modeling simultaneously has enabled the identification of very subtle dynamical features as long as they contribute significantly to the local VAMP-scores.

Although estimating a global VAMPnet model for synaptotagmin was not feasible given the sparse data sample, iVAMPnets use the same data efficiently and estimate a statistically valid dynamical model. This result is especially striking because the iVAMPnet approach also simplifies the subsequent task of interpreting models by separating dynamically independent protein domains.

Counterexample: folding of the villin miniprotein

Finally, we conducted an experiment on a villin protein folding trajectory of 125 μs length76 as a negative example (Supplementary Note 7). Small proteins such as villin are typically cooperative, i.e., the slowest processes related to folding involve all residues (Supplementary Note 5). Thus, these processes cannot be resolved when decomposing the system into several subsystems. Indeed, we find that a splitting into two subsystems with two states each results in timescales that are not converged, and whose relaxation processes approximate a partial folding on disjoint areas (cf. Supplementary Fig. 6).

Testing statistical independence of the learned dynamical subsystems

As constraint Eq. (7) was used as a penalty during training (as independence score Eq. (19)), we assess the validity of an estimated subsystem assignment by evaluating the constraints that were not enforced during training (Eq. (17)) as post-training independence scores MU, MV, and MUV (defined in Eq. (18)). Low values for MU and MV imply that the constructed left and right singular functions are indeed valid candidates for singular functions in the global state space. A small value for MUV indicates that the kinetics in the global state space is well predicted by the Kronecker product of subsystem models. We find that the three metrics are well suited to indicate independence of the learned subsystems (Table 1). Out of the tested systems only villin cannot be split into independent parts (all scores > 0.1). In comparison, the benchmark models and synaptotagmin can be decomposed into statistically uncoupled subsystems (all scores < 0.01). The slightly increased MR-value for synaptotagmin suggests that its subsystems might be weakly coupled.

Table 1 Post-training independence validation


We have proposed an unsupervised deep learning framework that, using only molecular dynamics simulation data, learns to decompose a complex molecular system into subsystems which behave as approximately independent Markov models. Thereby, iVAMPnet is an end-to-end learning framework that points a way out of the exponentially growing demand for simulation data that is required to sample increasingly large biomolecular complexes. Specifically, we have developed and demonstrated iVAMPnets for molecular dynamics, but the approach is, in principle, also applicable to different application areas, such as fluid dynamics. The specific implementation, such as the representation of the input vectors xt and the neural network architecture of the χ-functions, depend on the application and can be adapted as needed.

We now have a hierarchy of increasingly powerful models ranging from MSMs over VAMPnets to iVAMPnets. MSMs always consist of (1) a state space decomposition and (2) a Markovian transition matrix governing the dynamics between these states. VAMPnets provide a deep learning framework for MSMs, and thereby (3) learn the collective coordinates in which the state space discretization (1) is best made. iVAMPnets additionally learn (4) a physical separation of the molecular system into subsystems, each of which has its own slow coordinates, Markov states, and transition matrix.

We have demonstrated that iVAMPNets are a powerful multiscale learning method that succeeds in finding and modeling molecular subsystems when these subsystems indeed evolve statistically independently. Additionally, iVAMPnets are capable of learning from high dimensional MD data. To prove that point, we have demonstrated that the synaptotagmin C2A domain is decomposable into two almost independent Markov state models. Importantly, we have shown that this dynamical decomposition of synaptotagmin C2A succeeds while an attempt to model the system with a global Markov state model fails due to poor sampling. This is a direct demonstration that iVAMPnets are statistically more efficient than VAMPnets, MSMs, or other global-state models and may indeed scale to much larger systems.

We note, however, that iVAMPnets do not learn how the subsystems are coupled, and are, therefore, in their current form, only applicable to molecular systems that consist of uncoupled or weakly coupled subsystems. Although most biomolecular complexes are known to be cooperative, there are examples that have been modeled very successfully using independent subsystems, such as the Hudgkin-Huxley model of voltage-gated channel proteins77,78. For other systems, the degree of coupling is a matter of debate, for example, the C2-tandem (C2A and C2B domains) in synaptotagmins79,80. Since isolated domains are known to conduct function by themselves in many cases, we believe that discarding couplings is a first-order modeling assumption that is suitable to identify these domains and their relevant metastable states.

Following up on ref. 63 and introducing coupling parameters that describe how the learned MSMs are coupled, is subject to ongoing research. Furthermore, the weak-coupling assumption is made for the time-scale of the investigated molecular processes and may not be generalizable to arbitrary times. E.g., the degree of coupling between domains found in an MD simulations of a folded protein state may be very different in its unfolded state, which will be eventually encountered for a long enough simulation time.

Besides the usual hyperparameter choices in deep learning approaches, iVAMPnets require the specification of the number of sought subsystems. This choice can be guided by training an iVAMPnet for different numbers of subsystems and then interrogating the independence scores (Eqs. (19) and (18)) to choose a decomposition where statistical independence is optimal. We suggest to start with decomposing the system into two subsystems as a starting point, and to increase this number subsequently. Non-optimal choices may, e.g., reflect in non-converged implied timescales (possibly an incarnation of the sampling problem that may be mitigated by increasing the number of subsystems) or high independence scores (not possible to split the system because too many or non-optimal number of subsystems were chosen). Furthermore, the choice of the number of subsystems can be guided by the number of structural domains in a protein (complex) or by using the network-based approach presented in ref. 64. Furthermore, the number of states in each subsystems needs to balance (a) the quality of the singular function approximation (higher for few states) and (b) model resolution (higher for more states). Ultimately, different choices may yield converged validation measures, and the number of states may be chosen to yield the desired model resolution in this case.

iVAMPnets can be improved and further developed in multiple ways, e.g., by employing more advanced network architectures, e.g., graph neural networks, where parameters could be shared across subsystems. This might result in higher quality models and a greater robustness against the hyperparameter choice. Very recently, graph neural networks were indeed successfully combined with VAMPnets, showing that the resulting method (GraphVAMPnets) is applicable to MD data and that the estimated models are high quality81.

In summary, iVAMPnets pave a possible path for modeling the kinetics of large biological systems in a data-efficient and interpretable manner.



Since an iVAMPnet implements multiple parallel VAMPnets representing the kinetics of separate independent subsystems, we will introduce VAMPnets first56. VAMPnets are multilayer perceptrons that represent feature functions χ (we omit the upper subsystem index i for the sake of clearness here). Their last layer is often chosen to be a SoftMax function, i.e., summing over all non-negative outputs yields a 1. Therefore, the output of a VAMPnet can be interpreted as a fuzzy assignment to a metastable state. Taking the linear combination of states with equal weights results in the constant singular function with the singular value 1, which will be reflected by the singular values of the Koopman matrix (Eq. (10) with the normalized correlation matrix). Given the feature functions χ, we can compute the following correlation matrices:

$${{{{{{{{\bf{C}}}}}}}}}_{00} =\frac{1}{L}\mathop{\sum}\limits_{t}{{{{{{{\boldsymbol{\chi }}}}}}}}({{{{{{{{\bf{x}}}}}}}}}_{t}){{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t})}^{T}\\ {{{{{{{{\bf{C}}}}}}}}}_{0\tau } =\frac{1}{L}\mathop{\sum}\limits_{t}{{{{{{{\boldsymbol{\chi }}}}}}}}({{{{{{{{\bf{x}}}}}}}}}_{t}){{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t+\tau })}^{T}\\ {{{{{{{{\bf{C}}}}}}}}}_{\tau \tau } =\frac{1}{L}\mathop{\sum}\limits_{t}{{{{{{{\boldsymbol{\chi }}}}}}}}({{{{{{{{\bf{x}}}}}}}}}_{t+\tau }){{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t+\tau })}^{T},$$

where L is the number of collected data pairs in the simulations.

Training VAMPnets or iVAMPnets involves the computation of covariance matrices over minibatches. We, therefore, need to choose the batchsize to balance large estimator variance obtained for small batches and high memory requirements for large batches. Instead of using the trivial covariance estimator (Eq. (9)) which is asymptotically unbiased55 but has a high-variance, one can employ a shrinkage estimator82,83 which reduces the overall estimator error by trading larger bias for lower variance. For the current study, we assume that our benchmark and MD data has been sufficiently sampled to yield adequate approximations with the estimator given in Eq. (9).

The approximation of the singular functions and values can be estimated via the singular value decomposition (SVD) of the following matrix \(\bar{{{{{{{{\bf{K}}}}}}}}}\):

$$\bar{{{{{{{{\bf{K}}}}}}}}}={{{{{{{{\bf{C}}}}}}}}}_{00}^{-1/2}{{{{{{{{\bf{C}}}}}}}}}_{0\tau }{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{-1/2}={{{{{{{\bf{A}}}}}}}}{{{{{{{\bf{K}}}}}}}}{{{{{{{{\bf{B}}}}}}}}}^{T}$$

K is the diagonal matrix of approximated singular values corresponding to the left and right singular functions:

$${{{{{{{{\bf{f}}}}}}}}}^{T}({{{{{{{{\bf{x}}}}}}}}}_{t})={{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t})}^{T}{{{{{{{\bf{U}}}}}}}}={{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{00}^{-1/2}{{{{{{{\bf{A}}}}}}}}$$
$${{{{{{{{\bf{g}}}}}}}}}^{T}({{{{{{{{\bf{x}}}}}}}}}_{t+\tau })={{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t+\tau })}^{T}{{{{{{{\bf{V}}}}}}}}={{{{{{{\boldsymbol{\chi }}}}}}}}{({{{{{{{{\bf{x}}}}}}}}}_{t+\tau })}^{T}{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{-1/2}{{{{{{{\bf{B}}}}}}}}.$$

The matrices U and V construct the left and right singular functions from the individual state assignments. The optimal state assignments can be found by maximizing the VAMP-E score:

$${{{{{{{{\mathcal{R}}}}}}}}}_{E}={{{{{{{\rm{tr}}}}}}}}[2{{{{{{{\bf{K}}}}}}}}{{{{{{{{\bf{U}}}}}}}}}^{T}{{{{{{{{\bf{C}}}}}}}}}_{0\tau }{{{{{{{\bf{V}}}}}}}}-{{{{{{{\bf{K}}}}}}}}{{{{{{{{\bf{U}}}}}}}}}^{T}{{{{{{{{\bf{C}}}}}}}}}_{00}{{{{{{{\bf{U}}}}}}}}{{{{{{{\bf{K}}}}}}}}{{{{{{{{\bf{V}}}}}}}}}^{T}{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }{{{{{{{\bf{V}}}}}}}}].$$

Given trained state assignments χ(xt) and correlation matrices Eq. (9), the Koopman matrix T can then be evaluated as:

$${{{{{{{\bf{T}}}}}}}}={{{{{{{{\bf{C}}}}}}}}}_{00}^{-1}{{{{{{{{\bf{C}}}}}}}}}_{0\tau }.$$

Furthermore, we can estimate the eigenfunction φ and timescales ti by its eigendecomposition T = QΛQ−1:

$${{{{{{{\boldsymbol{\varphi }}}}}}}}({{{{{{{\bf{x}}}}}}}})={{{{{{{{\bf{Q}}}}}}}}}^{T}{{{{{{{\boldsymbol{\chi }}}}}}}}({{{{{{{\bf{x}}}}}}}}),$$
$${t}_{i}=\frac{-\tau }{\log (|{{{\Lambda }}}_{ii}|)}.$$

Please note that this operation is only possible if the eigendecomposition is (approximately) real-valued, a condition that is met for the presented application cases.

Conditions for independent systems

For Markov independent systems, the singular values and functions that are constructed by the Kronecker product match the true global ones,

$${({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{00}^{G}{\hat{{{{{{{{\bf{U}}}}}}}}}}^{G} ={{{{{{{\bf{1}}}}}}}}\\ {({\hat{{{{{{{{\bf{V}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{G}{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G} ={{{{{{{\bf{1}}}}}}}}\\ {({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{0\tau }^{G}{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G} ={\hat{{{{{{{{\bf{K}}}}}}}}}}^{G},$$

where the first two equations guarantee the orthonormality of the constructed singular functions. The latter verifies that the left and right singular functions correlate as predicted by the Kronecker product of the singular values. These conditions can be translated to the following scores:

$${M}_{U} =|{({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{00}^{G}{\hat{{{{{{{{\bf{U}}}}}}}}}}^{G}-{{{{{{{\bf{1}}}}}}}}|\\ {M}_{V} =|{({\hat{{{{{{{{\bf{V}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{\tau \tau }^{G}{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}-{{{{{{{\bf{1}}}}}}}}|\\ {M}_{UV} =|{({\hat{{{{{{{{\bf{U}}}}}}}}}}^{G})}^{T}{{{{{{{{\bf{C}}}}}}}}}_{0\tau }^{G}{\hat{{{{{{{{\bf{V}}}}}}}}}}^{G}-{\hat{{{{{{{{\bf{K}}}}}}}}}}^{G}|$$

Furthermore, using the identities Eq. (17) and the definition of the VAMP-E score Eq. (13) yields

$${M}_{R}=\frac{|{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{G}-{\prod }_{i}{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{i}|}{{{{{{{{{\mathcal{R}}}}}}}}}_{E}^{G}}.$$

The norms denote simple means. The last score, MR, is enforced during training in a pairwise fashion (cf. Eq. (8)).

Network architecture

Given a global system, which we want to decompose into N subsystems, and a time series of input features \({\{{{{{{{{{\bf{x}}}}}}}}}_{t}\}}_{t=1,...T}\), \({{{{{{{{\bf{x}}}}}}}}}_{t}\in {{\mathbb{R}}}^{D\times 1}\), we pass the features through a mask \({{{{{{{\bf{G}}}}}}}}\in {{\mathbb{R}}}^{D\times N}\), which weights each input differently for each subsystem, before the result are transformed individually by the N independent state assignment functions ηi. It should be mentioned that the mask is merely introduced for interpretability reasons and is not essential to find independent subsystems. If the mask was omitted, the extraction of the relevant features would simply be transferred to the downstream neural networks, remaining hidden to the practitioner.

The weighted input is assessed by an element wise multiplication \({\bar{{{{{{{{\bf{Y}}}}}}}}}}_{t}={{{{{{{\bf{G}}}}}}}}\odot {{{{{{{{\bf{x}}}}}}}}}_{t}\). In order to prevent the neural networks to reverse the weighting of the mask in its consecutive layers, we draw for each input feature i and subsystem j an independent, normally distributed random variable \({\epsilon }_{ij} \sim {{{{{{{\mathcal{N}}}}}}}}(0,\sigma (1-{G}_{ij}))\). This noise is added to the weighted features:

$${{{{{{{{\bf{Y}}}}}}}}}_{t}={\bar{{{{{{{{\bf{Y}}}}}}}}}}_{t}+{{{{{{{\boldsymbol{\epsilon }}}}}}}}.$$

Thereby, the attention weight linearly interpolates between input feature and Gaussian noise, i.e., if the attention weight Gij = 1, Yij carries exclusively the input feature xi, if Gij = 0, Yij is simple Gaussian noise. By tuning the noise scaling σ, a harder assignment by G can be enforced. This hyperparameter should be optimized by adjusting it so that the resulting mask yields clear subsystem assignments without being binary. Subsequently, the transformed feature vector is split for each individual subsystem \({{{{{{{{\bf{Y}}}}}}}}}_{t}=[{{{{{{{{\bf{Y}}}}}}}}}_{t}^{1},...,{{{{{{{{\bf{Y}}}}}}}}}_{t}^{N}]\) and passed through the subsystem specific neural network ηi resulting in feature transformations \({{{{{{{\boldsymbol{{\chi }}}}}}}^{i}}}({{{{{{{{\bf{x}}}}}}}}}_{t})={{{{{{{{\boldsymbol{\eta }}}}}}}}}^{i}({{{{{{{{\bf{Y}}}}}}}}}_{t}^{i})\). These features are then used to estimate the Koopman models.

The training framework and neural network architecture were implemented in the Python 3 programming language using numpy84 and pyTorch69; benchmark system data was generated using DeepTime70; data visualization was performed using matplotlib85 and VMD24.

Constructing the mask

To train an interpretable mask, we use the following three premises:

  1. 1.

    A single subsystem should not focus on all input features.

  2. 2.

    Different subsystems compete for high weights for the same feature.

  3. 3.

    All weights should be in the range [0, 1] and the matrix should be sparse.

Therefore, the mask is constructed by trainable weights \({{{{{{{\bf{g}}}}}}}}\in {{\mathbb{R}}}^{D\times N}\) which are first processed by a softmax function which normalizes along the input feature axis \({{{{{{{{\bf{g}}}}}}}}}_{1}={{{{{{{\rm{softmax}}}}}}}}({{{{{{{\bf{g}}}}}}}},\dim=0)\). Thereby, if a subsystem focuses on one part of the features, a lower weight for the other parts is expected following the first premise.

In a next step, weights which are lower than a threshold θ are clipped to zero g2 = relu(g1 − θ) to guarantee sparsity. The threshold θ is a hyperparameter that can be optimized by starting with comparably small values (i.e., very little cutoff) and subsequently increasing it without further training—a reasonable cutoff does not alter the results in this case, as the downstream neural networks still obtain all relevant information.

Since input features could be negligible for all subsystems, a dummy system is added which has a constant value \({{{{{{{\bf{c}}}}}}}}\in {{\mathbb{R}}}^{D\times 1}\) for all features g3 = [g2, c]. Consequently, the weights of all subsystems and the dummy system are normed for each feature \({{{{{{{{\bf{g}}}}}}}}}_{4}={{{{{{{{\bf{g}}}}}}}}}_{3}/{{{{{{{\rm{sum}}}}}}}}({{{{{{{{\bf{g}}}}}}}}}_{3},\dim=1)\), which together with the clipping fulfills the premises two and three.

Finally, the mask is given by truncating the dummy system \({{{{{{{{\bf{g}}}}}}}}}_{4}=[{{{{{{{\bf{G}}}}}}}},\bar{{{{{{{{\bf{c}}}}}}}}}]\). Beware that only g4 is normalized along the system axis.

Application to protein dynamics

Since for proteins the final model is often expected to be invariant with respect to rotations and translations, internal coordinates are employed as input features. For Markov state modeling, the minimal heavy atom distance dij between residues i, j has been proven to be a good descriptor56,86. However, for interpretability, mask weights for each residue are preferable. Therefore, the mask is of size \({{{{{{{\bf{G}}}}}}}}\in {{\mathbb{R}}}^{R\times N}\) with the number of residues R. The input features are then scaled as \({x}_{ij}={G}_{i}{G}_{j}\exp (-{d}_{ij})\).

Furthermore, a smoothing routine is implemented such that neighboring residues along the chain have similar importance weights. W windows of size B are placed along the chain with step size s. Each window has a trainable weight \({{{{{{{\bf{g}}}}}}}}\in {{\mathbb{R}}}^{W\times N}\). Consequently, the softmax function is taken along the window axis \(\bar{{{{{{{{\bf{g}}}}}}}}}={{{{{{{\rm{softmax}}}}}}}}({{{{{{{\bf{g}}}}}}}},\dim=0)\). However, before applying the clipping as before the weight for each residue \({{{{{{{{\bf{g}}}}}}}}}_{1}{{\mathbb{\in \; R}}}^{R\times N}\) is calculated as the product of all window weights the residue is part of (Fig. 6).

Fig. 6: Attention scheme for amino acid chain.
figure 6

Windows of size B are placed along the chain with a step size of s resulting into W many windows. A trainable weight \({{{{{{{\bf{g}}}}}}}}\in {{\mathbb{R}}}^{W\times N}\) is assigned for a window in each subsystem which are made positive and normalized along the window axis through a softmax \(\bar{{{{{{{{\bf{g}}}}}}}}}={{{{{{{\rm{softmax}}}}}}}}({{{{{{{\bf{g}}}}}}}},\dim=0)\). Here a window size of B = 4 and a step size of s = 2 is chosen. As a consequence the weight of the amino acid glutamine (Q) is given as the product of the two windows it is part of \({{{{{{{{\bf{g}}}}}}}}}_{1}(Q)={\bar{{{{{{{{\bf{g}}}}}}}}}}_{i}\odot {\bar{{{{{{{{\bf{g}}}}}}}}}}_{i+1}\), where the multiplication is executed element wise for each subsystem. The choice of the step size determines how many neighboring amino acids have the exact same weight within a subsystem, which applies here for the tyrosine (Y). Together with the window size it is regulated how many residues share parts of their weights. Hence, the serine (S) shares the weight \({\bar{{{{{{{{\bf{g}}}}}}}}}}_{i+1}\) with the previous two amino acids \({{{{{{{{\bf{g}}}}}}}}}_{1}(S)={\bar{{{{{{{{\bf{g}}}}}}}}}}_{i+1}{\bar{{{{{{{{\bf{g}}}}}}}}}}_{i+2}\), which has a smoothing effect on the attention mechanism along the chain.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.