Main

Structure determination of biological macromolecules by single-particle analysis of cryoegnic electron microscopy (cryo-EM) images is, at heart, a single-molecule imaging technique. Together, many images of individual complexes in a cryo-EM dataset contain information about the full extent of molecular dynamics that existed in the sample when it was plunge frozen. However, stringent low-dose imaging conditions, necessary to limit radiation damage, lead to high levels of experimental noise. Averaging over multiple individual images is thus necessary to extract detailed information about the underlying three-dimensional (3D) structures of the macromolecules. Because averaging projection images of distinct structures leads to blurring in the corresponding 3D reconstruction, image classification algorithms are often used to separate cryo-EM datasets into a user-defined number of structurally homogeneous subsets1. Despite their effectiveness in handling cryo-EM datasets with a discrete number of conformations, classification algorithms face challenges when continuous molecular motion is present in the sample. Therefore, continuous molecular motions in cryo-EM datasets is often considered a nuisance, rather than a rich source of information about protein dynamics.

Manifold embedding2 represented an early attempt to describe continuous molecular motions in cryo-EM datasets, although application of this approach has been limited to a few macromolecular complexes3,4. A more widely used approach to deal with continuously flexing complexes has been multi-body refinement5. Multi-body refinement divides complexes into independently moving rigid bodies through partial signal subtraction6,7,8. Independent image alignment and reconstruction for each of the individual bodies leads to better maps than a reconstruction of the entire complex that does not take the structural variability into account. A minimum size of the individual bodies, required for their alignment, limits the applicability of multi-body refinement to relatively large complexes. More recently, deep convolutional neural networks in the form of variational autoencoders (VAEs) have been proposed to map projection images into a continuous multi-dimensional latent space9,10,11. This mapping no longer assumes the presence of a discrete, user-defined number of structures in the data. Moreover, a corresponding decoder network can be used to reconstruct 3D structures for each point in latent space, allowing the creation of movies that describe 3D protein motions by traversing latent space. These approaches have proved useful in exploring continuous molecular motions. However, in contrast to multi-body refinement, most of them do not lead to improved reconstructed densities for the moving parts.

Two methods have been proposed that aim to analyze continuous molecular motions, while also improving the reconstructed density of the underlying consensus structure. 3D flexible refinement in cryoSPARC uses an autodecoder to learn deformations that are applied straight to the cryo-EM map12. A quasi-Newtonian optimization algorithm then uses the learned deformations to improve a reconstruction of the consensus structure. Alternatively, the Zernike3D approach expresses the deformation field of a cryo-EM map in a basis of 3D Zernike polynomials and uses Powell optimization to find the deformations for each individual particle image13. These deformations are then used in a modified algebraic reconstruction technique algorithm to obtain an improved reconstruction for the consensus structure.

In this study, we present an approach, coined DynaMight (for ‘exploring protein dynamics that might improve your map’). Inspired by the approach in e2gmm10, DynaMight uses Gaussian pseudo-atoms to model the cryo-EM density. The estimation of the conformational variability in the cryo-EM dataset is performed by a VAE, where an encoder maps individual cryo-EM images to latent space and a decoder outputs 3D deformations of the Gaussian pseudo-atoms to infer the different conformational states. We introduce a decoder architecture that takes the latent vector alongside spatial coordinates as an input and outputs actual displacements (Fig. 1). Compared to e2gmm10, given a latent representation, the decoder directly represents the function of interest, namely a deformation field. This enables the opportunity to impose prior knowledge directly on the deformation field in the form of regularization potentials, for which we explore both benefits and pitfalls. A modified filtered backprojection algorithm, that back-projects individual particle images along curves derived from these deformations, then yields an improved density map of the consensus structure.

Fig. 1: Schematic illustration of DynaMight.
figure 1

Two separate encoders take experimental images from each half set as input, and output a latent vector describing their conformational state. The decoders take the latent vectors together with the coordinates of Gaussian models for the consensus structures for each half set and generate a 3D deformation field for those Gaussians. The deformed models are then projected and compared to the experimental image in the loss function. At the end of the procedure, an approximation to the inverse deformation is used for reconstruction of an improved consensus map for each half set.

Results

Description of conformational variability

We describe the ith of Nd particle images, yi, with the following forward model:

$${y}_{i}={{{{\mathcal{C}}}}}_{i}* {{{{\bf{P}}}}}_{{\phi }_{i}\,}f({{{\varGamma }}}_{\!{z}_{i}}({{{\bf{x}}}})),$$
(1)

where \({{{{\mathcal{C}}}}}_{i} \ast\) denotes convolution with the contrast transfer function (CTF), \({{{{\bf{P}}}}}_{{\phi }_{i}}\) the projection of a particle that is rotated and shifted by its pose ϕi SE(3). We choose to represent the function f by a sum of Ng 3D Gaussian basis functions, or pseudo-atoms:

$$f({{{\bf{x}}}})\approx \hat{f}({{{\bf{x}}}}):=\sum\limits_{j=1}^{{N}_{\mathrm{g}}}{a}_{j}{{{{\mathcal{G}}}}}_{{s}_{j}}({{{\bf{x}}}}-{c}_{j})$$
(2)

where \({{{{\mathcal{G}}}}}_{{s}_{j}}:{{\mathbb{R}}}^{3}\to {\mathbb{R}}\) is \({{{{\mathcal{G}}}}}_{{s}_{j}}({{{\bf{x}}}})=\exp \left(\parallel {{{\bf{x}}}}{\parallel }^{2}/{s}_{j}\right)\). Here, aj > 0 denote the amplitudes, sj > 0 the widths and cj the central positions of the Gaussian functions.

We assume that all particle images are conformational variations of a single, consensus structure that is described by the Ng 3D Gaussian basis functions and zi in equation (1) is the conformational encoding for the ith image. We describe the deformation of individual particles as a deviation from the consensus coordinates x: Γ(x) = x − δ(x), so that:

$$\begin{array}{ll}\hat{f}({{\varGamma }}({{{\bf{x}}}}))&=\sum\limits_{j=1}^{{N}_{\mathrm{g}}}{a}_{j}{{{{\mathcal{G}}}}}_{{s}_{j}}\big({{\varGamma }}({{{\bf{x}}}})-{c}_{j}\big)\\ &=\sum\limits_{j=1}^{{N}_{\mathrm{g}}}{a}_{j}{{{{\mathcal{G}}}}}_{{s}_{j}}\big({{{\bf{x}}}}-\delta ({{{\bf{x}}}})-{c}_{j}\big)\\ &\approx \sum\limits_{j=1}^{{N}_{\mathrm{g}}}{a}_{j}{{{{\mathcal{G}}}}}_{{s}_{j}}\left({{{\bf{x}}}}-\left({c}_{j}+\delta ({c}_{j})\right.\right),\end{array}$$
(3)

where the last approximation assumes that the deformation field is locally constant and that the density surrounding cj moves in a similar manner. This enables us to describe the deformations as displacements of the Gaussian centers, which is a computationally tractable representation. Furthermore, the widths sj and amplitudes aj of all Gaussian pseudo-atoms are kept the same for the entire dataset. This means that DynaMight is by design constrained to only model mass-conserving heterogeneity and cannot handle nonstoichiometric mixtures. Therefore, compositional heterogeneity should be removed from the dataset by alternative approaches before running DynaMight.

Estimation of conformational variability

For learning the deformations, we use a VAE that consist of two neural networks, namely an encoder \({{{\mathcal{E}}}}\) that predicts an l-dimensional latent representation zi per particle image, and a decoder \({{{\mathcal{D}}}}\) that predicts the displacement of all Gaussian pseudo-atoms in the model. The encoder is a fully connected neural network with three linear layers and rectified linear unit activation functions. The input is a (real-space) experimental image yi and the output are two vectors \(({\mu }_{i},{\sigma }_{i})\in {{\mathbb{R}}}^{{N}_{l}}\times {{\mathbb{R}}}^{{N}_{l}}\), which describe the mean and standard deviation used to generate a sample zi that serves as input for the decoder.

The decoder \({{{\mathcal{D}}}}({z}_{i},{c}_{j})\) then approximates the term cj + δj for each zi. We define the decoder for the entire set of Ng positions as:

$${{{\mathcal{D}}}}({z}_{i},{{{{\bf{c}}}}}^{{{{\bf{0}}}}})={{{{\bf{c}}}}}^{{{{\bf{0}}}}}+{\delta }_{\theta }({z}_{i},{{{{\bf{c}}}}}^{{{{\bf{0}}}}})$$
(4)

In the above, c0 is all the consensus positions and δθ is a differentiable function, \({\delta }_{\theta }({z}_{i},{{{{\bf{c}}}}}^{{{{\bf{0}}}}})=[{\delta }_{\theta }({z}_{i},{c}_{1}),\ldots ,{\delta }_{\theta }({z}_{i},{c}_{{N}_{g}})]\), with parameters θ, that approximates δ for each position (Extended Data Fig. 1). In practice, we evaluate the decoder for each position cj and query δθ with a positional encoding of cj, concatenated with the latent representation zi that describes the conformation of each particle.

The output positions are used to generate a projection image pi of the deformed model in the pose of the particle, and the difference with the experimental image \(\parallel {p}_{i}-{y}_{i}{\parallel }_{{{\Sigma }}}^{2}\) is minimized during training of the neural networks. Once trained, for a latent embedding of the whole dataset, one obtains a family of deformation fields \({{{\mathcal{D}}}}({z}_{i},\mathbf{x} )\approx {{{\varGamma }}}_{{z}_{i}}(\mathbf{x} )\) that is defined over the entire 3D space.

Regularization and model bias

Because of high levels of experimental noise, cryo-EM reconstruction is an ill-posed problem. Even for standard, structurally homogeneous refinement, there are many possible rotational and translational assignments for each image. When estimating conformational variability, the poses are known, but many deformed density maps may explain each experimental image equally well. Therefore, in both cases regularization is essential for robust reconstruction.

The most common form of regularization in VAEs is to constrain the distribution of latent variables to follow a Gaussian distribution, which lead to the model learning more meaningful and structured representations. The design of the decoder in Fig. 1 allows an additional form of regularization that imposes prior knowledge on its output of real-space deformation fields. A wide range of physically and biologically inspired penalties can be incorporated as priors on the deformations, also see refs. 12,14,15. Possibly a powerful source of prior information would come from an atomic model of the consensus structure, which could provide constraints on chemical bonds, maintain secondary structure elements and so on.

To explore direct regularization of the deformation fields, we tested two approaches. The first approach aims to use prior information from an atomic model that is built in the consensus map, before running DynaMight. It generates a coarse-grained Gaussian representation of the atom positions, and then minimizes changes in the distances between these Gaussians according to the bonds that exist in the atomic model:

$${{{\mathcal{R}}}}(E\,)=\sum\limits_{\{(i,\,j):{E}_{ij}=1\}}{\big| d({c}_{i},{c}_{j})-d({{{\mathcal{D}}}}({c}_{i},z),{{{\mathcal{D}}}}({c}_{j},z))\big|} ^{2},$$
(5)

where Ei,j = 1 if there is a bond between the two pseudo-atoms ci and cj and d denotes Euclidian distance. The deformations with this regularization scheme result in Gaussians that remain close to a coarse-grained representation of the original atomic model.

The second regularization approach uses less prior information and does not require an atomic model. Instead, Gaussians are placed randomly to fill densities in the consensus map, and connections E in equation (5) are for all pairs of Gaussians that are within a distance of 1.5 times the average distance between all Gaussians and their two nearest neighbors. This regularization enforces overall smoothness in the deformations. Additional penalties that prevent Gaussians coming too close to each other, or moving too far away from other Gaussians, also exist to ensure a physically plausible distribution of Gaussians.

Improved 3D reconstruction

We propose an algorithm that uses the estimated deformation fields Γ to obtain an improved reconstruction of the consensus structure that incorporates information from all experimental images. To map back individual particle images to a hypothetical consensus state, one needs to estimate the inverse deformations, which represents a challenge. Whereas the inverse deformation on the displaced Gaussians is given by the negative displacement vector, that is Γ−1(Γ(ci)) = ci, the inverse deformation field needs to be inferred at all Cartesian grid positions of the improved reconstruction. We train a neural network as a regression function to estimate a deformation field that coincides on the given sampling points Γ(ci), but can be evaluated on arbitrary positions. This network consists of an multilayer perceptron with six layers and a single additive residual connection to the original coordinates of the consensus model c0. Similar to the forward deformation model, the network takes the latent code zi and the deformed positions Γ(ci) as inputs and aims to output the original positions ci. In addition to the inversion of the forward fields on the sampling points, we force the inverse field to be smooth by adding a regularization term to the loss function.

The algorithm aims to improve the reconstruction of the density f, using the known deformations Γ, that is we aim to find the minimizer \(\hat{f}\) of the data fidelity

$${\hat{f}} = \mathop{{\rm{argmin}}}\limits_{f} \sum\limits_i\|{\mathcal{C}}_i(P_{{\mathrm{\Gamma}}_i}\,f\,) -y_i\|^2.$$
(6)

This minimizer can be computed using the reconstruction formula

$$\begin{array}{rlr}\hat{f}&={\left[\sum\limits_{i}{P}_{{{{\Gamma }}}_{i}}^{* }\circ {{{{\mathcal{C}}}}}_{i}^{2}\circ {P}_{{{{\Gamma }}}_{i}}\right]}^{-1}\left[\sum\limits_{i}{P}_{{{{\Gamma }}}_{i}}^{* }({{{{\mathcal{C}}}}}_{i}^{* }{y}_{i})\right]&\\ &={D}^{-1}\left[\sum\limits_{i}{P}_{{{{\Gamma }}}_{i}}^{* }({{{{\mathcal{C}}}}}_{i}^{* }{y}_{i})\right],\end{array}$$
(7)

to get an estimate of the unknown density f. Here D is a matrix that depends on the estimated deformations, and \({P}_{{{{\Gamma }}}_{i}}^{* }\) is the composition of the backprojection operator and the inverse deformation corresponding to the ith particle (Fig. 1). For the structurally homogeneous case, Γ is the identity operator and D is diagonal in Fourier space and therefore the inverse can be computed simply by division, given that the distribution of projection directions covers the whole frequency domain and D has no zeros in the diagonal. In the presence of deformations, this matrix is not diagonal anymore and would be too expensive to compute or store. We approximate equation (7) by using the filter that would correspond to the homogeneous case, without deformations. Although even in the optimal scenario of having complete data of clean projection images, this method does not yield a minimum of functional in equation (6), it still allows to correct for the deformation to some degree. When the deformation fields are not smooth, for example when two nearby domains move in opposite directions, reconstruction with the proposed algorithm may introduce artifacts at the interface between the domains.

Implementation details

The initial positions of the Gaussians for the VAE are obtained by approximating a map from a consensus refinement with a Gaussian model. This initial consensus map does not correspond to an actual state of the complex, but rather to a mixture of different conformations. Therefore, parts of the map will have regions of poorly defined density, and correspondingly fewer Gaussians. To overcome this limitation, we update the positions of the consensus Gaussian model throughout the estimation of the deformations, such that the positions cj may correspond to a single conformation at the end of the iterative process. We recommend using two Gaussians per residue, but a smaller number can be chosen if computational resources are limited or a low resolution estimation of the motion is required.

After initialization of the Gaussians, in the first epochs of the training of the VAE, we only optimize the global Gaussian parameters, that is their widths, amplitudes and positions. These parameters are optimized with the ADAM optimizer and a learning rate of 0.0001. After this initial warm-up phase, we start optimization of the network parameters of the VAEs, again using the ADAM optimizer with a learning rate of 0.0001. During the second phase, the parameters of the Gaussians continue to be updated. Training of the VAEs is stopped when the updates of the consensus model do not yield improvements anymore or a fixed, user-defined number of epochs are completed.

Training of the VAE is performed on two half sets, where two encoder–decoder pairs are trained independently, as illustrated in Fig. 1. This procedure yields two independent families of deformation fields, one for each half set. The approximate inverse of these deformations are then used by the deformed weighted backprojection algorithm to generate two independent maps with improved estimates for the consensus structure. These half-maps can then be used in conventional postprocessing and resolution estimation routines. As described in the ‘Discussion’ section, by setting aside a small validation set of images, the two independent decoders also allow an error estimation of the displacement fields.

DynaMight has been implemented in pyTorch16, and is accessible as a separate job type from the RELION-5 graphical user interface. Because, as we will show below, the direct regularization of the deformation fields using atomic models may lead to overfitting, only the approach that enforces smoothness on the deformations, without the use of an atomic model, is exposed to the user on the graphical user interface. DynaMight uses the Napari viewer17 to visualize the distribution of particles in latent space, as well as the corresponding deformation fields. The same viewer also allows real-time generation of densities from points in latent space, movie generation, and the selection of particle subsets in latent space.

Further implementation details are given in the Methods.

Regularization can lead to model bias

We first analyzed the different options for regularization of the deformations on a well characterized dataset on the yeast Saccharomyces cerevisiae precatalytic B complex spliceosome18 EMPIAR-(10180, ref. 19). The same data, or subsets of it, have also been analyzed using multi-body refinement5 cryoDRGN9, Zernike3D13 and e2gmm10. To minimize computational costs and to ensure structural homogeneity9, we used 3D classification in RELION20 to select ~45,000 particles with reasonable density for the head region. Training of the VAEs on this subset with a box size of 320 took about 2.5 minutes per epoch on a single NVIDIA A100 GPU. This resulted in training times between 8 and 12 hours for estimating the deformations. Further estimation of the inverse deformations took ~4 hours and reconstruction with the deformed backprojection ~3 hours on the same GPU.

Without any regularization of the deformations, estimated deformation fields displayed rapidly changing directions for neighboring Gaussians, and deformed backprojection yielded reconstructions for which the local resolution did not improve with respect to the original consensus reconstruction (Fig. 2a,b). A consensus reconstruction with better local resolutions was obtained using the regularization scheme that enforces smoothness in the deformations, but without using an atomic model (Fig. 2c). The map with the highest local resolutions was obtained using the regularization scheme that enforces distances between bonded atoms of an atomic model (Protein Data Bank (PDB) ID 5nrl) (Fig. 2d). It thus appeared that incorporation of prior knowledge from the atomic model into the VAE had been beneficial.

Fig. 2: DynaMight reconstructions of the spliceosome subset.
figure 2

a, Standard RELION consensus refinement. b, DynaMight without regularization. c, DynaMight with smoothness regularization on the Gaussians. d, DynaMight with regularization from an atomic model. All maps are colored according to local resolution, as indicated by the color bar.

However, because the neural networks in our approach comprise many parameters, we were worried that there would be scope for ‘Einstein-from-noise’ artifacts, similar to those described for orientational assignments in single-particle analysis21,22,23. To test this, we performed two control experiments.

In the first control experiment, we replaced the atomic model of the U2 3′ domain/SF3a domain with a different protein domain of similar size (PDB 7YUY)24). The U2 3′ domain/SF3a showed only weak density in the consensus map, indicating large amounts of structural heterogeneity in this region. Although using the incorrect atomic model to estimate the deformation fields led to a similar improvement in local resolution compared to using the correct model (Fig. 3a,b), the reconstructed density from the deformed backprojection resembled the incorrect model, rather than the correct model (Fig. 3c and Supplementary Video 1).

Fig. 3: Using incorrect atomic models in DynaMight.
figure 3

a, Reconstruction after deformed backprojection using the correct atomic model for the SF3a region, colored by local resolution (right). The correct atomic model for the SF3a region is shown in green on the top left; an overlay of that model with the reconstructed density after deformed backprojection is shown on the bottom left. b, As in a, but using an incorrect atomic model for the SF3a region (shown in red). c, Fourier shell correlation (FSC) curves between the maps in a or b, masked around the SF3a region, and the correct (green) or incorrect (red) atomic models. df, As in ac, but using different atomic models for the SF3b region: reconstruction after deformed backprojection using the correct atomic model for the SF3b region (d), using an incorrect atomic model for the SF3b region (e) and FSC curves between maps in d or e, masked around the SF3b region, and the correct or incorrect atomic models (f).

In the second control experiment, we replaced the atomic model of the SF3b domain with PDB 1G88 (ref. 25). The density for the SF3b domain in the consensus map was stronger than the density for the SF3a region, indicating that this region in the spliceosome is less flexible. In this case, using the incorrect atomic model yielded a map with lower local resolutions in the SF3b region than using the correct model (Fig. 3d,e). But still, the reconstructed density from the deformed backprojection resembled the incorrect model more than the correct model (Fig. 3f and Supplementary Video 2).

These results indicate that estimation of deformation fields may lead to model bias, to the extent that reconstructed density may reproduce features of an incorrect atomic model. The scope for model bias to affect the deformed backprojection reconstruction is larger in regions of the map with higher levels of structural heterogeneity. Because it would be difficult to distinguish correct atomic models from incorrect ones, we caution against the use of this type of regularization in DynaMight. Therefore, in what follows, we only used the less informative, smoothness prior on the deformations. Using this prior, the deformations estimated by DynaMight are qualitatively similar to those observed for the same dataset using e2gmm10 (Extended Data Fig. 2 and Supplementary Video 3). For a different set EMPIAR-(10073, on the U4/U6.U5 tri-snRNP complex26), using the less informative smoothness prior in DynaMight led to an improved reconstruction with better map features and higher local-resolution estimates than reported for 3DFlex12 (Extended Data Fig. 3 and Supplementary Video 4), despite that 3D classification in RELION-5 selected a structurally homogeneous subset of only 86,624 particles, compared to 102,500 particles used for 3DFlex.

DynaMight improves inner kinetochore maps

Next, we demonstrate the usefulness of DynaMight on two cryo-EM datasets of the yeast inner kinetochore27. Training of the VAEs took 17 and 27 hours on an NVIDIA A100 GPU for the two respective datasets described below, with particle box sizes of 320 and 360. Estimating the inverse deformations took ~6 hours for both datasets. The deformed reconstructions took 9 and 13 hours, respectively.

The first dataset EMPIAR-(11910) comprises 100,311 particles of the monomeric constitutive centromere associated network complex bound to a CENP-A nucleosome (CCAN–CENP-A). For this dataset, we trained the half-set VAEs for 220 epochs and we used a ten-dimensional latent space. The estimated 3D deformations are distributed uniformly in latent space (Fig. 4a), without specifically clustered conformational states, suggesting that the motions in the dataset are mainly of a continuous nature. Analysis of the motions revealed that the nucleosome is rotating in different directions relative to the rest of the complex, and that these rotations coexist with the up and down bending of the Nkp1, Nkp2, CENP-Q and CENP-U subunits (arrows in Fig. 4b and Supplementary Video 5). The reconstruction from deformed backprojection improved local resolutions compared to the consensus map from standard RELION refinement, with clear improvements in the features for both protein and DNA (Fig. 4c,d and Extended Data Fig. 4).

Fig. 4: DynaMight results for the CCAN–CENP-A complex.
figure 4

a, Principal components analysis (PCA) of the conformational latent space, with colored dots indicating the positions of the five maps in b. (Only the latent space for one of the two half sets is shown.) b, Five conformational states of the complex. One state, in red, is shown in all four panels. The colors of the five maps are the same as the colors of their corresponding dots in a. c, Reconstructions from standard RELION consensus refinement. d, The improved reconstruction using DynaMight. The maps in c and d are colored according to local resolution, as indicated by the color bar.

The second data EMPIAR-(11890) comprises 108,672 particles of the complete yeast inner kinetochore complex assembled onto the CENP-A nucleosome. Training of the VAE was done for 290 epochs, and the dimensionality of the latent space was again set to ten. Again, a continuous distribution of deformations in latent space suggests continuous structural flexibility (Fig. 5a). Analysis of the deformations revealed large relative motions between different regions of the complex (root-mean-squared deviation and additional details are given in Supplementary Table 1). Different states of the complex are depicted in Fig. 5a and Supplementary Video 6. Deformed backprojection resulted in a map with improved local resolution and protein and DNA features compared to the map from consensus refinement (Fig. 5b,c and Extended Data Fig. 4).

Fig. 5: DynaMight results for the complete kinetochore complex.
figure 5

a, PCA of the conformational space (on the left) with highlighted positions of five conformation states, the maps of which are shown in the same colors on the right. (Only the latent space for one of the two half sets is shown.) b, Maps from standard RELION consensus refinement. c, DynaMight reconstruction. d, Reconstruction using RELION multi-body refinement. The outline regions in the latter show the four bodies that were used for multi-body refinement. The maps in bd are colored by local resolution, as indicated by the color bar.

Because this complex, with a molecular weight of 1.5 MDa, is large enough to divide into multiple independently moving rigid bodies, we also applied multi-body refinement5 to this dataset. We used the four bodies illustrated in Fig. 5d; body 1 (orange): CCANTopo, body 2 (light green): \({\rm{CCAN}}^{{{{\rm{Non}}}}-{{{{\rm{topo}}}}}_{\Delta }{{{\rm{CENP}}}}-{{{\rm{I}}}}({{{\rm{Body}}}})}\), body 3 (yellow): \({\rm{CBF}}3^{{{{\rm{Core}}}}}\)+CENP-IBody and body 4 (dark green): CENP-ANuc). The local resolutions resulting from multi-body refinement (Fig. 5d) are better than those from the deformed backprojection reconstruction of DynaMight, illustrating that there is still room for further development of the latter. Nevertheless, the DynaMight map had better protein and nucleic acid features than a map obtained for the same dataset with 3DFlex, using default parameters12 (Extended Data Fig. 5). The DynaMight map also correlated better than the map from 3DFlex with atomic models that were built in the maps from multi-body refinement. Despite these observations, resolution estimates calculated from half-maps calculated by 3DFlex were higher than those calculated from half-maps by DynaMight. This suggests that using a single 3D deformation model in 3DFlex, rather than two separate models as done in DynaMight, could potentially result in over-estimations of local resolution.

Discussion

How to deal with continuous conformational heterogeneity remains a rapidly developing topic in cryo-EM single-particle analysis. As outlined in the main text, and recently reviewed in ref. 28, multiple approaches from different laboratories have been proposed. In this paper we present an approach, called DynaMight, which consists of two VAEs that are trained independently on half sets to estimate displacements of a Gaussian model and a modified weighted backprojection algorithm to correct for the estimated deformations. To avoid deformations being described by the disappearance of Gaussians in one place and the appearance of Gaussians in another, and to limit the number of model parameters, DynaMight does not refine an occupancy factor for each Gaussian. Consequently, DynaMight cannot model compositional heterogeneity and it is unclear how it will perform on datasets with such heterogeneity. Compositional heterogeneity should thus be removed using existing discrete classification methods1 before the application of DynaMight. We show for two datasets on the yeast inner kinetochore that DynaMight is useful in improving cryo-EM maps of macromolecular complexes that exhibit large amounts of flexibility, although scope remains for further improvements, of DynaMight in particular and how to deal with continuous structural heterogeneity in general.

Because of the high levels of experimental noise and the large number of parameters needed to describe continuous structural flexibility in the particles, an obvious way to improve these methods is the incorporation of prior knowledge. However, our results on the spliceosomal B complex show that such approaches are not without risk. We observe that there are enough parameters in DynaMight’s neural networks to result in deformation fields that, when used in deformed backprojection, will reproduce incorrect features from the consensus model that is used to regularize these deformations. That model bias may play a role is perhaps not surprising, given that similar observations have been made for standard (structurally homogeneous) refinement, where only five parameters (three rotations and two translations) are used for every particle. The total number of parameters in DynaMight’s VAE is approximately 10 million, which results in considerably higher numbers of parameter per particle for typical datasets. We do not believe that the risk of overfitting exists only in DynaMight. Other approaches that describe structural heterogeneity in the dataset with large neural networks, or other approaches with high numbers of parameters per particle, such as cryoDRGN9, Zernike3D13 and 3DFlex12, will probably also be susceptible to these problems. The development of validation procedures will thus be important. In DynaMight, we chose not to expose the usage of atomic models for regularization of the deformations to the user, as potential model bias toward those models takes away the possibility to validate the map by the appearance of protein-like features. The exploration of more sophisticated methods, where part of the information of atomic models is used and other parts are set aside for validation, may yield better methods, while still allowing proper validation.

Because model bias may affect the estimation of deformation fields, over-estimation of the resolution of reconstructions that correct for these deformations may represent another pitfall. Resolutions are typically measured by Fourier shell correlation between two half sets. However, if deformations have been estimated jointly for both half sets, with the same reference map as origin, then incorrect features from the reference model may be reproduced in both half-reconstructions, resulting in inflated Fourier shell correlation curves and over-estimation of resolution. Our results with the yeast kinetochore complex (Extended Data Fig. 5) indicate that 3DFlex12 may suffer from such over-estimation of resolution. By training two independent VAEs with separate consensus models for both half sets, similar to ‘gold-standard’ approaches in standard refinement29,30, this risk is avoided in DynaMight.

Training two VAEs independently on two half sets of the data also offers an opportunity to estimate the uncertainty in the estimated deformations. Although in recent years multiple methods have been proposed to analyze molecular motions in cryo-EM datasets, less consideration has been given to what extent these motions can be trusted. Error estimates on the deformations can be obtained for a subset of the particles (we used 10% in Fig. 6), by excluding this subset from the training of the decoders and only using it for training its embedding to latent space. For each particle in this subset, one obtains an embedding with both separate encoders to obtain a latent representation for the corresponding decoder. Applying both decoders to get the displacements of either of the consensus models then leads to two independent estimates of the deformations for the particles in the subset. The difference between these two estimates provide an estimate of the errors in them. We illustrate this procedure in Fig. 6b, where we observe that the errors in the deformations vary among particles and among different regions of the CCAN–CENP-A complex. Future developments in regularization methods as described above may benefit from considering estimated errors in the deformations.

Fig. 6: Error estimation for the deformations.
figure 6

a, Particles of a validation subset (here 10% of the particles) are fed into both encoders. The encoders are updated, whereas these images are not used for training the decoder. At evaluation time, both decoders can be evaluated for the consensus models (purple for the consensus model of half set 1 and blue for the consensus model of half set 2). The resulting displacements can be compared. b, Example deformation fields for four particles. The radius of the sphere (colored by size from blue to pink) at the end of the deformations (black arrows) is determined by the norm of the difference of the deformations from the two decoders.

Besides estimation of deformations, DynaMight also implements a reconstruction algorithm that aims to correct for the deformations through the reconstruction of an improved consensus map. Reconstruction via equation (7) only gives an approximation of the minimizer of the convex problem in equation (6). Although it is therefore not guaranteed to yield a useful solution, in practice we observe that DynaMight results in maps with improved local resolutions compared to the standard RELION reconstruction algorithm that assumes structural homogeneity. The improvements in the reconstructed maps provide some level of validation of the estimated deformation fields. Nevertheless, our observations that multi-body refinement yields better local resolutions for the complete inner kinetochore complex suggest that there is room for further improvement. It is possible that iterative real-space methods, such as those implemented in 3DFlex12 or Zernike3D13, may yield better results. But the iterative approaches would be even more computationally expensive than our weighted backprojection approach, as they may require multiple sweeps through the data and optimization of hyperparameters, such as the step size. Alternatively, the results with multi-body refinement suggest that it may be possible to divide each particle into many smaller ‘bodies’, and to insert Fourier slices of each of these bodies using orientations that are a combination of the consensus orientation and the average deformation field at that region.

Although opportunities for further improvements exist, we believe that the current implementation of DynaMight will already be useful. Unlike multi-body refinement, there is no need for the design of masks that delineate the bodies. In fact, analysis of deformations estimated by DynaMight may assist users to define those masks for subsequent multi-body refinements. The implementation inside RELION-5 will make DynaMight easily accessible to many users, and its wider application will provide feedback for future developments of even better tools to analyze molecular motions in biological macromolecules. The unresolved challenges, as explored in this paper, of how to exploit more previous knowledge, while preventing the pitfalls of model bias, and how to validate the estimated deformations, imply that this topic will remain an active area of research.

Methods

Initialization of the reference model

We model the 3D cryo-EM density map \(f:{{\mathbb{R}}}^{3}\to {\mathbb{R}}\) by a sum of Ng Gaussian functions. The density f is defined by

$$f({{{\bf{x}}}})=\sum\limits_{j=1}^{{N}_{\mathrm{c}}}\left(\sum\limits_{i=1}^{{N}_{\mathrm{g}}}{d}_{j,i}{a}_{j}\exp \left(\frac{\parallel {{{\bf{x}}}}-{c}_{i}\parallel }{{s}_{j}}\right)\right).$$
(8)

Here Nc is a fixed number defining how many distinct widths are used in the Gaussian model. For the ith Gaussian the vector d–,i satisfies \(\sum\nolimits_{j=1}^{{N}_{\mathrm{c}}}{d}_{j,i}=1\) and dj,i ≥ 0 for all j {1, …, Nc}. This weight vector continuously classifies the type of Gaussian that is selected for a certain position of the Gaussian model. Although we used Nc = 1 in all our results, using more classes could be helpful for cases where the consensus map contains large variations in local resolution and the same width of all Gaussians does not give a reasonable representation of the map. The learnable parameters in this model are the widths \(({s}_{1},\ldots ,{s}_{{N}_{\mathrm{c}}})\), composition vectors d and amplitudes \(({a}_{1},\ldots ,{a}_{{N}_{\mathrm{c}}})\). These parameters are optimized globally, meaning that they are independent of the projection image, and stay the same over the whole dataset. Whereas a per-Gaussian amplitude parameter would be possible and would enable the representation of compositional heterogeneity, we decided to use the same amplitudes for all Gaussians. The reason for this is that otherwise movement could also be represented by Gaussian densities vanishing and reappearing at different places. We call the parameters (a, s, d, c) of the Gaussian model the reference parameters and we use a separate optimizer (ADAM) to update them. The total number of reference parameters is Ng × 3 + Nc × (Ng + 2). For our experiments, we used only one class of Gaussians, resulting in Ng × 3 + 2 parameters. The consensus model serves as the starting point for the decoder that predicts how every Gaussian in the model moves to explain the corresponding experimental image.

In the recommended way of running DynaMight, the initial reference map, that is, the reconstruction from the consensus refinement, is thresholded and randomly filled with Ng Gaussians that are within the region of the map exceeding this threshold. The threshold should be chosen such that density in the flexible regions remains, but no noise is visible in the solvent region. The parameters a and s are initialized to reasonable numbers such that the norm of the Gaussian model equals the norm of the consensus reconstruction and the classification weights are initialized randomly. Once the reference parameters are initialized, we optimize the reference model using gradient descent (that is, without any networks), minimizing the mean squared error to the experimental images.

Alternatively, Gaussians may be initialized from the positions of an atomic model that is rigid-body fitted into the consensus map. For our experiments with atomic models for the spliceosome dataset, we used the deposited atomic model (PDB 5nrl). Instead of using one Gaussian per atom, we coarse-grained the atomic models. For every amino acid we used one main chain Gaussian that was located at the Bary center of the N, C and O atoms. Subsequent main chain Gaussians were connected by an edge in the graph used for regularization. The number of Gaussians used to represent the side chains varied for different amino acids. We placed one additional Gaussian at the Bary center of the α, β and γ position side-chain atoms of all amino acids, except for ‘PRO’, where we took the Bary center of atoms at the α, β, γ and δ positions, and for ‘SER’, ‘CYS’, ‘ALA’, ‘GLY’, ‘VAL’ and ‘THR’, where we placed a Gaussian at the β position. For larger amino acids, we placed additional side-chain Gaussians at the Bary center of the remaining side-chain atoms, except for ‘TYR’ and ‘TRP’, where we used two additional Gaussians. Subsequent Gaussians from the side chains were connected to each other and then to the corresponding main chain Gaussian with edges for the regularization functional. The amplitudes of the Gaussians were chosen to be proportional to the combined atomic number of all (nonhydrogen) atoms grouped together for the corresponding Gaussian. For nucleic acids we used four Gaussians: one at the phosphate position and one at the Bary center of the sugar form the main chain of the nucleic acid chain and two Gaussians at the bases. Again, the amplitudes were set to be proportional to the combined atomic number within each group.

The VAE

A VAE estimates displacements of the Gaussians from the reference model. An encoder learns an embedding to a low dimensional latent space that describes the conformational landscape of the dataset. The decoder estimates a deformation, given a point in that latent space and a position in the 3D reference.

The input to the encoder is a flattened (real-space) experimental image yi and the output are two vectors \(({\mu }_{i},{\sigma }_{i})\in {{\mathbb{R}}}^{{N}_{l}}\times {{\mathbb{R}}}^{{N}_{l}}\), which describe the mean and standard deviation used to generate a sample, which serves as an input for the decoder. The encoder is a fully connected neural network with three linear layers and rectified linear unit activation functions. To optimize the weights of the encoder we used the ADAM optimizer with a learning rate of 0.001. We tried to use alternative encoder architectures using residual connections, more linear layers and convolutional neural networks, but without observing relevant improvements in performance. Even when substituting the input images with a different unique signal (we used a random vector per image), the deformations are not worse. We conclude that the encoder does not effectively use the information that is present in the images, suggesting that one could optimize the latent representation itself via an autodecoder12.

The decoder is at the heart of our approach. Given a conformational representation it estimates a deformation for the corresponding particle image. It takes the latent representation zi and a spatial position, and outputs the displacement of that which is predicted at this spatial position. During training, the positions where the decoder is evaluated are the Gaussian positions in the reference model. Compared to ref. 10 we use a coordinate-based network that takes the input position as an input. To augment the 3D coordinates, we use positional encoding with ten encoding dimensions, which has shown to resolve higher resolution information in coordinate-based networks31. We use the sine and cosine function for lifting the 3D position to a higher dimensional space as described in ref. 32. We observed that without the positional encoding of the input coordinate the deformations are too smooth and that localized motion is not captured well. The use of a coordinate-based network results in a network that approximates a deformation field that can be evaluated at any position in \({{\mathbb{R}}}^{3}\).

The decoder itself is a fully connected network δ with exponential linear unit (ELU) activation functions and an additive residual connection (Extended Data Fig. 1). We use eight linear layers to obtain for a given spatial position \({{{\bf{x}}}}\in {{\mathbb{R}}}^{3}\) the deformed position:

$$\begin{array}{r}{{{\mathcal{D}}}}({z}_{i},{{{\bf{x}}}})={{{\bf{x}}}}+\delta ({z}_{i},{{{\bf{x}}}}).\end{array}$$
(9)

In the training phase, we evaluate the decoder for all the positions c0 in the reference model. We then model the forward operator of cryo-EM by projecting the center points of the deformed Gaussian reference model using the orientation of the particle, resulting in 2D coordinates ξi. These coordinates are then placed into an (oversampled) 2D grid using bilinear interpolation. Then we compute the 2D Fourier transform, approximating the Fourier transform of the sum of deltas. Subsequently, we multiply the resulting Fourier-space image with the Gaussian basis function Gs and the CTF \({{{{\mathcal{C}}}}}_{i}\) resulting in the projection image gi of the deformed Gaussian model

$${g}_{i}\approx {{{\mathcal{F}}}}\left(\sum\limits_{j=1}^{{N}_{\mathrm{g}}}a{\delta }_{{\xi }_{i}^{j}}\right)\cdot {{{{\mathcal{C}}}}}_{i}\cdot {G}_{\mathrm{s}}.$$
(10)

If more than one type of Gaussian exists, the same operation is repeated for all types and weighted by the class assignment vector d. The resulting reference projection image gi is then compared to the experimental image, using a mean squared error as the loss function (also below).

Training

After initialization of the Gaussians in the consensus reconstruction, during the first epochs (that is, sweeps over the two half sets for both models) of training we only optimize the Gaussian parameters, that is their widths, amplitudes and positions. After this initial phase, we also start optimizing the network parameters of the two independent VAEs, which are initially assigned random values. Both phases of training use the ADAM optimizer at a learning rate 0.0001.

To get physically meaningful deformations, the reference model itself should lie within the distribution of all the conformations estimated by the decoder, rather than being a nonexisting average of conformations (as the reconstruction from the consensus refinement is). To achieve this, we apply two heuristic strategies that gradually improve the reference model. First, after every 30 epochs, we fix the encoder and decoder for five epochs and only adjust the Gaussian parameters. Second, at every tenth epoch where the decoder is not fixed, we replace the positions of the Gaussians of the reference model by the predicted Gaussian positions with the smallest displacement from the current reference model. The latter ensures that the reference model is in the distribution of deformed models. Without this replacement strategy, we observed that the reference model can move out of distribution, sometimes even to a point where the structure is completely distorted. As long as the deformations satisfy the regularization constraints, this should not change the value of the loss function, but we observed that this can lead to unphysical displacements of the Gaussians and suboptimal reconstructions. To also ensure that the reference models of the two independent half sets are in the same conformation, we generate a binary mask around the Gaussians positions of one half set and substitute the Gaussian positions of the other half set with the average over 100 predictions where the number of Gaussians inside this mask is the highest. The binary mask covers all voxels that have a Gaussian within a distance of 6 Å from the voxel center. Fourier shell correlations of the Gaussian model to the consensus and the final Gaussian model to the final reconstruction are displayed in Supplementary Fig. 1.

Training is stopped when the updates of the consensus model do not yield improvements in the data loss mean squared error (MSE; below) anymore. More specifically, we stop training if the MSE loss increased for the kth time. In our experiments we used the default value of k = 40.

Loss functions and regularization

Denoting by gi the reference projection image generated by the current VAE, the main loss function is the data loss, which for a batch \({{{\mathcal{B}}}}:={({g}_{i},{y}_{i})}_{i\in {{{\bf{B}}}}}\) is computed in Fourier space as

$$\begin{array}{r}{{{\mathcal{F}}}}({{{\mathcal{B}}}}):=\frac{1}{| {{{\mathcal{B}}}}| }\sum\limits_{i\in {{{\mathcal{B}}}}}\parallel {g}_{i}-{y}_{i}{\parallel }_{{{\varSigma }}}^{2},\end{array}$$
(11)

where the resolution-dependent noise weights Σ are estimated by the radially averaged power of the error on a subset of particle images.

Auxiliary losses are used to regularize the deformations of the Gaussian model. In the recommended way of running DynaMight, a graph is constructed by connecting Gaussians that are within a certain distance with edges. The set of edges is defined by

$$\begin{array}{r}{E}_{ij}=\left\{\begin{array}{ll}1\quad \quad &\parallel {c}_{i}-{c}_{j}\parallel < 1.5\,{c}_{{{{\rm{mean}}}}},\\ 0\quad &{{{\rm{else.}}}}\end{array}\right.\end{array}$$
(12)

Here cmean is the mean distance in the graph Fij, which is created by connecting every point to its two nearest neighbors. These graphs are recalculated from the reference model after every epoch. For the deformation of the kth image Γk the following regularization functional then preserves distances after displacement, enforcing local isometry:

$${{{{\mathcal{R}}}}}_{d}({{{\varGamma }}}_{k})=\sum\limits_{\{(i,\,j):{E}_{ij}=1\}}{\left\vert \parallel {c}_{i}-{c}_{j}\parallel -\parallel {{{\varGamma }}}_{k}({c}_{i})-{{{\varGamma }}}_{k}({c}_{j})\parallel \right\vert }^{2},$$
(13)

Additionally, we use a repulsion loss penalizing Gaussians that are too close to each other

$${{{{\mathcal{R}}}}}_{r}({{{\varGamma }}}_{k})=\sum\limits_{\{(i,\,j):{E}_{ij}=1\}}{\chi }_{\parallel {{{\varGamma }}}_{k}({c}_{i})-{{{\varGamma }}}_{k}({c}_{j})\parallel < \tau }{(\parallel {{{\varGamma }}}_{k}({c}_{i})-{{{\varGamma }}}_{k}({c}_{j})\parallel -\tau )}^{2},$$
(14)

where \({\chi }_{\parallel {{\varGamma }}({c}_{i})-{{\varGamma }}({c}_{j})\parallel < \tau }\) = 1 if the distance between neighboring Gaussians is less than τ. We set τ to cmean for all our results.

The total loss function is then given by

$${{{\mathcal{L}}}}({{{\mathcal{B}}}})={{{\mathcal{F}}}}({{{\mathcal{B}}}})+\lambda \frac{1}{| {{{\mathcal{B}}}}| }\sum\limits_{i\in {{{\mathcal{B}}}}}\left[{{{{\mathcal{R}}}}}_{d}({{{\varGamma }}}_{i})+{{{{\mathcal{R}}}}}_{r}({{{\varGamma }}}_{k})\right]=:{{{\mathcal{F}}}}({{{\mathcal{B}}}})+\lambda {{{\mathcal{R}}}}({{{\mathcal{B}}}}).$$

The parameter λ is a dynamic regularization parameter that is recalculated after every epoch. We do that by calculating the norm of the gradients of both loss terms \({{{\mathcal{L}}}}\) and \({{{\mathcal{R}}}}\) and define λ such that the ratio of these norms equal a user-defined number. When set to 1 the norm of the gradient of both terms is equal. For all our results we set this value to 0.9, which results in slightly more influence of the data term \({{{\mathcal{L}}}}\).

For the results, where we used the coarse-grained atomic model as a reference, we used the same data loss function \({{{\mathcal{F}}}}\) (equation (12)), but in contrast to the above described heuristic method to construct the edges between the Gaussian, the graph E is obtained from the coarse graining of the atomic model. The regularization that preserves distances is applied in the same way (equation (13)) with the fixed graph from the coarse graining. The second regularization functional (equation (14)) is not used in this case, since the distances in the reference model are fixed.

Improved reconstruction

To calculate an improved reconstruction from the estimated deformations, we use a network \({{{{\mathcal{D}}}}}^{-1}\) with the same architecture as the decoder to estimate a deformation field that maps back a deformed position to its original location. Again this network is coordinate-based and can be evaluated on an arbitrary position \({{{\bf{x}}}}\in {{\mathbb{R}}}^{3}\). Given the latent representation of each particle we train the neural network \({{{{\mathcal{D}}}}}^{-1}\) to map back the positions predicted by the trained VAE to the positions of the reference model. Since the model should estimate the inverse deformation of the decoder \({{{\mathcal{D}}}}\), it should satisfy

$$\begin{array}{r}{{{{\mathcal{D}}}}}^{-1}\left(\,{\mu }_{i},{{{\mathcal{D}}}}\left({z}_{i},{c}_{j}^{0}\right)\right)={c}_{j}^{0}.\end{array}$$

For each image gi the neural network takes as input the latent representation μi from the previously trained encoder \({{{\mathcal{E}}}}\) and a positional encoding of the deformed Gaussian positions \({{{\mathcal{D}}}}({z}_{i},{{{{\bf{c}}}}}^{{{{{0}}}}})\). The concatenated positional encoding and latent representation are then mapped by an multilayer perceptron with six layers and a single additive residual connection to the original coordinates of the consensus model c0. The loss function is the L2 distance between the positions

$$\begin{array}{r}\frac{1}{{N}_{d}{N}_{\mathrm{g}}}\sum\limits_{i=1}^{{N}_{d}}\sum\limits_{j=1}^{{N}_{\mathrm{g}}}\left\Vert{{{{\mathcal{D}}}}}^{-1}\left(\,{\mu }_{i},{{{\mathcal{D}}}}\left({z}_{i},{c}_{j}^{0}\right)\right)-{c}_{j}^{0}\right\Vert^{2}\end{array}$$

We optimized the weights of the inverse deformation network for 200 epochs with the ADAM optimizer for all our results. Once the network has been trained, the backprojection algorithm evaluates it for the latent representation of every particle on a 3D grid and applies the deformation to the CTF-multiplied, backprojected image. For computational speed, we evaluated the inverse deformation on a two times coarser grid, and then up-sampled the deformation fields to the original box size again using bilinear interpolation. The resulting volumes are then summed up and divided by the backprojected squared CTFs as illustrated in Fig. 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.