Main

Knowledge of the three-dimensional atomic structures of proteins and nucleic acids is essential for our understanding of the molecular processes of life. In recent years, considerable advances have been made in the determination of structures of biological macromolecules using electron cryo-microscopy (cryo-EM), culminating in cryo-EM maps of proteins with sufficient resolution to resolve individual atoms3,4. Accordingly, the number of new cryo-EM structures in the Electron Microscopy Data Bank (EMDB)5 is growing exponentially. If this trend continues, approximately 100,000 cryo-EM structures will be determined in the next 5 years6.

Over two-thirds of the structures reported in 2022 had resolutions better than 4 Å. Although individual atoms are not resolved at resolutions between 2–4 Å, reliable atomic models can be built by exploiting previous knowledge of the chemical structures of the proteins and nucleic acids in the sample, including their amino acid and nucleic acid sequences. Typically, atomic model building in cryo-EM maps is performed using manual procedures in three-dimensional computer graphics programs1,2. Atomic model building is often time-consuming and requires substantial levels of expertise to produce accurate models. At resolutions better than 3 Å, experts can build atomic models with few errors, whereas, at resolutions below 4 Å, avoiding mistakes is challenging. It is therefore not uncommon for atomic models of biological complexes to contain errors7, with potentially serious consequences8.

Structure determination using cryo-EM is also an increasingly important tool for the discovery of new subunits in biological complexes. Owing to its relaxed requirements for sample quantity and purity compared with other structural biology techniques, cryo-EM can determine structures of complexes purified from endogenous sources. Many such complexes contain subunits of unknown identities. Without previous knowledge of the amino acid sequence, identifying the chemical identity of individual amino acids in cryo-EM maps is difficult, and requires relatively high resolutions. Yet, provided that one can build stretches of several consecutive amino acids, database searches with the sequence fragments can lead to the identification of the corresponding protein. Recent examples include the identification of TMEM106B in amyloid filaments from human brains9,10,11 and the detection of subunits of axonemal complexes12,13.

Here we introduce a machine-learning approach called ModelAngelo for the automated building of atomic models and the identification of proteins in cryo-EM maps. Machine-learning approaches often require large amounts of training data. For example, recent protein language models were trained on tens of millions of sequences14 and AlphaFold2 was trained on more than 200,000 structures15. By contrast, fewer than 13,000 cryo-EM structures with resolutions better than 4 Å have been determined to date and many of these are redundant. The limited amount of available training data prompted us to design a multimodal machine-learning approach that combines local information from the cryo-EM map surrounding each protein or nucleic acid residue with additional information from the protein sequences in the sample and the local geometry of the structure. Similar sources of information are used by human experts when manually building atomic models in cryo-EM maps.

The sudden availability of atomic models for millions of proteins from protein structure prediction by AlphaFold215,16 has helped to guide and accelerate model building17. However, previous attempts to fully automate atomic modelling18,19,20,21,22,23,24 or the identification of unknown proteins25,26,27 have not become mainstream, although DeepTracer21,24 and findMySequence25 have gained some traction. However, atomic modelling remains a time-consuming and expert-dependent process in many structure determination projects. With the ongoing exponential growth in cryo-EM structures and the continuing influx of newcomers to the cryo-EM field, automation will be key in removing bottlenecks and replacing the dependence on human experts with objective methods that are accessible to all. We demonstrate that ModelAngelo can meet this need. Although subsequent error checking and refinement remain necessary, ModelAngelo outperforms human experts in identifying unknown proteins and produces initial atomic models of comparable completeness to those obtained by human experts.

A multimodal approach to model building

Automated model building of proteins and nucleic acids in ModelAngelo comprises three steps (Fig. 1a). Details about the network architectures that underlie these steps and how they are trained have been described previously28.

Fig. 1: Atomic modelling in ModelAngelo.
figure 1

a, ModelAngelo builds atomic models in three steps: (1) a CNN predicts protein and nucleic acid residue positions; (2) a GNN optimizes these positions and orientations (shown in b); (3) post-processing of the optimized graph leads to a complete atomic model. b, The GNN, which is arranged in eight layers with three modules, uses a feature vector per residue that is passed through MLP and integrated with additional data through attention mechanisms that have query (Q), key (K) and value (V) vectors. The cryo-EM module also produces a feature vector (C) used for residue prediction. The IPA module uses query points (Qpoints) and their distances to the neighbouring residues (Dq) for attention. Stable gradient propagation is ensured by residual connections with layer norms (Add LN)51. Residue feature vectors are used to update residue positions and orientations. They are also used to predict torsion angles, confidence scores and residue identities at the end of each layer.

In the first step, positions for the backbone Cα atom of amino acids and the phosphor atom of nucleic acids are predicted using a convolutional neural network (CNN). This CNN is a modified feature-pyramid network29 that predicts whether each voxel in the cryo-EM map contains the Cα atom of an amino acid, the phosphor atom of a nucleic acid residue or neither. A graph is then constructed in which each residue is a node, and edges are formed between each residue and its 20 nearest neighbours.

In the second step, a graph neural network (GNN) is used to optimize the positions and orientations of the residues to predict their amino or nucleic acid identity, and to predict torsion angles for their side chains or bases. The GNN consists of three modules: a cryo-EM module, a sequence module and an invariant point attention (IPA) module (Fig. 1b). Each node of the graph is associated with a residue feature vector. Each module takes the residue feature vector as input, combines it with new information and outputs an updated residue feature vector that is passed to the next module. The sequential application of the three modules in eight layers (Fig. 1b) enables the gradual extraction of more information from the different inputs.

The cryo-EM module incorporates information from the cryo-EM map and comprises two parts. First, the input feature vector is passed through a multilayer perceptron (MLP) network to generate query and value vectors. These vectors are used for cross-attention30 with key vectors that are calculated from a CNN on rectangular boxes that are extracted from the cryo-EM density map that point from the current residue to its 20 nearest neighbours. Intuitively, the cross-attention mechanism allows mixing information from each residue with that of its 20 nearest neighbours, depending on whether the cryo-EM density between them looks connected. Second, a cubic box is extracted from the cryo-EM map around the position of the current residue and passed through another CNN. The resulting vector is used in two ways: to generate amino and nucleic acid identity predictions through an MLP; and, after concatenation with the vector from the cross-attention, it is passed through another MLP to generate the output residue feature vector of the cryo-EM module.

The sequence module performs cross-attention for each residue with the user-provided amino acid sequences, which are embedded using the pretrained protein language model ESM-1b31. This incorporates information that is learned by the language model from many amino acid sequences, including multiple homologues. The information in protein language models has been shown to be sufficient for protein structure prediction14. The vector from the cross-attention is used in two ways: a first MLP is used to generate amino and nucleic acid identity predictions; a second MLP generates the output residue feature vector of the sequence module. For nucleic acid residues, the sequence module is not used.

The IPA module incorporates information from the geometry of the nodes in the graph and was inspired by the module with the same name in AlphaFold215. An MLP calculates four query points per residue and the Euclidean distance between the query points and the location of the neighbouring nodes is used to replace the cosine similarity of the attention algorithm between the query and key vectors. Intuitively, this enables the model to learn information about the topology of neighbouring residues, for example, about secondary structure. In fact, disabling this module in an ablation study led to atomic models with incorrect secondary structure geometry28.

In the third and final step, the residue feature vectors are post-processed to generate an atomic model. The feature vectors are used as inputs into two separate MLPs to predict new positions and orientations for each residue, as well as torsion angles for amino acid side chains and nucleic acid bases. They are also used to predict a confidence score for each residue, which is based on the network’s predicted root-mean-squared deviation (r.m.s.d.) for the backbone atoms with the deposited structure. Moreover, the predictions for the amino or nucleic acid identities from the cryo-EM and sequence modules are averaged to generate probabilities for each possible identity for all residues. These vectors are converted into a hidden Markov model (HMM) profile that is used for a search against the input sequences using HMMER32. A profile HMM is a probabilistic model representing the multiple-sequence alignment (MSA) of a set of related sequences. The parameters of a profile HMM are normally estimated from the MSA that it strives to model; however, here they are instead estimated from ModelAngelo predictions. There are three types of state in the profile HMM. For each position of the MSA’s consensus sequence, there is a match (M), a delete (D) and an insert (I) state with respect to the query sequences33. There are two types of probabilities in a profile HMM: transition and emission. The transition probabilities reflect the probability of a sequence going between the M, I and D states from one position of the profile to the next. ModelAngelo uses the confidence metric, c(i), that it predicts for each residue i to construct the transition probabilities as follows:

$$\begin{array}{ccc}{P}_{{\rm{M\to M}}}^{\left(i\right)}=\max \left({c}^{\left(i\right)}-d,0.5\right) & {P}_{{\rm{D\to M}}}^{\left(i\right)}=1-d & {P}_{{\rm{I\to M}}}^{\left(i\right)}=1-d\\ {P}_{{\rm{M\to D}}}^{\left(i\right)}=\frac{1-{P}_{{\rm{M\to M}}}^{\left(i\right)}}{2} & {P}_{{\rm{D\to D}}}^{\left(i\right)}=d & {P}_{{\rm{I\to D}}}^{\left(i\right)}=0\\ {P}_{{\rm{M\to I}}}^{\left(i\right)}=\frac{1-{P}_{{\rm{M\to M}}}^{\left(i\right)}}{2} & {P}_{{\rm{D\to I}}}^{\left(i\right)}=0 & {P}_{{\rm{I\to I}}}^{\left(i\right)}=d\end{array}$$

The strategy to set \({P}_{{\rm{M\to I}}}^{\left(i\right)}={P}_{{\rm{M\to D}}}^{\left(i\right)}\), the constant d = 0.5 and the minimum value of \({P}_{{\rm{M\to M}}}^{\left(i\right)}=0.5\) were chosen arbitrarily and these values were never optimized. The emission probabilities represent the probability of each amino acid being produced in an M or I state. For these, ModelAngelo uses its predicted probability distribution of the amino acids for each residue. The resulting HMM profiles are compatible with HMMER334 and HHblits35. Matched residues are mutated to the corresponding amino or nucleic acid in the input sequences, and separate chains are connected on the basis of their assigned sequences and proximity. Finally, chains shorter than four residues are pruned from the model, and a full atomic model is generated from the predicted positions and orientations of each residue and their corresponding amino acid or nucleic base torsion angle predictions using idealized geometries. The predicted backbone r.m.s.d. values are mapped to a score between 0 and 1, corresponding to a linear range for r.m.s.d. values between 1.2 and 0.5 Å, respectively. This score is stored in the B-factor column of the output coordinate file as a measure of local confidence in the backbone geometry.

Inspired by AlphaFold15, we recycle the post-processed model from one round of the GNN as the starting point of a subsequent round of graph optimization. For this purpose, ModelAngelo was trained with a random number of 1–3 recycling steps. During inference, we perform three rounds of recycling, as the performance plateaus after three rounds.

We trained ModelAngelo on maps deposited in the EMDB5 before 1 April 2022 with resolutions better than 4 Å and paired with models in the Protein Data Bank (PDB)36 that cover the entire map correctly, as described previously28. PDB files that included insertion codes, that is, additional residues relative to the reference sequence, were removed. This resulted in 3,715 map–model pairs that were used during training. All cryo-EM maps were resampled to a common pixel size of 1 Å. For comparison, findMySequence uses only 117 pairs, while DeepTracer uses approximately 1,400 (refs. 21,25).

To enable model building for structures with unknown sequences, we also trained a version of ModelAngelo without its sequence module. Still, for each protein residue, ModelAngelo predicts probabilities for all 20 amino acids. Within ModelAngelo, these probabilities are converted into HMM profiles and used for searches in HMMER334 as described above, but using a larger proteome, rather than only the sequences known to be present in the structure.

Protein modelling is on par with humans

To test ModelAngelo, we first considered all cryo-EM structures determined to at least 4 Å resolution and released from the EMDB between the cut-off date for training, 1 April 2022, and 9 February 2023. To reduce the computational costs, we excluded structures with more than 30,000 protein residues. We also removed viruses with icosahedral symmetry, for which typically only the asymmetric unit was built. To ensure that none of the sequences were seen before during training, we removed structures that had protein chains with more than 10% sequence identity to any of the proteins in the training set. Finally, we removed structures with insertion codes and other irregularities. This resulted in a test set of 177 structures (Supplementary Information), on which we ran ModelAngelo. Using a single A100 GPU, the smallest structure (PDB: 8DWI; molecular mass, 54.7 kDa) took 2 min; the largest structure (PDB: 7UMS; molecular mass, 1.85 MDa) took 53 min. The output coordinates from ModelAngelo were refined against the cryo-EM map using a standard refinement cycle in Servalcat37, and the refined models were compared to the deposited ones.

To assess the quality of the models generated by ModelAngelo, we analysed the Q-scores38 of all of the structures in the test set. The Q-score measures the resolvability of individual atoms in cryo-EM maps, therefore reflecting the quality of the built model. Provided that the model is built well, Q-scores also correlate with the local resolution, which can vary in cryo-EM maps: Q-scores of 0.4 are typical for cryo-EM maps at 4 Å resolution, values better than 0.7 are typical for maps beyond 2 Å resolution and values of 0.6 are typical for maps at 3 Å resolution38. We implemented Q-score calculation in ModelAngelo and calculated the average Q-scores for all atoms in each residue of both the deposited models and those built by ModelAngelo. We next calculated backbone r.m.s.d. values between the protein models built by ModelAngelo and those deposited and plotted these against the Q-scores of the deposited residues (Fig. 2a (pink line)). As expected, ModelAngelo builds models with lower r.m.s.d. values for residues with higher (better) Q-scores. Even for residues with Q-scores as low as 0.4, ModelAngelo builds models with backbone r.m.s.d. values lower than 1.0 Å. We also measured the completeness of the models built by ModelAngelo. We define completeness as the fraction of residues that are built with their Cα atom within 3 Å of the deposited model and with the correct amino acid assignment. As with backbone r.m.s.d., completeness improves for residues with higher Q-scores (Fig. 2a (blue line)). Overall, ModelAngelo built 77% of all 410,585 residues in the test set. Analysis of the deposited Q-scores shows that those residues not built by ModelAngelo have lower Q-scores than those that are built (Fig. 2b). In the deposited models, many of the residues with the lowest Q-scores were probably obtained by rigid-body docking of protein domains into poorly resolved regions of the cryo-EM maps. Excluding the 51,446 residues with Q-scores below 0.4, ModelAngelo built 85% of the residues in the test set. A comparison of Q-scores calculated for the models built by ModelAngelo with those calculated for the deposited models shows that models from ModelAngelo are of similar quality to the deposited ones (Fig. 2c). The same is also true for overall Fourier shell correlation values between the cryo-EM maps and those parts of the models that were both built by ModelAngelo and present in the deposited models (Fig. 2d).

Fig. 2: Performance of ModelAngelo for proteins.
figure 2

a, The backbone r.m.s.d. and model completeness plotted as a function of the target model Q-scores. b, Histograms of the Q-scores of residues in the deposited models, comparing those built by ModelAngelo with those not built. c, Q-score comparison between ModelAngelo-predicted models and the deposited models. d, Model-to-map Fourier shell correlation (FSC), as calculated by Servalcat37 after refining both models and using only residues present in both ModelAngelo and deposited models. e, Model completeness for various automated model-building software for different local-resolution ranges in the maps. f, Model completeness for ModelAngelo and versions of ModelAngelo in which its sequence and/or IPA modules were ablated. For ad, the data relate to the test set of 177 structures; for e and f, the data relate to the subset of 27 structures.

In a second test, we compared the performance of ModelAngelo with existing approaches for automated model building in cryo-EM maps. For this test, we used a subset of 27 protein structures from the 177 structures described above (Supplementary Information). We selected nine single-chain structures, nine homo-oligomeric structures and nine hetero-oligomeric structures. For each of these types of structures, we selected three structures with overall resolutions below 3.3 Å, three structures with resolutions between 3.3 and 2.8 Å, and three structures with resolutions better than 2.8 Å. For all 27 structures, unfiltered half-maps were available for download from the EMDB, and we used these to calculate local resolutions in ResMap39. We then used Phenix40, Demo-EM41, Buccaneer20 and DeepTracer21 for automated model building in these maps and compared the completeness of the resulting models with those obtained using ModelAngelo (Fig. 2e and Extended Data Table 1). The best alternative approach, DeepTracer, built approximately 80% of the deposited residues in regions of the maps with local resolutions in the range of 2.5–3 Å; the remaining approaches built models with considerably lower completeness. By contrast, ModelAngelo built up to 80% of the deposited residues in regions of the maps with local resolutions down to 3.5–4 Å, reflecting the observation that manual building by human experts also becomes prone to errors at resolutions below 4 Å. Tests in which we ran ModelAngelo without one or more of its modules indicate that its performance comes from a combination of all three modules (Fig. 2f), which is consistent with previous observations28.

Building good nucleic acid backbones

The test set of 177 structures described above contained only 103 nucleic acid chains, many with just a few nucleotides. Thus, instead of conducting a systematic analysis as done for the proteins, we present a few test cases to illustrate the quality of nucleotide building (Fig. 3). We applied ModelAngelo to 11 different ribosome structures that were determined to resolutions ranging from 1.98 to 3.80 Å (Fig. 3a,b), as well as a CRISPR-associated transpososome from Scytonema hofmanni42 (Fig. 3c,d). Although ribosome structures were included in ModelAngelo’s training set, the nucleotide sequences were not. When plotting backbone r.m.s.d. values and backbone completeness against the Q-scores of the deposited nucleotide coordinates (Fig. 3e), we observed similar trends to those for the protein chains. Backbone r.m.s.d. values range from 2 Å in the worst regions of the map to values better than 0.5 Å in the best regions. Likewise, near-complete backbones are built in the best regions, while backbone completeness drops to below 80% for the worst regions. However, ModelAngelo struggles to distinguish between the two purines or the two pyrimidines, echoing the difficulty that humans face in building nucleotide sequences based solely on the cryo-EM density, if the resolution does not extend beyond 2.5 Å. Consequently, when considering only correctly built sequences, the completeness of the models built by ModelAngelo drops to 80% for the best parts of the map, and to as low as 20% for the worst parts (Fig. 3e). Users should therefore carefully validate the nucleotide chains of models built by ModelAngelo, for example, by using nucleotide secondary structure predictors43. Nonetheless, ModelAngelo considerably accelerates the process of building the nucleotide backbone, as subsequent nucleotide base changes can be made with minimal manual intervention. For the CRISPR-associated transpososome and 3 out of the 11 ribosomes described above, we also used DeepTracer26 and CryoREAD44. ModelAngelo produced nucleotide models that were more complete and more accurate than these alternative approaches (Extended Data Table 2).

Fig. 3: Performance of ModelAngelo for nucleic acids.
figure 3

a, Escherichia coli ribosome built by ModelAngelo (with ribosomal RNA in green and proteins in blue) compared with the deposited model (PDB: 7S1G, black outline)52. b, Magnified view with nucleotide bases showing high accuracy compared with the deposited model (orange). c, ModelAngelo model of the V-K CAST transpososome from S. hofmanni compared with the deposited model (PDB: 8EA4)42. Sections that were not built by ModelAngelo (black outline) are in regions of low Q-score (as shown in g). d, Magnified view comparing the nucleotide bases of both models, showing a sequence that was incorrectly identified by ModelAngelo. e, Backbone r.m.s.d., backbone completeness and sequence completeness were plotted against the deposited Q-score for six ribosome structures. f,g, Deposited models for the structures in a and c, respectively, coloured by Q-score, with low-Q-score regions indicated by boxes.

Identifying novel proteins

To illustrate the performance of ModelAngelo in identifying protein chains in cryo-EM maps, we applied ModelAngelo to two examples of large cryo-EM structures that were recently determined from endogenous sources. The first example is a structure of the supercomplex of the phycobilisome (PBS), photosystem I and II (PSI and PSII) and the transmembrane light-harvesting complexes (LHCs) that was imaged in situ in the red alga Porphyridium purpureum45. The second example is a structure of the ciliary central apparatus and radial spokes of the green alga Chlamydomonas reinhardtii that was obtained by single-particle analysis after purification from cilia12,13.

At 16.7 MDa, the PBS–PSII–PSI–LHC supercomplex is one of the largest complexes determined using single-particle cryo-EM. The deposited model (PDB: 7Y5E) consists of 158,730 residues in 81 unique protein chains, including six chains for which the authors were unable to identify the corresponding protein. The unidentified chains were termed LPP1 (linker of PBS–PSII 1); CNT (for connector); PsbW and Psb34 (two of the core subunits of PSII); LRH (a linker protein); and LPS1 (photosystem linker protein 1). To identify these chains, we ran ModelAngelo without using its sequence module (using the build_no_seq option) to calculate an initial atomic model with HMM profiles for all chains, and we searched these profiles against the proteome constructed in ref. 46 (using the hmm_search option). Due to local pseudosymmetry, all six unidentified proteins occur more than once in the cryo-EM map. This enables us to bootstrap weaker individual hits by cross-referencing their matches to the other instances. Specifically, the same six protein chains were identified for all instances, with E-values in the range of 5.8 × 10−66 to 6.4 × 10−2. Using the backbone traces in the deposited model, findMySequence25 identified only two of the unassigned proteins (Psb34 and PsbW). Using the backbone traces generated by ModelAngelo, it also found LRH. We next constructed an input sequence file that included all chains in the deposited model plus the six newly identified chains and ran ModelAngelo again. This calculation took 23 h on an A100 GPU. The resulting model, containing 110,742 residues, is shown in Fig. 4a. For most sections of the unidentified chains, ModelAngelo built better models than those in the deposited structure, most notably for LRH and CNT. ModelAngelo did not build models for parts of the unidentified proteins that were in regions of poor cryo-EM density. Besides the excellent agreement between side-chain densities in the cryo-EM map and the predicted sequences (Extended Data Fig. 1), the structures built by ModelAngelo were also highly similar to AlphaFold2 predictions for the unidentified chains15,47 (Extended Data Fig. 2). ModelAngelo did not attempt to build amino acid or nucleotide residues in the densities for phycocyanobilin or phycoerythrobilin cofactors (Extended Data Fig. 3). As the cryo-EM maps that ModelAngelo was trained on did contain cofactor densities, but it was trained to build protein and nucleic acid residues, ModelAngelo has been incentivized to ignore cofactor densities.

Fig. 4: Examples of protein identification using ModelAngelo.
figure 4

a, The ModelAngelo model of the single-PBS–PSII–PSI–LHC supercomplex (grey) showing the positions, models and map densities of six newly identified proteins (green). Backbone traces in the deposited model (PDB: 7Y5E) are shown in orange. b, Atomic model of the central apparatus microtubule C1 showing the positions, models and map densities of two identified proteins—FAP92 and FAP374. The orange cartoons represent poly(UNK) chains deposited in the original model (PDB: 7SQC). c, An atomic model of radial spokes 1 and 2 (RS1 and RS2) bound to a doublet microtubule (grey) showing the positions, models and map densities of four proteins (RSP24–27, green) identified by ModelAngelo. Only RSP27 had a backbone trace in the deposited model (orange). C, C terminus; N, N terminus.

Like the PBS–PSII–PSI–LHC supercomplex, the central apparatus and radial spoke complexes isolated from C. reinhardtii ciliary axonemes are large complexes with poorly characterized subunit compositions. Although recent cryo-EM structures had identified 23 different radial spoke proteins (RSPs) and 48 different central apparatus proteins12,13, the deposited maps (EMDB: EMD-22475, EMD-24481 and EMD-25381) contained densities that were left unassigned despite considerable manual effort. To identify these proteins, we applied ModelAngelo without using its sequence module to the deposited maps and searched the resulting HMM profiles against the latest version of the C. reinhardtii predicted proteome48 (Fig. 4b and Methods). This approach identified four additional radial spoke proteins: FAP109, Cre05.g240450, Cre08.g800895 and Cre17.g802036), which we rename RSP24, RSP25, RSP26 and RSP27, respectively, and two additional central apparatus proteins (FAP92 and FAP374) (Extended Data Table 3). Using ModelAngelo’s backbone traces, findMySequence25 was unable to identify any of these proteins. Neither RSP24 nor RSP26 were annotated in earlier versions of the C. reinhardtii genome, explaining their absence from proteomic studies, and demonstrating the importance of high-quality genome annotations for de novo identification of proteins by cryo-EM. RSP27 was identified from a fragment of just 33 residues, demonstrating the power of ModelAngelo to identify proteins from small sections of well-resolved density. Both central apparatus proteins (FAP92 and FAP374) bind directly to the microtubule surface and have tertiary structures that are poorly predicted by AlphaFold2 (Extended Data Fig. 4); side-chain density was therefore essential for their successful identification (Extended Data Fig. 5). The identification of these proteins will allow their functional relevance to the regulation of ciliary motility to be investigated through targeted genetic manipulation.

Discussion

ModelAngelo automates atomic modelling in cryo-EM maps, building protein models of comparable quality to those built by human experts and nucleic acid models with near-complete and accurate backbones. ModelAngelo outperforms existing approaches for the automated modelling of both proteins and nucleotides. Furthermore, ModelAngelo builds these models within hours on a modern GPU, thereby removing an important bottleneck in cryo-EM structure determination. Future incorporation of ModelAngelo into automated cryo-EM image-processing pipelines will enable users to go from data acquisition to atomic models in a single automated procedure.

By introducing objectivity in the model-building process, ModelAngelo also informs which parts of the map can be confidently interpreted with an atomic model and which should be left uninterpreted. In this way, ModelAngelo will not only reduce the number of errors in atomic models but also have a role in making cryo-EM structure determination more accessible to the large numbers of newcomers that the field has experienced in recent years. Still, some degree of human supervision and intervention will remain necessary. Models from ModelAngelo will still need refinement, for example, in Servalcat37 or Phenix40, to optimize their stereochemistry and fit to the cryo-EM map. Users are also strongly encouraged to manually check the output of ModelAngelo, particularly for those parts of cryo-EM maps with resolutions worse than 3.5–4.0 Å, as rigid-body fitting of known domains or connecting loops in lower-resolution map regions to obtain a more complete model falls outside the scope of ModelAngelo. Colouring the model by its predicted confidence in backbone geometry, as stored in the B-factor column of the coordinate file, may guide the user towards parts of the model that are less reliable. ModelAngelo was trained with augmentation through a variety of positive and negative B-factors. It should therefore be relatively stable to local variations in B-factor. It is possible that combining ModelAngelo with neural networks that make cryo-EM maps look more like proteins49,50 could lead to further improvements, although this would probably require retraining of ModelAngelo to reach its full potential.

Besides accelerating cryo-EM structure determination and providing objectivity in atomic modelling, ModelAngelo also identifies protein chains in cryo-EM maps better than human experts. The reason why ModelAngelo outperforms the human expert in this task probably lies in the implementation of its sequence searches. While human experts typically base their identifications on discrete assignments of individual amino acids to various residues in unknown chains, ModelAngelo exploits predicted probabilities for all 20 amino acids for every protein residue and combines this information with its predicted confidence in each residue in a full HMM search. This not only allows better identification of unknown chains but also helps ModelAngelo during the building of atomic models with known sequences, where it may potentially outperform human experts in placing protein chains for which ambiguity exists, for example, when multiple homologous chains coexist in a single structure. The ability to identify proteins in cryo-EM maps will increase in importance as ongoing advances in sample preparation, microscopy and image processing enable ever more structures to be determined for samples purified from native sources or visualized in situ by electron tomography of frozen cells or thin tissue sections.

Methods

Changes in ModelAngelo 1.0

We previously described an early (beta) version of ModelAngelo28. Here we introduce the first stable release of ModelAngelo (v.1.0), which extends the beta version by adding the ability to build nucleotides and an updated HMM algorithm, as described in the main text. We also made minor changes in the GNN to improve the performance of ModelAngelo due to the enhanced requirements of building nucleotides. Whereas the beta version of ModelAngelo used cryo-EM maps to a maximum spatial frequency of 3 Å, ModelAngelo v.1.0 uses information up to 2 Å resolution. To capture the same context radius, the regions that are sampled around each residue in the cryo-EM module were therefore increased from 17 to 23 voxels for the cubes and from 5 to 7 voxels for the rectangle lengths. We improved the training of the model by using the Lion optimizer53 and changing the dropout probability to 0.1 from 0 (ref. 54). To compensate for the increased computational costs of these changes, we also implemented several approaches to speed up calculations. In particular, ModelAngelo can now be run using multiple GPUs simultaneously, node updates are performed more efficiently and we use larger batch sizes in training. Furthermore, we confirmed that half-precision inference (running the model with a two-byte floating-point precision rather than the default four-byte one) does not affect the outcome in the GNN. As a result of these changes, ModelAngelo 1.0 runs faster than the beta version, even though it uses a larger network.

Radial spoke and central apparatus

The structure of radial spoke 1 (RS1) from C. reinhardtii (EMD-22475)12 contained unassigned proteins that were either left unmodelled or tentatively interpreted with a poly(UNK) model. To identify these proteins, we ran ModelAngelo without using its sequence module to calculate an initial atomic model with HMM profiles for all chains. We subsequently searched the HMM profiles against the latest version of the C. reinhardtii genome48, which was not available at the time of the original publication. For a known radial spoke protein, RSP6, ModelAngelo correctly predicted 67% of all residues even without knowledge of its sequence. This approach also unambiguously identified three unassigned proteins: FAP109, Cre17.g802036 and Cre05.g240450, which we reassign as RSP25, RSP26 and RSP27, respectively. RSP27 was identified from a fragment of just 33 residues, demonstrating ModelAngelo’s ability to identify proteins from minimal information, given well-resolved side-chain densities.

RSP25 and RSP26 form a heterodimer in the neck of RS1. These structurally similar proteins each have an N-terminal RIIa domain (similar to the dimerization-anchoring domain of cAMP-dependent protein kinase regulatory subunit) followed by two C-terminal EF-hand motifs. The proteins were identified on the basis of sequence differences between their better-resolved RIIa domains, demonstrating ModelAngelo’s ability to distinguish between similar proteins. RSP25 (FAP109) had been detected by mass spectrometry analysis of RS1 purified from C. reinhardtii axonemes12, providing confidence to the assignment. RSP26 (Cre17.g802036) was not annotated in earlier versions of the C. reinhardtii genome, explaining its absence from proteomic studies. RSP27 (Cre05.g240450) forms a small, L-shaped helix in the centre of the RS1 stalk.

After identification, we constructed an input sequence file that included all of the chains in the deposited model along with the three newly identified chains and ran ModelAngelo again. This approach identified and built extensions of RSP16 that had been left unassigned in the deposited model. We then extended the models of RSP25 and RSP26 using AlphaFold2 predictions for the EF-hand motifs, which have relatively poor cryo-EM density, demonstrating how ModelAngelo and AI-based structure prediction methods can be used together to build more complete atomic models.

The microtubule-bound stalk of radial spoke 2 (RS2), which is structurally and compositionally different from RS1, also contained unassigned proteins in the deposited map (EMD-22481)12. We therefore applied the same process to identify one additional protein, Cre08.g800895, which we rename RSP24. RSP24 is a 25 kDa bilobal protein with an N-terminal ubiquitin-like domain. An LC8-interacting protein in the stalk of RS2 remains unassigned due to too few resolved side chains.

In the axoneme, radial spokes interact transiently with the central apparatus. Structures of the two microtubules (C1 and C2) that together form the C. reinhardtii central apparatus have recently become available (EMD-25381 and EMD-25361)13. The map of the C1 microtubule (EMD-25381) contained a number of unassigned densities. We therefore applied ModelAngelo without using its sequence module to a Phenix auto-sharpened version of the map, as the original map was post-processed using DeepEMhancer49. This approach identified two new proteins: FAP92 and FAP374. FAP92 is a microtubule-associated protein that binds in the interprotofilament cleft between protofilaments 3 and 4 and repeats with 32 nm periodicity, whereas FAP374 is a microtubule inner protein that repeats with 16 nm periodicity. Neither protein has a globular fold nor is fully resolved in the map, demonstrating ModelAngelo’s ability to identify ordered fragments of proteins. The final models of FAP92 and FAP374 were extended manually using Coot1 through regions of less-well-resolved density and refined in Phenix55.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.