Brief Communication | Published:

# A fully automatic method yielding initial models from high-resolution cryo-electron microscopy maps

## Abstract

We report a fully automated procedure for the optimization and interpretation of reconstructions from cryo-electron microscopy (cryo-EM) data, available in Phenix as phenix.map_to_model. We applied our approach to 476 datasets with resolution of 4.5 Å or better, including reconstructions of 47 ribosomes and 32 other protein–RNA complexes. The median fraction of residues in the deposited structures reproduced automatically was 71% for reconstructions determined at resolutions of 3 Å or better and 47% for those at resolutions worse than 3 Å.

## Main

Improvements to methods for cryo-EM data collection and image reconstruction have made it possible to obtain 3D images of macromolecules at resolutions that allow ready visualization of structural details such as the locations of side chains1,2. A key limiting factor in structure determination by cryo-EM is now the effort required for interpretation of these reconstructed images in terms of an atomic model. Algorithms for de novo model building using cryo-EM maps have been reported recently3,4,5,6,7,8 and are capable of building complete models in some cases, but they are not fully automated. None, for example, can carry out map segmentation; automatically sharpen a map; identify what parts of the map correspond to protein, RNA, or DNA when more than one type of chain is present in the same map; or take reconstruction symmetry into account.

Here we report an integrated procedure for map interpretation without user intervention. The most important new algorithms include automatic map segmentation, automatic map sharpening, automatic interpretation of multiple chain types, and automatic application of reconstruction symmetry. The procedure requires as input only a cryo-EM map, the nominal resolution of the reconstruction, the sequences of the molecules present, and any symmetry used in the reconstruction. Map interpretation begins with automatic image sharpening using an algorithm that maximizes connectivity and detail in the map9. The unique parts of the structure in the sharpened map are then identified with a segmentation algorithm that extends previous methods10 by taking into account reconstruction symmetry and choosing a contour level based on the expected contents of the structure (details in Methods and Supplementary Fig. 1; examples in ref. 11). For each part of the structure and for each type of macromolecule present, atomic models are generated via several independent methods, and then extensive real-space refinement12 is applied with restraints based on automatically identified secondary structure elements. Finally, our automated procedure carries out assembly and refinement of the entire structure, including any symmetry that is present. The procedure is carried out by the Phenix tool phenix.map_to_model, available at https://www.phenix-online.org.

One of the strengths of our automated model-building procedure is that a novice user will generally obtain the same result as an expert. Whereas most automatic tools have many parameters that the user can adjust to improve results in difficult cases, our approach, generally speaking, does not. Each time we identify a parameter that needs to be adjusted depending on the situation, that parameter is automatically varied and optimized in our procedure. For example, the sharpening of a map is a very important adjustable parameter that is set by the user in most approaches. We developed a metric that we could use to evaluate map sharpening and automatically optimize the sharpening of the map during the analysis. The result of this strategy is that our procedure has just one overall parameter (“quick” or “not quick”) that a user might normally adjust for a difficult case. Another strength of automated procedures such as this is that they conduct analyses that would be challenging to carry out with tools that require manual input. With automated interpretation of a map, for example, one can obtain error estimates by repeating the interpretation using different algorithms or different random seeds in appropriate stages of the analysis (Supplementary Results).

The steps in map interpretation are illustrated in Fig. 1. Part of a deposited cryo-EM reconstruction of lactate dehydrogenase (EMDB13 entry 819114) obtained at a resolution of 2.8 Å is illustrated in Fig. 1a, along with an atomic model (PDB15 5K0Z14) representing the authors’ interpretation of the reconstructed map. Our procedure optimizes map sharpening and yields an atomic model as an interpretation of the map (Fig. 1b). Features such as side chains can be clearly visualized without the need for manual adjustment of the map sharpening. In this case (Fig. 1b) our automated interpretation did not determine the identity of the side chains in this region of the map, but in other cases, such as the reconstruction of β-galactosidase16 at a resolution of 2.2 Å (Fig. 1c), side chains were identified automatically for most of the structure.

Figure 1d illustrates the deposited cryo-EM reconstruction of glutamate dehydrogenase (EMDB-663017) obtained at a resolution of 3.3 Å. Using the automatically sharpened map, we created three models, each with a different Phenix algorithm (tracing polypeptide backbones by following tubes of density, feature-based identification of secondary-structure elements, and template matching followed by extension with fragments from known structures; Methods) so as to minimize correlated errors. For each model-building algorithm, the main chain was built first, then the sequence was assigned and side chains were added on the basis of the fit to the density18. These three models are superimposed on the map in Fig. 1d (after application of reconstruction symmetry). The best-fitting parts of each model were combined and extended to fill gaps, and the resulting composite model was refined to yield an interpretation of the reconstruction (Fig. 1e). For this glutamate dehydrogenase structure, 66% of the residues in the deposited model (PDB 3JCZ17) were reproduced in the automatically generated model (Cα atoms in the deposited model matched within 3 Å), with a root-mean-square (r.m.s.) coordinate difference of 0.8 Å for matching Cα atoms.

We used a similar procedure to automatically generate interpretations of cryo-EM reconstructions of RNA–protein complexes. In addition to generating multiple models interpreting each part of a map as protein, we generated additional models interpreting each part of the map as RNA, and combined the best-fitting parts of each model to represent the protein–RNA complex. Figure 1f shows such an automatically interpreted reconstruction of the Mycobacterium smegmatis ribosome19 (EMDB-6789) analyzed at a resolution of 3.1 Å and compared with the deposited model (PDB 5XYM). Figure 1h,i compares a portion of the automatically generated and deposited models in detail. The automatically generated model represents 60% of the RNA and 48% of the protein in the deposited model.

At lower resolution, a smaller fraction of the structures is typically reproduced by our methods, but secondary structural elements such as protein and RNA helices are often identifiable. Figure 1g shows the automatically sharpened map and compares the automatic interpretation for the protein-conducting ERAD channel HRD120 (EMDB-8637) with the deposited model (PDB 5V6P). This reconstruction at a resolution of 4.1 Å was previously interpreted through a combination of helix-fitting into density, manual model building, and Rosetta20 modeling with distance restraints from evolutionary analysis.

We evaluated the overall effectiveness of our procedure by applying it to all 476 high-resolution cryo-EM reconstructions in the EMDB that we could extract with simple tools and match to an entry in the PDB. For chains in maps reconstructed at resolutions of 3 Å or better (24 structures containing protein and 6 containing RNA), the median fractions of protein and RNA residues in the deposited models reproduced by our approach were 71% and 45%, respectively (Fig. 2a). At resolutions lower than 3 Å (452 structures with protein and 73 with RNA), 47% of protein residues and 34% of RNA residues were reproduced. The median r.m.s. coordinate differences for matching Cα atoms in protein chains and for matching P atoms in RNA chains were each about one-third the resolution of the reconstructions (Fig. 2b). The median fraction of the sequence of the deposited structure that could be reproduced (Fig. 2c) was 28% for protein chains at higher resolution and 9% at lower resolution (the expected fraction matching based on random sequence assignment was roughly 6% for protein with 20 amino acids with frequencies in eukaryotes and roughly 25% for RNA with 4 bases and similar frequencies). For RNA chains, the sequence match was 49% at higher resolution and 42% at lower resolution. An analysis of the geometries of the models is presented in Supplementary Figs. 2 and 3. We found that substantial structural information could be obtained even at resolutions lower than the resolution of 4.5 Å for which the procedure was designed. Figure 2d,e shows a comparison of our automatic analysis of horse spleen apoferritin at a resolution of 4.7 Å with the deposited model.

We compared our automatic model-building approach with two other recently developed methods: MAINMAST8 and de novo Rosetta21 model building. For our analysis we chose maps for which these two methods had already been applied8. Because the other methods are not fully automated, we prepared maps suitable for them by cutting out just the region surrounding a single chain, and we used these artificial maps to test each approach. We applied our automated approach to the 22 unique maps used in the previous work8 with a matching entry in the PDB, segmented to show a single chain as described8, and compared the models (Fig. 2f and Supplementary Table 1) with those obtained previously8. Our method yielded an average coverage higher than that of Rosetta and lower than that of MAINMAST. The models built by our automated procedure had accuracy (r.m.s. difference from deposited models of matching Cα atoms of 1.31 Å) the same as or better than that of the MAINMAST models (r.m.s. deviation of 1.51 Å) and the Rosetta models (r.m.s. deviation of 1.33 Å).

With our automated procedure, essentially any high-resolution reconstruction with suitable metadata describing the reconstruction, its resolution, and reconstruction symmetry can be interpreted and a first atomic model can be generated without any manual intervention. The models produced are not complete at this stage. We anticipate that combination of the integration available in our approach with other algorithms3,4,5,6,7,8 could lead to both a high degree of automation and high model completeness.

## Methods

### Summary of automated modeling procedure

The first step in our automated modeling procedure is automatic map sharpening. This yields an optimized map suitable for interpretation. The second step is segmentation of the map, in which the unique part of the map is identified and further split into parts containing regions of contiguous density. The third step is modeling of each region in the unique part of the map as protein (and RNA/DNA if present in the sequence file) via several independent map-interpretation methods, which yields overlapping interpretations of each part of the map. The fourth step is selection of the best-fitting interpretation of each region of the map and creation of a composite model representing the entire unique part of the map. The final step entails application of the reconstruction symmetry of the map and refinement of the resulting full model.

### Map sharpening

Maps are automatically sharpened (or blurred) with phenix.auto_sharpen9, which maximizes the level of detail in the map and the connectivity of the map by optimizing an overall sharpening factor23 Bsharpen applied to Fourier coefficients representing the map up to its effective resolution. Beyond that resolution, a blurring exponential factor Bblur with a value of 200 Å2 is applied. This blurring procedure enables the user to dampen high-resolution information without precise knowledge of the optimal resolution cutoff.

### Map segmentation

Map segmentation consists of the identification of all regions with density above an automatically determined threshold, followed by selection of a unique set of density regions that maximizes connectivity and compactness, taking into account the symmetry present.

Contiguous regions above a threshold in a map are identified by a region-finding algorithm. This algorithm chooses all the grid points in a map that are above a given threshold. Then it groups these grid points into regions in which every point is above the threshold and is connected to every other point in that region through adjacent grid points that are also above the threshold.

The choice of threshold for defining regions of density is a critical parameter in segmentation10. We set this value automatically by finding the threshold that optimizes a target function that is based on three factors. One factor targets a specific volume of the map above the threshold. The second factor targets the expectation that if n-fold symmetry is present, then groups of n regions should have approximately equal volumes. The third factor targets regions of density of a specific size. The desired volume above the threshold is chosen on the basis of the molecular volumes of the molecules expected to be present in the structure and the assumption that a fraction f (typically 0.2) of the volume inside a molecule will have high density (the parts very near atoms) and that only these high-density locations should be above the threshold. The desired size of individual regions of density is about the size occupied by 50 residues of the macromolecule, as this is a suitable size for model building of one or a few segments of a macromolecule. The exact size of regions is not crucial.

The details of setting this threshold depend on n, the number of symmetry copies in the reconstruction; nres, the total number of residues in the reconstruction; the total volume of the reconstruction vtotal; the volume occupied by the macromolecule, vprotein; a target fraction f of grid points inside the macromolecular regions desired to be above the threshold; and r, a target number of residues in each region. Prior to segmentation, the map is normalized, taking into account the fraction vprotein/vtotal of the map that contains macromolecule. The map is first transformed by subtraction of its mean followed by division by its variance to yield a map with a mean value of zero and variance of unity. Then a multiplicative scale factor of

$$s = \left( {v_{\rm{protein}}{\mathrm{/}}v_{\rm{total}}} \right)^{1{\mathrm{/}}2}$$
(1)

is applied. This transformation has the property that if the density for the molecule has uniform variance everywhere inside the molecule, removal of part of the molecule from the map will lead to a transformed map that is unchanged for the remainder of the molecule.

The desired total volume vtarget corresponding to high density within the macromolecule is given by the product of the total volume within the macromolecule, vtotal, and the desired fraction of grid points within the macromolecule that are to be above the threshold, f.

$$v_{\rm{target}} = fv_{\rm{total}}$$
(2)

The number of desired regions mtarget is given by the number of residues in the macromolecule divided by the desired number of residues in each region,

$$m_{\rm{target}} = n_{\rm{res}}{\mathrm{/}}r$$
(3)

The desired volume per region vregion_target is the ratio of the total target volume to the total number of regions,

$$v_{\rm{region\_target}} = v_{\rm{target}}{{/}}m_{\rm{target}}$$
(4)

The desired volume ratio of the nth region, vn, to that of the first region, v1, is unity, and the value of this ratio is

$$v_{\rm{ratio}} = v_n{\mathrm{/}}v_1$$
(5)

For a specific threshold, the volumes of regions above the threshold and the median volume vmedian of the first mtarget of these regions, after sorting from largest to smallest, are noted.

The desired median volume vmedian is vregion_target. We use the target function

$$v_{\rm{ratio\_median}} = a\left\{ {v_{\rm{median}}{/}v_{\rm{region\_target}} \,{\mathrm{if}} < 1,} {v_{\rm{region\_target}}/v_{\rm{median}}\,{\mathrm{otherwise}}} \right\}$$
(6)

to express this, so that a high value of vratio_median is always preferred. If all regions are about equal in size, then this volume ratio vratio_median is not informative. The weight on the volume ratio is therefore scaled, increasing with variation in the size of regions, using the formula

$$a = v_{\rm{target}}{\mathrm{/}}v_{\rm{median}}$$
(7)

where a is expected to increase from a value of 1 if all regions are the same size to larger values if regions are of different sizes, as the largest regions will have volumes greater than the median.

The desired volume of the largest region, v1, is also vregion_target. The target function

$$v_{\rm{ratio\_1}} = \left\{ {v_1{{/}}v_{\rm{region\_target}}\,{\mathrm{if}} < 1,} {v_{\rm{region\_target}}{{/}}v_1\,{\mathrm{otherwise}}} \right\}$$
(7)

expresses this.

Finally, empirically we find that a threshold t on the order of unity is typically optimal (after scaling of the map as described above). We express this with a final ratio,

$$v_{\rm{ratio\_threshold}} = \left\{ {t\,{\mathrm{if}}\,<1, 1{\mathrm{/}}t\,{\mathrm{otherwise}}} \right\}$$
(8)

where larger values are again desired.

The total score Q for a threshold t is given by

$$Q = Av_{\rm{ratio}} + B \times \left( {v_{\rm{ratio\_median}} + v_{\rm{ratio\_1}}} \right) \times 2 + Cv_{\rm{ratio\_threshold}}$$
(9)

where A, B, and C have default values that we set by limited experimentation using a few test cases. Values of the threshold t are automatically tested, and the value that maximizes the total score is used.

Once a threshold is chosen and the resulting set of regions of connected density above that threshold are found, these regions are assembled into groups with members related by symmetry (if any is present).

A unique set of density regions is chosen through selection of one region from each symmetry-related group. The choice of regions is optimized to yield a compact structure and high connectivity. The compactness of the structure is represented by the radius of gyration of randomly sampled points from all chosen regions. The connectivity of a set of regions is calculated on the basis of the r.m.s. of the maximum gaps that would have to be spanned to connect each region to one central region, going through any number of regions in between. For any pair of regions, the gap is defined as the smallest distance between randomly sampled points in the two regions. For a set of regions, the overall gap is the largest of the individual gaps that would have to be crossed to go from one region to another, going through any other regions in the process. The overall connectivity score is the r.m.s. of these gaps for connections between each region and one central region.

The central region and all the other regions are chosen to minimize both the connectivity score and the radius of gyration. The relative weighting of the two scores is determined by a user-definable parameter with a default value of weight on the radius of gyration of weight_rad_gyr = 0.1. The weight on the radius of gyration is then normalized to the size of the molecule by multiplication of this parameter by the largest cell dimension divided by 300 Å. (This dimension is arbitrary; the key relationship is that the radius of gyration scales with the size of the molecule.)

The goal of the segmentation procedure described so far is to yield density corresponding to a single molecular unit. If the segmentation procedure yields only density corresponding to parts of several molecular units, then complete chain tracing will not be possible. To increase the proportion of complete chains, regions neighboring the initial regions that would lead to an increase in the overall radius of gyration of 1 Å or less are added to the segmented region.

### Chain types to be examined

The chain types (protein, RNA, DNA) to be tested in model building were automatically deduced from the contents of the sequence file with the Phenix method phenix.guess_chain_types_from_sequences.

### Protein model building

In our new core method for protein model building, the phenix.trace_chain algorithm24 is used to build a polypeptide backbone through a map following high contour levels. These preliminary models are then improved by automatic iterative identification of secondary structure and refinement of the models including hydrogen-bond restraints representing this secondary structure with the phenix.real_space_refine approach12. Because the connectivity in a map can sometimes be more evident at lower resolution, a series of maps blurred with different values of a blurring exponential factor Bblur ranging from 0 to 105 Å2 (eight choices in increments of 15 Å2) are created, and each one is used in chain tracing.

Automatic identification of secondary structure is carried out via a feature-based method that is relatively insensitive to large errors. Helices are identified from a helical geometry of Cα positions. Segments six or more residues in length are considered, and Cα,i → Cα,i+3 and Cα,i → Cα,i+4 vectors are calculated. The helical rise from a Cα atom is taken as the mean value of the length of the corresponding Cα,i → Cα,i+3 and Cα,i → Cα,i+4 vectors, divided by the mean number of intervening residues (3.5). The overall helical rise is taken as the average rise over all suitable Cα atoms, and the segment is rejected if the helical rise is not within 0.5 Å of the target of 1.54 Å. Then the mean of all Cα,i → Cα,i+3 vectors is computed, and the segment is considered helical if the mean dot product of individual vectors with the overall mean vector is at least 0.9 and no individual dot product is less than 0.3 (throughout this work, parameters are user-adjustable, but the values specified were used for all the work described here).

β-sheets are found by identification of parallel or antiparallel extended structure with at least four residues in each segment. For suitable Cα atoms within a single segment, Cα,i → Cα,i+3 vectors are calculated. If the mean length of these vectors is within 1.5 Å of the target of 10 Å, the mean dot product of individual vectors with the overall average is at least 0.75, and no individual dot product is less than 0.5, the segment is considered as a possible strand. Two strands are considered part of the same sheet if the Cα in the strands can be paired 1:1 with r.m.s. side-to-side Cα,i → Cα,i distances of 6 Å or less. As above, all parameters are user-adjustable, but default values were used throughout this work.

Once helices and parallel or antiparallel strands are identified, the corresponding hydrogen-bonding pattern is used to generate restraints that are used in real-space refinement.

A second method for model-building of proteins, previously used for crystallographic model building, is based on examination of a map, typically at lower resolution, to identify features diagnostic of specific types of secondary structures25.

A third method used to build protein models into segmented regions was the RESOLVE model-building algorithm26 as implemented in phenix.resolve. This model-building method uses template matching to identify secondary structure elements (which can be protein helices or strands). Once secondary structure elements are identified, they are extended with tripeptide (or trinucleotide) fragments to create a full model.

Sequence assignment (matching of the sequence to residues in the model) for protein model building is carried out with the tool phenix.assign_sequence, which was described recently27.

### Nucleic acid model building

We have developed two approaches for nucleic acid model building. The first is related to the template-matching methods for protein model building described above26. Regular A-form RNA helices (or B-form DNA helices) are identified through a convolution-based six-dimensional search of a density map using regular base-paired templates 4 bases in length. Longer single strands from base-paired templates of up to 19 bases are then superimposed on the templates that have been identified, and portions that best match the density map are kept. These helical fragments are extended using libraries of trinucleotides based on 749 nucleotides in six RNA structures determined at resolutions of 2.5 Å or better (PDB 1GID, 1HR2, 3D2G, 4P95, 4PQV, and 4Y1J, using only one chain in each case) and filtered to retain only nucleotides for which the average B-value (atomic displacement parameter) was 50 Å2 or lower. We extended in the 5′ direction using trinucleotides by superimposing the C4′, C3′, and C2′ atoms of the 3′ nucleotide base of the trinucleotide to be tested on the corresponding atoms of the 5′ nucleotide in a placed segment, examining the map–model correlation for each trinucleotide, and choosing the one that best matched the map. We used a corresponding procedure for extension in the 3′ direction. The matching of nucleic acid sequence to nucleotides in the model was carried out with an algorithm similar to the one used in the tool phenix.resolve for protein sequence assignment18. Four to six conformations of each of the bases were identified from the six RNA structures described above. For each conformation of each base, the average density calculated at a resolution of 3 Å from the examples in these structures (after superimposition of nucleotides using the C4′, C3′, and C2′ atoms) was used as a density template for that base conformation. After a nucleotide chain has been built with our algorithm, the map correlation between each of these templates and each position along the nucleotide chain is calculated and used to estimate the probability of each base at each position18. All possible alignments of the supplied sequence and each nucleotide chain built are considered, and the best matches with a confidence of 95% or greater are considered matched. For these matched nucleotides, the corresponding conformation of the matched base is then placed in the model. For residues where no match is found, the best-fitting nucleotide is used. For base-paired nucleotides, the same procedure is carried out, except that pairs of base-paired nucleotides are considered together, which essentially doubles the amount of density information available for sequence identification.

Using a second approach for nucleic acid model building, we built duplex RNA helices directly into the density map with the tool phenix.build_rna_helices. The motivation for this algorithm was that the nucleic-acid-model-building approach described above, which builds the two chains of duplexes separately, frequently resulted in poorly base-paired strands. To build RNA helices directly, we used very similar overall strategies, except that the templates were all base-paired, and base-paired nucleotides were always considered as a single unit. This automatically led to the same favorable base-pairing found in the structures used to derive the templates. The atoms used to superimpose chains and bases were the O4′ and C3′ atoms of one nucleotide and the C1′ atom of the base-paired nucleotide. As in the previous method for sequence assignment, both bases in each base-paired set were considered together, which led again to a substantial increase in map-based information about the identities of bases in the model. Supplementary Fig. 4 illustrates the models obtained with the two methods for a small region of RNA density from the Leishmania ribosome28 at a resolution of 2.5 Å (EMDB-7025).

### Combining model information from different sources and removing overlapping fragments

Model-building into local regions of density and into the entire asymmetric unit of the map normally yielded a set of partially overlapping segments. These segments were refined on the basis of the sharpened map with the Phenix tool phenix.real_space_refine12. The refined segments were scored on the basis of map–model correlation multiplied by the square root of the number of atoms in the segment (related to fragment scoring in RESOLVE model building, except that density at the coordinates of atoms was used instead of map–model correlation in that work26). Then a composite model was created from these fragments, starting from the highest-scoring one and working down, and including only nonoverlapping parts of each new fragment considered, as implemented in the Phenix tool phenix.combine_models. When symmetry was used in reconstruction, all the symmetry-related copies of each fragment were considered for evaluation of whether a particular part of a new fragment would overlap with the existing composite model.

### Construction and refinement of full model including reconstruction symmetry

In cases where symmetry was used during the reconstruction process, we assumed that this symmetry was nearly perfect and applied it to the model that we generated. We began with our model that represented the asymmetric unit of the reconstruction. Then we applied reconstruction symmetry and refined the model against the sharpened map with the Phenix tool phenix.real_space_refine. Finally, we extracted one asymmetric unit of this final model to represent the unique part of the molecule, and wrote out both the entire molecule with symmetry and the unique part.

### Evaluation of model similarity to deposited structures

We developed the Phenix tool phenix.chain_comparison as a way of comparing the overall backbone (Cα or P atoms only) similarity of two models. The unique feature of this tool is that it considers each segment of each model separately so that it does not matter whether the chain is complete or broken into segments. Additionally, the tool can separately identify segments that have matching Cα or P atoms that proceed in the same direction and those that are reversed, as well as those that have insertions and deletions, as is common in low-resolution model building. The phenix.chain_comparison tool also identifies whether the sequences of the two models match by counting the number of matching Cα or P atoms that are associated with matching residue names. These analyses are carried out with a default criteria that Cα or P atoms that are within 3 Å are matching and those further apart are not. This distance is arbitrary but was chosen to allow atoms to match in chains that superimpose secondary structure elements such as helices even if the secondary structure elements do not superimpose exactly.

In comparisons of models corresponding to a reconstruction that has internal symmetry, the appropriate pairs of matching atoms may require application of that symmetry. The phenix.chain_comparison tool allows the inclusion of symmetry in the analysis.

### Datasets used

We selected reconstructions for analysis on the basis of the following:

1. (1)

Availability of a reconstruction in the EMDB as of November 2017

2. (2)

Resolution of the reconstruction of 4.5 Å or better

3. (3)

Presence of a unique deposited model in the PDB matching the reconstruction

4. (4)

Consistent resolution in PDB and EMDB

5. (5)

Ability to use Phenix tools to automatically extract model and map from PDB and EMDB, apply symmetry if present in the metadata, and write the model

This resulted in 502 map–model pairs extracted from a total of 882 single-particle and helical reconstructions in the EMDB in this resolution range. (Note that only 660 of the 882 have one or more associated PDB entries.)

After our initial analysis, we further excluded reconstructions that had the following characteristics:

1. (1)

Map–model correlation for the deposited map and deposited model of less than 0.3 after extraction of map and model and analysis with phenix.map_model_cc (18 reconstructions)

2. (2)

Deposited model in the PDB represents less than half of the structure (9 reconstructions)

This yielded the 476 map–model pairs described in this work (Supplementary Data 1). We downloaded the maps from the EMDB13 and used them directly in phenix.map_to_model, with the exception of EMDB-6351. For that map, a pseudo-helical reconstruction29, we were able to deduce reconstruction symmetry only for the part of the map corresponding to the deposited model (PDB 3JAR), so we used the phenix.map_box tool to cut out a box of density around the region defined by the deposited model, analyzed this map, and at the conclusion of the process translated the automatically generated models to match the deposited map.

### Parameters used when running phenix.map_to_model

All of the reconstructions selected were analyzed with official version 1.13-3015 of Phenix, with all default values of parameters except those specifying file names for the reconstructed map, sequence file, and symmetry information; the resolution of the reconstruction; and any control parameters specific to the computing system and processing approach (for example, the number of processors to use, queue commands, parameters specifying what parts of the calculation to carry out or what parts to combine in a particular job, and the level of verbosity in the output).

The phenix.map_to_model tool allows an analysis to be broken up into smaller tasks, after which all the results are combined to produce essentially the same result as would be obtained if the entire procedure were run in one step. We used this approach for two reasons. First, some of the datasets we analyzed required a very large amount of computer memory in certain stages of the analysis, and in particular in the map-segmentation and final model-construction steps. For other datasets such as ribosomes, we were able to speed up the analysis substantially by running smaller tasks on many processors.

### Computation

We carried out most of the analyses in this work on the Grizzly high-performance computing cluster at Los Alamos National Laboratory, typically with one task per node. Some of the analyses were carried out on a dedicated computing cluster at Lawrence Berkeley National Laboratory, and some large analyses in particular were carried out on a machine with 1 TB of memory.

We monitored the CPU use for 175 of our analyses (generally of smaller structures) that were each carried out in a single step on a single machine using a single processor. These analyses took from 15 min to 12 h to complete on the Grizzly cluster. For example, the analysis of EMDB-6630 in Fig. 2c required 3 CPU hours. We also monitored the CPU use for one of the largest structures (EMDB-9565), which required 129 CPU hours to complete.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

### Code availability

The phenix.map_to_model tool and all other Phenix software are available along with full documentation in source and binary forms from the Phenix website at https://www.phenix-online.org as part of the Phenix software suite30.

## Data availability

The datasets generated and/or analyzed during the current study are available in the Phenix data repository at https://phenix-online.org/phenix_data/terwilliger/map_to_model_2018. All the parameters, including resolution and specifications of reconstruction symmetry used in this work, are available on this site, along with all of the models produced.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

1. 1.

Kühlbrandt, W. Science 343, 1443–1444 (2014).

2. 2.

Henderson, R. Arch. Biochem. Biophys. 581, 19–24 (2015).

3. 3.

Wang, R. Y. et al. Nat. Methods 12, 335–338 (2015).

4. 4.

Frenz, B., Walls, A. C., Egelman, E. H., Veesler, D. & DiMaio, F. Nat. Methods 14, 797–800 (2017).

5. 5.

Chen, M., Baldwin, P. R., Ludtke, S. J. & Baker, M. L. J. Struct. Biol. 196, 289–298 (2016).

6. 6.

DiMaio, F. & Chiu, W. Methods Enzymol. 579, 255–276 (2016).

7. 7.

Zhou, N., Wang, H. & Wang, J. Sci. Rep. 7, 2664 (2017).

8. 8.

Terashi, G. & Kihara, D. Nat. Commun. 9, 1618 (2018).

9. 9.

Terwilliger, T. C., Sobolev, O. V., Afonine, P. V. & Adams, P. D. Acta Crystallogr. D Struct. Biol. 74, 545–559 (2018).

10. 10.

Pintilie, G. D., Zhang, J., Goddard, T. D., Chiu, W. & Gossard, D. C. J. Struct. Biol. 170, 427–438 (2010).

11. 11.

Terwilliger, T. C., Adams, P. D., Afonine, P. V. & Sobolev, O. V. J. Struct. Biol. 204, 338–343 (2018).

12. 12.

Afonine, P. V. et al. Acta Crystallogr. D Struct. Biol. 74, 531–544 (2018).

13. 13.

Lawson, C. L. et al. Nucleic Acids Res. 44, D396–D403 (2016).

14. 14.

Merk, A. et al. Cell 165, 1698–1707 (2016).

15. 15.

Berman, H. M. et al. Nucleic Acids Res. 28, 235–242 (2000).

16. 16.

Bartesaghi, A. et al. Science 348, 1147–1151 (2015).

17. 17.

Borgnia, M. J. et al. Mol. Pharmacol. 89, 645–651 (2016).

18. 18.

Terwilliger, T. C. Acta Crystallogr. D Biol. Crystallogr. 59, 45–49 (2003).

19. 19.

Li, Z. et al. Protein Cell 9, 384–388 (2018).

20. 20.

Schoebel, S. et al. Nature 548, 352–355 (2017).

21. 21.

Wang, R. Y. R. et al. Nat. Methods 12, 335–338 (2015).

22. 22.

Pettersen, E. F. et al. J. Comput. Chem. 25, 1605–1612 (2004).

23. 23.

DeLaBarre, B. & Brunger, A. T. Acta Crystallogr. D Biol. Crystallogr. 62, 923–932 (2006).

24. 24.

Terwilliger, T. C. Acta Crystallogr. D Biol. Crystallogr. 66, 285–294 (2010).

25. 25.

Terwilliger, T. C. Acta Crystallogr. D Biol. Crystallogr. 66, 268–275 (2010).

26. 26.

Terwilliger, T. C. Acta Crystallogr. D Biol. Crystallogr. 59, 38–44 (2003).

27. 27.

Terwilliger, T. C. et al. Acta Crystallogr. D Biol. Crystallogr. 69, 2244–2250 (2013).

28. 28.

Shalev-Benami, M. et al. Nat. Commun. 8, 1589 (2017).

29. 29.

Zhang, R., Alushin, G. M., Brown, A. & Nogales, E. Cell 162, 849–859 (2015).

30. 30.

Adams, P. D. et al. Acta Crystallogr. D Biol. Crystallogr. 66, 213–221 (2010).

## Acknowledgements

This work was supported by the NIH (grant GM063210 to P.D.A. and T.C.T.) and the Phenix Industrial Consortium. This work was supported in part by the US Department of Energy under Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory. This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the US Department of Energy National Nuclear Security Administration under Contract No. DE-AC52-06NA25396. We thank G. Terashi and D. Kihara for providing the MAINMAST and Rosetta models built into segmented maps.

## Author information

### Affiliations

1. #### Los Alamos National Laboratory, Los Alamos, NM, USA

• Thomas C. Terwilliger
2. #### New Mexico Consortium, Los Alamos, NM, USA

• Thomas C. Terwilliger
3. #### Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

• , Pavel V. Afonine
•  & Oleg V. Sobolev

5. #### Department of Physics and International Centre for Quantum and Molecular Structures, Shanghai University, Shanghai, China

• Pavel V. Afonine

### Contributions

P.V.A. and O.V.S. developed core tools for map segmentation and secondary-structure restraints; P.D.A. and P.V.A. developed the real space refinement tools; T.C.T. developed the phenix.map_to_model tool and carried out the analyses; and P.D.A. and T.C.T. supervised the work.

### Competing interests

The authors declare no competing interests.

### Corresponding author

Correspondence to Thomas C. Terwilliger.

## Integrated supplementary information

1. ### Supplementary Figure 1 Automatic map segmentation.

A. Automatic segmentation of map for β-galactosidase1 at a resolution of 2.2 Å (EMD entry 2984). Symmetry was identified using the Phenix tool phenix.map_symmetry and specifying the symmetry as D2. The map, the sequence of β-galactosidase and the symmetry file were supplied as inputs to the Phenix tool phenix.map_box with the keyword extract_unique=True. The automatically extracted unique part of the map is shown in red and the entire map is shown in yellow. B. Automatic segmentation of map for group II chaperonin2 at a resolution of 4.3 Å (EMD entry 5137). The map, the sequence and symmetry (D8) were supplied as inputs to the Phenix tool phenix.map_box along with the keyword extract_unique=True. The automatically extracted unique part of the map is shown in red and chains A (green) and H (blue) of the deposited model (3Los) are shown. Graphics created with Chimera3. References: 1. Bartesaghi, A. et al. Science 348, 1147–1151 (2015). 2. Zhang, J. et al. Nature 463, 379–383 (2010). 3. Pettersen, E.F. et al. J. Comput. Chem. 25, 1605–1612 (2004).

2. ### Supplementary Figure 2 Evaluation of model quality, and comparison with deposited models.

Validation measures for deposited models from PDB (blue columns) and automatically generated models (orange columns), and automatically generated models after removal of an average of 0.4% of side chains and 0.4% of full residues that clashed using the Phenix tool phenix.remove_clashes (gray). A. Clashscore (bad contacts per 1,000 atoms). B. Map correlation. C. Ramachandran outliers (%), D, rotamer outliers (%), E, Bond rmsd (Å), F. Angle rmsd (degrees), G. C-β deviations.

3. ### Supplementary Figure 3 Evaluation of model quality as a function of map resolution.

Validation measures for automatically generated models at resolutions of 3.5–4.5 Å (blue) and at 2.2–3.49 Å (orange). A. Clashscore (bad contacts per 1,000 atoms). B. Map correlation. C. Ramachandran outliers (%), D, rotamer outliers (%), E, Bond rmsd (A), F. Angle rmsd (degrees), G. C-β deviations.

4. ### Supplementary Figure 4 RNA model building.

RNA model-building into a cut-out region of RNA density from Leishmania ribosome1 at a resolution of 2.5 Å (EMDB entry 7025). A. Model building and real-space refinement using base-paired RNA helices. B. Model building and real-space refinement with the same map using standard nucleic acid model building. Potential hydrogen bonds indicated by dotted red lines. Graphics created with Chimera2. References: 1. Shalev-Benami, M. et al. Nat. Commun. 8, 1589 (2017). 2. Pettersen, E.F. et al. J. Comput. Chem. 25, 1605–1612 (2004).

## Supplementary information

1. ### Supplementary Text and Figures

Supplementary Figures 1–4, Supplementary Table 1 and Supplementary Results

3. ### Supplementary Data 1

Summary data for all map_to_model results

4. ### Supplementary Data 2

Summary data for Figure 2 map_to_model results

### DOI

https://doi.org/10.1038/s41592-018-0173-1