De novo main-chain modeling for EM maps using MAINMAST

An increasing number of protein structures are determined by cryo-electron microscopy (cryo-EM) at near atomic resolution. However, tracing the main-chains and building full-atom models from EM maps of ~4–5 Å is still not trivial and remains a time-consuming task. Here, we introduce a fully automated de novo structure modeling method, MAINMAST, which builds three-dimensional models of a protein from a near-atomic resolution EM map. The method directly traces the protein’s main-chain and identifies Cα positions as tree-graph structures in the EM map. MAINMAST performs significantly better than existing software in building global protein structure models on data sets of 40 simulated density maps at 5 Å resolution and 30 experimentally determined maps at 2.6–4.8 Å resolution. In another benchmark of building missing fragments in protein models for EM maps, MAINMAST builds fragments of 11–161 residues long with an average RMSD of 2.68 Å.


Supplementary
Contour level of a map indicates a density contour for capturing the shape of the target protein that is recommended by the authors of the map. a) The RMSD of the top 1 and the best among top 10 models from the main-chain models of MAINMAST as well as results after the PULCHRA-MDFF refinement were shown. The coverage value is for the top 1 model after the refinement. Coverage is defined as the fraction of residues in the native structure that are closer than 3.0 Å to the model when superimposed. b) Results of Rosetta with a 0.8 consensus setting are shown, because this setting gave overall better results than the default setting as shown in Figure S2.

Supplementary Figure 1. Comparison with MAINMAST and Pathwalking models for the 40 simulated maps.
Modeling results of the 40 simulated maps by MAINMAST in comparison with Pathwalking of 2012 and 2016. a, local RMSD and b, structure overlap of the models by MAINMAST compared with Pathwalking models computed with the CLICK server. For the Pathwalking algorithm, data are taken from the two publications, in 2012 and 2016. For the MAINMAST results, the model with the best threading score among the generated 2688 models were used. Structure overlap by CLICK in the panel b is defined as the percentage of residues in a structure placed within 3.5 Å to residues in residues to the other superimposed structure.

Supplementary Figure 2. Top 1 Rosetta models for the 40 simulated maps.
a, coverage (the fraction of residues in the native structure that have some residues in the corresponding model within 2.0 Å) of the models relative to the precision, which is defined as the fraction of residues in a model that are placed within 2.0 Å to some residues in the native structure; b, the model coverage plotted against the percentage of the total length of fragments assigned in a map among the full protein length. The observed strong correlation indicates that most of the assigned fragments are accurate but regions modelled to fill gaps between the fragments tend not to be accurate. c, Precision (within 2.0 Å) of full residue models relative to the precision (within 2.0 Å) of assigned fragments. The diagonal line is y=x. It is shown here that most of the fragments are assigned with relatively high accuracy, over 0.8, however, the subsequent full-residue modeling step could not make good models, failing to fill gaps precisely. It was also observed that the subsequent modeling step moved some precisely assigned fragments to a wrong place while building a full residue model. d, C RMSD (Å) of the full residue model plotted relative to the model coverage.

Supplementary Figure 3. C RMSD (Å) of models by Rosetta for the 30 real maps.
Modeling results for the 30 maps using Rosetta in its default setting (x-axis) and Rosetta (0.8 consensus), where in the Monte Carlo fragment assembly, structures of local regions are kept if 80% of the assigned fragments are placed at the same position. The best RMSD model among all produced (filled circles) and the best RMSD model among top 10 scoring models (empty circles) are shown. Two empty and two filled circles on the right upper corner indicate that there were two structures that were not modelled (no output structures) by both default (filled circles) and with the 0.8 consensus (empty circles) Rosetta. Two more circles filled and empty each indicate that Rosetta (default) did not produce models but Rosetta with 0.8 consensus produced models with an RMSD of about 30 Å for the targets.
In the main text we showed the results by Rosetta with 0.8 consensus because overall this setting produced better (lower RMSD models) for more cases, which are represented by points below the diagonal line.
First, fragment structures for the query protein were generated on the Robetta website: http://robetta.bakerlab.org/fragmentqueue.jsp 1. Local Fragment search in an input EM map using denovo_density.
This procedure searches the density map for each sequence-predicted backbone fragment generated in the previous step. 3. Monte Carlo fragment assembly using denovo_density: This step generates "maximally consistent" fragment assembly in the map.

Consensus assignment using denovo_density:
This step is to identify the consensus assignment from the lower-scoring Monte Carlo trajectories.
If the assigned backbone residues are less than 70% of the target protein or the coverage is not converged, we iterate the four (1-4) steps.

Running RosettaCM
Following the paper "Wang, R. Y. R., Kudryashev, M., Li, X., Egelman, E. H., Basler, M., Cheng, Y., Baker, D., & DiMaio, F. (2015)", this step is applied to fill gaps where the fragments were not assigned by denovo_density to complete a model and to refine the overall model structure.