Automatic protein structure solution from weak X-ray data

Skubák, Pavol; Pannu, Navraj S.

doi:10.1038/ncomms3777

Download PDF

Article
Open access
Published: 15 November 2013

Automatic protein structure solution from weak X-ray data

Pavol Skubák¹ &
Navraj S. Pannu¹

Nature Communications volume 4, Article number: 2777 (2013) Cite this article

5565 Accesses
198 Citations
1 Altmetric
Metrics details

Subjects

Materials science

Abstract

Determining new protein structures from X-ray diffraction data at low resolution or with a weak anomalous signal is a difficult and often an impossible task. Here we propose a multivariate algorithm that simultaneously combines the structure determination steps. In tests on over 140 real data sets from the protein data bank, we show that this combined approach can automatically build models where current algorithms fail, including an anisotropically diffracting 3.88 Å RNA polymerase II data set. The method seamlessly automates the process, is ideal for non-specialists and provides a mathematical framework for successfully combining various sources of information in image processing.

Improved AlphaFold modeling with implicit experimental information

Article Open access 20 October 2022

Experimental phasing opportunities for macromolecular crystallography at very long wavelengths

Article Open access 12 October 2023

Improving resolution and resolvability of single-particle cryoEM structures using Gaussian mixture models

Article 16 November 2023

Introduction

X-ray diffraction of macromolecular crystals does not provide a direct image of a molecule. The macromolecule’s electron density can be computationally constructed by exploiting the anomalous signal from heavy atoms, such as seleniums incorporated into a molecule of unknown fold. For data sets with a strong anomalous signal diffracting to resolutions better than 3 Å, current computational methods can usually automatically build an atomic model. Yet, determining crystal structures of large macromolecular assemblies or membrane proteins that tend to diffract to lower resolutions is difficult and involves manually iterating over the different steps in the structure solution process¹ and still may not lead to an interpretable electron density map. Even at higher resolutions, diffraction data containing a weak anomalous signal can elude current computational methods and may require more data from other crystals². Here we propose a new method that combines the traditional structure solution steps to push the limits of computational techniques.

Currently, the process of solving a macromolecular crystal structure of unknown fold from X-ray data consists of distinct steps (Fig. 1a). In experimental phasing, crystallographic phase estimates are calculated by exploiting the signal from an anomalous substructure. An initial experimental electron density is constructed from these phase estimates and the X-ray data. Next, expected features of macromolecular electron density, such as the flatness of solvent regions, are imposed on the experimental electron density to improve its quality. This density-modified map is typically combined with the initial experimental density map in phase combination. Finally, the resulting electron density is used to iteratively build and refine a model of the macromolecule.

**Figure 1: The current and new combined approach for structure solution.**

After the experimental density is constructed, information about the unknown phase and its accuracy is often ignored or approximated, and statically propagated to the steps of phase combination and model refinement via Hendrickson–Lattman coefficients³. We have previously demonstrated that using the experimental data and anomalous substructure directly in phase combination⁴ and model refinement⁵ via step-specific multivariate distributions can improve the individual steps.

Here we present a novel combined multivariate probability function (Equation (2), see Methods section) that directly considers phase information from the experimentally collected X-ray data, and simultaneously combines it with the information from density modification and model building into a single unified process (Fig. 1b). The unified process consists of iterative minimization of the minus log-likelihood of the new combined probability distribution in reciprocal space, followed by current density modification and model-building procedures in crystal space. Thus, the structure solution process no longer relies on successive stepwise approximations of the experimental data. The full power of the new method is obtained by simultaneously considering the anomalous substructure, density-modified electron density map and partial protein model. If only the substructure is available, the new combined function elegantly reduces to the previously described experimental phasing function (Equation (3)). Similarly, when only the substructure and electron density are available, the combined function simplifies to the step-specific phase combination function. After a partial protein model has been built, the full combined probability distribution is used and all the information is exploited simultaneously. Results from our large collection of real data sets, described below, show that the best performance and efficiency of the new algorithm can be achieved by skipping the model-building step for some iterations (Fig. 2); currently, model building is first performed after 20 iterations and then repeated every eighth iteration. The total number of iterations is chosen automatically by the algorithm; more iterations are run for weaker signals.

**Figure 2: Detailed diagram for the combined algorithm.**

Current automated structure solution systems use a stepwise approach (Fig. 1a) for structure solution, but can use different programmes or parameters for experimental phasing, density modification or model building. To objectively assess the new algorithm’s (Fig. 1b) power in a controlled fashion, we compare it against the stepwise algorithm using the same programmes and parameters. For both approaches, real-space density modification is performed in PARROT⁶ (version 1.0.2) and automated model building is performed in BUCCANEER⁷ (version 1.5.2). The new combined function, implemented in REFMAC⁸, simultaneously uses the information from real-space density modification and model building, whereas the current approach uses this information separately in stepwise phase combination and model refinement functions, also implemented in REFMAC. The automated structure solution package CRANK⁹ (version 2.0.0) is used to link these programmes for both approaches in this test. To assess the performance of the combined structure solution approach against another automated package, we also compare the new method against the default, recursive, stepwise approach of PHENIX AutoSol¹⁰ (version 1.8.2-1309). We find that the new combined algorithm performs significantly better in both the controlled test and in comparison with PHENIX, and led to many models built automatically when the current approaches failed.

Results

Large-scale and controlled comparison

We test the performance and robustness of the new method for combined structure solution against the current stepwise approach on 147 single-wavelength anomalous diffraction (SAD) data sets spanning a wide range of resolutions from 0.94 to 3.88 Å and anomalous scatterers, including selenium, sulphur, chloride, iodide, bromide, calcium and zinc. Figure 3 compares the fraction of the 147 models automatically built within 1 Å of the deposited structure by the combined method on the y axis and the stepwise approach on the x axis, both implemented in CRANK. The cluster of points in the lower left corner of the plot represents the data sets where no model can be built, usually caused by the inability to find the heavy atom substructure. The data sets providing 85–100% complete models for both methods are depicted in the upper right corner. Finally, the ‘Pushing the limits’ cluster shows the numerous data sets for which partial or no model was built using the current stepwise method, but near complete models with the new combined method.

**Figure 3: Comparison of CRANK's stepwise and combined approach.**

For all data sets, the average fraction of model correctly built increases from 60% to 74%. If we exclude the data sets built to at least 85% completeness by the stepwise method and data sets where the heavy atom substructure could not be found, 45 data sets remain with 28% of the model correctly built on average by the stepwise approach and 77% by the combined algorithm.

Large-scale comparison with PHENIX

Figure 4 shows the fraction of the 147 models automatically built to within 1 Å of the deposited structure by the combined method and by the PHENIX AutoSol software. The results are similar to that of Fig. 3, showing many data sets significantly above the diagonal line for which no or a partial model is built using PHENIX, but nearly complete models with the combined method. Unlike in the controlled comparison of the stepwise and the combined algorithm shown in Fig. 3, we cannot draw direct conclusions about the performance of these algorithms, as although PHENIX also uses a stepwise approach it employs different programmes and parameters than CRANK. We can only conclude that with the default settings, more structures are built automatically with the combined approach in CRANK than with PHENIX for the random sample of 147 data sets: the average fraction correctly built increases from 59% to 74%.

**Figure 4: Comparison of PHENIX and CRANK’s combined approach.**

RNA polymerase II data set (3.88 Å)

The performance of the new method at low resolution can be demonstrated on the 12-subunit RNA polymerase II SAD data set diffracting anisotropically to 3.88 Å (ref. 11) and containing 3,950 residues in the asymmetric unit. The authors could neither automatically nor manually build the structure from the SAD data set collected: structure solution was performed by a combination of multicrystal, multiple wavelength anomalous diffraction phasing from five crystals, molecular replacement from a partial model and manual iterative model building and refinement. The combined method results in automatic building of a majority of the protein backbone solely from the anomalous signal of eight intrinsic zinc atoms and the single SAD data set. The quality of the automatically built structure is evident from the R-free value¹² of 37.6%. Figure 5 shows the agreement between the final and automatically built model, and the resulting electron density for a part of the RNA polymerase II molecule. 67% of the Cα positions were traced to 2 Å precision, 82%, to 2.5 Å precision, and 9% were placed incorrectly (see the Methods section for the definition we use to assess model-building quality that also requires a neighbouring Cα atom to be correctly placed).

**Figure 5: RNA polymerase II electron density.**

RNA polymerase clamp domain–Spt4/5 data set (3.3 Å)

An RNA polymerase clamp domain–Spt4/5 complex¹³ was built manually from a partial molecular replacement model using a 3.3 Å SAD data set containing an anomalous signal from intrinsic zinc atoms. The authors could not get an interpretable electron density map from the anomalous signal alone. However, when using the combined algorithm, 77% of the deposited model backbone residues were automatically correctly built to the 2-Å criteria. Figure 6a,b show the deposited and automatically built structure, respectively. Table 1 shows model refinement and building statistics for these two low-resolution data sets.

**Figure 6: Deposited and automatically built clamp domain–Spt4/5 models.**

Table 1 Model-building statistics for the highlighted data sets.

Full size table

Discussion

The presented results demonstrate that the current limits of X-ray crystallography can be significantly extended by the synergistic effect of simultaneously combining the steps. Although the use of the combined method does not improve the automated structure solution if a substructure could not be found or if a nearly complete structure can be built by the current methods, its use substantially improves the automated model building limited by a weak anomalous signal or a low resolution.

The mathematical framework presented here is certainly not limited to X-ray crystallography, but can be applied to other techniques such as cryo-electron microscopy where a related maximum likelihood analysis¹⁴ can be generalized and combined with, for example, model building¹⁵, while considering the observed experimental data/images directly. Both CRANK and REFMAC are open-source packages and these latest developments will be available from CCP4 ( http://www.ccp4.ac.uk/).

Methods

Testing methodology

The new function and algorithm have been tested on 147 real SAD data sets mainly composed of the same data sets used previously⁴: all data sets are listed in Supplementary Table S1. The diffraction data, the sequence of the protein monomer, the f′ and f′′ values for the substructure atoms and the substructure as determined by SHELXC and SHELXD¹⁶, or AFRO⁹ and CRUNCH2 (ref. 17), were input to PHENIX and to CRANK’s stepwise and combined pipelines. All three approaches were run with default settings.

The combined algorithm and the PHENIX AutoSol software automatically choose the number of model-building cycles (the current defaults for the CRANK implementation of the combined algorithm are a minimum of 5 and a maximum of 50 building cycles). In CRANK’s stepwise approach, the density-modified map from PARROT is input to model building by BUCCANEER, which is iterated 50 times with refinement by the multivariate SAD function⁵ in REFMAC (Equaton (3)). If the fraction of model built after the first 5 building cycles was higher than after 50 cycles, it was used for comparison with the combined algorithm’s results, otherwise the fraction built after 50 cycles was used.

The quality of the protein models built is expressed as a fraction of the Protein Data Bank-deposited model backbone ‘correctly built’. In the massive testing on 147 data sets, a residue is considered correctly built if its Cα position is at most 1 Å distant from a deposited model Cα (Cα-deposited) position. For the highlighted low-resolution cases, a 2-Å criteria¹⁸ is used, as 1 Å is a minimal estimate of the coordinate uncertainty at 4 Å resolution¹⁹. However, we add an additional requirement that a neighbouring Cα position must be at most 2 Å distant from a neighbour of Cα deposited for the residue to be considered correctly built. A residue is considered incorrectly built if it is >2.5 Å distant from the nearest Cα-deposited position. Furthermore, a residue is also considered incorrectly built if it is <2.5 Å distant from the Cα deposited, but none of its neighbouring Cα positions are closer than 2.5 Å from a neighbour of Cα deposited.

The combined likelihood function

To apply a maximum likelihood analysis that combines the information from the different steps in macromolecular X-ray crystallography and incorporates the observed experimental diffraction data directly, the multivariate probability distribution of the observed SAD structure factor amplitudes (), given the partial (anomalous and/or non-anomalous) calculated structure factors () and density modification structure factors (F_DM=F_DMexp(iα_DM)) is required. Here the subscripts O,C,DM denote observed, partial anomalous and/or non-anomalous calculated, and density modification structure factors, respectively, and the + and − superscripts denote the Friedel pairs. To derive the above distribution, the starting point is the multivariate distribution of structure factors:

The distribution for equation (1) is well approximated by a complex multivariate Gaussian distribution via the Central Limit Theorem. After transforming the multivariate complex Gaussian to polar coordinates and integrating out the unknown ‘observed’ structure factor phases, the required distribution is obtained:

In equation (2), a_ij is the ijth element of the inverse of the full 5 × 5 covariance matrix Σ₅ and c_ij is the ijth element of the model 3 × 3 (Σ₃) submatrix of Σ₅. If the density modification structure factor, F_DM, is not available, the equation reduces to the previously described multivariate function for SAD-based model refinement function⁵:

In equation (3), a_ij is the ijth element of the inverse of the 4 × 4 covariance matrix Σ₄ that is a submatrix of Σ₅ and c_ij is the ijth element of the model 2 × 2 (Σ₂) submatrix of Σ₄. If the partial structure factors consist only or mainly of contributions from anomalous atoms, such as those found in substructure detection, equation (3) reduces into the previously described function for multivariate heavy atom refinement and phasing, only differing by the covariance matrix Σ₄ definition. Similarly, if only density modification structure factors and anomalous atoms are available for calculation of , equation (2) reduces to the previously described multivariate phase combination function⁴, differing by the covariance matrix Σ₅ definition. These special cases have all been implemented in the programme REFMAC.

Additional information

How to cite this article: Skubák, P. & Pannu, N. S. Automatic protein structure solution from weak X-ray data. Nat. Commun. 4:2777 doi: 10.1038/ncomms3777 (2013).

References

Schroder, G., Levitt, M. & Brunger, A. T. Super-resolution biomolecular crystallography with low-resolution data. Nature 464, 1218–1222 (2010).
Article ADS Google Scholar
Liu, Q. et al. Structures from anomalous diffraction data of native biological macromolecules. Science 336, 1033–1037 (2012).
Article ADS CAS Google Scholar
Hendrickson, W. A. & Lattman, E. E. Representation of phase probability distributions for simplified combination of independent phase information. Acta Cryst. B26, 136–143 (1970).
Article Google Scholar
Skubak, P., Waterreus, W. J. & Pannu, N. S. Multivariate phase combination improves automated crystallographic model building. Acta Cryst. D66, 783–788 (2010).
Google Scholar
Skubak, P., Murshudov, G. N. & Pannu, N. S. Direct incorporation of experimental phase information in model refinement. Acta Cryst. D60, 2196–2201 (2004).
CAS Google Scholar
Cowtan, K. Recent developments in classical density modification. Acta Cryst. D66, 470–478 (2010).
Google Scholar
Cowtan, K. The Buccaneer software for automated model building. 1. Tracing protein chains. Acta Cryst. D62, 1002–1011 (2006).
CAS Google Scholar
Murshudov, G. N. et al. REFMAC5 for the refinement of macromolecular crystal structures. Acta Cryst. D67, 355–367 (2011).
Google Scholar
Pannu, N. S. et al. Recent advances in the CRANK software suite for experimental phasing. Acta Cryst. D67, 331–337 (2011).
Google Scholar
Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Cryst. D66, 213–221 (2010).
Google Scholar
Meyer, P. A., Ye, P., Zhang, M., Suh, M. H. & Fu, J. Phasing RNA polymerase II using intrinsically bound Zn atoms: an updated structural model. Structure 14, 973–982 (2006).
Article CAS Google Scholar
Brunger, A. T. The free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355, 472–475 (1992).
Article ADS CAS Google Scholar
Martinez-Rucobo, F. W., Sainsbury, S., Cheung, A. C. M. & Cramer, P. Achitecture of the RNA Polymerase-Spt4/5 complex and basis of universal transcription processivity. EMBO J. 30, 1302–1310 (2011).
Article CAS Google Scholar
Sigworth, F. J., Doerschuk, P. C., Carazo, J. M. & Scheres, S. H. An introduction to maximum likelihood methods in cryo-EM. Methods Enzymol. 482, 263–294 (2010).
Article CAS Google Scholar
Liu, H. et al. Atomic structure of human adenovirus by cryo-EM reveals interactions among protein networks. Science 329, 1038–1043 (2010).
Article ADS CAS Google Scholar
Sheldrick, G. M. A short history of SHELX. Acta Cryst. A64, 112–122 (2008).
Article Google Scholar
de Graaff, R. A. G., Hilge, M., van der Plas, J. L. & Abrahams, J. P. Matrix methods for solving protein substructures of chlorine and sulfur from anomalous data. Acta Cryst. D57, 1857–1862 (2001).
CAS Google Scholar
Brunger, A. T., Adams, P. D., Fromme, P., Fromme, R., Levitt, M. & Schroeder, G. F. Improving the accuracy of macromolecular structure refinement at 7A resolution. Structure 20, 957–966 (2012).
Article CAS Google Scholar
Brunger, A. T. Free R value: cross-validation in crystallography. Methods Enzymol. 277, 366–396 (1997).
Article CAS Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Cryst. D66, 486–501 (2008).
Google Scholar
Kraulis, P. J. MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950 (1991).
Article Google Scholar

Download references

Acknowledgements

Funding for this work was provided by the Netherlandse Organisatie voor Wetenschappelijk Onderzoek (NWO), grant number 700.55.425, Cyttron II (LSH framework: FES0908) and the Collaborative Computational Project 4 (CCP4). P.S. and N.S.P. thank all data set contributors and B. Zagrovic, J.P. Abrahams, A.T. Brunger, K. Cowtan and A. Kuzmanic for useful discussions.

Author information

Authors and Affiliations

Department of Biophysical Structural Chemistry, Leiden University, Einsteinweg 55, Leiden, 2333 CC, The Netherlands
Pavol Skubák & Navraj S. Pannu

Authors

Pavol Skubák
View author publications
You can also search for this author in PubMed Google Scholar
Navraj S. Pannu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.S. and N.S.P. designed the research, analysed the results and wrote the manuscript. P.S. wrote the computer source code and ran the test cases.

Corresponding authors

Correspondence to Pavol Skubák or Navraj S. Pannu.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Table

Supplementary Table S1 (PDF 37 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/

Reprints and permissions

About this article

Cite this article

Skubák, P., Pannu, N. Automatic protein structure solution from weak X-ray data. Nat Commun 4, 2777 (2013). https://doi.org/10.1038/ncomms3777

Download citation

Received: 21 August 2013
Accepted: 15 October 2013
Published: 15 November 2013
DOI: https://doi.org/10.1038/ncomms3777

This article is cited by

The Mycobacterium tuberculosis methyltransferase Rv2067c manipulates host epigenetic programming to promote its own survival
- Prakruti R. Singh
- Venkatareddy Dadireddy
- Valakunja Nagaraja
Nature Communications (2023)
Discovery of a lectin domain that regulates enzyme activity in mouse N-acetylglucosaminyltransferase-IVa (MGAT4A)
- Masamichi Nagae
- Tetsuya Hirata
- Yasuhiko Kizuka
Communications Biology (2022)
Quaternary structure independent folding of voltage-gated ion channel pore domain subunits
- Cristina Arrigoni
- Marco Lolicato
- Daniel L. Minor
Nature Structural & Molecular Biology (2022)
Ancient plant-like terpene biosynthesis in corals
- Immo Burkhardt
- Tristan de Rond
- Bradley S. Moore
Nature Chemical Biology (2022)
Asymmetric peptidoglycan editing generates cell curvature in Bdellovibrio predatory bacteria
- Emma J. Banks
- Mauricio Valdivia-Delgado
- R. Elizabeth Sockett
Nature Communications (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.