Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs

Abstract

Cryo-electron microscopy is a popular method for the determination of protein structures; however, identifying a sufficient number of particles for analysis can take months of manual effort. Current computational approaches find many false positives and require ad hoc postprocessing, especially for unusually shaped particles. To address these shortcomings, we develop Topaz, an efficient and accurate particle-picking pipeline using neural networks trained with a general-purpose positive-unlabeled learning method. This framework enables particle detection models to be trained with few sparsely labeled particles and no labeled negatives. Topaz retrieves many more real particles than conventional picking methods while maintaining low false-positive rates, is capable of picking challenging unusually shaped proteins (for example, small, non-globular and asymmetric particles), produces more representative particle sets and does not require post hoc curation. We demonstrate the performance of Topaz on two difficult datasets and three conventional datasets. Topaz is modular, standalone, free and open source (http://topaz.csail.mit.edu).

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Topaz particle-picking pipeline using CNNs trained with positive and unlabeled data.
Fig. 2: Reconstructions of the Toll receptor using particles picked by Topaz and DoG and template-based methods.
Fig. 3: Single particle reconstructions from published particles, Topaz particles and Topaz particles with published particles removed.
Fig. 4: Reconstruction resolution and 2D class averages for Topaz particles at decreasing log-likelihood ratio thresholds.
Fig. 5: Comparison of models trained using different objective functions with varying numbers of labeled positives on the EMPIAR-10096 and EMPIAR-10234 datasets.

Data availability

Single-particle half maps, full sharpened maps and masks for T20S proteasome, 80S ribosome, rabbit muscle aldolase and the Toll receptor (DoG, template and Topaz picks) have been deposited in the Electron Microscopy Data Bank (EMDB) under accessions EMD-9194, EMD-9201, EMD-9202, EMD-9206, EMD-9207, EMD-9208, EMD-9209, EMD-9210, EMD-9211, EMD-20529, EMD-20531 and EMD-20532. The full rabbit muscle aldolase dataset has been deposited in the Electron Microscopy Pilot Image Archive (EMPIAR) under accession EMPIAR-10215.

Code availability

Source code for Topaz is publicly available via Code Ocean43 and on GitHub at https://github.com/tbepler/topaz. Updates to Topaz will be posted at http://topaz.csail.mit.edu. Topaz is licensed under the GNU General Public License v.3.0.

Change history

  • 11 October 2019

    In the version of this article originally published, scale bars were missing from Supplementary Figs 1–3, 8–10 and 14. This has now been amended and the Supplementary Information file has been updated.

References

  1. 1.

    Cheng, Y., Grigorieff, N., Penczek, P. A. & Walz, T. A primer to single-particle cryo-electron microscopy. Cell 161, 438–449 (2015).

  2. 2.

    Stagg, S. M., Noble, A. J., Spilman, M. & Chapman, M. S. ResLog plots as an empirical metric of the quality of cryo-EM reconstructions. J. Struct. Biol. 185, 41–426 (2014).

  3. 3.

    Rosenthal, P. B. & Henderson, R. Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomicroscopy. J. Mol. Bio. 333, 721–745 (2003).

  4. 4.

    Scheres, S. H. W. Semi-automated selection of cryo-EM particles in RELION-1.3. J. Struct. Biol. 189, 114–122 (2015).

  5. 5.

    Tang, G. et al. EMAN2: an extensible image processing suite for electron microscopy. J. Struct. Biol. 157, 38–46 (2007).

  6. 6.

    Roseman, A. M. Particle finding in electron micrographs using a fast local correlation algorithm. Ultramicroscopy 94, 225–236 (2003).

  7. 7.

    Voss, N. R., Yoshioka, C. K., Radermacher, M., Potter, C. S. & Carragher, B. DoG Picker and TiltPicker: software tools to facilitate particle selection in single particle electron microscopy. J. Struct. Biol. 166, 205–213 (2009).

  8. 8.

    Zhang, K., Li, M. & Sun, F. Gautomatch: an efficient and convenient gpu-based automatic particle selection program. https://www.mrc-lmb.cam.ac.uk/kzhang/ (2011).

  9. 9.

    Henderson, R. Avoiding the pitfalls of single particle cryo-electron microscopy: Einstein from noise. Proc. Natl Acad. Sci. USA 110, 18037–18041 (2013).

  10. 10.

    Subramaniam, S. Structure of trimeric HIV-1 envelope glycoproteins. Proc. Natl Acad. Sci. USA 110, E4172–E4174 (2013).

  11. 11.

    van Heel, M. Finding trimeric HIV-1 envelope glycoproteins in random noise. Proc. Natl Acad. Sci. USA 110, E4175–E4177 (2013).

  12. 12.

    Wang, F. et al. DeepPicker: a deep learning approach for fully automated particle picking in cryo-EM. J. Struct. Biol. 195, 325–336 (2016).

  13. 13.

    Zhu, Y., Ouyang, Q. & Mao, Y. A deep convolutional neural network approach to single-particle recognition in cryo-electron microscopy. BMC Bioinformatics 18, 348 (2017).

  14. 14.

    Xiao, Y. & Yang, G. A fast method for particle picking in cryo-electron micrographs based on fast R-CNN. AIP Conf. Proc. 1836, 020080 (2017).

  15. 15.

    Chen, M. et al. Convolutional neural networks for automated annotation of cellular cryo-electron tomograms. Nat. Methods 14, 983–985 (2017).

  16. 16.

    Li, X.-L. & Liu, B. in Machine Learning: ECML 2005 (eds Gama, J. et al.) 218–229 (Springer, 2005).

  17. 17.

    Nguyen, M. N., Li, X.-L. & Ng, S.-K. Positive unlabeled learning for time series classification. IJCAI 11, 1421–1426 (2011).

  18. 18.

    Zhang, J., Wang, Z., Yuan, J. & Tan, Y.-P. Positive and unlabeled learning for anomaly detection with multi-features. in Proc. 2017 ACM on Multimedia Conference 854–862 (ACM, 2017).

  19. 19.

    Kiryo, R., Niu, G., du Plessis, M. C. & Sugiyama, M. in Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 1675–1685 (Curran Associates, 2017).

  20. 20.

    Mann, G. S. & McCallum, A. Generalized expectation criteria for semi-supervisedl earning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010).

  21. 21.

    Brasch, J. et al. Visualization of clustered protocadherin neuronal self-recognition complexes. Nature 569, 280–283 (2019).

  22. 22.

    Morin, A. et al. Cutting edge: collaboration gets the most out of software. eLife 2, e01456 (2013).

  23. 23.

    Lander, G. C. et al. Appion: an integrated, database-driven pipeline to facilitate EM image processing. J. Struct. Biol. 166, 95–102 (2009).

  24. 24.

    Scheres, S. H. W. RELION: implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 180, 519–530 (2012).

  25. 25.

    Punjani, A., Rubinstein, J. L., Fleet, D. J. & Brubaker, M. A. cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat. Methods 14, 290–296 (2017).

  26. 26.

    de la, Rosa-Trevín et al. Scipion: a software framework toward integration, reproducibility and validation in 3D electron microscopy. J. Struct. Biol. 195, 93–99 (2016).

  27. 27.

    Biyani, N. et al. Focus: the interface between data collection and data processing in cryo-EM. J. Struct. Biol. 198, 124–133 (2017).

  28. 28.

    Dutta, A. & Zisserman, A. The VIA annotation software for images, audio and video. Preprint at https://arxiv.org/abs/1904.10699 (2019).

  29. 29.

    Wagner, T. et al. SPHIRE-crYOLO: a fast and well-centering automated particle picker for cryo-EM. Comm. Biol. 2, 218 (2019).

  30. 30.

    Bepler, T. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. in Proc. 22nd Annual International Conference on Research in Computational Molecular Biology. (ed. Raphael, B. J.) 245–247 (Springer, 2018).

  31. 31.

    Tegunov, D. & Cramer, P. Real-time cryo-EM data pre-processing with Warp. Preprint at https://doi.org/10.1101/338558 (2018).

  32. 32.

    Campbell, M. G., Veesler, D., Cheng, A., Potter, C. S. & Carragher, B. 2.8 Å resolution reconstruction of the Thermoplasma acidophilum 20S proteasome using cryo-electron microscopy. eLife 4, e06380 (2015).

  33. 33.

    Wong, W. et al. Cryo-EM structure of the Plasmodium falciparum 80S ribosome bound to the anti-protozoan drug emetine. eLife 3, e03080 (2014).

  34. 34.

    Tan, Y. Z. et al. Addressing preferred specimen orientation in single-particle cryo-EM through tilting. Nat. Methods 14, 793–796 (2017).

  35. 35.

    Xu, H. et al. Structural basis of Nav1.7 inhibition by a gating-modifier spider toxin. Cell 176, 702–715 (2019).

  36. 36.

    Ioffe, S. & Szegedy, C. in Proc. 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) 448–456 (PMLR, 2015).

  37. 37.

    Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at https://arxiv.org/abs/1511.06434 (2015).

  38. 38.

    Zheng, S. Q. et al. MotionCor2: anisotropic correction of beam-induced motion for improved cryo-electron microscopy. Nat. Methods 14, 331–332 (2017).

  39. 39.

    Carragher, B. et al. Leginon: an automated system for acquisition of images from vitreous ice specimens. J. Struct. Biol. 132, 33–45 (2000).

  40. 40.

    Rohou, A. & Grigorieff, N. CTFFIND4: fast and accurate defocus estimation from electron micrographs. J. Struct. Biol. 192, 216–221 (2015).

  41. 41.

    Pettersen, E. F. et al. UCSF Chimera—a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605–1612 (2004).

  42. 42.

    Roseman, A. M. FindEM—a fast, efficient program for automatic selection of particles from electron micrographs. J. Struct. Biol. 145, 91–99 (2004).

  43. 43.

    Bepler, T. et al. Topaz: positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Code Ocean https://doi.org/10.24433/CO.1911124.v1 (2019).

Download references

Acknowledgements

The authors wish to thank Simons Electron Microscopy Center (SEMC) OPs for the aldolase sample preparation and collection, Y. Z. Tan for SPA discussion, and the Electron Microscopy Group at the New York Structural Biology Center (NYSBC) for microscope calibration and assistance. We thank J. Sampson (Columbia University) for expressing the Toll receptor. We would also like to thank T. Jaakkola for his valuable feedback on the machine learning methods. We thank the developers of Relion, cryoSPARC, Appion, EMAN2, Scipion and Focus for their efforts in integrating Topaz. The Topaz GUI is based on VGG Image Annotator (VIA), which is developed and maintained with the support of EPSRC program grant Seebibyte: Visual Search for the Era of Big Data (EP/M013774/1). T.B., A.M. and B.B. were supported by NIH grant R01-GM081871. M.R. was supported by NSF GRFP (DGE-1644869). L.S. was supported by NIH grant R01-MH114817. A.J.N. was supported by a grant from the NIH National Institute of General Medical Sciences (NIGMS) (F32GM128303). The cryo-EM work was performed at the SEMC and National Resource for Automated Molecular Microscopy located at NYSBC, supported by grants from the Simons Foundation (SF349247), NYSTAR and the NIH NIGMS (GM103310) with additional support from the Agouron Institute (F00316) and NIH (OD019994).

Author information

T.B., A.M. and B.B. conceived the project. T.B. developed the PU learning methods and implemented Topaz, processed and analyzed single particle datasets, and carried out the computational experiments under the guidance of B.B. M.R. prepared and collected the Toll receptor dataset. J.B. prepared and collected the clustered protocadherin dataset. A.J.N. analyzed the single particle cryo-EM reconstructions. A.J.N. developed the Topaz GUI based on VIA. T.B., A.M., M.R., J.B., L.S., A.J.N. and B.B. designed the experiments. T.B., M.R., A.J.N. and B.B. wrote the manuscript.

Correspondence to Alex J. Noble or Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Peer review information Allison Doerr was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Normalization methods on EMPIAR-10261.

Comparison of standard affine normalization and our proposed mixture model normalization on EMPIAR-10261 micrographs downsampled 4x. For affine normalization, micrographs are transformed by subtracting the mean and dividing by the standard deviation of the pixel values. (a) Visualization of three example micrographs with either affine (top) or GMM (bottom) normalization. Affine normalized micrographs are washed out when there are dark grid regions present in the micrographs. (b) Histograms of the pixel intensities of the same three micrographs after affine or GMM normalization. GMM normalization correctly centers the pixel intensities around the high intensity peak. Results are consistent across >20 micrographs examined.

Supplementary Figure 2 Normalization methods on EMPIAR-10234.

Comparison of standard affine normalization and our proposed mixture model normalization on EMPIAR-10234 micrographs downsampled 8x. For affine normalization, micrographs are transformed by subtracting the mean and dividing by the standard deviation of the pixel values. (a) Visualization of three example micrographs with either affine (top) or GMM (bottom) normalization. The left-most micrograph contains light carbon grid that is correctly removed by GMM normalization. (b) Histograms of the pixel intensities of the same three micrographs after affine or GMM normalization. Results are consistent across >20 micrographs examined.

Supplementary Figure 3 Normalization methods on EMPIAR-10096.

Comparison of standard affine normalization and our proposed mixture model normalization on EMPIAR-10096 micrographs downsampled 8x. For affine normalization, micrographs are transformed by subtracting the mean and dividing by the standard deviation of the pixel values. (a) Visualization of three example micrographs with either affine (top) or GMM (bottom) normalization. No grid is present in these micrographs, so affine and GMM normalization given nearly identical results. (b) Histograms of the pixel intensities of the same three micrographs after affine or GMM normalization. Results are consistent across >20 micrographs examined.

Supplementary Figure 4 Toll receptor example micrograph with picks.

Example micrograph for the Toll receptor dataset with the unlabeled micrograph (top left), training picks (top right), DoG Picker picks (middle left), Topaz picks (middle right), FindEM picks (bottom left), and crYOLO picks (bottom right). This result is consistent over >100 micrographs examined.

Supplementary Figure 5 Toll receptor 3DFSC curves and crYOLO reconstruction.

Toll receptor 3DFSC plots and 3D reconstruction of the Toll receptor using 131,300 particles picked using crYOLO. (a) 3DFSC plots for Toll receptor structures solved using particles from Topaz, DoG, Template, and crYOLO. (b) Density map of the crYOLO structure. The crYOLO structure reaches a resolution of 6.83 Å at FSC0.143 with a sphericity of 0.734.

Supplementary Figure 6 Sparse picking in aggregates: Topaz vs crYOLO.

Example micrograph with a large amount of aggregate but still a number of real particles, comparing Topaz with crYOLO after following crYOLO’s sparse picking procedure. Topaz (top left) picks 40 particles while avoiding aggregation and ice contamination at the default threshold of 0.0. crYOLO picks 1 particle at its default threshold of 0.3. Lowering the threshold to 0.07 (top right) yields 26 particles with a small cluster of picks in the aggregation near the bottom right hand corner of the image. Further decreasing the threshold to 0.06 (bottom left) yields even more particles, but now it is clear the network is picking a large number of pixels within the aggregation. At a threshold of 0.05 (bottom right), crYOLO is no longer able to avoid aggregation or ice contamination. Note that the particle picking threshold increment in crYOLO is 0.01. We found these results to be consistent across >100 micrographs examined.

Supplementary Figure 7 Sparse picking with low SNR: Topaz vs. crYOLO.

Example micrograph with real particles but thicker ice and thus lower contrast, comparing Topaz with crYOLO after following crYOLO’s sparse picking procedure. Topaz (top left) picks 127 particles at the default threshold of 0.0. crYOLO picks 0 particle at its default threshold of 0.3. Lowering the threshold to 0.07 (top right) yields 37 particles. Further decreasing the threshold to 0.06 (bottom left) yields more particles, but the network is still missing real particles while starting to select some background pixels. At a threshold of 0.05 (bottom right), there are clear artifacts at the edges of the micrograph. We found these results to be consistent across >100 micrographs examined.

Supplementary Figure 8 T20S proteasome example micrographs.

Two example micrographs for the T20S proteasome dataset (EMPIAR-10025) with (top) published particles circled in blue, (middle) training particles sampled from the published particles circled in blue, and (bottom) Topaz particles circled in red. Each column is a different micrograph. The circled training particles illustrates how sparse the Topaz picks are for this dataset. The PU learning framework allows picking to be performed with high accuracy despite the sparsity of examples, as seen by the Topaz picks in red. Furthermore, Topaz recovers many more real particles than are present in the published set. We found this result to be consistent across >100 micrographs examined.

Supplementary Figure 9 80S ribosome example micrographs.

Two example micrographs for the 80S ribosome dataset (EMPIAR-10028) with (top) published particles circled in blue, (middle) training particles sampled from the published particles circled in blue, and (bottom) Topaz particles circled in red. Each column is a different micrograph. The circled training particles illustrates how sparse the Topaz picks are for this dataset. The PU learning framework allows picking to be performed with high accuracy despite the sparsity of examples, as seen by the Topaz picks in red. Furthermore, Topaz recovers many more real particles than are present in the published set. We found this result to be consistent across >100 micrographs examined.

Supplementary Figure 10 Rabbit muscle aldolase example micrographs.

Two example micrographs for the rabbit muscle aldolase dataset (EMPIAR-10215) with (top) published particles circled in blue, (middle) training particles sampled from the published particles circled in blue, and (bottom) Topaz particles circled in red. Each column is a different micrograph. The circled training particles illustrates how sparse the Topaz picks are for this dataset. The PU learning framework allows picking to be performed with high accuracy despite the sparsity of examples, as seen by the Topaz picks in red. Furthermore, Topaz recovers many more real particles than are present in the published set. In this aldolase dataset, the particles are tightly packed, but Topaz correctly identifies and centers the particles. We found this result to be consistent across >100 micrographs examined.

Supplementary Figure 11 EMPIAR-10215 2D class averages.

2D class averages of Topaz particles with decreasing score threshold for the aldolase dataset. Classes identified as false positives for quantification in Fig. 5 are indicated by orange boxes.

Supplementary Figure 12 EMPIAR-10028 2D class averages.

2D class averages of Topaz particles with decreasing score threshold for the 80S ribosome. Classes identified as false positives for quantification in Fig. 5 are indicated by orange boxes.

Supplementary Figure 13 Precision-recall and F1 curves for EMPIAR-10025, EMPIAR-10028, and EMPIAR-10215.

Precision-recall curves and threshold vs precision, recall, and F1 score curves for classifiers trained on the EMPIAR-10025, EMPIAR-10028, and EMPIAR-10215 datasets. Curves were calculated by matching the particles predicted by the Topaz models on the test set micrographs of each dataset to the published particle annotations on those micrographs. We note that the precision and average-precision scores are underestimates of the true precision and true average-precision of the Topaz models due to incompleteness of the published particle set.

Supplementary Figure 14 EMPIAR-10096 and EMPIAR-10234 example micrographs with Topaz picks.

Representative micrographs from (a) the EMPIAR-10096 and (b) the EMPIAR-10234 test sets. For EMPIAR-10096, curated particles from EMPIAR (blue) and particles predicted by Topaz (red) are circled. For EMPIAR-10234, manually selected (blue) and predicted (red) particles are circled. Topaz avoids ice chunks, particles in proximity to the edge of the hole, and particles on carbon and correctly identifies many particles missing from the manually labeled/curated particles sets. We found this result to be consistent over >20 micrographs examined.

Supplementary Figure 15 Sensitivity to π of GE-binomial.

Sensitivity of GE-binomial objective function to the setting of π. For EMPIAR-10096 and the EMPIAR-10234 datasets we report average-precision scores for classifiers trained with 100 and 1,000 labeled particles and values of π varying from 0.5x to 1.5x the values reported in Table 1. We report the mean and standard deviation of 10 runs.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bepler, T., Morin, A., Rapp, M. et al. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat Methods 16, 1153–1160 (2019). https://doi.org/10.1038/s41592-019-0575-8

Download citation

Further reading