A neural network for large-scale clustering of peptide mass spectra

Repository-scale analysis of hundreds of millions to billions of mass spectra is a challenging endeavor due to the complexity and volume of associated data. A deep neural network embedding method is presented that enables large-scale investigation of repeatedly observed yet consistently unidentified mass spectra.

Fig. 1: GLEAMS deep neural network for clustering hundreds of millions of mass spectra.


  1. Perez-Riverol, Y. et al. The PRIDE database and related tools in 2019: improving support for quantification data. Nucleic Acids Res. 47, D442–D450 (2019). This paper describes the increase in publicly available proteomics data in the PRIDE database.

  2. Frank, A. M. et al. Clustering millions of tandem mass spectra. J. Proteome Res. 7, 113–122 (2008). This paper describes MS-Cluster, the first large-scale clustering algorithm for mass spectra.

  3. Griss, J. et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 13, 651–656 (2016). This paper describes a commonly used spectral clustering algorithm.

  4. Wang, M. et al. Assembling the community-scale discoverable human proteome. Cell Syst. 7, 412–421.e5 (2018). This paper describes the MassIVE-KB resource that provided training data for GLEAMS.

This is a summary of: Bittremieux, W., May, D. H., Bilmes, J. & Noble, W. S. A learned embedding for efficient joint analysis of millions of mass spectra. Nat. Methods (2021).

A neural network for large-scale clustering of peptide mass spectra. Nat Methods 19, 658–659 (2022).

