Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets


Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average, 75% of spectra analyzed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large scale to shed light on these unidentified spectra. The Proteomics Identifications (PRIDE) Database Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in the PRIDE Archive, coming from hundreds of data sets, we were able to consistently characterize spectra into three distinct groups: (1) incorrectly identified, (2) correctly identified but below the set scoring threshold, and (3) truly unidentified. Using multiple complementary analysis approaches, we were able to identify 20% of the consistently unidentified spectra. The complete spectrum-clustering results are available through the new version of the PRIDE Cluster resource ( This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Accuracy of the spectra-cluster algorithm compared with the MSCluster16 and MaRaCluster17 algorithms.
Figure 2: Overview of the results of the analysis to highlight commonly found incorrect peptide identifications in the PRIDE Archive.
Figure 3: Identified spectra from a diverse range of data sets, including spectra from experiments in other species, led to newly identified phosphorylated peptides in the Chromosome-Centric HPP data sets (PXD000529, PXD000533 and PXD000535).
Figure 4: Overview of the results of the analysis of clusters containing only unidentified spectra.
Figure 5: Delta masses observed for the 5,560 large human unidentified clusters whose consensus spectra were identified using an open modification search.


  1. 1

    Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).

    CAS  Article  Google Scholar 

  2. 2

    Chick, J.M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).

    CAS  Article  Google Scholar 

  3. 3

    Eng, J.K., McCormack, A.L. & Yates, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    CAS  Article  Google Scholar 

  4. 4

    Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    CAS  Article  Google Scholar 

  5. 5

    Craig, R. & Beavis, R.C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).

    CAS  Article  Google Scholar 

  6. 6

    Frank, A. & Pevzner, P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).

    CAS  Article  Google Scholar 

  7. 7

    Tabb, D.L., Ma, Z.Q., Martin, D.B., Ham, A.J. & Chambers, M.C. DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. J. Proteome Res. 7, 3838–3846 (2008).

    CAS  Article  Google Scholar 

  8. 8

    Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667 (2007).

    CAS  Article  Google Scholar 

  9. 9

    Ma, C.W. & Lam, H. Hunting for unexpected post-translational modifications by spectral library searching with tier-wise scoring. J. Proteome Res. 13, 2262–2271 (2014).

    CAS  Article  Google Scholar 

  10. 10

    Vizcaíno, J.A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).

    Article  Google Scholar 

  11. 11

    Vizcaíno, J.A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).

    Article  Google Scholar 

  12. 12

    Griss, J., Foster, J.M., Hermjakob, H. & Vizcaíno, J.A. PRIDE Cluster: building a consensus of proteomics data. Nat. Methods 10, 95–96 (2013).

    CAS  Article  Google Scholar 

  13. 13

    Yao, Q. et al. Design and development of a medical big data processing system based on Hadoop. J. Med. Syst. 39, 23 (2015).

    Article  Google Scholar 

  14. 14

    Hodor, P., Chawla, A., Clark, A. & Neal, L. cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud. Bioinformatics 32, 301–303 (2016).

    CAS  PubMed  Google Scholar 

  15. 15

    Dasari, S. et al. Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment. J. Proteome Res. 11, 1686–1695 (2012).

    CAS  Article  Google Scholar 

  16. 16

    Frank, A.M. et al. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat. Methods 8, 587–591 (2011).

    CAS  Article  Google Scholar 

  17. 17

    The, M. & Kall, L. MaRaCluster: a fragment rarity metric for clustering fragment spectra in shotgun proteomics. J. Proteome Res. 15, 713–720 (2016).

    CAS  Article  Google Scholar 

  18. 18

    Ternent, T. et al. How to submit MS proteomics data to ProteomeXchange via the PRIDE database. Proteomics 14, 2233–2241 (2014).

    CAS  Article  Google Scholar 

  19. 19

    Desiere, F. et al. The PeptideAtlas project. Nucleic Acids Res. 34, D655–D658 (2006).

    CAS  Article  Google Scholar 

  20. 20

    Craig, R., Cortens, J.P. & Beavis, R.C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3, 1234–1242 (2004).

    CAS  Article  Google Scholar 

  21. 21

    Omenn, G.S. et al. Metrics for the Human Proteome Project 2015: progress on the human proteome and guidelines for high-confidence protein identification. J. Proteome Res. 14, 3452–3460 (2015).

    CAS  Article  Google Scholar 

  22. 22

    Hu, Y. & Lam, H. Expanding tandem mass spectral libraries of phosphorylated peptides: advances and applications. J. Proteome Res. 12, 5971–5977 (2013).

    CAS  Article  Google Scholar 

  23. 23

    Liu, Y. et al. Chromosome-8-coded proteome of Chinese Chromosome Proteome Data set (CCPD) 2.0 with partial immunohistochemical verifications. J. Proteome Res. 13, 126–136 (2014).

    CAS  Article  Google Scholar 

  24. 24

    Tsai, C.F. et al. Sequential phosphoproteomic enrichment through complementary metal-directed immobilized metal ion affinity chromatography. Anal. Chem. 86, 685–693 (2014).

    CAS  Article  Google Scholar 

  25. 25

    Ye, X. & Li, L. Macroporous reversed-phase separation of proteins combined with reversed-phase separation of phosphopeptides and tandem mass spectrometry for profiling the phosphoproteome of MDA-MB-231 cells. Electrophoresis 35, 3479–3486 (2014).

    CAS  Article  Google Scholar 

  26. 26

    Mancuso, F., Bunkenborg, J., Wierer, M. & Molina, H. Data extraction from proteomics raw data: an evaluation of nine tandem MS tools using a large Orbitrap data set. J. Proteomics 75, 5293–5303 (2012).

    CAS  Article  Google Scholar 

  27. 27

    Raijmakers, R., Kraiczek, K., de Jong, A.P., Mohammed, S. & Heck, A.J. Exploring the human leukocyte phosphoproteome using a microfluidic reversed-phase-TiO2-reversed-phase high-performance liquid chromatography phosphochip coupled to a quadrupole time-of-flight mass spectrometer. Anal. Chem. 82, 824–832 (2010).

    CAS  Article  Google Scholar 

  28. 28

    Casado, P. et al. Kinase-substrate enrichment analysis provides insights into the heterogeneity of signaling pathway activation in leukemia cells. Sci. Signal. 6, rs6 (2013).

    Article  Google Scholar 

  29. 29

    Menschaert, G. et al. Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol. Cell. Proteomics 12, 1780–1790 (2013).

    CAS  Article  Google Scholar 

  30. 30

    Casado, P., Bilanges, B., Rajeeve, V., Vanhaesebroeck, B. & Cutillas, P.R. Environmental stress affects the activity of metabolic and growth factor signaling networks and induces autophagy markers in MCF7 breast cancer cells. Mol. Cell. Proteomics 13, 836–848 (2014).

    CAS  Article  Google Scholar 

  31. 31

    Collins, M.O., Wright, J.C., Jones, M., Rayner, J.C. & Choudhary, J.S. Confident and sensitive phosphoproteomics using combinations of collision induced dissociation and electron transfer dissociation. J. Proteomics 103, 1–14 (2014).

    CAS  Article  Google Scholar 

  32. 32

    van Gestel, R.A. et al. Quantitative erythrocyte membrane proteome analysis with Blue-native/SDS PAGE. J. Proteomics 73, 456–465 (2010).

    CAS  Article  Google Scholar 

  33. 33

    Sleno, L. The use of mass defect in modern mass spectrometry. J. Mass Spectrometry 47, 226–236 (2012).

    CAS  Article  Google Scholar 

  34. 34

    Sturm, M. et al. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics 9, 163 (2008).

    Article  Google Scholar 

  35. 35

    Wang, J., Pérez-Santiago, J., Katz, J.E., Mallick, P. & Bandeira, N. Peptide identification from mixture tandem mass spectra. Mol. Cell. Proteomics 9, 1476–1485 (2010).

    CAS  Article  Google Scholar 

  36. 36

    Schittmayer, M., Fritz, K., Liesinger, L., Griss, J. & Birner-Gruenberger, R. Cleaning out the litterbox of proteomic scientists' favorite pet: optimized data analysis avoiding trypsin artifacts. J. Proteome Res. 15, 1222–1229 (2016).

    CAS  Article  Google Scholar 

  37. 37

    Lam, H. Spectral archives: a vision for future proteomics data repositories. Nat. Methods 8, 546–548 (2011).

    CAS  Article  Google Scholar 

  38. 38

    Mosteller, F., Winsor, C.P. & Fisher, C.H. Questions and Answers. Am. Stat. 2, 18–19 (1948).

    Article  Google Scholar 

  39. 39

    Mi, H., Muruganujan, A. & Thomas, P.D. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 41, D377–D386 (2013).

    CAS  Article  Google Scholar 

Download references


This work was supported by the Vienna Science and Technology Fund (WWTF, grant LS11-045; grant was awarded to S.N. Wagner (Medical University of Vienna, Division of Immunology, Allergy and Infectious Diseases) and used to fund J.G.), the Wellcome Trust (grant WT101477MA to H.H. and J.A.V.), the BBSRC ('PROCESS' grant BB/K01997X/1 to H.H. and J.A.V., 'Quantitative Proteomics' grant BB/I00095X/1 to H.H.), the Deutsche Forschungsgemeinschaft (grant SFB685/B1 to O.K.), and the BMBF (grant 01ZX1301F to O.K.). We would like to acknowledge the attendees of the Midwinter Proteomics Bioinformatics Seminar 2015 at Semmering (Austria) and the Bioinformatics Hub at the HUPO conference 2015 at Vancouver (Canada), who provided valuable feedback on the data analysis. Finally, we want to acknowledge M. The and L. Käll for their support during the benchmarking of their MaRaCluster algorithm.

Author information




J.G. developed the clustering algorithm, ran the experiments, and performed the data analysis. D.L.T. contributed to the development of the probabilistic scoring approach. Y.P.-R. contributed to the data analysis. J.G. and R.W. developed the Java APIs for the spectrum-clustering-analysis pipeline. S.L., R.W., and J.G. developed the Hadoop implementation. J.A.D., N.d.-T., Y.P.-R., and R.W. created the web interface and the API of the PRIDE Cluster resource. M.R., M.W., and O.K. performed the metabolite search. J.G., R.W., H.H., and J.A.V. supervised the project. J.G. and J.A.V. wrote the manuscript, with contributions from the rest of the authors.

Corresponding authors

Correspondence to Johannes Griss or Juan Antonio Vizcaíno.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Relative Proportion of Unidentified Spectra in datasets submitted to PRIDE Archive

Box plots representing the relative proportion of unidentified spectra in the PRoteomics IDEntifications (PRIDE) Archive database. Overall, 75% of spectra in datasets submitted to PRIDE Archive are unidentified. Submitted datasets without identified spectra as well as datasets that only contained identified spectra were excluded from this calculation. Submissions to PRIDE (first box plot) represent those datasets submitted until mid-2012. Submissions to ProteomeXchange (second box plot) represent datasets submitted afterwards, once the ProteomeXchange data workflows were started. For the latter, only “complete” submissions are considered.

Supplementary Figure 2 Peptide Evidence per Proteomics Repository

Venn diagrams demonstrating that PRIDE Cluster based reliable peptide identifications provide additional MS/MS evidence for peptides not found in the other two other major MS-based data repositories (PeptideAtlas and GPMDB). Data are shown for (a) human, (b) mouse, (c) Arabidopsis thaliana, and (d) rat.

Supplementary Figure 3 PRIDE Cluster Provides MS/MS Evidence for Proteins without Experimental Evidence Annotated

PRIDE Cluster based validated peptide spectrum matches provide experimental evidence for the existence of a considerable number of proteins for which there is no experimental evidence at the protein level (PE=1), in UniProt. The present plot present the list of proteins that can be identified with at least 2 unique peptides with at least 9 amino acids, as was agreed in the latest guidelines of the Human Proteome Project. The categories represented are: only evidence on transcript level (PE=2), proteins inferred from homology (PE=3), and predicted proteins (PE=4) in the human UniProtKB/SwissProt (release 2016-03) database.

Supplementary Figure 4 Overall workflow representing the “identification pipeline”.

Workflow representing the “identification pipeline” used to identify originally submitted spectra of incorrectly identified and unidentified clusters.

Supplementary Figure 5 Overall workflow representing the PRIDE Cluster analysis process.

Flow chart summarizing all analyses steps performed during the analyses of the spectrum clustering results, as described in the main manuscript.

Supplementary Figure 6 Open modification search of unidentified mouse clusters.

Summary of results for the analysis of mouse clusters containing only unidentified spectra. The vast majority of delta masses observed in the open modification search were between -2 and +4 Da (top left panel). After adjusting the y-axis, several other delta masses were observed at a high frequency (top right panel). When limiting these delta masses to only masses that were observed at least for ten different clusters, the vast majority of delta masses could be mapped to known PTMs, as well as to one potential amino acid substitution (lower panel). For the complete list of the found delta masses see Supplementary Table 4.

Supplementary Figure 7 Open modification search of unidentified Arabidopsis thaliana clusters.

Summary of results for the analysis of Arabidopsis thaliana clusters containing only unidentified spectra. In contrast to the human and mouse data, consensus spectra of unidentified A. thaliana clusters were searched against the PRIDE Cluster spectral library for A. thaliana (version 2015-04). Again, most delta masses were found between -1 and 1 Da (top left panel). After adjusting the y-axis several other delta masses were observed (top right panel). When limiting these delta masses to only masses that were observed at least for five different clusters, three known PTMs could be identified even taking into account that the spectral library used in this search was derived from the same dataset (lower panel). For the complete list of the found delta masses see Supplementary Table 4.

Supplementary Figure 8 Identification of unidentified mouse clusters

Overview of the results of the analysis of clusters containing only unidentified spectra from mouse. (a) Venn diagram representing that 122 (15%) of the large unidentified mouse clusters were identified using SpectraST, X!Tandem and PepNovo. (b) In contrast to the results in human data, around 50% of identified proteins could not be classified as albumin, keratin, trypsin or haemoglobin. (c) Similarly to the human data, only trypsin peptides were commonly modified (e.g. dimethylated, the center line marks the median, edges the first and third quartile, whiskers extend to +/-1.58 times the inter-quartile ratio divided by the square root of the number of observations, single points denote measurements outside this range).

Supplementary Figure 9 Identification of unidentified Arabidopsis thaliana clusters

Overview of the results of the analysis of clusters containing only unidentified spectra from A. thaliana data. (a) Venn diagram representing that 50 (9%) of the large unidentified A. thaliana clusters were identified using SpectraST, X!Tandem and PepNovo. (b) In contrast to mouse and human data, no haemoglobin associated proteins were identified. Similarly to the identified proteins in the mouse dataset, most proteins could not be classified as albumin, keratin, or trypsin. All identified peptides corresponding to albumin were matches against bovine albumin and most likely experimental contaminants. (c) Similarly to the human and mouse data, trypsin peptides were commonly modified (e.g. dimethylated). Additionally, in this case the majority of albumin peptides were also modified (the center line marks the median, edges the first and third quartile, whiskers extend to +/-1.58 times the inter-quartile ratio divided by the square root of the number of observations, single points denote measurements outside this range).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–9 and Supplementary Notes 1–8 (PDF 3504 kb)

Supplementary Table 1

List of processed human PRIDE Archive submissions as part of the test dataset (XLS 94 kb)

Supplementary Table 2

List of analysed phosphorylation studies submitted to PRIDE Archive. (XLS 123 kb)

Supplementary Table 3

List of identified phosphorylated peptides identified in the three examples presented in the manuscript. (XLS 79 kb)

Supplementary Table 4

List of commonly observed mass deltas when processing unidentified clusters using an open modification search. Results are given for human, mouse and Arabidopsis. (XLS 28 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Griss, J., Perez-Riverol, Y., Lewis, S. et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Methods 13, 651–656 (2016).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing