Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra


Tandem mass spectrometry (MS/MS) experiments yield multiple, nearly identical spectra of the same peptide in various laboratories, but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from 1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Clustering of the PNNL dataset.
Figure 2: Identification of peptides across different species.


  1. 1

    Stein, S.E. & Scott, D.R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).

    CAS  Article  Google Scholar 

  2. 2

    Yates, J.R. III, Morgan, S.F., Gatlin, C.L., Griffin, P.R. & Eng, J.K. Method to compare collision-induced dissociation spectra of peptides: Potential for library searching and subtractive analysis. Anal. Chem. 70, 3557–3565 (1998).

    CAS  Article  Google Scholar 

  3. 3

    Craig, R., Cortens, J.C., Fenyo, D. & Beavis, R.C. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5, 1843–1849 (2006).

    CAS  Article  Google Scholar 

  4. 4

    Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 7, 655–667 (2007).

    CAS  Article  Google Scholar 

  5. 5

    Beer, I., Barnea, E., Ziv, T. & Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950–960 (2004).

    CAS  Article  Google Scholar 

  6. 6

    Tabb, D.L., Thompson, M.R., Khalsa-Moyers, G., VerBerkmoes, N.C. & McDonald, W.H. MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 16, 1250–1261 (2005).

    CAS  Article  Google Scholar 

  7. 7

    Flikka, K. et al. Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, 3245–3258 (2007).

    CAS  Article  Google Scholar 

  8. 8

    Frank, A.M. et al. Clustering millions of tandem mass spectra. J. Proteome Res. 7, 113–122 (2008).

    CAS  Article  Google Scholar 

  9. 9

    Bandeira, N., Tsur, D., Frank, A. & Pevzner, P. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA 104, 6140–6145 (2007).

    CAS  Article  Google Scholar 

  10. 10

    Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

    CAS  Article  Google Scholar 

  11. 11

    Tanner, S. et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 (2007).

    CAS  Article  Google Scholar 

  12. 12

    Gupta, N. & Pevzner, P.A. False discovery rates of protein identifications: a strike against the two peptide rule. J. Proteome Res. 8, 4173–4181 (2009).

    CAS  Article  Google Scholar 

  13. 13

    Searle, B.C., Turner, M. & Nesvizhskii, A.I. Improving sensitivity by probabilistically combining results from multiple ms/ms search methodologies. J. Proteome Res. 7, 245–253 (2008).

    CAS  Article  Google Scholar 

  14. 14

    Tsur, D., Tanner, S., Zandi, E., Bafna, V. & Pevzner, P.A. Identification of post-translational modifications via blind search of mass-spectra. Nat. Biotechnol. 23, 1562–1567 (2005).

    CAS  Article  Google Scholar 

  15. 15

    Shevchenko, A. et al. Charting the proteomes of organisms with unsequenced genomes by MALDI quadrupole time-of flight mass spectrometry and BLAST homology searching. Anal. Chem. 73, 1917–1926 (2001).

    CAS  Article  Google Scholar 

  16. 16

    Han, Y., Ma, B. & Zhang, K. SPIDER: software for protein identification from sequence tags with de novo sequencing error. J. Bioinform. Comput. Biol. 3, 697–716 (2005).

    CAS  Article  Google Scholar 

  17. 17

    Waridel, P. et al. Sequence similarity-driven proteomics in organisms with unknown genomes by lc-ms/ms and automated de novo sequencing. Proteomics 7, 2318–2329 (2007).

    CAS  Article  Google Scholar 

  18. 18

    Choudhary, J.S., Blackstock, W.P., Creasy, D.M. & Cottrell, J.S. Matching peptide mass spectra to EST and genomic DNA databases. Trends Biotechnol. 19, S17–S22 (2001).

    CAS  Article  Google Scholar 

  19. 19

    Jaffe, J.D., Berg, H.C. & Church, G.M. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 59–77 (2004).

    CAS  Article  Google Scholar 

  20. 20

    Desiere, F. et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2005).

    Article  Google Scholar 

  21. 21

    Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007).

    CAS  Article  Google Scholar 

  22. 22

    Ma, B. et al. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).

    CAS  Article  Google Scholar 

  23. 23

    Frank, A. & Pevzner, P. Pepnovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).

    CAS  Article  Google Scholar 

  24. 24

    Savitski, M.M., Nielsen, M.L., Kjeldsen, F. & Zubarev, R.A. Proteomics-grade de novo sequencing approach. J. Proteome Res. 4, 2348–2354 (2005).

    CAS  Article  Google Scholar 

  25. 25

    Shen, Y. et al. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. Anal. Chem. 80, 7742–7754 (2008).

    CAS  Article  Google Scholar 

  26. 26

    Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 (2009).

    CAS  Article  Google Scholar 

  27. 27

    Ng, J. & Pevzner, P.A. Algorithm for identification of fusion proteins via mass spectrometry. J. Proteome Res. 7, 89–95 (2008).

    CAS  Article  Google Scholar 

  28. 28

    Junqueira, M. et al. Separating the wheat from the chaff: unbiased filtering of background tandem mass spectra improves protein identification. J. Proteome Res. 7, 3382–3395 (2008).

    CAS  Article  Google Scholar 

  29. 29

    Xu, B. et al. Identification of early intestinal neoplasia protein biomarkers using laser capture microdissection and MALDI MS. Mol. Cell. Proteomics 8, 936–945 (2009).

    CAS  Article  Google Scholar 

  30. 30

    Andoni, A. & Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 117–122 (2008).

    Article  Google Scholar 

  31. 31

    Masselon, C. et al. Targeted comparative proteomics by liquid chromatography-tandem fourier ion cyclotron resonance mass spectrometry. Anal. Chem. 77, 400–406 (2005).

    CAS  Article  Google Scholar 

  32. 32

    Gupta, N. et al. Whole proteome analysis of post-translational modifications: applications of massspectrometry for proteogenomic annotation. Genome Res. 17, 1362–1377 (2007).

    CAS  Article  Google Scholar 

  33. 33

    Tanner, S. et al. Inspect: fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005).

    CAS  Article  Google Scholar 

  34. 34

    Bern, M., Goldberg, D., McDonald, W.H. & Yates, J.R. III . Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20, i49–i54 (2004).

    CAS  Article  Google Scholar 

  35. 35

    Flikka, K., Martens, L., Vandekerckhove, J., Gevaert, K. & Eidhammer, I. Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 6, 2086–2094 (2006).

    CAS  Article  Google Scholar 

  36. 36

    Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006).

    CAS  Article  Google Scholar 

  37. 37

    Wong, J., Sullivan, M., Cartwright, H. & Cagney, G. msmseval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 8, 51 (2007).

    Article  Google Scholar 

  38. 38

    Salmi, J. et al. Quality classification of tandem mass spectrometry data. Bioinformatics 22, 400–406 (2007).

    Article  Google Scholar 

  39. 39

    Wan, X.K., Vidavsky, I. & Gross, M.L. Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass Spectrom. 13, 85–88 (2002).

    CAS  Article  Google Scholar 

  40. 40

    Tabb, D.L., MacCoss, M.J., Wu, C.C., Anderson, S.D. & Yates, J.R. III . Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 75, 2470–2477 (2003).

    CAS  Article  Google Scholar 

  41. 41

    Ramakrishnan, S.R. et al. A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 22, 1524–1531 (2006).

    CAS  Article  Google Scholar 

  42. 42

    Liu, J. et al. Methods for peptide identification by spectral comparison. Proteome Sci. 5, 3 (2007).

    Article  Google Scholar 

  43. 43

    Frewen, F.B., Merrihew, G.E., Wu, C.C., Stafford Noble, W. & MacCoss, M.J. Analysis of peptide ms/ms spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 78, 5678–5684 (2006).

    CAS  Article  Google Scholar 

  44. 44

    Jaitly, N. et al. Robust algorithm for alignment of liquid chromatography-mass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Anal. Chem. 78, 7397–7409 (2006).

    CAS  Article  Google Scholar 

Download references


We thank I. Kaufman for his assistance in running the experiments on the computational grid. This work was supported by US National Institutes of Health grant 1-P41-RR024851 from the National Center for Research Resources. This work used measurements based upon capabilities developed by the Department of Energy, Office of Biological and Environmental Research, and National Center for Research Resources (grant RR18522) conducted at the Environmental Molecular Sciences Laboratory, a national scientific user facility located at Pacific Northwest National Laboratory in Richland, Washington, USA.

Author information




A.M.F. designed and implemented the algorithms, designed and ran the experiments and wrote the paper. P.A.P. designed the algorithms and the experiments and wrote the paper. R.D.S. developed the measurement capabilities. R.J.M. was responsible for the measurements. M.E.M. and G.A.A. developed protocols and did the proteomics data acquisition and processing. A.R.S. assisted in designing the experiments. J.J.C. and N.B. designed and implement the web-based archive searching tool. All authors discussed, commented and contributed to writing the paper.

Corresponding author

Correspondence to Pavel A Pevzner.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–2 and Supplementary Notes 1–6 (PDF 544 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Frank, A., Monroe, M., Shah, A. et al. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat Methods 8, 587–591 (2011).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing