Article | Published:

Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra

Nature Methods volume 8, pages 587591 (2011) | Download Citation

Abstract

Tandem mass spectrometry (MS/MS) experiments yield multiple, nearly identical spectra of the same peptide in various laboratories, but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from 1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    & Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).

  2. 2.

    , , , & Method to compare collision-induced dissociation spectra of peptides: Potential for library searching and subtractive analysis. Anal. Chem. 70, 3557–3565 (1998).

  3. 3.

    , , & Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5, 1843–1849 (2006).

  4. 4.

    et al. Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 7, 655–667 (2007).

  5. 5.

    , , & Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950–960 (2004).

  6. 6.

    , , , & MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 16, 1250–1261 (2005).

  7. 7.

    et al. Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, 3245–3258 (2007).

  8. 8.

    et al. Clustering millions of tandem mass spectra. J. Proteome Res. 7, 113–122 (2008).

  9. 9.

    , , & Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA 104, 6140–6145 (2007).

  10. 10.

    & Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

  11. 11.

    et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 (2007).

  12. 12.

    & False discovery rates of protein identifications: a strike against the two peptide rule. J. Proteome Res. 8, 4173–4181 (2009).

  13. 13.

    , & Improving sensitivity by probabilistically combining results from multiple ms/ms search methodologies. J. Proteome Res. 7, 245–253 (2008).

  14. 14.

    , , , & Identification of post-translational modifications via blind search of mass-spectra. Nat. Biotechnol. 23, 1562–1567 (2005).

  15. 15.

    et al. Charting the proteomes of organisms with unsequenced genomes by MALDI quadrupole time-of flight mass spectrometry and BLAST homology searching. Anal. Chem. 73, 1917–1926 (2001).

  16. 16.

    , & SPIDER: software for protein identification from sequence tags with de novo sequencing error. J. Bioinform. Comput. Biol. 3, 697–716 (2005).

  17. 17.

    et al. Sequence similarity-driven proteomics in organisms with unknown genomes by lc-ms/ms and automated de novo sequencing. Proteomics 7, 2318–2329 (2007).

  18. 18.

    , , & Matching peptide mass spectra to EST and genomic DNA databases. Trends Biotechnol. 19, S17–S22 (2001).

  19. 19.

    , & Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 59–77 (2004).

  20. 20.

    et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2005).

  21. 21.

    et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007).

  22. 22.

    et al. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).

  23. 23.

    & Pepnovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).

  24. 24.

    , , & Proteomics-grade de novo sequencing approach. J. Proteome Res. 4, 2348–2354 (2005).

  25. 25.

    et al. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. Anal. Chem. 80, 7742–7754 (2008).

  26. 26.

    , , & Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 (2009).

  27. 27.

    & Algorithm for identification of fusion proteins via mass spectrometry. J. Proteome Res. 7, 89–95 (2008).

  28. 28.

    et al. Separating the wheat from the chaff: unbiased filtering of background tandem mass spectra improves protein identification. J. Proteome Res. 7, 3382–3395 (2008).

  29. 29.

    et al. Identification of early intestinal neoplasia protein biomarkers using laser capture microdissection and MALDI MS. Mol. Cell. Proteomics 8, 936–945 (2009).

  30. 30.

    & Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 117–122 (2008).

  31. 31.

    et al. Targeted comparative proteomics by liquid chromatography-tandem fourier ion cyclotron resonance mass spectrometry. Anal. Chem. 77, 400–406 (2005).

  32. 32.

    et al. Whole proteome analysis of post-translational modifications: applications of massspectrometry for proteogenomic annotation. Genome Res. 17, 1362–1377 (2007).

  33. 33.

    et al. Inspect: fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005).

  34. 34.

    , , & . Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20, i49–i54 (2004).

  35. 35.

    , , , & Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 6, 2086–2094 (2006).

  36. 36.

    et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006).

  37. 37.

    , , & msmseval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 8, 51 (2007).

  38. 38.

    et al. Quality classification of tandem mass spectrometry data. Bioinformatics 22, 400–406 (2007).

  39. 39.

    , & Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass Spectrom. 13, 85–88 (2002).

  40. 40.

    , , , & . Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 75, 2470–2477 (2003).

  41. 41.

    et al. A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 22, 1524–1531 (2006).

  42. 42.

    et al. Methods for peptide identification by spectral comparison. Proteome Sci. 5, 3 (2007).

  43. 43.

    , , , & Analysis of peptide ms/ms spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 78, 5678–5684 (2006).

  44. 44.

    et al. Robust algorithm for alignment of liquid chromatography-mass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Anal. Chem. 78, 7397–7409 (2006).

Download references

Acknowledgements

We thank I. Kaufman for his assistance in running the experiments on the computational grid. This work was supported by US National Institutes of Health grant 1-P41-RR024851 from the National Center for Research Resources. This work used measurements based upon capabilities developed by the Department of Energy, Office of Biological and Environmental Research, and National Center for Research Resources (grant RR18522) conducted at the Environmental Molecular Sciences Laboratory, a national scientific user facility located at Pacific Northwest National Laboratory in Richland, Washington, USA.

Author information

Affiliations

  1. Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA.

    • Ari M Frank
    • , Jeremy J Carver
    • , Nuno Bandeira
    •  & Pavel A Pevzner
  2. Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington, USA.

    • Matthew E Monroe
    • , Anuj R Shah
    • , Ronald J Moore
    • , Gordon A Anderson
    •  & Richard D Smith
  3. Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, California, USA.

    • Nuno Bandeira

Authors

  1. Search for Ari M Frank in:

  2. Search for Matthew E Monroe in:

  3. Search for Anuj R Shah in:

  4. Search for Jeremy J Carver in:

  5. Search for Nuno Bandeira in:

  6. Search for Ronald J Moore in:

  7. Search for Gordon A Anderson in:

  8. Search for Richard D Smith in:

  9. Search for Pavel A Pevzner in:

Contributions

A.M.F. designed and implemented the algorithms, designed and ran the experiments and wrote the paper. P.A.P. designed the algorithms and the experiments and wrote the paper. R.D.S. developed the measurement capabilities. R.J.M. was responsible for the measurements. M.E.M. and G.A.A. developed protocols and did the proteomics data acquisition and processing. A.R.S. assisted in designing the experiments. J.J.C. and N.B. designed and implement the web-based archive searching tool. All authors discussed, commented and contributed to writing the paper.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Pavel A Pevzner.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Tables 1–2 and Supplementary Notes 1–6

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nmeth.1609

Further reading