Tandem mass spectrometry (MS/MS) experiments yield multiple, nearly identical spectra of the same peptide in various laboratories, but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from ∼1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Stein, S.E. & Scott, D.R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 5, 859–866 (1994).
Yates, J.R. III, Morgan, S.F., Gatlin, C.L., Griffin, P.R. & Eng, J.K. Method to compare collision-induced dissociation spectra of peptides: Potential for library searching and subtractive analysis. Anal. Chem. 70, 3557–3565 (1998).
Craig, R., Cortens, J.C., Fenyo, D. & Beavis, R.C. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 5, 1843–1849 (2006).
Lam, H. et al. Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 7, 655–667 (2007).
Beer, I., Barnea, E., Ziv, T. & Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950–960 (2004).
Tabb, D.L., Thompson, M.R., Khalsa-Moyers, G., VerBerkmoes, N.C. & McDonald, W.H. MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 16, 1250–1261 (2005).
Flikka, K. et al. Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 7, 3245–3258 (2007).
Frank, A.M. et al. Clustering millions of tandem mass spectra. J. Proteome Res. 7, 113–122 (2008).
Bandeira, N., Tsur, D., Frank, A. & Pevzner, P. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA 104, 6140–6145 (2007).
Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
Tanner, S. et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 (2007).
Gupta, N. & Pevzner, P.A. False discovery rates of protein identifications: a strike against the two peptide rule. J. Proteome Res. 8, 4173–4181 (2009).
Searle, B.C., Turner, M. & Nesvizhskii, A.I. Improving sensitivity by probabilistically combining results from multiple ms/ms search methodologies. J. Proteome Res. 7, 245–253 (2008).
Tsur, D., Tanner, S., Zandi, E., Bafna, V. & Pevzner, P.A. Identification of post-translational modifications via blind search of mass-spectra. Nat. Biotechnol. 23, 1562–1567 (2005).
Shevchenko, A. et al. Charting the proteomes of organisms with unsequenced genomes by MALDI quadrupole time-of flight mass spectrometry and BLAST homology searching. Anal. Chem. 73, 1917–1926 (2001).
Han, Y., Ma, B. & Zhang, K. SPIDER: software for protein identification from sequence tags with de novo sequencing error. J. Bioinform. Comput. Biol. 3, 697–716 (2005).
Waridel, P. et al. Sequence similarity-driven proteomics in organisms with unknown genomes by lc-ms/ms and automated de novo sequencing. Proteomics 7, 2318–2329 (2007).
Choudhary, J.S., Blackstock, W.P., Creasy, D.M. & Cottrell, J.S. Matching peptide mass spectra to EST and genomic DNA databases. Trends Biotechnol. 19, S17–S22 (2001).
Jaffe, J.D., Berg, H.C. & Church, G.M. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 59–77 (2004).
Desiere, F. et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 (2005).
Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007).
Ma, B. et al. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 17, 2337–2342 (2003).
Frank, A. & Pevzner, P. Pepnovo: de novo peptide sequencing via probabilistic network modeling. Anal. Chem. 77, 964–973 (2005).
Savitski, M.M., Nielsen, M.L., Kjeldsen, F. & Zubarev, R.A. Proteomics-grade de novo sequencing approach. J. Proteome Res. 4, 2348–2354 (2005).
Shen, Y. et al. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. Anal. Chem. 80, 7742–7754 (2008).
Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 (2009).
Ng, J. & Pevzner, P.A. Algorithm for identification of fusion proteins via mass spectrometry. J. Proteome Res. 7, 89–95 (2008).
Junqueira, M. et al. Separating the wheat from the chaff: unbiased filtering of background tandem mass spectra improves protein identification. J. Proteome Res. 7, 3382–3395 (2008).
Xu, B. et al. Identification of early intestinal neoplasia protein biomarkers using laser capture microdissection and MALDI MS. Mol. Cell. Proteomics 8, 936–945 (2009).
Andoni, A. & Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 117–122 (2008).
Masselon, C. et al. Targeted comparative proteomics by liquid chromatography-tandem fourier ion cyclotron resonance mass spectrometry. Anal. Chem. 77, 400–406 (2005).
Gupta, N. et al. Whole proteome analysis of post-translational modifications: applications of massspectrometry for proteogenomic annotation. Genome Res. 17, 1362–1377 (2007).
Tanner, S. et al. Inspect: fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–4639 (2005).
Bern, M., Goldberg, D., McDonald, W.H. & Yates, J.R. III . Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20, i49–i54 (2004).
Flikka, K., Martens, L., Vandekerckhove, J., Gevaert, K. & Eidhammer, I. Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 6, 2086–2094 (2006).
Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006).
Wong, J., Sullivan, M., Cartwright, H. & Cagney, G. msmseval: tandem mass spectral quality assignment for high-throughput proteomics. BMC Bioinformatics 8, 51 (2007).
Salmi, J. et al. Quality classification of tandem mass spectrometry data. Bioinformatics 22, 400–406 (2007).
Wan, X.K., Vidavsky, I. & Gross, M.L. Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass Spectrom. 13, 85–88 (2002).
Tabb, D.L., MacCoss, M.J., Wu, C.C., Anderson, S.D. & Yates, J.R. III . Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 75, 2470–2477 (2003).
Ramakrishnan, S.R. et al. A fast coarse filtering method for peptide identification by mass spectrometry. Bioinformatics 22, 1524–1531 (2006).
Liu, J. et al. Methods for peptide identification by spectral comparison. Proteome Sci. 5, 3 (2007).
Frewen, F.B., Merrihew, G.E., Wu, C.C., Stafford Noble, W. & MacCoss, M.J. Analysis of peptide ms/ms spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 78, 5678–5684 (2006).
Jaitly, N. et al. Robust algorithm for alignment of liquid chromatography-mass spectrometry analyses in an accurate mass and time tag data analysis pipeline. Anal. Chem. 78, 7397–7409 (2006).
We thank I. Kaufman for his assistance in running the experiments on the computational grid. This work was supported by US National Institutes of Health grant 1-P41-RR024851 from the National Center for Research Resources. This work used measurements based upon capabilities developed by the Department of Energy, Office of Biological and Environmental Research, and National Center for Research Resources (grant RR18522) conducted at the Environmental Molecular Sciences Laboratory, a national scientific user facility located at Pacific Northwest National Laboratory in Richland, Washington, USA.
The authors declare no competing financial interests.
About this article
Cite this article
Frank, A., Monroe, M., Shah, A. et al. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat Methods 8, 587–591 (2011). https://doi.org/10.1038/nmeth.1609
CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis
BMC Bioinformatics (2021)
Nature Communications (2021)
Data Mining and Knowledge Discovery (2021)
Nature Communications (2020)
Nature Protocols (2020)