Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra


Tandem mass spectrometry (MS/MS) experiments yield multiple, nearly identical spectra of the same peptide in various laboratories, but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from 1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.

Figure 1: Clustering of the PNNL dataset.
Figure 2: Identification of peptides across different species.


We thank I. Kaufman for his assistance in running the experiments on the computational grid. This work was supported by US National Institutes of Health grant 1-P41-RR024851 from the National Center for Research Resources. This work used measurements based upon capabilities developed by the Department of Energy, Office of Biological and Environmental Research, and National Center for Research Resources (grant RR18522) conducted at the Environmental Molecular Sciences Laboratory, a national scientific user facility located at Pacific Northwest National Laboratory in Richland, Washington, USA.

A.M.F. designed and implemented the algorithms, designed and ran the experiments and wrote the paper. P.A.P. designed the algorithms and the experiments and wrote the paper. R.D.S. developed the measurement capabilities. R.J.M. was responsible for the measurements. M.E.M. and G.A.A. developed protocols and did the proteomics data acquisition and processing. A.R.S. assisted in designing the experiments. J.J.C. and N.B. designed and implement the web-based archive searching tool. All authors discussed, commented and contributed to writing the paper.

The authors declare no competing financial interests.

Supplementary Tables 1–2 and Supplementary Notes 1–6 (PDF 544 kb)

