Abstract
A broad range of mass spectrometers are used in mass spectrometry (MS)-based proteomics research. Each type of instrument possesses a unique design, data system and performance specifications, resulting in strengths and weaknesses for different types of experiments. Unfortunately, the native binary data formats produced by each type of mass spectrometer also differ and are usually proprietary. The diverse, nontransparent nature of the data structure complicates the integration of new instruments into preexisting infrastructure, impedes the analysis, exchange, comparison and publication of results from different experiments and laboratories, and prevents the bioinformatics community from accessing data sets required for software development. Here, we introduce the 'mzXML' format, an open, generic XML (extensible markup language) representation of MS data. We have also developed an accompanying suite of supporting programs. We expect that this format will facilitate data management, interpretation and dissemination in proteomics research.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Augmented region of interest for untargeted metabolomics mass spectrometry (AriumMS) of multi-platform-based CE-MS and LC-MS data
Analytical and Bioanalytical Chemistry Open Access 25 May 2023
-
Spectral binning as an approach to post-acquisition processing of high resolution FIE-MS metabolome fingerprinting data
Metabolomics Open Access 02 August 2022
-
Metabolomics for the design of new metabolic engineering strategies for improving aerobic succinic acid production in Escherichia coli
Metabolomics Open Access 20 July 2022
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






References
Patterson, S.D. & Aebersold, R.H. Proteomics: the first decade and beyond. Nat. Genet. 33 suppl., 311–323 (2003).
Spellman, P.T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046 (2002).
Li, X., Zhang, H., Ranish, J.A. & Aebersold, R. Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal. Chem. 75, 6658–6665 (2003).
Han, D.K., Eng, J., Zhou, H. & Aebersold, R. Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat. Biotechnol. 19, 946–951 (2001).
Purvine, S., Eppel, J.T., Yi, E.C. & Goodlett, D.R. Shotgun collision-induced dissociation of peptides using a time of flight mass analyzer. Proteomics 3, 847–850 (2003).
Eng, J.K., McCormack, A.L. & Yates, J.R. III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Li, X.J. et al. A tool to visualize and evaluate data obtained by liquid chromatography-electrospray ionization-mass spectrometry. Anal. Chem. 76, 3856–3860 (2004).
Gygi, S.P. et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999 (1999).
Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).
Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003).
Baldwin, M.A. Protein identification by mass spectrometry: issues to be considered. Mol. Cell. Proteomics 3, 1–9 (2004).
Prince, J.T., Carlson, M.W., Wang, R., Lu, P. & Marcotte, E.M. The need for a public proteomics repository. Nat. Biotechnol. 22, 471–472 (2004).
Zhang, N., Aebersold, R. & Schwikowski, B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 1406–1412 (2002).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Acknowledgements
This project was funded in part by federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, under contract no. N01-HV-28179 and by grant no. 1R33CA93302 from the National Cancer Institute. The Institute for Systems Biology is supported by a generous gift from Merck and Co. We are grateful to SourceForge for hosting the project and Eugene Yi for providing the seven-protein mix data set. We would also like to acknowledge the following for endorsing the mzXML format: Philip C. Andrews, Tom Blackwell, Daniel Burns, Jayson Falkner, Panagiotis Papoulias, Abhik Shah, Peter Ulintz, Al Burlingame, Robert Chalkley, Karl Clauser, Bruno Domon, James Eddes, Robert Moritz, Daniel Figeys, Barry L. Karger, William Hancock, Tomas Rejtar, Peter James, Matthias Mann, Sanford Markey, Matthias Wilm, Ken Williams and Kratos Analytical Limited (a Shimadzu Group Company).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Fig. 1
The PEDRo model and the mzXML format. (PDF 420 kb)
Supplementary Fig. 2
The mzXML index. (PDF 19 kb)
Supplementary Notes
Quick introduction to XML. (PDF 21 kb)
Rights and permissions
About this article
Cite this article
Pedrioli, P., Eng, J., Hubley, R. et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat Biotechnol 22, 1459–1466 (2004). https://doi.org/10.1038/nbt1031
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt1031
This article is cited by
-
Augmented region of interest for untargeted metabolomics mass spectrometry (AriumMS) of multi-platform-based CE-MS and LC-MS data
Analytical and Bioanalytical Chemistry (2023)
-
Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time
BMC Bioinformatics (2022)
-
StackZDPD: a novel encoding scheme for mass spectrometry data optimized for speed and compression ratio
Scientific Reports (2022)
-
A multi-purpose, regenerable, proteome-scale, human phosphoserine resource for phosphoproteomics
Nature Methods (2022)
-
Structural basis for safe and efficient energy conversion in a respiratory supercomplex
Nature Communications (2022)