A broad range of mass spectrometers are used in mass spectrometry (MS)-based proteomics research. Each type of instrument possesses a unique design, data system and performance specifications, resulting in strengths and weaknesses for different types of experiments. Unfortunately, the native binary data formats produced by each type of mass spectrometer also differ and are usually proprietary. The diverse, nontransparent nature of the data structure complicates the integration of new instruments into preexisting infrastructure, impedes the analysis, exchange, comparison and publication of results from different experiments and laboratories, and prevents the bioinformatics community from accessing data sets required for software development. Here, we introduce the 'mzXML' format, an open, generic XML (extensible markup language) representation of MS data. We have also developed an accompanying suite of supporting programs. We expect that this format will facilitate data management, interpretation and dissemination in proteomics research.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


  1. 1.

    & Proteomics: the first decade and beyond. Nat. Genet. 33 suppl., 311–323 (2003).

  2. 2.

    et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046 (2002).

  3. 3.

    , , & Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal. Chem. 75, 6658–6665 (2003).

  4. 4.

    , , & Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat. Biotechnol. 19, 946–951 (2001).

  5. 5.

    , , & Shotgun collision-induced dissociation of peptides using a time of flight mass analyzer. Proteomics 3, 847–850 (2003).

  6. 6.

    , & An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

  7. 7.

    , , & Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

  8. 8.

    et al. A tool to visualize and evaluate data obtained by liquid chromatography-electrospray ionization-mass spectrometry. Anal. Chem. 76, 3856–3860 (2004).

  9. 9.

    et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999 (1999).

  10. 10.

    , , & Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).

  11. 11.

    , , & A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003).

  12. 12.

    Protein identification by mass spectrometry: issues to be considered. Mol. Cell. Proteomics 3, 1–9 (2004).

  13. 13.

    , , , & The need for a public proteomics repository. Nat. Biotechnol. 22, 471–472 (2004).

  14. 14.

    , & ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 1406–1412 (2002).

  15. 15.

    et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

Download references


This project was funded in part by federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, under contract no. N01-HV-28179 and by grant no. 1R33CA93302 from the National Cancer Institute. The Institute for Systems Biology is supported by a generous gift from Merck and Co. We are grateful to SourceForge for hosting the project and Eugene Yi for providing the seven-protein mix data set. We would also like to acknowledge the following for endorsing the mzXML format: Philip C. Andrews, Tom Blackwell, Daniel Burns, Jayson Falkner, Panagiotis Papoulias, Abhik Shah, Peter Ulintz, Al Burlingame, Robert Chalkley, Karl Clauser, Bruno Domon, James Eddes, Robert Moritz, Daniel Figeys, Barry L. Karger, William Hancock, Tomas Rejtar, Peter James, Matthias Mann, Sanford Markey, Matthias Wilm, Ken Williams and Kratos Analytical Limited (a Shimadzu Group Company).

Author information


  1. Institute for Systems Biology, 1441 North 34 Street, Seattle, Washington 98103-8904 USA.

    • Patrick G A Pedrioli
    • , Jimmy K Eng
    • , Robert Hubley
    • , Mathijs Vogelzang
    • , Eric W Deutsch
    • , Brian Raught
    •  & Ruedi Aebersold
  2. Insilicos LLC, 4509 Interlake Avenue North, no. 223, Seattle, Washington 98103-6773, USA.

    • Brian Pratt
    •  & Erik Nilsson
  3. Albert Einstein College of Medicine, LMAP Room 405, Ullman Bldg., 1300 Morris Park Avenue, Bronx, New York 10461 USA.

    • Ruth H Angeletti
  4. EMBL Outstation European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    • Rolf Apweiler
    • , Henning Hermjakob
    • , Chris F Taylor
    •  & Weimin Zhu
  5. Center for Medical Informatics, Department of Anesthesiology, Yale University School of Medicine, PO Box 208009, New Haven, Connecticut 06520, USA.

    • Kei Cheung
  6. Boston University School of Medicine, 715 Albany Street, R-806, Boston, Massachusetts 02118-2526, USA.

    • Catherine E Costello
    • , Sequin Huang
    •  & Mark E McComb
  7. Lilly Research Laboratories, One Lilly Corporate Center, Indianapolis, Indiana 46285, USA.

    • Randall K Julian Jr
  8. Joint Proteomics Laboratory, Ludwig Institute For Cancer Research & The Walter and Eliza Hall Institute of Medical Research, Royal Melbourne Hospital, Parkville, Victoria, Australia 3050.

    • Eugene Kapp
    •  & Richard Simpson
  9. School of Biological Sciences, University of Manchester, The Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK.

    • Stephen G Oliver
  10. The University of Michigan Medical School, 1150 W. Medical Center Drive, Ann Arbor, Michigan 48109-0656, USA.

    • Gilbert Omenn
  11. Department of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK.

    • Norman W Paton
  12. Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, PO Box 999, Richland, Washington 99352, USA.

    • Richard Smith


  1. Search for Patrick G A Pedrioli in:

  2. Search for Jimmy K Eng in:

  3. Search for Robert Hubley in:

  4. Search for Mathijs Vogelzang in:

  5. Search for Eric W Deutsch in:

  6. Search for Brian Raught in:

  7. Search for Brian Pratt in:

  8. Search for Erik Nilsson in:

  9. Search for Ruth H Angeletti in:

  10. Search for Rolf Apweiler in:

  11. Search for Kei Cheung in:

  12. Search for Catherine E Costello in:

  13. Search for Henning Hermjakob in:

  14. Search for Sequin Huang in:

  15. Search for Randall K Julian in:

  16. Search for Eugene Kapp in:

  17. Search for Mark E McComb in:

  18. Search for Stephen G Oliver in:

  19. Search for Gilbert Omenn in:

  20. Search for Norman W Paton in:

  21. Search for Richard Simpson in:

  22. Search for Richard Smith in:

  23. Search for Chris F Taylor in:

  24. Search for Weimin Zhu in:

  25. Search for Ruedi Aebersold in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Ruedi Aebersold.

Supplementary information

PDF files

  1. 1.

    Supplementary Fig. 1

    The PEDRo model and the mzXML format.

  2. 2.

    Supplementary Fig. 2

    The mzXML index.

  3. 3.

    Supplementary Methods

  4. 4.

    Supplementary Notes

    Quick introduction to XML.

About this article

Publication history



Further reading