ProteomeXchange provides globally coordinated proteomics data submission and dissemination

Vizcaíno, Juan A; Deutsch, Eric W; Wang, Rui; Csordas, Attila; Reisinger, Florian; Ríos, Daniel; Dianes, José A; Sun, Zhi; Farrah, Terry; Bandeira, Nuno; Binz, Pierre-Alain; Xenarios, Ioannis; Eisenacher, Martin; Mayer, Gerhard; Gatto, Laurent; Campos, Alex; Chalkley, Robert J; Kraus, Hans-Joachim; Albar, Juan Pablo; Martinez-Bartolomé, Salvador; Apweiler, Rolf; Omenn, Gilbert S; Martens, Lennart; Jones, Andrew R; Hermjakob, Henning

doi:10.1038/nbt.2839

Correspondence
Published: 10 March 2014

ProteomeXchange provides globally coordinated proteomics data submission and dissemination

Juan A Vizcaíno¹^na1,
Eric W Deutsch²^na1,
Rui Wang¹,
Attila Csordas¹,
Florian Reisinger¹,
Daniel Ríos¹,
José A Dianes¹,
Zhi Sun²,
Terry Farrah²,
Nuno Bandeira³,
Pierre-Alain Binz⁴,
Ioannis Xenarios^4,5,6,
Martin Eisenacher⁷,
Gerhard Mayer⁷,
Laurent Gatto ORCID: orcid.org/0000-0002-1520-2268⁸,
Alex Campos⁹,
Robert J Chalkley¹⁰,
Hans-Joachim Kraus¹¹,
Juan Pablo Albar¹²,
Salvador Martinez-Bartolomé¹²,
Rolf Apweiler¹,
Gilbert S Omenn^2,13,
Lennart Martens ORCID: orcid.org/0000-0003-4277-658X^14,15,
Andrew R Jones¹⁶ &
…
Henning Hermjakob¹

Nature Biotechnology volume 32, pages 223–226 (2014)Cite this article

17k Accesses
2108 Citations
28 Altmetric
Metrics details

Subjects

You have full access to this article via your institution.

Download PDF

To the Editor

There is a growing trend toward public dissemination of proteomics data, which is facilitating the assessment, reuse, comparative analyses and extraction of new findings from published data^1,2. This process has been mainly driven by journal publication guidelines and funding agencies. However, there is a need for better integration of public repositories and coordinated sharing of all the pieces of information needed to represent a full mass spectrometry (MS)–based proteomics experiment. An editorial in your journal in 2009, 'Credit where credit is overdue'³, exposed the situation in the proteomics field, where full data disclosure is still not common practice. Olsen and Mann⁴ identified different levels of information in the typical experiment: from raw data and going through peptide identification and quantification, protein identifications and protein ratios and the resulting biological conclusions. All of these levels should be captured and properly annotated in public databases, using the existing MS proteomics repositories for the MS data (raw data, identification and quantification results) and metadata, whereas the resulting biological information should be integrated in protein knowledge bases, such as UniProt⁵. A recent editorial⁶ in Nature Methods again highlighted the need for a stable repository for raw MS proteomics data. In this Correspondence, we report the first implementation of the ProteomeXchange consortium, an integrated framework for submission and dissemination of MS-based proteomics data.

Among the existing MS proteomics repositories with a broad target audience, the PRIDE (PRoteomics IDEntifications) database⁷ (European Bioinformatics Institute, EBI, Cambridge, UK; http://www.ebi.ac.uk/pride) and PeptideAtlas⁸ (Institute for Systems Biology, ISB, Seattle, USA; http://www.peptideatlas.org) are two of the most prominent. Both are mainly focused on tandem MS (MS/MS) data storage. Whereas PRIDE represents the information as originally analyzed by the researcher (thus constituting a primary resource), data in PeptideAtlas are reprocessed through a common pipeline (the Trans-Proteomic Pipeline) to provide a uniformly analyzed view of the data with a focus on low protein false discovery rates (constituting a secondary resource). In addition, ISB has set up the first repository for selected reaction monitoring (SRM) data, PASSEL⁹ (PeptideAtlas SRM Experiment Library, http://www.peptideatlas.org/passel/). There are other resources dedicated to storing MS proteomics data, each of them with different focuses and functionalities, for instance the Global Proteome Machine Database (GPMDB; where data are reprocessed using the search engine X!Tandem)¹⁰. At a higher abstraction level, resources such as UniProt and neXtProt are integrating proteomics results into a wider context of functional annotation from many different sources, including antibody-based methods.

Although most of the proteomics resources mentioned have existed for a long time, they have acted independently with limited coordination of their activities. As a result, data providers were unclear to which repository they should submit their data set, and in what form, with choices ranging from full raw data to highly processed identifications and quantifications. In addition, no repository could store both raw data and processed results. Similar issues arose for data consumers, who could not always find the data supporting a protein modification in UniProt, or know whether a particular data set from PRIDE had been integrated into PeptideAtlas.

The ProteomeXchange consortium (http://www.proteomexchange.org) was formed in 2006 (ref. 11) to overcome these challenges, developing from a loose collaboration into an international consortium of major stakeholders in the domain, comprising, among others, primary (PRIDE and PASSEL) and secondary resources (PeptideAtlas and UniProt), proteomics bioinformaticians, investigators (including some involved in the HUPO Human Proteome Project), and representatives from journals regularly publishing proteomics data (Supplementary Note, section 6). The aim of the ProteomeXchange consortium is to provide a common framework and infrastructure for the cooperation of proteomics resources by defining and implementing consistent, harmonized, user-friendly data deposition and exchange procedures among the major public proteomics repositories.

ProteomeXchange provides unified data submission for multiple MS data types and delivers different 'views' of the deposited data, such as the raw data suitable for reprocessing, the author-generated identifications and highly filtered composite results in resources like UniProt, all linked by a universal shared identifier. Authors are able to cite the resulting ProteomeXchange accession number for data sets reported in their publications. As such, a data set (with appropriate metadata) is becoming publishable per se and can be tracked if used by various consumers in different publications.

Individual resources can join ProteomeXchange by implementing the ProteomeXchange data submission and dissemination guidelines, and metadata requirements. In the current version (http://www.proteomexchange.org/concept), the mandatory information includes the following: first, mass spectrometer output files (raw data, either in a binary format, or in a standard open format such as mzML); second, processed identification results (two submission modes are available, see below); and third, sufficient metadata to provide a suitable biological and technological background, including method information such as transition lists in the case of SRM data. Other types of information, such as peak list files (processed versions of mass spectra most often used in the identification process) and quantification results can also be provided.

Two main MS proteomics workflows are now fully supported: tandem MS and SRM data (Fig. 1 and Supplementary Fig. 1). PRIDE acts as the initial submission point for MS/MS data, whereas PASSEL is the initial submission point for SRM data. It is expected that, in most cases, one ProteomeXchange data set will correspond to data from one publication, and it will be clearly linked to it. However, this concept is flexible and a mechanism for grouping different ProteomeXchange data sets is also available, for example, for large-scale collaborative studies. At present, two different submission modes ('complete' and 'partial') are available for MS/MS data.

**Figure 1: Representation of the ProteomeXchange workflow for MS/MS and SRM data.**

Complete submission requires peptide and protein identification results to be fully supported and integrated in the receiving repository (PRIDE at present). The search engine output files (plus the associated spectra) must therefore first be converted to PRIDE XML or mzIdentML format (a process supported by several popular and user-friendly tools; Supplementary Note, section 5). Complete submissions make the data fully available for querying, and thus maximize the potential for data re-use in MS. This in turn increases the visibility of the associated publication. A DOI (digital object identifier) is assigned to each data set, allowing formalized credit to be given to submitters and their principal investigators, through a citation index, as proposed in your editorial³.

In a partial submission, peptide or protein identification results cannot be integrated in PRIDE because data converters and exporters to the supported formats are not yet available. In this case, search engine output files can be directly provided in their original format. Although partial submissions are searchable by their metadata, they are not fully searchable by results, such as protein identifiers, and will not receive a DOI. However, partial submissions are important as they allow data from newly developed experimental approaches to be deposited into the ProteomeXchange resources, rather than having to reject these until the workflows have been mapped into a representation in PRIDE or another ProteomeXchange partner.

For the submission of MS/MS data sets, a stand-alone, open-source Java tool has been made available, the 'ProteomeXchange submission tool' (http://www.proteomexchange.org/submission) (Supplementary Figs. 2–10 and Supplementary Note, section 5). The tool allows interactive submission of small data sets as well as large-scale batch submissions.

For SRM data sets, a web form (http://www.peptideatlas.org/submit) can be used for submission to PASSEL. Similar to the guidelines stated above for MS/MS data sets, PASSEL submissions require mass spectrometer output files, study metadata, peptide reagents, analysis result files and the actual SRM transition lists, the information that drives the instrument data acquisition. Once data sets are submitted, they are checked by a curator and then loaded into the main PASSEL database, which facilitates interactive exploration of the data and results.

The submitted information and files can selectively be made available to journal editors and reviewers during manuscript peer review. Once the manuscript is accepted for publication or the submitter informs the receiving repository directly, the data will be publicly released (Fig. 1). At this point, the availability of the data set, as well as basic metadata, will be disseminated through a public RSS feed (http://groups.google.com/group/proteomexchange/feed/rss_v2_0_msgs.xml). The RSS feed includes a link to an XML message (ProteomeXchange XML), which is created by the receiving repository (Supplementary Note, section 3), and made available from ProteomeCentral, the portal for all public ProteomeXchange data sets (http://proteomecentral.proteomexchange.org) (Supplementary Note, section 2). Repositories, such as PeptideAtlas or GPMDB, as well as any interested end users can subscribe to this RSS feed and trigger actions, including incorporation of the data into local resources, re-processing or biological analysis. This reprocessing is already occurring in practice. For example, two ProteomeXchange data sets (PXD000134 and PXD000157) have been used in the latest build of the human proteome in PeptideAtlas, and PXD000013 (ref. 12) was reprocessed and nominated as technical data set of the year 2012 by GPMDB (http://www.thegpm.org/dsotw_2012.html#201210071).

ProteomeXchange started to accept regular submissions in June 2012. As of the beginning of February 2014, 685 ProteomeXchange data sets have been submitted (consisting of 656 tandem MS and 29 SRM data sets; Fig. 2), a total of ∼32 Tb of data. The largest submission so far (data sets PXD000320–PXD000324) comprised 5 Tb of data. For a current list of the publicly available data sets, see http://proteomecentral.proteomexchange.org/.

**Figure 2: Summary of the main metrics of ProteomeXchange submissions (as of February 2014).**

In summary, ProteomeXchange provides an infrastructure for efficient and reliable public dissemination of proteomics data, supporting crucial validation, analysis and re-use. By providing and linking different interpretations of the data, we aim to maximize data set visibility as well as their potential benefit to different communities. Citability and traceability are addressed through the assignment of DOIs and a common identifier space. The consortium is open to the participation of additional resources (Supplementary Note, section 9). Although all repositories depend on continuous funding for continuous operation, the ProteomeXchange core repositories PRIDE and PeptideAtlas are well established, with first publications in 2005 (refs. 7,8), and have strong institutional backing (Supplementary Note, section 8), ensuring that the data will remain reliably available for the foreseeable future. We are confident that the ProteomeXchange infrastructure will support the growing trend toward public availability of proteomics data, maximizing its benefit to the scientific community through increased ease of access, greater ability to re-assess interpretations and extract further biological insight, and greater citation rates for the submitters.

Author Contributions

J.A.V., H.H. and E.W.D. led the current implementation of the ProteomeXchange data workflow, guidelines and related software. R.W. developed the ProteomeXchange submission tool. The remaining authors (A.C., F.R., D.R., J.A.D., Z.S., T.F., N.B., P.-A.B., I.X., M.E., G.M., L.G., A.C., R.J.C., H.-J.K., J.P.A., S.M.-B.,R.A., G.S.O., L.M. and A.R.J.) contributed to the development of the ProteomeXchange consortium in different ways, for example, by contributing to the initial ProteomeXchange prototypes in the past, developing software and data standards, or contributing in different aspects to the implementation of the guidelines and the data workflow. J.A.V., E.W.D. and H.H. wrote the manuscript. All authors have agreed to all the content in the manuscript, including the data as presented.

References

Hahne, H. & Kuster, B. Mol. Cell. Proteomics 11, 1063–1069 (2012).
Article CAS Google Scholar
Matic, I., Ahel, I. & Hay, R.T. Nat. Methods 9, 771–772 (2012).
Article CAS Google Scholar
Editors. Nat. Biotechnol. 27, 579 (2009).
Olsen, J.V. & Mann, M. Sci. Signal. 4, pe7 (2011).
Article Google Scholar
The UniProt Consortium. Nucleic Acids Res. 40, D71–D75 (2012).
Editors. Nat. Methods 9, 419 (2012).
Martens, L. et al. Proteomics 5, 3537–3545 (2005).
Article CAS Google Scholar
Deutsch, E.W. et al. Proteomics 5, 3497–3500 (2005).
Article CAS Google Scholar
Farrah, T. et al. Proteomics 12, 1170–1175 (2012).
Article CAS Google Scholar
Craig, R., Cortens, J.P. & Beavis, R.C. J. Proteome Res. 3, 1234–1242 (2004).
Article CAS Google Scholar
Hermjakob, H. & Apweiler, R. Expert Rev. Proteomics 3, 1–3 (2006).
Article Google Scholar
Vaudel, M. et al. J. Proteome Res. 11, 5072–5080 (2012).
Article CAS Google Scholar

Download references

Acknowledgements

We thank all the members of the community who participated as stakeholders in the ProteomeXchange meetings. This work was supported by the EU FP7 grant ProteomeXchange (grant number 260558). J.A.V., A.C., F.R. and D.R. were also funded by the Wellcome Trust (grant number WT085949MA). E.W.D., Z.S. and T.F. are also funded in part by US National Institutes of Health/National Institute General Medical Sciences grant no. R01 GM087221, US National Science Foundation Major Research Instrumentation Program (grant number 0923536), and the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg. M.E. is funded by Protein Unit for Research in Europe (PURE), a project of Nordrhein-Westfalen. L.G. was supported by the EU FP7 PRIME-XS project (grant number 262067). R.W. was supported by the UK Biotechnology and Biological Sciences Research Council 'PRIDE Converter' grant (reference BB/I024204/1). G.S.O. acknowledges support from US National Institute of Health grants RM-08-029, P30U54ES017885 and UL1RR24986.

Author information

Juan A Vizcaíno and Eric W Deutsch: These authors contributed equally to this work.

Authors and Affiliations

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBLEBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Juan A Vizcaíno, Rui Wang, Attila Csordas, Florian Reisinger, Daniel Ríos, José A Dianes, Rolf Apweiler & Henning Hermjakob
Institute for Systems Biology, Seattle, Washington, USA
Eric W Deutsch, Zhi Sun, Terry Farrah & Gilbert S Omenn
Center for Computational Mass Spectrometry, University of California, San Diego, La Jolla, California, USA
Nuno Bandeira
Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
Pierre-Alain Binz & Ioannis Xenarios
University of Lausanne, Lausanne, Switzerland, and Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
Ioannis Xenarios
Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
Ioannis Xenarios
Medizinisches Proteom-Center, Ruhr-Universität Bochum, Bochum, Germany
Martin Eisenacher & Gerhard Mayer
Department of Biochemistry, Computational Proteomics Unit and Cambridge Centre for Proteomics, University of Cambridge, Cambridge, United Kingdom
Laurent Gatto
Integromics SL, Santiago Grisolia, Madrid, Spain
Alex Campos
Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, USA
Robert J Chalkley
Wiley-VCH Verlag, Weinheim, Germany
Hans-Joachim Kraus
ProteoRed-ISCIII, National Center for Biotechnology-CSIC, Madrid, Spain
Juan Pablo Albar & Salvador Martinez-Bartolomé
Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
Gilbert S Omenn
Department of Medical Protein Research, VIB, Ghent, Belgium
Lennart Martens
Department of Biochemistry, Ghent University, Ghent, Belgium
Lennart Martens
Institute of Integrative Biology, University of Liverpool, UK
Andrew R Jones

Authors

Juan A Vizcaíno
View author publications
You can also search for this author in PubMed Google Scholar
Eric W Deutsch
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Attila Csordas
View author publications
You can also search for this author in PubMed Google Scholar
Florian Reisinger
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Ríos
View author publications
You can also search for this author in PubMed Google Scholar
José A Dianes
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Sun
View author publications
You can also search for this author in PubMed Google Scholar
Terry Farrah
View author publications
You can also search for this author in PubMed Google Scholar
Nuno Bandeira
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Alain Binz
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Xenarios
View author publications
You can also search for this author in PubMed Google Scholar
Martin Eisenacher
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Mayer
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Gatto
View author publications
You can also search for this author in PubMed Google Scholar
Alex Campos
View author publications
You can also search for this author in PubMed Google Scholar
Robert J Chalkley
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Joachim Kraus
View author publications
You can also search for this author in PubMed Google Scholar
Juan Pablo Albar
View author publications
You can also search for this author in PubMed Google Scholar
Salvador Martinez-Bartolomé
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Apweiler
View author publications
You can also search for this author in PubMed Google Scholar
Gilbert S Omenn
View author publications
You can also search for this author in PubMed Google Scholar
Lennart Martens
View author publications
You can also search for this author in PubMed Google Scholar
Andrew R Jones
View author publications
You can also search for this author in PubMed Google Scholar
Henning Hermjakob
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan A Vizcaíno.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 and Supplementary Notes (PDF 4310 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vizcaíno, J., Deutsch, E., Wang, R. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol 32, 223–226 (2014). https://doi.org/10.1038/nbt.2839

Download citation

Published: 10 March 2014
Issue Date: March 2014
DOI: https://doi.org/10.1038/nbt.2839

This article is cited by

Phototropin2 3’UTR overlaps with the AT5G58150 gene encoding an inactive RLK kinase
- Justyna Łabuz
- Agnieszka Katarzyna Banaś
- Paweł Hermanowicz
BMC Plant Biology (2024)
Concerted SUMO-targeted ubiquitin ligase activities of TOPORS and RNF4 are essential for stress management and cell proliferation
- Julio C. Y. Liu
- Leena Ackermann
- Niels Mailand
Nature Structural & Molecular Biology (2024)
Biological big-data sources, problems of storage, computational issues, and applications: a comprehensive review
- Jyoti Kant Chaudhari
- Shubham Pant
- Dev Bukhsh Singh
Knowledge and Information Systems (2024)
Basal MET phosphorylation is an indicator of hepatocyte dysregulation in liver disease
- Sebastian Burbano de Lara
- Svenja Kemmer
- Ursula Klingmüller
Molecular Systems Biology (2024)
The salivary proteome in relation to oral mucositis in autologous hematopoietic stem cell transplantation recipients: a labelled and label-free proteomics approach
- S. J. M. van Leeuwen
- G. B. Proctor
- M. C. D. N. J. M. Huysmans
BMC Oral Health (2023)