Introduction
To the Editor:
Your editorial on 'Democratizing Proteomics Data'1 correctly addressed the increasing importance of making proteomics data publicly available so that it can be audited, reanalyzed or reused. To make global data-sharing in the field work, however, it is important to minimize the burden of uploading data into publicly available databases, such as PRIDE2. To this end, we have written a freely available, open source tool called PRIDE Converter that makes it straightforward to submit proteomics data to PRIDE from most common data formats.
Public availability of data is the standard modus operandi for most of the life sciences, ranging from genome sequences, over microarray data, to protein information. Some of the best known examples in the field of proteomics include protein sequences in UniProt (http://www.uniprot.org/), protein structures in the Protein Databank (http://www.rcsb.org/) and protein modifications in UniMod and RESID (http://www.unimod.org/ and http://www.ebi.ac.uk/RESID). As highlighted in your 2007 editorial1, making data publicly available in a standardized and structured way enables other researchers to access and reanalyze the data, and to use the collected results in novel ways.
Indeed, much of the progress over the past years in emerging fields, such as mass spectrometry (MS)-based proteomics, is directly related to the public availability of data obtained in earlier efforts3, specifically the genome sequencing projects. Not surprisingly, the need for data-sharing in the field of proteomics itself was quickly pointed out4. Several proteomics MS data repositories have since been established, with GPMDB, PRIDE, PeptideAtlas and Proteinpedia among the most prominent5. With this infrastructure in place, journals have followed suit by starting to request deposition of MS-related data in these databases1, 6.
The PRIDE repository at the European Bioinformatics Institute (http://www.ebi.ac.uk/pride/) occupies a special place in the list of proteomics databases, in that it constitutes an actual data repository and does not assume editorial control over submitted data1. Additionally, it provides a simple yet powerful infrastructure to support anonymous peer review of submitted data while maintaining the submission as private in the system1. The PRIDE database has so far accumulated data on more than 9,500 experiments, collectively containing more than 40 million mass spectra, identifying well over 1.4 million unique peptide sequences, which in turn infer more than 100,000 unique Ensembl proteins across all species.
Submitting an MS-based proteomics data set to a structured repository, such as PRIDE, has many advantages over alternative ways of making peptide and protein identifications publicly available, such as uploading raw data files on a web page7 or submitting text or PDF files as supplementary information to a journal2. Furthermore, centralized repositories can also offer additional services and tools to the scientific community, based on uploaded data. PRIDE for instance includes tools for (i) visualizing protein coverage, peptide modifications and spectrum annotations, (ii) automatic mapping of protein accession numbers to identifiers from all other commonly used proteomics databases using the PICR service8 and (iii) comprehensive protein list comparisons (through Venn diagrams)9.
Submitting data to PRIDE could be challenging for some users, however. PRIDE relies on an XML-based data format for submissions, which is built around the Proteomics Standards Initiative mzData standard for mass spectrometry (http://www.ebi.ac.uk/pride/schemaXmlspyDocumentation.do)10. And although the PRIDE XML format is well documented, converting proteomics data to PRIDE XML could present difficulties, especially for wet-lab scientists without a strong bioinformatics background or informatics support. To alleviate this problem, two tools for converting data into PRIDE XML have already been developed: the ProteomeHarvest PRIDE Submission Spreadsheet, which is Microsoft Excel-based (http://www.ebi.ac.uk/pride/proteomeharvest/), and the PRIDE Wizard for Mascot result files (http://www.mcisb.org/resources/PrideWizard/). Both tools come with important limitations, however, ProteomeHarvest can not accommodate peak lists easily and requires substantial manual effort from the user, whereas the PRIDE Wizard only accepts Mascot result files as input. In addition, both of them were developed for dealing with relatively small volumes of data.
We therefore developed PRIDE Converter, a tool that dramatically improves on the existing ones in three crucial aspects: (i) it can accommodate a large variety of input formats, (ii) it is suitable for both small and large data submissions and (iii) having a wizard-like graphical user interface, it is very intuitive and easy to use. PRIDE Converter is platform independent, is written in Java and is open source under the permissive Apache2 license. At the time of writing, PRIDE Converter supports the conversion of fifteen different input formats into PRIDE XML (Fig. 1). The conversion process is divided into eight simple steps using a wizard-like graphical-user interface. In each of these steps, the user is requested to provide appropriate metadata using controlled vocabulary terms that are retrieved through an online connection to the Ontology Lookup Service11 (Fig. 2). Like the PRIDE repository itself, PRIDE Converter can fully accommodate Minimum Information About A Proteomics Experiment (MIAPE)-compliant data reporting, although it does not rigorously enforce it. The PRIDE Converter application is freely available (http://code.google.com/p/pride-converter/), and further support is available through the PRIDE support team (pride-support@ebi.ac.uk).
ADVERTISEMENT
Even though PRIDE Converter has only been available for a few months, it has already been successfully used by a variety of people to submit their data to PRIDE, so far resulting in 557 distinct submissions, containing >230,000 identified proteins, close to 1.5 million peptides and well over 23 million spectra. From these numbers, it is clear that PRIDE Converter already plays a substantial role in easing the burden of data submission to PRIDE. Interestingly, PRIDE Converter has also been used to make one of the most discussed proteomics data sets published to date publicly available12. As such, we are confident that PRIDE Converter will be a key asset in allowing authors to efficiently comply with the data submission guidelines proposed by Nature journals1, 6 and others (http://www3.interscience.wiley.com/homepages/76510741/2120_instruc.pdf). And although the completion of ongoing efforts to develop community standards will eventually result in the replacement of PRIDE XML by these standard formats, it will take some time before the new standards are in daily use in the laboratory. In addition, the ability of PRIDE Converter to act as an annotation tool will remain important even when the new standards enjoy widespread adoption, as important high-level metadata will remain outside the scope of these data formats. We therefore envision that future versions of PRIDE Converter will continue to be a key part of the efficient dissemination of mass spectrometry–based proteomics data for quite some time to come, especially with regard to empowering those labs that lack a strong informatics support.


