To the Editor:
Your journal has published standards, such as the HUPO (Human Proteome Organization; Montreal, QC, Canada) Proteomics Standards Initiative–Molecular Interaction (PSI-MI) controlled vocabulary and data structure1, which, together with the continuing efforts of the International Molecular Exchange (IMEx) consortium2, have made it possible to aggregate protein-protein interaction (PPI) data from multiple sources into larger networks amenable to systematic analysis. Although the aggregated data that are available are useful, they are only partially consolidated owing to many outstanding issues. Among these issues are the endemic problem of matching gene and protein identifiers across databases, and varying practices in recording the organism where the interaction has been observed. Databases also tend to use different conventions for representing multiprotein complexes identified by various detection methods. Likewise, high-throughput studies may report raw unprocessed data in addition to a high-confidence subset of the data, but there is no general agreement between databases on which of these is best fit for redistribution.
Two recent reports by our laboratories3, 4 and the iRefWeb interface (http://wodaklab.org/iRefWeb/) have brought these issues to the forefront, making them more transparent to both data 'consumers' and data providers. Here, we briefly summarize our findings and suggest how this increased transparency will raise awareness in end users and incite all stakeholders, which include not only the databases, but also the journals and authors2, 5, to move toward greater standardization of data archiving and curation practices. This will make it possible to focus on the more fundamental challenges of curating and gaining insight from physical interactions between proteins, which should help unravel the complexity of cellular processes and predict disease outcomes.
iRefWeb3 is a web resource that consolidates PPI data from ten major public databases (BIND, BioGRID, CORUM, DIP, IntAct, HPRD, MINT, MPact, MPPI and OPHID), which each curate and archive PPIs from the scientific literature (references to the individual databases can be found in the Supplementary Note). Previous consolidation efforts have focused primarily on physical protein interactions6, 7, 8, although some projects also integrate additional types of data9. The iRefIndex consolidation procedure behind iRefWeb is one of the most rigorous and thorough to date. It is unique among PPI data integrators in using a well-defined and universal method to assign identifiers to both interaction records and their participants10 (http://irefindex.uio.no/wiki/iRefIndex). The system also records and distributes process-provenance related to this assignment and the data it operates on (for further details on provenance, see Supplementary Note). As a result, the integration method, which includes isoform normalization, enables data tracking and auditing in a manner that is transparent, reversible, reproducible and universally accessible. These features played a critical role in enabling our studies and allowed us to provide detailed feedback to the source databases.
In a follow-up study4, we used iRefWeb to systematically compare the interactions and proteins curated by different databases from the same publication. An interaction and the proteins that form it are two basic descriptors that should ideally be specified unambiguously, and can be readily compared using completely automatic procedures. A total of 15,471 shared publications were analyzed, revealing that, on average, two databases fully agreed on only 42% of the interactions and 62% of the proteins curated from the same publication. Agreement varied for different organisms (Fig. 1a) and different databases (Fig. 1b).