Main

Deciphering the molecular mechanisms of cell function relies to a large extent on tracing the multitude of interactions between the numerous components of living cells, and between these molecules and any entity or compound of interest to the scientist, such as pharmaceutical agents or environmental contaminants. Molecular interactions may be direct, with two molecules in contact with each other, or the molecules may be in the same affinity complex, purifying together without a physical interaction between them. Several public databases strive to capture the ever increasing amount of published molecular interaction data, which are generated by a broad range of biophysical, biochemical, genetic or predictive methods. During the process of manual curation, the raw data are extracted from a published paper or from a submitted manuscript and systematically transferred into a database.

Initially, interaction databases such as BIND1 and DIP2 worked in isolation and according to their own internal standards and data formats. Because no one database can achieve complete coverage of all known molecular interactions, the user may need to download and combine datasets from two or more databases to answer a specific question. Until recently, this could not be done without first transforming the data into a common format, using a different parser for each database. In 2004, however, several major databases jointly published a community-standard data model for the representation and exchange of protein interaction data3. This data model, developed by members of the Molecular Interaction (MI) group of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome Organization (HUPO)4, has already been adopted by major public interaction databases. Data sets can be downloaded from many of these databases in PSI-MI extensible markup language (XML) interchange format and further analyzed using a number of PSI-MI compatible tools, such as Cytoscape5, ProViz6 and PIMWalker7.

Building on the PSI-MI standard, several public interaction databases have formed the International Molecular Interaction Exchange consortium (IMEx; http://imex.sf.net). The consortium, originally founded by BIND1, DIP2, IntAct8, MINT9 and MPact (MIPS)10, has started to share the curation load and aims to regularly interchange data curated to the same common standards, in a manner similar to the well established pattern followed by the nucleotide sequence databases. However, the consortium's goal of achieving as near complete coverage as possible of interaction data in the literature is greatly hindered by inconsistencies and missing information in published papers. The absence of key pieces of information can lead both to misinterpretation of the paper by scientists and to a time-consuming, error-prone attempt to derive the missing information by a database curation team. Often, the reason for such information deficits is simply the lack of a community consensus on what information is required to appropriately describe a molecular interaction.

To address this issue, we have developed MIMIx as a basis for discussion. MIMIx represents a compromise between the depth of information necessary to describe all relevant aspects of an interaction experiment and the reporting burden placed on scientists who generate data. Its purpose is to ensure that the bench scientist has a checklist (Box 1) of the information to be supplied when describing experimental molecular interaction data in a journal article, displaying data on a website or depositing data directly into a public database (Box 2).

A MIMIx-compliant dataset is not intended to allow an interaction experiment to be reproduced from a database record but to enable database users to quickly assess and focus on data relevant to them and then link to the source publications for the full experimental context. On the other end of the complexity scale, the PSI-MI XML interchange format, which is adopted by all IMEx partners, provides for a much richer representation of a molecular interaction experiment than that required by MIMIx. IMEx partners also welcome data submissions that use the full complexity of the PSI-MI format.

Molecules

The single greatest source of data loss in transferring interaction data into a database is the use of ambiguous molecule identifiers, such as gene names. According to anecdotal estimates from database curators, as much as 70% of overall curation time is spent mapping molecule identifiers unambiguously to well characterized database entries. For example, a paper may not indicate both the gene name and the species from which the gene originated. This information is implicit in the molecule identifiers generated by the major databases. The description 'lck cloned in a mammalian expression vector' gives no indication as to whether the protein source is human, mouse, bovine or rat. 'Human p56lck protein' gives information about the species but not about the splice isoform, whereas both species and sequence are provided by the accession numbers UniProtKB11 P06239 and RefSeq12 NP_005347, and P06239-1 gives a full description of a specific isoform. 'Human PI3-kinase p85 subunit' may appear to be a unique reference, but does it refer to the alpha subunit (P27986) or the beta subunit (O00459), which are two distinct gene products? Such errors will almost certainly result in the paper in question not being added to a curated dataset and may also mislead the reader regarding the actual construction of the experiment. Similarly, it is important for authors to state whether an interaction described in one organism was modeled from an interaction detected between similar molecules in a related organism; for example, an interaction between a rat and a human protein being used to infer a human-human protein interaction. The constructs used, including the organism of origin of the sequence and the splice variant, should be clearly described.

We therefore request that all molecules be identified by a database accession number from a public database. For proteins, UniProt or RefSeq are strongly recommended; for genes, Ensembl13 or Entrez Gene14; for chemical entities, PubChem14 or ChEBI15. Nucleotide sequence database accession numbers (DDBJ, EMBL or GenBank, http://www.insdc.org) identify specific transcripts and give additional information as to the source and the class of nucleic acid under investigation. Where a molecule description is not available from these databases, identifiers from other public databases, such as model-organism databases, may be used. For a full list of recommended databases, please refer to the relevant section of the PSI-MI controlled vocabulary (see below), which also provides unified names for these resources.

An annotated protein or nucleic acid sequence may vary with time as the original submitters update their coding sequence prediction programs, frameshifts are identified, and correction or resequencing is undertaken. This may invalidate the mapping of specific sequence positions; for example, those where deletion mutants or binding domains are described. We therefore request the addition of version numbers, either of the molecule (for example, P06239.5) or of the database, to the MIMIx record.

Although the identification of molecules by accession number is precise, it may be unwieldy to refer to 'UniProt:P06239.5' instead of 'lck' in the text of a paper. To satisfy the need for both precision and readability, we recommend that the accession number and the molecule name used in the text be associated either in the submitted database record or at least at the first occurrence in the paper (for example, “...lck (UniProt:P06239.5)...”).

A key element in the description of an interaction experiment is the role a molecule has in the interaction. MIMIx requests the classification of the molecule role in two ways: the biological role, for example, enzyme or enzyme target; and the experimental role, for example, bait or prey. For both of these, the PSI-MI standard defines a comprehensive controlled vocabulary, ensuring that the same term, rather than synonyms or alternative spellings, is used throughout a paper and that the interpretation of the meaning of that term remains fixed. A list of controlled vocabulary terms that describe the various methods used to detect molecular interactions, current as Nature Biotechnology went to press, is available in Supplementary Note 1 online. This list undergoes continual revision as technologies evolve; the latest version is available at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI.

Finally, it should be noted that databases describe the canonical form of a molecule. The actual participant in a molecular interaction may have been altered, either naturally by the cell (e.g., by cleavage of a bioactive peptide from a precursor protein) or by engineering (e.g., by addition of a tag or creation of a deletion mutant). Terms to describe the 'participant' in an interaction as a derivative of the 'molecule' in the database entry are available in the PSI-MI controlled vocabularies.

Experiment

The MIMIx experiment description implements the core requirements of PSI's “minimum information about a proteomics experiment” (MIAPE) guidelines16 (see p. 887) and aims to capture the aspects of an interaction experiment that are necessary to classify and critically assess the results and the interpretation of the results. It is likely to be further refined in the future as other technology-specific MIAPE modules evolve. The attributes we consider essential at present are as follows:

The 'host organism' describes the system in which the interactions were detected. The host organism should be described by a National Center for Biotechnology Information (NCBI; Bethesda, Maryland, USA) taxonomy identifier and should contain further specification, such as cell-line or tissue descriptors. When the experiment was performed in vitro, this should be described as free text.

The 'interaction detection method' describes the method by which the interaction was determined (for example, tandem affinity purification (MI:0676)).

The 'participant detection method' names the experimental procedure for the detection of the molecules participating in the interaction (for example, peptide mass fingerprinting (MI:0082)).

Beyond these essential requirements, we recommend that authors provide additional detail on molecule sources, sample preparation and further relevant experimental parameters using the detailed controlled vocabularies provided by the PSI-MI standard.

Interaction

The PSI-MI standard provides a formal frame for a detailed description of an interaction, including both qualitative parameters, such as details of mutations, and quantitative parameters, such as dissociation constants. However, these data are often not available, and, thus, MIMIx requires only one element for the description of an interaction: the list of molecules participating in it, characterized as above. If a quality assessment was carried out, the confidence value assigned to the interaction and the confidence attribution system must also be included in the manuscript. Particularly in large-scale experiments, interactions are usually assigned a quality score, which might be derived from data collected in the experiment itself or from additional data outside the experiment. Inclusion of interaction data in public databases requires that this reliability score be easily accessible. Ideally, not only the score but also the raw data used to derive the score should be reported so that users can perform alternative quality assessments.

Relationships to other biological standards

The MIMIx guidelines have been developed in close collaboration with related standards bodies within both the HUPO-PSI and the wider community, in consultation with contributors to the MIAME microarray standards17. MIMIx is one of a series of modules developed within the framework of the MIAPE guidelines16. When an interaction experiment encompasses experimental data that are more fully described by other modules, authors should refer to the relevant guidelines when preparing their data for submission to a journal or a database (Fig. 1). For example, identification of prey proteins in a tandem affinity purification (TAP) pulldown by mass spectrometry should be described according to the MIAPE-MS guidelines (C.F. Taylor et al., unpublished).

Figure 1: The relationship of MIMIx to two guidelines that may be relevant to molecular interaction studies, MIAME17 and MIAPE16/MIAPE-MS.
figure 1

Almost all interaction data may be described using MIMIx; however, MIAME provides guidelines for describing a microarray experiment, and MIAPE allows the submitter to supply details of the peptides and underlying spectra when mass spectrometry has been used to identify protein participants.

Similarly, the HUPO-PSI and the Microarray Gene Expression Data (MGED) consortium are jointly working to provide guidelines for the annotation of array data that will ensure a smooth path for the annotation and submission of such data. Overarching all of these standards is the “functional genomics experiment” (FuGE) model18, which can be used to provide protocols and data flow models should the user wish to annotate such detail. All these guidelines are being managed through a central repository of standards, as described in the “minimum information for biological and biomedical investigations” (MIBBI)19, to ensure that they are complementary and nonoverlapping.

Data deposition

Curators of the main molecular interaction databases work to collect and archive data from journal publications. Although a systematic reporting of published interaction data according to the above guidelines would greatly increase the efficiency of the curation task, literature curation after publication is only a second-best option. We therefore recommend that all reported interaction data be deposited in a publicly available molecular interaction database before publication.

Data deposition has benefits for all parties involved. The databases will be able to work more efficiently and will have more direct access to the data producer to resolve unclear issues. The scientific community will benefit from more, and more precise, information in the databases, as database records can be checked directly by the data producer. Journals and data producers will benefit from consistently formatted database records, which can be included in the supplementary material of a publication. Accession numbers issued by a database and included in the journal publication will allow direct access to the data in the database and a quick connection to related data in the database, such as other records on the same molecules. Finally, data producers and journals will gain exposure for the publication through cross-references from the database records.

IMEx databases offer several options for data deposition (http://imex.sf.net/deposition.html). The submission of fully formatted PSI-MI XML files is recommended for large-scale data producers, who usually have the data available in in-house databases anyway. For smaller-scale experiments, a preformatted Microsoft Excel spreadsheet file is available, with instructions on how to complete it. In addition to technical systems, such as the Ontology Look-up Service (OLS) browser20 and a system for the automatic validation of PSI-MI XML files (http://www.ebi.ac.uk/intact/validator), database curation teams provide assistance in all stages of the data deposition process, for example, in the correct use of the detailed controlled vocabularies used to characterize an interaction. We particularly encourage early contacts with database curation teams, to embed appropriate data collection protocols into the experiment-planning stage.

In addition to the biological data, each data deposition must be accompanied by the minimal administrative data, namely contact email, publication title, first author and the publication identifier, usually a PubMed or Digital Object (http://www.doi.org) identifier. In the prepublication stage, a journal-specific identifier can be used to provide a unique identification of the manuscript accompanying the data deposition; before manuscript submission, the authors may use their own in-house identifier.

To optimize the use of public resources, IMEx partners have developed common curation guidelines and have agreed to synchronize their curation work and exchange all user-submitted data so as to build up a network of stable, well coordinated molecular interaction databases freely accessible to the community. Although accession numbers for deposited interactions will be issued within five working days of the provision of all necessary data, deposited data will be released only upon publication of the associated manuscript or at the request of the data provider.

Conclusion

The MIMIx guidelines presented here will not be static. They will evolve based on community requirements in the context of a rapidly developing science. This document has been assembled by a large number of experts and subjected to public review both on the PSI website and through Nature Biotechnology community review. At all stages, we have discussed input and fed it back into the document. The MIMIx guidelines, PSI-MI XML interchange format and the corresponding controlled vocabularies are all maintained and updated through the PSI-MI workgroup using mailing lists, issue trackers and annual workshops. If you wish to make specific comments on the MIMIx guidelines, please use the issue tracker at http://www.psidev.info/index.php?q=node/279 or, for a wider involvement, refer to the mailing lists at http://www.psidev.info/.

Note: Supplementary information is available on the Nature Biotechnology website.