In an 'omics' world, a concept as fundamental to scientific practice as sharing data meets serious logistical challenges. Traditionally, publication has ensured that one's results could be viewed, scrutinized and understood by others, but at the omics scale the sheer volume of data, and the multiplicity of their formats are major hurdles. To facilitate data sharing, several groups are engaged in the long and tedious process of standardization. The larger community of experimentalists should support and take an active part in these standardization initiatives, but also think carefully about what these standards can and cannot achieve.
The Microarray Gene Expression Data (MGED) society, a grassroot movement created in 1999, paved the way to omics standardization efforts. In its footsteps, the Proteomics Standards Initiative (PSI), sponsored by the Human Proteome Organization, has expanded the standardization movement to the proteomics arena (http://psidev.sourceforge.net/). And other groups are now following suit in a variety of disciplines.
Overall, these organizations have three major action points on their agenda. First, they develop data formats compatible with the majority of instrument outputs and analysis software. Second, they establish reporting standards, listing the minimum information that adequately describes the samples and how the experiment was performed. These reporting standards are captured in documents like Minimal Information About a Microarray Experiment (MIAME) and its proteomics equivalent, MIAPE, currently in preparation. Finally, the standardization groups are helping codify these descriptions by defining controlled vocabularies.
In parallel with these efforts, databases are created that accommodate the standards. Thus, it appears that in proteomics, too, in a not too distant future, the authors of a paper should be able to deposit the raw data—in a standardized format and with the adequate minimum information—in centralized repositories. Databases like GenBank and Protein Data Bank (PDB) have demonstrated the advantages of this approach for sharing large amounts of information.
Moreover, such a system could start addressing the concerns voiced by reviewers of omics papers, who are growing frustrated at the superhuman task that is asked of them (Nature 440, 992; 2006). Offering the data for the community's scrutiny upon publication, and defining more clearly what the peer review can achieve, could somewhat relieve the reviewers' burden while maintaining quality checks.
But beyond this, the primary goal of the standardization initiatives is to enable data mining at a large scale. By standardizing formats and ontologies, it becomes possible to use computers to mine a cumulative data set produced in hundreds of experiments performed around the world. This is a tantalizing vision, but it is important to keep the scale of this vision in mind as this is critical to its success.
Indeed, most of the analytical tools we are talking about are extremely sensitive to small variations; the samples are complex biological samples (often extensively manipulated), and the observations—unlike gene or protein sequences—are highly context dependent: we are talking about the quintessence of variability. Under these circumstances, the size of the data set used in a meta-analysis becomes critical to distinguish robust observations with true biological meaning from all other types of variations.
In the case of microarrays, for example, sources of variability are such that, at this stage, studies involving multiple laboratories need to be planned so that all procedures upstream and downstream of the microarray tests are identical (Nat. Methods 2, 337–343; 2005; Nat. Methods 2, 345–349; 2005; and Nat. Methods 2, 351–356; 2005). This is likely to be the case for other technologies as well.
Standards such as MIAME or MIAPE are reporting standards; they are not going to solve the issue of variability. They cannot possibly describe all confounding factors, and it would be futile to attempt a comprehensive listing. In contrast, it is vital to maintain a compromise between detail and practicality in reporting, so that compliance with the standards is not so onerous as to inhibit their adoption.
Experimentalists must keep these considerations in mind and support the standardization initiatives so that the point can rapidly be reached when the volume of standardized data permits meaningful examination. For the standardization initiatives to be successful they need the contribution of all strata of the community: investigators, informatics developers, database curators, instrument vendors, journals and funding agencies.
At this stage, the best support that experimentalists can provide is to add their voice to the definition of standards, provide feedback to organizations like MGED and PSI and ensure that the standards that emerge are practical, efficient and realistic to implement...after all, it will most likely affect the way you are doing research tomorrow.
About this article
Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project
Nature Biotechnology (2008)