Adoption of easy-to-follow standards will vastly improve our ability to interpret data from genomes, metagenomes and marker studies

Interest in sampling of diverse environments, combined with advances in high-throughput sequencing, vastly accelerates the pace at which new genomes and metagenomes are generated. For example, as of January 2011, 12 500 user-generated metagenomes have been submitted to the public MG-RAST Annotation server (http://metagenomics.nmpdr.org; Meyer et al., 2008), >90% of which were produced using high-throughput sequencing methodologies. We have entered into an era of ‘mega-sequencing projects’ that include the Genomic Encyclopaedia of Bacteria and Archaea project (http://www.jgi.doe.gov/programs/GEBA), the Microbial Earth Project (http://genome.jgi-psf.org/programs/bacteria-archaea/MEP/index.jsf), the Human Microbiome Project (http://nihroadmap.nih.gov/hmp), the Metagenomics of the Human Intestinal Tract consortium (http://www.metahit.eu), the Terragenome Initiative (http://www.terragenome.org), the Tara Oceans Expedition (http://oceans.taraexpeditions.org), the National Ecological Observatory Network (NEON-http://www.neoninc.org), the International Census of Marine Microbes (ICoMM-http://icomm.mbl.edu), Microbial Inventory Research Across Diverse Aquatic Long-Term Ecological Research Sites (http://amarallab.mbl.edu/mirada/mirada.html), the Earth Microbiome Project (http://www.earthmicrobiome.org) and other funded and unfunded projects, with many more visionary projects on the horizon.

Additionally, studies of emerging metatranscriptomes (community transcript profiles), metaproteomes (community protein profiles) and metametabolomes (community metabolite profiles) now complement genomes and metagenomes. Comparative studies of multi-omic data sets from the same community hold the promise of unparalleled insights into fundamental questions across a range of fields including evolution, ecology, environmental science, physiology and medicine. Advances stem from improvements in the annotation and quantification of genes, pathways, organisms and consortia within these communities. We are just starting to exploit these technologies to understand the microbial world, and have only scratched the surface in terms of sampling microbial diversity across temporal and spatial scales (Delmotte et al., 2009; Gilbert et al., 2010a). To fully exploit the promise of these data, we need both scientific innovation and community agreement on how to provide appropriate stewardship of these resources for the benefit of all.

Although we have collected billions of nucleic-acid sequences from thousands of ecosystems, illuminating uncharacterized microbial lifestyles remains far from trivial. For example, in each analysed genome or metagenome, about 40% of the putative protein-coding genes cannot be assigned to any known function or taxon. Only 42% of the 61 known bacterial phyla have even a single cultured representative (Hugenholtz and Kyrpides, 2009), with the remainder being known only from 16S rRNA gene environmental surveys. Surprisingly, only 14% of cultured bacterial taxa have a single complete genome sequenced. Holistic approaches that will centralize (meta) omics data are needed, which will allow investigators to analyze these data within the context of space, time, habitat and characteristics of the environment. Networks of information arising from these studies will allow us to describe and predict ecological patterns of organisms, genes, transcripts and proteins.

One key insight into the function of a gene or organism is the environment where it occurs. Collection of contextual (meta) data, which delineates the source of a sequence in terms of the space, time, habitat and characteristics of the environment, is thus essential in interpreting these unknown genes and species, as well as gaining new insights into the known fraction. Although early comparative studies of metagenomes (Tringe et al., 2005) relied on a few, deeply sequenced samples, the experience from 16S rRNA gene surveys suggests that additional insight is gained from observing spatial and temporal variation across hundreds of samples, whether examining the distribution of bacteria in soils across a continent (Lauber et al., 2009) or various skin sites from many subjects (Grice et al., 2009).

At present, the valuable contextual data halo is often missing for sequences deposited in the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, European Nucleotide Archive (ENA, including EMBL-Bank) and the DNA Databank of Japan (DDBJ)). This leaves researchers in the position of searching in electronic resources, literature or contacting the authors for even the most basic contextual data, such as geographic location, date and time of sampling or the habitat where the sample was obtained. Molecular ecologists should immediately recognize the inherent value of these data to the community, because without them their own sequence data sets will have extremely limited comparability with the wealth of other data available. Sequences without contextual data are like unlabeled cans in a supermarket—you do not know what you are purchasing until you open it and examine the contents. The present inability to automatically retrieve rich contextual data hampers comparative research, and constitutes a considerable misuse of the vast global resources currently being applied to microbial ecology. Just as food-safety laws emphasize clear and accurate labeling based on the product, process and producer, so should sequence data be properly annotated.

Standardization of the required information will greatly facilitate the annotation of sequence data. To achieve this, we must first have community collaboration and participation. Second, as a result of this collaboration, a contextual data set must be standardized in terms of content, syntax and terminology to which the community can adhere. In 2005, members of the community came together to form the Genomic Standards Consortium (GSC), an open-membership working body with the stated mission of working towards better descriptions of our genomes, metagenomes and related data (http://www.gensc.org). Supported by the expertize of the members involved in many of the aforementioned mega-sequencing projects, the GSC has formalized contextual data requirements for genomes and metagenomes as the Minimum Information about a Genome/Metagenome Sequence checklist (MIGS/MIMS) (Field et al., 2008). Furthermore, to cover the description of phylogenetic and functional marker genes an extended standard, the Minimum Information about a MARKer gene Sequence (MIMARKS) checklist (http://gensc.org/gc_wiki/index.php/MIMARKS) has been developed (Yilmaz et al., 2011). This family of minimum information checklists provides researchers with a condensed set of contextual data requirements, which range from description of the environment to sampling and sequencing procedures. The GSC is also driving the evolution of omics data sharing in a broader context through participation in the BioSharing (http://biosharing.org) portal. This forum aims to enable a broader dialog among funders, journals, standards and technology developers, and researchers on the critical issue of data sharing within the metagenomics community and beyond (Field et al., 2009). It provides an example of what an infrastructure to support standards-compliant reporting of contextual data might look like; as well as encouraging and enabling curation at community level (Rocca-Serra et al., 2010; http://isatab.sourceforge.net).

The primary sequence databases’ adoption of these standards is integral to their success. The INSDC partners have recognized this support for submission of compliant data sets with the adoption of an official keyword for the family of minimum standards reserved for compliant INSDC sequence records. Additionally, the development of a number of tools and formats to aid in data exchange (Kottmann et al., 2008) and compliance during sequence submissions with these standards is ongoing within specialized genomics and metagenomics resources.

The application of high-throughput sequencing technologies has transformed the way microbial ecologists approach questions in their field (Gilbert et al., 2010b). The shift of sequencing capacity to individual labs is creating a data bonanza. With appropriate contextual information, these data sets could herald a new era of discovery for microbial ecology. This will only be possible, if each study, from each environment, and from each lab maintains, at the very least, a minimum contextual data standard to facilitate cross-comparison and meta-analysis of global microbial communities. Inadequate implementation of these standards threatens progress in our field of research, as we will lose the best opportunity to produce a complete mechanistic understanding of microbial life. Every investigator will benefit immensely by being able to obtain a rapid, comprehensive answer to the question ‘Have my microbes been seen before, and, if so, where, with whom, and what were they doing? Only by accepting the relatively small responsibility of entering their own contextual data into a global system will they realize this dream. Just as standardized deposition of sequence data contributed an immensely valuable resource, standardization of contextual data will allow us to reap vast dividends for decades to come and enable us to finally escape the burden of ‘my sequence matches 1500 uncultured environmental isolates—now what’?

To provide a better understanding of the requirements, we included three examples for MIGS, MIMS and MIMARKS compliant data sets in the Supplementary Table 1. Supplementary File 2 provides links to detailed submission and compliance guidelines.

With this open letter to the ISME community, we not only hope to advertise the existence of the GSC and invite more microbial ecologists investigating marker genes and doing ‘omics’ work to join us, but also make a call for compliance with current and future GSC standards. To learn how to describe your data according to MIGS/MIMS/MIMARKS (MIxS) standards, please visit the GSC website for details and options for submitting compliant data sets into public domain databases (http://gensc.org/gc_wiki/index.php/MIGS/MIMS/MIMARKS).