Burgeoning microbial gene data require coherent efforts to make them readily usable.
Microbes don't subscribe to the single life. They are coupled with complex ecosystems of diverse, mutually dependent species. This complexity and the vast numbers of microbes in the ocean, the soil, in our gut and almost everywhere else pose a challenge to those seeking to understand microbial ecology.
In the 1980s, surveying the microbial world by sequencing the collective ribosomal RNA opened up new avenues. For the first time it was possible to get a glimpse of the make-up of complex microbial systems. It's a reasonable assumption that the more similar these sequences are, the more closely related the microbes are, and the more closely related their lifestyles must be — hence the pursuit of insights into what microbes might be doing in their environments.
But this assumption turned out to be fragile, as it emerged that microbes frequently shuffle around their genes both within and between species. A similarity in one gene does not necessarily correlate with the absence or presence of other genes in the genome.
Fortunately, the continuous decrease in sequencing costs allows today's microbiologists to sequence not only a single gene from each of the most abundant species in a microbial ecosystem, but also, at least in theory, all the genes present. These composite genomes, or 'metagenomes', provide a wealth of information that could only be dreamt of even a couple of years ago. With sequencing facilities continuing to increase their capacities by applying new technologies, and funding agencies supplying the necessary resources, sequencing the ocean or the contents of the human gut has become relatively easy. But how to extract meaningful information from a metagenome, and to gain insight into both the individual species' impact on the microbial community and the impact of this community on the ecosystem?
We can hope to unravel the function of every gene when individual species can be cultivated and genetically manipulated in the laboratory, but this is impossible when dealing with a complex community containing hundreds or thousands of species. Functional assignment of genes needs to be performed, even when the only information available is a string of nucleotide bases.
There are numerous databases and websites, public and not-so-public, some adhering to an easily understandable framework of standards and regulation, and some not so transparent. Five years ago it was a big disappointment to compare one's chosen sequence with the GenBank database and not find a 'hit'. Today there is a feeling of sheer inadequacy in the face of vast quantities of sequence and annotation information — and an acute need for a degree in bioinformatics.
Publication in most cases (including the Nature journals) requires the deposition of sequence data into the GenBank or EMBL databases. Much less effort is spent depositing unpublished data or updating information that is already published. In all probability, in the not too distant future, metagenomic studies will be done not only by the big sequencing centres, but by anybody with a reasonable research budget and university support. To make all the data more easily accessible, it would be desirable to have a collaborative effort of genome centres and funding agencies to build a universal microbial-sequence database, with a readily comprehensible framework for sequencing and annotation standards and regulations.
Microbiology has come a long way from investigating the easily cultured individual microbe from a rich microbial community and describing what is out there, and is now starting to get a grip on what they actually do. With the intrinsic difficulties of dealing with complex systems, it is good to see a field galvanized by new technologies and scientific daring. But more infrastructural order is required, to prevent the discipline getting ahead of itself.