Recently, a special Oceanic Metagenomics Collection of articles from the J Craig Venter Institute was published in PLoS Biology, available at: At first glance, the publication represents a very large (and very welcome) addition of data to the nascent field of marine microbial metagenomics. These data, consisting of more than 7.7 million sequencing reads (>6 billion base pairs of sequence), reveal more new genes, more new proteins, more diversity and a more complex ocean than might have been thought: yet they do not begin to touch the real complexity of the ocean ecosystem(s). The data are gathered from 41 sites, primarily marine, covering a transect that includes a sample about every 330 km for more than 8000 km, from the North Atlantic, southwards along the eastern edge of North America, through the Panama Canal, and onward towards the South Pacific. In addition, there is some extensive coverage near and around the Galapogos Islands. Included in the dataset are previously studied samples from the Sargasso (Venter et al., 2004).

A deeper look, however, reveals that these impressive numbers are the tip of an intellectual iceberg of fascinating inconsistencies with regard to marine microbial diversity. Indeed, it may well be that what is not in the dataset may offer opportunities for future studies that transcend the opportunities lying in the dataset itself. To understand what is not there, one needs to keep in mind where and how the samples were collected: these are all near-surface (within a few meters) samples that were filtered multiple times to yield a size fraction in the 0.2–0.8 μm range. Thus, the sample can be aptly characterized as the near-surface marine planktonic niche, consisting mostly of unattached, single cells. Other organisms should have been removed on the larger 0.8 μm filters, which remain as a resource for further study.

As for what is contained in the dataset, there is something for almost everyone. Rusch et al. (2007) lead off with a synopsis of the gene data – new genes galore, new phylotypes galore and the conclusion that in this niche there is still to be found an impressive array of diversity at both the taxonomic and biochemical levels. This being said, however, the dominant species are remarkably few in number. If one simply removes all ‘abundant’ species that occur at only one site, as well as those that are found only in the non-marine (hypersaline, mangrove and freshwater) sites, the number of dominant groups that characterize this marine planktonic niche decreases to about 10–20 (depending on whether you are a splitter or a grouper). This is quite remarkable, perhaps the paradox of the plankton is not a paradox at all, but is hidden in the way that microbiologists define diversity, and our understanding of what is being competed for in the so-called uniform ocean. Of these, only three (Synechococcus, Prochlorococcus and Pelagibacter ubique, a SAR-11 type) have been cultivated and have genomic sequences available.

However, among these abundant species can be found an impressive array of diversity – so impressive that in no case was it possible to assemble a genome from any of them. Thus, while taxonomic/phylogenetic diversity was quite limited, the diversity at the gene level was remarkably high, an observation fitting with several previous studies of localized sites, but apparently a general feature of the marine planktonic environment. Given these challenges, some new approaches were adopted to try and understand this immense diversity. For example, 584 sequenced genomes in finished or draft form were used for ‘fragment recruitment’ of the entire database. Remarkably, only 30% of the database revealed recruitment to any of the 584 genomes: 15% recruited to three genomes of the ‘marine planktonic niche’ (Pelagibacter, Prochlorococcus and Synechococcus), while 15% recruited to two genomes that appeared at only one site in the global ocean survey (GOS) (Shewanella and Burkholderia). In terms of understanding the nature of diversity in the marine planktonic niche, such information tells us that the sequencing of the other dominant species should be a high-priority item – one that will allow retrospective fragment recruitment studies that will begin to unravel this conundrum.

Yooseph et al. (2007) then present a paper dealing with the study of protein families gleaned from analysis of the dataset. In this study, intensive analysis of protein sequences led to the conclusion that at this level there is immense diversity and variation; 1700 new protein families were found with no apparent homology to existing protein groups. The study not only identifies new proteins, but also adds a much-needed input of data with regard to diversity of known protein families. What can be done with a dataset like this is then illustrated in the paper by Kannan et al. (2007), in which diversity of protein kinases was studied, resulting in a tripling of information with regard to ELK (eukaryotic protein kinase-like) proteins. The intriguing observation that prokaryotic ELKs are now more numerous than the prokaryotic histidine kinases, which have been considered to be the major regulatory elements for prokaryotic metabolism, begs the question of whether there are completely new regulatory pathways waiting to be discovered in this very interesting realm.

Finally, an overview article by Eisen (2007) discusses the ups and downs of the various approaches to studying microbial communities – a nice article to read before diving into the three articles discussed above. In addition, Seshadri et al. (2007) present an introduction and description of CAMERA (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis), a community database system for the deposition and analysis of data related to marine microbial ecology.

So, this would seem to be the end – a fantastic journey into the world of bioinformatics, an immense amount of data being made available to the community for detailed work on their systems of interest, and a workable interactive datasystem with which to do it. However, as mentioned above, the Rusch et al. (2007) paper suggests that there is much more, and that this may be lurking in some of the things that are not seen here. What do we mean by this?

The GOS survey, as noted above, focuses on the planktonic niche, and as such, misses certain parts of the marine microbial ecosystem, notably the larger single cells (small eukaryotes and large prokaryotes), multicells, attached cells and symbionts, to name a few. Yet, plating of seawater often yields low but consistent numbers of such microbes that for many years were known as the dominant oceanic species – genera like Vibrio, Shewanella (a.k.a. Alteromonas) and Pseudomonas, to name a few. In many cases, these bacteria have well-defined niches – disease causation, gut symbionts of marine fish, light organ symbionts of fish and squids, food spoilage, and so on, and there is no doubt that they play a role in marine ecosystems, almost certainly as attached forms (Visick and Ruby, 2006). In fact, on the basis of many studies of such genera (all of which were used for recruitment studies, and all of which proved negative with regard to recruitment) a model similar to that shown in Figure 1 can be proposed, in which the planktonic populations are simply a reflection of the various high-density niches for the attached forms. Such a picture stresses the importance of examining a variety of size classes, looking, perhaps, for those organisms that might account for those occasionally abundant microbes that are clearly not part of the planktonic niche.

Figure 1
figure 1

A simplistic diagram of the distribution of some marine Vibrios. This shows the simple case for the coast of California, where the major niches for these organisms involve a heterotrophic travel through the stomachs of marine fish, acting as catalysts for chitin degradation (Visick and Ruby, 2006). Moving to other locations could add more high-concentration environments such as the light organs of fishes and/or squids, where cell populations reach 1010 ml−1 or higher and constantly inoculate the waters.

A careful look at the genes needed for each niche might be very revealing in terms of distinguishing what defines planktonic versus attached lifestyles. With regard to the example chosen here, the luminous Vibrios were surely present in the samples analyzed, but were cryptic to the methods used, and in sufficiently low abundance that not a single luxABCD or E gene sequence was seen in the database. It is an interesting exercise for each of us to take our own idea of where our organism fits into the marine ecosystem and ask where one might look for evidence in the data or samples of the GOS expedition.

In closing this brief overview of the GOS volume, one does not want to detract from what has been (and will be) learned from this magnificent dataset. Each of us should sit down with the data and add our private interests and expertise to its analyses, thus using this as a landmark system for marine microbiological systems studies. This being said, one must keep in mind that this is one niche of perhaps hundreds in the ocean, and similar studies will be needed for each of them. These niches can be defined by size fractionation, by physicochemical properties of the environment, by depth of samples and perhaps by many other parameters. The important thing now is to seize the moment and move forward gathering more data, depositing it in the central CAMERA databank (, and working to describe the marine ecosystem as the complex system it is.