Craig Venter is still sailing his Sorcerer II sloop around the world, but now he has got 6.3 billion base pairs of microbial DNA to show for it. The pioneer of large-scale genome sequencing has published the first data set from his global quest to sequence microbes from all the oceans.

Sea of change: the Sorcerer II is obtaining DNA samples from oceans around the world. Credit: VENTER INST.

In a series of papers published this week in PLoS Biology, an international team of researchers presents the first results from Venter's Global Ocean Sampling expedition ( It follows an earlier, smaller study that uncovered rich microbial diversity in the Sargasso Sea (J. C. Venter et al. Science 304, 66–74; 2004). The analyses of the first set of 7.7 million genetic sequences from the expedition reveal that the upper limit on ocean diversity has yet to be set.

The papers point to 1,700 new protein families. Surprisingly, the rate of discovery stayed more or less the same as the number of new sequences grew, suggesting that the number of new protein families will continue to increase. The question, said Venter in a telephone interview from Sorcerer II as it bobbed its way through the Sea of Cortez, is how long will that increase continue? And how does one extract meaning from a pile of 6.3 billion As, Ts, Cs and Gs?

The Global Ocean Sampling results reported this week include sequences gathered from 41 locations in 2003 and 2004, from the northwest Atlantic to the eastern tropical Pacific. “The next phase is to try with a much larger data set from the entire circumnavigation,” said Venter. “We want to find out if it starts to show any degree of saturation, or is the number of independent protein and gene families so vast that we are still at the earliest stage.”

You can over-interpret DNA sequence data, but if you're careful, you just use them as a clue.

The team has faced some challenges. It has to predict protein sequences on the basis of the DNA sequences it retrieves from its samples. This can be tricky, because not all DNA codes for protein, and some bits of DNA can be read in different ways to produce different protein sequences. It will be important to go back and confirm that the predicted proteins are actually made, says Monica Orellana, a systems biologist at the Institute for Systems Biology in Seattle, Washington. And Brian Palenik, a microbiologist at the Scripps Institution of Oceanography in La Jolla, California, cautions that many sequences that seem to belong to new protein families may in fact be members of known families that contain highly divergent sequences. Still, although the hard numbers might be toned down in the future, Orellana and Palenik don't question that the Global Ocean Sampling database contains a wealth of new proteins.

Already, researchers are tracing the course of protein evolution using the database. Others will search it for new enzymes that might have technological applications, or use it to learn more about microbial ecology. For instance, Jonathan Eisen, a study co-author and microbiologist at the University of California, Davis, will work to match gene fragments in the database with their host organism, a substantial technical challenge when dealing with unfamiliar microbes.

Such studies need to be followed by experiments to establish the function that the gene sequence actually has in the organism, warns Eugene Madsen, a microbiologist at Cornell University in Ithaca, New York. “You can over-interpret DNA sequence data,” he says, “but if you're careful, you just use them as a clue and then they either lead to solving the mystery or they don't.”

Dennis Hansell, a microbiologist at the University of Miami in Florida, adds that although his research won't entail direct analysis of the Global Ocean Sampling sequence data, the newfound wealth of microbial diversity has made him re-evaluate his picture of how microbes leave their chemical fingerprints in the ocean. He likens the collection of genomic sequences to characterizing all of the pigments an artist could use to paint a portrait. “It's blending the pigments and applying them that results in a Mona Lisa,” Hansell says. “I'm looking at the Mona Lisa in my data, and this shows me what the pigments are.”