Now the race to obtain a draft sequence of the human genome has been declared an honourable draw, attention will switch to the task of finishing the sequence and ‘annotating’ the entire genome — characterizing all its genes and working out their functions. The annotation is so formidable that it may need the largest Internet ‘collaboratory’ yet attempted.

Carry on sequencing: genome scientists, here preparing DNA for analysis, will get no respite. Credit: BOB BOSTON/WASHINGTON UNI., ST LOUIS

Given that Celera has now stopped sequencing, the task of finishing the genome — in which, to ensure accuracy, each base has been sequenced 10 times over (10X coverage) — will fall to the public Human Genome Project (HGP). In that regard, says Tim Hubbard of the Sanger Centre at Hinxton, near Cambridge, the HGP got a pleasant surprise last weekend, when its data were subjected to a “brute force” computer analysis. Hubbard had expected to find that the HGP had sequenced the genome to an average depth of 5X, but instead, a figure of 7X emerged. This, and the fact that the draft seems to contain fewer gaps than expected, bodes well for finishing the genome ahead of the stated 2003 deadline, says Hubbard.

But annotation poses a much bigger challenge. The first step is to identify all of the protein-coding regions, which will give a good idea of how many genes there are. Most geneticists think the figure lies somewhere between 35,000 and 150,000. Beyond that will come detailed studies of the structure of individual genes, including their regulatory elements, and attempts to assign functions to them.

David Lipman, director of the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, believes that the draft sequence will allow researchers to use computational tools to pinpoint the position of many of the gene fragments catalogued in cDNA libraries of expressed genes. In many cases, it will then be possible to extract an entire gene from the draft sequence — and by comparison with other genes, begin to establish its function. But many biologists are unconvinced. “The current perception is that annotating finished sequence is much less difficult than annotating ‘sequence in progress’,” says Richard Gibbs of Baylor College of Medicine in Houston. “And no matter how you cut it, the draft is sequence in progress.”

Even with the finished sequence in hand, experience with the two human chromosomes for which this has been achieved — numbers 21 and 22 — indicates that annotating the genome will be a mammoth task. “With 21 and 22 it was not possible to reliably identify and delineate all of the genes,” says Philip Green, a biocomputing expert at the University of Washington in Seattle.

In the case of the genome of the fruitfly Drosophila, annotation was kickstarted by a two-week ‘jamboree’ held at Celera. This brought together over 40 academic fly geneticists and 50 Celera scientists, and compared the outcome of dozens of different annotation techniques. This experience should serve Celera well. “We basically trained their annotation team to annotate the human genome,” observes Martin Reese, formerly of the Drosophila Genome Center at the Lawrence Berkeley National Laboratory in California and now with ValiGen, a company near Paris.

The news that Celera and HGP researchers will hold a joint scientific meeting after publishing simultaneous papers of their draft sequences (see lead story) initially raised hopes of a similar human jamboree. However, as HGP head Francis Collins pointed out to Nature, Celera cannot really share its annotation, as it will be its core product for sale to its subscribers. Rather, the meeting is expected to look at discrepancies between the public and private sequences with the goal of ‘cleaning up’ one another's data.

Celera has said little publicly about its annotation capacity, but it uses specialized software to combine the output of multiple gene finding tools — mostly those available to the public sector. But while Celera's annotation team is at the cutting edge, many experts argue that no single team is currently in a position to annotate the entire genome. “No one really knows how to do it completely,” says John Quackenbush of The Institute for Genomic Research in Rockville, Maryland.

On the public side, annotating the genome might mean a rethink on how the HGP's data are organized. Lipman acknowledges that the main sequence database, NCBI's Genbank, has its limitations. “It does not represent what we know of biology at any given time,” he says. “It only represents what the author put in.” Indeed, while scientists deposit data in Genbank because many journals make this a condition for publication, some do not bother to correct and update it.

“With annotation we will need much more active curation,” says Lipman. Many experts believe this may require a ‘collaboratory’ approach, using the Internet to leverage the talent of biologists worldwide. The NCBI intends to set up a system in which named biologists around the world will ‘adopt’ a gene or gene family, becoming the curators responsible for gathering information from the wider research community. But Lipman remains against the idea of a free-for-all in which any biologist can annotate the genome — the problem, he says, is that most do not fully understand database syntax, and so tend to make errors when they input data. “What we really want is their knowledge,” says Lipman.

The Ensembl annotation project, run by the Sanger Centre and the European Bioinformatics Institute, is plotting a genuinely distributed effort. Hubbard foresees a system where a geneticist in Germany could annotate a gene online, and have his or her interpretation challenged almost in real time by a biologist in Boston. Ensembl's vision has been inspired by a radical suggestion, made by Tom Slezak of the Lawrence Livermore National Laboratory in California and Lincoln Stein of the Cold Spring Harbor Laboratory on Long Island, to use ‘Napster’ technology for genome annotation. This allows computer users worldwide to share MP3 music files, and could, in theory, let biologists share and annotate genome data (see Nature 404, 694; 2000). If these ideas catch on, the genome project's future could be one of annotation by anarchy.