Sir

Jean-Michel Claverie1 writes in Correspondence about the problems of annotating the whole human genome sequence, given that a draft form will be available in a few months. While we agree with many of his points, we disagree with what he says about the lack of bioinformatics capacity to provide a useful basic analysis. The Sanger laboratories, with the European Molecular Biology Laboratory's European Bioinformatics Institute, have been developing an automatic analysis system for some months; the results of the first full release of Ensembl can be seen at http://www.ensembl.org/. The system now tracks the daily output of human genomic sequence in real time. It is based on confirming ab initio predictions by homology and providing functional annotation via Pfam2. So far 17,045 gene fragments are annotated from the 1,405,539,258 bases processed.

We agree with Claverie about the limitations of any automatic analysis system, having ourselves worked on the semi-manual analysis of the human chromosome 22 sequence. However, a large subset of genes can already be predicted accurately, which will be very useful as a way into this huge volume of data. A key aspect of the system is its ability to keep track of genes despite revisions to the sequence. This will be important as the genome is completely sequenced over the next couple of years. Ensembl accession numbers assigned to genes are permanent identifiers that will refer to the same genes throughout this process.

How can we go beyond this baseline automatic annotation? Claverie points out the chaos that would result from duplicated annotation efforts, each with different standards and different ways of presenting the data. He is also correct in arguing that no single collaborative group will be capable of annotating the entire genome consistently and to high quality. One way to deal with this is to have a monolithic single entity that invests 300 person-years into annotating the genome. A better one is ‘open annotation’, where the annotation required is distributed across a highly motivated community of biologists.

We believe that many of the problems with open annotation are technical ones, which can be and are being addressed. The web allows different data sources to be readily crosslinked, but different websites have different formats and interfaces. An alternative, particularly appropriate for sequence data, is for a browser to merge annotation from multiple data sources on top of a baseline coordinate system to provide the user with a single annotation view. Lincoln Stein and colleagues are developing such a system (DAS) based on XML (see http://stein.cshl.org/das/). All that is then required for any centre to contribute annotation of all or part of the genome is to synchronize its coordinate system with its baseline server. Maintaining the coordinate system across a changing genome does require substantial resources, but keeping in synchronization with this need not. Ensembl is an open-source project and will provide both a common object framework for annotation as well as the synchronization tools needed for anyone to set up to serve annotation for all to see and use.

The power of open-source software is well recognized3, although it could be feared that open annotation will swamp biologists with alternative contradictory views of the sequence. We are more optimistic. Browsers will allow biologists to select only the data sources they wish to view. Just as some websites become popular, word of useful annotation will spread quickly, since selecting it will be as easy as bookmarking a new website. Software development has been democratized by open-source projects such as Linux, which have allowed everyone the opportunity to contribute. Open annotation provides the same opportunity for genomes, and so should speed our collective decoding of genetics without centralized annotation centres or commercial monopolies.