The term 'computational genomics' is known to elicit a feeling of unease, apparent in a jest made by Edward Uberbacher at the close of a recent workshop* on the subject: "there's a big data wave coming, and anyone with any sanity should get out of its way." With more than 739 million base pairs of human sequence in public databases, the completion of the genome sequence of Drosophila melanogaster, and billions of microarray data points, the wave seems to be breaking already. Clearly, there are substantive issues that have yet to be resolved. Knowing what these are (even if it is not yet possible to assess the extent of their magnitude) and the ways in which others are approaching the 'gap' between sequence and function helps one to navigate, or ponder how to go about navigating, sequence data.

Included in the workshop was a session on genome annotation, a topic central to the foci of many geneticists, for the dual goal of genome annotation is to identify gene sequences and assign probable function to sequence. The identification of start codons, stop codons and splice sites is fundamental to gene-finding, but not straightforward, judging from presentations made by a number of investigators. Comments suggest that the level of artefacts masquerading as 'coding' sequence may be high, although its estimation is difficult. It is clear that gene-finding programs vary. A recent 'competition', in which 12 research teams applied their own gene-finding programs to a 3-Mb sequence of the Drosophila genome1, resulted in the identification of between 238 and 248 genes within the region; 218 were originally identified by a group led by Gerald Rubin2 see (http://www.bdgp.org/). Corroboration that a supposed gene is, indeed, a gene—in the absence of experimental data—seems best obtained by a corresponding complementary DNA (cDNA) transcript or expressed sequence tags (ESTs).

The task of gene-finding exemplifies the task of the computational biologist: to distinguish true signal (be it gene sequence, expression or effect on phenotype) from noise. The only way to definitively establish whether a gene sequence is bona fide is to establish its function though traditional 'wet' experiment, and so far, there have been no large-scale studies that attempt to determine the proportion of in silico genes that are 'real'. A promising start, however, is represented by a pair of studies by Rubin and colleagues3. In these, nearly 3 Mb of Drosophila sequence2 was annotated and a collection of transposon-induced mutations presented4; the mutations affect approximately one-quarter of Drosophila genes that lead to obviously abberrant phenotype. They found that genes that generate such phenotypes when mutated are more likely to be represented in cDNA libraries—and have identifiable homologues in other species—than those that do not.

In the absence of 'wet' data that underscore the identity of a gene, clues can be gleaned from other sources. John Quackenbush (of The Institute for Genomic Research) described a method whereby 'tentative consensus' (TC) sequences are constructed from EST sequences available from dbEST and coding sequences annotated in GenBank records. The process is an iterative one in which new sequences are used to update existing TCs so as to continually optimize the integrity of hypothetical gene sequences. The TCs are then annotated according to gene content and searched against a protein database. This approach implicitly acknowledges the dynamic nature of genomic information resulting from the continuous acquisition of new sequence, analyses and experimental data.

The elastic nature of genome data has implications for annotation. What might be assigned an orthologue on one day could be designated a homologue the next. Similarly, what appears to be a phosphorylase according to sequence similarity, may, in fact, oppose phosphorylation. Or, as Peer Bork (of the European Molecular Biology Laboratory (EMBL)) pointed out, orthlogues may catalyse the same reaction, but in different pathways. It may be discovered that what holds true in the test tube may be a rare event in vivo. Or that a protein has several functions, rather than one. While it is gratifying to have anonymous sequence blessed with sudden function, courtesy of a BLAST search or one of its relations, many are uneasy with the surety with which function is assigned, and rightly so. The theoretical 'transitive annotation catastrophe' fuels this concern and can be described as follows. Suppose a gene in Haemophilus influenzae is annotated according to its homology with a gene in Escherichia coli. In turn, a gene in Bacillus subtilis may be annotated based on its homology with the annotated gene in H. influenzae. In essence, it would inherit the annotation of the original E. coli gene—whether or not they actually share any significant homology and without assuring that they share the functional domain that was used to determine the original assignment. Errors in annotation may also confound. One can take precautions to minimize the risk of erroneous conclusion5, but these are not foolproof, nor are they universally appreciated.

Genomic annotation should adhere to a system that accomodates the dynamic nature of knowledge. The static aspect of GenBank is not to be emulated (that is, it should be possible to rectify errors with ease). Users should be able to fathom how the status and assigned function of a gene has been determined. For instance, is its protein product characterized, or is its presence revealed by an EST, or a cluster of tentatively linked sequences? Is its function assigned through homology, biochemical assay, structural determination or a combination thereof? (See page 151 for a Progress article on structural genomics.) Links to the relevant literature would also be helpful. As Sarah Wheelan and Mark Boguski have noted6, relevant literature could be linked to specific nucletoide coordinates on a reference sequence and vice versa, with the support of editors, publishers and the prospective e-print server PubMed Central.

The fragmentary nature of the 'annotation' community (as evidenced by presentations at the workshop) ensures a diversity in approach. An awareness of the need for integration was demonstrated by Lincoln Stein (of the Cold Spring Harbor Laboratory), who presented software that fetches annotated data from different sites and integrates it into a single, unified view. Whereas this type of program may aid the interpretation of in silico analyses, so will efforts to ensure that sequence is consistently and appropriately annotated. To this end, the formation of a central, prescriptive body, in part comprised of those who administer the EMBL and GenBank databases, seems a good idea.