“Number is the ruler of forms and ideas, and is the cause of gods and demons.” –Pythagorus

Perhaps this goes some way to explaining a fixation with the total number of genes in the human genome: that, in having a grip on the size of our genetic complement and complexity, we are better acquainted with our genomes or, some might even say, ourselves. On a more practical level, such knowledge grants a means of gauging the degree to which high-throughput expression analyses capture the ‘global’ picture, rather than a fraction of it.

As indicated by the vim with which participants at a recent meetingFootnote 1 voted on the likely number of human genes (see inset), there is also the allure of competition and the inestimable pleasure of representing the balance of one's own calculations and assumptions with a number that can be compared with those of others. Three papers1,2,3 in this issue of Nature Genetics propose limits to human gene complement: these estimates range from 30,000 to 120,000. An inspection of the methods of estimation—described and discussed on page 129 of this issue4—discloses a dichotomy that is informed by the type of analysis.

The bets are on. Genesweep 2000, the competition launched by Ewan Birney (of the European Bioinformatics Institute) at a recent meeting on genome sequencing and biology*, involves estimates of human gene number (see http://www.ensembl.org/genesweep.html).

Brent Ewing and Philip Green1 derive estimates of 34,700 and 33,630 by multiplying a representative sample of genes (for example, the number of genes identified on chromosome 22) by the number of times another sample (say, a collection of EST sequences) can be divided by the number of common matches between the two. A similarly low estimate of 30,000 genes is made by Jean Weissenbach and colleagues2 on the basis of a comparative study involving genomic sequence of the pufferfish and human draft sequence. In contrast, John Quackenbush and colleagues3 seem to have hit on a motherlode of human genes—in the order of 120,000—by ridding GenBank EST sequences of perceived artefact (for example, solitary ESTs that lack a poly(A) tail), assembling and collapsing the remaining sequences into contigs, and comparing these with curated collections of protein-encoding genes.

Analyses by Green and Quackenbush rely on a comparison between annotated genes and the transcriptome—which is assumed to be an accurate representation of the genome. Gene identity is determined by a combination of ab initio gene prediction software and similarity searches against the contents of available DNA and protein databases. The capacity of the software and the content of databases have limitations whose magnitude is presently difficult to determine. A similar fuzziness clouds the degree to which one may confidently assume that ESTs represent transcribed genes, even when they are clustered into a consensus sequence. Illegitimate transcription primed from cryptic tracts, such as those present in Alu sequences, may generate ‘bogus’ consensus contigs. Moreover, alternative splice variants, unspliced or incorrectly spliced mRNA fragments and initiation of transcription from widely spaced 3′ poly(A) tracts common to the same gene may artificially inflate the gene estimates based on EST representation. Current efforts to collapse the Unigene set of cDNAs indicate that these concerns are warranted; that the spectre of the artefact, fed through the over-sampling of libraries in the quest for rare transcripts, looms large. The degree to which collections of EST sequences are ‘groomed’—that is, voided of artefact—will have obvious effect on the number of ‘genes’ that they identify.

The trio of estimates presented in this issue are in good company5,6,7: the sport of predicting the total number of human genes has enjoyed a long tradition. Many estimates are based on an extrapolation from the numbers gleaned from a small segment of the genome, taking into account estimates of gene density across the genome and its total size—these tend to agree with a gene number in the region of 65,000. Recent estimates from Incyte, DoubleTwist and Human Genome Sciences, however, are significantly higher, with assertions that the human genome contains over 100,000 genes. Without access to data that support these estimates, however, it is impossible to comment upon their computation. Releasing a subset of these data into the public domain would enable open assessment—of the kind demonstrated by the recent Drosophila Genome Annotation Assessment Project9—and potentially mute insinuation4,9 that commercial interests factor into the equation.

The brouhaha over human gene number should not eclipse the higher goal of understanding how gene action directs the development and biology of the organism, be it human, mouse or Amphioxus. Comparison of gene sequences of Caenorhabditis elegans and Drosophila melanogaster indicates that biological complexity is orchestrated at the level of the protein—for example, through regulation of expression levels, or splicing patterns—rather than gene number. Noting the known protein families of Drosophila, Saccharomyces cerevisiae and humans, Gerald Rubin (who heads the Berkeley Genome Project) predicts approximately 10,000 human protein families. ( Drosophila have 8,000 and yeast, 4,300.) These observations underscore yet again the importance of determining gene function in model organisms, and of close collaboration between those at the bench, who design and carry out screens for mutant phenotypes, and those at the monitor. They also bring home the point that, even with a complete, completely annotated genome to hand, we will still face the task of fathoming how it works.