The nature of the number

doi:10.1038/75946

Download PDF

Editorial
Published: 01 June 2000

The nature of the number

Nature Genetics volume 25, pages 127–128 (2000)Cite this article

520 Accesses
2 Citations
15 Altmetric
Metrics details

“Number is the ruler of forms and ideas, and is the cause of gods and demons.” –Pythagorus

Perhaps this goes some way to explaining a fixation with the total number of genes in the human genome: that, in having a grip on the size of our genetic complement and complexity, we are better acquainted with our genomes or, some might even say, ourselves. On a more practical level, such knowledge grants a means of gauging the degree to which high-throughput expression analyses capture the ‘global’ picture, rather than a fraction of it.

As indicated by the vim with which participants at a recent meeting^{Footnote 1} voted on the likely number of human genes (see inset), there is also the allure of competition and the inestimable pleasure of representing the balance of one's own calculations and assumptions with a number that can be compared with those of others. Three papers^1,2,3 in this issue of Nature Genetics propose limits to human gene complement: these estimates range from 30,000 to 120,000. An inspection of the methods of estimation—described and discussed on page 129 of this issue⁴—discloses a dichotomy that is informed by the type of analysis.

The bets are on. Genesweep 2000, the competition launched by Ewan Birney (of the European Bioinformatics Institute) at a recent meeting on genome sequencing and biology^*, involves estimates of human gene number (see http://www.ensembl.org/genesweep.html).

Brent Ewing and Philip Green¹ derive estimates of 34,700 and 33,630 by multiplying a representative sample of genes (for example, the number of genes identified on chromosome 22) by the number of times another sample (say, a collection of EST sequences) can be divided by the number of common matches between the two. A similarly low estimate of 30,000 genes is made by Jean Weissenbach and colleagues² on the basis of a comparative study involving genomic sequence of the pufferfish and human draft sequence. In contrast, John Quackenbush and colleagues³ seem to have hit on a motherlode of human genes—in the order of 120,000—by ridding GenBank EST sequences of perceived artefact (for example, solitary ESTs that lack a poly(A) tail), assembling and collapsing the remaining sequences into contigs, and comparing these with curated collections of protein-encoding genes.

Analyses by Green and Quackenbush rely on a comparison between annotated genes and the transcriptome—which is assumed to be an accurate representation of the genome. Gene identity is determined by a combination of ab initio gene prediction software and similarity searches against the contents of available DNA and protein databases. The capacity of the software and the content of databases have limitations whose magnitude is presently difficult to determine. A similar fuzziness clouds the degree to which one may confidently assume that ESTs represent transcribed genes, even when they are clustered into a consensus sequence. Illegitimate transcription primed from cryptic tracts, such as those present in Alu sequences, may generate ‘bogus’ consensus contigs. Moreover, alternative splice variants, unspliced or incorrectly spliced mRNA fragments and initiation of transcription from widely spaced 3′ poly(A) tracts common to the same gene may artificially inflate the gene estimates based on EST representation. Current efforts to collapse the Unigene set of cDNAs indicate that these concerns are warranted; that the spectre of the artefact, fed through the over-sampling of libraries in the quest for rare transcripts, looms large. The degree to which collections of EST sequences are ‘groomed’—that is, voided of artefact—will have obvious effect on the number of ‘genes’ that they identify.

The trio of estimates presented in this issue are in good company^5,6,7: the sport of predicting the total number of human genes has enjoyed a long tradition. Many estimates are based on an extrapolation from the numbers gleaned from a small segment of the genome, taking into account estimates of gene density across the genome and its total size—these tend to agree with a gene number in the region of 65,000. Recent estimates from Incyte, DoubleTwist and Human Genome Sciences, however, are significantly higher, with assertions that the human genome contains over 100,000 genes. Without access to data that support these estimates, however, it is impossible to comment upon their computation. Releasing a subset of these data into the public domain would enable open assessment—of the kind demonstrated by the recent Drosophila Genome Annotation Assessment Project⁹—and potentially mute insinuation^4,9 that commercial interests factor into the equation.

The brouhaha over human gene number should not eclipse the higher goal of understanding how gene action directs the development and biology of the organism, be it human, mouse or Amphioxus. Comparison of gene sequences of Caenorhabditis elegans and Drosophila melanogaster indicates that biological complexity is orchestrated at the level of the protein—for example, through regulation of expression levels, or splicing patterns—rather than gene number. Noting the known protein families of Drosophila, Saccharomyces cerevisiae and humans, Gerald Rubin (who heads the Berkeley Genome Project) predicts approximately 10,000 human protein families. ( Drosophila have 8,000 and yeast, 4,300.) These observations underscore yet again the importance of determining gene function in model organisms, and of close collaboration between those at the bench, who design and carry out screens for mutant phenotypes, and those at the monitor. They also bring home the point that, even with a complete, completely annotated genome to hand, we will still face the task of fathoming how it works.

Notes

^★Genome Sequencing & Biology May 10–14, 2000, Cold Spring Harbor, New York

References

Ewing, B. & Green, P. Nature Genet. 25, 232–234 (2000).
Article CAS Google Scholar
Roest Crollius, H. et al. Nature Genet. 25, 235– 238 (2000).
Article CAS Google Scholar
Liang, F. et al. Nature Genet. 25, 239– 240 (2000).
Article CAS Google Scholar
Aparicio, S. Nature Genet. 25, 129–130 (2000).
Article CAS Google Scholar
Fields, C., Adams, M.D., White, O. & Venter, J.C. Nature Genet. 7, 345–346 ( 1994).
Article CAS Google Scholar
Dunham, I. et al. Nature 402, 489–495 (1999).
Article CAS Google Scholar
Hattori, M. et al. Nature 405, 311–319 (2000).
Article CAS Google Scholar
Ashburner, M. Genome Res. 10, 391–393 (2000).
Article CAS Google Scholar
Cohen, J. Science 275, 769 (1997).
Article CAS Google Scholar

Download references

Rights and permissions

Reprints and permissions

About this article

Cite this article

The nature of the number. Nat Genet 25, 127–128 (2000). https://doi.org/10.1038/75946

Download citation

Issue Date: 01 June 2000
DOI: https://doi.org/10.1038/75946

The nature of the number

Notes

References

Rights and permissions

About this article

Cite this article

Search

Quick links

Notes

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links