Washington

Biologists are taking bets — literally — on the number of genes in the human genome. Enterprising attendees at the annual Cold Spring Harbor genome meeting last week opened a book, taking bets at $1 a time. The results will be known in three years time, when sequencing is completed.

The spread of bets placed so far — from 27,462 genes at the low end to 153,478 at the high — represents two very different approaches to gene counting. Techniques that extrapolate from manually annotated portions of the parts of the genome that have already been fully sequenced are yielding estimates of around 35,000–40,000 genes. Those that use computer algorithms to scour random expressed sequence tags (ESTs) from the whole genome predict 100,000 or more.

Bets cost $1 this year, $5 at next year’s meeting and $20 in 2002. At stake — beyond a cash prize that will be awarded at the 2003 meeting — is an improved understanding of the complexity of the genome, and the relative importance of genes and regulatory regions. There is still confusion over how to count genes that don’t code for protein, and uncertainty about what biological roles these pieces of DNA play.

Perhaps reflecting the absence of widely expected announcements on new sequencing milestones, the issue became the focus of heated discussion at Cold Spring Harbor last week. "Sequencing is like digging a gold mine," says John Quackenbush, a computational biologist with The Institute for Genomic Research (TIGR) in Rockville, Maryland. "How much gold is there to find?"

Quackenbush’s own prediction of 120,000 genes, to be published in the June issue of Nature Genetics, is at the high end of the scale. DoubleTwist (formerly Pangea), a bioinformatics company in Oakland, California, last week announced that its own algorithms predicted a total of 100,000 genes.

Both use gene-hunting programs and DNA and protein homology searches to find candidate genes in GenBank. But each uses different software. Another genomics company, Incyte, predicts over 100,000 genes, but has used its proprietary EST database, rather than the public GenBank (see Nature 401, 311; 1999).

Tim Hubbard, who heads Ensembl, an automated annotation program similar to TIGR’s and DoubleTwist’s, has doubts. "Automatic annotation over-predicts the number of genes," says Hubbard, who is based at Britain’s Sanger Centre. "This is due to false positives and cases where multiple genes are annotated when there is really only one."

The Ensembl total mirrors that of groups lead by Philip Green, of the University of Washington, Seattle, and Jean Weissenbach, of the Centre National de Séquençage at Evry, France, both of which have separate papers appearing in June’s Nature Genetics. Each used separate techniques that closely scrutinized a portion of the genome before scaling up their counts.

Green’s group used either the curated, annotated genes from chromosome 22 or a filtered set of near full-length mRNA sequences from Genbank as a starting point. Weissenbach’s team compared the bacterial artificial chromosome clone ends of a pufferfish genome, with a known, curated set of protein-encoding genes, then used algorithms to compare protein coding regions from the fish’s genome with a human sequence.

Quackenbush notes that extrapolation has its weaknesses, too. For example, chromosomes 22 and 21 — whose sequence is published this week (see page 311) — are similar in size; but chromosome 21 has 225 genes, compared with 545 on chromosome 22. Using either total on its own to estimate the number of genes in the genome could be misleading.

Francis Collins, director of the US National Human Genome Research Institute, notes that totalling the genes in chromosomes 21 and 22, then scaling that figure up to account for the size of the whole genome, results in an estimate of 40,000 genes. That’s close to the figure arrived at by the extrapolators and Ensembl. But he cautions not to put too much faith in this approach.

http://www.ensembl.org/genesweep.html