Guessing the number of human genes is a speculative enterprise. Even with the publication of the draft human genome sequence (see pages 158–159), one can only have cautious confidence in the estimated count of 30,000–40,000. More accurate estimates require a high-throughput method for validation of gene predictions, and Mark Boguski and colleagues now report an approach to do just that using microarray technology.

Traditional computational methods for predicting gene number integrate information from numerous sources, such as sequence signatures of gene structure, similarity to genes in other organisms and evidence that a DNA sequence is expressed. Although collectively powerful, each type of information is subject to error. Furthermore, computational approaches overlook the complexity of gene expression, such as the tremendous diversity that arises from alternative splicing of gene transcripts.

The premise of Boguski and colleagues is that exons of the same gene will demonstrate similar expression patterns across a range of different cell types and experimental conditions. To test this, they designed “exon arrays”, comprising 60-mer oligonucleotide probes — each of which corresponds to a predicted exon — printed on glass slides. As proof of principle, the authors analysed gene predictions for chromosome 22, the first chromosome to be fully sequenced and exhaustively annotated. Although the method validated most known genes on chromosome 22, about 15% were missed, indicating that the method needs further refinement. Intriguingly, over half of the gene predictions based solely on ab initio computer predictions were confirmed — far exceeding earlier expectations — illustrating that this is an effective means of quickly assessing the validity of computational predictions.

To define gene structure more accurately, Boguski and colleagues used higher-resolution 'tiling arrays', in which 60-mer probes are tiled at 10-base-pair intervals across a genomic region of interest. The tiling approach enabled the gene structure of a novel testis transcript to be refined, clarifying the exact exon–intron boundaries and precise transcript length.

As an initial step towards whole genome analysis, the authors constructed 50 arrays that contained over one million probes representing more than 400,000 predicted human exons. The authors detected 58% of confirmed exons and 34% of predicted exons (from the Ensembl human genome annotation data set), but only two cell lines were used for the analysis. Although a long way from comprehensively defining the structure of every gene in the human genome, this new approach offers a rapid means of validating computational predictions and training the next generation of gene-hunting programs.