Main

Using GlimmerM, we found 25 possible new genes on chromosome 3 in regions currently annotated as non-coding1 (Table 1). Although some of these are relatively short and may not be genuine coding sequences, six of them are longer than 400 base pairs (bp) and one is 2,187 bp. This last gene represents a 729-amino-acid protein that is encoded by one very large exon and one shorter exon. Open reading frames of this length in a chromosome with 80% A+T content are virtually certain to represent real genes. Three of the table entries, G802, G803 and G740, have detectable homology to var gene fragments on chromosome 2; G802 and G803 are close enough that they might represent two portions of the same gene.

Table 1 Possible new genes on chromosome 3

Of the 215 protein-coding regions reported by Bowman et al.1, GlimmerM automatically finds 214 of them. (An earlier version of GlimmerM was provided to the annotators of chromosome 3.) The only gene missed is a short hypothetical protein (PFC0360w, 114 amino acids) with no homology to any known gene1. Finally, given that chromosome 3 is 12 per cent larger than chromosome 2, an extrapolation based on gene density would predict 234 genes on chromosome 3, consistent with our finding that additional genes could be present.

GlimmerM is very accurate at identifying splice sites, having been trained on a carefully curated set of experimentally confirmed introns from the P. falciparum genome2. GlimmerM's predictions suggest different splice sites for 49 of the 215 genes annotated on chromosome 3; all of these are hypothetical proteins. As with the annotation of hypothetical proteins for chromosome 2, substantial additional laboratory studies are needed in order to determine with confidence the exon structure of these genes.

For chromosome 2, we used the polymerase chain reaction with reverse transcription for 13 hypothetical genes predicted by GlimmerM: all 13 predictions were confirmed2,3. Of course, as emphasized previously2,3, GlimmerM is only one step, albeit an important one, in a process that should involve many other computational methods as well as careful human curation of the results produced by those methods.

To achieve the highest quality in analysing genome sequences, peer-reviewed bioinformatics methods should be used when they are available. We consider it an oversight that the chromosome 3 annotation effort neglected to consider the predictions of GlimmerM, especially as a substantial part of the chromosome 3 analysis involves comparing the two chromosomes (see, for example, Table 2 of Bowman et al.1).