To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistently low false-negative rate. The incorporation of similarity information is meant to reduce the false-positive rate, but in doing so it increases the false-negative rate. The crucial variable is gene size (including introns) — genes of the most extreme sizes, especially very large genes, are most likely to be incorrectly predicted.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).
Misra, S. et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3, 0083.1–0083.22 (2002).
Reboul, J. et al. C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression. Nature Genet. 34, 35–41 (2003).
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).
Zhang, M. Q. Computational prediction of eukaryotic protein-coding genes. Nature Rev. Genet. 3, 698–709 (2002).
Hubbard, T. D. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Bennetzen, J. L. Comparative sequence analysis of plant nuclear genomes: microcolinearity and its many exceptions. Plant Cell 12, 1021–1029 (2000).
Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).
Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).
Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003).
Eddy, S. R. Computational genomics of noncoding RNA genes. Cell 109, 137–140 (2002).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001).
Tennyson, C. N., Klamut, H. J. & Worton, R. G. The human dystrophin gene requires 16 hours to be transcribed and is cotranscriptionally spliced. Nature Genet. 9, 184–190 (1995).
Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
Rogic, S., Mackworth, A. K. & Ouellette, F. B. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 11, 817–832 (2001).
Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).
Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003).
Ashburner, M. A biologist's view of the Drosophila genome annotation assessment project. Genome Res. 10, 391–393 (2000).
Claverie, J. M. Do we need a huge new centre to annotate the human genome? Nature 403, 12 (2000).
Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Saha, S. et al. Using the transcriptome to annotate the genome. Nature Biotechnol. 20, 508–512 (2002).
Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).
Okazaki, Y. & Hume, D. A. A guide to the mammalian genome. Genome Res. 13, 1267–1272 (2003).
Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev. Genet. 4, 251–262 (2003).
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003).
Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).
Guigo, R., et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).
Pearson, H. Geneticists play the numbers game in vain. Nature 423, 576–576 (2003).
Wong, G. K., Passey, D. A. & Yu, J. Most of the human genome is transcribed. Genome Res. 11, 1975–1977 (2001).
We thank E. Eyras at the Sanger Center, UK, for explaining the details of the Ensembl procedures to us. This work was sponsored by the Chinese Academy of Sciences, Commission for Economy Planning, Ministry of Science and Technology, National Natural Science Foundation of China, Beijing Municipal Government, Zhejiang Provincial Government and Hangzhou Municipal Government. Some of this work was also supported by the National Human Genome Research Institute.
- AB INITIO GENE PREDICTION
The identification of protein-coding genes in genomic sequence, using no prior knowledge other than the signal and content terms.
An almond-shaped neurostructure that is involved in the production and response to non-verbal signs of anger, avoidance, defensiveness and fear.
- ANNOTATION PIPELINES
A series of computer procedures that is used to identify the biological contents of a sequenced genome. Gene finding is only the first of many steps. Subsequent steps might include the identification of homologous genes, the assignment of biological function and so on.
- CDS SIZE
The size of the spliced transcript, excluding introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.
- COMPLETE MISS
(CM). The probability that less than 100 bp of the protein-coding sequence of a gene is correctly predicted.
- CONTENT TERMS
Patterns of codon usage, which are unique to each species, that allow protein-coding sequences to be distinguished from surrounding non-coding sequence.
- FALSE DESERT
(FD). A fraction of a sequence of a gene, including its introns which is not covered by any of the gene predictions.
- FALSE NEGATIVE
(FN). The probability that a segment that is known to code for protein is not correctly predicted to be coding, specified as a per-base pair or per-amino acid rate.
- FALSE POSITIVE
(FP). The probability that a segment that is predicted to code for protein is not in fact known to be coding, given as a per-base pair or per-amino acid rate. Note that we only count those exons that have some overlap to the region of the genome that is defined by the cDNA alignment. Exons that lie outside this region are relegated to the over-predictions.
- GENE SIZE
The size of the unspliced transcript, including introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.
- OUTLIER GENES
Genes the sequence characteristics of which are sufficiently outside the normal range to create problems for ab initio gene prediction.
Predicted exons that lie entirely outside the region of the genome that is defined by the complementary DNA alignment, but which are part of a prediction that has some overlap with this region. Note the distinction between this and false positives.
- PER-AMINO ACID RATE
(Per-aa rate). In computing FPs and FNs, this is the method in which we also insist that the correct amino acids are predicted, which requires that the reading frame is correctly assigned.
- PER-BASE PAIR RATE
(Per-bp rate). In computing FPs and FNs, this is the method in which we only ask that the correct nucleotides are predicted, without checking if the reading frame is correctly assigned.
The division of GenBank that is devoted to full-length reference sequences for experimentally confirmed genes.
A measure of prediction that is equivalent to one minus the false-negative rate.
- SERIAL ANALYSIS OF GENE EXPRESSION
(SAGE). A quantitative expression assay that is based on tags that are 10–20 bp in length, which are derived from mRNAs.
- SIGNAL TERMS
Short sequence motifs, such as splice sites, branch points, polypyrimidine tracts, start codons and stop codons, that are used to detect exon boundaries.
A measure of prediction that is equivalent to one minus the false-positive rate.
- TRAINING SET
A set of known protein-coding sequences that is used to teach the ab initio gene-prediction program what the codon-usage patterns look like for a given species.
About this article
Cite this article
Wang, J., Li, S., Zhang, Y. et al. Vertebrate gene predictions and the problem of large genes. Nat Rev Genet 4, 741–749 (2003). https://doi.org/10.1038/nrg1160
BMC Bioinformatics (2019)
Investigation of amino acid specificity in the CydX small protein shows sequence plasticity at the functional level
PLOS ONE (2018)
Russian Journal of Genetics (2016)
Trends in Biochemical Sciences (2016)
Nature Reviews Genetics (2014)