Abstract
To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistently low false-negative rate. The incorporation of similarity information is meant to reduce the false-positive rate, but in doing so it increases the false-negative rate. The crucial variable is gene size (including introns) — genes of the most extreme sizes, especially very large genes, are most likely to be incorrectly predicted.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Computational discovery and annotation of conserved small open reading frames in fungal genomes
BMC Bioinformatics Open Access 04 February 2019
-
Systematic analysis of intron size and abundance parameters in diverse lineages
Science China Life Sciences Open Access 10 September 2013
-
Spliceosomal intron size expansion in domesticated grapevine (Vitis vinifera)
BMC Research Notes Open Access 08 March 2011
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout








References
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).
Misra, S. et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3, 0083.1–0083.22 (2002).
Reboul, J. et al. C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression. Nature Genet. 34, 35–41 (2003).
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).
Zhang, M. Q. Computational prediction of eukaryotic protein-coding genes. Nature Rev. Genet. 3, 698–709 (2002).
Hubbard, T. D. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Bennetzen, J. L. Comparative sequence analysis of plant nuclear genomes: microcolinearity and its many exceptions. Plant Cell 12, 1021–1029 (2000).
Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).
Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).
Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003).
Eddy, S. R. Computational genomics of noncoding RNA genes. Cell 109, 137–140 (2002).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001).
Tennyson, C. N., Klamut, H. J. & Worton, R. G. The human dystrophin gene requires 16 hours to be transcribed and is cotranscriptionally spliced. Nature Genet. 9, 184–190 (1995).
Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
Rogic, S., Mackworth, A. K. & Ouellette, F. B. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 11, 817–832 (2001).
Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).
Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003).
Ashburner, M. A biologist's view of the Drosophila genome annotation assessment project. Genome Res. 10, 391–393 (2000).
Claverie, J. M. Do we need a huge new centre to annotate the human genome? Nature 403, 12 (2000).
Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Saha, S. et al. Using the transcriptome to annotate the genome. Nature Biotechnol. 20, 508–512 (2002).
Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).
Okazaki, Y. & Hume, D. A. A guide to the mammalian genome. Genome Res. 13, 1267–1272 (2003).
Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev. Genet. 4, 251–262 (2003).
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003).
Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).
Guigo, R., et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).
Pearson, H. Geneticists play the numbers game in vain. Nature 423, 576–576 (2003).
Wong, G. K., Passey, D. A. & Yu, J. Most of the human genome is transcribed. Genome Res. 11, 1975–1977 (2001).
Acknowledgements
We thank E. Eyras at the Sanger Center, UK, for explaining the details of the Ensembl procedures to us. This work was sponsored by the Chinese Academy of Sciences, Commission for Economy Planning, Ministry of Science and Technology, National Natural Science Foundation of China, Beijing Municipal Government, Zhejiang Provincial Government and Hangzhou Municipal Government. Some of this work was also supported by the National Human Genome Research Institute.
Author information
Authors and Affiliations
Corresponding author
Glossary
- AB INITIO GENE PREDICTION
-
The identification of protein-coding genes in genomic sequence, using no prior knowledge other than the signal and content terms.
- AMYGDALA
-
An almond-shaped neurostructure that is involved in the production and response to non-verbal signs of anger, avoidance, defensiveness and fear.
- ANNOTATION PIPELINES
-
A series of computer procedures that is used to identify the biological contents of a sequenced genome. Gene finding is only the first of many steps. Subsequent steps might include the identification of homologous genes, the assignment of biological function and so on.
- CDS SIZE
-
The size of the spliced transcript, excluding introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.
- COMPLETE MISS
-
(CM). The probability that less than 100 bp of the protein-coding sequence of a gene is correctly predicted.
- CONTENT TERMS
-
Patterns of codon usage, which are unique to each species, that allow protein-coding sequences to be distinguished from surrounding non-coding sequence.
- FALSE DESERT
-
(FD). A fraction of a sequence of a gene, including its introns which is not covered by any of the gene predictions.
- FALSE NEGATIVE
-
(FN). The probability that a segment that is known to code for protein is not correctly predicted to be coding, specified as a per-base pair or per-amino acid rate.
- FALSE POSITIVE
-
(FP). The probability that a segment that is predicted to code for protein is not in fact known to be coding, given as a per-base pair or per-amino acid rate. Note that we only count those exons that have some overlap to the region of the genome that is defined by the cDNA alignment. Exons that lie outside this region are relegated to the over-predictions.
- GENE SIZE
-
The size of the unspliced transcript, including introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.
- OUTLIER GENES
-
Genes the sequence characteristics of which are sufficiently outside the normal range to create problems for ab initio gene prediction.
- OVER-PREDICTION
-
Predicted exons that lie entirely outside the region of the genome that is defined by the complementary DNA alignment, but which are part of a prediction that has some overlap with this region. Note the distinction between this and false positives.
- PER-AMINO ACID RATE
-
(Per-aa rate). In computing FPs and FNs, this is the method in which we also insist that the correct amino acids are predicted, which requires that the reading frame is correctly assigned.
- PER-BASE PAIR RATE
-
(Per-bp rate). In computing FPs and FNs, this is the method in which we only ask that the correct nucleotides are predicted, without checking if the reading frame is correctly assigned.
- REFSEQ
-
The division of GenBank that is devoted to full-length reference sequences for experimentally confirmed genes.
- SENSITIVITY
-
A measure of prediction that is equivalent to one minus the false-negative rate.
- SERIAL ANALYSIS OF GENE EXPRESSION
-
(SAGE). A quantitative expression assay that is based on tags that are 10–20 bp in length, which are derived from mRNAs.
- SIGNAL TERMS
-
Short sequence motifs, such as splice sites, branch points, polypyrimidine tracts, start codons and stop codons, that are used to detect exon boundaries.
- SPECIFICITY
-
A measure of prediction that is equivalent to one minus the false-positive rate.
- TRAINING SET
-
A set of known protein-coding sequences that is used to teach the ab initio gene-prediction program what the codon-usage patterns look like for a given species.
Rights and permissions
About this article
Cite this article
Wang, J., Li, S., Zhang, Y. et al. Vertebrate gene predictions and the problem of large genes. Nat Rev Genet 4, 741–749 (2003). https://doi.org/10.1038/nrg1160
Issue Date:
DOI: https://doi.org/10.1038/nrg1160
This article is cited by
-
Computational discovery and annotation of conserved small open reading frames in fungal genomes
BMC Bioinformatics (2019)
-
Emerging evidence for functional peptides encoded by short open reading frames
Nature Reviews Genetics (2014)
-
Systematic analysis of intron size and abundance parameters in diverse lineages
Science China Life Sciences (2013)
-
Spliceosomal intron size expansion in domesticated grapevine (Vitis vinifera)
BMC Research Notes (2011)
-
GeneWaltz–A new method for reducing the false positives of gene finding
BioData Mining (2010)