Vertebrate gene predictions and the problem of large genes

Wang, Jun; Li, ShengTing; Zhang, Yong; Zheng, HongKun; Xu, Zhao; Ye, Jia; Yu, Jun; Wong, Gane Ka-Shu

doi:10.1038/nrg1160

Opinion
Published: 01 September 2003

Vertebrate gene predictions and the problem of large genes

Jun Wang^1,2^na1,
ShengTing Li¹^na1,
Yong Zhang^1,3^na1,
HongKun Zheng¹,
Zhao Xu¹,
Jia Ye¹,
Jun Yu^1,2,4 &
…
Gane Ka-Shu Wong^1,2,4

Nature Reviews Genetics volume 4, pages 741–749 (2003)Cite this article

490 Accesses
46 Citations
Metrics details

Abstract

To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistently low false-negative rate. The incorporation of similarity information is meant to reduce the false-positive rate, but in doing so it increases the false-negative rate. The crucial variable is gene size (including introns) — genes of the most extreme sizes, especially very large genes, are most likely to be incorrectly predicted.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Actual versus predicted exons in a known gene: TEA domain family member 1 (SV40 transcriptional enhancer factor on human chromosome 11).**

**Figure 2: Correlation between gene size and intron size.**

**Figure 3: Size dependencies for false-positive and false-negative rates.**

**Figure 4: Size dependency for gene fragmentation problem.**

**Figure 5: Detection of erroneous predictions using gene size.**

**Figure 6: Size dependency in tissue-specific expression.**

**Figure 7: Size independence of the over-prediction problem.**

**Figure 8: Complete and partial failure to detect a gene.**

Large multiple sequence alignments with a root-to-leaf regressive method

Article 02 December 2019

Edgar Garriga, Paolo Di Tommaso, … Cedric Notredame

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Article Open access 05 January 2023

Kenji Fukushima & David D. Pollock

Overcoming challenges and dogmas to understand the functions of pseudogenes

Article 17 December 2019

Seth W. Cheetham, Geoffrey J. Faulkner & Marcel E. Dinger

References

Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
Article Google Scholar
Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).
Article CAS Google Scholar
Misra, S. et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3, 0083.1–0083.22 (2002).
Article Google Scholar
Reboul, J. et al. C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression. Nature Genet. 34, 35–41 (2003).
Article Google Scholar
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).
Article CAS Google Scholar
Zhang, M. Q. Computational prediction of eukaryotic protein-coding genes. Nature Rev. Genet. 3, 698–709 (2002).
Article CAS Google Scholar
Hubbard, T. D. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS Google Scholar
Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Article CAS Google Scholar
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Article CAS Google Scholar
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS Google Scholar
Bennetzen, J. L. Comparative sequence analysis of plant nuclear genomes: microcolinearity and its many exceptions. Plant Cell 12, 1021–1029 (2000).
Article CAS Google Scholar
Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).
Article CAS Google Scholar
Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).
Article CAS Google Scholar
Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003).
Article CAS Google Scholar
Eddy, S. R. Computational genomics of noncoding RNA genes. Cell 109, 137–140 (2002).
Article CAS Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article CAS Google Scholar
Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Article CAS Google Scholar
Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001).
Tennyson, C. N., Klamut, H. J. & Worton, R. G. The human dystrophin gene requires 16 hours to be transcribed and is cotranscriptionally spliced. Nature Genet. 9, 184–190 (1995).
Article CAS Google Scholar
Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
Article CAS Google Scholar
Rogic, S., Mackworth, A. K. & Ouellette, F. B. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 11, 817–832 (2001).
Article CAS Google Scholar
Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).
Article CAS Google Scholar
Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003).
Article CAS Google Scholar
Ashburner, M. A biologist's view of the Drosophila genome annotation assessment project. Genome Res. 10, 391–393 (2000).
Article CAS Google Scholar
Claverie, J. M. Do we need a huge new centre to annotate the human genome? Nature 403, 12 (2000).
Article CAS Google Scholar
Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001).
Article CAS Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article CAS Google Scholar
Saha, S. et al. Using the transcriptome to annotate the genome. Nature Biotechnol. 20, 508–512 (2002).
Article CAS Google Scholar
Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).
Article CAS Google Scholar
Okazaki, Y. & Hume, D. A. A guide to the mammalian genome. Genome Res. 13, 1267–1272 (2003).
Article CAS Google Scholar
Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
Article CAS Google Scholar
Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev. Genet. 4, 251–262 (2003).
Article CAS Google Scholar
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003).
Article CAS Google Scholar
Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).
Article CAS Google Scholar
Guigo, R., et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).
Article CAS Google Scholar
Pearson, H. Geneticists play the numbers game in vain. Nature 423, 576–576 (2003).
Article CAS Google Scholar
Wong, G. K., Passey, D. A. & Yu, J. Most of the human genome is transcribed. Genome Res. 11, 1975–1977 (2001).
Article CAS Google Scholar

Download references

Acknowledgements

We thank E. Eyras at the Sanger Center, UK, for explaining the details of the Ensembl procedures to us. This work was sponsored by the Chinese Academy of Sciences, Commission for Economy Planning, Ministry of Science and Technology, National Natural Science Foundation of China, Beijing Municipal Government, Zhejiang Provincial Government and Hangzhou Municipal Government. Some of this work was also supported by the National Human Genome Research Institute.

Author information

Jun Wang, ShengTing Li and Yong Zhang: J. W., S. T. L. and Y. Z. contributed equally to this work.

Authors and Affiliations

Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 101300, China
Jun Wang, ShengTing Li, Yong Zhang, HongKun Zheng, Zhao Xu, Jia Ye, Jun Yu & Gane Ka-Shu Wong
James D. Watson Institute of Zhejiang University, Hangzhou Genomics Institute, Key Laboratory of Bioinformatics of Zhejiang Province, Hangzhou, 310007, China
Jun Wang, Jun Yu & Gane Ka-Shu Wong
College of Life Sciences, Peking University, Beijing, 100871, China
Yong Zhang
Department of Medicine, UW Genome Center, University of Washington, Seattle, 98195, Washington, USA
Jun Yu & Gane Ka-Shu Wong

Authors

Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
ShengTing Li
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
HongKun Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jia Ye
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Gane Ka-Shu Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gane Ka-Shu Wong.

Glossary

AB INITIO GENE PREDICTION: The identification of protein-coding genes in genomic sequence, using no prior knowledge other than the signal and content terms.
AMYGDALA: An almond-shaped neurostructure that is involved in the production and response to non-verbal signs of anger, avoidance, defensiveness and fear.
ANNOTATION PIPELINES: A series of computer procedures that is used to identify the biological contents of a sequenced genome. Gene finding is only the first of many steps. Subsequent steps might include the identification of homologous genes, the assignment of biological function and so on.
CDS SIZE: The size of the spliced transcript, excluding introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.
COMPLETE MISS: (CM). The probability that less than 100 bp of the protein-coding sequence of a gene is correctly predicted.
CONTENT TERMS: Patterns of codon usage, which are unique to each species, that allow protein-coding sequences to be distinguished from surrounding non-coding sequence.
FALSE DESERT: (FD). A fraction of a sequence of a gene, including its introns which is not covered by any of the gene predictions.
FALSE NEGATIVE: (FN). The probability that a segment that is known to code for protein is not correctly predicted to be coding, specified as a per-base pair or per-amino acid rate.
FALSE POSITIVE: (FP). The probability that a segment that is predicted to code for protein is not in fact known to be coding, given as a per-base pair or per-amino acid rate. Note that we only count those exons that have some overlap to the region of the genome that is defined by the cDNA alignment. Exons that lie outside this region are relegated to the over-predictions.
GENE SIZE: The size of the unspliced transcript, including introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.
OUTLIER GENES: Genes the sequence characteristics of which are sufficiently outside the normal range to create problems for ab initio gene prediction.
OVER-PREDICTION: Predicted exons that lie entirely outside the region of the genome that is defined by the complementary DNA alignment, but which are part of a prediction that has some overlap with this region. Note the distinction between this and false positives.
PER-AMINO ACID RATE: (Per-aa rate). In computing FPs and FNs, this is the method in which we also insist that the correct amino acids are predicted, which requires that the reading frame is correctly assigned.
PER-BASE PAIR RATE: (Per-bp rate). In computing FPs and FNs, this is the method in which we only ask that the correct nucleotides are predicted, without checking if the reading frame is correctly assigned.
REFSEQ: The division of GenBank that is devoted to full-length reference sequences for experimentally confirmed genes.
SENSITIVITY: A measure of prediction that is equivalent to one minus the false-negative rate.
SERIAL ANALYSIS OF GENE EXPRESSION: (SAGE). A quantitative expression assay that is based on tags that are 10–20 bp in length, which are derived from mRNAs.
SIGNAL TERMS: Short sequence motifs, such as splice sites, branch points, polypyrimidine tracts, start codons and stop codons, that are used to detect exon boundaries.
SPECIFICITY: A measure of prediction that is equivalent to one minus the false-positive rate.
TRAINING SET: A set of known protein-coding sequences that is used to teach the ab initio gene-prediction program what the codon-usage patterns look like for a given species.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Li, S., Zhang, Y. et al. Vertebrate gene predictions and the problem of large genes. Nat Rev Genet 4, 741–749 (2003). https://doi.org/10.1038/nrg1160

Download citation

Issue Date: 01 September 2003
DOI: https://doi.org/10.1038/nrg1160

This article is cited by

Computational discovery and annotation of conserved small open reading frames in fungal genomes
- Shuhaila Mat-Sharani
- Mohd Firdaus-Raih
BMC Bioinformatics (2019)
Emerging evidence for functional peptides encoded by short open reading frames
- Shea J. Andrews
- Joseph A. Rothnagel
Nature Reviews Genetics (2014)
Systematic analysis of intron size and abundance parameters in diverse lineages
- Wu JiaYan
- Xiao JingFa
- Yu Jun
Science China Life Sciences (2013)
Spliceosomal intron size expansion in domesticated grapevine (Vitis vinifera)
- Ke Jiang
- Leslie R Goertzen
BMC Research Notes (2011)
Analysis of Expressed Sequence Tags in Porcine Uterus Tissue
- Hui Chai
- Dong-Liang Yu
- Song-Nian Hu
Biochemical Genetics (2009)

Vertebrate gene predictions and the problem of large genes

Abstract

Access options

Similar content being viewed by others

Large multiple sequence alignments with a root-to-leaf regressive method

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Overcoming challenges and dogmas to understand the functions of pseudogenes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Computational discovery and annotation of conserved small open reading frames in fungal genomes

Emerging evidence for functional peptides encoded by short open reading frames

Systematic analysis of intron size and abundance parameters in diverse lineages

Spliceosomal intron size expansion in domesticated grapevine (Vitis vinifera)

Analysis of Expressed Sequence Tags in Porcine Uterus Tissue

Search

Quick links

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links