Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Vertebrate gene predictions and the problem of large genes

Abstract

To find unknown protein-coding genes, annotation pipelines use a combination of ab initio gene prediction and similarity to experimentally confirmed genes or proteins. Here, we show that although the ab initio predictions have an intrinsically high false-positive rate, they also have a consistently low false-negative rate. The incorporation of similarity information is meant to reduce the false-positive rate, but in doing so it increases the false-negative rate. The crucial variable is gene size (including introns) — genes of the most extreme sizes, especially very large genes, are most likely to be incorrectly predicted.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Actual versus predicted exons in a known gene: TEA domain family member 1 (SV40 transcriptional enhancer factor on human chromosome 11).
Figure 2: Correlation between gene size and intron size.
Figure 3: Size dependencies for false-positive and false-negative rates.
Figure 4: Size dependency for gene fragmentation problem.
Figure 5: Detection of erroneous predictions using gene size.
Figure 6: Size dependency in tissue-specific expression.
Figure 7: Size independence of the over-prediction problem.
Figure 8: Complete and partial failure to detect a gene.

References

  1. 1

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  2. 2

    Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).

    Article  Google Scholar 

  3. 3

    Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  4. 4

    Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).

    CAS  Article  Google Scholar 

  5. 5

    Misra, S. et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3, 0083.1–0083.22 (2002).

    Article  Google Scholar 

  6. 6

    Reboul, J. et al. C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expression. Nature Genet. 34, 35–41 (2003).

    Article  Google Scholar 

  7. 7

    Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).

    CAS  Article  Google Scholar 

  8. 8

    Zhang, M. Q. Computational prediction of eukaryotic protein-coding genes. Nature Rev. Genet. 3, 698–709 (2002).

    CAS  Article  Google Scholar 

  9. 9

    Hubbard, T. D. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

    CAS  Article  Google Scholar 

  10. 10

    Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).

    CAS  Article  Google Scholar 

  11. 11

    Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  Article  Google Scholar 

  12. 12

    Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    CAS  Article  Google Scholar 

  13. 13

    Bennetzen, J. L. Comparative sequence analysis of plant nuclear genomes: microcolinearity and its many exceptions. Plant Cell 12, 1021–1029 (2000).

    CAS  Article  Google Scholar 

  14. 14

    Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).

    CAS  Article  Google Scholar 

  15. 15

    Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).

    CAS  Article  Google Scholar 

  16. 16

    Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003).

    CAS  Article  Google Scholar 

  17. 17

    Eddy, S. R. Computational genomics of noncoding RNA genes. Cell 109, 137–140 (2002).

    CAS  Article  Google Scholar 

  18. 18

    Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    CAS  Article  Google Scholar 

  19. 19

    Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).

    CAS  Article  Google Scholar 

  20. 20

    Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001).

  21. 21

    Tennyson, C. N., Klamut, H. J. & Worton, R. G. The human dystrophin gene requires 16 hours to be transcribed and is cotranscriptionally spliced. Nature Genet. 9, 184–190 (1995).

    CAS  Article  Google Scholar 

  22. 22

    Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).

    CAS  Article  Google Scholar 

  23. 23

    Rogic, S., Mackworth, A. K. & Ouellette, F. B. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 11, 817–832 (2001).

    CAS  Article  Google Scholar 

  24. 24

    Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).

    CAS  Article  Google Scholar 

  25. 25

    Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003).

    CAS  Article  Google Scholar 

  26. 26

    Ashburner, M. A biologist's view of the Drosophila genome annotation assessment project. Genome Res. 10, 391–393 (2000).

    CAS  Article  Google Scholar 

  27. 27

    Claverie, J. M. Do we need a huge new centre to annotate the human genome? Nature 403, 12 (2000).

    CAS  Article  Google Scholar 

  28. 28

    Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001).

    CAS  Article  Google Scholar 

  29. 29

    Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    CAS  Article  Google Scholar 

  30. 30

    Saha, S. et al. Using the transcriptome to annotate the genome. Nature Biotechnol. 20, 508–512 (2002).

    CAS  Article  Google Scholar 

  31. 31

    Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).

    CAS  Article  Google Scholar 

  32. 32

    Okazaki, Y. & Hume, D. A. A guide to the mammalian genome. Genome Res. 13, 1267–1272 (2003).

    CAS  Article  Google Scholar 

  33. 33

    Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).

    CAS  Article  Google Scholar 

  34. 34

    Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev. Genet. 4, 251–262 (2003).

    CAS  Article  Google Scholar 

  35. 35

    Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003).

    CAS  Article  Google Scholar 

  36. 36

    Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).

    CAS  Article  Google Scholar 

  37. 37

    Guigo, R., et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).

    CAS  Article  Google Scholar 

  38. 38

    Pearson, H. Geneticists play the numbers game in vain. Nature 423, 576–576 (2003).

    CAS  Article  Google Scholar 

  39. 39

    Wong, G. K., Passey, D. A. & Yu, J. Most of the human genome is transcribed. Genome Res. 11, 1975–1977 (2001).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank E. Eyras at the Sanger Center, UK, for explaining the details of the Ensembl procedures to us. This work was sponsored by the Chinese Academy of Sciences, Commission for Economy Planning, Ministry of Science and Technology, National Natural Science Foundation of China, Beijing Municipal Government, Zhejiang Provincial Government and Hangzhou Municipal Government. Some of this work was also supported by the National Human Genome Research Institute.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gane Ka-Shu Wong.

Related links

Related links

FURTHER INFORMATION

BLAST

BLAT

Ensembl

FANTOM

FgeneSH

Gene Ontology

GeneMark

GenScan

RefSeq

SGP2

TwinScan

UCSC Human Genome Browser

Glossary

AB INITIO GENE PREDICTION

The identification of protein-coding genes in genomic sequence, using no prior knowledge other than the signal and content terms.

AMYGDALA

An almond-shaped neurostructure that is involved in the production and response to non-verbal signs of anger, avoidance, defensiveness and fear.

ANNOTATION PIPELINES

A series of computer procedures that is used to identify the biological contents of a sequenced genome. Gene finding is only the first of many steps. Subsequent steps might include the identification of homologous genes, the assignment of biological function and so on.

CDS SIZE

The size of the spliced transcript, excluding introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.

COMPLETE MISS

(CM). The probability that less than 100 bp of the protein-coding sequence of a gene is correctly predicted.

CONTENT TERMS

Patterns of codon usage, which are unique to each species, that allow protein-coding sequences to be distinguished from surrounding non-coding sequence.

FALSE DESERT

(FD). A fraction of a sequence of a gene, including its introns which is not covered by any of the gene predictions.

FALSE NEGATIVE

(FN). The probability that a segment that is known to code for protein is not correctly predicted to be coding, specified as a per-base pair or per-amino acid rate.

FALSE POSITIVE

(FP). The probability that a segment that is predicted to code for protein is not in fact known to be coding, given as a per-base pair or per-amino acid rate. Note that we only count those exons that have some overlap to the region of the genome that is defined by the cDNA alignment. Exons that lie outside this region are relegated to the over-predictions.

GENE SIZE

The size of the unspliced transcript, including introns. As gene-prediction programs do not detect untranslated regions, we do not include them in this definition.

OUTLIER GENES

Genes the sequence characteristics of which are sufficiently outside the normal range to create problems for ab initio gene prediction.

OVER-PREDICTION

Predicted exons that lie entirely outside the region of the genome that is defined by the complementary DNA alignment, but which are part of a prediction that has some overlap with this region. Note the distinction between this and false positives.

PER-AMINO ACID RATE

(Per-aa rate). In computing FPs and FNs, this is the method in which we also insist that the correct amino acids are predicted, which requires that the reading frame is correctly assigned.

PER-BASE PAIR RATE

(Per-bp rate). In computing FPs and FNs, this is the method in which we only ask that the correct nucleotides are predicted, without checking if the reading frame is correctly assigned.

REFSEQ

The division of GenBank that is devoted to full-length reference sequences for experimentally confirmed genes.

SENSITIVITY

A measure of prediction that is equivalent to one minus the false-negative rate.

SERIAL ANALYSIS OF GENE EXPRESSION

(SAGE). A quantitative expression assay that is based on tags that are 10–20 bp in length, which are derived from mRNAs.

SIGNAL TERMS

Short sequence motifs, such as splice sites, branch points, polypyrimidine tracts, start codons and stop codons, that are used to detect exon boundaries.

SPECIFICITY

A measure of prediction that is equivalent to one minus the false-positive rate.

TRAINING SET

A set of known protein-coding sequences that is used to teach the ab initio gene-prediction program what the codon-usage patterns look like for a given species.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Wang, J., Li, S., Zhang, Y. et al. Vertebrate gene predictions and the problem of large genes. Nat Rev Genet 4, 741–749 (2003). https://doi.org/10.1038/nrg1160

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing