Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Steady progress and recent breakthroughs in the accuracy of automated genome annotation

Key Points

  • It is not currently possible to determine the precise structure of every protein-coding gene in a complex, eukaryotic genome. However, the past 10 years have seen steady progress in the accuracy and completeness of methods for automated genome annotation.

  • Currently, the gold standard in the annotation of exon–intron structures is the alignment of a full-length cDNA sequence to the sequence of the genomic region from which it was transcribed.

  • For a significant fraction of genes, it is not practical to obtain full-length cDNA sequences by sequencing randomly selected cDNA clones or by screening clone libraries.

  • Some of these genes can be accurately annotated by aligning the sequence of a cDNA (or its translation) to a very similar genomic region other than the one from which it was transcribed.

  • The first driver of recent improvements in annotation is the sequencing of many genomes that can be compared with one another, a trend that is likely to continue.

  • A second source of improvement is the development of better probability models for de novo gene prediction, most recently those based on the conditional random field modelling framework.

  • A third significant source of improvement in mammalian genome annotation has been the development of software for automatically detecting processed pseudogenes.

  • By designing PCR primers for predicted cDNA sequences, it is possible to specifically amplify and sequence thousands of cDNAs, the sequences of which could not be obtained by traditional methods.

  • By using a combiner program to adjudicate among predictions and alignments produced by several methods, one can now come closer than ever before to producing complete and accurate gene catalogues.

Abstract

The sequencing of large, complex genomes has become routine, but understanding how sequences relate to biological function is less straightforward. Although much attention is focused on how to annotate genomic features such as developmental enhancers and non-coding RNAs, there is still no higher eukaryote for which we know the correct exon–intron structure of at least one ORF for each gene. Despite this uncomfortable truth, genome annotation has made remarkable progress since the first drafts of the human genome were analysed. By combining several computational and experimental methods, we are now closer to producing complete and accurate gene catalogues than ever before.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Performance of GeneWise, a trans-alignment program.
Figure 2: The steadily increasing accuracy of de novo gene prediction algorithms.
Figure 3: Criteria for selecting the best informant genome.

Similar content being viewed by others

The ENCODE Project Consortium, Michael P. Snyder, … Richard M. Myers

References

  1. The MGC Project Team. The status, quality, and expansion of the NIH full-length cDNA project: the mammalian gene collection (MGC). Genome Res. 14, 2121–2127 (2004).

  2. Bernal, A., Crammer, K., Hatzigeorgiou, A. & Pereira, F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007). This paper presents CRAIG, a CRF-based, single-genome de novo gene predictor with the best published accuracy for the human genome among programs that do not use comparison with related genome sequences.

    Article  Google Scholar 

  3. Decaprio, D. et al. CONRAD: gene prediction using conditional random fields. Genome Res. 17, 1389–1398 (2007). This paper presents CONRAD, a CRF-based, multi-genome de novo gene predictor with the best published benchmark accuracy on fungal genomes.

    Article  CAS  Google Scholar 

  4. Gross, S. S., Do, C. B., Sirota, M. & Batzoglou, S. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. (in the press). This paper presents CONTRAST, a CRF-based, multi-genome de novo gene predictor that is currently the most accurate predictor, at least for mammals and flies. CONTRAST is also likely to work well on other complex eukaryotic genomes.

  5. ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).

  6. Gerstein, M. B. et al. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 (2007).

    Article  CAS  Google Scholar 

  7. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13, 477–478 (1997).

    CAS  PubMed  Google Scholar 

  8. Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

    Article  CAS  Google Scholar 

  9. Shibata, Y. et al. Cloning full-length, cap-trapper-selected cDNAs by using the single-strand linker ligation method. Biotechniques 30, 1250–1254 (2001).

    Article  CAS  Google Scholar 

  10. Suzuki, Y. et al. Statistical analysis of the 5′ untranslated region of human mRNA using 'oligo-capped' cDNA libraries. Genomics 64, 286–297 (2000).

    Article  CAS  Google Scholar 

  11. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).

    Article  CAS  Google Scholar 

  12. Guigó, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).

    Article  Google Scholar 

  13. Wu, J. Q., Shteynberg, D., Arumugam, M., Gibbs, R. A. & Brent, M. R. Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. Genome Res. 14, 665–671 (2004).

    Article  CAS  Google Scholar 

  14. Eyras, E. et al. Gene finding in the chicken genome. BMC Bioinformatics 6, 131 (2005).

    Article  Google Scholar 

  15. Denoeud, F. et al. Prominent use of distal 5′ transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759 (2007).

    Article  CAS  Google Scholar 

  16. Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007). This paper shows that de novo gene prediction followed by RT-PCR and direct sequencing can be used to elucidate many novel exons and introns even in a genome as thoroughly studied as the human genome.

    Article  CAS  Google Scholar 

  17. Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    Article  CAS  Google Scholar 

  18. Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).

    Article  Google Scholar 

  19. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004).

    Article  CAS  Google Scholar 

  20. Birney, E. et al. An overview of ENSEMBL. Genome Res. 14, 925–928 (2004).

    Article  CAS  Google Scholar 

  21. Meyer, I. M. & Durbin, R. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res. 32, 776–783 (2004).

    Article  CAS  Google Scholar 

  22. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).

    Article  CAS  Google Scholar 

  23. Brent, M. R. How does eukaryotic gene prediction work? Nature Biotechnol. 25, 883–885 (2007).

    Article  CAS  Google Scholar 

  24. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    Article  CAS  Google Scholar 

  25. Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887–899 (1999).

    Article  CAS  Google Scholar 

  26. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1999).

    Article  CAS  Google Scholar 

  27. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003). This paper presents the RFC method of identifying protein-coding regions using only multi-genome alignments.

    Article  CAS  Google Scholar 

  28. Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

    Article  CAS  Google Scholar 

  29. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140–S148 (2001).

    Article  Google Scholar 

  30. Flicek, P. & Brent, M. R. Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol. 7, S8 (2006).

    Article  Google Scholar 

  31. Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).

    Article  CAS  Google Scholar 

  32. Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515 (2000).

    Article  CAS  Google Scholar 

  33. Clamp, M. et al. Distinguishing protein-coding and non-coding genes in the human genome. Proc. Natl Acad. Sci. USA (in the press).

  34. Wang, M., Buhler, J. & Brent, M. R. in The Genome of Homo Sapiens (eds Stillman, B. & Stewart, D.) 125–130 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2004).

    Google Scholar 

  35. Zhang, L., Pavlovic, V., Cantor, C. R. & Kasif, S. Human–mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res. 13, 1190–1202 (2003).

    Article  CAS  Google Scholar 

  36. Clark, A. G. et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007).

    Article  Google Scholar 

  37. Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003). This paper shows that unassembled sequencing reads representing three- to fourfold coverage of an informant genome are almost as useful as a high-coverage informant assembly for de novo gene prediction.

    Article  CAS  Google Scholar 

  38. Siepel, A. C. & Haussler, D. in RECOMB (ACM, San Diego, 2004).

    Google Scholar 

  39. Gross, S. S. & Brent, M. R. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13, 379–393 (2006). This paper presents N-SCAN, a multi-genome de novo gene predictor that was the most accurate program for animal genomes until CONTRAST was introduced.

    Article  CAS  Google Scholar 

  40. Do, C. B., Woods, D. A. & Batzoglou, S. CONRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22, e90–e98 (2006).

    Article  CAS  Google Scholar 

  41. Gross, S. S., Russakovsky, O., Do, C. B. & Batzoglou, S. Training conditional random fields for maximum labelwise accuracy. Adv. Neural Inf. Process. Syst. 19, (Neural Information Processing Systems Foundation, 2006).

  42. Wei, C. et al. Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res. 15, 577–582 (2005).

    Article  CAS  Google Scholar 

  43. Wei, C. & Brent, M. R. Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327 (2006).

    Article  Google Scholar 

  44. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).

    Article  CAS  Google Scholar 

  45. Moskal, W. A. Jr. et al. Experimental validation of novel genes predicted in the un-annotated regions of the Arabidopsis genome. BMC Genomics 8, 18 (2007).

    Article  Google Scholar 

  46. Allen, J. E., Pertea, M. & Salzberg, S. L. Computational gene prediction using multiple sources of evidence. Genome Res. 14, 142–148 (2004).

    Article  CAS  Google Scholar 

  47. van Baren, M. J. & Brent, M. R. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 16, 678–685 (2006). This paper presents PPFINDER, a program that can remove processed pseudogene fragments from gene predictions even when there is no database of previously known functional genes.

    Article  CAS  Google Scholar 

  48. Torrents, D., Suyama, M., Zdobnov, E. & Bork, P. A genome-wide survey of human pseudogenes. Genome Res. 13, 2559–2567 (2003).

    Article  CAS  Google Scholar 

  49. Zhang, Z. & Gerstein, M. Large-scale analysis of pseudogenes in the human genome. Curr. Opin. Genet. Dev. 14, 328–335 (2004).

    Article  CAS  Google Scholar 

  50. Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, S4 (2006). This paper provides useful insights into a modern manual annotation effort and how it compares with both automated annotation and experimental verification.

    Article  Google Scholar 

  51. Pruitt, K., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 1, 501–504 (2005).

    Google Scholar 

  52. Arumugam, M., Wei, C., Brown, R. H. & Brent, M. R. Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol. 7, S5 (2006).

    Article  Google Scholar 

  53. Stanke, M., Schoffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006).

    Article  Google Scholar 

  54. Stanke, M., Tzvetkova, A. & Morgenstern, B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol. 7, S11 (2006).

    Article  Google Scholar 

  55. Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 12, 1418–1427 (2002).

    Article  CAS  Google Scholar 

  56. Elsik, C. G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007).

    Article  Google Scholar 

  57. Allen, J. E. & Salzberg, S. L. Jigsaw: integration of multiple sources of evidence for gene prediction. Bioinformatics 21, 3596–3603 (2005). This paper presents Jigsaw, a highly accurate system for combining predictions that are produced by other methods.

    Article  CAS  Google Scholar 

  58. Coghlan, A. & Durbin, R. Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron–exon structure. Bioinformatics 23, 1468–1475 (2007).

    Article  CAS  Google Scholar 

  59. Guigo, R. et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7, S2 (2006). This paper describes detailed benchmarks on the accuracy of several gene prediction programs that use a range of methods and evaluating them on 30 Mb of the human genome.

    Article  Google Scholar 

  60. Brent, M. R. Genome annotation past, present and future: how to define an ORF at each locus. Genome Res. 15, 1777–1786 (2005).

    Article  CAS  Google Scholar 

  61. D'Haeseleer, P. What are DNA sequence motifs? Nature Biotechnol. 24, 423–425 (2006).

    Article  CAS  Google Scholar 

  62. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003). This paper presents AUGUSTUS, currently the most accurate GHMM-based, single-genome de novo predictor for flies. AUGUSTUS uses innovative splice-site and intron-length models.

    Article  Google Scholar 

Download references

Acknowledgements

I am deeply grateful to M. J. van Baren, M. Schuster and E. Birney for help with the GeneWise analysis, R. Brown for analysis of informant genome utility, M. J. van Baren and S. Gross for comments on the manuscript, and L. Kyro and L. Langton for help with figures. M.R.B. is supported in part by grants from the National Institutes of Health (HG002278, HG003700, HG004271) and Monsanto.

Author information

Authors and Affiliations

Authors

Related links

Related links

FURTHER INFORMATION

Michael R. Brent's homepage

AUGUSTUS

CONRAD

CONTRAST

CRAIG

ENSEMBL

EST_GENOME

Exonerate

EXONIPHY

GAZE

GeneID

GeneWise

Genomix

GENSCAN

GENSCAN parameter files

GLEAN

GlimmerM

GMAP

HAVANA

Jigsaw

N-SCAN

PPFINDER

SGP2

TWINSCAN

UNIPROT

Glossary

cDNA library

A collection of clones that propagate and amplify copies of diverse (usually random) cDNA sequences.

Cis alignment

The alignment of a cDNA sequence to the locus that matches it best in its source genome — the presumed template for its transcription.

Trans alignment

The alignment of a cDNA or protein sequence to a homologous locus other than the one from which it was transcribed.

De novo gene prediction

An approach to gene prediction in which the only inputs are genome sequences; no evidence derived from RNA is used.

Target genome

The genome to be annotated, as opposed to informant genomes or other supporting sequences. In gene prediction, informant genomes are genome sequences that are aligned to the target genome and used as auxiliary information for annotating it.

Conditional random field

A type of discriminative model that is used for assigning probabilities to possible annotations of a sequence. A discriminative model is a probability model in which the most likely values of hidden variables (for example, annotations of DNA segments) are calculated directly from the observed variable values (for example, the DNA sequences) without using the probability of the observed values.

Shotgun mass spectrometry

A method for simultaneously identifying many of the protein species present in a complex mixture by fragmenting them and precisely measuring the charge-to-mass ratios of the fragments in a mass spectrometer.

Processivity

The tendency of a polymerase to continue to move along a template molecule rather than falling off prematurely.

Robustness

The ability to function well in difficult circumstances or in unexpected circumstances for which it was not designed.

Nearly full-length (NFL) protein alignment

Alignment of a protein sequence to a genome in which the alignment extends to the ends of the protein, or nearly so.

Profile hidden Markov model

A mathematical model that represents the conserved elements of an entire family of related proteins or a family of conserved functional domains.

Training data

In de novo gene prediction, it is a set of known gene structures with the corresponding genomic sequence (and alignments to informant genomes, if available). Training data are used in specializing the probability model to fit the characteristics of a particular genome.

Parse

A segmentation of a string of letters together with a labelling of the segments.

Bayes' rule

A mathematical identity (Pr(x|y) = Pr(y|x) Pr(x)/Pr(y)) that allows one to swap variables in a conditional probability expression.

Negative selection

Sequences are under negative selection when mutations are deleterious to fitness and hence tend to be weeded out over time.

Substitutions per synonymous site

An estimate of evolutionary distance that makes use of silent substitutions in protein-coding regions, similar to the rate of substitutions in fourfold degenerate sites.

Generative model

A probability model in which, to calculate the most likely values of hidden variables (annotations of DNA segments), one must also calculate the probability of the observed variable values (the DNA sequence).

Generalized hidden Markov model

A type of generative model that is used for assigning probabilities to possible annotations of a sequence. Generalized hidden Markov models are preferred over ordinary hidden Markov models for gene prediction because they make it possible to model the distribution of exon lengths.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brent, M. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62–73 (2008). https://doi.org/10.1038/nrg2220

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2220

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing