Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A beginner's guide to eukaryotic genome annotation

Key Points

  • Sequencing costs have fallen so dramatically that a single laboratory can now afford to sequence even large genomes.

  • Genome annotation pipelines synthesize alignment-based evidence with ab initio gene predictions to obtain a final set of gene annotations.

  • The exotic nature of many of the genomes that are currently being sequenced complicates annotation efforts.

  • Genome annotation has moved beyond merely identifying protein-coding genes to include the annotation of transposons, regulatory regions, pseudogenes and non-coding RNA genes.

  • Another new challenge is the need to incorporate RNA-seq data into the annotation process.

  • Annotation quality control and management are becoming major bottlenecks.

  • Periodic updates to the annotations to every genome are necessary as new data and techniques become available.

  • Incorrect and incomplete annotations poison every experiment that makes use of them. Providing accurate and up-to-date annotations is therefore essential.

Abstract

The falling cost of genome sequencing is having a marked impact on the research community with respect to which genomes are sequenced and how and where they are annotated. Genome annotation projects have generally become small-scale affairs that are often carried out by an individual laboratory. Although annotating a eukaryotic genome assembly is now within the reach of non-experts, it remains a challenging task. Here we provide an overview of the genome annotation process and the available tools and describe some best-practice approaches.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Genome and gene sizes for a representative set of genomes.
Figure 2: Three basic approaches to genome annotation and some common variations.

References

  1. Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

    PubMed  Google Scholar 

  2. Celniker, S. E. et al. Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol. 3, research0079 (2002).

    PubMed  PubMed Central  Google Scholar 

  3. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    CAS  PubMed  Google Scholar 

  4. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  5. Denoeud, F. et al. Annotating genomes with massive-scale RNA sequencing. Genome Biol. 9, R175 (2008).

    PubMed  PubMed Central  Google Scholar 

  6. Ozsolak, F. et al. Direct RNA sequencing. Nature 461, 814–818 (2009).

    CAS  PubMed  Google Scholar 

  7. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).

    CAS  PubMed  Google Scholar 

  8. Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). This paper provides one of the most extensively documented surveys of alternatively spliced transcripts. It is a key publication for understanding how extensive alternative splicing is in human tissues, for understanding how powerful RNA-seq data are as a tool for discovering new transcripts and for quantifying their abundance and differential expression patterns.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Chain, P. S. et al. Genomics. Genome project standards in a new era of sequencing. Science 326, 236–237 (2009).

    CAS  PubMed  Google Scholar 

  10. Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Ye, L. et al. A vertebrate case study of the quality of assemblies derived from next-generation sequences. Genome Biol. 12, R31 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).

    CAS  PubMed  Google Scholar 

  13. Tsai, I. J., Otto, T. D. & Berriman, M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 11, R41 (2010).

    PubMed  PubMed Central  Google Scholar 

  14. Assefa, S., Keane, T. M., Otto, T. D., Newbold, C. & Berriman, M. ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 25, 1968–1969 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Husemann, P. & Stoye, J. r2cat: synteny plots and comparative assembly. Bioinformatics 26, 570–571 (2010).

    CAS  PubMed  Google Scholar 

  16. Kapitonov, V. V. & Jurka, J. A novel class of SINE elements derived from 5S rRNA. Mol. Biol. Evol. 20, 694–702 (2003).

    CAS  PubMed  Google Scholar 

  17. Kapitonov, V. V. & Jurka, J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nature Rev. Genet. 9, 411–412; author reply 414 (2008).

    PubMed  Google Scholar 

  18. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    CAS  Article  PubMed  Google Scholar 

  19. Buisine, N., Quesneville, H. & Colot, V. Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 91, 467–475 (2008).

    CAS  PubMed  Google Scholar 

  20. Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).

    PubMed  PubMed Central  Google Scholar 

  21. McClure, M. A. et al. Automated characterization of potentially active retroid agents in the human genome. Genomics 85, 512–523 (2005).

    CAS  PubMed  Google Scholar 

  22. Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269–1276 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21 (Suppl. 1), i351–i358 (2005).

    CAS  PubMed  Google Scholar 

  24. Smit, A. & Hubley, R. RepeatModeler 1.05. repeatmasker.org [online], (2011).

    Google Scholar 

  25. Morgulis, A., Gertz, E. M., Schaffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006).

    CAS  Article  PubMed  Google Scholar 

  26. Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Rev. Genet. 13, 36–46 (2012).

    CAS  Google Scholar 

  27. Bergman, C. M. & Quesneville, H. Discovering and detecting transposable elements in genome sequences. Brief. Bioinform. 8, 382–392 (2007).

    CAS  PubMed  Google Scholar 

  28. Cordaux, R. & Batzer, M. A. The impact of retrotransposons on human genome evolution. Nature Rev. Genet. 10, 691–703 (2009).

    CAS  PubMed  Google Scholar 

  29. Witherspoon, D. J. et al. Alu repeats increase local recombination rates. BMC Genomics 10, 530 (2009).

    PubMed  PubMed Central  Google Scholar 

  30. Smit, A. F., Hubley, R. & Green, P. RepeatMasker 3.0 repeatmasker.org [online], (1996–2010).

  31. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  PubMed  Google Scholar 

  32. Korf, I., Yandell, M. & Bedell, J. BLAST: an Essential Guide to the Basic Local Alignment Search Tool 339 (O'Reilly & Associates, 2003). Everyone involved with a genome project should be familiar with BLAST. Reference 31 is the original paper describing this tool. Reference 32 is an entire book describing BLAST and how it is used.

    Google Scholar 

  33. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Green, P. Crossmatch. A general purpose utility for comparing any two sets of DNA sequences. PHRAP [online], (1993–1996).

  35. Majoros, W. H. Methods for Computational Gene Prediction 2 (Cambridge Univ. Press, 2007).

    Google Scholar 

  36. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).

    PubMed  PubMed Central  Google Scholar 

  37. Bairoch, A., Boeckmann, B., Ferro, S. & Gasteiger, E. Swiss-Prot: juggling between evolution and stability. Brief. Bioinform. 5, 39–55 (2004).

    CAS  PubMed  Google Scholar 

  38. Boeckmann, B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its biological context. C.R. Biol. 328, 882–899 (2005).

    CAS  PubMed  Google Scholar 

  39. The UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 39, D214–D219 (2011).

  40. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 37, D26–D31 (2009).

    CAS  PubMed  Google Scholar 

  41. Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 37, D5–D15 (2009).

    CAS  PubMed  Google Scholar 

  42. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).

    PubMed  PubMed Central  Google Scholar 

  44. Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).

    PubMed  PubMed Central  Google Scholar 

  45. Wheelan, S. J., Church, D. M. & Ostell, J. M. Spidey: a tool for mRNA-to-genomic alignments. Genome Res. 11, 1952–1957 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. & Miller, W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967–974 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8, 469–477 (2011).

    CAS  PubMed  Google Scholar 

  48. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011). This paper describes Trinity, a transcriptome assembler that was specifically designed for next-generation sequence data. It is required reading for anyone trying to use RNA-seq data for genome annotation.

    CAS  Google Scholar 

  51. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Wu, T. D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 28, 503–510 (2010).

    CAS  Google Scholar 

  54. Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 28, 511–515 (2010).

    CAS  Google Scholar 

  55. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protoc. 7, 562–578 (2012). This paper describes best practice approaches for combining TopHat and Cufflinks when using RNA-seq data for genome annotation.

    CAS  Google Scholar 

  56. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Guigo, R., Knudsen, S., Drake, N. & Smith, T. Prediction of gene structure. J. Mol. Biol. 226, 141–157 (1992).

    CAS  PubMed  Google Scholar 

  58. Solovyev, V. V., Salamov, A. A. & Lawrence, C. B. The prediction of human exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 354–362 (1994).

    CAS  PubMed  Google Scholar 

  59. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997). This study describes the ab initio gene predictor GenScan. It is a classic paper that is full of informative explanations of the problems associated with eukaryotic gene prediction.

    CAS  PubMed  Google Scholar 

  60. Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Brent, M. R. Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res. 15, 1777–1786 (2005).

    CAS  PubMed  Google Scholar 

  62. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). This paper describes a gene predictor, SNAP, that is easy to use and to configure. It also clearly explains the pitfalls that are associated with using a poorly trained gene finder or one that has been trained on a different genome from the one that is being annotated.

    Article  PubMed  PubMed Central  Google Scholar 

  63. Reese, M. G. & Guigo, R. EGASP: Introduction. Genome Biol. 7 (Suppl. 1), 1–3 (2006). This is the introduction to an entire issue of Genome Biology that is dedicated to benchmarking an entire host of eukaryotic gene finders and annotation pipelines. Anyone involved with a genome annotation project should have a look at every paper in this special supplement.

    PubMed  Google Scholar 

  64. Coghlan, A. et al. nGASP—the nematode genome annotation assessment project. BMC Bioinformatics 9, 549 (2008).

    PubMed  PubMed Central  Google Scholar 

  65. Guigo, R. & Reese, M. G. EGASP: collaboration through competition to find human genes. Nature Methods 2, 575–577 (2005).

    CAS  PubMed  Google Scholar 

  66. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 (Suppl. 2), ii215–ii225 (2003).

    PubMed  Google Scholar 

  67. Stanke, M., Schoffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006).

    PubMed  PubMed Central  Google Scholar 

  68. Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  69. Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y. O. & Borodovsky, M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 18, 1979–1990 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).

    PubMed  PubMed Central  Google Scholar 

  71. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140–S148 (2001).

    PubMed  Google Scholar 

  72. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  73. Souvorov, A. et al. Gnomon — the NCBI eukaryotic gene prediction tool. National Center for Biotechnology Information [online], (2010).

    Google Scholar 

  74. Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 12, 1418–1427 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. Mungall, C. J. et al. An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 3, research0081 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  76. Misra, S. et al. Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biol. 3, research0083 (2002).

    PubMed  PubMed Central  Google Scholar 

  77. Yandell, M. et al. A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. Proc. Natl Acad. Sci. USA 102, 1566–1571 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Allen, J. E. & Salzberg, S. L. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21, 3596–3603 (2005).

    CAS  PubMed  Google Scholar 

  79. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).

    PubMed  PubMed Central  Google Scholar 

  80. Elsik, C. G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007).

    PubMed  PubMed Central  Google Scholar 

  81. Liu, Q., Mackey, A. J., Roos, D. S. & Pereira, F. C. Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction. Bioinformatics 24, 597–605 (2008).

    CAS  PubMed  Google Scholar 

  82. Haas, B. J., Zeng, Q., Pearson, M. D., Cuomo, C. A. & Wortman, J. R. Approaches to fungal genome annotation. Mycology 2, 118–141 (2011). This paper provides an excellent description of the process used by the Broad Institute for fungal annotation. It is also a good resource for those seeking to learn more about PASA; for more information about PASA, see reference 56.

    CAS  PubMed  Google Scholar 

  83. Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011). This study describes the database management and annotation quality-control tools for the MAKER2 genome annotation pipeline. It also explains many of the challenges that are associated with annotating novel genomes and how to overcome them.

    PubMed  PubMed Central  Google Scholar 

  84. Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA 85, 2444–2448 (1988).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Eilbeck, K. et al. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6, R44 (2005).

    PubMed  PubMed Central  Google Scholar 

  86. Donlin, M. J. in Current Protocols in Bioinformatics. Ch. 9, Unit 9.9 (2007).

    Google Scholar 

  87. Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J. & Holmes, I. H. JBrowse: a next-generation genome browser. Genome Res. 19, 1630–1638 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  88. Stajich, J. E. et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12, 1611–1618 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. Zhou, P., Emmert, D. & Zhang, P. in Current Protocols in Bioinformatics Ch. 9, Unit 9.6 (2006).

    Google Scholar 

  90. Klimke, W. et al. Solving the problem: genome annotation standards before the data deluge. Stand. Genomic Sci. 5, 168–193 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  91. Brister, J. R. et al. Towards viral genome annotation standards, report from the 2010 NCBI annotation workshop. Viruses 2, 2258–2268 (2010).

    PubMed  PubMed Central  Google Scholar 

  92. Madupu, R. et al. Meeting report: a workshop on best practices in genome annotation. Database 2010, baq001 (2010).

    PubMed  PubMed Central  Google Scholar 

  93. Mulder, N. & Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol. Biol. 396, 59–70 (2007).

    CAS  PubMed  Google Scholar 

  94. Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).

    CAS  PubMed  Google Scholar 

  95. Holt, C. Tools and Techniques for Genome Annotation Analysis. Ph.D. thesis, Univ. Utah (2011).

    Google Scholar 

  96. Eilbeck, K., Moore, B., Holt, C. & Yandell, M. Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics 10, 67 (2009). This paper describes a number of annotation quality-control measures, including annotation edit distance (AED). It also provides some interesting meta-analyses describing the impact of curation efforts on the gene annotations of several model organism databases over a period of several years.

    PubMed  PubMed Central  Google Scholar 

  97. Lewis, S. E. et al. Apollo: a sequence annotation editor. Genome Biol. 3, research0082 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  98. Engels, R. Argo Genome Browser version 1.0.31. Broad Institute [online], (2010).

    Google Scholar 

  99. Rutherford, K. et al. Artemis: sequence visualization and annotation. Bioinformatics 16, 944–945 (2000).

    CAS  PubMed  Google Scholar 

  100. Hartl, D. L. Fly meets shotgun: shotgun wins. Nature Genet. 24, 327–328 (2000).

    CAS  PubMed  Google Scholar 

  101. Desk, B. H. Introduction to the standalone WWW Blast server. National Center for Biotechnology Information [online], (2002). This page explains how to use a suite of programs to set up a local Blast server for your local database.

    Google Scholar 

  102. Stein, L. D. et al. The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  103. Munoz-Torres, M. C. et al. Hymenoptera Genome Database: integrated community resources for insect species of the order Hymenoptera. Nucleic Acids Res. 39, D658–D662 (2011).

    CAS  PubMed  Google Scholar 

  104. Smith, C. D. et al. Draft genome of the globally widespread and invasive Argentine ant (Linepithema humile). Proc. Natl Acad. Sci. USA 108, 5673–5678 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  105. Suen, G. et al. The genome sequence of the leaf-cutter ant Atta cephalotes reveals insights into its obligate symbiotic lifestyle. PLoS Genet. 7, e1002007 (2011).

    PubMed  PubMed Central  Google Scholar 

  106. Nygaard, S. et al. The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming. Genome Res. 21, 1339–1348 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  107. Curwen, V. et al. The Ensembl automatic gene annotation system. Genome Res. 14, 942–950 (2004). This paper describes the Ensembl genome annotation pipeline; although the article is now several years old, it is still a good place to start. We would recommend reading this paper and then browsing the extensive Ensembl web site for more information.

    CAS  PubMed  PubMed Central  Google Scholar 

  108. Youens-Clark, K. et al. Gramene database in 2010: updates and extensions. Nucleic Acids Res. 39, D1085–D1094 (2011).

    CAS  PubMed  Google Scholar 

  109. Duvick, J. et al. PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 36, D959–D965 (2008).

    CAS  PubMed  Google Scholar 

  110. Goodstein, D. M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178–D1186 (2012).

    CAS  PubMed  Google Scholar 

  111. Lawson, D. et al. VectorBase: a data resource for invertebrate vector genomics. Nucleic Acids Res. 37, D583–D587 (2009).

    CAS  PubMed  Google Scholar 

  112. Karro, J. E. et al. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35, D55–D60 (2007).

    CAS  PubMed  Google Scholar 

  113. Zheng, D. et al. Integrated pseudogene annotation for human chromosome 22: evidence for transcription. J. Mol. Biol. 349, 27–45 (2005).

    CAS  PubMed  Google Scholar 

  114. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S. R. Rfam: an RNA family database. Nucleic Acids Res. 31, 439–441 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  115. Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  116. Dolezel, J. & Bartos, J. Plant DNA flow cytometry and estimation of nuclear genome size. Ann. Botany 95, 99–110 (2005).

    CAS  Google Scholar 

  117. Laird, C. D. & McCarthy, B. J. Molecular characterization of the Drosophila genome. Genetics 63, 865–882 (1969).

    CAS  PubMed  PubMed Central  Google Scholar 

  118. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).

    CAS  PubMed  PubMed Central  Google Scholar 

  119. Schattner, P., Brooks, A. N. & Lowe, T. M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 33, W686–W689 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  120. Lewis, B. P., Shih, I. H., Jones-Rhoades, M. W., Bartel, D. P. & Burge, C. B. Prediction of mammalian microRNA targets. Cell 115, 787–798 (2003).

    CAS  PubMed  Google Scholar 

  121. Eddy, S. R. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics 3, 18 (2002).

    PubMed  PubMed Central  Google Scholar 

  122. Holmes, I. & Rubin, G. M. Pairwise RNA structure comparison with stochastic context-free grammars. Pac. Symp. Biocomput. 7, 163–174 (2002).

    Google Scholar 

  123. QIAGEN. Quick-Start Protocol miRNAeasy Mini Kit. QIAGEN [online], (2011).

  124. Chen, C. et al. Real-time quantification of microRNAs by stem–loop RT-PCR. Nucleic Acids Res. 33, e179 (2005).

    PubMed  PubMed Central  Google Scholar 

  125. van Leeuwen, S. & Mikkers, H. Long non-coding RNAs: guardians of development. Differentiation 80, 175–183 (2010).

    CAS  PubMed  Google Scholar 

  126. Hung., T. & Chang, H. Y. Long noncoding RNA in genome regulation: prospects and mechanisms. RNA Biol. 7, 582–585 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  127. Tam, O. H. et al. Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes. Nature 453, 534–538 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  128. Zhang, Z., Carriero, N. & Gerstein, M. Comparative analysis of processed pseudogenes in the mouse and human genomes. Trends Genet. 20, 62–67 (2004).

    PubMed  Google Scholar 

  129. Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  130. Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996). This paper provides an excellent explanation of how sensitivity and specificity measures can be used to evaluate gene finder performance. This is a classic paper in the field and should be read by anyone involved in gene annotation.

    CAS  PubMed  Google Scholar 

  131. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. & Nielsen, H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424 (2000).

    CAS  PubMed  Google Scholar 

  132. Guigo, R. et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7 (Suppl. 1), 1–31 (2006).

    PubMed  Google Scholar 

  133. Schweikert, G. et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 19, 2133–2143 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  134. Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  135. Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  136. DeCaprio, D. et al. Conrad: gene prediction using conditional random fields. Genome Res. 17, 1389–1398 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  137. Gross, S. S., Do, C. B., Sirota, M. & Batzoglou, S. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. 8, R269 (2007).

    PubMed  PubMed Central  Google Scholar 

  138. Bernal, A., Crammer, K., Hatzigeorgiou, A. & Pereira, F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007).

    PubMed  PubMed Central  Google Scholar 

  139. Usuka, J., Zhu, W. & Brendel, V. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203–211 (2000).

    CAS  PubMed  Google Scholar 

  140. Kiryutin, B. ProSplign. National Center for Biotechnology Information [online], (2011).

    Google Scholar 

  141. Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).

    PubMed  PubMed Central  Google Scholar 

  142. Kitts, P. in The NCBI Handbook (ed. McEntyre, J. & Ostell, J.) (National Center for Biotechnology Information, 2003).

    Google Scholar 

  143. Robinson, J. T. et al. Integrative genomics viewer. Nature Biotech. 29, 24–26 (2011).

    CAS  Google Scholar 

Download references

Acknowledgements

The authors would like to thank P. Flicek, B. Haas, N. Jiang, D. Lipman, A. Mackey, K. Pruitt, Y. Sun and J. Stajich for reading an earlier version of this manuscript and for their many helpful suggestions. This work was supported by the US National Institutes of Health grants R01GM099939 and R01-HG004694 and by the US National Science Foundation IOS-1126998 to M.Y.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Yandell.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

Mark Yandell's homepage

ABySS

Apollo

The Arabidopsis Information Resource (TAIR)

Argo

Artemis

Augustus

BeeBase

BioPerl

BLAST

The Brent Lab software (for TwinScan)

CEGMA

CHADO

Crossmatch

Cufflinks

EMBL

Ensembl

Ensembl Genome Annotation

EVidenceModeler

Evigan

Exonerate

FlyBase

GAZE

GBrowse

GenBank homepage

GenBank submission guide for eukaryotic genomes

GeneMark-ES

GFF3

GLEAN

Generic Model Organism Database (GMOD) overview

Gnomon

GSNAP

Gramene

GTF

Infernal

JBrowse

JIGSAW

MAKER

Nature Reviews Genetics article series on Study designs

NCBI taxonomy browser

PASA

Phytozome

PlantGDB

qRNA

RepeatMasker

Rfam

Saccharomyces Genome Database

Scripture

Sequence Ontology Project

sim4

Spidey

Splign

SNAP

Snoscan

SOAPdenovo

SoftBerry

SoftBerry products (for FGENESH)

Stemloc

TopHat

Trinity

tRNAscan-SE

UniProtKB/SwissProt

University of California Santa Cruz (UCSC) Genome Browser

VectorBase

WormBase

Glossary

Genome annotation

A term used to describe two distinct processes. 'Structural' genome annotation is the process of identifying genes and their intron–exon structures. 'Functional' genome annotation is the process of attaching meta-data such as gene ontology terms to structural annotations. This Review focuses on structural annotation.

RNA-sequencing data

(RNA-seq data). Data sets derived from the shotgun sequencing of a whole transcriptome using next-generation sequencing (NGS) techniques. RNA-seq data are the NGS equivalent of expressed sequence tags generated by the Sanger sequencing method.

N50

A basic statistic for describing the contiguity of a genome assembly. The longer the N50 is, the better the assembly is. See box 1 for details.

Long interspersed nuclear elements

(LINEs). Retrotransposons that encode reverse transcriptase and that make up a substantial fraction of many eukaryotic genomes.

Short interspersed nuclear elements

(SINEs). Retrotransposons that do no encode reverse transcriptase and that parasitize LINE elements. ALU elements, which are very common in the human genome, are one example of a SINE.

Percent similarity

The percent similarity of a sequence alignment refers to the percentage of positive scoring aligned bases or amino acids in a nucleotide or protein alignment, respectively. The term positive scoring refers to the score assigned to the paired nucleotides or amino acids by the scoring matrix that is used to align the sequences.

Percent identity

The percent identity of a sequence alignment refers to the percentage of identical aligned bases or amino acids in a nucleotide or protein alignment, respectively.

Unsupervised learning methods

Refers to methods that can be trained using unlabelled data. One example is a gene prediction algorithm that can be trained without a reference set of correct gene models; instead, the algorithm is trained using a collection of annotations, not all of which might be correct.

Data-mart

Provides users with online access to the contents of a data warehouse through user-configurable queries. A data-mart allows users to download data that meet their particular needs: for example,all transcripts from all annotated genes on human chromosome 3.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yandell, M., Ence, D. A beginner's guide to eukaryotic genome annotation. Nat Rev Genet 13, 329–342 (2012). https://doi.org/10.1038/nrg3174

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3174

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing