Genome annotation: from sequence to biology

Stein, Lincoln

doi:10.1038/35080529

Review Article
Published: 01 July 2001

Genome annotation: from sequence to biology

Lincoln Stein¹

Nature Reviews Genetics volume 2, pages 493–503 (2001)Cite this article

8895 Accesses
198 Citations
8 Altmetric
Metrics details

Key Points

Now that many genome sequences are available, attention is shifting towards developing and improving approaches for genome annotation.
Genome annotation can be classified into three levels: the nucleotide, protein and process levels.
Gene finding is a chief aspect of nucleotide-level annotation. For complex genomes, the most successful methods use a combination of ab initio gene prediction and sequence comparison with expressed sequence databases and other organisms. Nucleotide-level annotation also allows the integration of genome sequence with other genetic and physical maps of the genome.
The principal aim of protein-level annotation is to assign function to the products of the genome. Databases of protein sequences and functional domains and motifs are powerful resources for this type of annotation. Nevertheless, half of the predicted proteins in a new genome sequence tend to have no obvious function.
Understanding the function of genes and their products in the context of cellular and organismal physiology is the goal of process-level annotation. One of the obstacles to this level of annotation has been the inconsistency of terms used by different model systems. The Gene Ontology Consortium is helping to solve this problem.
There are several approaches to genome annotation: the factory (reliance on automation), museum (manual curation), cottage industry (exemplified by Proteome, Inc.) and party (the Celera Drosophila annotation jamboree).
As more scientists come to rely on genome annotation, it will become more important for the scientific community as a whole to contribute to this continuing process.

Abstract

The genome sequence of an organism is an information resource unlike any that biologists have previously had access to. But the value of the genome is only as good as its annotation. It is the annotation that bridges the gap from the sequence to the biology of the organism. The aim of high-quality annotation is to identify the key features of the genome — in particular, the genes and their products. The tools and resources for annotation are developing rapidly, and the scientific community is becoming increasingly reliant on this information for all aspects of biological research.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 2: A hidden Markov model explicitly models the probabilities for the transition from one part of a gene to another.**

**Figure 4: An example of genome annotation.**

Perspectives on ENCODE

Article 29 July 2020

The ENCODE Project Consortium, Michael P. Snyder, … Richard M. Myers

The ENCODE Blacklist: Identification of Problematic Regions of the Genome

Article Open access 27 June 2019

Haley M. Amemiya, Anshul Kundaje & Alan P. Boyle

The status of the human gene catalogue

Article 04 October 2023

Paulo Amaral, Silvia Carbonell-Sala, … Steven L. Salzberg

References

Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995).
Article CAS Google Scholar
Fraser, C. et al. The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403 (1995).
Article CAS Google Scholar
Cole, S. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998).
Article CAS Google Scholar
Goffeau, A. et al. Life with 6000 genes. Science 274, 546 (1996).The description of how the first eukaryotic genome was sequenced and annotated.
Article CAS Google Scholar
The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).The sequencing and annotation of the Caenorhabditis elegans genome.
Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).The sequencing and annotation of the Drosophila melanogaster genome.
Article CAS Google Scholar
Arabidopsis Genomics Initiative (AGI). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Venter, J. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).A landmark paper describing the human 'rough draft' (private version) and its annotation.
Article CAS Google Scholar
International Human Genome Sequencing Consortium (IHGSC). Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).A landmark paper describing the human 'rough draft' (public version) and its annotation.
Schuler, G. Sequence mapping by electronic PCR. Genome Res. 7, 541–550 (1997).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Ning, Z., Cox, A. & Mullikin, C. SSAHA: A fast search method for large DNA databases. Genome Res. (submitted).
The BAC Resource Consortium. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409, 953–958 (2001).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).The first description of GENSCAN and one of the best introductions to hidden Markov model-based gene-prediction programs.
Article CAS Google Scholar
Reese, M. G., Kulp, D., Tammana, H. & Haussler, D. Genie — gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).
Article CAS Google Scholar
Besemer, J. & Borodovsky, M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27, 3911–3920 (1999).
Article CAS Google Scholar
Uberacher, E. & Mural, R. Locating protein-coding regions in human DNA sequences by a multiple sensor–neural network approach. Proc. Natl Acad. Sci. USA 88, 11261–11265 (1991).
Article Google Scholar
Solovyev, V., Salamov, A. & Lawrence, C. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994).
Article CAS Google Scholar
Zhang, M. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl Acad. Sci. USA 94, 565–568 (1997).
Article CAS Google Scholar
Solovyev, V., Salamov, A. & Lawrence, C. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995).
CAS Google Scholar
Krogh, A. Two methods for improving performance of an HMM and their application for gene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 179–186 (1997).
CAS PubMed Google Scholar
Reese, M. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).The most comprehensive comparison of nucleotide-level annotation tools so far.
Article CAS Google Scholar
Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 10, 1631–1642 (2001).A crucial comparison of ab initio gene-prediction algorithms versus those based on similarity searches.
Article Google Scholar
Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).
Article CAS Google Scholar
Kent, W. J. & Zahler, A. M. The intronerator: exploring introns and alternative splicing in Caenorhabditis elegans. Nucleic Acids Res. 28, 91–93 (2000).
Article CAS Google Scholar
Reboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nature Genet. 27, 332–336 (2001).
Article CAS Google Scholar
Pruitt, K., Katz, K., Sicotte, H. & Maglott, D. Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet. 16, 44–47 (2000).
Article CAS Google Scholar
Schuler, G. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med. 75, 694–698 (1997).
Article CAS Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
Article CAS Google Scholar
Le, S., Chen, J. & Maizel, J. in Structure and Methods: Human Genome Initiative and DNA Recombination Vol 1 (eds Sarma, R. H. & Sarma, M. H.) 127–136 (Adenine, New York, 1990).
Google Scholar
Pavesi, A., Conterio, F., Bolchi, A., Dieci, G. & Ottonello, S. Identification of new eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis of transcriptional control regions. Nucleic Acids Res. 22, 1247–1256 (1994).
Article CAS Google Scholar
Lowe, T. & Eddy, S. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Article CAS Google Scholar
Eddy, S. Noncoding RNA genes. Curr. Opin. Genet. Dev. 9, 695–699 (1999).An easily accessible introduction to the fascinating world of non-coding RNA prediction.
Article CAS Google Scholar
Pennacchio, L. & Rubin, E. Genomic strategies to identify mammalian regulatory sequences. Nature Rev. Genet. 2, 100–109 (2001).
Article CAS Google Scholar
Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994).
CAS PubMed Google Scholar
Thacker, C., Marra, M. A., Jones, A., Baillie, D. L. & Rose, A. M. Functional genomics in Caenorhabditis elegans: an approach involving comparisons of sequences from related nematodes. Genome Res. 9, 348–359 (1999).
CAS PubMed PubMed Central Google Scholar
Aamodt, E. et al. Conservation of sequence and function of the pag-3 genes from C. elegans and C. briggsae. Gene 8, 67–74 (2000).
Article Google Scholar
Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E. Human–mouse genome comparisons to locate regulatory sites. Nature Genet. 26, 225–228 (2000).
Article CAS Google Scholar
Qiu, Y. et al. Human and mouse abca1 comparative sequencing and transgenesis studies revealing novel regulatory sequences. Genomics 73, 66–76 (2001).
Article CAS Google Scholar
Margarit, E. et al. Identification of conserved potentially regulatory sequences of the SRY gene from 10 different species of mammals. Biochem. Biophys. Res. Commun. 17, 370–377 (1998).A good example of how comparative genomics can be used to identify putative regulatory sequences.
Article Google Scholar
Ku, H. M., Vision, T., Liu, J. & Tanksley, S. D. Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl Acad. Sci. USA 97, 9121–9126 (2000).An elegant illustration of the power of comparative genomics.
Article CAS Google Scholar
Brookes, A. The essence of SNPs. Gene 234, 177–186 (1999).A comprehensive introduction to the potential contribution of single nucleotide polymorphisms to the understanding of human biology.
Article CAS Google Scholar
The SNP Consortium. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001).
Marth, G. et al. A general approach to single-nucleotide polymorphism discovery. Nature Genet. 23, 452–456 (1999).
Article CAS Google Scholar
Koch, R. et al. Single nucleotide polymorphisms in wild isolates of Caenorhabditis elegans. Genome Res. 10, 1690–1696 (2000).
Article CAS Google Scholar
Piatigorsky, J., Kantorow, M., Gopal-Srivastava, R. & Tomarev, S. I. Recruitment of enzymes and stress proteins as lens crystallins. EXS 71, 241–250 (1994).
CAS PubMed Google Scholar
Wistow, G. Lens crystallins: gene recruitment and evolutionary dynamism. Trends Biochem. Sci. 18, 301–306 (1993).
Article CAS Google Scholar
Henikoff, S. et al. Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614 (1997).
Article CAS Google Scholar
Tatusov, R., Galperin, M., Natale, D. & Koonin, E. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).
Article CAS Google Scholar
Altschul, S. F. & Koonin, E. V. Iterated profile searches with PSI-BLAST — a tool for discovery in protein databases. Trends Biochem. Sci. 23, 444–447 (1998).
Article CAS Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 28, 263–266 (2000).
Article CAS Google Scholar
Attwood, T. et al. PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28, 225–227 (2000).
Article CAS Google Scholar
Hoffman, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–219 (1999).
Article Google Scholar
Corpet, F., Gouzy, J. & Kahn, D. Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 27, 263–267 (1999).
Article CAS Google Scholar
Ponting, C. P., Schultz, J., Milpetz, F. & Bork, P. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 27, 229–232 (1999).
Article CAS Google Scholar
Henikoff, J. G., Greene, E. A., Pietrokovski, S. & Henikoff, S. Increased coverage of protein families with the BLOCKS database servers. Nucleic Acids Res. 28, 228–230 (2000).
Article CAS Google Scholar
Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40 (2001).
Article CAS Google Scholar
Cherry, J. et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res. 26, 73–79 (1998).
Article CAS Google Scholar
The FlyBase Consortium. The FlyBase database of the Drosophila Genome Projects and community literature. Nucleic Acids Res. 27, 85–88 (1999).
Blake, J., Eppig, J., Richardson, J., Bult, C. & Kadin, J. The Mouse Genome Database (MGD): integration nexus for the laboratory mouse. Nucleic Acids Res. 29, 91–94 (2001).A great example of a 'classic' model organism database.
Article CAS Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Article CAS Google Scholar
Huala, E. et al. The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 29, 102–105 (2001).
Article CAS Google Scholar
Stein, L. et al. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 29, 82–86 (2001).This model organism database combines nucleotide-level annotation with the results of high-throughput analyses, such as RNA interference.
Article CAS Google Scholar
Kumar, A. & Snyder, M. Emerging technologies in yeast genomics. Nature Rev. Genet. 2, 302–312 (2001).
Article CAS Google Scholar
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).
Article CAS Google Scholar
Fire, A. et al. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806–811 (1998).
Article CAS Google Scholar
Griffin, T. J. et al. Quantitative proteomic analysis using a MALDI quadruple time-of-flight mass spectrometer. Anal. Chem. 73, 978–986 (2001).
Article CAS Google Scholar
Bouvet, P. Determination of nucleic acid recognition sequences by SELEX. Methods Mol. Biol. 148, 603–610 (2001).
CAS PubMed Google Scholar
Gonzalez, C. & Bejarano, L. A. Protein traps: using intracellular localization for cloning. Trends Cell Biol. 10, 162–165 (2000).
Article CAS Google Scholar
Cagney, G., Uetz, P. & Fields, S. High-throughput screening for protein–protein interactions using two-hybrid assay. Methods Enzymol. 328, 3–14 (2000).
Article CAS Google Scholar
Pennisi, E. Ideas fly at gene-finding jamboree. Science 287, 2182–2184 (2000).
Article CAS Google Scholar
Kawai, J. et al. Functional annotation of a full-length mouse cDNA collection. Nature 409, 685–690 (2001).
Article Google Scholar

Download references

Acknowledgements

I acknowledge R. Durbin, S. Eddy, E. Birney and A. Neuwald for helpful discussions during the preparation of this review. A portion of this work was supported by the National Human Genome Research Institute at the US National Institutes of Health.

Author information

Authors and Affiliations

Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, 11724, New York, USA
Lincoln Stein

Authors

Lincoln Stein
View author publications
You can also search for this author in PubMed Google Scholar

Glossary

WORKING DRAFT: A 'working draft' sequence has come to mean a genomic sequence before it is finished. Working draft sequences contain multiple gaps, unrepresented areas and misassemblies. In addition, the error rate of working draft sequence is higher than the 1 in 10,000 error rate that is standard for finished sequence.
FASTA FILE: A common file format used for the storage and transfer of sequence data. It contains raw DNA or protein sequence, but no annotation information.
RADIATION HYBRID MAPPING: An experimental technique that uses radiation-induced chromosomal breakpoints in somatic-cell hybrids to map the positions of sequence tagged sites (STSs).
CLONE ENDS: Genomic sequencing projects typically sequence the ends of bacterial artificial chromosome and plasmid clones, in addition to shotgun sequencing entire clones. During assembly, the clone end sequences are used to create a scaffold on which the genome sequence is pieced together.
SENSORS: An algorithm specialized to identify a feature of a sequence, such as a possible splice site.
NEURAL NETWORK: Neural networks are analytical techniques modelled after the (proposed) processes of learning in cognitive systems and the neurological functions of the brain. Neural networks use a data 'training set' to build rules that can make predictions or classifications on data sets.
RULE-BASED SYSTEM: A type of computer algorithm that uses an explicit set of rules to make decisions.
HIDDEN MARKOV MODEL: A type of computer algorithm that represents a system as a set of discrete states and transitions between those states. Each transition has an associated probability. Markov models are 'hidden' when one or more of the states cannot be directly observed.
AB INITIO GENE PREDICTION: A class of software that attempts to predict genes from sequence data without the use of prior knowledge about similarities to other genes.
ALU SEQUENCE: A dispersed, intermediately repetitive DNA sequence found in the human genome in about 300,000 copies. The sequence is about 300 base pairs long. The name Alu comes from the restriction endonuclease (AluI) that cleaves it.
ANGIOSPERM: Flowering seed plant.
MONOCOTYLEDON: One of the two principal classes of flowering plant, monocots are characterized by a single cotyledon (primitive leaf) in the embryonic plant. Maize, rice, wheat and other grasses are common monocots.
DICOTYLEDON: One of the two principal classes of flowering plant, dicots are characterized by two cotyledons (primitive leaves) in the embryonic plant. Tomatoes, maple trees and mustard are common dicots.
CHAPERONINS: A class of ring-shaped, heat-shock proteins that have a key role in protein folding and protection from stress.
DIRECTED ACYCLIC GRAPH: (DAG). A type of hierarchy similar to the outline of a paper in that it has headers, subheaders and sub-subheaders. The main difference from a strict hierarchy is that each topic in a DAG is allowed to have more than one parent topic.
RNA INTERFERENCE: A phenomenon in which the expression of a gene is temporarily inhibited when a double-stranded complementary RNA is introduced into the organism.
SELEX: This is an in vitro selection method in which very large collections of oligonucleotides can be screened for specific functions.
OPEN SOURCE: A type of software distribution in which the source code (the human-readable instructions) are made freely available. The Linux operating system is open source. The Microsoft Windows and Macintosh operating systems are not.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stein, L. Genome annotation: from sequence to biology. Nat Rev Genet 2, 493–503 (2001). https://doi.org/10.1038/35080529

Download citation

Issue Date: 01 July 2001
DOI: https://doi.org/10.1038/35080529

This article is cited by

De novo assembly and comparative genome analysis for polyhydroxyalkanoates-producing Bacillus sp. BNPI-92 strain
- Seid Mohammed Ebu
- Lopamudra Ray
- Sudhansu K. Gouda
Journal of Genetic Engineering and Biotechnology (2023)
Microbiological and molecular insights on rare Actinobacteria harboring bioactive prospective
- Dina H. Amin
- Nagwa A. Abdallah
- Elizabeth M. H. Wellington
Bulletin of the National Research Centre (2020)
Phylogeny of teleost connexins reveals highly inconsistent intra- and interspecies use of nomenclature and misassemblies in recent teleost chromosome assemblies
- Svein-Ole Mikalsen
- Marni Tausen
- Sunnvør í Kongsstovu
BMC Genomics (2020)
Genome and pan-genome analysis to classify emerging bacteria
- Aurélia Caputo
- Pierre-Edouard Fournier
- Didier Raoult
Biology Direct (2019)

Genome annotation: from sequence to biology

Key Points

Abstract

Access options

Similar content being viewed by others

Perspectives on ENCODE

The ENCODE Blacklist: Identification of Problematic Regions of the Genome

The status of the human gene catalogue

References

Acknowledgements

Author information

Authors and Affiliations

Related links

FURTHER INFORMATION

LINKS

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

De novo assembly and comparative genome analysis for polyhydroxyalkanoates-producing Bacillus sp. BNPI-92 strain

Microbiological and molecular insights on rare Actinobacteria harboring bioactive prospective

Phylogeny of teleost connexins reveals highly inconsistent intra- and interspecies use of nomenclature and misassemblies in recent teleost chromosome assemblies

Genome and pan-genome analysis to classify emerging bacteria

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

Perspectives on ENCODE

The ENCODE Blacklist: Identification of Problematic Regions of the Genome

The status of the human gene catalogue

References

Acknowledgements

Author information

Authors and Affiliations

Related links

Related links

FURTHER INFORMATION

LINKS

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

De novo assembly and comparative genome analysis for polyhydroxyalkanoates-producing Bacillus sp. BNPI-92 strain

Microbiological and molecular insights on rare Actinobacteria harboring bioactive prospective

Phylogeny of teleost connexins reveals highly inconsistent intra- and interspecies use of nomenclature and misassemblies in recent teleost chromosome assemblies

Genome and pan-genome analysis to classify emerging bacteria

Search

Quick links