Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Genomics and natural language processing

Key Points

  • Today, the computational exploration and management of large text repositories are usually accomplished with search engines and databases that are based on a suite of text processing, indexing and search tools that are referred to collectively as 'natural language processing' (NLP) technologies.

  • There are three fundamental aspects to NLP: information retrieval, semantics and information extraction.

  • Exploring and managing the biomedical literature with these technologies, however, presents some interesting challenges, primarily because of the relationships between biomedical texts and biological sequences.

  • The associations between biological sequences and texts are a truly unique aspect of the biomedical literature. However, understanding the complex associations that exist between genes, sequences and texts is a daunting task.

  • The flood of sequence information produced by the rapid advances in genomics is creating new ways to explore texts and is blurring the traditional lines that separate bioinformatics and NLP.

  • Biological NLP (bio-NLP) is an emerging field of research that seeks to create tools and methodologies for sequence and textual analysis that combine bioinformatics and NLP technologies in a synergistic fashion.

  • Some bio-NLP researchers are focusing on texts as a means to discover information about protein interactions, and are wrestling with how best to adapt traditional NLP technologies to this task. Others, taking a more sequence-centred approach, are exploring the use of texts as a means to improve sequence-retrieval algorithms and as an aid to sequence annotation.

  • If bio-NLP is to achieve its full potential, it will have to move beyond information management and generate specific predictions pertaining to gene function that can be verified at the bench. The synergistic use of sequence and text to extract latent information from the biomedical literature holds much promise in this regard. Realizing this potential, however, will require more and better ontologies, software that is able to make inferences using sequence and textual information, and access to the full text of articles.


The Human Genome and MEDLINE are both the foci of intense data-mining efforts worldwide. The biomedical literature has much to say about sequence, but it also seems that sequence can tell us much about the biomedical literature. Biological natural language processing is an emerging field of research that seeks to explore systematically the relationships between genes, sequences and the biomedical literature as a basis for a new generation of data-mining tools.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Figure 1: Correlation between sequence similarity and document similarity.
Figure 2: Semantic classification and definition of terms using a lexicon, thesaurus and a hierarchical ontology.
Figure 3: HMMs are used for part-of-speech tagging, as well as for gene prediction.
Figure 4: Information extraction.


  1. Schuler, G. D., Epstein, J. A., Ohkawa, H. & Kans, J. A. Entrez: molecular biology database and retrieval system. Methods Enzymol. 266, 141–162 (1996)

    Article  CAS  PubMed  Google Scholar 

  2. Wilbur, W. J. & Yang, Y. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 26, 209–222 (1996).Describes the vector-space model used by Entrez, the literature-search service maintained by the NCBI.

    Article  CAS  PubMed  Google Scholar 

  3. Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Proc. Pacific Symp. Biocomp. 5, 54–68 (2000).

    Google Scholar 

  4. Shatkay, H., Edwards, S., Wilbur, W. J. & Boguski, M. Genes, themes, and microarrays. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–327 (2000).

    CAS  PubMed  Google Scholar 

  5. Manning, C. D. & Schutze, H. S. in Foundations of Statistical Natural Language Processing 85 (MIT press, Cambridge, Massachusetts, 1999).The indispensable reference for anyone who is interested in statistical natural language processing (NLP).

    Google Scholar 

  6. Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).

    Article  CAS  PubMed  Google Scholar 

  7. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  8. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. MacCallum, R. M., Kelley, L. A. & Sternberg, J. E. SAWTED: structure assignment with text description — enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics 16, 125–129 (2000).

    Article  CAS  PubMed  Google Scholar 

  10. Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Chang, J. T., Raychaudhuri, S. & Altman, R. B. Including biological literature improves homology search. Proc. Pacif. Symp. Biocomp. 5, 374–383 (2001).A quantitative assessment of the utility of combining sequence similarity with document similarity.

    Google Scholar 

  12. Eisenhaber, F. & Bork, P. Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics 15, 528–535 (1999).

    Article  CAS  PubMed  Google Scholar 

  13. Stapley, B. J., Kelley, L. A. & Sternberg, M. J. E. Predicting the sub-cellular location of proteins from text using support vector machines. Proc. Pacif. Symp. Biocomp. (in the press).Describes the use of both text and sequence data to predict subcellular localization.

  14. Iliopoulos, I., Enright, A. J. & Ouzounis, C. A. TEXTQUEST: document clustering of Medline abstracts for concept discovery in molecular biology. Proc. Pacif. Symp. Biocomp. 6, 374–383 (2001).

    Google Scholar 

  15. Andrade, M. A. & Valencia, A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14, 600–607 (1998).

    Article  CAS  PubMed  Google Scholar 

  16. Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Proc. Pacific Symp. Biocomp. 5, 54–68 (2000).

    Google Scholar 

  17. Raychaudhuri, S., Chang, J. T., Sutphin, P. D. & Altman, R. B. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 12, 203–214 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  19. Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief Bioinform. 1, 398–414 (2000).

    Article  CAS  PubMed  Google Scholar 

  20. Fellbaum, C. (ed.) WordNet: an Electronic Lexical Database (MIT Press, Cambridge, Massachusetts, 1999).

    Google Scholar 

  21. Humphreys, B. L., Lindberg, D. A., Schoolman, H. M. & Barnett, G. O. The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5, 1–11 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Baclawski, K., Cigna, J., Kokar, M. M., Mager, P. & Indurkhya, B. Knowledge representation and indexing using the unified medical language system. Proc. Pacif. Symp. Biocomp. 5, 502–513 (2000).A brief introduction to UMLS and related issues.

    Google Scholar 

  23. Nadkarni, P., Chen, R. & Brandt, C. UMLS concept indexing for production databases: a feasibility study. J. Am. Med. Inform. Assoc. 8, 80–91 (2001).Critically assesses the use of UMLS for concept indexing, and provides a useful discussion of nomenclature issues.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Hersh, W. R. & Donohoe, L. C. SAPHIRE International: a tool for cross-language information retrieval. Proc. 1998 AMIA Annu. Symp. 673–677 (1998).

  25. Maynard D. & Ananiadou S. in Recent Advances in Computational Terminology (eds Bourigault, D., Jacquemin, C. & L'Homme, M.-C.) (John Benjamins, Amsterdam, 2000).

    Google Scholar 

  26. Aronson, A. R. & Rindflesh, T. C. Query expansion using the UMLS Metathesaurus. Proc. AMIA Annu. Fall Symp. 1997, 485–489 (1997).

  27. Aronson, A. R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Annu. Fall Symp. 2001, 17–21 (2001).

  28. Brill, E. A simple rule-based part of speech tagger. Proc. Third ACL Appl. NLP (1992).

  29. Hersh, W. R., Price, S. & Donohoe, L. Assessing thesaurus-based expansion using the UMLS Metathesaurus. Proc. AMIA Annu. Fall Symp. 2000, 344–348 (2000).

  30. Bodenreider, O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc. AMIA Annu. Fall Symp. 2001, 57–61 (2001).

  31. Hahn, U., Romacker, M. & Schulz, S. Creating knowledge repositories from biomedical reports: the MEDSYNDICATE text mining system. Pacif. Symp. Biocomp. 338–349 (2002)Applies sophisticated NLP techniques to the task of information extraction, with excellent results.

  32. Proux, D., Rechenmann, F., Julliard, L., Pillet, V. & Jacq, B. Detecting gene symbols and names in biological texts: a first step toward pertinent information. Proc. Genome Inform. Workshop 9, 72–80 (1998).

    CAS  Google Scholar 

  33. Fukuda, K., Tsunoda, T., Tamura, A. & Takagi, T. Toward information extraction: identifying protein names from biological papers. Proc. Pacif. Symp. Biocomp. 3, 707–718 (1998).

    Google Scholar 

  34. Yoshida, M., Fukuda, K. & Takagi, T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169–175 (2000).

    Article  CAS  PubMed  Google Scholar 

  35. Stapley, B. J. & Benoit, G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in MEDLINE abstracts. Proc. Pacif. Symp. Biocomp. 5, 526–537 (2000).

    Google Scholar 

  36. Ng, S.-K. & Wong, M. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 10, 104–112 (1999).

    CAS  Google Scholar 

  37. Wong, L. PIES, a protein interaction extraction system. Proc. Pacif. Symp. Biocomp. 6, 520–531 (2001).

    Google Scholar 

  38. Hatzivassiloglou, V., Duboue, P. & Rzhetsky, A. Disambiguating proteins, genes and RNA in text: a machine learning approach. Bioinformatics 17 (Suppl. 1), S97–S106 (2001).

    Article  PubMed  Google Scholar 

  39. Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. Automatic extraction of protein interactions from scientific abstracts. Proc. Pacif. Symp. Biocomp. 5, 541–551 (2000).

    Google Scholar 

  40. Humphreys, K., Demetriou, G. & Gaizauskas, R. Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Proc. Pacif. Symp. Biocomp. 5, 502–513 (2000).

    Google Scholar 

  41. Jenssen, T.-K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).Describes PubGene — a large-scale information extraction system that uses simple co-occurrence to detect associations between genes.

    CAS  PubMed  Google Scholar 

  42. Sekimizu, T., Park, H. S. & Tsujii, J. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome Inform. 9, 62–71 (1998).

    CAS  Google Scholar 

  43. Ono, T., Hishigaki, H., Tanigami, A. & Toshihisa, T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001).Shows that information extraction can be carried out with reasonable sensitivity and specificity without using overly sophisticated NLP techniques.

    Article  CAS  PubMed  Google Scholar 

  44. Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. AAAI Conf. Intell. Syst. Mol. Biol. 7, 60–67 (1999).

    Google Scholar 

  45. Leroy, G. & Chen, H. Filling preposition–base templates to capture information from medical abstracts. Proc. Pacif. Symp. Biocomp. 350–361 (2002).

  46. Rindflesch, T. C., Tanabe, L., Weinstein, J. N. & Hunter, L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Proc. Pacif. Symp. Biocomp. 5, 517–528 (2000).

    Google Scholar 

  47. Mutalik, P., Deshpande, A. & Nadkarni, P. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J. Am. Med. Inform. Assoc. 8, 598–609 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M. & Cochran, B. Robust relational parsing over biomedical literature: extracting inhibit relations. Proc. Pacif. Symp. Biocomp. 362–373 (2002).Describes automatically inferred rules for extracting information using grammar induction techniques.

  49. Eilbeck, K., Brass, A., Paton, N. & Hodgman, C. INTERACT: an object oriented protein–protein interaction database. Proc. Int. Conf. Intell. Syst. Mol. Biol. 87–94 (1999).

  50. Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K. & Marcotte, E. M. DIP: the database of interacting proteins. Nucleic Acids Res. 28, 289–291 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T. & Hogue, C. W. BIND — the biomolecular interaction network database. Nucleic Acids Res. 29, 242–245 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Blaschke, C. & Valencia, A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp. Funct. Genomics 2, 196–206 (2001).Proposes that biomedical text mining is limited more by inadequate lexica and lack of full-text sources than by data-mining technology. Also includes a useful discussion of nomenclature issues.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Hearst, M. A. in WordNet: an Electronic Lexical Database (ed. Fellbaum, C.) 131–151 (MIT press, Cambridge, Massachusetts, 1999)

    Google Scholar 

  54. Roberts, R. J. PubMed Central: the GenBank of the published literature. Proc. Natl Acad. Sci. USA 98, 381–382 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Cruse, D. A. Lexical Semantics (Cambridge University Press, Cambridge, UK, 1986)

    Google Scholar 

Download references


The authors thank P. Li and S. Lewis for many stimulating discussions on the role of ontologies in biology and natural language processing, G. Marth for many useful comments on the manuscript and R. Mural for professional encouragement.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mark D. Yandell.

Related links

Related links



Public Library of Science, PLoS




A collection of documents that are used for searching or data mining.


This frequently used term also has a formal definition. The accuracy of an algorithm is often defined as 2 × precision × recall/(precision + recall).


A variation of BLAST that uses profiles that are based on sequence multiple-alignments to improve the sensitivity of protein database searches.


A well-annotated database of protein sequences.


These are approaches for identifying terms in a text that belong to a particular semantic class. Gene names in Caenorhabditis elegans, for example, are denoted with three letters followed by a dash and a number — for example, 'dbl-1'. So, this approach to identify C. elegans genes might consist of searching a text for regular expression of three letters, a dash and a number. Such approaches do not work equally well for identifying all genes and generally are not very precise.


The ratio between the observed frequency at which an event occurred and the expected frequency of that event given some statistical model. A term that occurs more frequently in a text, or collection of texts, than would be expected based on its frequency in a corpus will therefore have an odds ratio >1.


(GO). A hierarchical organization of concepts (ontology) with three organizing principles: molecular function, the tasks done by individual gene products, an example of which is 'transcription factor'; biological process, broad biological goals, such as mitosis, that are accomplished by ordered assemblies of molecular functions; cellular component, subcellular structures, locations and macromolecular complexes (examples include the nucleus and the telomere).


A hierarchical organization of concepts, typically used to denote 'more-general-than' and/or 'part-of' relationships.


Homologous genes that originated through speciation (for example, human β-globin and mouse β-globin).


An algorithm that identifies the nouns, verbs and other functional word classes among the words that comprise a sentence.


Homologous genes that originated by gene duplication (for example, human β-globin and human α-globin).


Computer science parlance for an abstract definition that embodies some common and essential syntactic characteristic that belongs to a set of terms. For example, in the popular PERL programming language, the regular expression '\s* \w+\−\d+\s*' will identify any word in a text that consists of one or more letters (or numbers), followed by a dash, and followed by one or more numbers. This regular expression will identify Caenorhabditis elegans gene names.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yandell, M., Majoros, W. Genomics and natural language processing. Nat Rev Genet 3, 601–610 (2002).

Download citation

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing