Genomics and natural language processing

Yandell, Mark D.; Majoros, William H.

doi:10.1038/nrg861

Review Article
Published: 01 August 2002

Genomics and natural language processing

Mark D. Yandell^1,2 &
William H. Majoros³

Nature Reviews Genetics volume 3, pages 601–610 (2002)Cite this article

2851 Accesses
99 Citations
3 Altmetric
Metrics details

Key Points

Today, the computational exploration and management of large text repositories are usually accomplished with search engines and databases that are based on a suite of text processing, indexing and search tools that are referred to collectively as 'natural language processing' (NLP) technologies.
There are three fundamental aspects to NLP: information retrieval, semantics and information extraction.
Exploring and managing the biomedical literature with these technologies, however, presents some interesting challenges, primarily because of the relationships between biomedical texts and biological sequences.
The associations between biological sequences and texts are a truly unique aspect of the biomedical literature. However, understanding the complex associations that exist between genes, sequences and texts is a daunting task.
The flood of sequence information produced by the rapid advances in genomics is creating new ways to explore texts and is blurring the traditional lines that separate bioinformatics and NLP.
Biological NLP (bio-NLP) is an emerging field of research that seeks to create tools and methodologies for sequence and textual analysis that combine bioinformatics and NLP technologies in a synergistic fashion.
Some bio-NLP researchers are focusing on texts as a means to discover information about protein interactions, and are wrestling with how best to adapt traditional NLP technologies to this task. Others, taking a more sequence-centred approach, are exploring the use of texts as a means to improve sequence-retrieval algorithms and as an aid to sequence annotation.
If bio-NLP is to achieve its full potential, it will have to move beyond information management and generate specific predictions pertaining to gene function that can be verified at the bench. The synergistic use of sequence and text to extract latent information from the biomedical literature holds much promise in this regard. Realizing this potential, however, will require more and better ontologies, software that is able to make inferences using sequence and textual information, and access to the full text of articles.

Abstract

The Human Genome and MEDLINE are both the foci of intense data-mining efforts worldwide. The biomedical literature has much to say about sequence, but it also seems that sequence can tell us much about the biomedical literature. Biological natural language processing is an emerging field of research that seeks to explore systematically the relationships between genes, sequences and the biomedical literature as a basis for a new generation of data-mining tools.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Correlation between sequence similarity and document similarity.**

**Figure 2: Semantic classification and definition of terms using a lexicon, thesaurus and a hierarchical ontology.**

**Figure 3: HMMs are used for part-of-speech tagging, as well as for gene prediction.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Genome-wide association studies

Article 26 August 2021

References

Schuler, G. D., Epstein, J. A., Ohkawa, H. & Kans, J. A. Entrez: molecular biology database and retrieval system. Methods Enzymol. 266, 141–162 (1996)
Article CAS PubMed Google Scholar
Wilbur, W. J. & Yang, Y. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 26, 209–222 (1996).Describes the vector-space model used by Entrez, the literature-search service maintained by the NCBI.
Article CAS PubMed Google Scholar
Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Proc. Pacific Symp. Biocomp. 5, 54–68 (2000).
Google Scholar
Shatkay, H., Edwards, S., Wilbur, W. J. & Boguski, M. Genes, themes, and microarrays. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–327 (2000).
CAS PubMed Google Scholar
Manning, C. D. & Schutze, H. S. in Foundations of Statistical Natural Language Processing 85 (MIT press, Cambridge, Massachusetts, 1999).The indispensable reference for anyone who is interested in statistical natural language processing (NLP).
Google Scholar
Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).
Article CAS PubMed Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
MacCallum, R. M., Kelley, L. A. & Sternberg, J. E. SAWTED: structure assignment with text description — enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics 16, 125–129 (2000).
Article CAS PubMed Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54 (1999).
Article CAS PubMed PubMed Central Google Scholar
Chang, J. T., Raychaudhuri, S. & Altman, R. B. Including biological literature improves homology search. Proc. Pacif. Symp. Biocomp. 5, 374–383 (2001).A quantitative assessment of the utility of combining sequence similarity with document similarity.
Google Scholar
Eisenhaber, F. & Bork, P. Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics 15, 528–535 (1999).
Article CAS PubMed Google Scholar
Stapley, B. J., Kelley, L. A. & Sternberg, M. J. E. Predicting the sub-cellular location of proteins from text using support vector machines. Proc. Pacif. Symp. Biocomp. (in the press).Describes the use of both text and sequence data to predict subcellular localization.
Iliopoulos, I., Enright, A. J. & Ouzounis, C. A. TEXTQUEST: document clustering of Medline abstracts for concept discovery in molecular biology. Proc. Pacif. Symp. Biocomp. 6, 374–383 (2001).
Google Scholar
Andrade, M. A. & Valencia, A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14, 600–607 (1998).
Article CAS PubMed Google Scholar
Renner, A. & Aszodi, A. High-throughput functional annotation of novel gene products using document clustering. Proc. Pacific Symp. Biocomp. 5, 54–68 (2000).
Google Scholar
Raychaudhuri, S., Chang, J. T., Sutphin, P. D. & Altman, R. B. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 12, 203–214 (2002).
Article CAS PubMed PubMed Central Google Scholar
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief Bioinform. 1, 398–414 (2000).
Article CAS PubMed Google Scholar
Fellbaum, C. (ed.) WordNet: an Electronic Lexical Database (MIT Press, Cambridge, Massachusetts, 1999).
Google Scholar
Humphreys, B. L., Lindberg, D. A., Schoolman, H. M. & Barnett, G. O. The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5, 1–11 (1998).
Article CAS PubMed PubMed Central Google Scholar
Baclawski, K., Cigna, J., Kokar, M. M., Mager, P. & Indurkhya, B. Knowledge representation and indexing using the unified medical language system. Proc. Pacif. Symp. Biocomp. 5, 502–513 (2000).A brief introduction to UMLS and related issues.
Google Scholar
Nadkarni, P., Chen, R. & Brandt, C. UMLS concept indexing for production databases: a feasibility study. J. Am. Med. Inform. Assoc. 8, 80–91 (2001).Critically assesses the use of UMLS for concept indexing, and provides a useful discussion of nomenclature issues.
Article CAS PubMed PubMed Central Google Scholar
Hersh, W. R. & Donohoe, L. C. SAPHIRE International: a tool for cross-language information retrieval. Proc. 1998 AMIA Annu. Symp. 673–677 (1998).
Maynard D. & Ananiadou S. in Recent Advances in Computational Terminology (eds Bourigault, D., Jacquemin, C. & L'Homme, M.-C.) (John Benjamins, Amsterdam, 2000).
Google Scholar
Aronson, A. R. & Rindflesh, T. C. Query expansion using the UMLS Metathesaurus. Proc. AMIA Annu. Fall Symp. 1997, 485–489 (1997).
Aronson, A. R. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Annu. Fall Symp. 2001, 17–21 (2001).
Brill, E. A simple rule-based part of speech tagger. Proc. Third ACL Appl. NLP (1992).
Hersh, W. R., Price, S. & Donohoe, L. Assessing thesaurus-based expansion using the UMLS Metathesaurus. Proc. AMIA Annu. Fall Symp. 2000, 344–348 (2000).
Bodenreider, O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc. AMIA Annu. Fall Symp. 2001, 57–61 (2001).
Hahn, U., Romacker, M. & Schulz, S. Creating knowledge repositories from biomedical reports: the MEDSYNDICATE text mining system. Pacif. Symp. Biocomp. 338–349 (2002)Applies sophisticated NLP techniques to the task of information extraction, with excellent results.
Proux, D., Rechenmann, F., Julliard, L., Pillet, V. & Jacq, B. Detecting gene symbols and names in biological texts: a first step toward pertinent information. Proc. Genome Inform. Workshop 9, 72–80 (1998).
CAS Google Scholar
Fukuda, K., Tsunoda, T., Tamura, A. & Takagi, T. Toward information extraction: identifying protein names from biological papers. Proc. Pacif. Symp. Biocomp. 3, 707–718 (1998).
Google Scholar
Yoshida, M., Fukuda, K. & Takagi, T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169–175 (2000).
Article CAS PubMed Google Scholar
Stapley, B. J. & Benoit, G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in MEDLINE abstracts. Proc. Pacif. Symp. Biocomp. 5, 526–537 (2000).
Google Scholar
Ng, S.-K. & Wong, M. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 10, 104–112 (1999).
CAS Google Scholar
Wong, L. PIES, a protein interaction extraction system. Proc. Pacif. Symp. Biocomp. 6, 520–531 (2001).
Google Scholar
Hatzivassiloglou, V., Duboue, P. & Rzhetsky, A. Disambiguating proteins, genes and RNA in text: a machine learning approach. Bioinformatics 17 (Suppl. 1), S97–S106 (2001).
Article PubMed Google Scholar
Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. Automatic extraction of protein interactions from scientific abstracts. Proc. Pacif. Symp. Biocomp. 5, 541–551 (2000).
Google Scholar
Humphreys, K., Demetriou, G. & Gaizauskas, R. Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Proc. Pacif. Symp. Biocomp. 5, 502–513 (2000).
Google Scholar
Jenssen, T.-K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).Describes PubGene — a large-scale information extraction system that uses simple co-occurrence to detect associations between genes.
CAS PubMed Google Scholar
Sekimizu, T., Park, H. S. & Tsujii, J. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome Inform. 9, 62–71 (1998).
CAS Google Scholar
Ono, T., Hishigaki, H., Tanigami, A. & Toshihisa, T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001).Shows that information extraction can be carried out with reasonable sensitivity and specificity without using overly sophisticated NLP techniques.
Article CAS PubMed Google Scholar
Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. AAAI Conf. Intell. Syst. Mol. Biol. 7, 60–67 (1999).
Google Scholar
Leroy, G. & Chen, H. Filling preposition–base templates to capture information from medical abstracts. Proc. Pacif. Symp. Biocomp. 350–361 (2002).
Rindflesch, T. C., Tanabe, L., Weinstein, J. N. & Hunter, L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Proc. Pacif. Symp. Biocomp. 5, 517–528 (2000).
Google Scholar
Mutalik, P., Deshpande, A. & Nadkarni, P. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J. Am. Med. Inform. Assoc. 8, 598–609 (2001).
Article CAS PubMed PubMed Central Google Scholar
Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M. & Cochran, B. Robust relational parsing over biomedical literature: extracting inhibit relations. Proc. Pacif. Symp. Biocomp. 362–373 (2002).Describes automatically inferred rules for extracting information using grammar induction techniques.
Eilbeck, K., Brass, A., Paton, N. & Hodgman, C. INTERACT: an object oriented protein–protein interaction database. Proc. Int. Conf. Intell. Syst. Mol. Biol. 87–94 (1999).
Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K. & Marcotte, E. M. DIP: the database of interacting proteins. Nucleic Acids Res. 28, 289–291 (2000).
Article CAS PubMed PubMed Central Google Scholar
Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T. & Hogue, C. W. BIND — the biomolecular interaction network database. Nucleic Acids Res. 29, 242–245 (2001).
Article CAS PubMed PubMed Central Google Scholar
Blaschke, C. & Valencia, A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp. Funct. Genomics 2, 196–206 (2001).Proposes that biomedical text mining is limited more by inadequate lexica and lack of full-text sources than by data-mining technology. Also includes a useful discussion of nomenclature issues.
Article CAS PubMed PubMed Central Google Scholar
Hearst, M. A. in WordNet: an Electronic Lexical Database (ed. Fellbaum, C.) 131–151 (MIT press, Cambridge, Massachusetts, 1999)
Google Scholar
Roberts, R. J. PubMed Central: the GenBank of the published literature. Proc. Natl Acad. Sci. USA 98, 381–382 (2001).
Article CAS PubMed PubMed Central Google Scholar
Cruse, D. A. Lexical Semantics (Cambridge University Press, Cambridge, UK, 1986)
Google Scholar

Download references

Acknowledgements

The authors thank P. Li and S. Lewis for many stimulating discussions on the role of ontologies in biology and natural language processing, G. Marth for many useful comments on the manuscript and R. Mural for professional encouragement.

Author information

Authors and Affiliations

Department of Molecular and Cell Biology, Howard Hughes Medical Institute, Room 545, LSA Building No. 3200
Mark D. Yandell
University of California, Berkeley, 94720-3200, California, USA
Mark D. Yandell
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, 20850, Maryland, USA
William H. Majoros

Authors

Mark D. Yandell
View author publications
You can also search for this author in PubMed Google Scholar
William H. Majoros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark D. Yandell.

Glossary

CORPUS: A collection of documents that are used for searching or data mining.
ACCURACY: This frequently used term also has a formal definition. The accuracy of an algorithm is often defined as 2 × precision × recall/(precision + recall).
PSI-BLAST: A variation of BLAST that uses profiles that are based on sequence multiple-alignments to improve the sensitivity of protein database searches.
SWISS-PROT: A well-annotated database of protein sequences.
AD HOC RULE-BASED APPROACHES: These are approaches for identifying terms in a text that belong to a particular semantic class. Gene names in Caenorhabditis elegans, for example, are denoted with three letters followed by a dash and a number — for example, 'dbl-1'. So, this approach to identify C. elegans genes might consist of searching a text for regular expression of three letters, a dash and a number. Such approaches do not work equally well for identifying all genes and generally are not very precise.
ODDS RATIO: The ratio between the observed frequency at which an event occurred and the expected frequency of that event given some statistical model. A term that occurs more frequently in a text, or collection of texts, than would be expected based on its frequency in a corpus will therefore have an odds ratio >1.
GENE ONTOLOGY: (GO). A hierarchical organization of concepts (ontology) with three organizing principles: molecular function, the tasks done by individual gene products, an example of which is 'transcription factor'; biological process, broad biological goals, such as mitosis, that are accomplished by ordered assemblies of molecular functions; cellular component, subcellular structures, locations and macromolecular complexes (examples include the nucleus and the telomere).
ONTOLOGY: A hierarchical organization of concepts, typically used to denote 'more-general-than' and/or 'part-of' relationships.
ORTHOLOGUES: Homologous genes that originated through speciation (for example, human β-globin and mouse β-globin).
PART-OF-SPEECH TAGGER: An algorithm that identifies the nouns, verbs and other functional word classes among the words that comprise a sentence.
PARALOGUES: Homologous genes that originated by gene duplication (for example, human β-globin and human α-globin).
REGULAR EXPRESSION: Computer science parlance for an abstract definition that embodies some common and essential syntactic characteristic that belongs to a set of terms. For example, in the popular PERL programming language, the regular expression '\s^* \w+\−\d+\s^*' will identify any word in a text that consists of one or more letters (or numbers), followed by a dash, and followed by one or more numbers. This regular expression will identify Caenorhabditis elegans gene names.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yandell, M., Majoros, W. Genomics and natural language processing. Nat Rev Genet 3, 601–610 (2002). https://doi.org/10.1038/nrg861

Download citation

Issue Date: 01 August 2002
DOI: https://doi.org/10.1038/nrg861

This article is cited by

Tracking mutational semantics of SARS-CoV-2 genomes
- Rohan Singh
- Sunil Nagpal
- Sharmila S. Mande
Scientific Reports (2022)
Cognitive analysis of metabolomics data for systems biology
- Erica L.-W. Majumder
- Elizabeth M. Billings
- Gary Siuzdak
Nature Protocols (2021)
DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine
- Abdul Wahab
- Hilal Tayara
- Kil To Chong
Scientific Reports (2021)
Challenges in the construction of knowledge bases for human microbiome-disease associations
- Varsha Dave Badal
- Dustin Wright
- Chun-Nan Hsu
Microbiome (2019)
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
- Ehsaneddin Asgari
- Alice C. McHardy
- Mohammad R. K. Mofrad
Scientific Reports (2019)

Genomics and natural language processing

Key Points

Abstract

Access options

Similar content being viewed by others

Highly accurate protein structure prediction with AlphaFold

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Genome-wide association studies

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Tracking mutational semantics of SARS-CoV-2 genomes

Cognitive analysis of metabolomics data for systems biology

DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine

Challenges in the construction of knowledge bases for human microbiome-disease associations

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links