Nature 448, 130 (12 July 2007) | doi:10.1038/448130b; Published online 11 July 2007

Text mining: powering the database revolution

Udo Hahn1, Joachim Wermter1, Rainer Blasczyk2 & Peter A. Horn2

  1. Friedrich Schiller University Jena, Computational Linguistics Group – JULIE Lab, 07743 Jena, Germany
  2. Institute for Transfusion Medicine, Hannover Medical School, 30625 Hannover, Germany


Mark Gerstein and colleagues in Correspondence (Nature 447, 142; doi:10.1038/447142a 2007) propose that journals should require authors to manually provide structured abstracts to facilitate text mining of biological information. There are three main difficulties in implementing such a proposal.

First, life-science terminologies are huge, diversified and complex. This means that identifying the correct content descriptors is almost impossible for inexperienced users of online term repositories. For example, Medical Subject Headings (, the International Classification of Diseases ( and Gene Ontology ( are high-volume — tens of thousands of terms — and structurally complicated terminological systems, each with different design rationales, naming conventions and principles of structural organization. Even human indexers, search specialists and database curators with routine exposure to these resources have to invest much effort in understanding and keeping track of their content as well as terminological updates and revisions. Will scientists find the time to dive so deeply into this alien terminological territory, and be capable of finding exactly what they are looking for?

Second, the coverage of existing terminologies for the many subdomains in the life sciences is incomplete. The two main terminological umbrella systems for the life sciences, the Unified Medical Language System ( and the Open Biomedical Ontologies (, contain impressive numbers of individual terminologies, but their coverage of the life sciences is still fragmentary and suffers from varying depths of description. The size of the terminology gap is likely to be even more pronounced if authors were required to encode relational descriptions, for example indicating a binding relation between two specific proteins, P1 and P2, by Bind(P1, P2), because such a vocabulary has not yet been determined.

Third, the quality and reliability of author-supplied content descriptions is quite a hurdle. Even if the first and second problems were to be solved, human indexers, even professional ones, are liable to error as well as to the possibility of intrinsic subjective bias (M. E. Funk and C. A. Reid Bull. Med. Libr. Assoc. 71, 176–183; 1983). This is not to say that authors of a structured abstract would consciously cheat, but rather there is a grey area of overstatement and overestimation of one's own results in a highly competitive scientific environment. If authors' structured entries were subject to peer review together with the submitted article, this would be more work for the reviewers as well as the authors — neither of them likely to have been trained as terminologists.

As an alternative, we suggest automated procedures for knowledge capture in which neither the authors nor the reviewers are in the loop. There has been significant progress in automatic text mining and information extraction as well as in the methodological foundations of life-science terminologies in terms of ontologies, knowledge representation languages and semantic encoding standards. These efforts in automating the generation of content descriptions and linking them directly to biological databases are strongly experimentally founded and would help to avoid additional workload and subjectivity — see, for example, the BioCreAtIvE competition results ( Once automated mechanisms for content analysis are applied, this also increases the coverage and the recency of the literature entered into biological databases, as human input is complemented by computationally generated content.

Extra navigation