Text-mining solutions for biomedical research: enabling integrative biology

Rebholz-Schuhmann, Dietrich; Oellrich, Anika; Hoehndorf, Robert

doi:10.1038/nrg3337

Review Article
Published: 14 November 2012

Text-mining solutions for biomedical research: enabling integrative biology

Dietrich Rebholz-Schuhmann^1,2,
Anika Oellrich¹ &
Robert Hoehndorf^3,4

Nature Reviews Genetics volume 13, pages 829–839 (2012)Cite this article

7975 Accesses
159 Citations
15 Altmetric
Metrics details

Subjects

Key Points

Text mining is a means to process the scientific literature at a large scale. It is the means to make documents and their content more accessible.
Literature repositories, such as PubMed Central and UK PubMed Central, are data collections just like the scientific biomedical databases. They require special techniques to parse the text and to deliver the facts for further analysis.
Data integration, such as the normalization of named entities in the text to database entries, is an essential step towards integrative biology using semantic Web technology.
Knowledge discovery is the ultimate goal of any researcher when exploiting integrated biomedical resources. The scientific literature contributes novel hypotheses and facts.
The use of formal knowledge representations — such as ontologies and fact data repositories — is paramount to make efficient use of our hypothesis generation and validation.
Solutions are emerging that provide intelligent but automated systems to assist biomedical researchers, particularly those dealing with high-throughput data.

Abstract

In response to the unbridled growth of information in literature and biomedical databases, researchers require efficient means of handling and extracting information. As well as providing background information for research, scientific publications can be processed to transform textual information into database content or complex networks and can be integrated with existing knowledge resources to suggest novel hypotheses. Information extraction and text data analysis can be particularly relevant and helpful in genetics and biomedical research, in which up-to-date information about complex processes involving genes, proteins and phenotypes is crucial. Here we explore the latest advancements in automated literature analysis and its contribution to innovative research approaches.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Categories of text-mining solutions.**

References

Jensen, L. J., Saric, J. & Bork, P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Rev. Genet. 7, 119–129 (2006).
Article CAS PubMed Google Scholar
Kim, J. J. & Rebholz-Schuhmann, D. Categorization of services for seeking information in biomedical literature: a typology for improvement of practice. Brief. Bioinformat. 9, 452–465 (2008). This manuscript exploits assumptions and observations linked to search behaviour from users of Web pages to judge the information-seeking behaviour of scientists. It judges available text-mining tools according to these assumptions.
Article CAS Google Scholar
Altman, R. B. et al. Text mining for biology—the way forward: opinions from leading scientists. Genome Biol. 9 (Suppl. 2), S7 (2008).
Article PubMed PubMed Central Google Scholar
Leach, S. M. et al. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput. Biol. 5, e1000215 (2009).
Article CAS PubMed PubMed Central Google Scholar
Hirschman, L. et al. Text mining for the biocuration workflow. Database 2012, bas020 (2012).
Article CAS PubMed PubMed Central Google Scholar
Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nature Genet. 31, 316–319 (2002).
Article CAS PubMed Google Scholar
Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2d: a tool for mining genes associated with disease. BMC Genetics 6, 45 (2005).
Article CAS PubMed PubMed Central Google Scholar
Blagosklonny, M. V. & Pardee, A. B. Conceptual biology: unearthing the gems. Nature 416, 373 (2002).
Article CAS PubMed Google Scholar
Malandrino, N. & Smith, R. J. Personalized medicine in diabetes. Clin. Chem. 57, 231–240 (2011).
Article PubMed Google Scholar
Herder, C. & Roden, M. Genetics of type 2 diabetes: pathophysiologic and clinical relevance. Eur. J. Clin. Invest. 41, 679–692 (2011).
Article PubMed Google Scholar
McCarthy, M. I. Progress in defining the molecular basis of type 2 diabetes mellitus through susceptibility-gene identification. Hum. Mol. Genet. 13 (Suppl. 1), 33–41 (2004).
Article CAS Google Scholar
Hoehndorf, R., Schofield, P. N. & Gkoutos, G. V. PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res. 39, e119 (2011). The authors describe their approach to the integration of phenotype resources to judge gene–disease associations. The paper demonstrates the potential of phenotype descriptions in the understanding of biological processes.
Article CAS PubMed PubMed Central Google Scholar
Li, S. et al. Genetic predisposition to obesity leads to increased risk of type 2 diabetes. Diabetologia 54, 776–782 (2011).
Article CAS PubMed PubMed Central Google Scholar
O'Rahilly, S. Human genetics illuminates the paths to metabolic disease. Nature 462, 307–314 (2009).
Article CAS PubMed Google Scholar
Smith, R. J. et al. Individualizing therapies in type 2 diabetes mellitus based on patient characteristics: what we know and what we need to know. J. Clin. Endocrinol. Metab. 95, 1566–1574 (2010).
Article CAS PubMed PubMed Central Google Scholar
Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C. & Hunter, L. E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 11, 492 (2010).
Article PubMed PubMed Central Google Scholar
Attwood, T. K. et al. Utopia documents: linking scholarly literature with research data. Bioinformatics 26, i568–i574 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kim, J. J., Zhang, Z., Park, J. C. & Ng, S. K. BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22, 597–605 (2006).
Article CAS PubMed Google Scholar
Rzhetsky, A., Iossifov, I., Loh, J. M. & White, K. P. Microparadigms: chains of collective reasoning in publications about molecular interactions. Proc. Natl Acad. Sci. USA 103, 4940–4945 (2006). This article explores how authors report on their results and how the collection of reported facts can be traced, compared and evaluated against each other. It gives early indications of what results might be produced if we applied automatic reasoning to the information from scientific literature and other resources.
Article CAS PubMed PubMed Central Google Scholar
Hearst, M. A. Untangling text data mining. Proc. 37th Annu. Meeting Assoc. Comput. Linguistics 1999, 3–10 (1999).
Article Google Scholar
Swanson, D. R. Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78, 29–37 (1990).
CAS PubMed PubMed Central Google Scholar
Karamanis, N. et al. Natural language processing in aid of FlyBase curators. BMC Bioinformatics 9, 193 (2008).
Article PubMed PubMed Central Google Scholar
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 40, D13–D25 (2012).
Article CAS PubMed Google Scholar
McEntyre, J. R. et al. UKPMC: a full text article resource for the life sciences. Nucleic Acids Res. 39, D58–D65 (2011).
Article CAS PubMed Google Scholar
Cheng, D. et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36, 399–405 (2008).
Article CAS Google Scholar
Yu, H. et al. Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS. BMC Bioinformatics 11 (Suppl.2), S6 (2010).
Article PubMed PubMed Central Google Scholar
Tsuruoka, Y., Tsujii, J. & Ananiadou, S. Facta: a text search engine for finding associated biomedical concepts. Bioinformatics 24, 2559–2560 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genet. 25, 25–29 (2000).
Article CAS PubMed Google Scholar
Consortium, G. O. The gene ontology: enhancements for 2011. Nucleic Acids Res. 40, D559–D564 (2012).
Article CAS Google Scholar
Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33, W783–W786 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kim, J. J., Pezik, P. & Rebholz-Schuhmann, D. Medevi: retrieving textual evidence of relations between biomedical concepts from MEDLINE. Bioinformatics 24, 1410–1412 (2008).
Article CAS PubMed PubMed Central Google Scholar
Cohen, K. B. & Hunter, L. Getting started in text mining. PLoS Comput. Biol. 4, e20 (2008).
Article CAS PubMed PubMed Central Google Scholar
Brachman, R. J. & Levesque, H. J. Knowledge Representation and Reasoning (Elsevier, 2004).
Leaman, R. & Gonzalez, G. BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomput. 2008, 652–663 (2008).
Google Scholar
Gerner, M., Nenadic, G. & Bergman, C. Linnaeus: A species name identification system for biomedical literature. BMC Bioinformatics 11, 85 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jimeno, A. et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, S3 (2008).
Article PubMed PubMed Central Google Scholar
Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L. & Murray-Rust, P. OSCAR4: a flexible architecture for chemical text-mining. J. Cheminform. 3, 41 (2011).
Article CAS PubMed PubMed Central Google Scholar
Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H. & Jimeno, A. Text processing through web services: calling Whatizit. Bioinformatics 24, 296–298 (2008).
Article CAS PubMed Google Scholar
Shah, N. H. et al. Comparison of concept recognizers for building the open biomedical annotator. BMC Bioinformatics 10, S14 (2009).
Article PubMed PubMed Central Google Scholar
Noy, N. F. et al. Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, W170–W173 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pafilis, E. et al. Reflect: augmented browsing for the life scientist. Nature Biotech. 27, 508–510 (2009).
Article CAS Google Scholar
Frijters, R. et al. CoPub: a literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Res. 36, W406–W410 (2008).
Article CAS PubMed PubMed Central Google Scholar
Muller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004).
Article CAS PubMed PubMed Central Google Scholar
Wermter, J., Tomanek, K. & Hahn, U. High-performance gene name normalization with GeNo. Bioinformatics 25, 815–821 (2009).
Article CAS PubMed Google Scholar
Hakenberg, J., Plake, C., Leaman, R., Schroeder, M. & Gonzalez, G. Inter-species normalization of gene mentions with GNAT. Bioinformatics 24, i126–i132 (2008).
Article PubMed Google Scholar
Leitner, F. et al. The FEBS Letters/BioCreative II.5 experiment: making biological information accessible. Nature Biotech. 28, 897–899 (2010).
Article CAS Google Scholar
Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).
CAS PubMed Google Scholar
Hoffmann, R. & Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21 (Suppl. 2), ii252–ii258 (2005).
CAS PubMed Google Scholar
Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Article CAS PubMed PubMed Central Google Scholar
Feldman, I., Rzhetsky, A. & Vitkup, D. Network properties of genes harboring inherited disease mutations. Proc. Natl Acad. Sci. USA 105, 4323–4328 (2008).
Article CAS PubMed PubMed Central Google Scholar
Krallinger, M. et al. How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience. Database 2012, bas017 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ananiadou, S., Pyysalo, S., Tsujii, J. & Kell, D. B. Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28, 381–390 (2010).
Article CAS PubMed Google Scholar
Geifman, N. & Rubin, E. Towards an age-phenome knowledge-base. BMC Bioinformatics 12, 229 (2011).
Article PubMed PubMed Central Google Scholar
Hearst, M. A. Automatic acquisition of hyponyms from large text corpora. Proc. 14th Conf. Comput. Ling. 2, 539–545 (1992).
Google Scholar
Brady, S. & Shatkay, H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac. Symp. Biocomput. 2008, 604–615 (2008).
Google Scholar
Jaeger, S., Gaudan, S., Leser, U. & Rebholz-Schuhmann, D. Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9, S2 (2008).
Article CAS PubMed PubMed Central Google Scholar
Nagel, K., Jimeno-Yepes, A. & Rebholz-Schuhmann, D. Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb. BMC Bioinformatics 10 (Suppl.8), S4 (2009).
Article CAS PubMed PubMed Central Google Scholar
Blaschke, C., Oliveros, J. C. & Valencia, A. Mining functional information associated with expression arrays. Funct. Integr. Genom. 1, 256–268 (2001).
Article CAS Google Scholar
Kuffner, R., Fundel, K. & Zimmer, R. Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics 21, (Suppl.2), i259–i267 (2005).
Google Scholar
Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999, 60–67 (1999).
Google Scholar
Hunter, L. et al. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 9, 78 (2008). The work presented in this paper demonstrates the information technology infrastructure required to process conceptual knowledge and to derive novel findings.
Article CAS PubMed PubMed Central Google Scholar
Oda, K. et al. New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 9 (Suppl. 3), S5 (2008).
Article CAS PubMed PubMed Central Google Scholar
Narayanaswamy, M., Ravikumar, K. E. & Vijay-Shanker, K. Beyond the clause: extraction of phosphorylation information from MEDLINE abstracts. Bioinformatics 21, i319–i327 (2005).
Article CAS PubMed Google Scholar
Yuan, X. et al. An online literature mining tool for protein phosphorylation. Bioinformatics 22, 1668–1669 (2006).
Article CAS PubMed Google Scholar
Saric, J., Jensen, L. J. & Rojas, I. Large-scale extraction of gene regulation for model organisms in an ontological context. In Silico Biol. 5, 21–32 (2005).
CAS PubMed Google Scholar
Rodriguez-Penagos, C., Salgado, H., Martinez-Flores, I. & Collado-Vides, J. Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinformatics 8, 293 (2007).
Article CAS PubMed PubMed Central Google Scholar
Kim, J. & Rebholz-Schuhmann, D. Improving the extraction of complex regulatory events from scientific text by using ontology-based inference. J. Biomed. Semantics 2, S3 (2011).
Article PubMed PubMed Central Google Scholar
Rzhetsky, A., Seringhaus, M. & Gerstein, M. Seeking a new biology through text mining. Cell 134, 9–13 (2008). The authors argue that the exploitation of the scientific literature will serve as an additional resource for the generation of hypotheses and the validation of human-driven hypotheses.
Article CAS PubMed PubMed Central Google Scholar
Samwald, M. & Stenzhorn, H. Establishing a distributed system for the simple representation and integration of diverse scientific assertions. J. Biomed. Semantics 1 (Suppl.1), S5 (2010).
Article PubMed PubMed Central Google Scholar
Sansone, S. A. et al. Toward interoperable bioscience data. Nature Genet. 44, 121–126 (2012).
Article CAS PubMed Google Scholar
Neumann, E. & Prusak, L. Knowledge networks in the age of the semantic Web. Brief. Bioinformat. 8, 141–149 (2007).
Article Google Scholar
Gao, Y. et al. SWAN: A distributed knowledge infrastructure for Alzheimer disease research. J. Web Semant. 4, 222–228 (2006).
Article Google Scholar
Dowell, K. G., McAndrews-Hill, M. S., Hill, D. P., Drabkin, H. J. & Blake, J. A. Integrating text mining into the MGI biocuration workflow. Database 2009, bap019 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jamieson, D. G., Gerner, M., Sarafraz, F., Nenadic, G. & Robertson, D. L. Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database. Database 2012, bas023 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kafkas, S¸., Varog˘lu, E., Rebholz-Schuhmann, D. & Taneri, B. Diversity in the interactions of isoforms linked to clustered transcripts: a systematic literature analysis. J. Proteom. Bioinf. 4, 250–259 (2011).
Article Google Scholar
Attwood, T. K. et al. Prints and its automatic supplement, preprints. Nucleic Acids Res. 31, 400–402 (2003).
Article CAS PubMed PubMed Central Google Scholar
Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2012).
Article CAS PubMed Google Scholar
Donaldson, I. et al. PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003).
Article PubMed PubMed Central Google Scholar
Thorn, C. F., Klein, T. E. & Altman, R. B. Pharmacogenomics and bioinformatics: PharmGKB. Pharmacogenomics 11, 501–505 (2010).
Article CAS PubMed Google Scholar
Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J. & Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6, 343 (2010). In this study, semantic resources for the description of phenotypes were used to determine effects induced by drugs, (that is, the authors identify effects and side effects of drugs).
Article PubMed PubMed Central Google Scholar
Collier, N. et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 24, 2940–2941 (2008). BioCaster is an information technology solution that monitors public information streams, such as Twitter, to detect expressions that indicate disease outbreaks. This study demonstrates that social information in combination with scientific information can be very useful for the prediction of disease-related events.
Article CAS PubMed PubMed Central Google Scholar
Elkin, P. L., Tuttle, M. S., Trusko, B. E. & Brown, S. H. BioProspecting: novel marker discovery obtained by mining the bibleome. BMC Bioinformatics 10 (Suppl.2), S9 (2009).
Article CAS PubMed PubMed Central Google Scholar
van Haagen, H. H. et al. Novel protein-protein interactions inferred from literature context. PLoS ONE 4, e7894 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ceci, F., Pietrobon, R. & Goncalves, A. L. Turning text into research networks: information retrieval and computational ontologies in the creation of scientific databases. PLoS ONE 7, e27499 (2012).
Article CAS PubMed PubMed Central Google Scholar
Pesquita, C. et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 9 (Suppl.5), S4 (2008).
Article CAS PubMed PubMed Central Google Scholar
Coulet, A., Shah, N. H., Garten, Y., Musen, M. & Altman, R. B. Using text to build semantic networks for pharmacogenomics. J. Biomed. Informat. 43, 1009–1019 (2010).
Article CAS Google Scholar
Percha, B., Garten, Y. & Altman, R. B. Discovery and explanation of drug-drug interactions via text mining. Pacific Symp. Biocomput. 2012, 410–421 (2012).
Google Scholar
Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321, 263–266 (2008).
Article CAS PubMed Google Scholar
Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P. & Morissette, J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform 41, 706–716 (2008).
Article PubMed Google Scholar
Patrinos, G. P. et al. Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Hum. Mutat. 26 June 2012 (doi:10.1002/humu.22144).
Article PubMed Google Scholar
Grau, B. et al. OWL 2: The next step for OWL. Web Semantics 6, 309–322 (2008).
Article Google Scholar
Jensen, L. J. & Bork, P. Ontologies in quantitative biology: A basis for comparison, integration, and discovery. PLoS Biol. 8, e1000374 (2010).
Article CAS PubMed PubMed Central Google Scholar
Chen, H., Yu, T. & Chen, J. Y. Semantic web meets integrative biology: a survey. Brief. Bioinf. 6 April 2012 (doi:10.1093/bib/bbs014).
Article PubMed Google Scholar
Chen, C.-K. et al. Mousefinder: candidate disease genes from mouse phenotype data. Hum. Mutat. 33, 858–866 (2012).
Article PubMed PubMed Central Google Scholar
Washington, N. L. et al. Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol. 7, e1000247 (2009).
Article CAS PubMed PubMed Central Google Scholar
King, R. D. et al. The automation of science. Science 324, 85–89 (2009). The authors mimicked genuine scientific work through automatic analysis of experimental results, derivation of novel hypotheses and by controlling a robot to execute novel experiments. Text mining and literature analysis played an important part in the interpretation of the results from the data mining step to generate valid hypotheses.
Article CAS PubMed Google Scholar
Wilkinson, M. D., Vandervalk, B. & McCarthy, L. The semantic automated discovery and integration (SADI) Web service design-pattern, API and reference implementation. J. Biomed. Semantics 2, 8 (2011). SADI is a framework that registers Web-based services in such a way that they can be easily detected for the processing of data in the Web. Such work helps to set the stage for future progress towards experimental data residing and data analysis occurring on the Web to improve efficiency and to generate new hypotheses.
Article PubMed PubMed Central Google Scholar
Krauthammer, M. & Nenadic, G. Term identification in the biomedical literature. J. Biomed. Inform. 37, 512–526 (2004).
Article CAS PubMed Google Scholar
Liakata, M., Saha, S., Dobnik, S., Batchelor, C. & Rebholz-Schuhmann, D. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28, 991–1000 (2012).
Article CAS PubMed PubMed Central Google Scholar
Krallinger, M. et al. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 9 (Suppl.2), S1 (2008).
Article CAS PubMed PubMed Central Google Scholar
Smith, B. et al. The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotech. 25, 1251–1255 (2007).
Article CAS Google Scholar
Richter, J. D., Harris, M. A. A., Haendel, M. & Lewis, S. Obo-edit — an ontology editor for biologists. Bioinformatics 23, 2198–2200 (2007).
Article CAS Google Scholar
Noy, N. F. et al. Creating semantic web contents with Protege-2000. IEEE Intelligent Systems 16, 60–71 (2001).
Article Google Scholar
Jonquet, C., Shah, N. H. & Musen, M. A. The open biomedical annotator. Summit Translat. Bioinforma 2009, 56–60 (2009).
Google Scholar
Douglas, S. M., Montelione, G. T. & Gerstein, M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 6, R80 (2005).
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Dietrich Rebholz-Schuhmann & Anika Oellrich
Institut für Computerlinguistik, Universität Zürich, Binzmühlestrasse 14, Zürich, 8050, Switzerland
Dietrich Rebholz-Schuhmann
Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
Robert Hoehndorf
Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, CB2 3EG, UK
Robert Hoehndorf

Authors

Dietrich Rebholz-Schuhmann
View author publications
You can also search for this author in PubMed Google Scholar
Anika Oellrich
View author publications
You can also search for this author in PubMed Google Scholar
Robert Hoehndorf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dietrich Rebholz-Schuhmann.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Hypotheses: Testable statements that, if true, may explain an observed phenomenon.
Knowledge bases: Databases of statements covering a knowledge domain. Often, statements are represented in a form that permits the automated or manual inference of statements that are not explicitly stated using inference rules.
Facts: Objective and (experimentally) verifiable ways in which the world is structured.
Information retrieval: The process of selecting information or documents from a collection as a result to the submission of a query.
Information extraction: The process of automatically assessing documents, data or knowledge bases to extract statements that are likely to be true given the available information. Information extraction can be based on defined patterns, machine-learning techniques, statistical analyses or automated reasoning.
Knowledge discovery: The process of analysing a set of statements to identify new statements that are true. To discover new knowledge, evidence must have already been gathered in support of the identified statements.
Statements: Declarative sentences that can be said to be either true or false. True statements express facts.
Evidence: The information that has been gathered to demonstrate that the statement is true (that is, it corresponds to a fact); in science, evidence usually contains experimental results.
Provenance: A reference to literature from which a statement or its supporting evidence were derived.
Terms: Single words or compositions of words with well-defined meanings.
Types: The conceptualization of categories of entities or conceptual instances, represented by a unique identifier, a label and a definition.
Entity recognition: The extraction of text constituents representing a specific type, preferably entities with a name such as a protein.
Entity normalization: The mapping of a named entity or type in the text to a unique identifier, possibly requiring disambiguation and contextual analysis.
Ontology: A representation of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. Commonly, ontologies are represented as graph structure that represents a taxonomy.
Features: Any constituents of the text — such as tokens, words, complex terms or representations of a concept — that serve as an input to a text-mining solution.
Assertions: Statements that are represented in a formal language to denote the properties or relations of an entity (or concept).
Semantic Web: The extension of the World Wide Web to provide, simultaneously, human- and computer-readable semantics through references to well-defined resources.
Syntactic parsing: Processing of the sentence structure using statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a 'noun phrase'), their roles (for example, the 'subject') and dependencies.
Semantic resources: Biomedical ontologies and databases serve as semantic resources, as they define and describe concepts and entities.
Automated reasoning: The use of software to derive statements automatically from a knowledge base using inference rules.
Micro-theories: Sets of assertions that share the same topic or that result from the same source. The assertions must be conflict-free within a micro-theory but can contradict other micro-theories.
Hypothesis generation: The selection or creation of hypotheses that can explain a given phenomenon. Commonly, selection criteria regarding relevance, parsimony or consistency with existing knowledge are applied to select the most viable hypotheses for a given phenomenon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rebholz-Schuhmann, D., Oellrich, A. & Hoehndorf, R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13, 829–839 (2012). https://doi.org/10.1038/nrg3337

Download citation

Published: 14 November 2012
Issue Date: December 2012
DOI: https://doi.org/10.1038/nrg3337

This article is cited by

Molecular and network-level mechanisms explaining individual differences in autism spectrum disorder
- Amanda M. Buch
- Petra E. Vértes
- Conor Liston
Nature Neuroscience (2023)
A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics
- Muhammad Waqas
- Nadeem Anjum
- Muhammad Tanvir Afzal
Scientometrics (2023)
Generic features selection for structure classification of diverse styled scholarly articles
- Muhammad Waqas
- Nadeem Anjum
Multimedia Tools and Applications (2023)
Considerations and challenges for sex-aware drug repurposing
- Jennifer L. Fisher
- Emma F. Jones
- Brittany N. Lasseigne
Biology of Sex Differences (2022)
Combining lexical and context features for automatic ontology extension
- Sara Althubaiti
- Şenay Kafkas
- Robert Hoehndorf
Journal of Biomedical Semantics (2020)