Key Points
-
Text mining is a means to process the scientific literature at a large scale. It is the means to make documents and their content more accessible.
-
Literature repositories, such as PubMed Central and UK PubMed Central, are data collections just like the scientific biomedical databases. They require special techniques to parse the text and to deliver the facts for further analysis.
-
Data integration, such as the normalization of named entities in the text to database entries, is an essential step towards integrative biology using semantic Web technology.
-
Knowledge discovery is the ultimate goal of any researcher when exploiting integrated biomedical resources. The scientific literature contributes novel hypotheses and facts.
-
The use of formal knowledge representations — such as ontologies and fact data repositories — is paramount to make efficient use of our hypothesis generation and validation.
-
Solutions are emerging that provide intelligent but automated systems to assist biomedical researchers, particularly those dealing with high-throughput data.
Abstract
In response to the unbridled growth of information in literature and biomedical databases, researchers require efficient means of handling and extracting information. As well as providing background information for research, scientific publications can be processed to transform textual information into database content or complex networks and can be integrated with existing knowledge resources to suggest novel hypotheses. Information extraction and text data analysis can be particularly relevant and helpful in genetics and biomedical research, in which up-to-date information about complex processes involving genes, proteins and phenotypes is crucial. Here we explore the latest advancements in automated literature analysis and its contribution to innovative research approaches.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Jensen, L. J., Saric, J. & Bork, P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Rev. Genet. 7, 119–129 (2006).
Kim, J. J. & Rebholz-Schuhmann, D. Categorization of services for seeking information in biomedical literature: a typology for improvement of practice. Brief. Bioinformat. 9, 452–465 (2008). This manuscript exploits assumptions and observations linked to search behaviour from users of Web pages to judge the information-seeking behaviour of scientists. It judges available text-mining tools according to these assumptions.
Altman, R. B. et al. Text mining for biology—the way forward: opinions from leading scientists. Genome Biol. 9 (Suppl. 2), S7 (2008).
Leach, S. M. et al. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput. Biol. 5, e1000215 (2009).
Hirschman, L. et al. Text mining for the biocuration workflow. Database 2012, bas020 (2012).
Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nature Genet. 31, 316–319 (2002).
Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2d: a tool for mining genes associated with disease. BMC Genetics 6, 45 (2005).
Blagosklonny, M. V. & Pardee, A. B. Conceptual biology: unearthing the gems. Nature 416, 373 (2002).
Malandrino, N. & Smith, R. J. Personalized medicine in diabetes. Clin. Chem. 57, 231–240 (2011).
Herder, C. & Roden, M. Genetics of type 2 diabetes: pathophysiologic and clinical relevance. Eur. J. Clin. Invest. 41, 679–692 (2011).
McCarthy, M. I. Progress in defining the molecular basis of type 2 diabetes mellitus through susceptibility-gene identification. Hum. Mol. Genet. 13 (Suppl. 1), 33–41 (2004).
Hoehndorf, R., Schofield, P. N. & Gkoutos, G. V. PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res. 39, e119 (2011). The authors describe their approach to the integration of phenotype resources to judge gene–disease associations. The paper demonstrates the potential of phenotype descriptions in the understanding of biological processes.
Li, S. et al. Genetic predisposition to obesity leads to increased risk of type 2 diabetes. Diabetologia 54, 776–782 (2011).
O'Rahilly, S. Human genetics illuminates the paths to metabolic disease. Nature 462, 307–314 (2009).
Smith, R. J. et al. Individualizing therapies in type 2 diabetes mellitus based on patient characteristics: what we know and what we need to know. J. Clin. Endocrinol. Metab. 95, 1566–1574 (2010).
Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C. & Hunter, L. E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 11, 492 (2010).
Attwood, T. K. et al. Utopia documents: linking scholarly literature with research data. Bioinformatics 26, i568–i574 (2010).
Kim, J. J., Zhang, Z., Park, J. C. & Ng, S. K. BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22, 597–605 (2006).
Rzhetsky, A., Iossifov, I., Loh, J. M. & White, K. P. Microparadigms: chains of collective reasoning in publications about molecular interactions. Proc. Natl Acad. Sci. USA 103, 4940–4945 (2006). This article explores how authors report on their results and how the collection of reported facts can be traced, compared and evaluated against each other. It gives early indications of what results might be produced if we applied automatic reasoning to the information from scientific literature and other resources.
Hearst, M. A. Untangling text data mining. Proc. 37th Annu. Meeting Assoc. Comput. Linguistics 1999, 3–10 (1999).
Swanson, D. R. Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78, 29–37 (1990).
Karamanis, N. et al. Natural language processing in aid of FlyBase curators. BMC Bioinformatics 9, 193 (2008).
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 40, D13–D25 (2012).
McEntyre, J. R. et al. UKPMC: a full text article resource for the life sciences. Nucleic Acids Res. 39, D58–D65 (2011).
Cheng, D. et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36, 399–405 (2008).
Yu, H. et al. Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS. BMC Bioinformatics 11 (Suppl.2), S6 (2010).
Tsuruoka, Y., Tsujii, J. & Ananiadou, S. Facta: a text search engine for finding associated biomedical concepts. Bioinformatics 24, 2559–2560 (2008).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genet. 25, 25–29 (2000).
Consortium, G. O. The gene ontology: enhancements for 2011. Nucleic Acids Res. 40, D559–D564 (2012).
Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33, W783–W786 (2005).
Kim, J. J., Pezik, P. & Rebholz-Schuhmann, D. Medevi: retrieving textual evidence of relations between biomedical concepts from MEDLINE. Bioinformatics 24, 1410–1412 (2008).
Cohen, K. B. & Hunter, L. Getting started in text mining. PLoS Comput. Biol. 4, e20 (2008).
Brachman, R. J. & Levesque, H. J. Knowledge Representation and Reasoning (Elsevier, 2004).
Leaman, R. & Gonzalez, G. BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomput. 2008, 652–663 (2008).
Gerner, M., Nenadic, G. & Bergman, C. Linnaeus: A species name identification system for biomedical literature. BMC Bioinformatics 11, 85 (2010).
Jimeno, A. et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, S3 (2008).
Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L. & Murray-Rust, P. OSCAR4: a flexible architecture for chemical text-mining. J. Cheminform. 3, 41 (2011).
Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H. & Jimeno, A. Text processing through web services: calling Whatizit. Bioinformatics 24, 296–298 (2008).
Shah, N. H. et al. Comparison of concept recognizers for building the open biomedical annotator. BMC Bioinformatics 10, S14 (2009).
Noy, N. F. et al. Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, W170–W173 (2009).
Pafilis, E. et al. Reflect: augmented browsing for the life scientist. Nature Biotech. 27, 508–510 (2009).
Frijters, R. et al. CoPub: a literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Res. 36, W406–W410 (2008).
Muller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004).
Wermter, J., Tomanek, K. & Hahn, U. High-performance gene name normalization with GeNo. Bioinformatics 25, 815–821 (2009).
Hakenberg, J., Plake, C., Leaman, R., Schroeder, M. & Gonzalez, G. Inter-species normalization of gene mentions with GNAT. Bioinformatics 24, i126–i132 (2008).
Leitner, F. et al. The FEBS Letters/BioCreative II.5 experiment: making biological information accessible. Nature Biotech. 28, 897–899 (2010).
Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).
Hoffmann, R. & Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21 (Suppl. 2), ii252–ii258 (2005).
Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Feldman, I., Rzhetsky, A. & Vitkup, D. Network properties of genes harboring inherited disease mutations. Proc. Natl Acad. Sci. USA 105, 4323–4328 (2008).
Krallinger, M. et al. How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience. Database 2012, bas017 (2012).
Ananiadou, S., Pyysalo, S., Tsujii, J. & Kell, D. B. Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28, 381–390 (2010).
Geifman, N. & Rubin, E. Towards an age-phenome knowledge-base. BMC Bioinformatics 12, 229 (2011).
Hearst, M. A. Automatic acquisition of hyponyms from large text corpora. Proc. 14th Conf. Comput. Ling. 2, 539–545 (1992).
Brady, S. & Shatkay, H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac. Symp. Biocomput. 2008, 604–615 (2008).
Jaeger, S., Gaudan, S., Leser, U. & Rebholz-Schuhmann, D. Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9, S2 (2008).
Nagel, K., Jimeno-Yepes, A. & Rebholz-Schuhmann, D. Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb. BMC Bioinformatics 10 (Suppl.8), S4 (2009).
Blaschke, C., Oliveros, J. C. & Valencia, A. Mining functional information associated with expression arrays. Funct. Integr. Genom. 1, 256–268 (2001).
Kuffner, R., Fundel, K. & Zimmer, R. Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics 21, (Suppl.2), i259–i267 (2005).
Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999, 60–67 (1999).
Hunter, L. et al. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 9, 78 (2008). The work presented in this paper demonstrates the information technology infrastructure required to process conceptual knowledge and to derive novel findings.
Oda, K. et al. New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 9 (Suppl. 3), S5 (2008).
Narayanaswamy, M., Ravikumar, K. E. & Vijay-Shanker, K. Beyond the clause: extraction of phosphorylation information from MEDLINE abstracts. Bioinformatics 21, i319–i327 (2005).
Yuan, X. et al. An online literature mining tool for protein phosphorylation. Bioinformatics 22, 1668–1669 (2006).
Saric, J., Jensen, L. J. & Rojas, I. Large-scale extraction of gene regulation for model organisms in an ontological context. In Silico Biol. 5, 21–32 (2005).
Rodriguez-Penagos, C., Salgado, H., Martinez-Flores, I. & Collado-Vides, J. Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinformatics 8, 293 (2007).
Kim, J. & Rebholz-Schuhmann, D. Improving the extraction of complex regulatory events from scientific text by using ontology-based inference. J. Biomed. Semantics 2, S3 (2011).
Rzhetsky, A., Seringhaus, M. & Gerstein, M. Seeking a new biology through text mining. Cell 134, 9–13 (2008). The authors argue that the exploitation of the scientific literature will serve as an additional resource for the generation of hypotheses and the validation of human-driven hypotheses.
Samwald, M. & Stenzhorn, H. Establishing a distributed system for the simple representation and integration of diverse scientific assertions. J. Biomed. Semantics 1 (Suppl.1), S5 (2010).
Sansone, S. A. et al. Toward interoperable bioscience data. Nature Genet. 44, 121–126 (2012).
Neumann, E. & Prusak, L. Knowledge networks in the age of the semantic Web. Brief. Bioinformat. 8, 141–149 (2007).
Gao, Y. et al. SWAN: A distributed knowledge infrastructure for Alzheimer disease research. J. Web Semant. 4, 222–228 (2006).
Dowell, K. G., McAndrews-Hill, M. S., Hill, D. P., Drabkin, H. J. & Blake, J. A. Integrating text mining into the MGI biocuration workflow. Database 2009, bap019 (2009).
Jamieson, D. G., Gerner, M., Sarafraz, F., Nenadic, G. & Robertson, D. L. Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database. Database 2012, bas023 (2012).
Kafkas, S¸., Varog˘lu, E., Rebholz-Schuhmann, D. & Taneri, B. Diversity in the interactions of isoforms linked to clustered transcripts: a systematic literature analysis. J. Proteom. Bioinf. 4, 250–259 (2011).
Attwood, T. K. et al. Prints and its automatic supplement, preprints. Nucleic Acids Res. 31, 400–402 (2003).
Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2012).
Donaldson, I. et al. PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003).
Thorn, C. F., Klein, T. E. & Altman, R. B. Pharmacogenomics and bioinformatics: PharmGKB. Pharmacogenomics 11, 501–505 (2010).
Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J. & Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6, 343 (2010). In this study, semantic resources for the description of phenotypes were used to determine effects induced by drugs, (that is, the authors identify effects and side effects of drugs).
Collier, N. et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 24, 2940–2941 (2008). BioCaster is an information technology solution that monitors public information streams, such as Twitter, to detect expressions that indicate disease outbreaks. This study demonstrates that social information in combination with scientific information can be very useful for the prediction of disease-related events.
Elkin, P. L., Tuttle, M. S., Trusko, B. E. & Brown, S. H. BioProspecting: novel marker discovery obtained by mining the bibleome. BMC Bioinformatics 10 (Suppl.2), S9 (2009).
van Haagen, H. H. et al. Novel protein-protein interactions inferred from literature context. PLoS ONE 4, e7894 (2009).
Ceci, F., Pietrobon, R. & Goncalves, A. L. Turning text into research networks: information retrieval and computational ontologies in the creation of scientific databases. PLoS ONE 7, e27499 (2012).
Pesquita, C. et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 9 (Suppl.5), S4 (2008).
Coulet, A., Shah, N. H., Garten, Y., Musen, M. & Altman, R. B. Using text to build semantic networks for pharmacogenomics. J. Biomed. Informat. 43, 1009–1019 (2010).
Percha, B., Garten, Y. & Altman, R. B. Discovery and explanation of drug-drug interactions via text mining. Pacific Symp. Biocomput. 2012, 410–421 (2012).
Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321, 263–266 (2008).
Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P. & Morissette, J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform 41, 706–716 (2008).
Patrinos, G. P. et al. Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Hum. Mutat. 26 June 2012 (doi:10.1002/humu.22144).
Grau, B. et al. OWL 2: The next step for OWL. Web Semantics 6, 309–322 (2008).
Jensen, L. J. & Bork, P. Ontologies in quantitative biology: A basis for comparison, integration, and discovery. PLoS Biol. 8, e1000374 (2010).
Chen, H., Yu, T. & Chen, J. Y. Semantic web meets integrative biology: a survey. Brief. Bioinf. 6 April 2012 (doi:10.1093/bib/bbs014).
Chen, C.-K. et al. Mousefinder: candidate disease genes from mouse phenotype data. Hum. Mutat. 33, 858–866 (2012).
Washington, N. L. et al. Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol. 7, e1000247 (2009).
King, R. D. et al. The automation of science. Science 324, 85–89 (2009). The authors mimicked genuine scientific work through automatic analysis of experimental results, derivation of novel hypotheses and by controlling a robot to execute novel experiments. Text mining and literature analysis played an important part in the interpretation of the results from the data mining step to generate valid hypotheses.
Wilkinson, M. D., Vandervalk, B. & McCarthy, L. The semantic automated discovery and integration (SADI) Web service design-pattern, API and reference implementation. J. Biomed. Semantics 2, 8 (2011). SADI is a framework that registers Web-based services in such a way that they can be easily detected for the processing of data in the Web. Such work helps to set the stage for future progress towards experimental data residing and data analysis occurring on the Web to improve efficiency and to generate new hypotheses.
Krauthammer, M. & Nenadic, G. Term identification in the biomedical literature. J. Biomed. Inform. 37, 512–526 (2004).
Liakata, M., Saha, S., Dobnik, S., Batchelor, C. & Rebholz-Schuhmann, D. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28, 991–1000 (2012).
Krallinger, M. et al. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 9 (Suppl.2), S1 (2008).
Smith, B. et al. The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotech. 25, 1251–1255 (2007).
Richter, J. D., Harris, M. A. A., Haendel, M. & Lewis, S. Obo-edit — an ontology editor for biologists. Bioinformatics 23, 2198–2200 (2007).
Noy, N. F. et al. Creating semantic web contents with Protege-2000. IEEE Intelligent Systems 16, 60–71 (2001).
Jonquet, C., Shah, N. H. & Musen, M. A. The open biomedical annotator. Summit Translat. Bioinforma 2009, 56–60 (2009).
Douglas, S. M., Montelione, G. T. & Gerstein, M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 6, R80 (2005).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Related links
FURTHER INFORMATION
European Bioinformatics Institute
National Agricultural Laboratory Catalog
National Center for Biotechnology Information
Nature Reviews Genetics Series on Computational tools
OBO Flatfile Format Syntax and Semantics
Glossary
- Hypotheses
-
Testable statements that, if true, may explain an observed phenomenon.
- Knowledge bases
-
Databases of statements covering a knowledge domain. Often, statements are represented in a form that permits the automated or manual inference of statements that are not explicitly stated using inference rules.
- Facts
-
Objective and (experimentally) verifiable ways in which the world is structured.
- Information retrieval
-
The process of selecting information or documents from a collection as a result to the submission of a query.
- Information extraction
-
The process of automatically assessing documents, data or knowledge bases to extract statements that are likely to be true given the available information. Information extraction can be based on defined patterns, machine-learning techniques, statistical analyses or automated reasoning.
- Knowledge discovery
-
The process of analysing a set of statements to identify new statements that are true. To discover new knowledge, evidence must have already been gathered in support of the identified statements.
- Statements
-
Declarative sentences that can be said to be either true or false. True statements express facts.
- Evidence
-
The information that has been gathered to demonstrate that the statement is true (that is, it corresponds to a fact); in science, evidence usually contains experimental results.
- Provenance
-
A reference to literature from which a statement or its supporting evidence were derived.
- Terms
-
Single words or compositions of words with well-defined meanings.
- Types
-
The conceptualization of categories of entities or conceptual instances, represented by a unique identifier, a label and a definition.
- Entity recognition
-
The extraction of text constituents representing a specific type, preferably entities with a name such as a protein.
- Entity normalization
-
The mapping of a named entity or type in the text to a unique identifier, possibly requiring disambiguation and contextual analysis.
- Ontology
-
A representation of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. Commonly, ontologies are represented as graph structure that represents a taxonomy.
- Features
-
Any constituents of the text — such as tokens, words, complex terms or representations of a concept — that serve as an input to a text-mining solution.
- Assertions
-
Statements that are represented in a formal language to denote the properties or relations of an entity (or concept).
- Semantic Web
-
The extension of the World Wide Web to provide, simultaneously, human- and computer-readable semantics through references to well-defined resources.
- Syntactic parsing
-
Processing of the sentence structure using statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a 'noun phrase'), their roles (for example, the 'subject') and dependencies.
- Semantic resources
-
Biomedical ontologies and databases serve as semantic resources, as they define and describe concepts and entities.
- Automated reasoning
-
The use of software to derive statements automatically from a knowledge base using inference rules.
- Micro-theories
-
Sets of assertions that share the same topic or that result from the same source. The assertions must be conflict-free within a micro-theory but can contradict other micro-theories.
- Hypothesis generation
-
The selection or creation of hypotheses that can explain a given phenomenon. Commonly, selection criteria regarding relevance, parsimony or consistency with existing knowledge are applied to select the most viable hypotheses for a given phenomenon.
Rights and permissions
About this article
Cite this article
Rebholz-Schuhmann, D., Oellrich, A. & Hoehndorf, R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13, 829–839 (2012). https://doi.org/10.1038/nrg3337
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg3337
This article is cited by
-
Molecular and network-level mechanisms explaining individual differences in autism spectrum disorder
Nature Neuroscience (2023)
-
A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics
Scientometrics (2023)
-
Generic features selection for structure classification of diverse styled scholarly articles
Multimedia Tools and Applications (2023)
-
Considerations and challenges for sex-aware drug repurposing
Biology of Sex Differences (2022)
-
Combining lexical and context features for automatic ontology extension
Journal of Biomedical Semantics (2020)