Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Text-mining solutions for biomedical research: enabling integrative biology

Key Points

  • Text mining is a means to process the scientific literature at a large scale. It is the means to make documents and their content more accessible.

  • Literature repositories, such as PubMed Central and UK PubMed Central, are data collections just like the scientific biomedical databases. They require special techniques to parse the text and to deliver the facts for further analysis.

  • Data integration, such as the normalization of named entities in the text to database entries, is an essential step towards integrative biology using semantic Web technology.

  • Knowledge discovery is the ultimate goal of any researcher when exploiting integrated biomedical resources. The scientific literature contributes novel hypotheses and facts.

  • The use of formal knowledge representations — such as ontologies and fact data repositories — is paramount to make efficient use of our hypothesis generation and validation.

  • Solutions are emerging that provide intelligent but automated systems to assist biomedical researchers, particularly those dealing with high-throughput data.

Abstract

In response to the unbridled growth of information in literature and biomedical databases, researchers require efficient means of handling and extracting information. As well as providing background information for research, scientific publications can be processed to transform textual information into database content or complex networks and can be integrated with existing knowledge resources to suggest novel hypotheses. Information extraction and text data analysis can be particularly relevant and helpful in genetics and biomedical research, in which up-to-date information about complex processes involving genes, proteins and phenotypes is crucial. Here we explore the latest advancements in automated literature analysis and its contribution to innovative research approaches.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Categories of text-mining solutions.

Similar content being viewed by others

References

  1. Jensen, L. J., Saric, J. & Bork, P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Rev. Genet. 7, 119–129 (2006).

    Article  CAS  PubMed  Google Scholar 

  2. Kim, J. J. & Rebholz-Schuhmann, D. Categorization of services for seeking information in biomedical literature: a typology for improvement of practice. Brief. Bioinformat. 9, 452–465 (2008). This manuscript exploits assumptions and observations linked to search behaviour from users of Web pages to judge the information-seeking behaviour of scientists. It judges available text-mining tools according to these assumptions.

    Article  CAS  Google Scholar 

  3. Altman, R. B. et al. Text mining for biology—the way forward: opinions from leading scientists. Genome Biol. 9 (Suppl. 2), S7 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Leach, S. M. et al. Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput. Biol. 5, e1000215 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Hirschman, L. et al. Text mining for the biocuration workflow. Database 2012, bas020 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Association of genes to genetically inherited diseases using data mining. Nature Genet. 31, 316–319 (2002).

    Article  CAS  PubMed  Google Scholar 

  7. Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M. A. G2d: a tool for mining genes associated with disease. BMC Genetics 6, 45 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Blagosklonny, M. V. & Pardee, A. B. Conceptual biology: unearthing the gems. Nature 416, 373 (2002).

    Article  CAS  PubMed  Google Scholar 

  9. Malandrino, N. & Smith, R. J. Personalized medicine in diabetes. Clin. Chem. 57, 231–240 (2011).

    Article  PubMed  Google Scholar 

  10. Herder, C. & Roden, M. Genetics of type 2 diabetes: pathophysiologic and clinical relevance. Eur. J. Clin. Invest. 41, 679–692 (2011).

    Article  PubMed  Google Scholar 

  11. McCarthy, M. I. Progress in defining the molecular basis of type 2 diabetes mellitus through susceptibility-gene identification. Hum. Mol. Genet. 13 (Suppl. 1), 33–41 (2004).

    Article  CAS  Google Scholar 

  12. Hoehndorf, R., Schofield, P. N. & Gkoutos, G. V. PhenomeNET: a whole-phenome approach to disease gene discovery. Nucleic Acids Res. 39, e119 (2011). The authors describe their approach to the integration of phenotype resources to judge gene–disease associations. The paper demonstrates the potential of phenotype descriptions in the understanding of biological processes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Li, S. et al. Genetic predisposition to obesity leads to increased risk of type 2 diabetes. Diabetologia 54, 776–782 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. O'Rahilly, S. Human genetics illuminates the paths to metabolic disease. Nature 462, 307–314 (2009).

    Article  CAS  PubMed  Google Scholar 

  15. Smith, R. J. et al. Individualizing therapies in type 2 diabetes mellitus based on patient characteristics: what we know and what we need to know. J. Clin. Endocrinol. Metab. 95, 1566–1574 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Cohen, K. B., Johnson, H. L., Verspoor, K., Roeder, C. & Hunter, L. E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 11, 492 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Attwood, T. K. et al. Utopia documents: linking scholarly literature with research data. Bioinformatics 26, i568–i574 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Kim, J. J., Zhang, Z., Park, J. C. & Ng, S. K. BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22, 597–605 (2006).

    Article  CAS  PubMed  Google Scholar 

  19. Rzhetsky, A., Iossifov, I., Loh, J. M. & White, K. P. Microparadigms: chains of collective reasoning in publications about molecular interactions. Proc. Natl Acad. Sci. USA 103, 4940–4945 (2006). This article explores how authors report on their results and how the collection of reported facts can be traced, compared and evaluated against each other. It gives early indications of what results might be produced if we applied automatic reasoning to the information from scientific literature and other resources.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Hearst, M. A. Untangling text data mining. Proc. 37th Annu. Meeting Assoc. Comput. Linguistics 1999, 3–10 (1999).

    Article  Google Scholar 

  21. Swanson, D. R. Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78, 29–37 (1990).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Karamanis, N. et al. Natural language processing in aid of FlyBase curators. BMC Bioinformatics 9, 193 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 40, D13–D25 (2012).

    Article  CAS  PubMed  Google Scholar 

  24. McEntyre, J. R. et al. UKPMC: a full text article resource for the life sciences. Nucleic Acids Res. 39, D58–D65 (2011).

    Article  CAS  PubMed  Google Scholar 

  25. Cheng, D. et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36, 399–405 (2008).

    Article  CAS  Google Scholar 

  26. Yu, H. et al. Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS. BMC Bioinformatics 11 (Suppl.2), S6 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Tsuruoka, Y., Tsujii, J. & Ananiadou, S. Facta: a text search engine for finding associated biomedical concepts. Bioinformatics 24, 2559–2560 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  Google Scholar 

  29. Consortium, G. O. The gene ontology: enhancements for 2011. Nucleic Acids Res. 40, D559–D564 (2012).

    Article  CAS  Google Scholar 

  30. Doms, A. & Schroeder, M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 33, W783–W786 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Kim, J. J., Pezik, P. & Rebholz-Schuhmann, D. Medevi: retrieving textual evidence of relations between biomedical concepts from MEDLINE. Bioinformatics 24, 1410–1412 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Cohen, K. B. & Hunter, L. Getting started in text mining. PLoS Comput. Biol. 4, e20 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Brachman, R. J. & Levesque, H. J. Knowledge Representation and Reasoning (Elsevier, 2004).

  34. Leaman, R. & Gonzalez, G. BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomput. 2008, 652–663 (2008).

    Google Scholar 

  35. Gerner, M., Nenadic, G. & Bergman, C. Linnaeus: A species name identification system for biomedical literature. BMC Bioinformatics 11, 85 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Jimeno, A. et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9, S3 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L. & Murray-Rust, P. OSCAR4: a flexible architecture for chemical text-mining. J. Cheminform. 3, 41 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H. & Jimeno, A. Text processing through web services: calling Whatizit. Bioinformatics 24, 296–298 (2008).

    Article  CAS  PubMed  Google Scholar 

  39. Shah, N. H. et al. Comparison of concept recognizers for building the open biomedical annotator. BMC Bioinformatics 10, S14 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Noy, N. F. et al. Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 37, W170–W173 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Pafilis, E. et al. Reflect: augmented browsing for the life scientist. Nature Biotech. 27, 508–510 (2009).

    Article  CAS  Google Scholar 

  42. Frijters, R. et al. CoPub: a literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Res. 36, W406–W410 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Muller, H. M., Kenny, E. E. & Sternberg, P. W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2, e309 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Wermter, J., Tomanek, K. & Hahn, U. High-performance gene name normalization with GeNo. Bioinformatics 25, 815–821 (2009).

    Article  CAS  PubMed  Google Scholar 

  45. Hakenberg, J., Plake, C., Leaman, R., Schroeder, M. & Gonzalez, G. Inter-species normalization of gene mentions with GNAT. Bioinformatics 24, i126–i132 (2008).

    Article  PubMed  Google Scholar 

  46. Leitner, F. et al. The FEBS Letters/BioCreative II.5 experiment: making biological information accessible. Nature Biotech. 28, 897–899 (2010).

    Article  CAS  Google Scholar 

  47. Jenssen, T. K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).

    CAS  PubMed  Google Scholar 

  48. Hoffmann, R. & Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 21 (Suppl. 2), ii252–ii258 (2005).

    CAS  PubMed  Google Scholar 

  49. Goh, K.-I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Feldman, I., Rzhetsky, A. & Vitkup, D. Network properties of genes harboring inherited disease mutations. Proc. Natl Acad. Sci. USA 105, 4323–4328 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Krallinger, M. et al. How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience. Database 2012, bas017 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Ananiadou, S., Pyysalo, S., Tsujii, J. & Kell, D. B. Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28, 381–390 (2010).

    Article  CAS  PubMed  Google Scholar 

  53. Geifman, N. & Rubin, E. Towards an age-phenome knowledge-base. BMC Bioinformatics 12, 229 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Hearst, M. A. Automatic acquisition of hyponyms from large text corpora. Proc. 14th Conf. Comput. Ling. 2, 539–545 (1992).

    Google Scholar 

  55. Brady, S. & Shatkay, H. EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac. Symp. Biocomput. 2008, 604–615 (2008).

    Google Scholar 

  56. Jaeger, S., Gaudan, S., Leser, U. & Rebholz-Schuhmann, D. Integrating protein-protein interactions and text mining for protein function prediction. BMC Bioinformatics 9, S2 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Nagel, K., Jimeno-Yepes, A. & Rebholz-Schuhmann, D. Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb. BMC Bioinformatics 10 (Suppl.8), S4 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Blaschke, C., Oliveros, J. C. & Valencia, A. Mining functional information associated with expression arrays. Funct. Integr. Genom. 1, 256–268 (2001).

    Article  CAS  Google Scholar 

  59. Kuffner, R., Fundel, K. & Zimmer, R. Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics 21, (Suppl.2), i259–i267 (2005).

    Google Scholar 

  60. Blaschke, C., Andrade, M. A., Ouzounis, C. & Valencia, A. Automatic extraction of biological information from scientific text: protein–protein interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999, 60–67 (1999).

    Google Scholar 

  61. Hunter, L. et al. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 9, 78 (2008). The work presented in this paper demonstrates the information technology infrastructure required to process conceptual knowledge and to derive novel findings.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Oda, K. et al. New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 9 (Suppl. 3), S5 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Narayanaswamy, M., Ravikumar, K. E. & Vijay-Shanker, K. Beyond the clause: extraction of phosphorylation information from MEDLINE abstracts. Bioinformatics 21, i319–i327 (2005).

    Article  CAS  PubMed  Google Scholar 

  64. Yuan, X. et al. An online literature mining tool for protein phosphorylation. Bioinformatics 22, 1668–1669 (2006).

    Article  CAS  PubMed  Google Scholar 

  65. Saric, J., Jensen, L. J. & Rojas, I. Large-scale extraction of gene regulation for model organisms in an ontological context. In Silico Biol. 5, 21–32 (2005).

    CAS  PubMed  Google Scholar 

  66. Rodriguez-Penagos, C., Salgado, H., Martinez-Flores, I. & Collado-Vides, J. Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinformatics 8, 293 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Kim, J. & Rebholz-Schuhmann, D. Improving the extraction of complex regulatory events from scientific text by using ontology-based inference. J. Biomed. Semantics 2, S3 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Rzhetsky, A., Seringhaus, M. & Gerstein, M. Seeking a new biology through text mining. Cell 134, 9–13 (2008). The authors argue that the exploitation of the scientific literature will serve as an additional resource for the generation of hypotheses and the validation of human-driven hypotheses.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Samwald, M. & Stenzhorn, H. Establishing a distributed system for the simple representation and integration of diverse scientific assertions. J. Biomed. Semantics 1 (Suppl.1), S5 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  70. Sansone, S. A. et al. Toward interoperable bioscience data. Nature Genet. 44, 121–126 (2012).

    Article  CAS  PubMed  Google Scholar 

  71. Neumann, E. & Prusak, L. Knowledge networks in the age of the semantic Web. Brief. Bioinformat. 8, 141–149 (2007).

    Article  Google Scholar 

  72. Gao, Y. et al. SWAN: A distributed knowledge infrastructure for Alzheimer disease research. J. Web Semant. 4, 222–228 (2006).

    Article  Google Scholar 

  73. Dowell, K. G., McAndrews-Hill, M. S., Hill, D. P., Drabkin, H. J. & Blake, J. A. Integrating text mining into the MGI biocuration workflow. Database 2009, bap019 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Jamieson, D. G., Gerner, M., Sarafraz, F., Nenadic, G. & Robertson, D. L. Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database. Database 2012, bas023 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Kafkas, S¸., Varog˘lu, E., Rebholz-Schuhmann, D. & Taneri, B. Diversity in the interactions of isoforms linked to clustered transcripts: a systematic literature analysis. J. Proteom. Bioinf. 4, 250–259 (2011).

    Article  Google Scholar 

  76. Attwood, T. K. et al. Prints and its automatic supplement, preprints. Nucleic Acids Res. 31, 400–402 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2012).

    Article  CAS  PubMed  Google Scholar 

  78. Donaldson, I. et al. PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  79. Thorn, C. F., Klein, T. E. & Altman, R. B. Pharmacogenomics and bioinformatics: PharmGKB. Pharmacogenomics 11, 501–505 (2010).

    Article  CAS  PubMed  Google Scholar 

  80. Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J. & Bork, P. A side effect resource to capture phenotypic effects of drugs. Mol. Syst. Biol. 6, 343 (2010). In this study, semantic resources for the description of phenotypes were used to determine effects induced by drugs, (that is, the authors identify effects and side effects of drugs).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Collier, N. et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 24, 2940–2941 (2008). BioCaster is an information technology solution that monitors public information streams, such as Twitter, to detect expressions that indicate disease outbreaks. This study demonstrates that social information in combination with scientific information can be very useful for the prediction of disease-related events.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Elkin, P. L., Tuttle, M. S., Trusko, B. E. & Brown, S. H. BioProspecting: novel marker discovery obtained by mining the bibleome. BMC Bioinformatics 10 (Suppl.2), S9 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. van Haagen, H. H. et al. Novel protein-protein interactions inferred from literature context. PLoS ONE 4, e7894 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Ceci, F., Pietrobon, R. & Goncalves, A. L. Turning text into research networks: information retrieval and computational ontologies in the creation of scientific databases. PLoS ONE 7, e27499 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Pesquita, C. et al. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 9 (Suppl.5), S4 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Coulet, A., Shah, N. H., Garten, Y., Musen, M. & Altman, R. B. Using text to build semantic networks for pharmacogenomics. J. Biomed. Informat. 43, 1009–1019 (2010).

    Article  CAS  Google Scholar 

  87. Percha, B., Garten, Y. & Altman, R. B. Discovery and explanation of drug-drug interactions via text mining. Pacific Symp. Biocomput. 2012, 410–421 (2012).

    Google Scholar 

  88. Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321, 263–266 (2008).

    Article  CAS  PubMed  Google Scholar 

  89. Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P. & Morissette, J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform 41, 706–716 (2008).

    Article  PubMed  Google Scholar 

  90. Patrinos, G. P. et al. Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Hum. Mutat. 26 June 2012 (doi:10.1002/humu.22144).

    Article  PubMed  Google Scholar 

  91. Grau, B. et al. OWL 2: The next step for OWL. Web Semantics 6, 309–322 (2008).

    Article  Google Scholar 

  92. Jensen, L. J. & Bork, P. Ontologies in quantitative biology: A basis for comparison, integration, and discovery. PLoS Biol. 8, e1000374 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Chen, H., Yu, T. & Chen, J. Y. Semantic web meets integrative biology: a survey. Brief. Bioinf. 6 April 2012 (doi:10.1093/bib/bbs014).

    Article  PubMed  Google Scholar 

  94. Chen, C.-K. et al. Mousefinder: candidate disease genes from mouse phenotype data. Hum. Mutat. 33, 858–866 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  95. Washington, N. L. et al. Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol. 7, e1000247 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. King, R. D. et al. The automation of science. Science 324, 85–89 (2009). The authors mimicked genuine scientific work through automatic analysis of experimental results, derivation of novel hypotheses and by controlling a robot to execute novel experiments. Text mining and literature analysis played an important part in the interpretation of the results from the data mining step to generate valid hypotheses.

    Article  CAS  PubMed  Google Scholar 

  97. Wilkinson, M. D., Vandervalk, B. & McCarthy, L. The semantic automated discovery and integration (SADI) Web service design-pattern, API and reference implementation. J. Biomed. Semantics 2, 8 (2011). SADI is a framework that registers Web-based services in such a way that they can be easily detected for the processing of data in the Web. Such work helps to set the stage for future progress towards experimental data residing and data analysis occurring on the Web to improve efficiency and to generate new hypotheses.

    Article  PubMed  PubMed Central  Google Scholar 

  98. Krauthammer, M. & Nenadic, G. Term identification in the biomedical literature. J. Biomed. Inform. 37, 512–526 (2004).

    Article  CAS  PubMed  Google Scholar 

  99. Liakata, M., Saha, S., Dobnik, S., Batchelor, C. & Rebholz-Schuhmann, D. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28, 991–1000 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Krallinger, M. et al. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 9 (Suppl.2), S1 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Smith, B. et al. The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotech. 25, 1251–1255 (2007).

    Article  CAS  Google Scholar 

  102. Richter, J. D., Harris, M. A. A., Haendel, M. & Lewis, S. Obo-edit — an ontology editor for biologists. Bioinformatics 23, 2198–2200 (2007).

    Article  CAS  Google Scholar 

  103. Noy, N. F. et al. Creating semantic web contents with Protege-2000. IEEE Intelligent Systems 16, 60–71 (2001).

    Article  Google Scholar 

  104. Jonquet, C., Shah, N. H. & Musen, M. A. The open biomedical annotator. Summit Translat. Bioinforma 2009, 56–60 (2009).

    Google Scholar 

  105. Douglas, S. M., Montelione, G. T. & Gerstein, M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 6, R80 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dietrich Rebholz-Schuhmann.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

BioCaster

CoPub

CiteULike

Diseasome

European Bioinformatics Institute

European Patent Office

Google Scholar

GoPubMed

iHOP

Mendeley

National Agricultural Laboratory Catalog

National Center for Biotechnology Information

Nature Reviews Genetics Series on Computational tools

OBO Flatfile Format Syntax and Semantics

Open Biomedical Annotator

Paperpile

Papers

PLoS Neglected Tropical Diseases: Impact of environment and social gradient on Leptospira infection in urban slums

PharmGKB

PolySearch

PubMed

Reflect

RefMED

ScienceDirect.com

SIDER — Side Effect Resource

Textpresso

Transcript Based Isoform Interaction Database (TBIID)

UK PubMed Central

Glossary

Hypotheses

Testable statements that, if true, may explain an observed phenomenon.

Knowledge bases

Databases of statements covering a knowledge domain. Often, statements are represented in a form that permits the automated or manual inference of statements that are not explicitly stated using inference rules.

Facts

Objective and (experimentally) verifiable ways in which the world is structured.

Information retrieval

The process of selecting information or documents from a collection as a result to the submission of a query.

Information extraction

The process of automatically assessing documents, data or knowledge bases to extract statements that are likely to be true given the available information. Information extraction can be based on defined patterns, machine-learning techniques, statistical analyses or automated reasoning.

Knowledge discovery

The process of analysing a set of statements to identify new statements that are true. To discover new knowledge, evidence must have already been gathered in support of the identified statements.

Statements

Declarative sentences that can be said to be either true or false. True statements express facts.

Evidence

The information that has been gathered to demonstrate that the statement is true (that is, it corresponds to a fact); in science, evidence usually contains experimental results.

Provenance

A reference to literature from which a statement or its supporting evidence were derived.

Terms

Single words or compositions of words with well-defined meanings.

Types

The conceptualization of categories of entities or conceptual instances, represented by a unique identifier, a label and a definition.

Entity recognition

The extraction of text constituents representing a specific type, preferably entities with a name such as a protein.

Entity normalization

The mapping of a named entity or type in the text to a unique identifier, possibly requiring disambiguation and contextual analysis.

Ontology

A representation of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. Commonly, ontologies are represented as graph structure that represents a taxonomy.

Features

Any constituents of the text — such as tokens, words, complex terms or representations of a concept — that serve as an input to a text-mining solution.

Assertions

Statements that are represented in a formal language to denote the properties or relations of an entity (or concept).

Semantic Web

The extension of the World Wide Web to provide, simultaneously, human- and computer-readable semantics through references to well-defined resources.

Syntactic parsing

Processing of the sentence structure using statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a 'noun phrase'), their roles (for example, the 'subject') and dependencies.

Semantic resources

Biomedical ontologies and databases serve as semantic resources, as they define and describe concepts and entities.

Automated reasoning

The use of software to derive statements automatically from a knowledge base using inference rules.

Micro-theories

Sets of assertions that share the same topic or that result from the same source. The assertions must be conflict-free within a micro-theory but can contradict other micro-theories.

Hypothesis generation

The selection or creation of hypotheses that can explain a given phenomenon. Commonly, selection criteria regarding relevance, parsimony or consistency with existing knowledge are applied to select the most viable hypotheses for a given phenomenon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rebholz-Schuhmann, D., Oellrich, A. & Hoehndorf, R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13, 829–839 (2012). https://doi.org/10.1038/nrg3337

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3337

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing