Box 1 | Natural language processing

From the following article:

Mining electronic health records: towards better research applications and clinical care

Peter B. Jensen, Lars J. Jensen & Søren Brunak

Nature Reviews Genetics 13, 395-405 (June 2012)

doi:10.1038/nrg3208

Typical natural language processing (NLP) steps exemplified by the clinical Text Analysis and Knowledge Extraction System (cTAKES) clinical text-mining pipeline27. First, sentence boundary detection splits the text into units of individual sentences. This is followed by tokenization, which splits the text using space and punctuation as a guide to identify individual tokens (typically individual words), with rules for handling special cases such as dates. Tokens are reduced to a base form by normalizing, for example, case, inflection or spelling variants. The next step assigns part-of-speech tags to each token to identify its grammatical category in the context (for example, NN for noun, IN for preposition or JJ for adjective). This is not a trivial task as many words have ambiguous meaning. After the tokens have been tagged, the shallow parsing step identifies syntactic units, most importantly noun phrases (NPs), which are grammatical units, built from a noun with optional modifiers such as adjectives. In the entity recognition step, NPs and various lexical permutations are then mapped to controlled vocabularies using tools such as MetaMap106. Importantly, such systems also identify the presence of negating terms, such as 'no' or 'never', near identified entities. The various steps are typically implemented using combinations of logical rules (and their exceptions) and machine-learning methods. For example, a full stop (period) followed by a space and a capital letter indicates a sentence boundary. In the figure, two disorders are identified as well as one anatomical structure. Both disorders are tagged as relating to family history (Fx), and, in the case of coronary artery disease, the preceding word 'no' tags the term as negated. Clinical information extraction systems generally perform best when fine-tuned for specific tasks or clinical domains, such as identifying smoking status or analysing radiology reports. Vocabularies can be customized for a task with domain-specific terms, and the rules and training can be focused. The annual Informatics for Integrating Biology and the Bedside (i2b2) NLP shared tasks107, 108, 109, 110 meeting provides a good demonstration of state-of-the-art practice in clinical NLP applied to increasingly difficult challenges. The 2010 challenge prompted participants to extract concepts, assertions and relations from clinical text110. CC, coordinated conjuction; DT, determiner; NNS, plural noun.

Mining electronic health records: towards better research applications and clinical care