Ontology-driven weak supervision for clinical entity classification in electronic health records

In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove’s ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.


Introduction
Analyzing text to identify concepts such as disease names and their associated attributes like negation are foundational tasks in medical natural language processing (NLP).Traditionally, training classifiers for named entity recognition (NER) and cue-based entity classification have relied on hand-labeled training data.However annotating medical corpora requires considerable domain expertise and money, creating barriers to using machine learning in critical applications [1,2].Moreover, hand-labeled datasets are static artifacts that are expensive to change.The recent COVID-19 pandemic highlights the need for machine learning tools that enable faster, more flexible analysis of clinical and scientific documents in response to rapidly unfolding events [3].
To address the scarcity of hand-labeled training data, machine learning practitioners increasingly turn to lower cost, less accurate label sources to rapidly build classifiers.Instead of requiring hand-labeled training data, weakly supervised learning relies on task-specific rules and other imperfect labeling strategies to programmatically generate training data.This approach combines the benefits of rule-based systems, which are easily shared, inspected and modified, with machine learning which typically improves performance and generalization properties.Weakly supervised methods have demonstrated success across a range of NLP and other settings [4,5,6,7,8] .
Knowledge bases and ontologies provide a compelling foundation for building weakly supervised entity classifiers.Ontologies codify a vast amount of medical knowledge via taxonomies and example instances for millions of medical concepts.However, repurposing ontologies for weak supervision creates challenges when combining label information from multiple sources without access to ground truth labels.The hundreds of terminologies found in the Unified Medical Language System (UMLS) Metathesaurus [9] and other sources [10] typify the highly redundant, conflicting, and imperfect entity definitions found across medical ontologies.Naively combining such conflicting label assignments can cause substantial performance drops in weakly supervised classification [11]; therefore, a key challenge is correcting for labeling errors made by individual ontologies when combining label information.
In this work, we explore how ontology-driven weak supervision can be used to train medical entity classifiers without hand-labeled training data.Prior research on weakly supervised medical NER has required complex preprocessing to identify possible entity spans [12], generated labels from a single source rather than combining multiple sources [13], or relied on ad hoc rule engineering [14].High impact application areas, such as clinical NER using weak supervision, are largely unstudied.Key questions remain about the extent to which we can automate weak supervision using existing medical ontologies and how much additional task-specific rule engineering is required for state-of-the-art performance.It is also unclear whether, and by how much, pre-trained language models such as BioBERT [15] improve the ability to generalize from weakly labeled data and reduce the need for task-specific labeling rules.
We present a Trove, a framework for training weakly supervised medical entity classifiers using off-the-shelf ontologies.The overall pipeline is shown in Figure 1.We focus on the challenge of building classifiers without hand-labeled training data by unifying: (1) imperfect labels generated by multiple ontologies and (2) task-specific rules.Our main hypothesis is that ontology-only weak supervision, coupled with recent pre-trained language models such as BioBERT, substantially reduces the engineering cost of creating entity classifiers while matching performance of prior, more expensive, weakly supervised methods.The central intuition of this work is that individual ontologies and task-specific rules each make systematic labeling errors.By observing the rates of agreement and disagreement across labeling rules, and without requiring ground truth labels, we can learn each source's accuracy and correct for label noise to generate "denoised", probabilistic training data [16,17].These data are then used to train deep learning models to generalize beyond the concepts found in ontologies alone.
We conduct experiments on six benchmark tasks for clinical and scientific text, reporting state-of-the-art weakly supervised performance (i.e., using no hand-labeled training data) on NER datasets for chemical/disease and drug tagging.We further present new weakly supervised baselines for two tasks in clinical text: disorder tagging and event temporality classification.Our study includes ablation analyses exploring the performance trade-offs of training models with labels generated from easily automated ontology-based weak supervision vs. more expensive, task-specific rules.Finally, we present a case study deploying Trove for COVID-19 symptom tagging and risk factor monitoring using a near-realtime feed of Stanford Health Care emergency department notes.

Related work
Rule-based systems for NER [18] and cue detection [19,20] are common in clinical text processing, where labeled corpora are difficult to share due to privacy concerns.Generating imperfect training labels from indirect sources (e.g., patient notes) is often used in analyzing medical images [21,22,23,24,25,26] and text processing [27].Recent work has explored learning the accuracies of sources to correct for label noise in rule-based systems  for text classification [28,29,4,17].However these focus on sentence or document classification via task-specific labeling rules and do not explore NER or automating labeling via multiple ontologies.
Weakly supervised learning is an umbrella term referring to methods for training classifiers using imperfect, indirect, or limited labeled data and includes techniques such as distant supervision [30,31], co-training [32] and others [33].Prior approaches for weakly supervised NER such as co-training use a small set of labeled seed examples [34] which are iteratively expanded through bootstrapping or self-training [35].Semi-supervised methods also use some amount of labeled training data and incorporate unlabeled data by imposing constraints on properties such as expected label distributions [36].Distant supervision requires no labeled training data, but typically focuses on a single source for labels [13], rather than unifying labels assigned using heterogeneous sources of unknown quality.Crowdsourcing methods combine labels from multiple human annotators with unknown accuracy [37].However compared to human labelers, programmatic label assignment has different correlation and scaling properties which create technical challenges when combining sources.
Data programming [16,11,17] formalizes theory for combining multiple label sources with different coverage and unknown accuracy as well as correlation structure to correct for labeling errors.This approach is used in SwellShark [12] where a generative model is trained using labels from multiple dictionary and rule-based sources.However this approach required task-specific preprocessing to identify candidate entities a priori to achieve competitive performance.Safranchik et al. [14] presented WISER, a linked hidden Markov model where weak supervision was defined separately over tags and tag transitions using linking rules derived from language models, ngram statistics, mined phrases and custom heuristics to train a BiLSTM-CRF.
Our work advances these prior approaches by: (1) eliminating the requirement for identifying probable entity spans a priori by combining word-level weak supervision with contextualized word embeddings; (2) using ontology-only supervision; and (3) quantifying the relative contributions of sources of label assignment -such as pre-existing ontologies from the UMLS (low cost) and task-specific rule engineering (high cost) -to the achieved performance for a task.

Datasets and tasks
We analyze two categories of medical tasks using six datasets: (1) NER; and (2) span classification where entities are identified a priori and classified for cue-driven attributes such as negation or document relative time i.e., the order of an event entity relative to the parent document's timestamp.Both categories of tasks are formalized as token classification problems, either tagging all words in a sequence (NER) or just the head words for an entity set (span classification).preprocessed using a spaCy [38] pipeline optimized for medical tokenization and sentence boundary detection [29].
We used 99 label sources covering a broad range of medical ontologies.We used the 2018AA release of the UMLS Metathesaurus, removing non-English and zoonotic source terminologies as well as sources containing less than 500 terms, resulting in 92 sources.Additional sources included the 2019 SPECIALIST abbreviations [43]; Disease Ontology [44]; Chemical Entities of Biological Interest (ChEBI) [45]; Comparative Toxicogenomics Database (CTD) [46]; the seed vocabulary used in AutoNER [13]; ADAM abbreviations database [47]; and word sense abbreviation dictionaries used by the clinical abbreviation system CARD [48].We applied minimal preprocessing to all source ontologies, filtering out English stopwords [49] and applying a letter case normalization heuristic to preserve abbreviations.

Formulation of the labeling problem
We assume a sequence labeling problem formulation, where we are given a dataset D = {X i } N i=1 of N sequences X i = (x i,1 , ..., x i,t ) consisting of words x from a fixed vocabulary.Each sequence is mapped to a corresponding sequence of latent class variables Y i = (y i,1 , ..., y i,t ), where y ∈ {0, ..., k} for k tag classes.Since Y is not observable, our primary technical challenge is estimating Y from multiple, potentially conflicting label sources of unknown quality to construct a probabilistically labeled dataset D = {X i , Ŷi } N i=1 .This dataset can then be used for training classification models such as deep neural networks.Such a labeling regimen is typically low-cost, but less accurate than the hand-curated labels used in traditional supervised learning, hence this paradigm is referred to as weakly supervised learning.

Unifying and denoising sources with a label model
Combining labels assigned via term-matching using multiple ontologies and task-specific rules is challenging because the different sources have unknown, task-dependent accuracies and can disagree on the correct (unobserved) label, introducing noise into the labeling process.To correct for such label noise, we use data programming [16] to estimate accuracies of each source and ensemble the sources via a label model which assigns a consensus probabilistic label per word.
To learn the label model, m different label sources are parameterized as labeling functions λ 1 , ....λ m .Labeling functions assign a label given an input instance (e.g., a document or entity span) and an underlying heuristic such as matching strings against a dictionary.The output of a labeling function is in the domain {−1, 0, ..., k} where -1 denotes ABSTAIN, i.e., not assigning any class label.The vector of m labeling functions applied to n instances forms the label matrix Λ ∈ {−1, 0, ..., k} m×n .A key finding of data programming is that we can use Λ to recover the latent class-conditional accuracy of each label source without ground truth labels by observing the rates of agreement and disagreement across all pairs of labeling functions λ i , λ j [16].
We use the weak supervision framework Snorkel [11] to train a probabilistic label model which captures the relationship between the true label and label sources P (Y, Λ).Here the training input is only the label matrix Λ, generated by applying labeling functions λ 1 , ....λ m to the unlabeled dataset D. Formally, P (Y, Λ) can be encoded as a factor graph-based model with m accuracy factors between λ 1 , ..., λ m and our true (unobserved) label y. θ Acc j Snorkel implements a matrix completion formulation of data programming which enables faster estimation of model parameters θ using stochastic gradient descent rather than relying on Gibbs sampling-based approaches [17].The label model estimates P (Y |Λ) to provide "denoised" consensus label predictions Ŷ and generates our probabilistically labeled dataset D.

Labeling function templates
In this work, a labeling function λ j accepts an unlabeled sequence X i as input and emits a vector of predicted labels Ỹi,j = (ỹ j,1 , ..., ỹj,t ), i.e., a label ỹj ∈ {−1, 0, ..., k} for each word in X i .A typical labeling function serves as a wrapper for an underlying, potentially task-specific labeling heuristic such as pattern matching with a regular expression.Since these labeling functions are not easily automated and require hand coding, we refer to them as task-specific labeling functions.
In contrast, medical ontologies are easily transformed into labeling functions by defining reusable labeling function templates.Templates only require specifying a target entity taxonomy and providing a collection of terminologies mapped to that taxonomy.These mappings are common in knowledge bases such as the UMLS Metathesaurus, where the UMLS Semantic Network [50] provides a shared taxonomy for over a hundred medical terminologies.We utilize two ontology-based labeling functions in this work.
Taxonomy labeling functions require a set of terms (single or multi-word entities) t ∈ T mapped to a taxonomy, where a term may be mapped to multiple entity classes.This mapping is converted to a k-dimensional probability vector where k is the number of entity classes Given input sequence X i , use string matching to find all longest term matches (in token length) and assign each match to its most probable entity class ỹ = max(t i ), abstaining on ties.Using the longest match is a heuristic which helps disambiguates nested terms ("lung" as anatomy vs "lung cancer" as disease).Matching optionally includes a set of slot-filled patterns to capture simple compositional mentions (e.g., "{*} ({*})" → "Tylenol (Acetaminophen)").
Synonym (synset) labeling functions require synsets (collections of synonymous terms) { t1 , ..., tn } ∈ T and terms T mapped to a taxonomy.Given input sequence X i and it's parent context (e.g., document) search for >1 unique synonym matches from a target synset and label all matches ỹ = max(t i ).This is useful for disambiguating abbreviations (e.g, "Duchenne muscular dystrophy" → "DMD") , where a long form of an abbreviated term appears elsewhere in a document.Matches can be unconstrained, e.g., any tuple found anywhere in a context, or subject to matching rules e.g., using Schwartz-Hearst abbreviation disambiguation [51] to identify out-of-dictionary abbreviations.
Our labeling functions generate word-level labels.Figure 2 shows how this provides a principled way to synthesize a label when there is disagreement across label sources about what constitutes an entity span.Here the disease mention "diabetes type 2" is not found in Metathesaurus Names (MTH) or SNOMED Clinical Terms (SNOMEDCT) [52] which leads to disagreement and label errors.Using a soft majority vote of labeling functions misses the complete entity span, while the label model learns to account for systematic errors made by each ontology to generate a more accurate consensus label prediction.

Training the BioBERT end model
The output of the label model is a set of probabilistically labeled words, which we transform back into sequences D = {X i , Ŷi } N i=1 .While probabilistic labels may be used directly for classification, this suffers from a key limitation: the label model cannot generalize beyond the direct output of labeling functions.Rules alone can miss common error cases such as out-of-dictionary synonyms or misspellings.Therefore, to improve coverage we train a discriminative end model, in this case a deep neural network, to transform the output of labeling functions into learned feature representations.Doing so leverages the inductive bias of pre-trained language models [53] and provides additional opportunities for injecting domain knowledge via data augmentation [54] and multi-task learning [55] to improve classification performance.We use the transformer-based BioBERT [15], a language model fine-tuned on medical text.We also evaluated ClinicalBERT [56] for clinical tasks, and found its performance to be the same as BioBERT.BioBERT is trained as a token-level classifier with a max sequence length of 512 tokens.We follow Devlin et al. [53] for sequence labeling formulation, using the last BERT layer of each word's head wordpiece token as the contextualized embedding.Since sequence labels may be incomplete (i.e., cases where all labeling functions abstain on a word), we mask all abstained tokens when computing the loss during training.We modified BioBERT to support a noise-aware binary cross entropy loss function [16] which minimizes the expected value with respect to Ŷ to take advantage of the more informative probabilistic labels.

Hyperparameter tuning for the label and end models
All models were trained using weakly-labeled versions of the original training splits, i.e., no hand-labeled instances.We used a hand-labeled validation and test set for hyperparameter tuning and model evaluation, respectively.Result metrics are reported using the test set.The label model was tuned for learning rate, training epochs, L2 regularization, and a uniform accuracy prior used to initialize labeling function accuracies.BioBERT weights were fine-tuned, and end models were tuned for learning rate and training epochs.We used a linear decay learning rate schedule with a 10% warmup period.

Metrics
We report precision, recall, and F1-score for all tasks.DocRelaTime is reported using micro-averaging.NER metrics are computed using exact span matching [57].Each NER task is trained separately as a binary classifier using IO (inside, outside) tagging to simplify labeling function design, with predicted tags converted to BIO (beginning, inside, outside) to properly count errors detecting head words.Span task metrics are calculated assuming access to gold test set spans, as per the evaluation protocol of the original challenges.Label model and BioBERT scores are reported as the mean and standard deviation of five runs with different random seeds.

Experiment overview
After quantifying the performance of ontology-driven weak supervision in all our tasks, we performed four experiments.First, we examined performance differences by label source ablations, which compared ontologybased labeling functions against those incorporating task-specific rules.Second, we compared Trove to existing weakly supervised tagging methods.Third, we examined learning source accuracies for UMLS terminologies.
Finally we report on a case study that used Trove to monitor emergency department notes for symptoms and risk factors associated with patients tested for COVID-19.Expanded experimental details, tuning experiments, and performance measures are provided in supplemental materials.

Labeling source ablations
For NER tasks, we examined five ablations, ordered by increasing cost of labeling effort.
1. Guidelines: A dictionary of all positive and negative examples explicitly provided in annotation guidelines, including dictionaries for punctuation, numbers, and English stopwords.
3. + Other: Additional ontologies or existing dictionaries not included in the UMLS.
4. + Rules: Task-specific rules including regular expressions, small dictionaries, and other heuristics.

5.
Hand-labeled: Supervised learning using the expert-labeled training split.
Tiers 1-4 are additive and include all prior levels.We initialized labeling function templates as follows: Ontology-based Labeling Functions: We used the UMLS Semantic Network as our entity taxonomy and defined a mapping of semantic types (STYs) to target class labels y ∈ {0, 1}.Non-UMLS ontologies that did not provide semantic type assignments (e.g., ChEBI) were mapped to a single class label.All UMLS terminologies v were ranked by term coverage on the unlabeled training set, defined as each term's document frequency summed by terminology, and the top s terminologies were used to initialize templates, where s was tuned with a validation set.The remaining (v s+1 , ..., v 92 ) UMLS terminologies were merged into a single labeling function to ensure all term in the UMLS were included.UMLS synsets were constructed using concept unique identifiers (CUIs) and templates were initialized with the union of all terminologies and fixed across all NER tasks.
Task-specific Labeling Functions: All task-specific labeling functions were developed by inspecting unlabeled training set documents.For NER, we used three general rule types to label concepts: regular expressions to detect out-of-ontology mentions; small dictionaries of related terms (e.g., illegal drugs); and bigram word co-occurrence graphs from ontologies to support fuzzy span matching.For negation, we built on NegEx [19] which uses regular expressions to search left and right context windows for negation cues.For DocRelaTime we used a heuristic based on the nearest explicit datetime mention (in token distance) to an event mention [59].Additional regular expression-based rules were added to detect other cues of event temporality.

Soft Majority Vote (SMV)
Weakly Supervised BioBERT Fully Supervised BioBERT Figure 3 reports F1 scores across all ablation tiers.In all settings, the weakly supervised BioBERT models outperformed SMV.Gains of 6.3 to 33.4 F1 points are seen in the guideline-only tier and 1.5 to 9.9 points in other tiers.Incorporating source accuracies into BioBERT training provided significant benefits when combining high precision sources with low precision/high recall sources.In the case of chemical tagging with SMV, the UMLS tier (light green bar) outperformed UMLS+Other (pink bar) by 3.2 F1 points (81.6 vs. 78.4).This was due to adding the ChEBI ontology which increased recall but only had 65% word-level precision.Soft majority vote cannot learn or utilize this information, so naively adding ChEBI labels hurt performance.However the label model learned ChEBI's accuracy to take advantage of the noisier, but higher coverage signal, thus the WS BioBERT UMLS+Other (red bar) outperformed UMLS (green bar) by 2.4 F1 points (88.2 vs 85.8).

Comparing Trove with existing weakly supervised methods
We compared Trove to three existing weakly supervised methods for NER and sequence labeling: SwellShark [12], AutoNER [13], and WISER [14].We compared performance on BC5CDR (the combination of disease and chemical tasks) against all methods and on the i2b2 drug task for SwellShark.Table 3: Precision (P), recall (R), and F1 scores for the BC5CDR task using state-of-the-art weakly supervised NER methods.Underlined numbers indicates the best weakly supervised score using only dictionaries/ontologies and bold indicates the best score using custom rules.

Learning source accuracies for UMLS terminologies
Estimating accuracies with the label model requires observing agreement and disagreement among multiple label sources.However it is non-obvious how to partition the UMLS, which contains many terminologies, into labeling functions.The naive extremes are to either create a single labeling function from the union of all terminologies or include all terminologies as individual labeling functions.To explore how partitioning choices impact label model performance, we held all non-UMLS labeling functions fixed across all ablation tiers and computed performance across s = (1, ..., 92) partitions of the UMLS by terminology.All scores were normalized to the best global soft majority vote (SMV) score per tier to assess the impact of correcting for label noise.

Case study in rapidly building clinical classifiers
We deployed Trove to monitor emergency departments for patients undergoing COVID-19 testing, analyzing clinical notes for presenting symptoms and risk factors [60].This required identifying disorders and defining a novel classification task for exposure to a confirmed COVID-19 positive individual, a risk factor informing patient contact tracing.The dataset consisted of hourly dumps of emergency department notes from Stanford Health Care (SHC), beginning in March 2020.We manually annotated a gold test set of 20 notes for all mentions of disorders and 776 notes for mentions of a positive COVID exposure.Two clinical experts generated gold annotations which were adjudicated for disagreements by authors AC and JF.As a baseline for disorder tagging, we used the fully supervised ShARe/CLEF disorder tagger.This reflects a readily available, but out-of-distribution training set (MIMIC-II [61] vs. SHC).We used the same disorder labeling function set as our prior experiments, adding one additional dictionary of COVID terms [62].BioBERT was trained using 2482 weakly-labeled documents.Custom labeling functions were written for the exposure task and models were trained on 14k sentences.

Discussion
Our experiments demonstrate the effectiveness of using weakly supervised methods to train entity classifiers using off-the-shelf ontologies and without requiring hand-labeled training data.medical ontologies are freely available sources of weak supervision for NLP applications [63] and in several NER tasks, our ontology-only weakly supervised models matched or outperformed more complex weak supervision methods in the literature.
Our work also highlights how domain-aware language models, such as BioBERT, can be combined with weak supervision to build low-cost and highly performant medical NLP classifiers.
Rule-based approaches are common tools in scientific literature analysis and clinical text processing [64,65,66,67] Our results suggest that engineering task-specific rules in addition to labels provided by ontologies provides strong performance for several NER tasks -in some cases approaching the performance of systems built using hand-labeled data.We further demonstrated how leveraging the structure inherent in knowledge bases such as the UMLS to estimate source accuracies and correct for label noise provides substantial performance benefits.We find that the classification performance of the label model alone is strong, with BioBERT providing modest gains of 0.9 F1 points on average.Since the label model is orders-of-magnitude more computationally efficient to train than BERT-based models, in many settings (e.g., limited access to high-end GPU hardware) the label model alone may suffice.
Our tasks reflect a wide range of difficulty.Clinical tasks required more task-specific rules to address the increased complexity of entity definitions and other non-grammatical, sub-language phenomena [68].Here custom rules improved clinical tasks an average of 9.1 F1 points vs. 2.3 points for scientific literature.Moreover, adding non-UMLS ontologies to PubMed tasks consistently improved overall performance while providing little-to-no benefit for our clinical tasks.Annotation guidelines for our clinical tasks also increased complexity.The i2b2 drug task combines several underlying classification problems (e.g., filtering out negated medications, patient allergies, and historical medications) into a single tagging formulation.This extends beyond entity typing and requires more complex, cue-driven rule design.
Manually labeling training data is time consuming and expensive, creating barriers to using machine learning for new medical classification tasks.Sometimes, there is a critical need to rapidly analyze both scientific literature and unstructured electronic health record data -as in the case of the COVID-19 pandemic when we need to understand the full repertoire of symptoms, outcomes, and risk factors at short notice [60,69,70].However, sharing patient notes and constructing labeled training sets presents logistical challenges, both in terms of patient privacy and in developing infrastructure to aggregate patient records [71].In contrast, labeling functions can be easily shared, edited, and applied to data across sites in a privacy preserving manner to rapidly construct classifiers for symptom tagging and risk factor monitoring.
This work has several limitations.Our task-specific labeling functions were not exhaustive and only reflect lowcost rules easily generated by domain experts.Additional rule development could lead to improved performance.
In addition, we did not explore data augmentation or multi-task learning in the BioBERT model, which may further mitigate the need to engineer task-specific rules.There is considerable prior work developing machine learning models for tagging disease, drug, and chemical entities [72,13] that could be incorporated as labeling functions.However, our goal was to explore performance tradeoffs in settings where existing machine learning models are not available.Our framework leverages the wide range of medical ontologies available for English language settings, which provides considerable advantages for weakly supervised methods.Additional work is needed to characterize the extent to which the framework can benefit tasks in non-English settings [73].
Combining labels from multiple ontology sources violates an independence assumption of data programming as used in this work, because for any pair of source ontologies we may have correlated noise.This restriction applies to all label sources, but is more prevalent in cases with extremely similar label sources, as can occur with ontologies.In our experiments, for a small number of sources, the impact was minor, however performance tended to decrease after including more than 20 ontologies.Additional research into unsupervised methods for structure learning [74,75], i.e., learning dependencies among sources from unlabeled data, could further improve performance or mitigate the need to limit the number of included ontologies.

Conclusion
Identifying named entities and attributes such as negation are critical tasks in medical natural language processing.Manually labeling training data for these tasks is time consuming and expensive, creating a barrier to building classifiers for new tasks.The Trove framework provides ontology-driven weak supervision for medical entity classification and achieves state-of-the-art weakly supervised performance in the NER tasks of recognizing chemicals, diseases, and drugs.We further establish weakly supervised baselines for disorder tagging and classifying the temporal order of an event entity relative to its document timestamp.The weakly supervised NER classifiers perform within 1.4 -4.8 F1 points of classifiers trained with hand-labeled data.Modeling the accuracies of individual ontologies and rules to correct for label noise improved performance in all of our entity classification tasks.Combining pre-trained language models such as BioBERT with weak supervision results in an additional improvement in most tasks.
The Trove framework demonstrates how classifiers for a wide range of medical NLP tasks can be quickly constructed by leveraging medical ontologies and weak supervision without requiring manually labeled training data.Weakly supervised learning provides a mechanism for combining the generalization capabilities of state-of-the-art machine learning with the flexibility and inspectability of rule-based approaches.

Figure 1 :
Figure 1: Trove pipeline for ontology-driven weak supervision for medical entity classification: dotted boxes/lines indicate optional steps.Users specify: A) a mapping of an ontology's class taxonomy to entity classes; B) a set of label sources (e.g., ontologies, task-specific rules) for weak supervision; and C) a collection of unlabeled document sentences with which to build a training set.Ontologies instantiate labeling function templates which are applied to sentences to generate a label matrix.This matrix is used to train the label model which learns source accuracies and corrects for label noise to predict a consensus probability per word.Consensus labels are transformed into the probabilistic sequence label dataset which is used as training data for an end model (e.g., BioBERT).Alternatively, the label model can also be used as the final classifier.

Figure 2 :
Figure 2: An example of labeling functions.Here four ontology labeling functions are used to label a sequence of words X i containing the entity "diabetes type 2".Soft majority vote estimates Y i as a word-level sum of positive class labels.The label model learns a latent class-conditional accuracy parameter for each ontology, which is used to reweight the original labels to generate a more accurate consensus prediction of Y i .

Figure 4 Figure 4 :
Figure4shows the impact of partitioning the UMLS into s different labeling functions.Modeling source accuracy consistently outperformed SMV across all tiers, in some cases by 2-8 F1 points.The best performing partition size s ranged from 1-10 by task.Two naive baseline approaches -collapsing the UMLS into a single labeling function or treating all terminologies as individual labeling functions -generally did not perform best overall.

Table 1 :
Table 1 contains summary statistics for all six datasets.All documents were Dataset summary statistics.There are (k) classes per task.The (Documents) and (Entities) columns indicate counts for train/validation/test splits.

Table 2 reports
F1 performance for weak supervision using ontology-based labeling functions and those incorporating additional, task-specific rules.For NER tasks, adding task-specific rules performed within 1.4 -4.8 F1 points (4.1%) of models trained on hand-labeled data and for span tasks within 3.4 -13.3 F1 points.The total number of task-specific labeling functions used ranged from 9 to 27.For ontology-based supervision, the label model improved performance over SMV by 4.4 F1 points on average.BioBERT provided an additional average increase of 0.4 F1 points.

Table 2 :
F1 scores for ontology and task-specific rule-based weak supervision categories.Models are soft majority vote (SMV); label model (LM); weakly supervised BioBERT (WS); and fully supervised BioBERT (FS).LFs denote labeling function counts or total added task-specific rules.Bold indicates the best weakly supervised score for each approach and task.Scores are the mean and ±1 SD of five random weight initializations.

Table 3 compares
Trove with these existing weakly supervised methods.Our ontology-based approach outperformed AutoNER by 1.7 F1 points.For models incorporating task-specific rules, we outperformed the best weakly supervised model SwellShark by 1.9 F1 points.SwellShark reported F1 scores on the i2b2 drug task of 78.3 for dictionaries and 83.4 for task-specific rules.Our best models achieved 79.1 and 88.4 F1 respectively.

Table 4 :
COVID-19 presenting symptoms and risk factors evaluated on Stanford Health Care emergency department notes.Bold and underlined scores indicate the best score in symptom/disorder tagging and COVID exposure classification respectively.