DeepTag: inferring diagnoses from veterinary clinical notes

Large scale veterinary clinical records can become a powerful resource for patient care and research. However, clinicians lack the time and resource to annotate patient records with standard medical diagnostic codes and most veterinary visits are captured in free-text notes. The lack of standard coding makes it challenging to use the clinical data to improve patient care. It is also a major impediment to cross-species translational research, which relies on the ability to accurately identify patient cohorts with specific diagnostic criteria in humans and animals. In order to reduce the coding burden for veterinary clinical practice and aid translational research, we have developed a deep learning algorithm, DeepTag, which automatically infers diagnostic codes from veterinary free-text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multitask LSTM with an improved hierarchical objective that captures the semantic structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defers them to human experts, resulting in improved performance. DeepTag accurately infers disease codes from free-text even in challenging cross-hospital settings where the text comes from different clinical settings than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal preprocessing. The technical framework in this work can be applied in other medical domains that currently lack medical coding resources.

Top-left: learning to reject model with confidence score as input, estimate accuracy or loss. Top-right: learning to reject model with post-sigmoid probabilitiesŷ score as input, estimate accuracy or loss. Bottom-left: learning to reject model with prior-tosigmoid logits as input, estimate accuracy or loss. Bottom-right: learning to reject model with global max pooled hidden states c as input, estimate accuracy or loss.

CSU Discharge Summary Format
The Colorado State University discharge summaries contain multiple data fields, including: History, Assessment, Diagnosis, Prognosis, FollowUpPlan, ProceduresAndTreatments, PendingDiagnostics, PendingDi-agnosticsComments, Diet, Exercise, DischargeStatus, DischargeDate, Medications, AdditionalInstructions DrugWithdrawal, RecheckVisits, Complications, MedicalComplications, SurgicalComplications and Anes-thesiaComplications. We filtered out fields with many null entries as well as the diagnosis related fields, since this is not present in the private practice data. The remaining fields-History, Assessment, Prognosis, DischargeStatus and Medications-were used as the input to train the models.

Model Description
We formulate the problem of veterinary disease tagging as a multi-label classification problem. Given a veterinary record X, which contains detailed description of the diagnosis, we try to infer a subset of diseases y ∈ Y, given a pre-defined set of diseases Y. The problem of inferring a subset of disease codes can be viewed as a series of independent binary prediction problems 12 . The binary classifier learns to predict whether a disease code y i exists or not for i = 1, ..., m, where m = |Y|.
Our learning system has two components: a text processing module and disease code prediction module. Our text processing module uses a long-short-term memory network (LSTM) which has demonstrated their effectiveness in learning implicit language patterns from the text 9 . Our disease code prediction module consists of binary classifiers that are parameterized independently. A long-short-term memory network is a recurrent neural network with a long-short-term memory cell. It takes one word as input, as well as the previous cell and hidden state. Given a sequence of word embeddings x 1 , ..., x T , the recurrent computation of LSTM network at a time step t can be described in Eq 1, where σ is the sigmoid function σ = 1/(1 + e −x ), and tanh is the hyperbolic tangent function. We use to indicate the hadamard product.
An extension of this recurrent neural network with LSTM cell is to introduce bidirectional passes 4 . Graves et al. shows that introducing bidrectional passes, it can effectively eliminate problems such as retaining longterm dependency when the document is very long. We parameterize two LSTM cells with different set of parameters, one cell is used in forward pass where the sequence is passed in sequentially from the beginning {x 1 , ..., x T } , one cell is used for backward pass, where the sequence is passed in with reversed ordering {x T , ..., x 1 }. At the end of both passes, bidirectional LSTM will output two hidden states represents each input x t , and we stack these two hidden states as our new hidden state for this input After computing hidden states over the entire document, we introduce global max pooling over the hidden states, as suggested by Collobert & Weston 2 so that the hidden states will aggregate information from the entire documents. Assuming the dimension of hidden state is d, global max pooling apply an element-wise maximum operation over the temporal dimension of the hidden state matrix, described in Eq 2.
Then we define a binary classifier for each of the 42 disease code in our pre-defined set. The binary classifier takes in a vector c that represents the veterinary record and outputs a sufficient statistic for the Bernoulli probability distribution indicating the probability of whether a tag should be predicted. For i = 1, ..., m: We use binary cross entropy loss averaged across all labels as the training loss. Given the binary predictions from the modelŷ ∈ [0, 1] m and correct one-hot label y ∈ {0, 1} m , binary cross entropy loss is written as follow. The decision boundary in our model is set to be 0.5.

Leveraging disease similarity
We introduce two penalties that are inspired by the implicit relationships between disease codes that we refer to as meta-diseases. By augmenting our loss with these two penalties, we aim to increase model's ability to predict codes that have fewer instances. We introduce them as DeepTag meta-disease objective and DeepTag-M meta-disease objective.
DeepTag meta-disease objective After defining the meta-diseases for the disease codes, we can use techniques the from multi-task learning literature. Each task corresponds to the binary prediction of one of the 42 disease codes. Jacob et al. 5 proposed a hypothesis that if two tasks are similar, the task-specific parameters for these two tasks-i.e. the corresponding weights in the final neural network layer-should be close in vector space, and vice versa.
We can first compute the mean vector of all disease code embeddingsθ = 1 m m i=1 θ i . Each disease embedding is a weight parameter θ defined in Eq 3. We can define J (k) ⊂ {1, ..., m}, where J (k) is a set of disease codes that belong to meta-disease k. Then we can compute a vector for each meta-disease: for k = 1, ..., K,θ k = 1 The within-meta-disease closeness constraint Ω within can be computed as the distance between disease code embeddings and the meta-disease vectorθ k . Ω between can be computed as the distance betweenθ k and θ. We formulate this as an additional loss term Ω(Θ), and allow three hyperparameter γ norm , γ within and γ between to control the strength of this penalty.
DeepTag-M meta-disease objective We propose an additional penalty following the intuition that we want the model to make accurate predictions for the meta-disease even though mistakes can be made on the disease codes. Meta-disease training labels are created by examining whether any of the disease code under this meta-disease has been marked as tagged. Following the same logic, since the disease codes are predicted independently, we can compute the probability of the presence of a meta-diseaseỹ k from the probability of disease codes that belong to this meta-disease.
After computing the probability of the presence of each meta-disease, given the set of meta-diseasesỹ that are created from our true set of disease codes y, we can then compute the binary cross entropy loss between the model's estimation on meta-disease probability and true meta-diseases in Eq 7. We use β to adjust the strength of this penalty.

Learning to abstain
In practice, it is often desirable for the model to forfeit the prediction if the prediction is likely to be incorrect. When the method is used in collaboration with human experts, the model can just defer difficult cases to them, fostering human-computer collaboration. However, this is still an under-explored field in machine learning, and previous research has focused largely on binary-class single-label classification 3 . We formally describe the set-up and our learning-based approach in the following sections, and extend relevant discussion to a multi-label setting. We propose two abstention settings. Each setting will compute a score α for each document, which we refer to as the abstention priority score. We can then rank these documents using this score α. When user specifies a percentage of documents to be dropped, documents that have high α will be dropped first.
Confidence-based abstention Since our model already outputs a probability for each disease code, if our model is well-calibrated, meaning that the output probability satisfies the following constraint in Eq 8, then our probability should reflect how uncertain the model is about the output. P x,y∼D [y = 1|f t (x) = p] = p ∀p ∈ [0, 1] and ∀t The notion of calibration means that when the model thinks the chance of a given prediction to be correct is p%, we collect all instances that the model predicts with such probability, and the model in total will be correct p% of the time. A well-calibrated model's output probability corresponds to the model's confidence/certainty on how correct its prediction is. Previous research has shown that binary classifiers with sigmoid scoring function and cross-entropy loss are often well-calibrated 10 .
With calibrated {p(y 1 ), ..., p(y m )}, we want to compute how confident the model is on these predictions. For each prediction, the model is more confident if p(y i ) is close to 0 or close to 1. Based on this observation, we can convert the probability into a confidence score with function g: g(p(y i )) = max{p(y i ), 1 − p(y i )}.
We can now compute the probability of the model getting k disease codes correct on a single example. We choose all subsets from the entire disease code set, and compute the probability of a chosen subset to be correct as well as the probability of the not chosen (m − k) disease codes to be incorrect.
The score α conf is an abstention priority score because it is a valid indication of how confident the model's overall output is. We refer to this scheme confidence-based abstention module (or "CB" in Supplemental Figure 2, "Baseline" in main manuscript Figure 4).
Learning-based abstention Instead of computing α from a fixed formula, we can try to link abstention priority score to a value that we care about. For example, we want to drop examples that will induce high loss, or equivalently, examples where predicted result gives a low accuracy. However, we do not have access to ground-truth answers in the real world, instead, we propose that if the data distribution D between training and deployment are consistent (x test , y test ∼ D, which is the underlying assumption specified in calibration), then we can learn to estimate loss or accuracy for each example. We can compute a regression target for the learned abstention module using the training dataset's accuracy and loss value for each example (Eq 10), where d(p) = 1(p > 0.5).
This abstention learning module A can take an input z and output an estimated abstention scoreα. We train this module by minimizing minimum square squared error with the regression target: We choose four possible inputs from various parts of the DeepTag model that the DeepTag-abstention module can use to predict accuracy or loss without knowing the ground-truth disease codes. Two choices are obvious: confidence scores g(ŷ) that is used to compute confidence-based abstention priority score in the previous section, and estimated probability for the presence of each disease codeŷ, which we have used to compute confidence scores via function g(·). However, sinceŷ is obtained by applying a sigmoid function to the output of the classifierŷ i = σ(θ i c), then we can also use the prior-to-sigmoid value θ i c as input. At last, we hypothesize that the representation of document c might also contain relevant information that is useful for model A to determine whether the document is difficult to process. We fit the model A to estimate α learn in the training set of our data, same split as the one used to train the overall model. We then evaluate on a previously unseen test set.

MetaMap
MetaMap is a program developed by the National Library of Medicine (NLM) 1 . It processes a document and outputs a list of matched medically-relevant keywords in the given document. We use these keywords as features and map each document into a frequency-encoded bag-of-words vector. The final feature vector size is 57,235. We perform the multi-label classification task with the Multi-layer Perceptron (MLP) and support vector machine (SVM) with linear kernel 1 on these feature vectors.

Text CNN
For the convolutional neural network baseline, we use filter windows of 3, 4, and 5, and each has 340 feature maps. We use rectified linear unit after the convolution, and then apply max pooling over time. We concatenate the the final representations from all filter window sizes, which results in a sentence vector of dimension 1020, comparable to the sentence vector generated by the BLSTM model, which is 1024. The details of our set up follows directly from the implementation of Kim et al. 6 .

Main Experiment
We initialize our model with 100-dimension pretrained GloVE word vectors 11 , and we initialize un-matched words in the CSU training data with sampled multivariate normally distributed vectors. We allow all word embeddings to be updated through the training process. We use a recurrent neural network with a 512 dimension LSTM cell, and set the feed-forward dropout rate to be 20%. We use batch size of 32, clipping gradient at 5. We use ADAM 7 optimizer with a learning rate of 0.001.
We trained all models to a maximum of 5 epochs with early stopping, the maximum number of epoch is picked by observing performance on validation dataset. After picking out the best hyper-parameters on validation set, we evaluate all models in-domain generalization performance on the CSU test dataset and out-domain generalization performance on the PP dataset.
After hyperparameter searching, we report models with the hyperparameters that perform well on each dataset. We train each model five times and report the averaged result. For the CSU dataset, we find β = 0.001 works best for DeepTag-M, and γ norm = 1e − 5, γ between = 1e − 4, γ within = 1e − 4 works best for DeepTag. For the PP dataset, we find β = 0.0001 works the best for DeepTag-M, and γ norm = 1e − 4, γ between = 1e − 3, γ within = 1e − 3 works best for DeepTag. We report these results in Table 2 in the main manuscript.

Abstention Experiment
We use a 3-layer neural network with SELU activation 8 to parameterize abstention model A. The learning to abstain model is trained on various outputs generated by the DeepTag system. All configurations of learning to abstain models are trained optimally for 3 epochs on the training set, and evaluated on the unseen test set.

Coding of Private Practice (PP) notes
Our guidelines for applying diagnostic codes to the private practice dataset were derived following consultations and review of coding guidelines from CSU as well consultation with an additional coding professional who helps maintain the SNOMED-veterinary extension and are summarized below: 1. Implied assessments/problems are not be coded unless there is direct evidence to support those diagnoses in the record or noted in the assessment or diagnosis fields. At minimum, diagnoses are applied if there is support from the physical exam and the primary clinician considers it a problem in the patients "Assessment" section in the notes. It is preferred if the assessment/problem is also addressed in the "Plan" section of the note by way of treatments or results from additional diagnostic tests (which are in the "Plan"), but not all diagnoses are addressed here. For example, if the clinician applies a free text diagnosis of obesity, the plan includes weight loss and there is a 7/8 BCS, it is appropriate to code obesity as a problem.
2. Tentative diagnoses are not coded.
3. Historical findings or diagnoses are not coded on a particular visit unless they represent an active problem.
4. Only diagnoses are coded, not signs, symptoms or presenting complaints.