Question-answering system extracts information on injection drug use from clinical notes

Background Injection drug use (IDU) can increase mortality and morbidity. Therefore, identifying IDU early and initiating harm reduction interventions can benefit individuals at risk. However, extracting IDU behaviors from patients’ electronic health records (EHR) is difficult because there is no other structured data available, such as International Classification of Disease (ICD) codes, and IDU is most often documented in unstructured free-text clinical notes. Although natural language processing can efficiently extract this information from unstructured data, there are no validated tools. Methods To address this gap in clinical information, we design a question-answering (QA) framework to extract information on IDU from clinical notes for use in clinical operations. Our framework involves two main steps: (1) generating a gold-standard QA dataset and (2) developing and testing the QA model. We use 2323 clinical notes of 1145 patients curated from the US Department of Veterans Affairs (VA) Corporate Data Warehouse to construct the gold-standard dataset for developing and evaluating the QA model. We also demonstrate the QA model’s ability to extract IDU-related information from temporally out-of-distribution data. Results Here, we show that for a strict match between gold-standard and predicted answers, the QA model achieves a 51.65% F1 score. For a relaxed match between the gold-standard and predicted answers, the QA model obtains a 78.03% F1 score, along with 85.38% Precision and 79.02% Recall scores. Moreover, the QA model demonstrates consistent performance when subjected to temporally out-of-distribution data. Conclusions Our study introduces a QA framework designed to extract IDU information from clinical notes, aiming to enhance the accurate and efficient detection of people who inject drugs, extract relevant information, and ultimately facilitate informed patient care.


Introduction
Injection drug use (IDU) is a critical health concern in the United States and internationally [1].Most people begin using illicit drugs through other modes of administration such as smoking, intranasal absorption, or oral ingestion.As dependence grows, individuals tend to prefer the intravenous (IV) route of drug administration, injecting drugs directly into the veins, as it offers stronger and more immediate effects [2].The number of people who inject drugs increased almost fivefold from 2011 to 2018 according to estimates in [3], whereas the number of IDU-related overdoses increased eightfold from 2000 to 2018 [4].
IDU is a highly dangerous practice, which can lead to complicated medical conditions such as abscesses and cutaneous infections, scarring and needle tracks, endocarditis, HIV/AIDS, Hepatitis C, overdose and deaths [5,1,6,7,8].An increase in IDU is also associated with an increase in morbidity and mortality [9,10,11].
Accurately identifying IDU behaviors in people who inject drugs is crucial for risk assessment and detection of patients that can benefit from harm reduction interventions to potentially prevent IDU-related morbidity and mortality [12,13].In the literature, the study of IDU-related information extraction has been performed along with other socio-behavioral determinants of health (SBDH).Considering and including SBDH such as prior incarceration, substance use (regardless of administration mode), treatment attitude, psychological distress, and interpersonal violence improve patient mortality and enhances the prediction of medication adherence, hospital readmission, and suicide attempts [14,15].
Despite the growing interest, SBDH such as IDU is not identifiable in patients' electronic health records (EHRs) through ICD codes; although not systematically assessed, it can be documented in clinical notes [16,17].While structured data fields derived from EHRs may provide some amount of information about risky drug use behaviors and morbidities related to IDU, the clinical note is the only place it can be explicitly documented [12].Despite being clinically meaningful and having the potential to identify patients that can benefit from harm reduction interventions, care providers often struggle to retrieve these data points from EHRs, and evidently, the exclusion of this data may result in an overall reduced quality of care [18,19].
Natural language processing (NLP) can help extract SBDH-related information from clinical notes and expand the utility of such information in patient care [20,21,22].NLP is a branch of computer science that involves automated learning, understanding, and generation of natural languages, enabling the interactions between machines and human languages.Although NLP deals with a variety of tasks involving unstructured text data (e.g., event prediction [23], entity recognition [24], question-answering (QA) for information extraction [25], and relation extraction [26]), in this article, we use extractive question-answering (extractive QA) task to automatically extract information related to IDU from clinical notes in EHRs.To avoid redundancy, for the rest of the paper we use "QA" in place of "extractive QA".In this QA task, given a query and a clinical note, a QA model would return the relevant answer verbatim from the note as the extracted information.Thus, a QA system is tasked with learning to read and comprehend the clinical note provided a query and then extract information consisting of consecutive words from the notes relevant to the query from that note (Figure 1).
We use QA to address the information extraction problem for the following reasons.The texts in the clinical notes are very unstructured in nature, for example, the information regarding injection drug names can be presented in the notes in multiple forms, such as "opioids: denies recent use, hx ivdu1 , claims last use years ago.other drugs: hx methamphetamine use, has been using daily via injecting since relapse in December", "ivdu (cocaine/methamphetamine)", "reports using iv meth", "iv cocaine mixed with heroine use", "used meth by iv drug use", or "history of daily heroine use, prior ivdu".Given the demonstrated success of QA models in extracting information of diverse forms from clinical notes [27], we chose to focus on the QA task in NLP.Moreover, one potential implementation of this work would be to incorporate the developed model into a chatbot framework, enabling clinicians to inquire about IDU behavior in people who inject drugs at the point of care by posing questions with various syntactic structures.It would help clinicians identify people who inject drugs and pinpoint related status.
Although not specific to IDU, several studies have focused on identifying clinical concepts or information on substance use disorders (SUD) using NLP [28,22].In these studies, various NLP techniques have been used to extract SUD-related information.The stemming algorithm has been used to identify words and phrases associated with mental illness and substance use in clinical notes [29,30].Dependency structure has been utilized to capture relationships between phrases and tokens in the substance use statement [28].Word-embedding models have been employed to identify alcohol and substance abuse status [21].Machine reading comprehension has been applied to extract some clinical concept categories and relation categories, such as relations of medications with adverse drug events and SBDH [22].Multi-label text classification and sequence labeling have been used to identify sentences containing labeled arguments about drug use [31].
Topic modeling and keyword matching techniques have been leveraged to extract drug use-related information [32].Techniques such as active learning [33], multi-label classification [34,35,36,37], concept extraction, and joint extraction of entities and relations have been employed to extract information about drug use [38].Researchers have also focused on identifying drug use information by using NLP-specific techniques to detect opioid use disorder and predict overdose [39,40,41,42,43,44,45,46,47,48,49,50].In the literature, we came across one research study that has focused exclusively on IDU.The study has utilized rule-based algorithms, such as regular expressions (RegEx), NegEx [51], and N-grams to search for very limited IDU-related terms, with the objective of identifying people who inject drugs (PWIDs) [12].In our study, on the other hand, we focus on extracting a broad spectrum of information on injection drug use from clinical notes.This encompasses details such as drug names, active/historical use, frequency of use, risky needle-using behavior, visible signs of IDU, last use, skin popping, harm reduction interventions, and existence of IDU.Since evidence of IDU cannot be found in structured EHR data and therefore must be inferred from clinical notes, this study's sole focus on IDU aims to help understand how this phenomenon is represented in unstructured notes data and augment techniques that have used NLP techniques less generalizable to this population.To the best of our knowledge, to date, there has been no published attempt at developing a QA algorithm to extract IDU-related information from clinical notes.Figure 1: A sample clinical note featuring questions about the IDU behavior in people who inject drugs, with extracted IDU-related information color-coded in the note.
To solve the QA task, we use transformer-based deep learning models [52,53] that are known to be one of the most streamlined ways to solve QA tasks and achieve comparable performance in extracting targeted information from different types of biomedical documents, such as scholarly articles [52,54], clinical practice guidelines [25], electronic medical records [27], etc. Nonetheless, evidence suggests that supervised deep learning models require high-quality and large-scale annotated datasets to achieve good performance in any task [55,56,53] and the absence of such a dataset for our targeted QA task poses a critical challenge.An annotated QA dataset comprises data samples, with each sample containing a context (e.g., a clinical note), a question, and an answer extracted verbatim from the context (i.e., the extracted information).In addressing the challenge posed by the limited availability of annotated QA data for constructing an effective QA model, our study takes a two-fold approach.First, we built a high-quality gold-standard QA dataset in collaboration with a subject matter expert (SME), facilitating model training and testing.The dataset includes clinical notes as contexts and question-answer pairs specific to IDU.Then, using this meticulously curated gold-standard dataset, we dive into the primary objective of this study -develop and assess the QA system for IDU-related information extraction from clinical notes.We also perform an error analysis to identify the strengths and weaknesses of our QA system, providing valuable insights to guide future research endeavors.

Methods
In this section, we elaborate on the formulation of this study and its two components: (i) Goldstandard dataset generation and (ii) modeling (Figure 2).Furthermore, we outline the specifications of the gold-standard dataset, the experimental setup, and the metrics used to assess the performance of the QA models.

Problem formulation
We formulate the information extraction task as a QA problem in NLP in the following manner: Given a question on patients' behavior about IDU and a clinical note with IDU-related information (i.e., the context), a QA system retrieves the relevant information (i.e., the answer) from the provided note.
For example, given the question-Does the patient have a history of IDU? and the clinical note-pt X, 200 yrs old . . . he has a history of smoking with 50 pack years, quit 10 years ago . . .social ethanol user . . .no history of idu . . .remote history of marijuana use . . .family hx: . . .physical exam: . . .provider: name.-theQA system is expected to return the answer-no history of idu-verbatim from the note.

Gold-standard dataset generation
QA is a supervised NLP learning task and as such requires an annotated gold-standard dataset for model development and inference.In a QA dataset, each sample consists of the context, a question, and an answer, with the question-answer pairs serving as annotations.To generate a gold-standard dataset from clinical notes, which serve as the context, we employ a three-stage process outlined in Figure 2: (1) question collection, (2) note enrichment, and (3) gold-standard answer extraction.

Question collection
We initialize the process of question collection for the dataset by asking SMEs about the kind of information on IDU they are interested in from the clinical notes.We then generate a set of questions based on their interest.Table 1 shows the nine categories of interest.In the rest of the paper, we use the term "Query Group" to imply categories of interest.Each query group targets to extract one category of information from the notes pertaining to that group.For example, the query group "drug names" targets to extract any information about IV drug names from the notes.In our gold-standard dataset, we include multiple variations of questions for each query group.For example, for the query group "drug names", we have five different variations of questions as follows: "To what IV drugs has the patient been exposed?","Which IV drugs has the pt used?", "Which intravenous drugs has the patient used?", "Which injection drugs?", "Which illicit drugs has the patient injected?".
We do this for the following reasons.We anticipate our system to be used as a standalone application -a more user-friendly QA tool to collect IDU evidence -and to be capable of handling different variations of questions posed by clinicians.Furthermore, we hope that different variations of questions for each query group will help increase the QA model's user-flexibility, comprehensiveness, and robustness, ultimately enhancing its performance in real-world applications, as follows: (i) Users may pose questions in different ways based on their preferences or understanding.A QA model trained with diverse question variations is more adaptable and capable of accommodating the linguistic diversity inherent in user queries.(ii) Including variations of questions during training helps the QA model become more robust by exposing it to diverse ways the same question can be asked, preparing the model to handle real-world scenarios where questions may be phrased differently but still seek the same information.(iii) Variations of questions during training enable the QA model to generalize its understanding.Instead of memorizing specific phrasings, the model learns the underlying patterns and associations between questions and answers, improving its ability to respond accurately to novel queries.
We use abbreviations, synonyms, and syntactical variations to introduce variations in the questions for each query group, as follows: Abbreviations: "Is the patient actively using intravenous drugs?" → "Is the pt actively using intravenous drugs?", "Is the patient actively using intravenous drugs?" → "Is the patient actively using iv drugs?", etc.
Synonyms: "Does the pt have a history of using intravenous drugs?" → "Does the pt have a history of using injection drugs?", "Does the pt have a history of IDU?" → "Does the pt have a history of IVDU?", Syntactical variations: "Which iv drugs has the patient used?" → "To which iv drugs has the patient been exposed?","Does the pt have a history of IVDU?" → "Has the pt ever used IV drugs?" It should be noted that when identifying abbreviations and synonyms to be used in questions, we only choose terms and variants that clinicians commonly use.Examples of these terms and variants include "patient" and "pt", "intravenous" and "iv", "history" and "hx", and "IVDU" and "IDU".And, to ensure that we were able to accurately capture the nuances of possible language usage in the questions with regard to syntactical variations, we sought the guidance of SMEs.

Note enrichment
The contexts in the gold-standard dataset are clinical notes that contain some IDU-related information.As such, we select a cohort of patients whose notes have a higher chance of containing IDU-related information, such as patients who have been diagnosed with Hepatitis C. To guarantee that the clinical notes include information relevant to IDU and narrow down the notes accordingly, we use a list of keywords/phrases that are indicative of IDU (refer to Table 2) and has been developed by SMEs.SMEs followed an iterative approach to create this list.They began by compiling a list of common terms related to IDU, which they then refined by reviewing the associated snippet.They removed terms that caused excessive noise, such as "slamming" and "drug paraphernalia," and added terms like "skin popping" to enhance granularity.The experts received extensive training to sort and/or define the snippet categories, and they validated the terms to ensure their accuracy.
For our study, we assumed that the presence of any of these IDU-related keywords indicates the presence of relevant information pertaining to IDU in the note.Hence, we discard the notes that do not contain any of the words/phrases provided in Table 2 suggesting the possible non-existence of any IDU-related information in that note.As shown in To enhance the readability of clinical notes and make them more suitable for automated processing, we conduct rigorous manual exploration of the final set of notes, identifying some common patterns that can help clean them using RegEx.It is important to note that to preserve crucial information in the clinical notes, we perform minimal data cleaning, as follows: (i) Remove newlines following within-sentence punctuation marks, such as comma, semicolon, or colon.For instance, removing the newline ("\n") highlighted in the sentence "Veteran reported using iv meth,\n iv cocaine and etoh.".(ii) Remove newlines appearing before punctuation marks, such as period, comma, or semicolon.For example, removing the newline ("\n") highlighted in the sentence "Veteran reported using iv meth, iv cocaine and etoh\n.".(iii) Remove newlines positioned between words within the same sentence.For example, removing the newline ("\n") highlighted in the sentence "Veteran reported\n using iv meth, iv cocaine and etoh.".(iv) Consolidate multiple consecutive occurrences of newlines, white spaces, or punctuations into single instances.For example, replacing multiple periods with a single period in "Veteran reported using iv meth, iv cocaine and etoh.............".We perform these steps to clean all the notes used for training, validation, and testing.

Gold-standard answer extraction
The next step in our dataset generation process is to extract gold-standard answers (i.e., information related to IDU) from the clinical notes.Clinical notes are inherently lengthy, and manually extracting the gold-standard answers from them requires a substantial amount of time, rendering the process unfeasible.Therefore, we devise a pre-annotation strategy involving an automated step-by-step answer extraction process that integrates rule-based NLP techniques.The primary objective of this phase is to substantially reduce the manual annotation/review effort.Nevertheless, to ensure the utmost quality of the gold-standard dataset, the outputs from this preannotation phase, along with the associated questions, underwent subsequent manual review and correction by a subject-matter expert with a PhD in Psychology and an extensive background in substance use disorder, counseling, and treatment.Our pre-annotation strategy is based on three assumptions: Assumption 1: Our QA task only tackles information extraction (i.e., answering questions) from one single place (a sentence) in the note at a time.
Assumption 2: The inquired information can be found in a single sentence in the note.This assumption stems from our rigorous manual exploration of the notes during the note enrichment step, where we find RegEx patterns.Our observation indicates that, in most instances, a single sentence per question suffices to capture the relevant answer.Nonetheless, we acknowledge that this straightforward sentence selection process may not always be optimal.Unstructured clinical notes often deviate from grammatical rules.Additionally, information presentation in these notes may vary, adopting styles such as questionnaires or bulleted lists.As a result, a single sentence in the traditional sense occasionally leads to either a larger text segment or a fragmented part of a single piece of information.These instances lead to the inclusion of irrelevant or incomplete information in the answers, and we address and rectify these issues during our manual review phase.
Assumption 3: If the note contains IDU-related information in multiple locations, each is considered a separate answer string.Furthermore, multiple answer strings from the same note are expected to contain different kinds of information that should be answered by different questions.For example, in the note snippet-pt has a history of smoking with 50 pack years, quit 10 years ago . . .social ethanol user . . .has h/o ivdu . . .remote history of marijuana use . . .last used iv meth 2 years ago . .., there are two locations where IDU-related information can be found -has h/o ivdu and last used iv drugs 2 years ago.In such cases, we consider them as separate answers that are retrieved when asked the following questions: Does the pt have a history of IDU? and When did the pt last use IV drugs? .
Given clinical notes, we extract the automated gold-standard answers using rule-based NLP techniques as follows: Step 1: Tokenize the sentences in the notes.Here, we define "sentence" in the traditional sense, ending with a period.Therefore, for the sentence tokenization, we use periods to indicate the end-of-sentence.
Step 2: Identify sentences that contain any of the IDU keywords from Table 2 using regular expression string matching and discard the rest.
Step 3: At this point, the sentences containing the IDU keywords can be ideally considered the gold-standard answers (i.e., extracted information relevant to IDU).Nonetheless, our primary aim is to extract IDU-related information from the notes, but we also want the extracted information to be as precise as possible containing lesser nonessential information.A full-sentence answer is most likely to include nonessential information, which can be further reduced by using parsing rules.Parsing rules refer to NLP techniques that can identify specific patterns of text within a string that represent the concepts of interest, while ignoring the remaining text.An example of removing nonessential information from the answer can be transforming the sentence social history: pt lives with family in [location], quit smoking 10 y ago, occ etoh, .... hx methamphetamine use, has been using daily via injecting since relapse in December.into the phrase hx methamphetamine use, has been using daily via injecting since relapse in December .
To create the parsing rules in this study, we randomly sample a set of sentences and focus on identifying specific phrases that occur together before or after the IDU keywords and modify or provide information that is crucial to the IDU-related history of the patient (refer to Table 1).These phrases can be adjacent to or distant from the keywords.For example, pt.lives with family, denies ivdu.vs pt lives with family, denies any tobacco, etoh or ivdu.In this example, the phrase "denies" provides crucial information on the IDU behavior of the patient.
In Table 3, we provide a detailed list of these phrases along with the targeted pattern type, parsing rules, and examples of how they help reduce the nonessential information from the answers.The parsing rules mainly focus on identifying patterns stating negative IDU mentions, temporal information, opioid use disorder specific to IDU, and status of track marks.
Although these parsing rules can extract the correct concise gold-standard answers from the clinical notes in numerous cases, manual review reveals instances where the rules failed to accurately identify these answers.This discrepancy was primarily attributed to the unstructured nature of information within the notes.

Question-to-answer mapping
Finally, to generate the labels (question-answer pairs) of our gold-standard dataset, we create mappings between the questions from Section Question collection and the gold-standard answers from Section Gold-standard answer extraction.We achieve this by considering the query groups in Table 1.For each query group, we identify a group of words in the gold-standard answers that are most likely to provide the information inquired by that query group.To compile this group of words, we engage in meticulous manual exploration, reading sentences containing IDU-keywords.Depending on the kind of information we are interested in (reflected by the query groups), these words can be either the keywords in Table 2 or the words (Table 3) that co-occur with the keywords and can help convey the information inquired by the user.For example, co-occurring words "daily" and "last" describes the "frequency of use" and the "last use" of IDU, respectively.
Thus, for each query group, we decide on a group of words that are most likely to help convey the inquired information and map the answers that contain these words to the questions in that query group (Table 1).The resulting compilation is presented in the "Words in Goldstandard Answers Most Likely to Provide Inquired Information" column of Table 4.It is important to note, however, that this list is not exhaustive and represents only what we observe during our exploration, not an all-encompassing collection of potential phrases indicating the inquired information.Considering this, in our manual review phase, we manually correct annotations that are overlooked or mislabeled by these rules.
In Table 4, we present the mappings between the query groups and the words in gold-standard answers that are most likely to provide inquired information.We also demonstrate sample answers for each mapping.Note that, the answers in one query group and the answers in a different query may not be mutually exclusive.This is because, if we find words in an answer that belong to multiple query groups, then that answer is mapped to all questions from these query groups.For example, the first sample answer from Table 4-"recent ivdu with meth and heroin"-contains the words "recent" and "heroin"/"meth" from query groups "active/historical use" and "drug names", respectively.Hence, this answer will be mapped to all questions in these two query groups.
The well-known ConText rules [57] in the literature use a similar rule-based approach to identify the negation or temporality of a condition.They used a specific set of words tailored to the types of notes used in their study.On the contrary, although the words utilized in our study share some commonalities, they exhibit notable differences from those employed in the ConText algo- rithm.This distinction arises from variations in the notes used in our experiments and the specific information we target to extract from the notes.Our study exclusively focuses on injection drug use.In contrast, the error analysis of ConText indicates its unsatisfactory performance in identifying temporality related to "chronic conditions and risk factors, i.e., alcohol, drug" in clinical notes.Additionally, while ConText explicitly identifies historical versus recent conditions, our questionanswering system concentrates on extracting any temporal information regarding injection drug use, leaving the determination of whether the status is recent or historical to clinicians.
Regarding the query group "last use", it is crucial to note that a patient may have multiple note entries, each with its own last use.Given our study's emphasis on extracting information from one clinical note at a time, the definition of "last use" is confined to "last use per note".

Query Groups
Words  4: Mappings between the query groups and the words in gold-standard answers most likely to provide inquired information and example answers for each mapping After generating the labels (i.e., question-answer pairs), we manually review the whole dataset in collaboration with a subject-matter expert to ensure that our gold-standard dataset is of high quality and accuracy.

Modeling with question-answering system
In the next step of our study, we develop the QA model for extracting IDU-related information using the gold-standard QA dataset from Section Gold-standard dataset generation.We use Bidirectional Encoder Representations from Transformers (BERT) [53]-based deep learning QA models where the feature extractor is a trainable pre-trained BERT-based language model and the QA task layer is a single-layer feed-forward neural network.
Provided a sequence of tokens (words or pieces of words) in a question and a clinical note, the QA model returns the start and end token of the answer span.Any text between the start and end tokens included is then considered as the answer (i.e., the extracted information).Together with the question and the note, the maximum allowable number of input tokens in these BERTbased QA models is 512.To handle samples with longer clinical notes, we follow a widely known technique in QA modeling-sliding window with a document stride [53].
Below we provide a brief description of this technique: Given an input question consisting of 20 tokens, the remaining allowable number of input tokens for the note is limited to 492 (which is 512 minus the 20 tokens in the question).If the note exceeds this limit, we employ a sliding window technique to split it into multiple chunks using a document stride of 128 tokens.The document stride determines the starting token of each subsequent chunk.After this preprocessing step, each chunk prepended with the original question tokens is considered a separate data sample.

Dataset statistics
We use clinical notes sourced from the VA Corporate Data Warehouse (CDW) to construct the gold-standard dataset.The selected notes correspond to the period of January 2022 and belong to patients with the Hepatitis C diagnosis.The identification of Hepatitis C positive patients is performed using ICD-10 codes.We select the cohort of patients with Hepatitis C as their clinical notes are more likely to include information related to IDU.As explained in Section Note enrichment, we narrow down the clinical notes using a list of keywords/phrases indicative of IDU (refer to Table 2).
To reduce computational overhead during training and because unusually large notes (determined by the outliers in the distribution of note lengths) may contain templated nonessential information that is not relevant to any specific patient, we remove some outlier notes based on the interquartile range of the note lengths.We later show in Section Error analysis that note length does not affect the performance of the model at the time of inference.
We also analyze the types of notes included in this study.Our analysis reveals that there are 411 different types of notes.Figure 3 displays 20 most frequently encountered note types in this study.Notably, internal medicine notes and primary care notes emerge as the two most prevalent types.We also find that addendum notes rank third in frequency.Addendum notes serve as supplements to notes of other types.
Table 5 shows the statistics of our gold-standard dataset.Our cohort consists of 1145 patients with a total of 2323 notes that have an average length of 1013 words.Words are identified based on whitespace.In addition, we examine the distribution of the query groups outlined in Tables 1 and  4 within the gold-standard dataset.This analysis is illustrated by the pie chart depicted in Figure 4.The dataset is dominated by QA pairs related to the "active/historical use", as demonstrated.Following closely behind are QA pairs about "existence of IDU" and "drug names", whereas the least frequent QA pairs in the dataset are those pertaining to "skin popping" and "harm reduction interventions".

Experimental setup
For experimentation, we divide our gold-standard dataset into train, validation, and test sets using a 70-10-20 split based on patients to avoid any data leakage.To implement the QA models, we use PyTorch [60].We use the pre-trained language models from the huggingface API [61].
Based on the statistics of our gold-standard dataset, we choose 512 as the maximum sequence length, 20 as the query length, and 100 as the answer length.After reviewing the hyperparameters utilized in various QA tasks as outlined in [62,63,64,52,65,66,67,68,53,25], we set the document stride to 128 and opted for a batch size of 32, a learning rate of 3e −5 and a training epoch count of 5 for the training configurations.We performed all experiments using a single GPU on a Linux virtual machine with two GRID V100-32C GPUs.

Metrics
To assess the performance of QA models in extracting IDU-related information, we utilize strict matching criteria to compute the F1 score [69].It involves verifying if the prediction precisely matches the gold-standard answer character by character, resulting in a strict F1 score per sample that can be either 1 or 0. Additionally, we use a relaxed matching criteria to measure the F1, precision, and recall scores [69].A relaxed match determines if there is any overlap between the prediction and the gold-standard answer.The recall or sensitivity score per sample reveals the proportion of words in the gold-standard answer that is identified correctly in the predicted answer.Precision or positive predictive value (PPV) score per sample informs us about the proportion of words in the predicted answer that are actually correct.In the context of QA problem, when calculating these metrics, true positive refers to the count of tokens that both the predicted answer and the gold-standard answer share, false positive represents the number of tokens found solely in the predicted answer, and false negative indicates the number of tokens only in the gold-standard answer and not in the predicted one.The relaxed F1, precision, or recall scores per sample can range from 0 to 1.Following [55], we report the macro-averaged F1 score, accompanied by macro-averaged precision and recall scores on the test sets.

Results and Discussion
In this section, we report and discuss the findings from the experiments with QA models.Furthermore, we conduct a comprehensive error analysis to demonstrate the capabilities and limitations of the QA models in extracting information related to IDU from clinical notes.

Results on gold-standard test set
This section focuses on examining the experimental outcomes of the QA models and demonstrates their performance on the test set of our gold-standard dataset.As shown in Table 6, ClinicalBERT outperforms other BERT-based QA models.A strict F1 score of 52% for ClinicalBERT implies that the QA model can extract IDU-related information 52% of the time with a strict match to the gold-standard answers.A relaxed recall score of 79% on the test set suggests that overall there is a substantial degree of word overlap between the predicted answers and gold-standard answers.We further analyze the recall score in Section Analysis of recall score.On the other hand, a relaxed precision score of 85% in the test set indicates that a higher percentage of terms retrieved as answers by the QA model are included in the gold-standard answers.A relaxed F1 score of 78% indicates that the ClinicalBERT model can extract a high percentage of correct information while achieving high precision in those extracted answers.

Temporal out-of-distribution testing
The writing style of clinical notes may change over time because of changes in clinicians, health care facilities, patients, etc. [70].Given the purpose of our QA model, it is imperative to examine whether the performance of our QA models is retained over time.Therefore, we perform additional testing of the models from Table 6 on unseen data.We examine the QA models' short-term and long-term information extraction capabilities by testing on clinical notes from two additional cohorts.For testing the short-term capability, we randomly select 100 patients and use their notes from February 2022.Similarly, for testing the longer-term capability, we randomly select 100 patients and use their notes from November 2022.Due to the limitations in our data availability at the time of this study, we were unable to include clinical notes beyond November 2022 for testing the longer-term information extraction capability of the QA models.In future endeavors, we aim to assess the performance of QA models on more recent notes as part of ongoing research.
To avoid data leakage, we use patients and their notes that did not appear in the gold-standard dataset generated by using notes from January 2022.We use the method described in Section Gold-standard dataset generation for building the test datasets using these notes.Similar to the gold-standard dataset, we manually review these test datasets in collaboration with a subjectmatter expert.For the rest of the paper, we use the terms "Cohort-Short" and "Cohort-Long" to represent temporally out-of-distribution notes in February and November, respectively.Table 7 shows the statistics of the test datasets built using Cohort-Short and Cohort-Long.We also show the distribution of query groups in these test datasets in Figure 5.As shown, the distribution of the query groups is similar for the additional test sets and our original gold-standard dataset (refer to Figure 4).8 shows the performance of the QA models.As shown, for both test datasets, the ClinicalBERT model performs with overall high scores, reflecting its competence in extracting information over time.

Error analysis
In this section, we provide a comprehensive analysis of the strengths and weaknesses of our bestperforming model, which is the ClinicalBERT QA model, in extracting IDU-related information.We perform a fivefold analysis as follows: Examine the (i) confidence intervals of the performance scores, the effect of (ii) note length, (iii) question length, and (iv) gold-standard answer length on the performance of the QA model, and (v) the performance of the QA model for each query group.Furthermore, analyzing the recall scores, we showcase the proficiency of the QA model in retrieving IDU-related information.For our error analysis, we consider all three of our test sets-the test set in our gold-standard dataset and the test datasets from Cohort-Short and Cohort-Long.
Confidence Intervals of Performance Scores: We calculate the confidence intervals (CI) for strict F1 score and relaxed F1, precision, and recall scores achieved by the best-performing QA model to represent how "good" these estimates are and thus quantify their uncertainty.Smaller confidence intervals demonstrated in Table 9 indicate that our estimates are precise with a high level (95%) of confidence.Analysis of recall score : In this part of the discussion, we analyze the recall scores of the QA model to shed light on its overall capability to extract gold-standard answers.In cases where the strict F1 score for the predicted answer is 0, the recall score can demonstrate the overlap between the gold-standard and predicted answers.For the test set in our gold-standard dataset, our QA model achieved a strict F1 score of approximately 52%.For the remaining 48%, we examine the recall scores by binning them into 12 intervals (shown in Table 10).We also perform similar analyses for Cohort-Short and Cohort-Long.As indicated in Table 10, 14% of the predictions for the gold-standard test set, although lacking a strict match, exhibit a 100Similarly, for Cohort-Short and Cohort-Long, respectively, 7% and 15% of the predicted answers have a 100% overlap with the gold-standard answers while not having a strict match.One potential issue while considering 100% overlap without a strict match is the predicted answer being the entire context.To address this concern, we compare the ratio of the predicted answers (that do not have a strict F1 score of 1) to the contexts with the ratio of the gold-standard answers to the contexts.Figure 7 shows that the distribution of the percentage ratios of the predicted answers to the contexts is similar to that of the gold-standard answers to the contexts.notes, but ideally, we also want our QA model to identify whether the note contains IDU-related information or not.As such, as an additional analysis, we examine the QA model's ability to identify clinical notes that do not contain any mention of IDU keywords (Table 2) and as such are assumed to have no information about IDU.We hypothesize that given a clinical note with no mentions of IDU, the QA model should return an empty string because it could not find the information it was asked to retrieve.
To test this, we use patients from the test set in the gold-standard dataset.Recall that in our context processing step in Section Note enrichment, we remove notes that do not contain any IDU keywords.For this analysis, we incorporate 443 notes from 226 patients with no mentions of IDU keywords.We ensure that the notes only belong to the patients in the test set.
To annotate these notes, we use the query group "existence of IDU" as questions and empty strings as answers.For example, given a note with no mentions of IDU and the question "Has the pt ever injected drugs?", the QA model should return an empty string.
To measure the performance, we consider only the strict F1 score.Thus, if the predicted answer matches with the empty string, we consider that a success (strict F1 score = 1) and otherwise a failure (strict F1 score = 0).We find that our QA model can identify approximately 88% of the clinical notes that do not contain any IDU-related information.Interestingly, we find that for 10% of the mispredicted answers, the model returned the string "empty".Additionally, we observe that the model returned the string with a single period ".", constituting the second most frequently mispredicted answer, accounting for 0.5% of the predictions.Therefore, we can say that while our QA model can extract IDU-related information from clinical notes, it also has the potential to identify the notes that do not contain any.

Study limitations
This study has some limitations.First, the QA model was trained and tested on a dataset that had already undergone a fair amount of NLP pre-processing.Therefore, the model's performance may be limited when generalized to raw, source clinical notes.Further evaluation is needed to prove otherwise.Second, in many cases, we have noticed the use of terms "patient denied" or "veteran tells me" for IDU-related information in the clinical notes.The QA model's capabilities are limited to the text from which it can extract the pertinent information.Therefore, the QA model must be implemented with supervision in the clinical setting.Third, our list of IDU keywords/phrases based NLP techniques, and subject-matter expert validation.We also perform an additional evaluation to examine the capability of our QA model to extract information from temporally out-of-distribution notes.We then investigate the strengths and limitations of the QA model and identify potential avenues for future research by performing rigorous error analysis.
We have identified the following next steps for this research: (i) Examine the QA model's capability to extract information from temporally out-of-distribution clinical notes by testing the model on a more recent set of clinical notes.(ii) Examine/enhance the QA model's capability to handle raw clinical notes without the data-cleaning steps.(iii) Examine/enhance the QA model's capability to extract information on illicit injection drugs that are not covered in this study, for example, xylazine.(iv) The extractive QA problem may benefit from the named entity recognition (NER) task [71,72].Subsequent research could explore the integration of NER into the QA task for further investigation.(v) Expand the applications of QA tasks to extract other types of information from clinical notes, such as information related to alcohol use disorder and substance use disorder.We hope this method can support the accurate and efficient detection of people who inject drugs and relevant information extraction using their clinical notes.
General acknowledgements: The authors wish to acknowledge the support of the larger partnership.Most importantly, the authors would like to thank and acknowledge the veterans who chose to get their care at the VA.
Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE).The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes.DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://www.energy.gov/doe-public-access-plan).

Figure 2 :
Figure 2: Framework for a two-part study for extracting information on IDU behavior in people who inject drugs from clinical notes.The first part consists of gold-standard dataset generation in three primary steps and the second part consists of QA model development from implementation to inference.

Figure 3 :Figure 4 :
Figure 3: Twenty most frequently encountered clinical note types in this study, along with their frequency distribution.

Figure 5 :
Figure 5: Distribution of query groups in additional test datasets built using Cohort-Short (a) and Cohort-Long (b).

Figure 6 :
Figure 6: Error analysis.Effect of context (note) length (a), question length (b), and goldstandard answer length (c) on QA model's performance.d Performance of QA model for each query group.

Figure 7 :
Figure 7: Distributions of the ratios of the predicted answers to the contexts and the ratios of the gold-standard answers to the contexts.The ratios here are presented in the form of percentages.

Table 2 ,
this list can be categorized into the following groups: IV drug names, visible signs of IDU, risky needle-using behavior, skin popping, harm reduction interventions, and generic IDU terms.

Table 2 :
A list of IDU keywords/phrases provided by SMEs.Abbreviations: ssp, syringe services programs; ivda, intravenous drug abuse; ris4e, resists infection by sterile syringe safe sex and education; PWID, people who inject drugs.

Table 3 :
A detailed list of co-occurring phrases before/after IDU keywords along with the parsing rules, targeted pattern type, and examples of answers before and after parsing

Table 6 :
Performance scores of QA models on the test set

Table 7 :
Statistics of the additional test datasets built using Cohort-Short and Cohort-LongTable

Table 8 :
Performance scores of the QA models on the additional test datasets built using Cohort-Short and Cohort-Long

Table 10 :
Analysis of recall scores for cases where the predicted answers do not have an strict match with the gold-standard answer.We demonstrate the capability of the QA model by showing some randomly selected examples of the predicted answers along with the questions and goldstandard answers inTable 11.
Analysis of Model's Capability to Identify Whether a Note Contains IDU-related Information or Not: Our study focuses on extracting IDU-related information from clinical