Introduction

Injection drug use (IDU) is a critical health concern in the United States and internationally1. Most people begin using illicit drugs through other modes of administration, such as smoking, intranasal absorption, or oral ingestion. As dependence grows, individuals tend to prefer the intravenous (IV) route of drug administration, injecting drugs directly into the veins, as it offers stronger and more immediate effects2. The number of people who inject drugs increased almost fivefold from 2011 to 2018, according to estimates in3, whereas the number of IDU-related overdoses increased eightfold from 2000 to 20184.

IDU can lead to complicated medical conditions such as abscesses and cutaneous infections, scarring and needle tracks, endocarditis, HIV/AIDS, Hepatitis C, overdose, and death1,5,6,7,8. An increase in IDU is also associated with an increase in morbidity and mortality9,10,11.

Accurately identifying IDU behaviors in people who inject drugs is crucial for risk assessment and detection of patients who can benefit from harm reduction interventions to potentially prevent IDU-related morbidity and mortality12,13. In the literature, the study of IDU-related information extraction has been performed along with other socio-behavioral determinants of health (SBDH). Considering and including SBDH such as prior incarceration, substance use (regardless of administration mode), treatment attitude, psychological distress, and interpersonal violence improve patient mortality and enhance the prediction of medication adherence, hospital readmission, and suicide attempts14,15.

Despite the growing interest, SBDH such as IDU is not identifiable in patients’ electronic health records (EHRs) through ICD codes; although not systematically assessed, it can be documented in clinical notes16,17. While structured data fields derived from EHRs may provide some amount of information about risky drug use behaviors and morbidities related to IDU, the clinical note is the only place it can be explicitly documented12. Despite being clinically meaningful and having the potential to identify patients that can benefit from harm reduction interventions, care providers often struggle to retrieve these data points from EHRs, and evidently, the exclusion of this data may result in an overall reduced quality of care18,19.

Natural language processing (NLP) can help extract SBDH-related information from clinical notes and expand the utility of such information in patient care20,21,22. NLP is a branch of computer science that involves automated learning, understanding, and generation of natural languages, enabling the interactions between machines and human languages. Although NLP deals with a variety of tasks involving unstructured text data (e.g., event prediction23, entity recognition24, question-answering (QA) for information extraction25, and relation extraction26), in this article, we use extractive question-answering (extractive QA) task to automatically extract information related to IDU from clinical notes in EHRs. To avoid redundancy, for the rest of the paper, we use QA in place of extractive QA. In this QA task, given a query and a clinical note, a QA model would return the relevant answer verbatim from the note as the extracted information. Thus, a QA system is tasked with learning to read and comprehend the clinical note provided a query and then extract information consisting of consecutive words from the notes relevant to the query from that note (Fig. 1).

Fig. 1: A sample clinical note featuring questions about the IDU behavior in people who inject drugs, with extracted IDU-related information color-coded in the note.
figure 1

pt patient, IDU injection drug use, iv intravenous.

We use QA to address the information extraction problem for the following reasons. The texts in the clinical notes are very unstructured in nature. For example, the information regarding injection drug names can be presented in the notes in multiple forms, such as—opioids: denies recent use, hx ivdu, claims last use years ago. other drugs: hx methamphetamine use, has been using daily via injecting since a relapse in December; ivdu (cocaine/methamphetamine); reports using iv meth; iv cocaine mixed with heroine use; used meth by iv drug use; or history of daily heroine use, prior ivdu. Here, ivdu refers to intravenous drug use. Given the demonstrated success of QA models in extracting information of diverse forms from clinical notes27, we chose to focus on the QA task in NLP. Moreover, one potential implementation of this work would be to incorporate the developed model into a chatbot framework, enabling clinicians to inquire about IDU behavior in people who inject drugs at the point of care by posing questions with various syntactic structures. It would help clinicians identify people who inject drugs and pinpoint related status.

Although not specific to IDU, several studies have focused on identifying clinical concepts or information on substance use disorders (SUD) using NLP22,28. In these studies, various NLP techniques have been used to extract SUD-related information. The stemming algorithm has been used to identify words and phrases associated with mental illness and substance use in clinical notes29,30. Dependency structure has been utilized to capture relationships between phrases and tokens in the substance use statement28. Word-embedding models have been employed to identify people who use alcohol and substance use status21. Machine reading comprehension has been applied to extract some clinical concept categories and relation categories, such as relations of medications with adverse drug events and SBDH22. Multi-label text classification and sequence labeling have been used to identify sentences containing labeled arguments about drug use31. Topic modeling and keyword matching techniques have been leveraged to extract drug use–related information32. Techniques such as active learning33, multi-label classification34,35,36,37, concept extraction, and joint extraction of entities and relations have been employed to extract information about drug use38. Researchers have also focused on identifying drug use information by using NLP-specific techniques to detect opioid use disorder and predict overdose39,40,41,42,43,44,45,46,47,48,49,50. In the literature, we came across one research study that has focused exclusively on IDU. The study has utilized rule-based algorithms, such as regular expressions (RegEx), NegEx51, and N-grams to search for very limited IDU-related terms, with the objective of identifying people who inject drugs (PWIDs)12. In our study, on the other hand, we focus on extracting a broad spectrum of information on injection drug use from clinical notes. This encompasses details such as drug names, active/historical use, frequency of use, risky needle-using behavior, visible signs of IDU, last use, skin popping, harm reduction interventions, and the existence of IDU. Since evidence of IDU cannot be found in structured EHR data and therefore must be inferred from clinical notes, this study’s sole focus on IDU aims to help understand how this phenomenon is represented in unstructured notes data and augment techniques that have used NLP techniques less generalizable to this population. To the best of our knowledge, to date, there has been no published attempt at developing a QA algorithm to extract IDU-related information from clinical notes.

To solve the QA task, we use transformer-based deep learning models52,53 that are known to be one of the most streamlined ways to solve QA tasks and achieve comparable performance in extracting targeted information from different types of biomedical documents, such as scholarly articles52,54, clinical practice guidelines25, electronic medical records27, etc. Nonetheless, evidence suggests that supervised deep learning models require high-quality and large-scale annotated datasets to achieve good performance in any task53,55,56, and the absence of such a dataset for our targeted QA task poses a critical challenge. An annotated QA dataset comprises data samples, with each sample containing a context (e.g., a clinical note), a question, and an answer extracted verbatim from the context (i.e., the extracted information). In addressing the challenge posed by the limited availability of annotated QA data for constructing an effective QA model, our study takes a two-fold approach. First, we built a high-quality gold-standard QA dataset in collaboration with a subject matter expert (SME), facilitating model training and testing. The dataset includes clinical notes as contexts and question-answer pairs specific to IDU. Then, using this meticulously curated gold-standard dataset, we dive into the primary objective of this study—develop and assess the QA system for IDU-related information extraction from clinical notes. We also perform an error analysis to identify the strengths and weaknesses of our QA system, providing valuable insights to guide future research endeavors. The QA model achieves noteworthy performance, demonstrated by the F1 score of 51.65% for a strict match between gold-standard and predicted answers, as well as F1, Precision, and Recall scores of 78.03%, 85.38%, and 79.02%, respectively, for a relaxed match. These findings hold promising implications for the precise and efficient identification of injection drug use, enabling the extraction of relevant information from clinical notes.

Methods

In this section, we elaborate on the formulation of this study and its two components: (i) Gold-standard dataset generation and (ii) modeling (Fig. 2). Furthermore, we outline the specifications of the gold-standard dataset, the experimental setup and the metrics used to assess the performance of the QA models.

Fig. 2: Framework for a two-part study for extracting information on IDU behavior in people who inject drugs from clinical notes.
figure 2

The first part consists of gold-standard dataset generation in three primary steps, and the second part consists of QA model development from implementation to inference. SMEs subject–matter experts, IDU injection drug use, QA question–answering.

Problem formulation

We formulate the information extraction task as a QA problem in NLP in the following manner: Given a question on patients’ behavior about IDU and a clinical note with IDU-related information (i.e., the context), a QA system retrieves the relevant information (i.e., the answer) from the provided note.

For example, given the question—does the patient have a history of IDU?—and the clinical note—pt X, 200 yrs old … he has a history of smoking with 50 pack years, quit 10 years ago … social ethanol user … no history of idu … remote history of marijuana use … family hx: … physical exam: … provider: name.—the QA system is expected to return the answer—no history of idu—verbatim from the note.

Gold-standard dataset generation

QA is a supervised NLP learning task and, as such, requires an annotated gold-standard dataset for model development and inference. In a QA dataset, each sample consists of the context, a question, and an answer, with the question-answer pairs serving as annotations. To generate a gold-standard dataset from clinical notes, which serve as the context, we employ a three-stage process outlined in Fig. 2: (1) question collection, (2) note enrichment, and (3) gold-standard answer extraction.

Question collection

We initialize the process of question collection for the dataset by asking SMEs about the kind of information on IDU they are interested in from the clinical notes. We then generate a set of questions based on their interest. Table 1 shows the nine categories of interest. In the rest of the paper, we use the term Query Group to imply categories of interest. Table 1 also provides sample questions and answers for each query group.

Table 1 Types of information about IDU that are most likely to be inquired by clinicians from the clinical notes, categorized into nine query groups

Each query group targets to extract one category of information from the notes pertaining to that group. For example, the query group—drug names—targets to extract any information about IV drug names from the notes. In our gold-standard dataset, we include multiple variations of questions for each query group. For example, for the query group—drug names, we have five different variations of questions as follows: To what IV drugs has the patient has been exposed?, Which IV drugs have the pt used?, Which intravenous drugs has the patient used?, Which injection drugs?, Which illicit drugs has the patient injected?.

We do this for the following reasons. We anticipate our system to be used as a standalone application—a more user-friendly QA tool to collect IDU evidence—and to be capable of handling different variations of questions posed by clinicians. Furthermore, we hope that different variations of questions for each query group will help increase the QA model’s user-flexibility, comprehensiveness, and robustness, ultimately enhancing its performance in real-world applications, as follows: (i) Users may pose questions in different ways based on their preferences or understanding. A QA model trained with diverse question variations is more adaptable and capable of accommodating the linguistic diversity inherent in user queries. (ii) Including variations of questions during training helps the QA model become more robust by exposing it to diverse ways the same question can be asked, preparing the model to handle real-world scenarios where questions may be phrased differently but still seek the same information. (iii) Variations of questions during training enable the QA model to generalize its understanding. Instead of memorizing specific phrasings, the model learns the underlying patterns and associations between questions and answers, improving its ability to respond accurately to novel queries.

We use abbreviations, synonyms, and syntactical variations to introduce variations in the questions for each query group, as follows: (i) abbreviations: Is the patient actively using intravenous drugs? → Is the pt actively using intravenous drugs?, Is the patient actively using intravenous drugs? → Is the patient actively using iv drugs?. (ii) synonyms: Does the pt have a history of using intravenous drugs? → Does the pt have a history of using injection drugs?, Does the pt have a history of IDU? → Does the pt have a history of IVDU?. (iii) syntactical variations: Which iv drugs has the patient used? → To which iv drugs has the patient been exposed?, Does the pt have a history of IVDU? → Has the pt ever used IV drugs?

It should be noted that when identifying abbreviations and synonyms to be used in questions, we only choose terms and variants that clinicians commonly use. Examples of these terms and variants include patient and pt, intravenous and iv, history and hx, and IVDU and IDU. And, to ensure that we were able to accurately capture the nuances of possible language usage in the questions with regard to syntactical variations, we sought the guidance of SMEs.

Note enrichment

The contexts in the gold-standard dataset are clinical notes that contain some IDU-related information. As such, we select a cohort of patients whose notes have a higher chance of containing IDU-related information, such as patients who have been diagnosed with Hepatitis C. To guarantee that the clinical notes include information relevant to IDU and narrow down the notes accordingly, we use a list of keywords/phrases that are indicative of IDU (refer to Table 2) and has been developed by SMEs. SMEs followed an iterative approach to create this list. They began by compiling a list of common terms related to IDU, which they then refined by reviewing the associated snippet. They removed terms that caused excessive noise, such as—slamming and drug paraphernalia—and added terms like—skin popping—to enhance granularity. The experts received extensive training to sort and/or define the snippet categories, and they validated the terms to ensure their accuracy.

Table 2 A list of IDU keywords/phrases provided by SMEs

For our study, we assumed that the presence of any of these IDU-related keywords indicates the presence of relevant information pertaining to IDU in the note. Hence, we discard the notes that do not contain any of the words/phrases provided in Table 2, suggesting the possible non-existence of any IDU-related information in that note. As shown in Table 2, this list can be categorized into the following groups: IV drug names, visible signs of IDU, risky needle-using behavior, skin popping, harm reduction interventions, and generic IDU terms.

To enhance the readability of clinical notes and make them more suitable for automated processing, we conduct rigorous manual exploration of the final set of notes, identifying some common patterns that can help clean them using RegEx. It is important to note that to preserve crucial information in the clinical notes, we perform minimal data cleaning, as follows: (i) Remove newlines following within-sentence punctuation marks, such as commas, semicolons, or colons. For instance, removing the newline (\n) highlighted in the sentence—Veteran reported using iv meth,\n iv cocaine, and etoh. (ii) Remove newlines appearing before punctuation marks, such as period, comma, or semicolon. For example, removing the newline (\n) highlighted in the sentence—Veteran reported using iv meth, iv cocaine, and etoh\n. (iii) Remove newlines positioned between words within the same sentence. For example, removing the newline (\n) highlighted in the sentence—Veteran reported\n using iv meth, iv cocaine, and etoh. (iv) Consolidate multiple consecutive occurrences of newlines, white spaces, or punctuations into single instances. For example, replacing multiple periods with a single period in the sentence—Veteran reported using iv meth, iv cocaine, and etoh.............. We perform these steps to clean all the notes used for training, validation, and testing.

Gold-standard answer extraction

The next step in our dataset generation process is to extract gold-standard answers (i.e., information related to IDU) from the clinical notes. Clinical notes are inherently lengthy, and manually extracting the gold-standard answers from them requires a substantial amount of time, rendering the process unfeasible. Therefore, we devise a pre-annotation strategy involving an automated step-by-step answer extraction process that integrates rule-based NLP techniques. The primary objective of this phase is to substantially reduce the manual annotation/review effort. Nevertheless, to ensure the utmost quality of the gold-standard dataset, the outputs from this pre-annotation phase, along with the associated questions, underwent subsequent manual review and correction by a subject-matter expert with a Ph.D. in Psychology and an extensive background in substance use disorder, counseling, and treatment. Our pre-annotation strategy is based on three assumptions:

Assumption 1

Our QA task only tackles information extraction (i.e., answering questions) from one single place (a sentence) in the note at a time.

Assumption 2

The inquired information can be found in a single sentence in the note. This assumption stems from our rigorous manual exploration of the notes during the note enrichment step, where we find RegEx patterns. Our observation indicates that, in most instances, a single sentence per question suffices to capture the relevant answer. Nonetheless, we acknowledge that this straightforward sentence selection process may not always be optimal. Unstructured clinical notes often deviate from grammatical rules. Additionally, information presentation in these notes may vary, adopting styles such as questionnaires or bulleted lists. As a result, a single sentence in the traditional sense occasionally leads to either a larger text segment or a fragmented part of a single piece of information. These instances lead to the inclusion of irrelevant or incomplete information in the answers, and we address and rectify these issues during our manual review phase.

Assumption 3

If the note contains IDU-related information in multiple locations, each is considered a separate answer string. Furthermore, multiple answer strings from the same note are expected to contain different kinds of information that should be answered by different questions. For example, in the note snippet—pt has a history of smoking with 50 pack years, quit 10 years ago … social ethanol user … has h/o ivdu … remote history of marijuana use … last used iv meth 2 years ago …—there are two locations where IDU-related information can be found—has h/o ivdu and last used iv drugs 2 years ago. In such cases, we consider them as separate answers that are retrieved when asked the following questions: Does the pt have a history of IDU? and When did the pt last use IV drugs?.

Given clinical notes, we extract the automated gold-standard answers using rule-based NLP techniques as follows:

  • Step 1: Tokenize the sentences in the notes. Here, we define a sentence in the traditional sense, ending with a period. Therefore, for the sentence tokenization, we use periods to indicate the end-of-sentence.

  • Step 2: Identify sentences that contain any of the IDU keywords from Table 2 using regular expression string matching and discard the rest.

  • Step 3: At this point, the sentences containing the IDU keywords can be ideally considered gold-standard answers (i.e., extracted information relevant to IDU). Nonetheless, our primary aim is to extract IDU-related information from the notes, but we also want the extracted information to be as precise as possible containing lesser nonessential information. A full-sentence answer is most likely to include nonessential information, which can be further reduced by using parsing rules. Parsing rules refer to NLP techniques that can identify specific patterns of text within a string that represent the concepts of interest while ignoring the remaining text. An example of removing nonessential information from the answer can be transforming the sentence—social history: pt lives with family in [location], quit smoking 10 y ago, occ etoh, .... hx methamphetamine use, has been using daily via injecting since a relapse in December.—into the phrase—hx methamphetamine use, has been using daily via injecting since a relapse in December.

To create the parsing rules in this study, we randomly sample a set of sentences and focus on identifying specific phrases that occur together before or after the IDU keywords and modify or provide information that is crucial to the IDU-related history of the patient (refer to Table 1). These phrases can be adjacent to or distant from the keywords. For example, pt lives with family, denies ivdu—versus—pt lives with family, denies any tobacco, etoh or ivdu In this example, the phrase—denies—provides crucial information on the IDU behavior of the patient.

In Supplementary Table 1, we provide a detailed list of these phrases along with the targeted pattern type, parsing rules, and examples of how they help reduce the nonessential information from the answers. The parsing rules mainly focus on identifying patterns stating negative IDU mentions, temporal information, opioid use disorder specific to IDU, and status of track marks.

Although these parsing rules can extract the correct concise gold-standard answers from the clinical notes in numerous cases, manual review reveals instances where the rules failed to accurately identify these answers. This discrepancy was primarily attributed to the unstructured nature of information within the notes.

Question-to-answer mapping

Finally, to generate the labels (question-answer pairs) of our gold-standard dataset, we create mappings between the questions from Section Question collection and the gold-standard answers from Section Gold-standard answer extraction. We achieve this by considering the query groups in Table 1. For each query group, we identify a group of words in the gold-standard answers that are most likely to provide the information inquired by that query group. To compile this group of words, we engage in meticulous manual exploration, reading sentences containing IDU keywords. Depending on the kind of information we are interested in (reflected by the query groups), these words can be either the keywords in Table 2 or the words (Supplementary Table 1) that co-occur with the keywords and can help convey the information inquired by the user. For example, co-occurring words—daily and last—describe the frequency of use and the last use of IDU, respectively.

Thus, for each query group, we decide on a group of words that are most likely to help convey the inquired information and map the answers that contain these words to the questions in that query group (Table 1). The resulting compilation is presented in the—Words in Gold-standard Answers Most Likely to Provide Inquired Information—column of Supplementary Table 2. It is important to note, however, that this list is not exhaustive and represents only what we observe during our exploration, not an all-encompassing collection of potential phrases indicating the inquired information. Considering this, in our manual review phase, we manually correct annotations that are overlooked or mislabeled by these rules.

In Supplementary Table 2, we present the mappings between the query groups and the words in gold-standard answers that are most likely to provide inquired information. We also demonstrate sample answers for each mapping. Note that the answers in one query group and the answers in a different query may not be mutually exclusive. This is because if we find words in an answer that belong to multiple query groups, then that answer is mapped to all questions from these query groups. For example, the first sample answer from Supplementary Table 2—recent ivdu with meth and heroin—contains the words—recent and heroin/meth—from query groups active/historical use and drug names, respectively. Hence, this answer will be mapped to all questions in these two query groups.

The well-known ConText rules57 in the literature use a similar rule-based approach to identify the negation or temporality of a condition. They used a specific set of words tailored to the types of notes used in their study. On the contrary, although the words utilized in our study share some commonalities, they exhibit notable differences from those employed in the ConText algorithm. This distinction arises from variations in the notes used in our experiments and the specific information we target to extract from the notes. Our study exclusively focuses on injection drug use. In contrast, the error analysis of ConText indicates its unsatisfactory performance in identifying temporality related to chronic conditions and risk factors, i.e., alcohol, and drugs in clinical notes. Additionally, while ConText explicitly identifies historical versus recent conditions, our question-answering system concentrates on extracting any temporal information regarding injection drug use, leaving the determination of whether the status is recent or historical to clinicians.

Regarding the query group—last use, it is crucial to note that a patient may have multiple note entries, each with its own last use. Given our study’s emphasis on extracting information from one clinical note at a time, the definition of—last use—is confined to—last use per note.

After generating the labels (i.e., question-answer pairs), we manually review the whole dataset in collaboration with a subject-matter expert to ensure that our gold-standard dataset is of high quality and accuracy.

Modeling with question–answering system

In the next step of our study, we develop the QA model for extracting IDU-related information using the gold-standard QA dataset from Section Gold-standard dataset generation. We use Bidirectional Encoder Representations from Transformers (BERT)53-based deep learning QA models where the feature extractor is a trainable pre-trained BERT-based language model, and the QA task layer is a single-layer feed-forward neural network.

We experiment with four state-of-the-art pre-trained language models—BERT53, BioBERT52, BlueBERT58, and ClinicalBERT59—as trainable feature extractors and develop four QA models.

Provided a sequence of tokens (words or pieces of words) in a question and a clinical note, the QA model returns the start and end token of the answer span. Any text between the start and end tokens included is then considered as the answer (i.e., the extracted information). Together with the question and the note, the maximum allowable number of input tokens in these BERT-based QA models is 512. To handle samples with longer clinical notes, we follow a widely known technique in QA modeling—sliding window with a document stride53.

Below we provide a brief description of this technique: Given an input question consisting of 20 tokens, the remaining allowable number of input tokens for the note is limited to 492 (which is 512 minus the 20 tokens in the question). If the note exceeds this limit, we employ a sliding window technique to split it into multiple chunks using a document stride of 128 tokens. The document stride determines the starting token of each subsequent chunk. After this preprocessing step, each chunk prepended with the original question tokens is considered a separate data sample.

Dataset statistics

We use clinical notes sourced from the VA Corporate Data Warehouse (CDW) to construct the gold-standard dataset. The clinical notes in the VA CDW are fully identified. The selected notes correspond to the period of January 2022 and belong to patients with the Hepatitis C diagnosis. The identification of Hepatitis C-positive patients is performed using ICD-10 codes. We select the cohort of patients with Hepatitis C as their clinical notes are more likely to include information related to IDU. As explained in Section Note enrichment, we narrow down the clinical notes using a list of keywords/phrases indicative of IDU (refer to Table 2).

To reduce computational overhead during training and because unusually large notes (determined by the outliers in the distribution of note lengths) may contain templated nonessential information that is not relevant to any specific patient, we remove some outlier notes based on the interquartile range of the note lengths. We later show in Section Error analysis that note length does not affect the performance of the model at the time of inference.

We also analyze the types of notes included in this study. Our analysis reveals that there are 411 different types of notes. Supplementary Fig. 1 displays the 20 most frequently encountered note types in this study. Notably, internal medicine notes and primary care notes emerge as the two most prevalent types. We also find that addendum notes rank third in frequency. Addendum notes serve as supplements to notes of other types.

Supplementary Table 3 shows the statistics of our gold-standard dataset. Our cohort consists of 1145 patients with a total of 2323 notes that have an average length of 1013 words. Words are identified based on whitespace. In addition, we examine the distribution of the query groups outlined in Table 1 and Supplementary Table 2 within the gold-standard dataset. This analysis is illustrated by the pie chart depicted in Supplementary Fig. 2. The dataset is dominated by QA pairs related to active/historical use, as demonstrated. Following closely behind are QA pairs about the existence of IDU and drug names, whereas the least frequent QA pairs in the dataset are those pertaining to skin-popping and harm reduction interventions.

Ethics: This project was conducted as a national quality improvement effort to improve care for Veterans with substance use being treated in the Veterans Health Administration (VHA). Models were designed to be implemented into VHA decision support systems and are not expected to be generalizable or valid for application outside of notes from the VHA Computerized Patient Record System (CPRS). As such, this work is considered non-research by VHA (as per ProgramGuide-1200-21-VHA-Operations-Activities.pdf (va.gov)). However, Oak Ridge National Laboratory (ORNL) required additional oversight of this VHA clinical quality improvement project as local standard practice for all uses of patient medical record data within their institution, with approval of the project by the Oak Ridge National Laboratory IRB. The need for the veterans whose medical records were used in the study to give informed consent for the study was waived by the ORNL IRB.

Experimental setup

For experimentation, we divide our gold-standard dataset into train, validation, and test sets using a 70-10-20 split based on patients to avoid any data leakage. To implement the QA models, we use PyTorch60. We use the pre-trained language models from the huggingface API61.

Based on the statistics of our gold-standard dataset, we choose 512 as the maximum sequence length, 20 as the query length, and 100 as the answer length. After reviewing the hyperparameters utilized in various QA tasks as outlined in25,52,53,58,62,63,64,65,66,67, we set the document stride to 128 and opted for a batch size of 32, a learning rate of 3e−5, and a training epoch count of 5 for the training configurations. We performed all experiments using a single GPU on a Linux virtual machine with two GRID V100-32C GPUs.

Metrics

To assess the performance of QA models in extracting IDU-related information, we utilize strict matching criteria to compute the F1 score68. It involves verifying if the prediction precisely matches the gold-standard answer character by character, resulting in a strict F1 score per sample that can be either 1 or 0. Additionally, we use relaxed matching criteria to measure the F1, precision, and recall scores68. A relaxed match determines if there is any overlap between the prediction and the gold-standard answer. The recall or sensitivity score per sample reveals the proportion of words in the gold-standard answer that is identified correctly in the predicted answer. Precision or positive predictive value (PPV) score per sample informs us about the proportion of words in the predicted answer that are actually correct. In the context of QA problem, when calculating these metrics, true positive refers to the count of tokens that both the predicted answer and the gold-standard answer share, false positive represents the number of tokens found solely in the predicted answer, and false negative indicates the number of tokens only in the gold-standard answer and not in the predicted one. The relaxed F1, precision, or recall scores per sample can range from 0 to 1. Following55, we report the macro-averaged F1 score, accompanied by macro-averaged precision and recall scores on the test sets.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results and discussion

In this section, we report and discuss the findings from the experiments with QA models. Furthermore, we conduct a comprehensive error analysis to demonstrate the capabilities and limitations of the QA models in extracting information related to IDU from clinical notes.

Results on gold-standard test set

This section focuses on examining the experimental outcomes of the QA models and demonstrates their performance on the test set of our gold-standard dataset. As shown in Table 3, ClinicalBERT outperforms other BERT-based QA models. A strict F1 score of 52% for ClinicalBERT implies that the QA model can extract IDU-related information 52% of the time with a strict match to the gold-standard answers. A relaxed recall score of 79% on the test set suggests that overall there is a substantial degree of word overlap between the predicted answers and gold-standard answers. We further analyze the recall score in Section Analysis of recall score. On the other hand, a relaxed precision score of 85% in the test set indicates that a higher percentage of terms retrieved as answers by the QA model are included in the gold-standard answers. A relaxed F1 score of 78% indicates that the ClinicalBERT model can extract a high percentage of correct information while achieving high precision in those extracted answers.

Table 3 Performance scores of QA models on the test set

Temporal out-of-distribution testing

The writing style of clinical notes may change over time because of changes in clinicians, health care facilities, patients, etc.69. Given the purpose of our QA model, it is imperative to examine whether the performance of our QA models is retained over time. Therefore, we perform additional testing of the models from Table 3 on unseen data. We examine the QA models’ short-term and long-term information extraction capabilities by testing on clinical notes from two additional cohorts. For testing the short-term capability, we randomly select 100 patients and use their notes from February 2022. Similarly, for testing the longer-term capability, we randomly select 100 patients and use their notes from November 2022. Due to the limitations in our data availability at the time of this study, we were unable to include clinical notes beyond November 2022 for testing the longer-term information extraction capability of the QA models. In future endeavors, we aim to assess the performance of QA models on more recent notes as part of ongoing research.

To avoid data leakage, we use patients and their notes that did not appear in the gold-standard dataset generated by using notes from January 2022. We use the method described in Section Gold-standard dataset generation for building the test datasets using these notes. Similar to the gold-standard dataset, we manually review these test datasets in collaboration with a subject-matter expert. For the rest of the paper, we use the terms Cohort-Short and Cohort-Long to represent temporally out-of-distribution notes in February and November, respectively. Supplementary Table 4 shows the statistics of the test datasets built using Cohort-Short and Cohort-Long. We also show the distribution of query groups in these test datasets in Supplementary Fig. 3. As shown, the distribution of the query groups is similar for the additional test sets and our original gold-standard dataset (refer to Supplementary Fig. 2).

Table 4 shows the performance of the QA models. As shown, for both test datasets, the ClinicalBERT model performs with overall high scores, reflecting its competence in extracting information over time.

Table 4 Performance scores of the QA models on the additional test datasets built using Cohort-Short and Cohort-Long

Error analysis

In this section, we provide a comprehensive analysis of the strengths and weaknesses of our best-performing model, which is the ClinicalBERT QA model, in extracting IDU-related information. We perform a fivefold analysis as follows: Examine the (i) confidence intervals of the performance scores, the effect of (ii) note length, (iii) question length, and (iv) gold-standard answer length on the performance of the QA model, and (v) the performance of the QA model for each query group. Furthermore, by analyzing the recall scores, we showcase the proficiency of the QA model in retrieving IDU-related information. For our error analysis, we consider all three of our test sets—the test set in our gold-standard dataset and the test datasets from Cohort-Short and Cohort-Long.

Confidence intervals of performance scores

We calculate the confidence intervals (CI) for strict F1 score and relaxed F1, precision, and recall scores achieved by the best-performing QA model to represent how good these estimates are and thus quantify their uncertainty. Smaller confidence intervals demonstrated in Table 5 indicate that our estimates are precise with a high level (95%) of confidence.

Table 5 Performance scores (with 95% confidence intervals) of the best-performing ClinicalBERT QA model on the test set in the gold-standard dataset and test datasets from cohort-short and cohort-long

Effect of note length

Clinical notes have varying lengths—they can be as short as 30 words up to as long as 5747 words, based on the statistics of our test datasets. Therefore, we want the QA model to perform consistently well for all lengths of clinical notes. To identify the effect of note length on the QA model’s performance, we calculate the length of the contexts (i.e., notes) in the three test sets and bin them into four quartiles based on their ascending lengths. Supplementary Data 1 and the x-axis in Fig. 3a show the length range of these bins, whereas the green bars with the right y-axis show the sample count for each bin. We find that note length does not have any notable effect on the model’s performance scores, demonstrated in Supplementary Data 1 and on the left y-axis of Fig. 3a.

Fig. 3: Error analysis of the QA model. Blue bars refer to the sample count.
figure 3

Effect of context (note) length (a), question length (b), and gold-standard answer length (c) on QA model’s performance. d Performance of QA model for each query group. IDU injection drug use.

Effect of question length

We also examine the effect of question length on the performance of the QA model. For this analysis, we adopt the same binning approach as the one on note length. Figure 3b and Supplementary Data 1 show that, similar to note length, question length also has no effect on the model’s performance scores.

Effect of gold-standard answer length

In our test sets, we have varying lengths for the gold-standard answers (i.e., extracted information). For successful implementation, it is essential for the QA model to be able to extract different lengths of information from the clinical notes. Using the binning approach described earlier for the analysis on note length, we find that the QA model struggles to extract longer gold-standard answers with a strict match—demonstrated by the strict F1 score in Fig. 3c and Supplementary Data 1. Nevertheless, higher relaxed metric scores demonstrated by the QA model indicate its capability to identify the location of the correct answers. To improve the QA model’s proficiency in extracting longer answers with a strict match, additional research is required.

Performance for query groups

Based on the information we are interested in extracting from the clinical notes, we create nine query groups, as shown in Table 1. Supplementary Data 1 and the green bars along with the right y-axis in Fig. 3d show the sample count (in log scale) for the query groups in the test sets. The query group active/historical use dominates the datasets, followed by the query group existence of IDU and drug names. Interestingly, we find that the model performs the best on the query group visible signs of IDU and existence of IDU. Presumably, the query group visible signs of IDU has an overall higher performance despite having the third lowest sample count in the test sets and fifth lowest sample count in the gold-standard dataset because the information queried by this query group usually has some consistent terms in them such as track marks or needle track marks along with some other limited relevant information, for example, fresh track marks on his forearms. We hypothesize that the information extracted by this query group may be easier for the QA model to comprehend. However, further evaluation of the QA model is necessary to corroborate this hypothesis. Figure 3d also shows that the QA model struggles the most with the group harm reduction interventions. It may happen because harm reduction interventions have the least number of samples in the gold-standard dataset, possibly causing difficulty for the model to learn from training samples. It also has the least number of samples in the test sets to obtain a comprehensive overview of the model’s performance.

Analysis of recall score

In this part of the discussion, we analyze the recall scores of the QA model to shed light on its overall capability to extract gold-standard answers. In cases where the strict F1 score for the predicted answer is 0, the recall score can demonstrate the overlap between the gold-standard and predicted answers. For the test set in our gold-standard dataset, our QA model achieved a strict F1 score of approximately 52%. For the remaining 48%, we examine the recall scores by binning them into 12 intervals (shown in Table 6). We also perform similar analyses for cohort-short and cohort-long. As indicated in Table 6, 14% of the predictions for the gold-standard test set, although lacking a strict match, exhibit a 100 Similarly, for Cohort-Short and Cohort-Long, respectively, 7% and 15% of the predicted answers have a 100% overlap with the gold-standard answers while not having a strict match. One potential issue while considering 100% overlap without a strict match is the predicted answer being the entire context. To address this concern, we compare the ratio of the predicted answers (that do not have a strict F1 score of 1) to the contexts with the ratio of the gold-standard answers to the contexts. Figure 4 and Supplementary Data 2 show that the distribution of the percentage ratios of the predicted answers to the contexts is similar to that of the gold-standard answers to the contexts.

Table 6 Analysis of recall scores for cases where the predicted answers do not have a strict match with the gold-standard answer
Fig. 4: Distributions of the ratios of the predicted answers to the contexts and the ratios of the gold-standard answers to the contexts across 747 QA samples.
figure 4

This analysis specifically focuses on cases where the predicted answers exhibit 100% overlap with gold-standard answers without adhering to strict matching criteria. The ratios here are presented in the form of percentages.

Examples of predicted answers

We demonstrate the capability of the QA model by showing some randomly selected examples of the predicted answers along with the questions and gold-standard answers in Supplementary Table 5.

Analysis of model’s capability to identify whether a note contains IDU-related information or not

Our study focuses on extracting IDU-related information from clinical notes, but ideally, we also want our QA model to identify whether the note contains IDU-related information or not. As such, as an additional analysis, we examine the QA model’s ability to identify clinical notes that do not contain any mention of IDU keywords (Table 2) and, as such, are assumed to have no information about IDU. We hypothesize that given a clinical note with no mentions of IDU, the QA model should return an empty string because it could not find the information it was asked to retrieve.

To test this, we use patients from the test set in the gold-standard dataset. Recall that in our context processing step in Section Note enrichment, we remove notes that do not contain any IDU keywords. For this analysis, we incorporate 443 notes from 226 patients with no mentions of IDU keywords. We ensure that the notes only belong to the patients in the test set.

To annotate these notes, we use the query group—existence of IDU—as questions and empty strings as answers. For example, given a note with no mentions of IDU and the question—Has the pt ever injected drugs?, the QA model should return an empty string.

To measure the performance, we consider only the strict F1 score. Thus, if the predicted answer matches with the empty string, we consider that a success (strict F1 score = 1) and otherwise a failure (strict F1 score = 0). We find that our QA model can identify approximately 88% of the clinical notes that do not contain any IDU-related information. Interestingly, we find that for 10% of the mispredicted answers, the model returned the string—empty. Additionally, we observe that the model returned the string with a single period, constituting the second most frequently mispredicted answer, accounting for 0.5% of the predictions. Therefore, we can say that while our QA model can extract IDU-related information from clinical notes, it also has the potential to identify the notes that do not contain any.

Study limitations

This study has some limitations. First, the QA model was trained and tested on a dataset that had already undergone a fair amount of NLP pre-processing. Therefore, the model’s performance may be limited when generalized to raw, source clinical notes. Further evaluation is needed to prove otherwise. Second, in many cases, we have noticed the use of the terms—patient denied, or veteran tells me—for IDU-related information in the clinical notes. The QA model’s capabilities are limited to the text from which it can extract the pertinent information. Therefore, the QA model must be implemented with supervision in the clinical setting. Third, our list of IDU keywords/phrases provided by SMEs to filter notes for generating gold-standard datasets is not exhaustive. Notably, drug names such as fentanyl or xylazine are absent from the list. Further assessment is required to measure the QA model’s capability to extract information related to these substances. Fourth, the datasets used in this study have been manually reviewed by one reviewer. Including a second reviewer in the manual review process may ensure more diverse perspectives, reducing the likelihood of individual biases or errors.

Conclusion

Detection of injection drug use (IDU) behavior among patients is crucial for informed patient care. In this paper, we tackle the challenging task of IDU-related information extraction from clinical notes. We build a QA system that takes in a clinical note and an end-user query on IDU and returns the information on IDU extracted from the note. We hope to potentially integrate the QA model from this study into a user-friendly chatbot framework, enabling clinicians to inquire about information related to nine categories, as identified in this study, with a view to collecting IDU evidence through an interactive platform. We evaluate our QA system on a gold-standard dataset built using clinical notes from VA CDW and a combination of manual exploration, rule-based NLP techniques, and subject–matter expert validation. We also perform an additional evaluation to examine the capability of our QA model to extract information from temporally out-of-distribution notes. We then investigate the strengths and limitations of the QA model and identify potential avenues for future research by performing rigorous error analysis.

We have identified the following next steps for this research: (i) Examine the QA model’s capability to extract information from temporally out-of-distribution clinical notes by testing the model on a more recent set of clinical notes. (ii) Examine/enhance the QA model’s capability to handle raw clinical notes without the data-cleaning steps. (iii) Examine/enhance the QA model’s capability to extract information on illicit injection drugs that are not covered in this study, for example, xylazine. (iv) The extractive QA problem may benefit from the named entity recognition (NER) task70,71. Subsequent research could explore the integration of NER into the QA task for further investigation. (v) Expand the applications of QA tasks to extract other types of information from clinical notes, such as information related to alcohol use disorder and substance use disorder.

This method can support the accurate and efficient identification of people who inject drugs and relevant information extraction using their clinical notes to facilitate harm-reduction interventions and care.