Large language models to identify social determinants of health in electronic health records

Social determinants of health (SDoH) play a critical role in patient outcomes, yet their documentation is often missing or incomplete in the structured data of electronic health records (EHRs). Large language models (LLMs) could enable high-throughput extraction of SDoH from the EHR to support research and clinical care. However, class imbalance and data limitations present challenges for this sparsely documented yet critical information. Here, we investigated the optimal methods for using LLMs to extract six SDoH categories from narrative text in the EHR: employment, housing, transportation, parental status, relationship, and social support. The best-performing models were fine-tuned Flan-T5 XL for any SDoH mentions (macro-F1 0.71), and Flan-T5 XXL for adverse SDoH mentions (macro-F1 0.70). Adding LLM-generated synthetic data to training varied across models and architecture, but improved the performance of smaller Flan-T5 models (delta F1 + 0.12 to +0.23). Our best-fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models in the zero- and few-shot setting, except GPT4 with 10-shot prompting for adverse SDoH. Fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p < 0.05). Our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. These results demonstrate the potential of LLMs in improving real-world evidence on SDoH and assisting in identifying patients who could benefit from resource support.


INTRODUCTION
2][3] However, our ability to address these disparities remains limited by an insufficient understanding of their contributing factors.Social determinants of health (SDoH), are defined by the World Health Organization as "the conditions in which people are born, grow, live, work, and age [...] shaped by the distribution of money, power, and resources at global, national, and local levels". 4SDoH may be adverse or protective, impacting health outcomes at multiple levels as they likely play a major role in disparities by determining access to and quality of medical care.For example, a patient cannot benefit from an effective treatment if they don't have transportation to make it to the clinic.6][7][8] In fact, SDoH are estimated to account for 80-90% of modifiable factors impacting health outcomes. 9oH are rarely documented comprehensively in structured data in the electronic health records (EHRs), [10][11][12] creating an obstacle to research and clinical care.Instead, issues related to SDoH are most frequently described in the free text of clinic notes, which creates a bottleneck for incorporating these critical factors into databases to research the full impact and drivers of SDoH, and for proactively identifying patients who may benefit from additional social work and resource support.
Natural language processing (NLP) could address these challenges by automating the abstraction of these data from clinical texts.4][15][16][17][18][19][20][21][22][23] Yet, there remains a need to optimize performance for the high-stakes medical domain and to evaluate state-of-the-art language models (LMs) for this task.In addition to anticipated performance changes scaling with model size, large LMs may support EHR mining via data augmentation.5][26] The advanced capabilities of state-of-the-art large LMs to generate coherent text open new avenues for data augmentation through synthetic text generation.However, the optimal methods for generating and utilizing such data remain uncertain.Large LM-generated synthetic data may also be a means to distill knowledge represented in larger LMs to more computationally accessible smaller LMs. 27In addition, few studies assess the potential bias of SDoH information extraction methods across patient populations.LMs could contribute to the health inequity epidemic if they perform differently in diverse populations and/or recapitulate societal prejudices. 28Therefore, understanding bias is critical for future development and deployment decisions.
In this study, we aimed to characterize the optimal methods and the role of synthetic clinical text for SDoH extraction in the large LM era.Specifically, we used LMs for extracting 6 key SDoH: employment status, housing issues, transportation issues, parental status, relationship, and social support.Because SDoH data are sparsely documented, we assessed the value of adding large LM-generated synthetic SDoH data at the fine-tuning stage.Using a synthetic dataset, we evaluated the performance of state-of-the-art large LMs, including GPT3.5 and GPT4, to identify SDoH in the zero-and few-shot settings.The potential for algorithmic bias to impact LM predictions was explored.Our results could yield real-world evidence on SDoH, assist in identifying patients who could benefit from resource and social work support, and our research also serves to raise awareness on such an undocumented yet crucial topic.

Data
Table 1 describes the patient populations of the datasets used in this study.Our primary dataset consisted of a corpus of 800 clinic notes from 770 patients with cancer who received radiotherapy (RT) at the Department of Radiation Oncology at Brigham and Women's Hospital/Dana-Farber Cancer Institute in Boston, Massachusetts from 2015-2022.We also created two validation datasets.First, we collected 200 clinic notes from 170 patients with cancer treated with immunotherapy at Dana-Farber Cancer, and not present in the RT dataset.Second, we collected 200 notes from 183 patients in the MIMIC (Medical Information Mart for Intensive Care)-III database 29,30 , which includes data associated with patients admitted to the critical care units at Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001-2008.This study was approved by the Mass General Brigham institutional review board, and consent was waived as this was deemed exempt human subjects research.
Only notes written by physicians, physician assistants, nurse practitioners, registered nurses, and social workers were included.To maintain a minimum threshold of information, we excluded notes with fewer than 150 tokens across all provider types.This helped ensure that the selected notes contained sufficient textual content.For notes written by all providers save social workers, we excluded notes containing any section longer than 500 tokens to avoid excessively lengthy sections that might have included less relevant or redundant information.For physician, physician assistant, and nurse practitioner notes, we used a customized medSpacy 31,32 sectionizer to include only notes that contained at least one of the following sections: Assessment and Plan, Social History, and History/Subjective.Please refer to Supplemental Materials, Appendix A, for more details on note selection for each dataset.
Prior to annotation, all notes were segmented into sentences using the syntok 33 sentence segmenter as well as split on bullet points "•".This method was used for all notes in the radiotherapy, immunotherapy, and MIMIC datasets for sentence-level annotation and subsequent classification.

Task definition and data labeling
We defined our label schema and classification tasks by first carrying out interviews with subject matter experts, including social workers, resource specialists, and oncologists, to determine SDoH that are clinically relevant but not already readily available as structured data in the EHR, especially as dynamic features over time.After initial interviews, a set of exploratory pilot annotations was conducted on a subset of clinical notes and preliminary annotation guidelines were developed.The guidelines were then iteratively refined and finalized based on the pilot annotations and additional input from subject matter experts.The following SDoH categories and their attributes were selected for inclusion in the project: Employment status (employed, unemployed, underemployed, retired, disability, student), Housing issue (financial status, undomiciled, other), Transportation issue (distance, resource, other), Parental status (if the patient has a child under 18 years old), Relationship (married, partnered, widowed, divorced, single), and Social support (presence or absence of social support).
We defined two multilabel sentence-level classification tasks: 1. Any SDoH mentions: The presence of language describing an SDoH category as defined above, regardless of the attribute.A1-2.A single annotator then annotated the remaining radiotherapy notes, the immunotherapy dataset, and the MIMIC-III dataset.Supplemental A includes more details on the annotation process.Table 2 describes the distribution of labels across the datasets and the label-level inter-annotator agreement on the radiotherapy dataset.

Data augmentation
We employed synthetic data generation methods to assess the impact of data augmentation for the positive class, and also to enable an exploratory evaluation of proprietary large LMs that could not be used with protected health information.In round 1, GPT-turbo-0301(ChatGPT) version of GPT3.5 via the OpenAI 34 API was prompted to generate new sentences for each SDoH category, using sentences from the annotation guidelines as references.In round 2, in order to generate more linguistic diversity, the sample synthetic sentences output from round 1 were taken as references to again generate another set of synthetic sentences.One hundred sentences per category were generated in each round.Supplemental Material, Table A4 provides full details of prompting methods.

Synthetic test set generation
Iteration 1 for generating SDoH sentences involved prompting the 538 synthetic sentences to be manually validated to evaluate ChatGPT, which cannot be used with protected health information.Of these, only 480 were found to have any SDoH mention, and 289 to have an adverse SDoH mention (Table 2).For all synthetic data generation methods, no real patient data were used in prompt development or fine-tuning.
Table 2. Distribution of documents and sentence labels in each dataset a Synthetic Validated = sentences used to evaluate GPT models, thus there is no demographic information for this dataset.
b Synthetic Demo = sentences used for bias evaluation, where demographic descriptors were inserted.All data presented as n (%) unless otherwise noted.N.B.Labels sum to > 100% because some sentences had more than 1 SDoH label.SDoH = social determinants of health; N/A = not applicable

Model development
The radiotherapy corpus was split into a 60%/20%/20% distribution for training, development, and testing respectively.The entire immunotherapy and MIMIC-III corpora were held-out for validation and were not used during model development.
The experimental phase of this study focused on investigating the effectiveness of different machine learning models and data settings for the classification of SDoH.We explored one multi-label BERT model as a baseline, namely bert-base-uncased 35 , as well as a range of Flan-T5 models 36,37 including Flan-T5 base, large, XL, and XXL; where XL and XXL used a parameter efficient tuning method (low-rank adaptation (LoRA) 38 ).Binary cross-entropy loss with logits was used for BERT and cross-entropy loss for the Flan T5 models.With Flan-T5 being a sequence-to-sequence architecture, we predicted our label-space as the target vocabulary, and post-processed it with a simple dictionary mapping (e.g., 'RELAT' → 'RELATIONSHIP').Given the large class imbalance, non-SDoH sentences were undersampled during training.We assessed the impact of adding synthetic data on model performance.

Ablation studies
Ablation studies were carried out to understand the impact of manually labeled training data quantity on performance when synthetic SDoH data is included in the training dataset.First, models were trained using 10%, 25%, 40%, 50%, 70%, 75%, and 90% of manually labeled sentences; both SDOH and non-SDOH sentences were reduced at the same rate.

Evaluation
During training and fine-tuning, we evaluated all models using the development set and assessed their final performance on the held-out test set.For each classification task, we calculated precision/positive predictive value, recall/sensitivity, and F1 (harmonic mean of recall and precision) (Supplemental Materials, Appendix A).Manual error analysis was conducted on the radiotherapy dataset using the best-performing model.

ChatGPT-family model evaluation
To evaluate ChatGPT, the Scikit-LLM 39 multi-label zero-shot classifier and few-shot binary classifier were adapted to form a multi-label zero-and few-shot classifier (Figure 1).A subset of 364 unique synthetic sentences whose labels were manually validated, were used for testing.Test sentences were inserted into the prompt template, which instructs ChatGPT to act as a multi-label classifier model, and to label the sentences accordingly.Of note, because we were unable to generate high-quality synthetic non-SDoH sentences, these classifiers did not include a negative class.We evaluated the most current ChatGPT model freely available at the time of this work, GPT-turbo-0613, as well as GPT4-0314, via the OpenAI API.

Language model bias evaluation
In order to test for bias in our best-performing models and in large LMs pre-trained on general text, we used GPT4 to insert demographic descriptors into our synthetic data, as illustrated in Figure 2. GPT4 was supplied with our synthetically-generated test sentences, and prompted to insert demographic information into them (Supplemental Material, Appendix A).For example, a sentence starting with "Widower admits fears surrounding potential judgment…" might become "Hispanic widower admits fears surrounding potential judgment…".These sentences were then manually validated; 419 had any SDoH mention, and 253 had an adverse SDoH mention.THe rate of discrepant SDoH classifications with and without the injection of demographic information were compared between the best-performing fine-tuned models and ChatGPT using chi-squared tests for multi-class comparisons and 2-proportion z-tests for binary comparisons.A 2-sided P ≤ 0.05 was considered statistically significant.Statistical analyses were carried out using the statistical Python package in scipy (Scipy.org).

Comparison with structured EHR data
To assess the completeness of SDoH documentation in structured versus unstructured EHR data, we collected Z-codes for all patients in our test set.Z-codes are SDoH-related ICD-10-CM diagnostic codes, mapped most closely with our SDoH categories present as structured data for the radiotherapy dataset (Supplemental Materials, Table A3).Text-extracted patient-level SDoH information was defined as the presence of one or more labels in any note.We compared these patient-level labels to structured Z-codes entered in the EHR during the same time frame.
The final annotation guidelines, analytic code, and all synthetic datasets used in this study are available at: https://github.com/AIM-Harvard/SDoH.Python version 3.9.16(Python Software Foundation) was used to carry out this work.

Model performance
Table 3 shows the performance of fine-tuned models for both SDoH tasks on the radiotherapy test set.The best-performing model for any SDoH mention task was Flan-T5 XXL (3 out of 6 categories) using synthetic data (Macro-F1 0.71).The best-performing model for the adverse SDoH mention task was Flan-T5 XL without synthetic data (macro-F1 0.70).In general, the Flan-T5 models outperformed BERT, and model performance scaled with size.However, although the Flan-T5 XL and XXL models were the largest models evaluated in terms of total parameters, because LoRA was used for their fine-tuning, the fewest parameters were tuned for these models: 9.5M and 18M for Flan-TX XL and XXL, respectively compared to 110M for BERT.The negative class generally had the best performance overall, followed by Relationship and Employment.Performance varied quite a bit across the models for the other classes.For both tasks, the best-performing models with synthetic data augmentation used sentences from both rounds of GPT3.5 prompting.Synthetic data augmentation tended to lead to the largest performance improvements for classes with few instances in the training dataset and for which the model trained on gold-only data had very low performance (Housing, Parent, and Transportation).
The performance of the best-performing models for each task on the immunotherapy and MIMIC-III datasets are shown in Table 4. Performance was similar in the immunotherapy dataset, which represents a separate but similar patient population treated at the same hospital system.We observed a performance decrement on the MIMIC-III dataset, representing a more dissimilar patient population from a different hospital system.Performance was similar between models developed with and without synthetic data.

Ablation studies
The ablation studies showed a consistent deterioration in model performance across all SDoH tasks and categories as the volume of real gold SDoH sentences progressively decreased, although models that included synthetic data maintained performance at higher levels throughout and were less sensitive to decreases in gold data (Figure 3, Supplemental Material Table B1).When synthetic data were included in training, performance was maintained until approximately 50% of gold data were removed from the train set.Conversely, without synthetic data, performance dropped after about 10-20% of the gold data were removed from the train set mimicking a true low-resource setting.

Error analysis
The leading discrepancies between ground-truth and model prediction for each task are in Supplemental Material, Table B2.Qualitative analysis revealed 4 distinct error patterns: Human annotator error; false positives and false negatives for Relationship and Support labels in the presence of any family mentions; incorrect labels due to information present in the note but external to the sentence and therefore not accessible to the model; and incorrectly labeling a non-adverse SDoH as an adverse SDoH.

ChatGPT-family model performance
When evaluating our fine-tuned Flan-T5 models on the synthetic validation dataset against GPT-turbo-0613 and GPT4-0314, our model surpassed the performance of the top-performing 10-shot learning GPT model by a margin of Macro-F1 0.05 (Figure 4).

Language model bias evaluation
Both fine-tuned Flan-T5 models and ChatGPT synthetic provided discrepant classification for sentence pairs with and without demographic information injected (Figure 5).However, the discrepancy rate of our fine-tuned models was nearly half that of ChatGPT: 14.3% vs. 21.5% of sentence pairs for any SDoH (P = 0.007) and 9.9% vs. 18.2% of sentence pairs for adverse SDoH (P = 0.005) for fine-tuned Flan-T5 vs. ChatGPT, respectively.ChatGPT was significantly more likely to change its classification when a female gender was injected compared to a male gender for the Any SDoH task (P = 0.01); no other within-model comparisons were statistically significant.Sentences gold-labeled as Support for both any SDoH and adverse SDoH mentions were most likely to lead to discrepant predictions for ChatGPT (56.3% (27/48)) and ( 21.0% (9/29)), respectively).Employment gold-labeled sentences were most-likely to lead to discrepant prediction for any SDoH mention fine-tuned model ( 14.4% (13/90)), and Transportation for adverse SDoH mention fine-tuned model ( 12.2% (6/49)).

Comparison with structured EHR data
Our best-performing models for any SDoH mention correctly identified 95.7% (89/93) patients with at least one SDoH mention, and 93.8% (45/48) patients with at least one adverse SDoH mention (Supplemental Material, Tables B3-4).SDoH entered as structured Z-code in the EHR during the same timespan identified 2.0% (1/48) with at least one adverse SDoH mention (all mapped Z-codes were adverse) (Supplemental Material, Table B5).Supplemental Material, Figures B1-2 shows that patient-level performance when using model predictions out-performed Z-codes by a factor of at least 3 for every label for each task (Macro-F1 0.78 vs. 0.17 for any SDoH mention and 0.71 vs. 0.17 for adverse SDoH mention).

DISCUSSION
We developed multilabel classifiers to identify the presence of 6 different SDoH documented in clinical notes, demonstrating the potential of large LMs to improve the collection of real-world data on SDoH and support appropriate allocation of resource support to patients who need it most.We identified a substantial performance gap between a more traditional BERT classifier and larger Flan-T5 XL and XXL models.Our fine-tuned models out-performed ChatGPT-family models with zero-and few-shot learning, and were less sensitive to the injection of demographic descriptors.Compared to diagnostic codes entered as structured data, text-extracted data identified 91.8% more patients with an adverse SDoH.We also contribute new annotation guidelines as well as synthetic SDoH datasets to the research community.
All of our models performed well at identifying sentences that do not contain SDoH mentions (F1 ≥ 0.99 for all).For any SDoH mentions, performance was worst for parental status and transportation issues.For adverse SDoH mentions, performance was worst for parental status and social support.These findings are unsurprising given the marked class imbalance for all SDoH labels-Only 3% of sentences in our training set contained any SDoH mention.Given this imbalance, our models' ability to identify sentences that contain SDoH language is impressive.In addition, these SDoH descriptions are semantically and linguistically complex.In particular, sentences describing social support are highly variable given the variety of ways individuals can receive support from their social systems during care.Interestingly, our best-performing models demonstrated strong performance in classifying housing issues (macro-F1 0.67), which was our scarcest label with only 20 instances in the training dataset.This speaks to the potential of large LMs in improved real-world data collection for very sparsely documented information, which is the most likely to be missed via manual review.
The recent advancements in large LMs have opened a pathway for synthetic text generation that may improve model performance via data augmentation, and enable experiments that better protect patient privacy. 40This is an emerging area of research that falls within a larger body of work on synthetic patient data across a range of data types and end-uses. 41,42Our study is among the first to evaluate the role of contemporary generative large LMs for synthetic clinical text to help unlock the value of unstructured data within the EHR.We were particularly interested in synthetic clinical data as a means to address the aforementioned scarcity of SDoH documentation, and our findings may provide generalizable insights for the common clinical NLP challenges of class imbalance-Many clinically important data are difficult to identify among the huge amounts of text in a patient's EHR.We found variable benefits of synthetic data augmentation across model architecture and size; the strategy was most beneficial for the smaller Flan-T5 models and for the rarest classes where performance was dismal using gold data alone.Importantly, the ablation studies demonstrated that only approximately half of the gold-labeled dataset was needed to maintain performance when synthetic data was included in training, although synthetic data alone did not produce high-quality models.Of note, we aimed to understand whether synthetic data for augmentation could be automatically generated using ChatGPT-family models without additional human annotation, and so it is possible that manual gold-labeling could further enhance the value of these data.However, this would decrease the value of synthetic data in terms of reducing annotation effort.
Our novel approach to generating synthetic clinical sentences also enabled us to explore the potential for ChatGPT-family models, GPT3.5 and GPT4, for supporting the collection of SDoH information from the EHR.We found that fine-tuning LMs that are orders of magnitude smaller than ChatGPT-family models, even with our relatively small dataset, outperformed zero-shot and few-shot learning with ChatGPT-family models, consistent with prior work evaluating large LMs for clinical uses. 43,44Nevertheless, GPT3.5 in particular showed promising performance for a model that was not explicitly trained for clinical tasks.
2][3] We were especially concerned that SDoH-containing language may be especially prone to eliciting these biases.Both our fine-tuned models and ChatGPT altered their SDoH classification predictions when demographics and gender descriptors were injected into sentences, although the fine-tuned models were significantly more robust that ChatGPT.Although not significantly different, it is worth noting that for both the fine-tuned models and ChatGPT, Hispanic and Black descriptors were most likely to change the classification for any SDoH and adverse SDoH mentions, respectively.This lack of significance may be due to the small numbers in this evaluation, and future work is critically needed to further evaluate bias in clinical LMs.We have made our paired demographic-injected sentences openly available for future efforts on LM bias evaluation.
12]49 Our findings that text-extracted SDoH information was better able to identify patients with adverse SDoH than relevant billing codes are in agreement with prior work showing under-utilization of Z-codes 10,11 .Most EMR systems have other ways to enter SDoH information as structured data which may have more complete documentation, however these did not exist for most of our target SDoH.Lyberger et al. evaluated other EHR sources of structured SDoH data, and similarly found that NLP methods are a complementary source SDoH information extraction, and were able to identify 10-30% of patients with tobacco, alcohol, and homelessness risk factors documented only in unstructured text 22 .
15][16][17][18][19][20][21]50 The most common SDoH targeted in prior efforts include smoking history, substance use, alcohol use, and homelessness. 23In addition, many prior efforts focus only on text in the Social History section of notes.In a recent shared task on alcohol, drug, tobacco, employment, and living situation event extraction from Social History sections, pre-trained LMs similarly provided best performance. 513][54][55][56][57][58][59][60] We also developed methods that can mine information from full clinic notes, not only Social History sections-a fundamentally more challenging task with a much larger class imbalance.Clinically-impactful SDoH information is often scattered throughout other note sections and many note types, such as many inpatient progress notes and notes written by nurses and social workers, do not consistently contain Social History sections.
Our study has limitations.First, our training and validation datasets come from a predominantly white population treated at hospitals in Boston, Massachusetts in the United States of America.This limits the generalizability of our findings.We could not exhaustively assess the many methods to generate synthetic data from ChatGPT.Instead, we chose to investigate prompting methods that could be easily reproduced by others and did not require extensive task-specific optimization, as this is likely not feasible for the many clinical NLP tasks that one may wish to generate synthetic data on.Incorporating real clinical examples in the prompt would likely improve the quality of the synthetic data, and is an area of future research when large generative LMs become more widely available for use with protected health information and within the resource constraints of academic researchers and healthcare systems.Because we could not evaluate ChatGPT-family models using protected health information, our evaluations are limited to manually-verified synthetic sentences.Thus, our reported performance may not completely reflect true performance on real clinical text.Because the synthetic sentences were generated using ChatGPT itself, and ChatGPT presumably has not been trained on clinical text, we hypothesize that if anything, performance would be worse on real clinical data.

CONCLUSIONS
Our findings highlight the potential of large LMs to improve real-world data collection and identification of SDoH from the EHR.In addition, synthetic clinical text generated by large LMs may enable better identification of rare events documented in the EHR, although more work is needed to optimize generation methods.Our fine-tuned models out-performed and were less prone to bias than ChatGPT-family models, despite being orders of magnitude smaller.In the future, these models could improve our understanding of drivers of health disparities by improving real-world evidence, and could directly support patient care by flagging patients who may benefit most from proactive resource and social work referral.

Table 1 .
Patient demographics across datasetsa Synthetic Validated = sentences used to evaluate GPT models, thus there is no demographic information for this dataset.bSynthetic Demo = sentences used for bias evaluation, where demographic descriptors were inserted.N/A = not applicable All data presented as n (%) unless otherwise noted.

Figure 1 .
Figure 1.Example of prompt templates used in the SKLLM package for GPT-turbo-0301 (GPT3.5)and GPT4 to classify our labeled synthetic data.{labels} and {training_data} were sampled from a separate synthetic dataset, which was not human-annotated.The final label output is shown highlighted in green.

Figure 2 .
Figure 2. Illustration of generating and comparing synthetic demographic-injected SDoH language pairs to assess how adding race/ethnicity and gender information into a sentence may impact model performance.FT = fine-tuned.

a
Delta F1 score is the change in Macro-F1 when synthetic data is added to the fine-tuning data.Bolded text indicates the best performance with and without synthetic data augmentation.SDoH = social determinants of health.

a
Delta F1 score is the change in Macro-F1 when synthetic data is added to the fine-tuning data.Bolded text indicates the best performance with and without synthetic data augmentation.SDoH = social determinants of health.Best models are determined by the performance from RT-test set

Figure 3 .
Figure 3. Performance in Macro F1 of Flan-T5 XL models fine-tuned using gold data only (orange line) and gold and synthetic data (blue line), as gold-labeled sentences are gradually reduced by undersample value from the training dataset for the (a) any social determinant of health (SDoH) mention task and (b) adverse SDoH mention task.The full gold-labeled training set is comprised of 29,869 sentences, augmented with 1,800 synthetic SDoH sentences.

Figure 4 .
Figure 4.This comparison shows the difference in model performance between our fine-tuned Flan-T5 models against zero-and 10-shot GPT.Macro-F1 was measured using our manually validated synthetic dataset.The GPT-turbo-0613 version of GPT3.5 and the GPT4-0314 version of GPT4 was used.The red dashed lines indicate the performance of the best-performing fine-tuned FLAN-T5 models for this task.

Figure 5 .
Figure 5.The proportion of synthetic sentence pairs with and without demographics injected that led to a classification mismatch, meaning that the model predicted a different SDoH label for each sentence in the pair.Overall, Results are shown across race/ethnicity and gender for (a) adverse SDoH mention task and (b) any SDoH mention task.Asterisks indicate statistical significance (P ≤ 0.05).

Figure B1 .
Figure B1.Class-wise and Macro-F1 scores of our best-performing model against mapped Z-Codes at the patient level (on test set and dev set).

Table 3 .
Model performance on the RT dataset

Table A2 .
Inter-annotator agreement for higher-level SDoH mention labels SDoH = social determinant of health.

Table A3 .
Z-Code to SDoH label mappings

Table A4 .
Prompts used to generate synthetic SDoH sentences using GPT3.5 {"role": "user", "content": "Imagine you are a physician.Please give me 100 sentences from your clinic notes about various patient's social support similar to the examples."}