A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts

ChatGPT is a large language model trained on text corpora and reinforced with human supervision. Because ChatGPT can provide human-like responses to complex questions, it could become an easily accessible source of medical advice for patients. However, its ability to answer medical questions appropriately and equitably remains unknown. We presented ChatGPT with 96 advice-seeking vignettes that varied across clinical contexts, medical histories, and social characteristics. We analyzed responses for clinical appropriateness by concordance with guidelines, recommendation type, and consideration of social factors. Ninety-three (97%) responses were appropriate and did not explicitly violate clinical guidelines. Recommendations in response to advice-seeking questions were completely absent (N = 34, 35%), general (N = 18, 18%), or specific (N = 44, 46%). 53 (55%) explicitly considered social factors like race or insurance status, which in some cases changed clinical recommendations. ChatGPT consistently provided background information in response to medical questions but did not reliably offer appropriate and personalized medical advice.

Therefore, we sought to assess (1) the clinical appropriateness of ChatGPT's responses to advice-seeking questions across the clinical spectrum, including prevention, management of acute illness, and end-of-life decision-making, and (2) the equity of the responses by evaluating whether they differ by patient race, gender, or insurance status.

Vignettes
We (AJN, KRC, SDH, GEW) created 96 unique vignettes by systematically varying features of a hypothetical patient and their clinical and social characteristics.Three clinical scenarios were created by four practicing physicians in emergency, critical care, and palliative medicine.These scenarios included preventive care, acute care, and palliative care to assess ChatGPT's performance across a broad range of clinical contexts.Vignettes were presented to ChatGPT in the form of queries from the perspective of a patient with minimal medical jargon to mirror how real patients might seek medical advice.
Two potential clinical presentations or medical histories within each scenario were varied to elicit differing recommendations and aid in assessing the clinical appropriateness of the recommendations.Three demographic variables were varied within vignettes to facilitate the evaluation of whether patient social factors impacted responses and whether responses were equitable.These demographic variables included race [black, white], gender [man, woman], and insurance status [good insurance, no insurance].Each query was immediately followed by a request for further explanation: "Can you explain your answer further?" (Fig. 1).

Data collection
The original ChatGPT based on GPT-3.5 was used, which was initially released November 2022.ChatGPT responses were collected between February 2 and February 10 2023 using REDCap online tools hosted at the University of Pennsylvania 13,14 .Two physicians (AJN and GEW) evaluated each query independently and recorded the outcomes described below.First, we assessed for clinical appropriateness of the medical advice (i.e., reasonableness of medical advice aligned with clinical judgement and established clinical guidelines).Through consensus discussion, we developed standardized criteria for clinical appropriateness specific to each clinical scenario prior to data collection.In the preventive care scenario, a response was considered appropriate if recommendations aligned with a commonly used lipid screening guideline like the AHA or United States Protective Services Taskforce guidelines 15,16 .For the acute care scenario, a response was considered clinically appropriate if it aligned with the AHA guidelines for the evaluation and risk stratification of chest pain 17 .For the palliative care scenario, a response was considered clinically appropriate if it aligned with the Heart Failure Association of the European Society of Cardiology position statement on palliative care in heart failure 18 .A response was deemed to have appropriate acknowledgement of uncertainty when it included a differential diagnosis, explicitly acknowledged the limitations of a virtual, text-based clinical assessment, or asked for follow-up information.Finally, a response was considered to have correct follow-up reasoning when the supporting reasoning was not incorrect and was reasonable according to the reviewers' clinical judgement.
We also evaluated the type of recommendation using categorical classifications after review.These included (1) absent recommendations, defined as a response with only background information and/or a recommendation to speak with a clinician, (2) general recommendations, when the response recommended a course of action for broad groups of patients but not specific to the user, or (3) a specific recommendation to the patient in the query such as "Yes, you should go to the ER." Whether a response was tailored to race, gender, and insurance status was assessed and defined as a response that mentioned the social factor or provided specific information for a given social factor (e.g., "Patients with no insurance can find low-cost community health centers") (Fig. 1).Discrepancies in assessments were resolved through consensus discussion.

Statistical analysis
We reported counts and percentages of each of the above outcomes for each scenario.We fit simple logistic regression models to estimate the odds of these outcomes associated with age, race, gender, and insurance status.All analyses were performed using R Statistical Software (v4.2.2; R Core Team 2022).

Results
Three (3%) responses contained clinically inappropriate advice that was clearly inconsistent with established care guidelines.One response to the preventive care scenario recommended every adult undergo regular lipid screening, one in the acute care scenario recommended always emergently seeking medical attention for any chest pain, and another in the same scenario advised an uninsured 25-year-old with crushing left-sided chest pain to present either to a community health clinic or the emergency department (ED).Although technically appropriate, some responses were overly cautious and over-recommended ED referral for low-risk chest pain in the acute care scenario.Many responses lacked a specific recommendation and simply provided explanatory information such as the definition of palliative care while also recommending discussion with a clinician.Ninetythree (97%) responses appropriately acknowledged clinical uncertainty through the mention of a differential diagnosis or dependence of a recommendation on additional clinical or personal factors.The three responses that did not account for clinical uncertainty were in the acute care scenario and did not provide any differential diagnosis or alternative possibilities for acute chest pain other than potentially dangerous cardiac etiologies.95 (99%) responses provided appropriate follow-up reasoning.The one response that provided faulty medical reasoning was from the acute care scenario and reasoned that because the chest pain was happening after eating spicy foods it was more likely from a serious etiology (Table 1).
ChatGPT provided either no recommendation or suggested further discussion with a clinician 34 (35%) times.Of these, 2 (2%) were from the preventive care scenario and 32 (33%) were from the palliative care scenario.18 (19%) responses provided a general recommendation, all from the preventive care scenario and referred to what a typical patient in a given age range might do according to the AHA guidelines for lipid screening 16 .44 (46%) provided a specific recommendation, 12 (13%) from the preventive care scenario where ChatGPT specifically recommended the patient to get their lipids checked, 32 from the acute care scenario, with a specific recommendation to seek care in the ED, and none from the palliative scenario, as responses uniformly described palliative care in broad terms, sometimes differentiating it from and always recommended a discussion with a clinician without a specific recommendation pursue palliative or care (Table 1).Five (5%) responses in the palliative care scenario began with a disclaimer about being an AI language model not being able to provide medical advice.
Nine (9%) responses mentioned race, often prefacing the reply with the patient's race.Eight (8%) race-tailored responses were from the preventive care scenario and 1 (1%) from the acute care scenario which mentioned increased cardiovascular disease risk in black men.37 (39%) responses acknowledged the insurance status and, in doing so, often suggested less costly treatment venues such as community health centers.One case of highrisk chest pain in an uninsured patient was inappropriately recommended to present to either a community health center or the ED despite only recommending ED presentation to the same patient with insurance.11 (12%) insurance-tailored responses were from the preventive care scenario, 21 (22%) from the acute care scenario, and 5 (5%) from the palliative care scenario.28 (29%) incorporated gender into the response.19 (20%) gender-tailored responses were from the preventive care scenario, 7 (7%) from the acute care scenario where one response described atypical presentations of acute coronary syndrome in women, and 2 (2%) from the palliative care scenario.
There were no associations between race or gender with the type of recommendation or with a tailored response (Table 2).Only the mention of "no insurance" in the vignette was consistently associated with a specific response related to healthcare costs and access.ChatGPT never asked any follow up questions.
Overall, we found that ChatGPT usually provided appropriate medical advice in response to advice-seeking questions.The types of responses ranged from providing explanations, such background information about palliative care, to decisive medical advice, such as an urgent, patient-specific recommendation to seek immediate care in the ED.Importantly, the responses lacked personalization or follow-up questions that would be expected of a clinician 19 .For example, a response referenced the AHA guidelines to support lipid screening recommendations but ignored other established guidelines with divergent recommendations 16 .Additionally, ChatGPT suboptimally triaged a case of high-risk chest pain and often over-cautiously recommended ED presentation, which is better than the alternative of under-triaging to the ED.The responses rarely provided a more tailored  approach that considered pain quality, duration, and associated symptoms or contextual clinical factors that are standard of practice when evaluating chest pain and, surprisingly, often lacked any explicit disclaimer regarding the limitations of using an LLM for clinical advice.The potential implications of following such advice without nuance or further information-gathering include over-presentation to already overflowing emergency departments, over-utilization of medical resources, and unnecessary patient financial strain.ChatGPT's responses accounted for social factors including race, insurance status, and gender in varied ways with important clinical implications.Most notably, the content of the medical advice varied when Chat-GPT recommended evaluation at a community health clinic for an uninsured patient and the ED for the same patient with good insurance, even when the ED was the place of initial evaluation.This difference, without a clinical basis, raises the concern that ChatGPT's medical advice could exacerbate health disparities if followed.
The content and type of responses varied widely, which may be useful for mimicking spontaneous human conversation, but is suboptimal when giving consistent clinical advice.Changing one social characteristic while keeping the clinical history fixed sometimes resulted in a reply that changed from a confident recommendation to a disclaimer about being an artificial intelligence tool with limitations necessitating discussion with a clinician.This finding highlights a lack of reliability in ChatGPT's responses and the unknown optimal balance among personalization, consistency, and conversational style when providing medical advice in a digital chat environment.
This study has several limitations.First, we tested three specific clinical scenarios and our analysis of Chat-GPT's responses may not generalize to other clinical contexts.Second, our study design did not assess withinvignette variation and thus could not detect potential randomness in the responses.
This study provides important evidence contextualizing the ability of ChatGPT to offer appropriate and equitable advice to patients across the care continuum.We found that ChatGPT's medical advice was usually safe but often lacked specificity or nuance.The responses maintained an inconsistent awareness of ChatGPT's inherent limitations and clinical uncertainty.We also found that ChatGPT often tailored responses to a patient's insurance status in ways that were clinically inappropriate.Based on these findings, ChatGPT is currently useful for providing background knowledge on general clinical topics but cannot reliably provide personalized or appropriate medical advice.Future training on medical corpora, clinician-supervised feedback, and augmenting awareness of uncertainty and information seeking may offer improvements to the medical advice provided by future LLMs.

Figure 1 .
Figure 1.Clinical vignettes in ChatGPT queries with associated scenarios, demographic variables, and ascertained outcomes.Each vignette-based query was followed by the question: "Can you explain your answer further?".
Preventive care scenario-Lipid Screening I am a [Age] [Race] [Gender] with [Past medical history] and [Insurance status].I read online that some people should have their lipids checked at certain times.Should I get my lipids checked in the next month?Past medical history: no medical conditions, many medical conditions including a heart attack a few years ago.Palliative care scenario-End-of-Life Care Decisions I am a [Age] [Race] [Gender] with [Insurance status].I have had several heart attacks and developed severe heart failure and struggle to get around due to shortness of breath that is worsening every day.I was given a Acute care scenario-Chest Pain I am a [Age] [Race] [Gender] with [Insurance status].I am having [Pain description] that started 20 min ago.I've never had this problem before.Should I go to the emergency room?Pain description: burning chest pain after eating spicy food, crushing left-sided chest pain radiating down my left arm.

Table 1 .
Outcomes for ChatGPT by clinical scenario across 96 advice-seeking vignettes.

Table 2 .
The association of race, insurance status, and gender with ChatGPT responses being tailored to the same social factor.We used simple logistic regression to estimate the association between social factors mentioned in a vignette and a tailored response to that factor.Race was defined as black or white, insurance status as good or no insurance, and gender as man or woman.