Non-task expert physicians benefit from correct explainable AI advice when reviewing X-rays

Artificial intelligence (AI)-generated clinical advice is becoming more prevalent in healthcare. However, the impact of AI-generated advice on physicians’ decision-making is underexplored. In this study, physicians received X-rays with correct diagnostic advice and were asked to make a diagnosis, rate the advice’s quality, and judge their own confidence. We manipulated whether the advice came with or without a visual annotation on the X-rays, and whether it was labeled as coming from an AI or a human radiologist. Overall, receiving annotated advice from an AI resulted in the highest diagnostic accuracy. Physicians rated the quality of AI advice higher than human advice. We did not find a strong effect of either manipulation on participants’ confidence. The magnitude of the effects varied between task experts and non-task experts, with the latter benefiting considerably from correct explainable AI advice. These findings raise important considerations for the deployment of diagnostic advice in healthcare.

• Findings: Mild cardiomegaly, visceral pleural edge at right apex, right basilar atelectasis, small right pleural effusion, right rib fractures • Diagnosis: Right pneumothorax Description: In case PT007, attentive interrogation of the right lung apex should have led individuals to recognize a visceral pleural edge with no distal lung markers, which are characteristic findings of a pneumothorax. It has been well established that pathology at the lung apices may be missed due to the many overlapping anatomical structures in this region.

Case ID: PT010
Pacsbin*: annotated, not annotated MIMIC ID: p10165672\s10 • Patient information: A 63-year-old male presenting to the Emergency Department with cough. • Findings: Normal heart size, right upper lobe airspace opacification, small pleural effusion, no pneumothorax • Diagnosis: Right upper lobe pneumonia Description: Case PT010 requires respondents to recognize an ill-defined right upper lung opacity. This radiographic finding and clinical history of cough should lead to the correct diagnosis of pneumonia. Respondents may have misinterpreted the ill-defined opacity as vascular markers or the superimposition of anatomical structures such as ribs.
• Findings: Normal heart size, focal opacity projecting over right upper lung, no pleural effusion, no pneumothorax • Diagnosis: Rib fracture Description: In case PT011, a focal area of increased density appears to project over the right upper lung. A first instinct may be to consider this a pulmonary nodule. However, upon closer review, the area of increased density can be accounted for by overlapping of the third anterior and sixth posterior ribs. The correct diagnosis of an acute rib fracture can be made by identifying the step deformity of the third anterior right rib. The superimposition of anatomical structures is a well-documented cause of "pseudo-nodules".

Pre-Registered Study Protocols
The pre-registered study protocols (https://osf.io/sb9hf, https://osf.io/f69mz) can be found on the OSF-project page from a previously published study (https://osf.io/rjfqx/). We report two deviations: First, we planned to recruit 128 IM/EM physicians and 128 radiologists, but we ultimately only obtained 117 and 106 participants. Unfortunately, recruiting a large sample of practicing physicians for an online experiment is challenging. Second, we did not include participants' age and gender in the regression models, since we did not have a hypothesis why these demographic variables should affect the dependent variables.

Regression Model Equations
Below are the equations for the three mixed-effect regressions used for analyzing the dependent variables (1) diagnostic accuracy, (2) advice quality rating, and (3) physicians' confidence in their final diagnosis:

Additional Statistical Analyses
Group differences by source of advice: Before conducting the principal statistical analysis, we checked whether the randomization of participants into the between-subjects factor source of advice (AI vs. human) worked. This was done by testing if there were any significant differences between the participants in both groups of advice on the variables professional identification, belief in professional autonomy, self-reported AI-knowledge, attitude toward AI, and years of experience. As Table S1 shows, the differences between the mean values of these variables were statistically non-significant for participants in the two source of advice groups.
Group differences by task expertise: Additionally, we tested whether there were differences on the same variables (professional identification, belief in professional autonomy, self-reported AIknowledge, attitude toward AI, and years of experience) between task experts (i.e., radiologists) vs. non-task experts (i.e., IM/EM physicians). The only statistically significant difference was that radiologists rated their self-reported AI knowledge higher than IM/EM physicians (see Table S2).    Note. SE = standard error; p = probability of committing a Type I error; random effects: σ 2 =3.29, τ00 ID = 0.40, τ00 PATIENTID = 0.94, ICC = 0.29, NID = 222, NPATIENTID = 4, Observations = 888, Marginal R 2 = 0.072 / Conditional R 2 = 0.340; OR>1 variable associated with higher odds for correct diagnosis; OR<1 variable associated with lower odds for correct diagnosis, OR=1 variable does not affect odds of outcome. The intercept indicates that the probability of an accurate diagnosis was 0.79 when all predictors are zero. Predictors without a natural zero point (i.e., professional identification, beliefs about professional autonomy, self-reported AIknowledge, attitude toward AI) were mean-centered.  Note. SE = standard error; p = probability of committing a Type I error; random effects: σ 2 =0.93, τ00 ID = 0.23, τ00 PATIENTID = 0.20, ICC = 0.31, N ID = 222, N PATIENTID = 4, Observations = 888, Marginal R 2 = 0.140 / Conditional R 2 = 0.409. The regression estimate indicates how much the mean confidence rating changes given a one-unit shift in the predictor while holding other predictors in the model constant. The intercept represents the mean value of the confidence in the diagnosis when all predictor variables are zero. Predictors without a natural zero point (i.e., professional identification, beliefs about professional autonomy, self-reported AI-knowledge, attitude toward AI) were mean-centered.