Do as AI say: susceptibility in deployment of clinical decision-aids

Artificial intelligence (AI) models for decision support have been developed for clinical settings such as radiology, but little work evaluates the potential impact of such systems. In this study, physicians received chest X-rays and diagnostic advice, some of which was inaccurate, and were asked to evaluate advice quality and make diagnoses. All advice was generated by human experts, but some was labeled as coming from an AI system. As a group, radiologists rated advice as lower quality when it appeared to come from an AI system; physicians with less task-expertise did not. Diagnostic accuracy was significantly worse when participants received inaccurate advice, regardless of the purported source. This work raises important considerations for how advice, AI and non-AI, should be deployed in clinical environments.

Diagnosis Accuracy: As reported in the main article, we then tested if the correctness of participants' final diagnoses were affected by the source or accuracy of advice. There was a significant main effect of the accuracy of advice both among task experts χ2(1) = 187.40, p < 0.001 and non-experts χ2(1) = 115.94, p < 0.001 (see Manuscript Fig. 3a). In neither group did the source of the advice affect participant's performance: task experts: χ2 (1) = 1.07, p = 0.301; non-experts: χ2(1) = 1.12, p = 0.289 (see Manuscript Fig. 3b). Again, there was neither a significant interaction between the accuracy of advice and the source of the advice among the task experts χ2(1) = 0.728, p = 0.393 nor the non-experts χ2(1) = 2.63, p = 0.105. All the main effects did not change after controlling for the same covariates as above (see Supplementary  Table 2). Including the covariates improved the model fit significantly (p = 0.001) among the task experts. Higher levels of professional identification and more work experience among the radiologists resulted in higher diagnostic accuracy. Among the non-experts, including the covariates did not improve the model fit significantly (p = 0.614), and no covariate was significantly associated with diagnostic accuracy.

Recruitment Process and Response Rate
During recruitment, we sent emails to residency directors/coordinators at most institutes in the US and Canada that had residency programs in radiology (183 institutions), IM (479) and EM (238), and asked directors/coordinators to forward the email to residents and staff in that field. In addition, when we found physician emails available on the institution's website, we sent recruitment emails to them directly. This process led to approximately 1850 emails sent, resulting in 425 people opening the link to the Qualtrics survey page. 361 people then met our inclusion criteria, consented to participate, and started to look at cases. Finally, 265 people finished looking at cases and answered all of the post-survey questions about demographics, attitudes, and professional identity/autonomy -these are our final participants included in the analysis. This is about a 14.3% response rate given our initial 1850 recruitment emails.

DICOM viewer
Participants were able to see the chest x-rays in the Digital Imaging and Communications in Medicine (DICOM) format, which is commonly used for radiologic images, and allows grouping multiple images and metadata together in a lossless format. The DICOMS were accessible in a fully functional external DICOM viewer called Pacsbin "Pacsbin is a platform developed by radiologists as an attempt to make HIPAA compliant radiology teaching cases easier to create, view, and share. Purpose: The primary goal of Pacsbin is to bring a fully-featured PACS environment to the web for teaching cases and research. Modern web technologies, in particular Javascript, HTML5, and CSS3 have enabled the creation of fast, highly functional radiology imaging programs in the browser. Pacsbin leverages the fantastic open source Cornerstone library from Chris Hafey, adding education specific tools, an anonymization pathway, and a way to organize your cases and save them forever. Features: Full DICOM images, with support for the most common image manipulation tools. Editing interface for creators of case content, including annotations and case notes. Create links from case notes to specific images and window/level settings, allowing a guided tour through pertinent imaging findings. Automated anonymization pipeline, using DICOM standard anonymization techniques. Upload images directly, or link to an enterprise PACS for single click case creation." (Information from https://www.pacsbin.com/docs/about)
• Belief in professional autonomy: A four-item scale (e.g., "Individual physicians should make their own decisions in regard to what is to be done in their work." 2 ) answered on a 7-point Likert scale from 1 (strongly disagree) to 7 (strongly agree); Cronbach's α = 0.68.
• Self-reported AI knowledge: "How would you consider your own general knowledge of artificial intelligence (AI)? (I have no knowledge, Novice: I have heard of AI, Intermediate: I have read media articles or have listened to news about AI technologies, Advanced: I have used AI-based tools and have some understanding of how they work, Expert: For example, I am an academic or industry researcher in AI). • Attitude toward AI: Three item scale ("How much do you agree with the following statements? -AI will make most people's lives better."; "How much do you agree with the following statements? -AI is dangerous to society."; "How much do you agree with the following statements? -AI poses a threat to my career.") answered on a 7-point Likert scale from 1 (strongly disagree) to 7 (strongly agree); Cronbach's α = 0.57. • Years of experience: "How many years of experience do you have as a physician (starting from your first year of residency)?" (open answer format).
• Gender: "What is your gender?" (male, female, other, prefer not to answer).

Supplementary Note 3. Patient Case Information
• Case 1 is a normal chest x-ray. The potential pitfall is misinterpreting the left breast shadow as pneumonia. • Case 2 is an uncommon fracture-dislocation injury. A systematic search pattern in interpreting chest x-rays should include the sterno-clavicular joint. Less experienced physicians may be unaware of this injury and misinterpreted the fracture fragment as a pleural plaque which is a more commonly encountered finding. • Case 3 illustrates the importance of adjusting window levels and width when assessing the retrocardiac space. Normally the retrocardiac region contains traversing pulmonary vessels that taper peripherally. Careful scrutiny of case 3 reveals an air and soft tissue density mass in the retrocardiac region. This finding is characteristic of a hiatal hernia, a benign but potentially symptomatic entity. The cardiac silhouette on frontal chest radiographs can decrease conspicuity of important pathology that can occur in the retrocardiac space such as lung cancers, enlarged lymph nodes, and aortic aneurysms. • In Case 4 attentive interrogation of the right lung apex should have led individuals to recognize a visceral pleural edge with no distal lung markers which are characteristic findings of a pneumothorax. It has been well established that pathology at the lung apices may be missed due to the many overlapping anatomical structures in this region (25,26).
• Case 5 requires respondents to recognize an ill-defined right upper lung opacity. This radiographic finding and clinical history of cough should lead to the correct diagnosis of pneumonia. Respondents may have misinterpreted the ill-defined opacity as vascular markers or the superimposition of anatomical structures such as ribs. • In Case 6, a focal area of increased density appears to project over the right upper lung. A first instinct may be to consider this a pulmonary nodule (which was provided as incorrect advice in the experiment). However, upon closer review the area of increased density can be accounted for by overlapping of the third anterior and sixth posterior ribs. The correct diagnosis of an acute rib fracture can be made by identifying the step deformity of the third anterior right rib. The superimposition of anatomical structures is a well-documented cause of "pseudo-nodules" (27). • Case 7 requires respondents to integrate the clinical history and multiple radiograph findings to arrive at the correct diagnosis of pulmonary edema. We show the individual performance of radiologists and IM/EM physicians sorted in increasing order by the number of cases they correctly diagnosed, and split up by whether they received advice labeled as coming from an AI or human source. Each physician's individual performance is split up by their performance on cases with accurate advice (the lower, blue part of the bar) and inaccurate advice (the upper, red part of the bar). We further indicate Critical Performers, who always recognize inaccurate advice, and Susceptible Performers, who never do. While the distribution of performance levels is quite different across expertise groups, it does not differ much by the source of advice within an expertise group. Note. β = estimated coefficient; SE = standard error; z = z-value; p = probability of committing a Type I error, IM = internal medicine, EM = emergency medicine. Note. The values in the "Link to Pacsbin" column are links to view the x-rays in a web-based DICOM viewer. The x-rays are better viewed in this way, since they lose quality when converted to JPG files and made smaller to fit on a page.