Diagnostic decisions of specialist optometrists exposed to ambiguous deep-learning outputs

Artificial intelligence (AI) has great potential in ophthalmology. We investigated how ambiguous outputs from an AI diagnostic support system (AI-DSS) affected diagnostic responses from optometrists when assessing cases of suspected retinal disease. Thirty optometrists (15 more experienced, 15 less) assessed 30 clinical cases. For ten, participants saw an optical coherence tomography (OCT) scan, basic clinical information and retinal photography (‘no AI’). For another ten, they were also given AI-generated OCT-based probabilistic diagnoses (‘AI diagnosis’); and for ten, both AI-diagnosis and AI-generated OCT segmentations (‘AI diagnosis + segmentation’) were provided. Cases were matched across the three types of presentation and were selected to include 40% ambiguous and 20% incorrect AI outputs. Optometrist diagnostic agreement with the predefined reference standard was lowest for ‘AI diagnosis + segmentation’ (204/300, 68%) compared to ‘AI diagnosis’ (224/300, 75% p = 0.010), and ‘no Al’ (242/300, 81%, p =  < 0.001). Agreement with AI diagnosis consistent with the reference standard decreased (174/210 vs 199/210, p = 0.003), but participants trusted the AI more (p = 0.029) with segmentations. Practitioner experience did not affect diagnostic responses (p = 0.24). More experienced participants were more confident (p = 0.012) and trusted the AI less (p = 0.038). Our findings also highlight issues around reference standard definition.

their level of experience in medical retina (MR) which was used as a surrogate for their familiarity of interpreting retinal OCT scans.The group allocation criteria are displayed in Supplementary Figure 1.If a participant was currently working in an MR clinic, and had been there for more than 1 year, they were allocated to the more experienced group.Others were allocated to the less experienced group, including those who had never worked in MR, who had not worked in MR in the past year, and those who had worked in MR for less than a year.This time period was decided with a consultant optometrist specialising in MR as most optometrists work in MR for only 1 or 2 sessions per week and require supervision for roughly the first 4-6 months.Also, without working in the clinic for over a year, OCT interpretation skills are likely to have degraded.It is acknowledged that this does not provide a distinct divide between more and less experienced groups, as optometrists may also have some knowledge of retinal OCT scans from outside MR clinics.However, these classification rules were chosen as a reasonable measure of level of experience.

Participant Training
Clear instructions were provided for how to navigate through the survey and how to clearly view the OCT volume scans prior to any study cases being presented.Participants were shown an example of an AI segmentation map along with the diagnosis probability percentages.This example was annotated with each aspect clearly explained.If the participant indicated that they were still unclear about what the AI segmentation and outputs represented, they were unable to complete the study at that point and were encouraged to contact the study investigator.All 30 participants indicated that they understood what the AI displayed.No information was given about the algorithms' diagnostic accuracy.

Participant Training -Segmentation Overlays
The following was shown to all participants during the training phase of the study: You will also be provided with 'segmentation maps' produced using artificial intelligence (AI) algorithms.These maps display identified features within the OCT scan (for example intraretinal fluid (IRF)).Segmentations are presented as overlays, covering the OCT scan.If a specific feature is identified, it is colour coded, based on a key that will be provided to

Participant Training -AI Diagnostic Outputs
The following was shown to all participants during the training phase of the study: You will also be provided with bar charts, presenting the output from an algorithm designed to suggest the most probable diagnosis as well as a referral suggestion.This algorithm uses the results from the OCT segmentation maps to determine the most likely diagnosis or pathology present.The following image is an example of how this output will be presented: Each percentage is out of 100 and is the algorithm's output probability of each diagnosis being present.This example demonstrates a 97.08% probability that the OCT scan is normal.
The percentage for each diagnosis can be between 0-100%.The presence of each condition is assessed independently of the other diagnoses.
The AI may not always be as confident in its diagnosis.For example, consider the AI predicted a diagnosis of CNV with a value of 55% probability, but at the same time also predicted the diagnosis was MRO with 55%.For the two conditions considered independently, the AI predicts the same probability that both are present.

Statistical Methods
As our data did not meet the ANOVA assumptions, we used non-parametric tests for analysis.In particular we used the Aligned Rank Transform (ART) for factorial data, to assess the presence of interactions between N number of different factors.ART relies on a pre-processing step that aligns data before applying averaged ranks.After this step, common ANOVA and post-hoc analysis can be performed.By carrying out the preprocessing step, ART can be used in circumstances like the parametric ANOVA, despite the dependent variable being continuous or ordinal and not normally distributed.Appropriate post-hoc statistical comparisons with Bonferroni correction were used when ANOVA p values were significant between three or more groups.

Supplementary Exploratory Analysis
After running the analysis reported in the paper, we noticed that three cases across the conditions (n=1 'no AI', n=1 'AI diagnosis' and n=1 'AI diagnosis + segmentation') displayed very subtle epiretinal membranes on OCT imaging that would not be considered clinically significant.In order to assess whether our results for diagnostic accuracy and agreement with AI were significantly impacted by these three cases, we repeated the analysis excluding them.Thus, for each of the three case presentation formats, 270 diagnostic responses were assessed.An ANOVA with ART adjustment revealed significant differences in correct responses for the same factors as the original analysis; there was a significant difference across the three presentation formats (p<0•001) (Supplementary Table 1).A significant effect of the order of case presentation was again found (p=0•007).There was no significant effect of experience on the number of correct responses.When testing interactions between factors, a significant interaction between order and presentation ('no AI', 'AI diagnosis', 'AI diagnosis + segmentation) was found (p=0•006).All other interactions showed no significant effect.

Effect of presentation
The participants' responses were divided into 3 classes, based on the presentation of This change from significant to non-significant is likely due to the smaller sample size creating less statistical power, as the difference in correct responses between these two conditions changed by just one response in the new analysis.

Participants' level of agreement with AI
We also assessed whether excluding the three cases affected the results for agreement with AI outputs with (AI diagnosis + segmentation) or without (AI diagnosis) segmentation overlays.The results again matched the original analysis whereby there was a significant effect of presentation format (p=0•006) (Supplementary PED = Pigment epithelial detachment) In this example, the segmentation has identified numerous large pockets of intra-retinal fluid.It has also identified a fibrovascular PED and sub-retinal hyper-reflective material.Other colour coded areas represent anatomical structures.The results displayed in this segmentation map are then used by a separate AI algorithm to determine a suggested probable diagnosis.

Table 1 :
Results from ANOVA testing on number of correct diagnoses.ANOVA performed on results using aligned rank transform (ART).Results for factors 1-3 represent the effect of a single factor on diagnosis.Results for factors 4-7 represent the effect of two or more factors interacting.Values in bold represent statistically significant results.
* p values considered statistically significantSupplementary

Table 2
* p values considered statistically significantSupplementary

Table 2 :
Results from ANOVA testing on number of responses in agreement with AI outputs.ANOVA performed on results using aligned rank transform (ART).Results for factors 1-3 represent the effect of a single factor on diagnosis.Results for factors 4-7 represent the effect of two or more factors interacting.Values in bold represent statistically significant results.