Blinded, randomized trial of sonographer versus AI cardiac function assessment

Artificial intelligence (AI) has been developed for echocardiography1–3, although it has not yet been tested with blinding and randomization. Here we designed a blinded, randomized non-inferiority clinical trial (ClinicalTrials.gov ID: NCT05140642; no outside funding) of AI versus sonographer initial assessment of left ventricular ejection fraction (LVEF) to evaluate the impact of AI in the interpretation workflow. The primary end point was the change in the LVEF between initial AI or sonographer assessment and final cardiologist assessment, evaluated by the proportion of studies with substantial change (more than 5% change). From 3,769 echocardiographic studies screened, 274 studies were excluded owing to poor image quality. The proportion of studies substantially changed was 16.8% in the AI group and 27.2% in the sonographer group (difference of −10.4%, 95% confidence interval: −13.2% to −7.7%, P < 0.001 for non-inferiority, P < 0.001 for superiority). The mean absolute difference between final cardiologist assessment and independent previous cardiologist assessment was 6.29% in the AI group and 7.23% in the sonographer group (difference of −0.96%, 95% confidence interval: −1.34% to −0.54%, P < 0.001 for superiority). The AI-guided workflow saved time for both sonographers and cardiologists, and cardiologists were not able to distinguish between the initial assessments by AI versus the sonographer (blinding index of 0.088). For patients undergoing echocardiographic quantification of cardiac function, initial assessment of LVEF by AI was non-inferior to assessment by sonographers.

The impact of this paper should be clarified. While this is an important step towards validating this AI technique, the actual AI engine has already been presented. This study is not the final step, since it does not demonstrate the utility of this AI technique in a real clinical setting prospectively. Thus, the impact of this publication may be limited. Another concern, to be better discussed, regards assessment of blindness: despite the cardiologist were not able to always determine if the initial assessment was done by AI or a sonographer, they were at least more likely to be correct than wrong (not perfectly blind to it). Not a big concern, but this should be better discussed in Discussion. Similarly, the anchor effect seems to be strong from the presented results, and given the nonperfect blindness, this may be a concern in interpreting the results. At least a deeper discussion on this point would be needed.
Minor comments: -in Abstract: "The mean absolute difference from between final and prior...": sentence to be rephrased and clarified, the "prior cardiology assessment" was clear to the reader only after reading the Methods. It should be clarified what we mean to a non-expert when this results are presented in the Abstract.
-"as be subject to heterogeneity" -> "as being subject to"? Referee #3 (Remarks to the Author): The paper describes a prospective, blinded randomized controlled trial to evaluate a previously published AI algorithm for assessing left ventricular ejection fraction (LVEF), by comparing it against sonographer initial assessment. The study is well designed and well carried out. The statistical analyses are appropriate and results are convincing. The paper is very well written, with clearly laid out motivations, clearly explained approach and conclusions, as well as thoughtful descriptions of study limitations. Even though the trial being single center is a major limitation, and furthermore it was limited to one particular AI model, the fact that the model was trained on data from a completely different population does land value to the results. As the authors recognized, more trials should be conducted in the near future, on more diverse trial population, and to evaluate AI models trained on broader set of data. Nonetheless, as the first study of its kind, the findings presented in this paper represent a significant initial step forward and are of great importance to the field.

1
We appreciate the valuable feedback from the Editor and Reviewers. We detail below the changes to our manuscript that we have made in response to the helpful comments and suggestions provided. We believe manuscript is substantially improved as a result.

Comments from the Reviewer 1
This is a retrospective, single-center, randomized study that compared AI vs sonographer determined ejection fraction and how that impacted the cardiologist's final reporting of this metric. The AI led to less change of the cardiologist's assessment and saved ~2 minutes compared with the sonographer's determination. This non-inferiority and superiority of the primary endpoint essentially validates the AI tool for initial screening of ejection fraction, which has been done via other means without this trial construct, such as the NPJ Digital Medicine 2020 report on "Deep learning interpretation of echocardiograms" by Zhou and colleagues, a co-author of this paper.

Reply:
We appreciate the thoughtful comments provided by the Reviewer. We would like to clarify that this was a randomized, blinded trial that involved 25 sonographers and 10 cardiologists with pre-specified endpoints (available online prior to trial initiation: https://clinicaltrials.gov/ct2/show/NCT05140642 Specifically, this trial was presented as a late breaking clinical trial at the European Society of Cardiology 2022 meeting (https://www.escardio.org/The-ESC/Press-Office/Pressreleases/Artificial-intelligence-assessment-of-heart-function-is-superior-to-sonographerassessment), highlighting its reception as a blinded, randomized clinical trial.
An important aspect the reviewer brings up is that the patient videos were collected originally from 2019 while the sonographer re-evaluation and cardiologist evaluation was done in 2022.
From the results: 3769 transthoracic echocardiogram studies originally performed at an academic medical center between June 1,2019 and August 8, 2019 were prospectively re-evaluated by twenty-five cardiac sonographers (mean of 14.1 years of practice) and ten cardiologists (mean of 12.7 years of practice).
2 This is an important feature of our study and we hope to clarify that this was mandated by our IRB as the review board thought AI technology was investigational and did not allow the study team to incorporate contemporaneous echocardiograms. We were asked to use previously collected echocardiograms to have the safety endpoint (the same images are evaluated by cardiologists twice and for the sonographer arm, be the largest test-retest of human clinician performance) in anticipation of future clinical deployment. However, even then, both sonographer and cardiologist assessment were done prospectively and within prespecified criteria trial set-up. We think there should not be a significant change in the results if the scanning was contemporaneous as the assessments in clinical practice are independent and stepwise (the way it was done in the trial) and in this trial, assessment of LVEF after scanning by both sonographer and cardiologist was done prospectively and in pre-specified fashion.
We recognize the confusion with descriptors of prospective and retrospective, and to help clarify the manuscript, removed references to prospective and only focus on the trial protocol. In addition to the particular question and technology studied, we think there is interest in the trial design in regards to randomization and blinding. In particular, it is difficult to blind AI technologies and all prior trials of AI do not use an active comparator (only comparing with nothing). For our trial, to blind and randomize with sonographers (the active comparator), we had to build substantial infrastructure to imbed the AI system within the clinical reporting system. This is in contrast with our prior work with purely in silico validation published in Nature 2020 and npj Digital Medicine, which were not blinded or randomized. Even in other prospective AI trials, there is not a similar ability to blind and as such, randomization is often clusterrandomized by site, rather by individual study.
We clarify that we are not claiming to be the first randomized trial of AI technology in cardiology, but the first blinded randomized trial of AI technology in cardiology. This is worthy of interest as blinding minimizes participant bias and allows for the introduction of an active comparator. EchoNet-RCT is the first trial where the AI technology is compared head to head with a clinician, while prior trials compared with the absence of interpretation. We clarify and reframe the discussion to describe this point: In the discussion: To our knowledge, this study represents the first blinded and randomized trial of AI technology applied to clinical cardiology.

Has been changed to: While not the first trial of AI technology in clinical cardiology 22 [citation of the Yao et al and Noseworthy et al clinical trials], to our knowledge, this study represents the first blinded implementation of a randomized trial in this space.
There are some grand claims that need to be toned down. This is certainly not the first randomized trial of AI technology applied to clinical cardiology. Beyond the fact that it is retrospective, there are prospective randomized trials such as the Mayo Clinic ECG (Yao et al, 2001) and others in acute heart failure, blood pressure management (reviewed in Plana, JAMA Network Open, Sept 2022). Moreover, a retrospective in silico study is hardly a randomized, real-world assessment of AI. The authors emphasize the "blinding" aspect, but that would be far more important in a prospective trial.
Reply: Thank you for the important comments provided by the Reviewer and the opportunity to revise our paper to reflect important precedence and context for our trial. In this revision, we clarify that we are not trying to claim to be the first randomized trial of AI technology in cardiology, but the first blinded randomized trial of AI technology in cardiology. As the reviewer mention, blinding is an important aspect of clinical trials (one that is particularly hard to do for diagnostic tools as it is hard to find a fair active comparator). In both studies described by the author, blood pressure management and ECG screening of low EF, the interventions were open label and unblinded -which can introduce bias as study participants might have either favorable or unfavorable impressions of the intervention.
With the introduction of blinding, trials able to introduce an active comparator (for this trial, the comparison with human sonographer annotation). Without an active comparator, it is difficult to assess whether the effect of the intervention is due to 1) AI technology, 2) the active of being observed (Hawthorne effect), or 3) the result of more intensive follow-up (for example: lead time bias). For example, our prior work showed that simply interacting with a system known to be an AI system changes the behavior of clinicians (https://arxiv.org/abs/2107.07015). By introducing blinding, the relative effect of being in a trial vs. the actual AI technology can more teased out.
In the discussion: This was a randomized, blinded trial that prospective involved 25 sonographers and 10 cardiologists with pre-specified endpoints and informed consent was obtained from all clinicians.
In particular, we think the design is interesting with the use of an active comparator (the sonographer) which is often not done in prospective trials of AI. With the use of an active comparator, we are able to do study-level randomization (in contrast to site-level cluster randomization which needs to be done when there is no active comparator).
Many echo labs do not have the sonographer compute ejection fraction and that metric is solely read out by the cardiologist. This practical point is not commented upon in the paper. For that common practice, a direct comparison of AI with cardiologist assessment of EF would be more meaningful.
Reply: We appreciate the thoughtful comments provided by the Reviewer. Across the world, there are different practice set-ups, and in Europe and Asia, often the cardiologist is alone in evaluating ejection fraction without an aid. We particularly chose the design of the American model (where sonographers initially interpret and cardiologists finalize) because it facilitated blinding and randomization since there are two independent points of expert clinician contact and allow for the comparison with an active comparator. Prior trials in AI only compare with the lack of assistance or "standard of care", but the introduction of an active comparator allows more insight into the impact of the AI technology vs. open-label trials.
We sought a strong benchmark of comparison, such that we used a group of highly experienced sonographers as a high bar to compare with. We would imagine less experienced sonographers or a stronger cardiologist comparator could change the difference between AI and non-AI assistance. In this study, the sonographers had an average of 14 years of experience, in which many clinicians would say they become quite good in assessing LVEF; however, we should recognize the heterogeneity and variation that can occur based on provider.
In the discussion: In addition to prospective evaluating the impact of AI in a clinical trial, our study represents the largest to date test-retest assessment of clinician variability in assessing LVEF. The degree of human variability between repeated LVEF assessments in our study is consistent with prior studies, 8,9,21 and the introduction of AI guidance decreased variance between independent clinician assessments. In this trial, we utilized experienced sonographers as an active comparator vs. the AI for the initial assessment of LVEF, different levels of experience and types of training can change the relative impact of AI compared to clinician judgement.
To evaluate heterogeneity among providers, in our supplmenetal results, we describe how much variation there is for individual clinicians, lighting the reviewers important point in choice of clinician and evaluation.
Supplemental Figure 3: Performance of each individual sonographer vs. AI initial assessment compared to historical assessment. Boxplot of interquartile range (IQR) and median. Whiskers truncated beyond 20% difference.

Comments from the Reviewer 2
The paper clearly present the results of a clinical trial to assess the non-inferiority of an AI-based method for initial assessment of left ventricular ejection fraction, one step to facilitate the work of cardiologists in interpreting an echocardiography. The AI method was previously developed and published. The work is novel and an important step towards validating existing technology for their use in the clinical space.
Reply: Thank you for your review of our trial and we appreciate your feedback on how to improve the presentation of the results. We think there is particular value in evaluating AI technologies in blinded and randomized fashion.

The size of the experiment is appropriate, and the quality of the presentation is good.
There is a concern regarding the term prospectively re-evaluated used in the Results. I would define this study, from the title, as a retrospective study as no new data is collected with the study. Once this is clearly stated, the term "prospectively reevaluated" will be clear and correct in that context.
Reply: Thank you for your review of our trial and we appreciate your feedback. An important aspect the reviewer brings up is that the patient videos were collected originally from 2019 while the sonographer re-evaluation and cardiologist evaluation was done in 2022.

From the results: 3769 transthoracic echocardiogram studies originally performed at an academic medical center between June 1,2019 and August 8, 2019 were prospectively re-evaluated by twenty-five cardiac sonographers (mean of 14.1 years of practice) and ten cardiologists (mean of 12.7 years of practice).
This is an important feature of our study -in fact, mandated by our IRB as the review board thought AI technology was investigational and did not allow the study team to incorporate contemporaneous echocardiograms -however both sonographer and cardiologist assessment were done prospectively and within prespecified criteria trial set-up. Notably, this allows us to have the safety endpoint (since the same images are evaluated by cardiologists twice and for the sonographer arm, be the largest test-retest of human clinician performance).
We recognize the confusion with descriptors of prospective and retrospective, and to help clarify the manuscript, removed references to prospective and only focus on the trial protocol.

In the discussion: Several limitations of our trial should be mentioned. First, our study was single center, reflecting the demographics and clinical practices of a particular population. […] Second, the study was not powered to assess long term outcomes based on differences in LVEF assessment. […]. Third, this trial used previously acquired echocardiogram studies, and while prospectively evaluated by sonographers and cardiologists, there can be bias when a different sonographer than the scanning sonographer interprets the images.
We also change to the title to highlight the use of clinical echocardiograms reflecting the difference between the prospective assessment and the historically obtained imaging.

Assessment of Cardiac Function in Clinical Acquired Echocardiograms
No major concern on the standard statistical methods applied.

Reply: Thank you for your feedback
The impact of this paper should be clarified. While this is an important step towards validating this AI technique, the actual AI engine has already been presented. This study is not the final step, since it does not demonstrate the utility of this AI technique in a real clinical setting prospectively. Thus, the impact of this publication may be limited.
Reply: Thank you for your feedback and we definitely agree with the challenges presented given the evolving landscape of evaluating AI technologies. For our trial, we had initially proposed an entirely prospective evaluation, however was recommended revision by the IRB because of the concern of deployment of an entirely new technology and asked us to consider the current design which allows us to evaluate the safety of the technology (as well as assess human clinician test-retest variation). We see this as a necessary intermediate step, as many medical centers are not yet comfortable with the idea of AI technology, even with clinician oversight.
For our trial, to blind and randomize with sonographers (the active comparator), we had to build substantial infrastructure to imbed the AI system within the clinical reporting system. This is something we hope is worthy of interest to the general AI and cardiology audience and an additional necessary step for final deployment. As an aside, we note that after presenting the integration at ESC, the clinical software system vendors approached the study team to ask about integration of the AI technology because this is an area of active interest and necessary consideration prior to deployment. Given the combination of regulatory as well as technical deployment challenges, we hope our study is a step towards ultimate deployment of AI technology in cardiology, and we seek to clarify this in the revised discussion: In Another concern, to be better discussed, regards assessment of blindness: despite the cardiologist were not able to always determine if the initial assessment was done by AI or a sonographer, they were at least more likely to be correct than wrong (not perfectly blind to it). Not a big concern, but this should be better discussed in Discussion.
Reply: Thank you for feedback and the opportunity to improve the presentation of the results. Blinding is an important aspect of randomized trial, and in particular we took particular emphasis to try to maximize blinding (by showing the same proportion of similar types of annotation and display them in the clinical software system to minimize variation from sonographer annotations). Additionally, we note that there are no prior blinded clinical trials of AI technology and many prospective clinical trials of even therapeutics do not formally assess blinding (e.g., in vaccine trials, the presence of side effects of sore arm and etc likely create a small degree of unblinding that could bias the participants).
Our blinding index was between -0.2 and 0.2, which is typically considered good blinding (Bang et al. Control Clin Trials, 2004). and within the range of statistical noise if it was randomly assessed. For example, even in the setting of guessing heads or tails, one will not always be incorrect, and there is a statistically acceptable range of correct guesses that can happen from random sampling and due to noise alone.
We believe there is particular value in evaluating AI technologies in blinded and randomized fashion, and try to discuss this in more detail in the revised discussion: In the discussion: By integrating the AI into the reporting software, our study sought to minimize bias in assessing the effect size of AI intervention. To enable effective blinding, we implemented a single cardiac cycle annotation workflow representative of many realworld high-volume echocardiography laboratories. Despite this framework, there was a small signal for cardiologists to be more likely to be correct than incorrect in guessing the agent of initial assessment. However, the blinding index is within the range typically described as good blinding, and regardless whether the cardiologist thought the initial agent as AI, sonographer, or uncertain, the results trended towards improved performance in the AI arm.
Additionally, in new analyses, we show that irrespective of whether the cardiologists were able correctly or incorrectly guess the initial agent of interpretation, the trend was towards improved performance by AI. These subset analyses are new/ad hoc, so not powered for significance, but trend in the same direction and have minimal heterogeneity.
From Similarly, the anchor effect seems to be strong from the presented results, and given the non-perfect blindness, this may be a concern in interpreting the results. At least a deeper discussion on this point would be needed.
Reply: Thank you for your review of our trial and we appreciate the feedback. We recognize that there is a small amount of signal that could result imperfect blinding but within the range of possible statistical noise. Even in this setting, we note that there was no significant difference between subgroups by whether cardiologist through the performance was by AI, sonographer or uncertain, and all groups trended to improved performance in the AI arm (although not statistically powered in the subgroup analysis), as well as whether the cardiologist was incorrect or correct in guessing the initial annotator.
From Additionally, the results trended in the same direction with anchoring (the primary result gave cardiologist the initial interpretation) as without anchoring (the key safety endpoint comparison was comparing with standard clinical measurement which would not have an anchoring of AI assistance). Human clinician variation, and in particular anchoring is an important issue, often understudied in clinical research, to the degree that there are even limited studies of variance in unanchored human performance. Our study (by comparing just the cardiologist assessment in the sonographer arm with the historical cardiologist assessment) can potentially be useful to the cardiology literature as the largest clinician test-retest evaluation of LVEF. We seek to more broadly discuss these important points in the discussion and are open to feedback: In the discussion: In this trial, we utilized experienced sonographers as an active comparator vs. the AI for the initial assessment of LVEF, different levels of experience and types of training can change the relative impact of AI compared to clinician judgement. The smaller difference between final and initial assessment, seen in this study for both methods of initial assessment, compared to the difference between final and prior cardiologist assessment highlights the anchoring effect of an initial assessment in practice -and the importance of blinding for quantifying effect size in clinical trials of diagnostic imaging. In both the anchored outcome (comparison of preliminary to final assessment) and independent outcome (comparison of final assessment in the trial vs. historical cardiologist assessment), the AI arm showed less variation and more precision in the assessment of LVEF.
Additionally, we further discuss inter-clinician variability and provide discussion that we observed no differences in model performance by image quality, inpatient vs. outpatient status, or single plane vs. biplane assessment. To clarify these findings, we include additional analyses demonstrating that the AI algorithm's median absolute difference from historical LVEF is smaller than the median absolute difference for 22 of the 27 individual sonographers who provided measurements for the study. In this setting, we believe historical LVEF is the best comparison without anchoring.
Supplemental Figure 3: Performance of each individual sonographer vs. AI initial assessment compared to historical assessment. Boxplot of interquartile range (IQR) and median. Whiskers truncated beyond 20% difference.
Minor comments: -in Abstract: "The mean absolute difference from between final and prior...": sentence to be re-phrased and clarified, the "prior cardiology assessment" was clear to the reader only after reading the Methods. It should be clarified what we mean to a non-expert when this results are presented in the Abstract.
Reply: Thank you for feedback and we hope this is clarified by changing to: In the abstract: The mean absolute difference between final cardiologist assessment and independent prior cardiologist assessment was 6.29% in the AI group and 7.23% in the sonographer group (difference -0.96%, 95% CI -1.34% to -0.54%, P < 0.001 for superiority).
-"as be subject to heterogeneity" -> "as being subject to"?
Reply: Thank you for feedback and that has been changed.
In the discussion: Despite the importance of LVEF assessment in daily clinical practice and clinical research protocols, conventional approaches to measuring LVEF are well recognized as being subject to heterogeneity and variance given that they rely on manual and subjective human tracings. 5,6 Comments from the Reviewer 3 The paper describes a prospective, blinded randomized controlled trial to evaluate a previously published AI algorithm for assessing left ventricular ejection fraction (LVEF), by comparing it against sonographer initial assessment. The study is well designed and well carried out. The statistical analyses are appropriate and results are convincing. The paper is very well written, with clearly laid out motivations, clearly explained approach and conclusions, as well as thoughtful descriptions of study limitations.
Reply: Thank you for your review of our trial and we appreciate your feedback. We think there is particular value in evaluating AI technologies in blinded and randomized fashion. We recognize that this is an evolving landscape and seek to integrate the feedback on future directions and additional experiments remaining to be done in the field.
Even though the trial being single center is a major limitation, and furthermore it was limited to one particular AI model, the fact that the model was trained on data from a completely different population does land value to the results. As the authors recognized, more trials should be conducted in the near future, on more diverse trial population, and to evaluate AI models trained on broader set of data. Nonetheless, as the first study of its kind, the findings presented in this paper represent a significant initial step forward and are of great importance to the field.
Reply: Thank you for your review of our trial and we appreciate your feedback on how to improve the presentation of the results. We seek to present the trial fairly and integrate discussion of some of the challenges in the field and necessary next steps for deployment.
In the discussion: Several limitations of our trial should be mentioned. First, our study was single center, reflecting the demographics and clinical practices of a particular population. Nevertheless, the AI model was trained on example images from another center and the clinical trial was performed as prospective external validation, suggesting