Main

Accurate quantification of cardiac function is necessary for disease diagnosis, risk stratification and assessment of treatment response4,5. LVEF, in particular, is routinely used to guide clinical decisions regarding patient appropriateness for a wide range of medical and device therapies as well as interventions including surgeries4,6,7. Despite the importance of LVEF assessment in daily clinical practice and clinical research protocols, conventional approaches to measuring LVEF are well recognized as being subject to heterogeneity and variance given that they rely on manual and subjective human tracings8,9.

Clinical practice guidelines recommend when assessing LVEF based on cardiac imaging—most commonly echocardiography—that the measurements should be performed repeatedly over multiple cardiac cycles to improve precision and account for arrhythmic or haemodynamic sources of variation10. Unfortunately, such repeated human measurements are rarely done in practice given logistical constraints present in most clinical imaging laboratories and single tracings or a visual estimation of LVEF is often used as a pragmatic alternative11,12. Such an approach is suboptimal for detection of subtle changes in LVEF, which is needed for making important therapeutic decisions (for example, eligibility for continued chemotherapy or defibrillator implantation)13.

Extending from tremendous progress in the field of AI over the past decade14, numerous algorithms have been developed with the goal of automating assessment of LVEF in real-world patient care settings1,3,15. Although such AI algorithms have demonstrated improved precision on limited retrospective datasets, to date, there are no current cardiovascular AI technologies validated in blinded, randomized clinical trials16,17,18,19. In addition, human–computer interaction and the effect of AI prompting on clinical interpretations is underexplored in clinical studies. To address this need, we conducted a blinded, randomized non-inferiority clinical trial to prospectively assess the effect of initial assessment by AI versus conventional initial assessment by a sonographer on final cardiologist interpretation of LVEF.

Cohort characteristics

We enroled 3,769 transthoracic echocardiogram studies originally performed at an academic medical centre between 1 June 2019 and 8 August 2019; these studies were prospectively re-evaluated by 25 cardiac sonographers (mean of 14.1 years of practice) and 10 cardiologists (mean of 12.7 years of practice). In total, 3,495 studies from 3,035 patients were able to be annotated by sonographers using Simpson’s method of disc calculation of LVEF, and 274 studies were excluded for being of insufficient image quality to contour the left ventricle (Fig. 1). The eligible studies were randomized 1:1 to AI versus sonographer for initial evaluation, with 1,740 studies assigned to the AI group and 1,755 studies assigned to the sonographer group. The baseline characteristics of the two groups were well balanced (Table 1). At a separate workstation and different time, cardiologists were presented with the full echocardiogram study with initial annotations for the final blinded assessment of LVEF.

Table 1 Demographic and imaging study characteristics

Assessment of blinding

After completing each study, cardiologists were asked to predict the agent of initial interpretation. Cardiologists correctly predicted the method of initial assessment for 1,130 (32.3%) studies, incorrectly guessed for 845 (24.2%) studies and were unsure whether initial assessment was provided by AI or sonographers for 1,520 (43.4%) studies. The Bang’s blinding index, a metric of blinding in which 0 is perfect blinding and −1 or 1 is perfectly unblinding, was used to assess the trial20. A blinding metric between −0.2 and 0.2 is typically considered good blinding, and the blinding index was 0.075 for the sonographer group and 0.088 for the AI group. The bootstrapped confidence interval was consistently within −0.2 and 0.2 (P > 0.99).

Primary outcome

The primary outcome of substantial change between the initial and final assessments occurred in 292 (16.8%) studies in the AI group compared with 478 (27.2%) studies in the sonographer group (difference of −10.4%, 95% confidence interval: −13.2% to −7.7%, P < 0.001 for non-inferiority, P < 0.001 for superiority) (Table 2). The mean absolute difference between the initial and final assessments of LVEF was 2.79% in the AI group compared with 3.77% in the sonographer group (difference −0.97%, 95% confidence interval: −1.33% to −0.54%, P < 0.001 for superiority; Fig. 2).

Table 2 Efficacy and safety outcomes
Fig. 1: Consort diagram.
figure 1

Screening, randomization and follow-up.

Secondary safety outcome

The secondary safety outcome of substantial difference between final cardiologist-adjudicated LVEF compared with the previously clinically reported LVEF occurred in 871 (50.1%) studies in the AI group compared with 957 (54.5%) studies in the sonographer group (difference of −4.5%, 95% confidence interval: −7.8% to −1.2%, P = 0.008). The mean absolute difference between previous and final cardiologist assessments was 6.29% in the AI group compared with 7.23% in the sonographer group (difference −0.94%, 95% confidence interval: −1.34% to −0.54%, P < 0.001 for superiority; Fig. 2).

Other outcomes and subgroup analyses

The reduction in the primary end point with the AI group was consistent across all major subgroups (Table 3). Between the initial and final assessments, 1,100 (63.2%) studies in the AI group and 1,218 (69.4%) studies in the sonographer group had any change (difference of −6.2%, 95% confidence interval: −9.3% to −3.1%, P < 0.001 for superiority). Sonographers took a median of 119 s (interquartile range (IQR): 77–173 s) to assess and annotate LVEF. Cardiologists took a median of 54 s (IQR: 31–95 s) to adjudicate initial assessments in the AI group and a median of 64 s (IQR: 36–108 s) to adjudicate initial assessments in the sonographer group. The mean difference in sonographer time between AI and sonographer groups was −131 s (95% confidence interval: −134 to −127 s, P < 0.001). The mean difference in cardiologist time between AI and sonographer groups was −8 s (95% confidence interval: −12 to −4 s, P < 0.001).

Fig. 2: Comparison of AI versus sonographer guidance on cardiologist assessment and difference between final versus previous cardiologist assessments.
figure 2

Dots represent individual studies and lines represent the lines of best fit. MAD, mean absolute difference.

We additionally assessed the frequency of changes from initial to final assessment crossing a clinically meaningful threshold (that is, LVEF of 35% for consideration of implantable defibrillator therapy) post-hoc. In the AI group, 22 of 1,740 (1.3%) studies crossed the 35% threshold between initial and final cardiologist assessments. In the sonographer group, 54 of 1,755 (3.1%) studies crossed the threshold between initial and final assessments (P = 0.0004, Chi-squared test).

Table 3 Subgroup analysis

Discussion

In this trial of board-certified cardiologists adjudicating clinical transthoracic echocardiographic exams, AI-guided initial evaluation of LVEF was found to be non-inferior and even superior to sonographer-guided initial evaluation. After blinded review of AI versus sonographer-guided LVEF assessment, cardiologists were less likely to substantially change the LVEF assessment for their final report with initial AI assessment. Furthermore, the AI-guided assessment took less time for cardiologists to overread and was more consistent with cardiologist assessment from the previous clinical report. Although not the first trial of AI technology in clinical cardiology21,22,23, to our knowledge, this study represents the first blinded implementation of a randomized trial in this space.

In addition to prospectively evaluating the impact of AI in a clinical trial, our study represents the largest test–retest assessment of clinician variability in assessing LVEF to date. The degree of human variability between repeated LVEF assessments in our study is consistent with previous studies11,12,24, and the introduction of AI guidance decreased variance between independent clinician assessments. In this trial, we utilized experienced sonographers as an active comparator versus the AI for the initial assessment of LVEF; however, different levels of experience and types of training can change the relative impact of AI compared with clinician judgement. The smaller difference between final and initial assessments, seen in this study for both methods of initial assessment, compared with the difference between final and previous cardiologist assessments highlights the anchoring effect of an initial assessment in practice, and the importance of blinding for quantifying effect size in clinical trials of diagnostic imaging. In both the anchored outcome (comparison of preliminary assessment to final assessment) and independent outcome (comparison of final assessment in the trial versus historical cardiologist assessment), the AI arm showed less variation and more precision in the assessment of LVEF.

Notwithstanding tremendous interest in AI technologies, there have been few prospective trials evaluating their efficacy and effect on clinician assessments. Important clinical trials of AI technology have already shown the efficaciousness of AI in cardiology21,25; however, given the difficulty of blinding a diagnostic tool, previous trials are often open-label and compared with a placebo or no diagnostic assistance. Previous works have shown that there can be a Hawthorne effect when studying novel technologies such as AI systems26,27. By introducing blinding with an active comparator arm, studies can better distinguish between the effect of the AI technology itself versus the impact of being observed or the act of introducing an intervention. Current FDA-approved technologies for LVEF assessment were not prospectively evaluated with randomization and blinding16. By integrating AI into the reporting software, our study sought to minimize bias in assessing the effect size of AI intervention.

To enable effective blinding, we implemented a single cardiac cycle annotation workflow representative of many real-world high-volume echocardiography laboratories. Despite this framework, there was a small signal for cardiologists to be more likely to be correct than incorrect in guessing the agent of initial assessment. However, the blinding index is within the range typically described as good blinding, and regardless of whether the cardiologist thought the initial agent was AI, sonographer or uncertain, the results trended towards improved performance in the AI arm. Our findings of non-inferiority and even superiority of initial AI assessment are encouraging given that AI assessment reduces the time and effort required of the tedious manual processing that is typically required by routine clinical workflows. Given these promising results, further developments of AI could eventually facilitate additional workflows that are required for conducting comprehensive cardiac assessments in routine clinical practice and in accordance with guideline recommendations28,29,30,31.

Several limitations of our trial should be mentioned. First, our study was single centre, reflecting the demographics and clinical practices of a particular population. Nevertheless, the AI model was trained on example images from another centre and the clinical trial was performed as prospective external validation, suggesting generalizability of the AI techniques and workflow. Second, the study was not powered to assess long-term outcomes based on differences in LVEF assessment. Although the results were consistent across subgroups, further analyses are needed to evaluate the impact of video selection, frame selection and intra-provider variability. Third, this trial used previously acquired echocardiogram studies, and although prospectively evaluated by sonographers and cardiologists, there can be bias when a different sonographer than the scanning sonographer interprets the images. Last, consistent with findings from most AI studies, we found model performance improvement scales with the number of training examples. Thus, we anticipate that future studies could improve on the AI performance that we observed in the current study by implementing AI models developed based on an even greater number of training examples derived from a broad and diverse cohort of patients. Of note, this clinical trial utilized an AI model entirely trained from an independent site, representing external validation of the model. Effective deployment of AI models in cardiology clinical practice will require additional regulatory oversight, adoption and appropriate use by clinicians, and functional integration with clinical systems, all of which need to be carefully considered and further studied.

In summary, we found that an AI-guided workflow for the initial assessment of cardiac function in echocardiography was non-inferior and even superior to the initial assessment by the sonographer. Cardiologists required less time, substantially changed the initial assessment less frequently and were more consistent with previous clinical assessments by the cardiologist when using an AI-guided workflow. This finding was consistent across subgroups of different demographic and imaging characteristics. In the context of an ongoing need for precision phenotyping, our trial results suggest that AI tools can improve efficacy as well as efficiency in assessing cardiac function. Next steps include studying the effect of AI guidance on cardiac function assessment across multiple centres.

Methods

Study design and oversight

Cardiologists with board certification in echocardiography were assigned to read independent transthoracic echocardiogram studies randomized to initial assessment by AI versus sonographer. Imaging studies were initially acquired and interpreted clinically by a board-certified cardiologist between 1 June 2019 and 8 August 2019 at Cedars-Sinai Medical Center. Studies were randomly sampled within the time range without regard to patient identity, so that multiple studies from the same patient would be randomized and assessed independently. Sonographers were asked to use their standard clinical practice to annotate the left ventricle for either single-plane or biplane method-of-discs calculation of LVEF10. Studies were excluded from randomization if the sonographer was unable to quantify the LVEF owing to inadequate image quality.

Eligible studies were randomly assigned, in a 1:1 ratio, to initial assessment by AI or sonographer and presented to the cardiologists in the standard clinical reporting workflow and software (Siemens Syngo Dynamics VA40D) for adjusting the LV annotation and calculating EF. Although the AI model could annotate every single frame and cardiac cycle, to facilitate blinding, one representative cardiac cycle was annotated and presented to the cardiologist in the AI group. To preserve blinding, the same proportion of single-plane-annotated and biplane-annotated studies was generated for the AI group as the sonographer group.

The trial was designed as a blinded, randomized non-inferiority trial with a prespecified margin of difference by academic study investigators without industry sponsorship or representation in trial design. Approval by the Cedars-Sinai Medical Center Institutional Review Board was obtained before the start of the study. All reading echocardiographers gave informed consent and were excluded from the data analysis. The last author prepared the first draft of the manuscript. The first and last authors had full access to the data. All the authors vouch for the accuracy and completeness of the data and analyses and for the fidelity of the study to the protocol. This study satisfies the requirements set forth by the CONSORT-AI and SPIRIT-AI reporting guidelines27,31.

Model design and clinical integration

The architecture and design of the AI model have been previously described1. The AI model was trained using 144,184 echocardiogram videos from Stanford Healthcare and was never exposed to videos from the Cedars-Sinai Medical Center. An end-to-end workflow was developed including view classification, LV annotation and LVEF assessment. The AI model was fully embedded within the clinical reporting system at the start of the trial without subsequent changes or adjustments. An in-depth technical description can be found in Supplementary Appendix.

For both sonographers and cardiologists, the entire study (most often between 60 and 120 videos) was shown in the standard clinical reporting software. The study was shown without any annotations to the sonographer, who chose the apical-4-chamber and apical-2-chamber videos and traced the endocardium to assess LVEF. For the cardiologist, the study was shown with one set of annotations (provided by either AI or the sonographer) and can adjust the endocardium to change the reported LVEF (example video 1). Standard method of discs evaluation of the left ventricle, either biplane or single plane depending on sonographer input, was used to calculate LVEF.

Outcomes assessment

The primary outcome was the change in LVEF between the initial assessment by AI or sonographer and the final cardiologist assessment. The primary outcome was evaluated both as the proportion of studies with substantial change and the mean absolute difference between initial and final assessments. Substantial change was defined as greater than 5% change in LVEF between initial and final assessments. The analysis was performed as randomized and there was no crossover between the two groups.

The duration of time for contouring and adjustment by the sonographer and cardiologist was tracked and compared between the sonographer and AI arms. To assess blinding, cardiologists were asked to predict whether the initial interpretation was by AI, sonographer or unable to tell for each study. A key secondary safety end point was change in final cardiologist-adjudicated LVEF compared with the previous cardiologist-reported LVEF. An additional secondary end point includes the proportion of studies with no change in LVEF between initial and final interpretations.

Statistical analysis

The trial was designed to test for non-inferiority, with a secondary objective of testing for superiority with respect to the primary end point. Non-inferiority is shown if the lower limit of the 95% confidence interval for the between-group difference in the primary end point was less than 5% (less than the natural variation of test–retest variability in the blinded human assessment of LVEF)8,24. With an α = 0.05, power of 0.9 and estimating a 5% occurrence rate of substantial change in the AI-driven workflow versus 8% in the sonographer-driven workflow, we estimated a sample size of 2,834 studies needed for statistical power and pre-hoc planned to enrol 3,500 studies for this trial as a buffer for dropout and unforeseen challenges. Non-block randomization was performed once at the beginning of the trial, and the analyses included all imaging studies that underwent randomization (intention-to-treat population). All reported P values for non-inferiority are one-sided, and all reported P values for superiority are two-sided. Bang’s blinding index was used to evaluate the quality of blinding during the trial20. Statistical analyses were performed with the use of R 4.1.0 (R Foundation for Statistical Computing).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.