Introduction

Despite the rapid growth of artificial intelligence (AI) applications in healthcare, few models have progressed beyond retrospective development or validation, creating what is commonly called the “AI chasm”1. Among the subset of models that have moved into randomized controlled trials, even fewer have demonstrated clinically meaningful benefits2. This reality is a sobering reminder that translating AI algorithms from in silico environments to real-world clinical settings remains a formidable challenge. Possible reasons for this translational gap may be attributed to a high risk of bias during model development or dataset shifts during prospective validation3,4.

One of the conditions that has been extensively studied within the AI community is sepsis, life-threatening organ dysfunction due to infection, and a leading cause of morbidity and mortality worldwide5. Early identification of sepsis is paramount, as it enables timely administration of antibiotics and other life-saving measures. Therefore, the challenge and importance of early sepsis detection has catalyzed the development of several predictive algorithms across various clinical settings, including the emergency department (ED), inpatient ward, and intensive care unit (ICU)6. However, model evaluation concerning real-world patient outcomes has remained limited.

In this context, Boussina and colleagues should be congratulated for their efforts to demonstrate significant improvements in patient outcomes after implementing their AI algorithm7. The authors previously developed COMPOSER (COnformal Multidimensional Prediction Of SEpsis Risk)8. This deep learning model imports routine clinical information from electronic health records (EHR) using retrospective data to predict sepsis (based on the current Sepsis-3 criteria). In the present study, they first conducted a “silent mode trial,” evaluating their model on prospective patients in real-time while end-users were blinded to predictions. Next, they performed an implementation experiment that tracked patient outcomes before and after the deployment of COMPOSER. Their approach was well-aligned with the three-stage translational pathway for AI, which comprises (1) exploratory model development, (2) a silent trial, and (3) prospective clinical evaluation9,10. Here, the authors found that using COMPOSER within two EDs at UC San Diego (UCSD) Health was associated with a 17% relative reduction in in-hospital mortality and a 10% increase in sepsis bundle compliance. Sepsis bundles may vary across institutions but are generally composed of actions such as obtaining blood cultures before administering antibiotics, measuring lactate at defined time intervals, and administering fluids within three hours of presentation.

More than just the AI algorithm

Importantly, this study offers valuable insights into the ecosystem required for AI algorithms to perform well in the clinical setting in the United States. COMPOSER was directly embedded into the clinical workflow, following similar principles described by Sendak et al.11. A nurse-facing Best Practice Advisory (BPA) (i.e., a reminder/warning) presenting the COMPOSER sepsis risk score alongside top predictive features was integrated into the EHR. This was an essential step towards addressing the critical need for explainability among clinical end-users12. A standardized set of responses to the BPA was devised with multidisciplinary input. This broad stakeholder engagement was likely vital to achieving a remarkable degree of buy-in among nurses, with only 5.9% of sepsis alerts dismissed over the five-month intervention period. Furthermore, the BPA enhanced communication between nurses and physicians and expedited time-to-antibiotics—a plausible mechanism for the observed reduction in mortality. Finally, the study team implemented robust systems to continuously monitor data quality and model performance, prompting model retraining if performance fell below predefined thresholds. This approach ensures the sustained effectiveness and adaptability of COMPOSER over time.

As evident in that study, scaling AI algorithms within healthcare systems requires substantial resources, infrastructure, expertise, and adequate endorsement at the clinical end-user, departmental, and institutional levels. Such an ecosystem may be challenging outside of academic settings or within single-payer healthcare systems. Therefore, the costs and benefits of these AI algorithms should be carefully considered through health technology assessments because their incremental advantages may not justify the steep costs required to implement and maintain such technologies. Table 1 outlines key considerations for hospital leadership as they navigate implementing these algorithms within their institutions.

Table 1 Considerations for implementing AI algorithms into healthcare systems

Healthcare is only human

AI algorithms tend to excel in controlled environments, where only specific predictive features may influence the clinical outcome. However, patients’ and providers’ inherently human nature introduces numerous challenges, causing even the most robust AI models to degrade over time. Diversity in patient characteristics, disease presentations, practice patterns, and evolving treatment paradigms contribute to the potential failure of algorithms post-deployment4. Indeed, Boussina et al. highlight some of these challenges in their study. Despite a reported reduction in sepsis-related mortality, this benefit was only observed in one of the two hospitals. The lack of clinical improvement at their quaternary site may be attributed to differences in patient comorbidities, where even timely interventions may not be sufficient. In addition, the evaluation of COMPOSER was limited to the ED setting at UCSD thus, its generalizability in other clinical environments or institutions remains unknown. Similar concerns have been raised regarding the Epic Sepsis Model, which was found to have much lower performance and high false positive rates during external validation13. Lastly, clinical end-users may have been influenced by their awareness of being observed (i.e., Hawthorne effect) during the five-month implementation period, and their compliance with the BPA may diminish over time. These limitations emphasize the need for an AI ecosystem to support algorithms and enable them to adapt as healthcare continuously evolves.

Conclusion

AI can only be successful in healthcare systems if their predictions are available at the right time and place. Algorithms, while critical, cannot function in isolation – they must be paired with dedicated infrastructure, resources, and personnel trained to act on their predictions. Processes must also be in place to enable algorithms to adapt when their predictions degrade over time due to the evolving healthcare landscape. Furthermore, AI researchers should shift the focus from measuring just performance metrics such as accuracy towards meaningful improvements in individual patient outcomes while balancing the potentially steep costs of technological innovation. As a healthcare and AI community, we have a responsibility to deliver on these clinically relevant metrics, and researchers and journals alike should be encouraged to prioritize such studies.