Introduction

Sepsis is a leading cause of mortality and morbidity in hospitalized patients1,2. A large body of published evidence shows the link between delayed responses including lactate measurement, antibiotic initiation, and fluid administration3,4,5 and adverse clinical outcomes6. Thus, the sepsis surviving campaign guideline recommends the prompt initiation of sepsis bundles for the treatment of sepsis in a variety of clinical settings7. With the rapid development of electronic health care records, the application of automated alerting systems to provide early warning for sepsis detection has triggered tremendous interest in the literature. There have been several prediction algorithms developed including rule-based and machine learning (ML) based methods. The former typically include those with standard SIRS or qSOFA criteria involving routine variables such as vital signs and laboratory findings. The latter utilized a variety of ML methods to alert sepsis including neural networks, random forests, and support vector machines. These methods are found to have high accuracy in predicting sepsis8,9,10,11. However, good statistical performance of a prediction model does not necessarily mean clinical usefulness of the model. It is more important for an automated alerting system to be able to improve patient-important outcomes. Thus, comparative effectiveness studies are mandatory to provide high-quality evidence for clinical decision-making.

There have been many studies exploring the clinical effectiveness of an automated alerting system for the management of sepsis12,13. Many investigators compared clinical outcomes between pre-and post-implementation of an automated system14,15. Systematic reviews evaluating the usefulness of automated alerting systems in sepsis have been reported in the literature. However, most of these studies evaluated reporting the diagnostic accuracy of the alerting system in predicting sepsis12,16,17,18, and a few evaluated the effectiveness in terms of clinically relevant outcomes, such as mortality and length of stay (LOS). For instance, Hwang and colleagues analyzed studies published between 2009 and 2018 and found that algorithm-based methods had high accuracy in predicting sepsis. To our knowledge, only one such analysis reported improved mortality outcome19. A systematic review conducted by the Cochrane collaboration included three RCTs and concluded that it was unclear what effect automated systems for monitoring sepsis have on clinical outcomes due to the low quality of included studies13. The number of comparative effectiveness studies has been steadily increasing in recent years with several new RCTs being reported20,21. Thus, an updated systematic review is needed to renew evidence for clinical practice. Furthermore, the results of these studies are conflicting due to differences in the prediction algorithm, clinical setting, and study designs. To address the heterogeneity of these studies and to appraise the evidence for clinical practice, we performed a systematic review to critically evaluate the quality of this evidence.

Results

Study selection

The initial search identified 2950 articles from the databases, and 921 were screened after the removal of duplicated items. A total of 823 citations were excluded by reviewing the title and abstract because they were pediatric patients, non-relevant interventions, reviews, and other non-original articles. The remaining 98 citations were further screened for the full text, and finally, we included 36 articles for quantitative analyses (Fig. 1). The number of publications were increasing until the year 2017 and then declined (Supplementary Fig. 1).

Fig. 1: Flowchart of study selection.
figure 1

WOS web of science, CENTRAL Cochrane Central Register of Controlled Trials.

Study characteristics

A total of 36 studies were included in the study, spanning from the year 2010 to 2021 (Table 1). There were 6 RCTs20,21,22,23,24,25 and 30 NRS14,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54. Four studies explored ML-based prediction for sepsis/severe sepsis alert22,26,27,30. Six studies were conducted in ICU setting14,22,23,25,44,51. The sample sizes of RCTs ranged from 142 to 1123. Burdick’s study included 17,758 subjects because this study involved nine medical centers and all at-risk patients were analyzed for clinical outcomes27.

Table 1 Characteristics of included studies.

Risk of bias in studies

The risk of bias was assessed with different tools for RCTs (Fig. 2) and NRS (Figs. 3 and 4). While some studies did not report all the necessary information to grade the methodology, the RCTs were found to have less risk of bias. NRS studies had more risk of bias in the selection of participants and outcome measurements.

Fig. 2: Risk of bias assessment for randomized controlled trials.
figure 2

a Summary statistics for the risk of bias assessment for RCT. b Risk of bias assessment for each RCT. RCT randomized controlled trial.

Fig. 3: Risk of bias assessment for each of the non-randomized studies.
figure 3

The rows represent the individual study and the columns represent the quality items as annotated at the bottom.

Fig. 4: Summary of the risk of bias assessment for non-randomized studies.
figure 4

The bars show the percentage of studies with different levels of quality as indicated by colors.

Results of syntheses

The mortality outcome reported in individual studies were inconsistent across studies (Fig. 5). While some studies reported beneficial effects22,26,39, others reported harmful effects of the automated alerting system23,38,45. By pooling risk ratios across RCTs, there is a trend toward improved mortality in the experimental group, but this does not reach statistical significance (RR: 0.85; 95% CI: 0.61–1.17). However, there was a statistically significant beneficial effect in the NRS (RR: 0.69; 95% CI: 0.59–0.80). While there was no statistically significant heterogeneity in RCTs (I2 = 33%, p = 0.19), there was significant heterogeneity across NRS (I2 = 81%, p < 0.01). In subgroup analysis, it was interesting to note that the automated alerting system had less beneficial effects in the ICU (RR: 0.90; 95% CI: 0.73–1.11) than that in ED (RR: 0.68; 95% CI: 0.51–0.90) and ward (RR: 0.71; 95% CI: 0.61–0.82; Supplementary Fig. 2). Furthermore, ML-based prediction methods showed a larger magnitude in reducing mortality (RR: 0.56; 95% CI: 0.39–0.80) than rule-based methods (RR: 0.73; 95% CI: 0.63–0.85; Supplementary Fig. 3). Bundle recommendation alerting (RR: 0.63; 95% CI: 0.43–0.94; Supplementary Fig. 4) performed better than sepsis alert in reducing mortality (RR: 0.78; 95% CI: 0.66–0.92; Supplementary Fig. 4). Bayesian meta-analysis of RCTs with NRS as prior showed that automated alert was able to reduce the mortality risk (RR: 0.71; 95% credible interval: 0.62 to 0.81; Supplementary Fig. 5).

Fig. 5: Forest plot for pooled effects of the automated alerting system in mortality outcome.
figure 5

The size of the blue square indicates the weight of each study. The black diamond represents the pooled effect size for each subgroup as well as for the overall effect. The red bars represent the prediction interval. IV inverse variance, RCT randomized controlled trial, CI confidence interval.

ICU length of stay was reported in 11 studies and there was no evidence that automated alerts could significantly reduce the length of stay in ICU (MD: -1.33; 95% CI: -3.34 to 0.67). There was also substantial heterogeneity across these studies (I2 = 97%, p < 0.01; Supplementary Fig. 6). Other subgroup analyses failed to find factors to explain the heterogeneity (Supplementary Figs. 79). Hospital length of stay was reported in 21 studies. Overall, there was a significant reduction in hospital length of stay (MD: -2.42; 95% CI: -4.43 to -0.41), with substantial heterogeneity across studies (I2 = 94%, p < 0.01). The heterogeneity could not be fully explained by the study design (Supplementary Fig. 10). However, studies conducted in hospital wards showed more consistent results (I2 = 77%, p < 0.01; Supplementary Fig. 11). The methods (ML or rule-based), specific rules (SIRS, qSOFA, and MEWS) used to alert sepsis, and purposes of alerting were not able to account for the heterogeneity (Supplementary Figs. 12 to 14).

Reporting biases

The reporting biases of included studies were assessed by p-curve, which showed a right-skewed distribution with 73% of the p-values between 0 and .01 (Fig. 6). The statistical tests against the null hypothesis that all the significant p-values are false positives were rejected with high statistical significance. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 99%, with a confidence interval ranging from 96% to 99%. The contour-enhanced funnel plots showed that the distributions of studies were generally symmetric for mortality and hospital LOS (Fig. 6). The supposed missing studies are in the area of high statistical significance; thus, it is possible that the asymmetry is not due to publication bias.

Fig. 6: Assessment of publication bias.
figure 6

a Visual inspection of the p-curve plot shows a right-skewed distribution with 73% of the p-values between 0 and 0.01 and only 20% of p-values between 0.03 and 0.05. The statistical tests against the null hypothesis that all of the significant p-values are false positives are highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 99%, with a tight confidence interval ranging from 96% to 99%. Somewhat redundant with this information, the p-curve also provides a significant test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 99%. The contour-enhanced funnel plots showed significant levels area at 0.1, 0.05, and 0.01 for b mortality, c ICU length of stay, and d hospital length of stay. Some studies appeared to be missing in areas of high statistical significance, thus it is possible that the asymmetry is not due to publication bias. ICU intensive care unit.

Discussion

This study provides systematically updated evidence on the effectiveness of automated alerts for the management of sepsis in various settings. The results show that the management of sepsis with an automated alerting system can reduce the mortality rate, which is further confirmed by the Bayesian meta-analytic approach. Although there is no evidence that the automated alerting system can reduce ICU LOS, the hospital LOS is significantly reduced. Subgroup analyses indicate that the beneficial effect of automated alerting systems is less significant in ICU settings than that in ED and general wards. ML-based alerting systems appear to provide additional benefits as compared to rule-based methods.

The main finding in our study is that an automated alerting system can reduce mortality risk, probably attributable to the increased awareness of the sepsis onset. There has been a large body of evidence showing that early recognition of sepsis and prompt initial of sepsis bundle are associated with improved outcomes. For example, the reduction in time-to-antibiotic use is consistently reported to be associated with improved survival outcomes55,56. The same effects are also observed in other bundle components such as lactate measurement and fluid administration57. The effect of the automated alerting system is more prominent in the general ward and emergency setting than that in the ICU setting. Probably, ICU is already equipped with advanced monitoring modality, and physicians and nurses are in high acuity for sepsis surveillance as their usual care. The addition of a further automated alerting system will not provide further benefits.

The findings of the study have several novelties and clinical implications. First, the Bayesian meta-analytic approach was employed to integrate evidence from both RCTs and NRS. Although RCT is the gold standard design for comparative effectiveness, these data are sparse, smaller, and potentially unrepresentative of the patient populations or conditions found in real-world settings. Thus, real-world evidence from routine clinical practice provided by NRS is important to complement information from RCTs and potentially cover the ‘efficacy-effectiveness’ gap58. The results from the Bayesian meta-analysis are consistent with that from the frequentist meta-analytic approach.

Second, more in-depth subgroup analyses were performed to explore potential heterogeneity in component studies. Our analysis found that automated alerting systems deployed in ICU settings had less beneficial effects as compared to other settings. This is not surprising since ICU patients are already monitored closely by both automated systems and additional, attentive staff. In contrast, general wards and ED are equipped with much less staff, and some deteriorating conditions may not be recognized as quickly. In such situations, the use of an automated alerting system can have additional benefits to improve clinical outcomes. In line with this finding, other early warning scores have been widely deployed outside ICUs to improve the early recognition of deteriorating conditions59. Real-time automated alerting systems based on EMRs could help identify unstable patients, and early detection and intervention with the system may improve patient outcomes60.

Third, ML-based methods appear to be superior to other rule-based methods in improving mortality. ML-based methods estimate the presence of sepsis/severe sepsis by utilizing a higher number of relevant data points/biomarkers and can better capture the non-linear relationships between these variables61,62. Such complex relationships cannot be recognized to the same degree by humans. Rule-based methods are mostly based on established diagnostic criteria for identifying sepsis and sepsis is usually already present when a warning is triggered. Thus, the timeliness of diagnosis might be more easily achieved by using ML methods15.

Several limitations of the current study must be acknowledged. First, the qualities of included RCTs were variable. The blinding is difficult to achieve due to the nature of the intervention. The changes in medical decision-making dictated by the alerting are not necessarily well characterized. Additionally, the reported beneficial effects in the intervention group could be biased because clinicians know the allocation, and more attention may have been given to patients in the intervention group. Second, most component studies are NRS, which are prone to both measured and unmeasured confounding factors. The biased effect size from NRS was partly overcome by using a Bayesian meta-analytic approach. Third, there is remarkable heterogeneity among these included studies, which cannot be explained by some prespecified variables. Thus, more homogeneous large RCTs are needed to provide high-quality evidence63. Finally, machine-learning algorithms are sensitive to changes in the environment and subject to performance decay64. Continuous monitoring and updating are required to ensure their long-term safety and effectiveness. Healthcare processes of sepsis can change with accumulating evidence, requiring the ML algorithms to adapt to the new environment.

In conclusion, the study shows a beneficial effect of an automated alerting system in the management of sepsis. Interestingly, machine learning monitoring systems coupled with better early interventions show promise, especially for patients outside of ICUs. However, there is substantial heterogeneity and risk of bias across component studies. Further experimental trials are still required to improve the quality of evidence.

Methods

Eligibility criteria

Studies comparing the effectiveness of automated alerting systems for the management of sepsis were potentially eligible. The study population included hospitalized patients who were at risk for sepsis or patients who had sepsis. Patients who were at risk for sepsis were defined as per the original studies, including those presented to the emergency department (ED), general hospitalized patients, and ICU patients. Patients who were not initially in ICU and subsequently transferred to ICU due to deteriorated conditions were also included. The intervention was an automated alerting system integrated into the electronic healthcare records. The algorithms for the alerting system included ML-based methods and rule-based methods. The control group received usual care in which the medical providers would not receive any alerting messages. The outcomes included hospital mortality, LOS in the intensive care unit (ICU), and hospital. Evidence from non-randomized studies (NRS) was pooled with those from RCTs using the Bayesian meta-analytic approach. Subgroup analyses stratified by study design, setting, and methods of the alerting system were performed. The study protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO: CRD42022299219).

Information sources

Electronic databases of PubMed, Scopus, Embase®, Cochrane Central Register of Controlled Trials (CENTRAL) and ISI Web of science, and MedRvix were searched from inception (the earliest year could date back to 1917) to December 2021. The reference lists of identified articles were also searched manually to identify additional references.

Search strategy

Key terms related to (1) sepsis (sepsis or septic shock or septicemia), (2) automated alert (automated, ML, prediction, warning, and recognition), (3) clinical outcomes (mortality, length of stay), and (4) study design (randomized, controlled, pre-implementation and post-implementation) were searched in the databases. The type of literature was restricted to articles if a search engine had filtering functionality (Supplementary Methods).

Selection process

Two authors (L.C. and P.X.) independently performed the literature selection process. The duplicated references from each database were removed by using the RefManageR package (version: 1.3.0). The title and abstract of each reference were firstly screened to remove some irrelevant articles such as reviews, animal studies, non-relevant interventions (such as antimicrobial susceptibility testing), irrelevant subjects (such as delirium management and prediction of AKI), pediatric patients (age < 16 years old), and case reports. The full-text articles were then screened for the remaining references. Conflicting results were solved in a meeting participated by all the review authors.

Data collection process

A custom-made data collection form was prepared for data collection. Data includes the name of the first author, publication year, sample size, study design, prediction algorithm, number of patients in the intervention and control aims, the summary effect of the length of stay, and relevant standard deviation or interquartile range. Studies were classified into RCT and NRS by the design. NRS included those comparing patients managed with the automated alerting system versus historical controls. The studies might report mortality at different follow-up time points. If a study reported several mortality time points, we extracted the mortality in the hospital. The prediction algorithm was classified into rule-based or ML-based methods. The rule-based method referred to those using existing sepsis diagnostic criteria for the warning of the presence of sepsis. Two authors independently extracted data. Any conflicting results were solved by a third reviewer (Z.Z.).

Assessment of risk of bias

The risk of bias was assessed separately for RCT and NRS. RCT was assessed from six aspects including sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment, incomplete outcome data, selective reporting, and other sources of bias65. The risk of bias of NRS was assessed using the Risk Of Bias In Non-randomized Studies - of Interventions (ROBINS-I) tools, which included several aspects of bias due to confounding, bias due to selection of participants, bias in classification of interventions, Bias due to deviations from intended interventions, Bias due to missing data, Bias in the measurement of outcomes, Bias in the selection of the reported result66. The risk of bias assessment was performed independently by two reviewers (L.C. and K.C.) and any conflicting results were settled by a third opinion (Z.Z.).

Effect measures and synthesis methods

The primary outcome was mortality and we reported risk ratio (RR) and confidence interval as the effect measure. The LOS in the hospital and ICU were reported as mean difference (MD). The evidence from NRS and RCT were pooled separately by using a conventional frequentist meta-analytic approach with the R meta package (version: 5.1-1)67. Due to the heterogeneity of the component studies, the random-effects method was employed to pool the effect measures. The Mantel-Haenszel estimator was used in the calculation of the between-study heterogeneity statistic Q which was used in the DerSimonian-Laird estimator68. Evidence from NRS was pooled with those from RCTs using the Bayesian meta-analytic approach. The effect measures of the NRS were used as the prior distribution for Bayesian meta-analysis for integrating RCT data69. This approach will ‘pull’ the treatment-effect estimates from the RCTs toward the summary effects from the NRS. Subgroup analyses stratified by setting (ICU, ED, or ward), methods of the alerting system (ML-based versus rule-based), and alerting purpose (bundle compliance, sepsis/severe sepsis alert) were performed.

Reporting bias assessment

The reporting bias of included component studies was assessed and visualized using contour-enhanced funnel plots, which included colors to signify the significance level of each study in the plot. The significance level helps to differentiate asymmetry due to publication bias from that due to other factors70. P-curve analysis was also performed to detect p-hacking and publication bias in meta-analyses71. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, there are more p-values between 0 and 0.01 than between 0.04 and 0.05. This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or left-skewed (more p-values between 0.04 and 0.05 than between 0 and 0.01), the results are more consistent with the null hypothesis than with the presence of a real effect.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.