Development of an artificial intelligence bacteremia prediction model and evaluation of its impact on physician predictions focusing on uncertainty

Prediction of bacteremia is a clinically important but challenging task. An artificial intelligence (AI) model has the potential to facilitate early bacteremia prediction, aiding emergency department (ED) physicians in making timely decisions and reducing unnecessary medical costs. In this study, we developed and externally validated a Bayesian neural network-based AI bacteremia prediction model (AI-BPM). We also evaluated its impact on physician predictive performance considering both AI and physician uncertainties using historical patient data. A retrospective cohort of 15,362 adult patients with blood cultures performed in the ED was used to develop the AI-BPM. The AI-BPM used structured and unstructured text data acquired during the early stage of ED visit, and provided both the point estimate and 95% confidence interval (CI) of its predictions. High AI-BPM uncertainty was defined as when the predetermined bacteremia risk threshold (5%) was included in the 95% CI of the AI-BPM prediction, and low AI-BPM uncertainty was when it was not included. In the temporal validation dataset (N = 8,188), the AI-BPM achieved area under the receiver operating characteristic curve (AUC) of 0.754 (95% CI 0.737–0.771), sensitivity of 0.917 (95% CI 0.897–0.934), and specificity of 0.340 (95% CI 0.330–0.351). In the external validation dataset (N = 7,029), the AI-BPM’s AUC was 0.738 (95% CI 0.722–0.755), sensitivity was 0.927 (95% CI 0.909–0.942), and specificity was 0.319 (95% CI 0.307–0.330). The AUC of the post-AI physicians predictions (0.703, 95% CI 0.654–0.753) was significantly improved compared with that of the pre-AI predictions (0.639, 95% CI 0.585–0.693; p-value < 0.001) in the sampled dataset (N = 1,000). The AI-BPM especially improved the predictive performance of physicians in cases with high physician uncertainty (low subjective confidence) and low AI-BPM uncertainty. Our results suggest that the uncertainty of both the AI model and physicians should be considered for successful AI model implementation.

Phase 1: Development and validation of AI-BPM.Adult (aged ≥ 18 years) ED patients, who had at least two sets of blood cultures taken during their ED stay, were included for analysis.15,362, 8,188, and 7,029 cases were included in the development, temporal validation, and external validation dataset, respectively, with mean ages ranging 62.3-65.6 years and proportion of females ranging 45.1-45.8%.The proportion of patients with bacteremia were 10.9%, 10.3%, and 13.6% in the development, temporal validation, and external validation datasets, respectively (Table 1).In the development dataset, patients with bacteremia were older, more likely to use an ambulance, and less likely to be referred from other hospitals than patients without bacteremia; they also exhibited lower blood pressure, higher heart rate (HR), and higher body temperature (BT).Bacteremia patients were more likely to have a history of chills, vomiting, and abdominal pain (Supplementary Table 1).
In the ablation study, we observed inferior performance when using only structured data or unstructured data to predict bacteremia compared to the AI-BPM, which utilized both types of data.Specifically, when only structured data was used, the AUCs (CIs) were 0.703 (0.684-0.721) and 0.679 (0.660-0.697) in the temporal validation and external validation datasets, respectively.Similarly, when only unstructured data was used, the AUCs (CIs) were 0.679 (0.660-0.698) and 0.681 (0.663-0.699) in the temporal validation and external validation datasets, respectively (Supplementary Table 2).
Phase 2: Physician predictive performance before and after the use of AI-BPM.Five hundred cases from each of the temporal and external validation datasets were randomly sampled to construct the sampled dataset with 1,000 unique cases.The sampled dataset was then divided into ten sets, each with 100 unique cases.Twenty board-certified emergency medicine physicians were recruited to review one of the ten sets and predict the probability of bacteremia before and after observing the AI-BPM predictions for each case.Therefore, a single set was reviewed separately by two physicians, and each physician reviewed 100 cases (Fig. 1).Among the 20 reviewing physicians, 14 were currently affiliated in a tertiary hospital and 6 in a secondary hospital.The physicians had 4-10 years of experience in the ED.
The reliability of the pre-AI predictions between two physicians showed a minimal level of agreement using Cohen's kappa statistic (κ = 0.28), but was increased in the post-AI predictions (κ = 0.38).In the post-experiment survey using a 5-point Likert scale (1: strongly disagree, 5: strongly agree), the participating physicians rated an average of 4.1 (SD: 0.7) points for the statement "Providing explanations of the AI model's predictions increased the trustworthiness of the model." Additionally, the participating physicians rated an average of 4.1 (SD: 0.9) points for the statement "Providing confidence intervals for the AI model's predictions increased the trustworthiness of the model".

Discussion
In this two-phase study, we first developed and validated an AI-BPM and subsequently examined its impact on physician predictions using historical patient records.The proposed AI-BPM is a BNN-based multi-modal prediction model that utilizes both structured and unstructured text data available at the early stage of an ED visit and was developed and validated using large datasets.Temporal and external validation of the AI-BPM indicated acceptable discrimination and calibration performance, with AUCs for predicting bacteremia in the range of 0.73-0.76.In the validation datasets, the sensitivities and specificities at a threshold of 5% were in the ranges of 0.91-0.93 and 0.31-0.34,respectively.When the AI-BPM was used as a CDSS, the physician performance of predicting bacteremia was significantly improved.The AUC increased from 0.64 to 0.70 and the sensitivity increased from 0.84 to 0.90 after utilizing the AI-BPM.The predictive performance of physicians was especially improved in cases where they had low confidence in their predictions (high physician uncertainty) and the AI-BPM had high confidence (low AI-BPM uncertainty).The strengths of this study include a large sample size, development of a novel AI bacteremia prediction model that considers the uncertainties of its predictions, and validation of the model on an external dataset.Additionally, to the best of our knowledge, this study is one of the first to explore the impact of an AI model on physicians considering the uncertainties of both the physician and the AI model.
A recently published study compared physician gestalt with two well-established prediction models for predicting bacteremia 13 .In the study, the AUC and sensitivity (at a 5% risk threshold) of predicting bacteremia using physician gestalt were 0.79 and 0.97, respectively, which are higher than those of the pre-AI physician predictions reported in our study.However, there are significant differences in the setting: the predictions made using physician gestalt in the previous study were performed just before admission and were therefore based on information already obtained, including imaging and laboratory tests.Because such time-consuming information is often not available at the time ED blood cultures are performed, the study's results do not truly reflect the performance of physician gestalt to avoid unnecessary blood cultures in the ED.Several validated bacteremia prediction models that use laboratory test results as inputs have demonstrated an AUC of 0.74-0.75 in the ED setting 13,21,22 .The AI-BPM, without using laboratory test results, achieved comparable performance with existing prediction models by utilizing multi-modal data.Natural language processing was used to mine unstructured clinical notes to enable early prediction of sepsis in a previous study 25 .However, to our knowledge, there is currently no bacteremia prediction model that incorporates unstructured text data.We believe that a multi-modal AI model that integrates such data greatly enhances the ability to formulate early and accurate predictions.The important features of the AI-BPM for predicting bacteremia included old age, fever, hypotension, and history of chills, which are similar to previous studies 21,22 .Words including "sputum", "cough", and "dyspnea" decreased the predicted probability of bacteremia, which is consistent with previous findings that found a low prevalence of bacteremia in patients with respiratory tract infections 26 .
The uncertainty of AI model predictions can be assessed in two ways: by analyzing the point estimate or the dispersion of the estimate 27 .When a prediction's point estimate is very low or high, it can be considered highly confident, while an estimate in the middle range may indicate less confidence in a binary classification problem.BNNs are particularly effective at capturing the second type of uncertainty, in which a narrow CI suggests high    confidence and a wide CI suggests low confidence.This study's approach to defining AI uncertainty encompasses both types of uncertainties mentioned above.Specifically, whether the risk threshold value (5%) falls within the 95% CI of the AI-BPM prediction is determined by both the point estimate and the dispersion of the prediction.This definition of AI uncertainty also considers clinical knowledge.For instance, an AI model prediction yielding a point estimate of 0.5 with a 95% CI of 0.3-0.7 might suggest a high degree of uncertainty in some situations; however, a model designed to predict the probability of bacteremia would still confidently recommend to the physician that blood cultures be performed.
In our study, physician uncertainty was assessed based on their subjective confidence in their predictions.This measure may be influenced by various factors, such as the physician's clinical experience and personality and the patient information provided.For instance, a physician may be confident in predicting the absence of bacteremia using gestalt for a young healthy patient with specific symptoms suggestive of upper respiratory infection.However, predicting bacteremia in a patient with vague symptoms can be challenging.The AI-BPM has demonstrated significant value in situations with such high physician uncertainty.
The results of this study indicate that the physician-AI interaction process closely resembles the traditional clinical decision-making process.When faced with high levels of uncertainty, a physician may seek advice from a peer or obtain further diagnostic test results.The likelihood of the physician accepting recommendations may be higher if the peer is experienced or if the diagnostic test results are definitive.An AI prediction model with suitable performance can potentially serve as either an experienced peer or a valuable diagnostic test.In this context, providing the level of uncertainty and explanations for the prediction are crucial to ensure physicians will trust the AI model 2 .
One noteworthy finding of this study is that the use of AI-BPM improved the sensitivity of physicians, while the specificity remained unchanged.This is likely because physicians prioritize safety over other factors due to the severe consequences of missing life-threatening conditions 13 .Another interesting discovery was that the AI-BPM diagnostic performance was similar in subgroups with low and high physician uncertainty, which suggests that the AI-BPM may be interpreting clinical information differently from physicians, thus enabling it to perform well even in situations where physicians lack confidence.
There were 18 cases of bacteremia in which the physician initially assessed the risk of bacteremia as very low but subsequently revised their evaluation to a higher risk after utilizing AI-BPM.Among these cases, 10 did not display fever upon presentation at the ED and lacked any documentation of fever or chills in the physician's notes.The patients were elderly (with a mean age of 70.8 years) and exhibited symptoms such as abdominal pain, headache, dyspnea, hematemesis, and altered mental state.These findings highlight the significance of this study in medical education, as it identifies scenarios where physicians may exhibit weaknesses in predicting This study has several limitations.First, we used data from academic tertiary hospitals located in urban areas, which may limit the generalizability of this study.The characteristics of patients and the decision criteria to obtain blood cultures may be different in other settings.Second, Phase 1 of the study used retrospectively collected data, which could potentially include unmeasured biases.Third, we assessed the impact of the AI model on physicians using historical patient records instead of evaluating it in the real-world setting.Therefore, the reviewing physicians were not able to examine the patients themselves, but were only able to read the examination results from the historical patient record.The completeness and accuracy of the physician notes may have also affected the study results.Fourth, reading order bias may have been involved due to the sequential reading design of this study 28 .However, a sequential reading design was also adopted in many previous studies, and it was necessary to evaluate the prediction changes of physicians according to their uncertainty 29,30 .Finally, although we did not specifically enroll physicians who either favored or opposed the adoption of AI models, the participating physicians' familiarity with and attitude towards AI may have influenced the impact of the AI-BPM.
This study provides several important insights into the factors that should be considered during the process of AI model implementation in the healthcare system.First, the uncertainty of the physicians, which is associated with the effectiveness of a novel AI model implementation, should be considered.An AI model would be of greater utility if it can provide accurate predictions in clinical situations where physicians are highly uncertain.Additionally, the baseline predictive performance of the physicians should be measured and reported to the physicians.If physicians are unaware of their baseline predictive performance, they can become overconfident, which may lead to decreased effectiveness of AI model implementation 31 .Second, AI model prediction uncertainty should be considered to allow physicians to make proper decisions in tandem with the model.For example, in the bacteremia prediction setting of our study, a predicted probability of 0.08 (95% CI 0.03-0.13)would indicate an uncertain prediction, whereas a model that considers only the point estimate (0.08) would simply recommend performing blood cultures.Finally, satisfactory explanations and estimates of prediction uncertainty should be provided to acquire the physicians' trust and enable effective AI model implementation 11 .
In conclusion, the AI-BPM, a BNN-based model that captures the uncertainty of its predictions, was developed and externally validated.The use of the AI-BPM significantly improved the predictive performance of physicians, especially in cases where physicians were uncertain and the AI-BPM was confident.Although further clinical trials are necessary to assess the effectiveness of the AI-BPM in real-world clinical settings, our study provides insight into the potential benefits of physician-AI model collaboration in enhancing predictive accuracy in uncertain clinical tasks.

Methods
Study design and setting.Cases of ED visits to Seoul National University Hospital (Hospital A) between January 2016 and December 2017 were used for AI-BPM development.ED visits to Hospital A between January 2018 and December 2018 were used for temporal validation.Cases of ED visits to Seoul National University Bundang Hospital (Hospital B) between January 2018 and December 2018 were used for external validation (Fig. 1).Hospitals A and B have annual ED visits of 70,000-90,000 and receive both referred patients and patients from the regional community.Data, including patient demographics, vital signs, symptoms, ED physician notes, and ED outcomes, were extracted from the clinical data warehouses of the study institutions.
A graphical user interface (GUI) was developed to simulate an electronic medical record (EMR) system that presents a patient's baseline characteristics (age and sex), ambulance use, ED triage level, initial vital signs, mental status, and initial ED physician notes (Supplementary Fig. 2).The GUI depicted an EMR of a recently arrived ED patient who had just been examined by an ED physician.Historical records of the patients in the sampled dataset were used.Before the study, the participating physicians were briefly informed of the AI-BPM development process and the predictive performance of the AI-BPM in the development dataset.The physicians reviewed the records in the GUI and selected the estimated probability of bacteremia on an ordinal scale (very low, 0-5%; low, 5-10%; low-moderate, 10-20%; moderate 20-50%; high, 50-100%) using clinical gestalt (pre-AI prediction).The ordinal scale of bacteremia probability was determined according to a previous review 26 .They also chose the confidence level of their predictions on a 5-point Likert scale (1, very low; 2, low; 3, moderate; 4, high; 5, very high) for each of the patients.After a pre-AI prediction was made, the AI-BPM prediction of bacteremia probability along with its 95% CIs were presented on the GUI sequentially.Additionally, the local feature importance using SHapley Additive exPlanations (SHAP) was shown as a bar plot on the GUI to inform the reviewing physician how each variable influenced the output of the AI-BPM for each case 32 .The physicians were asked to rerate the probability of bacteremia and the confidence level of their predictions after observing the results of the AI-BPM (post-AI prediction).

Study population.
All adults (aged ≥ 18 years) who visited the ED of the study institutions during the corresponding study period and had at least two sets of blood cultures taken during their ED stay were included.Different ED visits from the same patient were considered as separate cases.Cases without matching ED physician notes were excluded.The decision to obtain blood cultures was made by the attending ED physician, similar to the process in previous studies 13,21 .www.nature.com/scientificreports/significant difference between patients with bacteremia and those without bacteremia in the development dataset were used as predictors for the AI-BPM (Supplementary Table 1).Vital signs, mental status, and ED triage level were measured by the triage nurse shortly after a patient's arrival to the ED.The ED triage level was determined by the Korean Triage and Acuity Scale, which was developed based on the Canadian Triage and Acuity Scale 33 .The symptoms of patients were recorded by the initial attending ED physician.While there were some missing vital sign data in all three datasets, the proportions of data missing were less than 3%.Missing data were imputed with mean values.Other variables excluding vital signs had no missing data.Continuous variables were standardized to zero mean and unit variance.Categorical variables were one-hot encoded.A patient's present illness and past medical history recorded by the initial attending ED physician were used as unstructured data for the AI-BPM.The notes were documented immediately after the attending ED physician examined the triaged patient.Physician notes were written in bilingual (English/Korean) free-text format, which is a common practice in Korea 34 .Text preprocessing, including removal of punctuation marks, deleting English and Korean stop words, substituting capital letters with lowercase, and lemmatization, was performed.Subsequently, each note was vectorized using the term frequency-inverse document frequency (TF-IDF) vectorizer with the minimum document frequency set to 1%.The TF-IDF method was chosen for this study because it offers several advantages, including the ability to manage bilingual text, ease of interpretation, and comparable performance to more complex algorithms 35,36 .The full list of predictors used in the AI-BPM are presented in Supplementary Table 3.

Development of the AI-BPM.
The development dataset was randomly split into two for hyperparameter tuning, in which 80% of the data were used for AI-BPM training and the remaining 20% were used for validation.Subsequently, the AI-BPM was trained on the entire development dataset with the optimal hyperparameters (Supplementary Table 4).The BNN algorithm is a type of neural network with Bayesian inference.The AI-BPM, which is based on the BNN algorithm, receives two inputs: preprocessed structured data and vectorized encoding based on TF-IDF from unstructured text data.The structured data input and TF-IDF vector input were connected to hidden layers of 100 and 15 nodes, respectively.The hidden layers were concatenated and then connected to a single output node.All layers were densely connected and used the Flipout estimator for Bayesian variational inference 37 .While a standard neural network is trained to find the point estimates of the weights and outputs, a BNN is trained to find the marginal distributions of the weights and outputs that best fit the data 38 .Because the AI-BPM is based on BNN, the uncertainty of each of the predictions can be estimated 9,38 .To calculate the mean and SD of the AI-BPM output distribution for a single patient case, 25 samples are taken from the output distribution.The final prediction of the AI-BPM is then determined as the mean of the output distribution.The 95% CI, derived from the SD, is used to define the uncertainty of the AI-BPM prediction.

Definition of bacteremia.
The definition of bacteremia and the process of obtaining blood cultures are described in our previous study 15 .In brief, bacteremia is defined as the growth of a pathogenic bacteria (excluding common commensals defined by the National Healthcare Safety Network guideline) in at least one blood culture.For each set of blood cultures, 10 cc of blood was drawn from different venipuncture sites.

Study outcomes.
The primary outcome of this study was the AUC for prediction of bacteremia.The secondary outcomes were sensitivity and specificity for prediction of bacteremia.According to previous literature, blood cultures may not be necessary for patients with a predicted bacteremia probability of less than 5% or 10% 13,26 .In our study, we analyzed the results of Phase 1 using risk thresholds of 5% and 10%.In other words, the estimated risk obtained from the output of the AI-BPM was binarized into positive or negative predictions according to the threshold of 5% or 10%.However, we found that the AI-BPM sensitivity for predicting bacteremia was less than 0.80 when the 10% threshold was used.This low sensitivity may not be acceptable, given that undetected bacteremia in the ED can be fatal.Therefore, we conducted the analysis of Phase 2 using a risk threshold of 5% only.

Statistical analysis.
Categorical variables were reported as numbers and proportions, and the chi-square test was used for comparisons between groups.Continuous variables were reported as means and SDs, and the Student's t-test was used for comparisons between groups.A two-sided p-value less than 0.05 was considered statistically significant.All statistical analyses were performed using Python version 3.8.12(Python Software Foundation, Wilmington, DE, USA) and R version 3.6.3(RStudio, Boston, MA, USA).
In Phase 1, the discrimination performance of the AI-BPM in each dataset was assessed using AUC, sensitivity, specificity, positive predictive value, negative predictive value, and their CIs, which were obtained using DeLong's method 40 .The calibration of the AI-BPM was assessed using the calibration plot.The global feature importance of the AI-BPM was obtained using mean absolute SHAP values 32 .Additionally, we conducted an ablation study in which we assessed the discrimination performance of two additional models: one using structured data only and another using unstructured data only to predict bacteremia.The purpose of this study was twofold: firstly, to evaluate the individual contribution of structured and unstructured data to the model's performance, and secondly, to account for scenarios where both types of data might not be available in some hospitals.The architectures of the models were slightly modified from the AI-BPM so that they would only utilize the layers corresponding to the type of data they were using (Supplementary Table 4).
In Phase 2, the reviewing physician pre-and post-AI AUC, sensitivity, and specificity for predicting bacteremia were calculated and compared using the Obuchowski-Rockette method to account for the "multiple readers of multiple cases" design (https:// cran.r-project.org/ packa ge= MRMCa ov) 41,42 .The average receiver operating characteristics curve from multiple reviewing physicians was presented 43 .The physician pre-and

Figure 2 .
Figure 2. (a) Receiver operating characteristic curve and (b) calibration plot for the AI-BPM bacteremia prediction.The 95% confidence intervals are drawn as error bars at each point of the calibration plot.

Figure 3 .
Figure 3. (a) Receiver operating characteristic curve and (b) calibration plot for the AI-BPM, pre-AI, and post-AI bacteremia prediction.The total number of case reviews is 2,000 since each of the 1,000 cases are reviewed twice by two different physicians.The 95% confidence intervals are drawn as error bars at each point of the calibration plot.AI, artificial intelligence.

Figure 4 .
Figure 4. Sankey diagrams illustrating the change in physician predictions according to the physician confidence level and AI-BPM prediction result.The widths of the links are proportional to the number of case reviews corresponding to the link.Case reviews with a pre-AI prediction of low-high probability are shown as red, while case reviews with a pre-AI prediction of very low probability are shown as blue.

Table 2 .
Discrimination performance of the AI-BPM for predicting bacteremia.AUC, area under the receiver operating characteristic curve; CI, confidence interval; PPV, positive predictive value; NPV, negative predictive value.