Predicting bloodstream infection outcome using machine learning

Bloodstream infections (BSI) are a main cause of infectious disease morbidity and mortality worldwide. Early prediction of BSI patients at high risk of poor outcomes is important for earlier decision making and effective patient stratification. We developed electronic medical record-based machine learning models that predict patient outcomes of BSI. The area under the receiver-operating characteristics curve was 0.82 for a full featured inclusive model, and 0.81 for a compact model using only 25 features. Our models were trained using electronic medical records that include demographics, blood tests, and the medical and diagnosis history of 7889 hospitalized patients diagnosed with BSI. Among the implications of this work is implementation of the models as a basis for selective rapid microbiological identification, toward earlier administration of appropriate antibiotic therapy. Additionally, our models may help reduce the development of BSI and its associated adverse health outcomes and complications.

Bloodstream infections (BSI) can lead to prolonged hospital stays, and life-threatening and aggressive complications, in addition to high costs to health care systems [1][2][3][4] . Increasing rates of antimicrobial-resistant pathogens, particularly gram-negative bacteria, limit treatment options; this often prompts empirical use of broad-range antibiotics 5 . Therefore, timely and critical assessment of available microbiology results are necessary to ensure that individuals with BSI receive prompt, effective, and targeted treatment, for optimal clinical outcomes 5 . However, the current standard-of-care, which mostly depends on blood culture-based diagnosis, is often extremely slow 6 .
Sepsis is a life-threatening medical condition, defined as the body's systemic immunological response to an infectious process, which may cause end-stage organ dysfunction and eventually death 7 . Several studies have utilized electronic medical records (EMRs) to construct prediction models for mortality from sepsis 8,9 and sepsis onset [10][11][12][13] . From a clinical perspective, these types of early warning systems may be useful in detecting patients at risk of BSI, and typically provide information at a certain point in time (for example, preoperatively 14 ) or in a certain time-window before deterioration (see review in 15 ). Identifying a patient at risk can trigger early goaldirected therapy regarding confirmation of infection, administration of antimicrobial therapy, and transition to the intensive care unit (ICU).
In contrast to studies that aimed to detect patients at risk of BSI, the current study focused on patients with confirmed infections. Based on EMRs of patients hospitalized with positive blood cultures, we constructed machine learning models that predict poor outcomes of hospitalized patients with BSI, which was confirmed by identifying a bacterial morphology on direct gram stain from a positive blood culture ( Fig. 1; see "Methods"). Our prediction occurs at a certain point of time: just after confirming a BSI, yet well before performing specific pathogen identification. At this point of time, early goal-directed clinical practice can be diverted to patients at risk for poor clinical outcomes. As part of that, the prediction can facilitate the use of costly diagnostic procedures, such as the use of rapid microbiological identification techniques that are costly and not widely available instead of the traditional lengthy methods 16

Results
The inclusive model. The characteristics of the population used for the training and testing of the inclusive model are described in Table 1 (see "Methods"). From 7889 adults with a BSI, 2590 (32.8%) positive composite outcomes were recorded. The contribution of each feature of the inclusive model to predict the composite outcome was measured by SHaply Additive exPlanations (SHAP) scores, for each patient (Fig. 2). The predictive contribution of missingness (gray points) was also assessed. Accordingly, a missing value of a feature (e.g., albumin) serves as a signal regarding the patient's risk. SHAP values of three selected variables-age, monocyte %, albumin-are presented in Fig. 2b-d.
The main features that influenced model prediction of a poor outcome included: low albumin, high red cell distribution width (RDW), and high creatinine. RDW and albumin have been associated with mortality, and have been used as prognostic markers in a number of studies [17][18][19][20][21][22][23][24] . This ranking shown partly in Fig. 2 contributed to the creation of the compact model described in the following subsection. The top 20 features ranked by the SHAP scores were also calculated (see Supplementary Fig. S1).
Performance of the inclusive prediction model on the test set showed area under the receiver-operating characteristics curve (auROC) of 0.82 (95% confidence interval (CI): 0.80-0.845), which indicates good discrimination; and an area under the precision-recall curve (auPRC) of 0.65 (95% CI 0.61-0.70) (Fig. 3). The calibration plot, which runs very close to the diagonal, shows excellent calibration (Fig. 3c).
The calibrated model was also tested on two subsets of the future test set, which comprised only patients hospitalized in the ICU (N = 91), or patients tested in the ER (N = 268), and where the prevalence of the composite adverse outcome was 56.7% and 27.75%, respectively. Results for these subsets showed an auROC curve of 0.72 (95% CI 0.614-0.812) for ICU and 0.82 (95% CI 0.758-0.889) for ER (see full results in Supplementary Figs. S2 and S3).
The Charlson Comorbidity Index (CCI) was found as a predictor of poor outcome (Fig. 2a). CCI is a wellknown mortality predictor in various medical situations 25,26 . Thus, we compared the ROC curve we receive from CCI, and from the inclusive and compact model. auROC 0.62 (95% CI 0.585-0.648), which shows that www.nature.com/scientificreports/ the inclusive model is much better at predicting a poor outcome compared to CCI. ROC curves can be found in Supplementary Fig. S4a.
The compact model. The numerous EMR features, more than 600, incorporated in the inclusive model render its external validation and the reproducibility of results very difficult. This limits its applicability to other hospitals. Hence, we established a simpler and more compact prediction model that incorporates the features with the greatest impact on outcome, and that are most commonly listed in EMR datasets from other hospitals.
To this end, we trained and evaluated the performance of a compact model with only 25 features, including simple demographic information, a single blood test, and a brief medical history (Fig. 4d). We used the same cohort of patients of the training and test sets from the previous analysis (see the inclusive model). The compact model achieves an auROC of 0.81 (95% CI 0.78-0.83) (Fig. 4a), and an auPRC of 0.63 (95% CI 0.58-0.68) (Fig. 4b), which are only slightly lower than the values of the inclusive model, with good calibration (Fig. 4c). SHAP scores are shown in Fig. 5. ROC curves comparing compact model performance and CCI predictions can be found in Supplementary Fig. S4b.
The code to test this model is available in our GitHub repository.

Discussion
In this study, we examined the ability to utilize EMRs to predict a composite poor outcome of patients with BSI, which may promote both rapid interventions and patient stratification. Several scoring systems have been developed in recent years for stratifying the risk of patients with sepsis 8,10-13,15 , but not for outcomes of patients with BSI. In addition, none are commonly used in routine practice nor recommended according to current guidelines. www.nature.com/scientificreports/ Our results show that EMRs can be used to produce accurate predictions of BSI outcomes. Our retrospective analysis demonstrates the feasibility of accurate prediction of BSI outcomes using data available in EMRs. In the inclusive model, the prediction was after a gram-stain, and yielded an auROC of 0.82, and an auPRC of 0.65. In the compact model, the prediction was based on only 25 features available at the time of culture, not including gram-stain results; and yielded an auROC of 0.81, and an auPRC of 0.63. Notably, these models outperformed CCI, which is the standard scoring system for mortality. The high performance of machine learning models demonstrates their potential in contributing to infection management by providing accurate information at very early stages of the infection [27][28][29] . Recent studies have shown machine learning models to outperform standard scoring systems 27,30,31 . In line with these findings, our results show enhanced performance in comparison to the CCI score. It should be noted that CCI predicts mortality, rather than composite poor outcome used in the current study. While our clinical outcome was composed of several variables, it relies heavily on mortality, as 68% of all adverse outcome patients had shown short-term mortality. The substantial improvement in prediction ability (auROC of 0.62 vs. 0.83 for the full model and auROC of 0.62 vs. 0.81 for the compact model) demonstrates the advantage in using machine learning generated scores over the standard conventional scores.
In addition to the well-established risk factors for complications from BSI, such as age and previous infections, our analysis revealed less-known factors as highly predictive of a poor outcome from BSIs. The main factors that were identified as increasing risk included: red cell distribution width, albumin, and creatinine. Red cell distribution width and albumin have been associated with mortality, and have been used as prognostic markers in a number of studies [17][18][19][20][21][22][23][24] . The Charlson Comorbidity Index has also been found to predict mortality in various medical situations 25,26 . However, other factors revealed as central by our analysis, such as serum creatinine values and monocyte counts, are less well recognized and used as predictors of a poor outcome in patients with BSI.
Although maximal model explainability requires using the patient's entire EMR, we demonstrated that a subset of features, available from only simple demographic information and a single blood test, enables accurate prediction with only a slight decrease in auROC, from 0.82 in the inclusive model to 0.81 in the compact model. This may enable accurate BSI outcome estimation by embedded systems in emergency departments of hospitals. www.nature.com/scientificreports/ Our work has several clinical applications. For instance, it can help select individuals at high-risk for BSI, for whom hospitalization in the ICU, treatment with broad-spectrum antibiotics, or effective early-stage rapid microbiological identification should be considered.
The benefit of early microbiological identification for patient outcomes has been well described in the medical literature 5,32 and the time-to-administration of appropriate antibiotics to treat BSI is an important predictor of outcomes 33 . The current study therefore paves the route for future randomized control trials to further study the effectiveness of implementing a model for early prediction of BSI outcomes, possible preventive interventions, and more efficient selection of patients for advanced microbiological diagnostic testing, thus reducing the time-to-administration of appropriate antibiotics.
Our study has several limitations. First, our prediction model is based on retrospective EMR data, which have inherent biases and are influenced by the interaction of the patient with the health system 34 . However, these biases are partially mitigated in this study, since the data contain information originating from a public hospital serving a very large population, and since the outcome of the model is based on information that is accurately and comprehensively documented in the EMRs. Another limitation of the study is that we assessed BSI outcomes only in patients who already have a confirmed BSI (having a positive blood culture). This makes it more difficult to generalize to other patient cohorts. Moreover, the models presented in the current study were designed for the purpose of alerting the clinical staff about a patient's potential risk from BSI. To perform this at an early stage, we relied on data available up to the blood culture collection and did not consider the final microbiological results or treatment effects. As these factors also effect the patients' clinical outcome, they should also be incorporated into future studies to complete the picture. Finally, the predictor was trained and validated on EMRs from Tel www.nature.com/scientificreports/ Aviv Sourasky Medical Center (TASMC), composed of patients from and around Tel Aviv, Israel. Nonetheless, TASMC is a public hospital, and the medical system in Israel is accessible to the entire population.
The applicability of the model to other hospital populations needs to be shown. However, the large size of the data, and the comprehensive validation process and its result, namely, validation of the utility of established risk factors for a poor BSI outcome, all support the model's generalizability to other hospitals. Given the additional complexities introduced by the machine learning algorithms, we sought to ensure that sufficient information would be provided to enable our model to undergo external validation 35,36 . This drove us to develop the compact model, which is based on the features with the greatest influence on the overall prediction, and that are easily accessible in EMR datasets from other hospitals. This compact model achieved only a slightly reduced auROC of 0.81. We made the compact model available in our GitHub repository (see code availability) and we encourage researchers with similar data from other hospitals to test it.
In conclusion, our work demonstrates that accurate and calibrated predictions of BSI outcomes early in a hospital admission can be achieved. Earlier and better characterization of patients with BSI could potentially reduce the development of BSI and its associated adverse health outcomes and complications. Our predictive model could become the basis of selective, rapid microbiological identification, and contribute to various decisions such as ICU hospitalization and administration of broad-spectrum antibiotics. Future prospective studies, as well as those on populations from other hospitals, are needed to evaluate the clinical impact of the model.

Methods
Study design, population, and definition of outcome. The study was designed as a retrospective cohort study for the development and validation of a clinical prediction model. The study was performed at the TASMC, a 1500-bed public tertiary care center, and the only general hospital serving the population of Tel-Aviv, the most populous city in Israel, of all socioeconomic backgrounds. Data processing, model training and analyses were performed at the TASMC Data Science Department and the Faculty of Medicine at Tel Aviv University.
The study included EMRs of adults hospitalized with a positive blood culture (bacterial only) in the period between January 2014 and January 2020. The year 2014 was determined as the starting point since frequent changes in variable identifiers occurred in the preceding years. Patients' EMRs included demographics, laboratory test results, previous diagnoses recorded at TASMC, recorded medical history, and initial gram-stain morphology of positive blood cultures that are reported by phone.
The models were developed according to features extracted from various modalities available in EMRs of patients hospitalized with a positive blood culture at the TASMC, and predicted a composite poor outcome, defined as at least one of the following: • Short term in-hospital mortality within 10 days of a culture.
• Mechanical ventilation in the 10 days after the culture.
The study flow chart is presented in Fig. 1. Exclusion criteria were: the absence of laboratory data of medical history information, age younger than 18 years or older than 100 years, and patients' explicit objection to the use of their medical data for research purposes. Before initiating any analysis, the study population was divided into a training-validation set from years 2014-2018 (included 6434 admissions, of whom 2203 had a composite poor outcome) and a test set of admissions from 2019 and the first month of 2020 (included 1455 admissions, of whom 387 had a composite poor outcome). The training-validation set was further divided to training and validation sets at a ratio of 4:1. The model was also tested on two subsets of the test set, which comprised only patients from ICUs or only patients from the emergency room. Both these subsets posed a high challenge to the model regarding its generalization. The prospective test cross-sections were performed to emulate the model's use in practice and in real world situations.
Variable and feature selection. To evaluate whether EMR-derived information might accurately predict outcomes of patients with BSI, we compiled a set of 606 features. All these features were available at the time a blood sample was sent for culture, except for the gram-stain information (used only in the inclusive model). With all these features, we trained a gradient-boosting model, the inclusive model, to predict the probability that each held-out sample (patients not included in the training set) would have a poor outcome. Distributions of the various modalities of the features used are depicted in Fig. 1b. In addition, we trained and tested a compact model, comprising 25 features, for application on EMRs from other hospitals.
Each of the 606 features was assigned a category. For more comprehensive representation, some features within a category were combined, such that all the features could be represented using 96 variables. These variables were used to create the pie chart in Fig. 1 • Diagnoses history is recorded in TASMC as ICD-9 codes. The ICD-9 code hierarchy was used to group these features into 33 variables represented by ICD-9 codes of higher hierarchy.
Analysis platform. All computational analyses were performed on a secure compute cluster environment located at TASMC. Python 3, with numpy, pandas, and scikit-learn formed the backbone of the data-processing pipeline.
Development of the models. Predictions were generated using a gradient-boosting machine model built with decision-tree base-learners 37 . Such models have demonstrated efficacy in prediction, using tabular data 38 , and have been incorporated in several successful algorithms in the field of machine learning 39 . We implemented the gradient-boosting predictor trained with the LightGBM 40 Python package. LightGBM has shown effectiveness on clinical and patient tabular data in particular, and was adopted by many recently published models [41][42][43][44][45][46] .
Missing values were inherently handled by the LightGBM predictor 40,47,48 . The validation set was used for early stopping 49 , with auROC as the performance measure. Hyperparameters were chosen after a cross-validated grid Figure 6. Calibration plots of the observed outcome against the predicted probabilities. The diagonal gray line represents perfect calibration. A smoothed line is fit to the curve, and points are drawn to represent the mean values in ten discretized bins. Blue, orange, and green lines correspond to the original uncalibrated model, the model after sigmoid calibration, and after isotonic calibration, respectively. www.nature.com/scientificreports/ search while training the inclusive model, and were also used to train the compact model. Hyperparameters are available at our GitHub repository. Two LightGBM classifiers with differing complexity were developed: the inclusive and compact models. The compact model is available at our GitHub repository (see Code Availability).
The mechanistic basis of the models. To identify the principal features driving model prediction, SHAP values 50 were calculated. These values are suited for complex models such as artificial neural networks and gradient-boosting machines 51 . Originating in game theory, SHAP values partition the prediction result of every sample into the contribution of each constituent feature value. This is done by estimating differences between models with subsets of the feature space. By averaging across samples, SHAP values estimate the contribution of each feature to overall model predictions. A higher value indicates that a feature has a larger impact on the model, which indicates that the feature is more important.
Calibration of the models. We analyzed the calibration (observed risk versus raw prediction score) of our proposed inclusive and compact models. The raw prediction scores produced by the machine learning model (LightGBM) were calibrated and evaluated on the test-set. We used isotonic regression, which fits a rankpreserving transformation between the original scores and transformed scores; and minimizes the deviation between the target label and the prediction score. We used the scikit-learn library (version 0.20.0) for fitting the isotonic regression model. Ten prediction score bins were used, with regular spacing between the minimum/ maximum prediction score produced by a model. We noticed that the raw scores were already well-calibrated. Curves for isotonic and sigmoid calibrations, as well as for the raw model are presented in Fig. 6.
Evaluation of the models. The models were scored on the test set using the auROC. In addition, plots of positive predictive value (PPV) against sensitivity (precision-recall curve) were drawn across different thresholds. For all the thresholds from all the ROC curves, metrics were calculated, including sensitivity, specificity, PPV, negative predictive value, false positive rate, false negative rate, false discovery rate, and overall accuracy (Supplementary Dataset 1). Confidence intervals for the various performance measures were derived through resampling, using the bootstrap percentile method 52 with 1000 repetitions.

Ethics declarations.
This study (TLV-0684-18) was approved by the TASMC Institutional Review Board (IRB). All methods were performed in accordance with the IRB policy, guidelines, and regulations. As this is retrospective study, informed consent was waived by the Tel Aviv Sourasky Medical Center IRB, as all identifying details of the participants were removed before any computational analysis.

Data availability
The data that support the findings of this study are from TASMC. Access restrictions apply to these data and they are therefore not publicly available. Due to these restrictions, these data can be accessed only by request to TASMC or the authors.