Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare

Sepsis is a leading cause of death in hospitals. Early prediction and diagnosis of sepsis, which is critical in reducing mortality, is challenging as many of its signs and symptoms are similar to other less critical conditions. We develop an artificial intelligence algorithm, SERA algorithm, which uses both structured data and unstructured clinical notes to predict and diagnose sepsis. We test this algorithm with independent, clinical notes and achieve high predictive accuracy 12 hours before the onset of sepsis (AUC 0.94, sensitivity 0.87 and specificity 0.87). We compare the SERA algorithm against physician predictions and show the algorithm’s potential to increase the early detection of sepsis by up to 32% and reduce false positives by up to 17%. Mining unstructured clinical notes is shown to improve the algorithm’s accuracy compared to using only clinical measures for early warning 12 to 48 hours before the onset of sepsis.

substantially better than this comparison paper (0.86 vs. 0.92), which is considerably, and would make a good discussion point, and in fact these performance characteristics persist even 12 hours prior.
In the Methods section, under Data Sample, how was random sampling performed? Was the unit of randomization performed at the level of the note, the patient visit, or the patient. Also, was there any overlap in patients between the test set and the training/validation set?
In the methods section, it would be helpful if the ICD-10 codes for cohort selection were explicitly mentioned.
Methods: processing of clinical notes: Please cite the implementation (software package) that was used, as well as for your prediction model.
The paper is missing a demographics table to describe the patient population. For example, what is the incidence of sepsis, severe sepsis, septic shock? How many are admitted to the ICU? What is the age distribution?
What was the class imbalance of your prediction and how did you account for it? It is unclear from the manuscript if a balanced dataset was created by randomly undersampling the non-sepsis cohort, or if the class imbalance was dealt with in some other manner during training.
It would be helpful if a standard CONSORT enrollment diagram was included as a figure, potentially to replace Figure 1.
It would also be helpful if one included a reliability diagram (calibration diagram) as well as a precision-recall diagram for a representative model to better understand calibration as well as the trade-offs between precision and recall for choosing a decision threshold.
Thank you for pointing out these studies that examined sepsis diagnosis and prediction. In this revision, we have incorporated them in our literature review. Specifically, we have cited and acknowledged the strengths of Liu, Greenstein, Sarma, and Winslow (2019) 's sepsis detection algorithm. As per Senior Editor's request to discuss how our work compares with Liu et al. 's (2019) on sepsis detection and prediction algorithm, we would like to highlight the following key points:

Strength and robustness of our prediction algorithm
• Our algorithm provides earlier prediction of sepsis up to 24 hours. Whereas Liu et al. (2019) 's algorithm for early prediction of sepsis is given at 7 hours prior to the onset of sepsis (AUC: 0.92, Sensitivity: 0.84, Specificity: 0.82), our algorithm's early sepsis prediction is effective up to 24 hours before the onset. Furthermore, our algorithm's results at 12 hours before the onset is comparable to Liu et al. (2019) 's algorithm at 7 hours (i.e., our 12 hour AUC: 0.94, sensitivity:0.87 and specificity:0.87). Given that studies have found that one hour delay in antimicrobial administration for sepsis patients is associated with a decrease in survival of 7.6% (Kumar et al., 2006), the ability to provide an sepsis alert five hours ahead of onset would substantially increase a patient's survival.
• Our algorithm works in natural clinical setting's level of prevalence. The dataset we used to develop the model was extracted from a natural clinical environment. We tested the algorithm in a dataset where the prevalence of sepsis is low (only 6.15% of all the patients in the sample have sepsis). This level of prevalence is equivalent to the level typically observed in hospitals. We confirmed this with Rhee et al. (2017) (Carnielli et al., 2018) and cell identification/ classification (Rennie, Dalby, van Duin, & Andersson, 2018;Xia et al., 2020). These studies are similar to our study in that they also face low prevalence of positive cases. As seen from Table 1, we were able to achieve similar diagnostic stats under such conditions.

We compared our prediction algorithm with human physicians
• Comparing algorithm's performance with human physician. In our study, we compared the performance of our algorithm against human physician's early detection of sepsis cases. We found that our algorithm out-performs human in early sepsis detection up to 48 hours ahead of sepsis onset. Although the algorithm by Liu et al. (2019) is able to achieve high AUC at 7 hours ahead of sepsis onset, they did not report any attempts to compare their algorithm with human physicians.

Our prediction algorithm uses a more stable NLP technique
• Stability of Natural Langugage Processing (NLP) -words vs topics. Liu et al. (2019) employed a text modeling technique that extracts commonly used words in the clinical notes as predictors of sepsis. In our paper, we extend this technique by aggregating topics from those words extracted from clinical notes. Each topic is characterized by a collection of words that cluster around a common theme. We then employed these topics as predictors for sepsis. This NLP technique is better in the following ways: o Stability over time. Lexicographical topics are more stable compared to individual words (Blei, 2012;Wallach, 2006). Synonyms or phrases which carry similar meanings can be substituted to characterise a particular topic or phenemonon which falls within the same topic. Individual words on the other hand have narrower meanings and are unable to capture the use of synomyms. To show that our topics are stable over time, we tested our model on a test sample that included patients who were admitted at a later time.
o More accurate and generalizable. When we compare classifiers that are built using topic features to those that are built using word features, the former classifers were found to be more accurate (Blei, 2012;Blei, Ng, & Jordan, 2003;Wallach, 2006). This is because individual writers (e.g., clinicians) have idiosyncratic writing styles that may influence their choice of words. By using topics to extract and process notes, we can mitigate some of idiosyncratic writing styles and words employed by different writers. As a result, we believe this technique provides a NLP structure (topics) that is more generalizable across context and domains (e.g., different hospitals or specialty). We will describe below how we can operationalize this process in a clinical context (see our response below).

Based on the title of the paper "role of unstructured data", I would have expected the authors to describe the actual role of clinical notes in diagnosis/prediction performance, in comparison to using structured data only. However, the results presented pertain only to processing of both types of data together. Thus, it is not known if clinical notes play a role in increasing predictive performance.
Thank you for pointing out this fact. In this revision, we have provided the comparison between the model that used only structured variables (e.g. vitals) and the model that used both structured variables and clinical notes. From Table 2, we can see that the model that used both structured variables and clinical notes is significantly more accurate that the structured variable for models more than 12 hours prior to the onset of sepsis. For time periods less than 12 hours, the structured variables provide relatively good prediction as seen in prior literature, but in this research we show that unstructured clinical notes play a more important role in predicting sepsis more than 12 hours prior to the onset of sepsis.

I found the method of labelling sepsis prediction quite unusual and a potential source of significant confusion: "For each patient encounter, when a physician suspects sepsis, she will at least request a culture test and lactate test. Thus, when the physician orders for both tests, we classify the patient as one predicted to have sepsis by the physician,". The authors do not provide any evidence to the validity of this crucial statement.
Thank you for this clarification question. We agree that our statement was unclear and thus may have created some confusion for the reader. We would like to clarify that: • First, our classification for sepsis vs. non-sepsis cases is based on the ICD-10 classification. The ICD-10 classification is used for the training/ validation of the model as well as for the verification of the test results. The list of ICD-10 codes is now presented in the paper as requested by R2.
• Second, the statement quoted above refers to the way we measured the event where a hospital physician suspects a patient has sepsis (i.e., physician's prediction of sepsis). We measured such events to compare the performance of our early prediction algorithm with physicians' performance in predicting sepsis. This measure was NOT used to train/validate or test the model in any way. The measure was simply to determine human physicians' performance of predicting sepsis. • Third, the two criteria described here to determine the physician's prediction of sepsis (i.e., request for lactate test and a culture test) are based on international guidelines for sepsis management and is part of the hospital's operating procedures (see below). o Based on the International Guidelines for Management of Sepsis and Septic Shock: 2016 (Rhodes et al., 2017), as part of the guidelines for initial resuscitation, physicians are required to normalize lactate in patients with elevated lactate levels as a marker of tissue hypoperfusion. As part of diagnosis, the international guidelines also "recommend that appropriate routine microbiologic cultures (including blood) be obtained before starting antimicrobial therapy in patients with suspected sepsis or septic shock if doing so results in no substantial delay in the start of antimicrobials (BPS)." pp. 312 (Rhodes et al., 2017). As such culture and lactate tests are among the first two tests to be conducted when a patient is suspected to have sepsis. o We verified with the hospital management that the standard operating procedure when a physician suspects a patient has sepsis is to request for at least one culture test together with a lactate test.

The dataset is highly imbalanced; thus, ROC should not be the only performance metric reported. Authors should at least provide PPV and NPV. Furthermore, it is unclear how well calibrated the model is; thus, calibration curves should also be provided.
Thank you for the suggestion. In this revision, we have provided PPV, NPV, and the calibration curves. As noted earlier, our dataset is imbalanced and as suggested by the Senior Editor and R2 in this revision, we used SMOTE to oversample the positive cases to develop a more balanced dataset while training the model. A few points to highlight in this revision.
The unit of analysis for our prediction model is each single entry of clinical notethis is the same as per our initial submission. We define this as the unit of analysis because every instance the physician assesses the patient and inputs the clinical notes, she is making a clinical judgment. Hence, this unit of analysis is the most realistic in clinical setting and any sepsis alert should be presented at this point in time.
In clinical settings, sepsis has naturally low occurrence of 2% incidence rate per year with about 6% prevalence (Rhee et al., 2017). As PPV is directly related to the prevalence of sepsis (see equation 1 below), we expected low PPV values given the low prevalence in our data sample. To create a balanced sample (i.e. higher prevalence of sepsis), some machine learning studies under-sample the non-sepsis cases (Liu et al., 2019). But to build predictive models for classification tasks in a medical context, some researchers have argued that oversampling (instead of undersampling) can result in more accurate models (Batista, Prati, & Monard, 2004;Carnielli et al., 2018;Chawla, Bowyer, Hall, & Kegelmeyer, 2002). This method is used in studies that develop machine learning classifiers in low prevalence environment, e.g., oral cancer detection (Carnielli et al., 2018) and cell identification/classification (Rennie et al., 2018;Xia et al., 2020). As such, given these studies as well as the fact that undersampling is not a viable option given the naturally, low prevalence of sepsis, we chose to oversample the sepsis cases using SMOTE (Synthetic Minority Over-sampling Technique).
The tables below show the diagnostics of the models (AUC, sensitivity, specificity, PPV, NPV with corresponding prevalence value). To check against overfittingwhich is a criticism of oversamplingwe also report our models without SMOTE for comparison. We are glad to report that other than PPV, the AUC, sensitivity, and specificity are equivalent for both oversampled and non-oversampled models.
Models that were run in natural environments of low prevalence without any oversampling are labelled as "Original data" and models with higher prevalence achieved by oversampling are labelled as "Smote #%" where # represents the extent of SMOTE. For example, SMOTE to 10% represents oversampling the sepsis cases up to the point where the sepsis cases make up to 10% of the overall sample. As prior literature suggest a "SMOTE to 50%" approach, as a robustness check, we provided five different levels of SMOTE for the early prediction models presented below.
The unit of analysis for the algorithm is each clinical note entry by the physician. As such, the prevalence figures presented in Table 3a to 3f below are computed at the clinical note level and not at the patient-encounter level. The original data prevalence figures (at the clinical note level) vary due to differences in time windows. Although the number of sepsis cases remains the same for the test sample across different time windows, the number of sepsis notes reduces with shorter time windows. Since the number of non-sepsis cases (and notes) remains the same in the test sample, this eventually leads to a reduction in prevalence with shorter time windows.

When developing the model, cross-validation is much more robust method to avoid overfitting than the random split used by the authors.
Thank you for your suggestion. In this revision, we used 10-fold cross validation modelling to prevent overfitting. The results obtained were similar to our initial submission.

Given the above considerations and the amount of manual work required in annotation of LDA output, I find it difficult to see how this algorithm may enter clinical practice as outlined in the discussion section.
Thank you for highlighting this point. We agree with you that the practical application of this algorithm is critical to the usefulness of the algorithm. 1. Before we discuss the practical use of the algorithm, we would like to first clarify the aspect of building and using the LDA output. Although training the topic library might be time intensive, this is only performed during the initial development of the topic library. Once the topic library is developed, it will be deployed to score new clinical text. As such, the topic library construction is only performed once at the beginning of the project. As shown in the testing of our model, the topic library can effectively predict sepsis cases using clinical notes that were entered six months later. Subsequently, we only need to periodically update the topic library, which can be done using an automated workflow routine (e.g., using SAS Enterprise Miner). 2. We propose the following steps to run the SERA algorithm in a clinical setting for a patient: a. Clinical note scoring processscoring of a new clinical text in clinical setting involves three steps: i. Parsing: tokenization, lemmatisation, and POS tagging ii. Filtering: to weight terms iii. Topic assignment (based on existing topic library that had been earlier developed) The duration of computation to process and score the text is relatively short. To illustrate, we use a test-case patient with a long clinical note of 1,806 words (the median length of clinical note in our sample is 840 words). The clinical note scoring process using an Intel i7 Processor 2.7 GHz, 16.0G RAM is about 0.17 secs in SAS Enterprise Miner 14.1 b. SERA algorithm score processafter the clinical text is processed and scored, we will combine that with the structured variables from the EMR system and predict the likelihood of sepsis using the SERA algorithm. Here, the estimate processing time for all inputs using Intel i7 Processor 2.7 MHz, 16.0G RAM is about 0.01 secs.
Together, the total duration to fit a new patient's data to the SERA algorithm takes about 0.18 secs from the moment the data is made available in the system. 3. The SERA algorithm can work in two different modes within the clinical environment. a. Background mode: In this first mode, the algorithm is designed to run in the background. Specifically, it is configured to run at key events using the latest patient's clinical data available, e.g., during ward shift handovers. If the risk score exceeds the designated cut-off level, the physician will be alerted via the EMR. Alternatively, if there are more computing resources available, hospitals can choose to run it in fixed hourly-time intervals. For a large 500bed hospital, assuming if the algorithm runs the cases individually, it will approximately take 90 secs to completely score all 500 patients. This approach ensures an ongoing, regular time-based sepsis risk assessment for patients within the hospital. (See Figure R2 on the workflow for this mode) b. Ad hoc mode: Second, the algorithm can be designed to immediately run after a physician submits her clinical notes in the EMR system. In this case, the SERA algorithm is run in an ad-hoc manner since the score is only applied after a physician has updated the patient's status. The algorithm's score then acts as a decision support to flag out suspected sepsis cases. As observed from study, the SERA algorithm outperforms physicians in early prediction of sepsis and thus may be an important early warning indicator for physicians to take note. (See Figure R3 for the workflow)

Thank you for the opportunity to review this interesting paper. I only have some minor suggestions that I think would help clarify the manuscript for the reader.
The background is well motivated. This reviewer whole-heartedly agrees in the use of ML for real-time surveillance, specifically in the area of workflow augmentation for applications such as decreasing variability in care, as the authors have eloquently stated in their introduction.

The authors might consider citing this article in the background https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5383046/ which is directly applicable to this work, which also shows that using unstructured data, in addition to structured data, substantially improves this prediction task. More importantly, this manuscript reports performance metrics substantially better than this comparison paper (0.86 vs. 0.92), which is considerably, and would make a good discussion point, and in fact these performance characteristics persist even 12 hours prior.
Thank you for taking the time to review our paper and pointing out the reference (Horng et al., 2017). We have cited it in this revision and incorporated the points, where relevant, in our revision. We hope that in this revision, we have resolved the issues you raised.

In the Methods section, under Data Sample, how was random sampling performed? Was the unit of randomization performed at the level of the note, the patient visit, or the patient? Also, was there any overlap in patients between the test set and the training/validation set?
The unit of sampling was performed at the patient visit level. The sample period is from 1 st April 2015 to 31 Dec 2017 (first patient record 2 nd April 2015). All sepsis patients (based on ICD-10 classification) were included in the dataset and we randomly selected non-sepsis patients to make up the rest of the sample. Given that the training/validation dataset is from an earlier set of patients and the test set is on a later set of patients, we had an overlap of 1 sepsis patient (out of 327 encounters) and 241 non-sepsis patients (out of 4990 encounters). It is important to note that while the patients are same, the notes were for different hospitalization encounters.
We have included a modified STROBE/CONSORT diagram as requested in this review for easier representation as well.

In the methods section, it would be helpful if the ICD-10 codes for cohort selection were explicitly mentioned.
ICD-10 codes used were: The text mining procedure were conducted using SAS Enterprise Miner 14.1 and the Ensemble machine learning was conducted using KNIME Analytics Platform (version 4.1.6).

What was the class imbalance of your prediction and how did you account for it? It is unclear from the manuscript if a balanced dataset was created by randomly under sampling the non-sepsis cohort, or if the class imbalance was dealt with in some other manner during training.
The original dataset is imbalanced as it consists of data extracted from a single hospital over a 2.5 year period. We sampled the data so that the prevalence of the cases would be similar to natural prevalence in clinical settings. We selected all sepsis patient (cases) and randomly selected non-sepsis cases (controls) and arrived at patient-visit level prevalence of 6.15%. The level of prevalence is equivalent to the natural prevalence of sepsis typically observed in hospitals. As seen in (Rhee et al., 2017) from 2009 to 2014 the prevalence of sepsis is about 6% of the patient population and it relatively stable over time. (cf. pp. 1246(cf. pp. (Rhee et al., 2017).
In our initial submission, we trained/validated and tested the model without any oversampling procedure applied to the data. Due to the low prevalence of sepsis in our dataset, and given that the analysis was done using each clinical note as the unit of analysis, the prevalence of sepsis in the clinical was around 1% for the early prediction algorithm leading to a naturally low PPV, even with AUC of > 0.90 and sensitivity 0.86 and specificity 0.80.
Based on the review team's suggestion to test our model under higher prevalence of sepsis, as seen in most machine learning studies where some form of over/under sampling is used (Liu et al., 2019), we oversampled the sepsis cases using SMOTE (Synthetic Minority Oversampling Technique) for the training and validation dataset. SMOTE is a commonly applied oversampling procedure where additional positive cases are imputed via a nearest neighbor resampling algorithm (Chawla et al., 2002). This method is used in prior studies published in Nature Communications that develop machine learning classifiers in low prevalence environment, e.g., oral cancer detection (Carnielli et al., 2018) and cell identification/ classification (Rennie et al., 2018;Xia et al., 2020).
It is important to note that the AUC, sensitivity, and specficity of models developed using oversampling (SMOTE) and models developed without oversampling are very similar as seen in Tables 3a to 3f (see response to Reviewer 1). The PPV for models without oversampling are significantly lower as PPV is algebrically constrained by the prevalence of sepsis (see equation 2 below) (only exception is where specificity equals to 1). With the low prevalence in our sample, we expected low PPV values.

It would be helpful if a standard CONSORT enrolment diagram was included as a figure,
potentially to replace Figure 1.
Thank you for this useful suggestion and we agree that the current Figure 1 is less informative. However, your suggestion of a CONSORT diagram is applicable only to a randomized controlled trial but ours is a case-control study. As such, we have used a STROBE Enrolment diagram to replace Figure 1 as suggested in Vandenbroucke et al. (2007). We believe a STROBE Enrolment diagram will provide the equivalent information as a CONSORT diagram.

It would also be helpful if one included a reliability diagram (calibration diagram) as well as a precision-recall diagram for a representative model to better understand calibration as well as the trade-offs between precision and recall for choosing a decision threshold.
Thank you for raising this point. We have now included the calibration curves ( Figure R1 in response to Reviewer 1 comments) as well as the precision-recall curves below (Figures R4).
* Note that the Precision axis is truncated and starts at 0.5.

REVIEWERS' COMMENTS
Reviewer #1 (Remarks to the Author): In general, the authors have responded to the majority of my concerns. I feel my final comment on the practicability of SERA algorithm in clinical practice could have been addressed better. As shown in [1] traditional machine learning methods dominate clinical practice with respect to deep learning methods. As such, the authors could use this evidence to better motivate their discussion.