Introduction

Sepsis is a dangerous condition triggered by an immune overreaction to an infection. According to the World Health Organization (WHO) estimates, sepsis affects more than 30 million people yearly worldwide , causing approximately 6 million deaths1, and causing more than US$24 billion healthcare related costs annually just in United States2. The scientific community is still investigating sepsis etiology3, whilst its management4,5,6 is troublesome due to the high disease’s complexity and heterogeneity7,8. A further complexity factor lies in a more restrictive definition of sepsis introduced in 20169; the new definition, named Sepsis-310, now requires the presence of additional organ dysfunctions for the condition to be labelled as sepsis. Although the usefulness of Sepsis-3 has recently been validated11, it is still debated within the medical community12. Additionally, early detection is critical to managing the attack and obtaining a favorable outcome, as Sepsis can kill a patient in as little as an hour.

Prediction of survival of patients with sepsis

Medical literature is rich of general purpose articles on sepsis13, and quest for biomarkers in clinical settings have now spanned several decades, with papers dating back to early seventies still relevant today14. Initially, the core of the researches focused on clinical trials aimed at identifying therapeutic factors representing potential targets for novel or repurposed drugs. The crucial change of pace occurred in the early 2000s, when broad epidemiological data begun being publicly available, yielding the appearance of large retrospective studies2,15. Indeed, such recent influx of data has resulted in a steady flow of medical and computer science studies in which researchers have used various data science techniques to find associations between clinical factors and sepsis outcomes, with patient survival being among the most important. Contributing to the landscape, the practitioners’ community started introducing different early warning scores, such as physiological monitoring systems for detecting of acutely deteriorating patients16. A small group of scores quickly gained popularity in the clinical settings, thus becoming de facto standards for benchmarking studies: APACHE17, SAPS18, SOFA19 and qSOFA score10. Adding to such established community shared scorxes, different formulas have been recently defined in the literature involving alternative variables: for instance, the dynamic pulse pressure and vasopressor (DPV), the delta pulse pressure (\(\Delta\)PP)20 and the sepsis hospital mortality score (SHMS)21. However, although early warning scores have been widely adopted, there is only limited evidence of their effectiveness in the improvement of patient outcome16. Among all statistical methods, algorithms based on multivariate (Cox) regression on clinical variables have played a key role22,23, since back in the early years24 to nowadays25. Notably, the features involved in these methods are not limited to clinical variables: in the last few years a number of teams tried alternative elements from modern omics technologies, such as metabolomics26, SNPs genomics27, circulating microRNA28, blood metabolites29 or lymphocytes apoptosis30, often coupled with more classical biomarkers and compared with the different scores. Unfortunately, these statistical based approaches proved to be rather limited in their performances, with only a tiny fraction of studies achieving acceptable level of efficacy31. Indeed, Gwadry-Sridhar and colleagues32 claimed superiority of decision trees over regression methods already in 2010.

Sepsis and machine learning

More recently machine learning has become the major player in the predictive analysis of sepsis data, leading to a massive wave of studies targeting different aspects of the problem, from the general issue33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51, to more specific objectives or methods. For instance, many studies have been defining, combining and validating score risks52,53, predicting early onset54,55,56, or focusing on pediatric aspects57 or on the immediate applicability to clinical practice58,59,60. Longitudinal studies have also appeared61,62,63, together with methods integrating alternative data sources such as omics64 and others. In the end of the 2010s, the computational intelligence revolution entered the playground too, and deep learning approaches flooded the specialized journals65,66,67,68,69,70,71,72,73, also considering the interpretability issue74,75. Fleuren et al.76 published comprehensive review of the different aspects. As mentioned earlier, many of these studies have been possible thanks to the public availability of curated clinical datasets related to sepsis. Among these datasets, we point out the Surviving Sepsis Campaign initiative77,78 (albeit not fully publicly released), the Medical Information Mart for Intensive Care database (MIMIC-III)79 and the electronic Intensive Care Unit (eICU) Collaborative Research Database80, which stand our for their completeness and integrity. Additionally, it is worth mentioning some notable studies aimed at identifying a restricted number of sepsis survival predicting features81,82: for instance, six predictors by Mao and colleagues83,84, five main predictive features by Shukeri et al.85, and three blood biomarkers by Dolin and coauthors86.

This study

In the present manuscript, we take a similar approach: our driving goal is the prediction of the binary survival in a large cohort of Norwegian patients originally introduced and made public by Knoop and colleagues87. In addition to this prognostic task, as a distinguishing feature we also aim at proving that a minimal set of predictors can adequately predict the survival status. To further confirm the validity of our approach, we show that our approach can also be applied to an external South Korean dataset having the same clinical features, that we use as validation cohort. As a major outcome of such quest and improving over the published literature, we discovered that a single clinical factor, namely the progressive hospitalization episode, coupled with the two basic personal elements age and sex, can effectively predict the survival of the patients. Notably, we carried out the analysis both on the whole cohort, originally called primary cohort, corresponding to the admissions of the patients affected by sepsis potential preconditions (ante Sepsis-3 definition), and on a subset of the data including only the patients’ admissions defined by the novel Sepsis-3 definition, called study cohort. We then repeated the same analysis entirely on the validation cohort, and finally trained our models on the primary and study cohorts to apply them afterwards to the validation cohort. For the first time, we show that is possible to apply machine learning to sex, age, and septic episode number collected from admission clinical records to predict the survival of the patients who had sepsis. Our very small set of detected predictors represent a sensible compromise between accuracy and simplicity of the model, requiring few resources as collected data. This balance is critical when considering the translation to clinical practice, which especially for sepsis management is rarely successful58 and not easily integrated with clinicians’ activities88. Although a number of digital handling proposals have appeared in the literature89, the impact of sepsis of Food and Drugs Administration (FDA) approved Software as Medical Devices (SaMD)90, for example, is yet far from being widespread, with perhaps the Sepsis Prediction and Optimization of Therapy system (SPOT)91 as the most famous example. This given, having a simple albeit accurate predictive test on patient survival as presented here is a promising initial step towards the development of a machine learning-based tool supporting clinicians in everyday practice.

We organized the rest of the manuscript as follows. After this Introduction, we describe the dataset analyzed (Datasets), and the results we obtained (Results) Afterwards, we discuss the impact and consequences of our results, and limitations and future developments of study (Discussion), and describe the methods we employed ((Methods).

Datasets

Primary cohort and study cohort

We analyzed a dataset made of 110,204 admissions of 84,811 hospitalized subjects between 2011 and 2012 in Norway who were diagnosed with infections, systemic inflammatory response syndrome (SIRS), sepsis by causative microbes, or septic shock87. The data come from the Norwegian Patient Registry92 and the Statistics Norway agency93.

For each patient admission, the dataset contains sex, age, septic episode number, hospitalization outcome (survival), length of stay (LOS) in the hospital, and one or more codes of the International Classification of Diseases 10th revision (ICD-10) describing the patient’s disease (Table 1). Since the main goal of this study is to predict the survival of the patient, we discarded the length of stay because it strongly relates to the likelihood to survive: the longer the patient has to stay in the hospital, the less likely she/he will survive. The survival variable relates to the hospital length-of-stay, which ranges in the [0, 499] days interval and has mean of 9.351 days. Our prediction therefore refers to the likelihood of a patient to survive or decease in the 9.351 days after the collection of her/his medical record, in the hospital.

The admissions are of 57,973 of men and of 52,231 of women, ranging from 0 to 100 years of age (Table 2 and Table 3). Most of the admissions (76.96%) relate to the first septic episode. More information about the dataset can be found in the original study87.

The original dataset curators Knoop et al.87 called the complete dataset with 110,204 admissions the primary cohort. From the primary cohort, they selected the admissions respecting the Sepsis-3 definition of sepsis (at least one several infection or sepsis related ICD-10 code and at least one codes for acute organ dysfunction)9, and they called this subset the study cohort.

Since the data of the primary cohort were recorded before the Sepsis-3 definition emerged in 2016, we cannot know if a patient diagnosed with an ICD-10 code related to sepsis actually had an organ dysfunction afterwards. Therefore, we cannot know if these admissions can be considered related to sepsis by the current Sepsis-3 definition today. What we know, instead, is that the conditions of the primary cohort (infections, systemic inflammatory response syndrome (SIRS), sepsis by causative microbes, or septic shock) might have lead to sepsis. To reflect this information, we call these conditions sepsis potential preconditions in this study. We decided to consider both the primary cohort and the study cohort for our analysis because consensus on a unified definition of sepsis has not been reached by the medical community yet94.

We took the original dataset95 and applied the same selection, generating a study cohort having a different size from the one of Knoop et al.87: while their study cohort contained 18,460 admissions, our study cohort contains 19,051. We were unable to obtain the original study cohort subset from Knoop unfortunately.

Both the primary cohort and our study cohort are positively imbalanced (Table 2). The primary cohort contains 102,099 admissions of patients who survived (92.65% positives) and 8,105 admissions of patients who deceased (7.35% negatives). Our study cohort contains 15,445 admissions of patients who survived (81.07% positives) and 3,606 admissions of patients who deceased (18.93% negatives).

Table 1 Meanings, measurement units, and intervals of each feature of the dataset.
Table 2 Statistical quantitative description of the category features.
Table 3 Statistical quantitative description of the numeric features.
Figure 1
figure 1

Primary cohort: stacked barplots of the distribution of categories. Distribution of sepsis episode number and sex of the admissions of patients who deceased (left) and survived (right). Admissions of survived patients: positives data instances (class 1). Admissions of deceased patients: negative data instances (class 0).

Figure 2
figure 2

Primary cohort: histograms of the patients’ ages in relation with the number of admissions. On the left, the admissions of the patients who deceased. On the right, the admissions of patients who survived. Admissions of survived patients: positives data instances (class 1). Admissions of deceased patients: negative data instances (class 0).

Figure 3
figure 3

Study cohort: stacked barplots of the distribution of categories. Distribution of sepsis episode number and sex of the admissions of patients who deceased (left) and survived (right). Admissions of survived patients: positives data instances (class 1). Admissions of deceased patients: negative data instances (class 0).

Figure 4
figure 4

Study cohort: histograms of the patients’ ages in relation with the number of admissions. On the left, the admissions of the patients who deceased. On the right, the admissions of patients who survived. Admissions of survived patients: positives data instances (class 1). Admissions of deceased patients: negative data instances (class 0).

We report the stacked barplots of sex and disease episode number (Fig. 1) and the histogram of the age distribution (Fig. 2) for the primary cohort; we report the stacked barplots of sex and sepsis episode number (Fig. 3) and the histogram of the age distribution (Fig. 4) for the study cohort.

Validation cohort

To confirm our findings, we also applied our methods to a dataset of South Korean critically ill patients whose medical records were collected between between January 2007 and December 2015 and publically released by Lee and colleagues96. From their original dataset, we selected the data of 137 patients who had already 1 or 2 septic episodes.

Since all these data were recorded before 2016, they are associated to a definition of sepsis earlier than Sepsis-3.

The dataset already contained the sex and age features, while we deduced the septic episode feature by selecting all the patients that already had a sepsis before the surgery (“Preop shock = 1”), and devided them between the ones that had a second sepsis afterwards (“new sepsis = 1”) and the ones that did not have it (“new sepsis = 0”).

The 137 patients of our validation cohort are 59.54 years old on average (median: 60), 47 women and 90 men. Among them, 115 had one septic episode and 22 had two septic episodes, while 113 survived and 24 deceased. Regarding the dataset imbalance, this validation cohort is positively imbalanced, having 82.482% positive data instances and 17.518% negative data instances.

More information about this dataset can be found in the original study96.

Results

In this section we first describe the results we obtained through the traditional univariate biostatistics tests (Statistical correlations) and then the results we achieved through the machine learning classifiers on the primary and study cohorts (Survival predictions), and on the external validation cohort (Validation on external cohort).

Statistical correlations

We applied some traditional biostatistics tests (Biostatistics univariate tests) to evaluate univariate associations between all three feature variables and survival status on the primary cohort. Their results showed they were statistically significant with \(p<0.001\) (Table 4).

Table 4 Results of the application of univariate biostatistics tests between each feature and the survival target, in the primary cohort.

The results of these tests state there are statistically meaningful relationships between age and survival, between disease episode number and survival, and between sex and survival. These results confirm that we can use these three clinical factors as predictive features to forecast survival.

Survival predictions

We report the results of our machine learning predictions made on the primary cohort and on the validation cohort, measured with traditional confusion matrix rates, in Table 5. As mentioned earlier, we consider positive data instances the admissions of the survived patients (class 1), and negative data instances the admissions of the deceased patients (class 0).

Table 5 Results of the survival prediction made with machine learning classifiers, with training phase and testing phase done on the Norwegian primary cohort or study cohort87.

We report two scores considering all the possible confusion matrix thresholds (precision-recall curve and receiver operating characteristic curve), and seven scores computed by artificially setting the confusion matrix threshold to 0.5 (TP rate, TN rate, PPV, NPV, MCC, \(\hbox {F}_1\) score, and accuracy). Since the main goal of our study is to predict the survived patients (positive data instances), and the inclusion of all the possible confusion matrix thresholds is more informative than the usage of an heuristic cut-off, we focused on the precision-recall area under the curve (PR AUC) as principal indicator (Table 5).

In the primary cohort, which contained admissions of patients diagnosed with sepsis before Sepsis-3, radial SVM and gradient boosting outperformed the other methods by achieving PR AUC = 0.966 and ROC AUC close to 0.7. Gradient boosting resulted being very efficient when predicting the survived patients, by achieving TP rate = 0.905, followed linear regression, that reached sensitivity = 0.805. Regarding the identification of deceased patients, the two SVM models attained the top TN rates: 0.898 for the linear SVM and 0.807 for the radial SVM.

All the five models obtained very high positive predictive values (PPVs), from linear SVM achieving 0.896 to radial SVM reaching the almost perfect value of 0.970. All the five methods, also, had low negative predictive values (NPVs), ranging from 0.112 to 0.210, which resulted in Matthews correlation coefficients, too. Regarding \(\hbox {F}_1\) score and accuracy, four methods obtained high or very high results, with top performance reached by gradient boosting (\(\hbox {F}_1\) score = 0.916 and accuracy = 0.851), while linear SVM achieved low scores on both these rates (Table 5).

In the study cohort, which contains admissions of patients diagnosed with sepsis based on the 2016 Sepsis-3 definition, the results were similar to those seen in the primary cohort, albeit a little lower.  (Table 5). All the five models obtained very high PR AUC, with linear SVM obtaining the top score of 0.860. Regarding the ROC AUCs, the two support vector machines gained the best results with 0.568 both. Gradient boosting and linear regression were capable to correctly predict most of the survived patients, reaching sensitivity scores of 0.837 and 0.764, respectively. And linear SVM was the best at predicting deceased patients, with a specificity score of 0.898.

Regarding precision, all the five methods were capable to make accurate positive predictions, with linear SVM obtaining again the best PPV (0.896). Similar to the primary cohort, they also all had low NPV values (ranging from 0.210 to 0.239), which was reflected in their Matthews correlation coefficients. The low results on the NPVs are reflected in the Matthews correlation coefficients, too. Gradient boosting gained high values for \(\hbox {F}_1\) score and accuracy also in the primary cohort (0.819 and 0.718, respectively), followed by linear regression (Table S1).

Validation on external cohort

To further verify the predictive power and the generalizability of our classifiers, we performed two additional analyses involving an external validation cohort containing medical records of patients from South Korea (Datasets)96.

In the first analysis, we both trained and tested our models on this external validation cohort, and reported the results (Table 6). In the second analysis, we trained our models on the Norwegian primary cohort or study cohort, applied the trained models to the external validation cohort, and reported the results (Table 7).

Table 6 Results of the survival prediction made with machine learning classifiers on the South Korean external validation cohort96.

Train and test on the external validation cohort

Our results we report show that all our five methods (naïve Bayes, linear SVM, radial SVM, gradient boosting, and linear regression) are capable of efficiently predicting survival not only when trained and tested on the Norwegian cohorts, but also when trained and tested on another external dataset (Table 6). These results confirm the generalizability of our approach.

All the classifiers, in fact, obtained high PR AUC ranging from 0.873 (radial SVM) to 0.899 (linear SVM), and were able to correctly classify most of the positive data instances (minimum TP rate = 0.849) and most of the positive predictions (minimum PPV = 0.849). Only naïve Bayes and linear regression were able to correctly classify most of the negative data instances and correctly make most of negative predictions (specificity and NPV greater than 0.5 for both the methods).

Also the other indicators show good scores (ROC AUC and MCC) or optimal scores (accuracy and \(\hbox {F}_1\) score for all the five classifiers, Table 6).

Table 7 Results of the survival prediction made with machine learning classifiers, including standard deviation, with training phase done on the Norwegian primary cohort or study cohort87 and testing phase done on the South Korean external validation cohort96.

Train on the primary or study cohort, and test on the external validation cohort

The final part of our analysis involved the attempt to use our trained models to make survival predictions on an external dataset. In a real case scenario, in fact, physicians and medical doctors would apply our approach to the data of a new cohort of patients arriving to the hospital, and these patients of course would not be part of the original cohort where to train the models. To address this scenario, we performed an additional analysis where we trained our models on the Norwegian primary cohort or study cohort of Knoop et al.95 and we tested them on the South Korean external validation cohort by Lee et al.96. We reported the results in Table 7.

As one can noticed, the algorithms we employed were able to correctly predict most of the survived patients and to make most of correct predctions, obtaining PR AUC scores ranging from 0.821 (radial SVM trained on the primary cohort) to 0.863 (gradient boosting trained on the study cohort). Naïve Bayes obtained the top score for PR AUC when trained on the primary cohort (0.848), while gradient boosting achieved the top PR AUC when trained on the study cohort (0.863). Because of the imbalance of the cohorts, all the methods achieved high scores for positive data instances (sensitivity and precision) but low scores for negative data instances (specificity and NPV). Naïve Bayes achieved the top specificity both when trained on the primary cohort (0.415) and when trained on the study cohort (0.386).

Differently from the other tests we made, some methods failed in correctly predicting any negative data instances: the linear SVM method classified all the validation set data instances as positive, both when trained on the primary cohort and on the study cohort, while the linear regression did the same for primary cohort. This aspect suggests additional future studies in the theoretical machine learning field about the behavior of these algorithms.

These results show, additionally, the level of generalizability of our approach, that is able to correctly predict survived patients just from sex, age, and septic episode even when our models are trained and tested on two different cohorts.

Discussion

Our results show that machine learning applied to minimal clinical records of patients diagnosed with sepsis, containing only age, sex, and number of septic episode, is sufficient to predict the survival outcome of the patients themselves. Most of our machine learning methods, in fact, were capable to correctly predict most of the survived patients (very high sensitivity rates) with high confidence probability (very high precision values).

To the best of our knowledge, no other study on sepsis has predicted patient survival outcomes with such little and easily obtainable information; age and sex are immediately available for each patient, while sepsis episode number can be easily found in the patient’s history.

Our finding can be consequential to the way that sepsis is managed around the world. If validated, hospitals will be able to quickly and reliably predict a patient’s survival in few seconds, . allowing for quicker action from the doctors, which is crucial for a quick-to-kill illness like sepsis. The finding will be especially useful to hospitals that lack personnel and machinery, like those in rural or developing areas.

Our findings were not identified by the study of the original Norwegian dataset curators, which instead provides an overall general analysis about the correlation between features of the patients’ cohort87, and not even by the study of the validation cohort96.

As a limitation, we have to report that, even if our machine learning methods resulted being effective in identifying the admissions of survived patients, the same cannot be said for the admissions of deceased patients. Our data mining techniques, in fact, were able to correctly predict most of the admissions of the deceased patients (high TN rates), but with low diagnostic proportions (low NPVs)98. We believe this drawback of our study is due to the huge imbalance of the datasets: during training, the machine learning methods do not see enough negative elements, and therefore they generate many false negatives when making predictions on the test set. We tried to tackle this problem with ROSE oversampling97, which improved the situation, but did not solve the issue. This drawback is critical because the patients who are more likely to decease are the ones who need urgent therapies and cures in a hospital setting. We hope to overcome this issue in the future by employing other oversampling techniques.

We also have to report that the absence of a temporal feature expressing the time passed between a septic episode and decease has been a limitation for this study. The presence of this time feature, in fact, would have allowed us to make time-related predictions which would have higher impact in a hospital setting, by helping doctors understanding which patients are more in need of immediate help.

In the future, we plan to further investigate the theme of the minimal clinical record for computational prediction of survival on other diseases such as cervical cancer99, neuroblastoma100, breast cancer101, and amyotrophic lateral sclerosis102.

Methods

In this section, we briefly describe the traditional biostatistics tests we employed to detect correlation between each clinical feature and survival target (Biostatistics univariate tests), and the machine learning methods we used to predict survival (Machine learning classifiers).

We implemented our software code with the free open source R programming language and platform103, and made it publicly available online on GitHub (Data and software availability).

Biostatistics univariate tests

To identify preliminary associations between feature (age, sex, septic episode number) and target (survival), we performed univariate biostatistics analyses. We used the Anderson–Darling test104 to test for normality of continuous variables. As the normality assumptions were not met, we employed the Mann–Whitney U test105 to evaluate associations between the continuous features and survival. We used the chi-squared (\(\chi ^2\)) test106 to evaluate the association between sex and survival. We considered p-values less than 0.05 as statistically significant.

For both the Mann–Whitney U test and the chi-squared test, a low p-value (close to 0) means that the two analyzed features strongly relate to each other, while a high p-value (close to 1), instead, means there is no correlation107.

Machine learning classifiers

To predict the survival of patients from only three features, we initially employed function approximation methods108, trying to frame this scientific problem into a linear setting, with a mathematical formula such as \(y = f(x, w, z)\) where where y is survival, x is age, w is sex, and z episode number. After several attempts, however, we realized that this problem could not be solved through a simple linear function with three variables, and therefore decided to take advantage of machine learning.

We employed five machine learning classifiers from four different method families: linear regression109, support vector machine with linear kernel (linear SVM)110, support vector machine with radial kernel (radial SVM)111, gradient boosting112, and naïve Bayes113.

We first chose linear regression because it is a baseline statistical model and one of the simplest methods in computational intelligence; starting an analysis with a simple method is considered a good practice in machine learning114. We then chose two support vector machines with different kernels (linear and Gaussian radial), because they can project data into a hyperplane suitable for classification. After that, we tried gradient boosting, an ensemble boosting method capable of training several weak classifiers to build a strong one. Finally, we employed a probabilistic classifier, such as naïve Bayes, which is based on the Bayesian conditional probability and can estimate how likely a data instance can belong to a class.

All these methods have shown their effectiveness in binary classification of biomedical data in the past, and therefore represented suitable candidates for this study as well.

We applied each algorithm 100 times both to the primary cohort and the study cohort and reported the mean result (Results). For methods that needed hyper-parameter optimization (linear SVM and radial SVM), we split the dataset into 60% randomly selected admissions for the training set, 20% randomly selected admissions for the validation set, and 20% remaining admissions for the test set. To choose the top hyper-parameter C, we used a grid search and selected the model that generated the highest Matthews correlation coefficient114,115. For the other methods (linear regression, naïve Bayes, and gradient boosting), instead, we severed the dataset into 80% randomly selected data instances for the training set, and 20% remaining data instances for the test set.

For each of the 100 executions, our script randomly chose admissions for the training set and for the test set (and for the validation set, in the case of hyper-parameter optimization) from the complete original primary cohort or study cohort. We trained each model on the training set (and validated it on the validation set, in the case of hyper-parameter optimization), and we then applied the model to the test set. Given the different selections of admissions for the dataset splits, each script execution generated slightly different results even when employing the same method.

Because of the huge imbalance of the datasets (92.65% positives and 7.35% negatives in the primary cohort, and 81.07% positives 18.93% negatives in the study cohort), we had to employ an oversampling technique at each execution, to make the training set more balanced. We applied the Randomly Over Sampling Examples (ROSE) method97, which creates and adds artificial synthetic data instances of the minority class (the deceased patients, in our datasets) to the training sets. Since we split the datasets into training set, validation set, and test set for the support vector machines, and just into training set and test set for the other methods, we had to select different optimized probability values for the ROSE minority class for these two groups of algorithms.

We measured the classifiers’ performances by using typical confusion matrix evaluation scores such as Matthews correlation coefficient (MCC), receiver operating characteristic area under the curve (ROC AUC), precision recall area under the curve (PR AUC), and other ones. Since our main goal is to correctly predict the survival of patients, we ranked the results based on the PR AUCs, which highlight the true positive rates and positive predictive values reached by each method116.