Identifying and characterizing high-risk clusters in a heterogeneous ICU population with deep embedded clustering

Critically ill patients constitute a highly heterogeneous population, with seemingly distinct patients having similar outcomes, and patients with the same admission diagnosis having opposite clinical trajectories. We aimed to develop a machine learning methodology that identifies and provides better characterization of patient clusters at high risk of mortality and kidney injury. We analysed prospectively collected data including co-morbidities, clinical examination, and laboratory parameters from a minimally-selected population of 743 patients admitted to the ICU of a Dutch hospital between 2015 and 2017. We compared four clustering methodologies and trained a classifier to predict and validate cluster membership. The contribution of different variables to the predicted cluster membership was assessed using SHapley Additive exPlanations values. We found that deep embedded clustering yielded better results compared to the traditional clustering algorithms. The best cluster configuration was achieved for 6 clusters. All clusters were clinically recognizable, and differed in in-ICU, 30-day, and 90-day mortality, as well as incidence of acute kidney injury. We identified two high mortality risk clusters with at least 60%, 40%, and 30% increased. ICU, 30-day and 90-day mortality, and a low risk cluster with 25–56% lower mortality risk. This machine learning methodology combining deep embedded clustering and variable importance analysis, which we made publicly available, is a possible solution to challenges previously encountered by clustering analyses in heterogeneous patient populations and may help improve the characterization of risk groups in critical care.

acute respiratory distress syndrome (ARDS) and acute kidney injury (AKI) 4,5 . LCA is an established, model-based statistical technique, that defines the best fitting models for data assumed to contain several unobserved groups 4,5 . Unlike LCA, where clusters are derived from the distribution of the data, unsupervised clustering algorithms such as k-means and hierarchical clustering find clusters by identifying similarities between cases 1 . These two algorithms have recently been applied to identify clusters in a general ICU population, cardiovascular clusters in septic shock patients, and corticosteroid response in patients with severe asthma, with a variation of k-means called fuzzy c-means also being used to cluster severely injured blunt trauma patients 2,6-8 . Similarly, prediction models with increasingly higher accuracy and explainability for mortality and organ injury have been suggested, capable of processing high-frequency data [9][10][11] . Both approaches have different advantages and shortcomings. Prediction models can deal with virtually any data format and provide individual probabilities for an outcome, but are bound by a priori hypotheses and can only compute the probability of one specific outcome. On the other hand, traditional clustering algorithms, are not designed to process the highfrequency, dynamic data collected in ICU 1 .
In this study, we sought to develop and apply a novel approach to identifying and characterizing clusters of critically ill patients. Using co-morbidity, clinical examination, and laboratory data from a minimally selected ICU cohort, we compared the performance of different clustering methodologies and applied a combined deep embedded clustering and feature importance analysis algorithm to identify clusters of patients at high risk of AKI and mortality during ICU stay, and at 30 and 90 days. Then, we trained a classifier to predict and validate cluster membership, and identified the features driving these predictions. We hypothesized that this approach could identify clinically recognizable patient clusters with clinically significant differences in mortality and severe acute kidney injury.

Methods
Data sources. Data used for this study originated from the prospective, single-centre Simple Intensive Care Studies (SICS) I cohort study. All acutely admitted, critically ill patients included the study underwent clinical examination and critical care ultrasonography (CCUS) within the first 24 h of ICU admission. Informed consent was obtained for all included patients, and all analyses were performed in accordance with relevant guidelines and regulations. Further details on inclusion criteria, informed consent, and study protocol are available elsewhere 12,13 . The study was approved by the local institutional review board (Medisch Ethische Toetsingscommissie (METc) of the UMCG, M15.168207).
Co-morbidity, clinical examination, and laboratory data. The dataset consisted of patient characteristics including co-morbidities, clinical examination variables including CCUS, vital signs, and urine output, and a time-series of 40 laboratory values measured at least once daily (Table 1). CCUS measurements were www.nature.com/scientificreports/ validated by experts, and vital signs were recorded from the bedside monitor 12,13 . Patients with more than 10% missing data (i.e. variables for which no measurements were registered at any moment during ICU stay) were excluded from the analysis. Missing laboratory data were imputed using a rolling mean based on ICU-specific values 14 . For other variables, iterative imputation with 10 iterations (a method similar to multivariate imputation by chained equations) was used 15 . Feature extraction (mean and variance concatenated over the whole time-series) was employed to represent the time-series data. Table 1 shows the average number of co-morbid conditions per patient per cluster, calculated based on the information on co-morbidities.
Outcome. To define and assess clinically relevant differences between clusters, mortality at three time points (in-ICU, as well as at 30 and 90 days) was taken as a primary outcome. Kaplan Meier curves were used to visualize the mortality per cluster during and after ICU stay. The secondary outcome was the development of severe AKI (stages 2 or 3). Additionally, differences in the development of any stage of AKI, need for vasopressors, ICU length of stay, development of shock, and need for renal replacement therapy are also reported.

Development and comparison of different clustering methodologies.
Most clustering algorithms, such as k-means clustering and hierarchical clustering, are not designed to process high-frequency, dynamic data 1 . Different strategies have been developed to facilitate this, including combining K-means and HC with a time-series processing methodology such as dynamic time warping (DTW), as well as using clustering algorithms which represent data in a different way, such as deep embedding clustering (DEC) 16,17 . In both these approaches, described in more detail in the Supplementary information files, features such as mean and variance are extracted from the time-series data and subsequently fed into a clustering algorithm. In this study, we compared a DEC model to two "traditional" clustering algorithms, k-means and hierarchical clustering (HC), as well as a combination of HC and dynamic time warping (HC-DTW).
Deep embedded clustering algorithms utilize autoencoder neural networks to learn a certain representation of the data, and then use this representation to form clusters 16 . Despite its frequent use in for clustering analyses in other fields, there are but a few reports of analysis of medical data using DEC 18 . The DEC model developed in this study combined a multilayer perceptron (MLP) autoencoder, which is a type of neural network, and a custom clustering layer with the k-means clustering algorithm. The clustering layer reconstructs features created by the MLP autoencoder, and converts it to cluster label probabilities represented by Student's t-distribution. The clustering layer weights represent the cluster centroids and are initialized using k-means algorithm. To improve cluster purity, a centroid-based target distribution is constructed by squaring the encoded vectors and normalizing them by frequency per cluster. Finally, the algorithm is trained to minimize Kullback-Leibler divergence loss for a maximum of 8000 iterations with 0.01 tolerance threshold.
Once clusters were computed for all four algorithms, validity assessments were conducted 19 . Internal validity assesses whether the structure of the clustering is intrinsically appropriate for the data. Patients clustered in the same cluster should have similar data, whereas patients from different clusters should be as distinct as possible from those in other clusters. Here, the Silhouette index was used to internally validate k-means, HC, and HC-DTW. For DEC, cluster-wise stability was computed by resampling the dataset 100 times and computing the Jaccard similarities to the originally defined clusters as well as entropy scores [20][21][22] . External validity assesses whether clustering results match some a priori expected data structure. When the true cluster labels are known, this is done by comparing the clustering output to a given "correct" clustering 1,23 . Since no "true" labels are available when attempting to identify new putative patient clusters, the clinical recognizability of these clusters was used as surrogate of external validity. Lastly, to compare the different methodologies, the potential clinical utility of the clustering was assessed by examining the distribution of patients across the different clusters and whether the different clustering configurations identified between-group differences in the input features.
Cluster membership prediction and feature importance analysis for cluster characterization. A gradient boosting algorithm (XGBoost) was trained to predict cluster membership over 10-folds for each of the 100 clustering configurations resulting from DEC 24 . Then, SHapley Additive exPlanations (SHAP) values were computed on the run with the highest accuracy to represent the feature importance of each variable in the model. SHAP values are widely used in game theory to determine the contribution of particular features to the difference between the actual and the mean predictions 25 . Positive and negative SHAP values signal that variables contribute positively or negatively to cluster membership, respectively. Finally, clusters were characterized based on the between-cluster differences in input variables and outcomes, and feature importance values. The admission and discharge diagnoses of all patients were analysed and relevant clinical information to aid in the characterization was extracted. A full schematic overview of the analysis is provided in Fig. 1.

Statistical analysis.
Descriptive characteristics for the study population were reported as means with standard deviations and proportions for continuous and categorical variables, respectively. Differences in input variables and outcomes between clusters were determined using analysis of variance and chi-squared tests. Hazard ratios for mortality per cluster were computed, and the p-value for comparison against the full cohort was calculated using the log-rank test. A p-value below 0.05 was considered significant. All clustering and further statistical analyses were performed using Python with PyCharm as interface (version 2019.3.5). Performance of different clustering methodologies. Across k-means, HC, and HC-DTW, the silhouette score was highest when patients were divided into only 2 clusters (Supplementary Table S1). When 3-6 clusters were considered, silhouette scores were still similar, but lower, across algorithms. However, the high silhouette scores are explained by all three algorithms always grouping most patients into one of the clusters, even when the dimensionality (and hence the noise) of the inputs was reduced using principal component analysis. As shown in Supplementary Table S1, all three methods tended to cluster around 90% of patients in only one cluster, regardless of the number of clusters. In contrast, clusters generated by DEC showed balanced cluster membership distribution irrespective of the putative number of clusters (Supplementary Table S1) and identified significant between-cluster differences for the majority of the input features (Supplementary Tables S3-S5).
Amongst the seven different possible clustering configurations generated by DEC, stability was highest for six clusters (Supplementary Figs. S1-S8). The tenfold cross-validation XGBoost model predicted cluster membership with 83% accuracy, with sensitivity ranging from 64 to 90% and specificity from 85 to 100% (Supplementary  Table S7, Supplementary Fig. S11).

Feature importance analysis and cluster characterization.
Sixty-eight patients with high prevalence of respiratory failure or infection (34%), as well as sepsis (21%), were assigned to cluster 1 (Fig. 3, Supplementary Table S3). These patients had a long ICU stay, and the highest rate of worsened respiratory condition after 24 h (Fig. 2, Supplementary Tables S3-S5 and Supplementary Fig. S9). Feature importance analysis identified increased alkaline phosphatase, gamma-GT, bilirubin and lactate as having the greatest impact on cluster  Fig. S12). Higher values for the former three variables drove predictions towards cluster membership, while a high lactate was associated with non-membership. Patients in this cluster were not at increased mortality risk (ICU, 30-day or 90-day; Table 3 and Fig. 4).  Fig. 2, Supplementary Table S3), including the largest post-transplant group, and cardiac and vascular procedures (Fig. 3, Supplementary Table S6). Almost 40% of patients presented with acute or chronic cardiac condition, with 63% having a low cardiac index (Fig. 3, Supplementary Table S6). Accordingly, troponin T, lactate dehydrogenase (LDH), creatine kinase (CK) and inflammatory variables were increased (Fig. 2, Supplementary Table S4, Supplementary Fig. S9 (Fig. 3, Table 3, Supplementary Table S6). Higher values for arterial oxygen (pO 2 ), LDH, lactate, troponin and calcium were associated with cluster membership, while low pO 2 , LDH, and ASAT values drove predictions towards non-membership ( Supplementary Fig. S13).
For cluster 5 (n = 290), the main admission diagnoses were respiratory failure (19%), cardiac arrest (12%), neurological causes (11%), or trauma (12%). Patients in this cluster had a medium co-morbidity profile, the longest mean ICU stay (7.7 days), and high rates of delayed CRT at admission (35%). Worsening respiratory condition was frequent (14%), and COPD was a common co-morbidity (17%). High thrombocytes and potassium predicted cluster membership, while high values for pO 2 , creatinine, bilirubin, and phosphate drove predictions towards non-membership for some patients (Supplementary Fig. S16). The 95 patients assigned to cluster 6 primarily suffered from sepsis or respiratory infection, having been admitted to the ICU with respiratory (41%) or gastrointestinal diagnoses (20%). They also had high rates of distributive shock (38%), with physical examination showing a high cardiac index in 59% of patients, as well as high respiratory rates and heart rates. High values for fibrinogen, creatinine, urea, and CRP drove predictions towards cluster membership (Supplementary Fig. S17). Patients in these clusters were not at increased or reduced risk of mortality (Table 3).

Discussion
In this study, we set out to identify patient sub-phenotypes of clinical relevance using time-series laboratory, co-morbidity and clinical examination data. To do this, we compared four different clustering approaches. With a deep clustering algorithm, we identified six sub-phenotypes that capture differences in morbidity and in commonly measured clinical variables. In addition, these sub-phenotypes differed significantly in relevant clinical Table 3. Mortality rates, and hazard ratios for mortality and acute kidney injury per cluster. Hazard ratios were compared using the log-rank test, with the full cohort used as reference. AKI acute kidney injury. www.nature.com/scientificreports/ events rates (such as need for RRT and use of vasopressors) as well as mortality, with one group at low mortality risk, and two at higher mortality risk compared to average. As the first study evaluating the combination of clustering such a broad range of ICU data including demographic, co-morbidity, clinical examination, and laboratory data with feature importance analysis to characterize patient sub-phenotypes in a minimally-selected ICU population, we draw several conclusions.
We found that traditional clustering algorithms such as k-means and HC were highly susceptible to the variation in data and outliers generated by the inclusion of a large number of laboratory variables. Previous studies using these algorithms with other, mostly low-dimensional datasets had not reported this issue, and do not show the large imbalance in cluster size we observed for k-means and HC in this study 2,6,7 . Interestingly, we identified the same issue with HC-DTW despite the use of a computationally-expensive technique like DTW to make time-series sequences of different length more uniform, and therefore suitable for clustering. Deep embedded clustering, on the other hand, provided a balanced patient distribution across clusters using the extracted features alone. Despite the large amount of data and the moderate cohort size, we managed to achieve stable clusters, and use these labels to train a classifier to identify the features driving cluster membership predictions.
The findings from variable importance analysis provided interesting, adjuvant data for the interpretation of the clusters identified during this analysis. Previous clustering analyses have based interpretation of the phenotypes found on differences in means of the variables measured, or by listing variables related to each cluster based on relative importance 2,6,7 . Here, we complemented the descriptive statistics of each cluster with SHAP values to establish the directionality of the association between high and low values of a variable and cluster membership. For example, membership of clusters with higher mortality risk was associated with increased ASAT and www.nature.com/scientificreports/ LDH, which are known biomarkers of myocardial ischemia and have also been shown to be positively correlated with ICU mortality 14,[26][27][28][29] . Likewise, patients were more likely to be in the cluster with the lowest mortality risk (cluster 4) when their albumin, hemoglobin, and pO 2 values were higher, as well as when their liver function was better. These associations are supported by literature, including the addition of albumin measurements to APACHE scores [30][31][32][33] .
The results of our study suggest that this clustering methodology is superior to most widely used approaches for clustering of critically ill patients for several reasons. Firstly, it can process time-series data, unlike k-means, HC, or even HC with dynamic time warping. A recent study using clustering analysis to define cardiovascular phenotypes suggested that incorporating serial measures to study transitions from one phenotype to another during ICU stay would provide additional insight to their analysis, which was limited to static variables 6 . Similarly, clustering studies on treatment response in critically ill patients would benefit substantially from processing time-series data, as opposed to one-time treatment administration 7 .
Secondly, departing from a "minimally-selected" patient cohort, we identified six clusters which differed significantly in mortality and AKI risk, and were also clinically recognizable and describable. Caution has been advised when interpreting the results of clustering analyses, especially when identifying "novel" sub-phenotypes, and rightfully so 3 . Clustering algorithms will inevitably partition patients into clusters, and, as with most  It was with this in mind that we set out to cluster patients using a methodology which could identify clusters with significantly different outcomes and simultaneously provided some validation of the results and additional insight into the variables associated with each cluster. This pipeline of clustering, training a classifier on the stable cluster labels, and extraction of SHAP values therefore helped "open the black box" and characterize clusters in more detail. Lastly, the inclusion of time-series data into clustering analyses can bring a wide range of benefits to studies aiming to characterize patient sub-phenotypes. It can enable the use of continuous hemodynamic monitoring and laboratory data to detect variations in sensitivity to myocardial ischemia or acute kidney injury, or to identify groups with differential treatment responses over time. From a clinical perspective, the potential of accurate clustering of ICU patients to improving patient care is two-fold: it can contribute to better clinical trial design, and it has the potential to inform clinical monitoring and prognostication at bedside. The former idea is supported by recent research on ARDS that suggests sub-phenotyping based on biological markers (such as laboratory parameters) has the potential to identify mechanistic markers proximal to the clinical expression of critical disease and syndromes 34 . By doing so, it can provide a more accurate alternative to clinical phenotyping, which is prone to misclassification and remains challenging even for well-known syndromes, as well as better predict treatment response 34 . Given the challenges critical care trials often face to demonstrate meaningful clinical effects, improved randomization/allocation and design in clinical trials is especially important 35 . Secondly, the identification of clinically meaningful clusters of ICU patients, separating patients by mechanism (cardiac, respiratory, infectious, or other) as well as prognosis (low vs high risk), can be leveraged from ICU admission and throughout ICU stay to guide and optimize staffing tasks. This would allow for not only more personalized care, for example, by minimizing staff contact with patients identified as belonging to clusters with "lower care needs". In addition, accurate clustering could also be an effective way of summarizing the status of the patient, and could potentially be translated to a (triage) system possibly integrated in a clinical dashboard providing an overview of patients for rapid review by physicians. Naturally, it is essential to replicate and validate the findings of this study before any of these steps can be taken. To this end, we have made the code used in this study publicly available at https:// github. com/ J1C4F8/ SICS_ DEC and are currently setting up a multicenter study to assess the validity of the clusters and the feasibility of their clinical application.
Our study also included some limitations. First, while the goal of the SICS-I study was to collect data from a minimally-selected clinical population, inclusion criteria did apply which may account for some selection bias 12,13 . For example, patients expected to stay in the ICU for less than 24 h, due to discharge or extremely dire prognosis, were not included. Second, the six sub-phenotypes identified do not represent an exhaustive classification of critically ill patient subtypes. Information user in previous studies like end-of-life desires, need for life-sustaining therapies, and post-discharge care needs would complement our analysis, which did not include any variables of the disease course after the ICU except for mortality 2 . Third, the SHAP values reported are from an intermediate XGBoost which, despite its moderately high accuracy, does not guarantee the variables identified by SHAP are the exact same variables that the DEC model relied on when creating the clusters. Lastly, external validation of the six identified clusters in an independent cohort is necessary. Future studies with larger datasets should look to validate and replicate our findings, and address the possibility of patients belonging to multiple clusters or whether the addition of other features from the time-series data, such as trend or seasonality, would improve the results we obtained by extracting only the mean and variance of each variable.
In conclusion, our analysis of a cohort with 743 ICU patients, based on a combination of clustering and feature importance analysis of co-morbidity, clinical examination, and laboratory data identified six patient sub-phenotypes with varying mortality and risk of severe acute kidney injury. This machine learning methodology, which we made publicly available, is a possible solution to challenges previously encountered by clustering analyses in heterogeneous populations, and may help improve the characterization of risk groups in critical care.

Data availability
The datasets generated during and/or analysed during the current study are not publicly available due to containing sensitive patient information (in particular the detailed admission and discharge diagnoses information) but are available for researchers who meet the criteria for access to confidential data on reasonable request. The code used to create the models is open and available on GitHub, at https:// github. com/ J1C4F8/ SICS_ DEC.