Identifying acute illness phenotypes via deep temporal interpolation and clustering network on physiologic signatures

Using clustering analysis for early vital signs, unique patient phenotypes with distinct pathophysiological signatures and clinical outcomes may be revealed and support early clinical decision-making. Phenotyping using early vital signs has proven challenging, as vital signs are typically sampled sporadically. We proposed a novel, deep temporal interpolation and clustering network to simultaneously extract latent representations from irregularly sampled vital signs and derive phenotypes. Four distinct clusters were identified. Phenotype A (18%) had the greatest prevalence of comorbid disease with increased prevalence of prolonged respiratory insufficiency, acute kidney injury, sepsis, and long-term (3-year) mortality. Phenotypes B (33%) and C (31%) had a diffuse pattern of mild organ dysfunction. Phenotype B’s favorable short-term clinical outcomes were tempered by the second highest rate of long-term mortality. Phenotype C had favorable clinical outcomes. Phenotype D (17%) exhibited early and persistent hypotension, high incidence of early surgery, and substantial biomarker incidence of inflammation. Despite early and severe illness, phenotype D had the second lowest long-term mortality. After comparing the sequential organ failure assessment scores, the clustering results did not simply provide a recapitulation of previous acuity assessments. This tool may impact triage decisions and have significant implications for clinical decision-support under time constraints and uncertainty.


Introduction
Every year in the United States, more than 36 million hospital admissions occur, with approximately seven hundred thousand in-hospital deaths, nearly one-fourth of which are potentially preventable. 1-3 A significant source of preventable harm during the early stages of hospital admission is the misdiagnosis and under-triage of high-risk patients to general hospital wards. 4,5 In this crucial period, clinicians are required to make a series of decisions involving monitoring, testing, and treatment that can significantly influence the patient's clinical trajectory. 1,6,7 This series of decisions entails analysis of a variety of data representing essential physiologic processes. [6][7][8] For example, values and trends in vital signs may indicate whether a patient requires intensive monitoring in an intensive care unit (ICU) or if they can be safely transferred to a hospital's general ward. The trajectories of early vital signs may be useful for identifying distinct physiological signatures that are linked to specific patient phenotypes and clinical outcomes.
Unsupervised clustering analyses of vital signs and other clinical variables have shown promise for helping clinicians identify novel clinical phenotypes for sepsis and acute respiratory distress syndrome. [8][9][10][11] However, these phenotypes have not been evaluated with large, heterogeneous cohorts that include all hospitalized patients. In addition, broader phenotyping based on vital signs has proven more challenging, in part because sampling occurs at irregular intervals, thus complicating the application of conventional time series analyses and machine learning clustering techniques. In the past decade, however, deep learning has garnered significant achievements in the healthcare domain to facilitate the clinical decision-making process with its superior capability to detect the intricate patterns inherent in raw clinical data and to approximate highly complex functions. [12][13][14] Although several advanced deep learning algorithms have been developed to manage the irregularly-sampled time series data, [15][16][17] there remains a dearth of work specifically focused on the clinical phenotype identification, particularly using the early stages vital sign data.
To fill this gap and address the patient stratification challenge, our study presents a novel deep temporal interpolation and clustering (dTIC) network. This innovative tool is designed to extract latent representations from sparse and irregularly sampled time series vital sign data, and concurrently stratify patients into distinct phenotypes. The dTIC network exhibits considerable potential to effectively facilitate clinical decision-making, offering a promising solution to existing limitations in patient phenotype identification during the critical initial hours of hospital admission.

Data Source and Participants
By using electronic health records (EHR) of 75,762 hospital admissions of 43,598 unique patients that represent adults (18 years or older) of all demographics, we created a longitudinal dataset of adult patients in the University of Florida Health's 1,000-bed academic hospital who remained admitted for six hours or longer (including emergency department admission) between June 1, 2014, and April 1, 2016. We excluded patients without sufficient vital sign measurements within six hours of hospital admission-that is, when two or more of the six vital sign measurements (systolic and diastolic blood pressure, heart rate, respiratory rate, temperature, and oxygen saturation) were completely missing (eFigure 1). This study was approved by the University of Florida institutional review board as exempt with waiver of informed consent (IRB201901123). All methods were performed in accordance with relevant guidelines and regulations.

Study design
To mitigate any consequences of dataset drift because of adjustments in clinical practice or patient population, we adhered to the guidelines of the Type 2b analysis category 18 under the Transparent Reporting of a multivariate prediction model for Individual Prognosis or Diagnosis (TRIPOD) in order to split the dataset chronologically into three categories-training (patients admitted from June 1, 2014, to May 31, 2015, validation (patients admitted from June 1, 20151, , to October 31, 2015, and testing (patients admitted from November 1, 2015, to April 1, 2016, n=16,845)-which followed a previous paper setting. 19 Using the training cohort, we identified acute illness phenotypes by applying unsupervised machine learning clustering to chronologically ordered measurements of patient vital signs from the first six hours of hospital admission. We utilized the validation cohort to select the hyperparameters of our dTIC model.
Within the testing cohort, we assessed phenotype reproducibility by predicting phenotypes and analyzing phenotype frequency distributions and clinical outcomes.

Identifying acute illness phenotypes via early physiologic signatures
We removed outliers from raw time series vital signs and explored distributions, missingness, and correlation (eTable 1 and eFigure 2). In instances where a time-series variable was entirely absent from a patient's record, we imputed the mean value with a timestamp at hospital admission derived from the training cohort.
The dTIC network, designed to extract representations from sparse and irregularly sampled time-series data such as vital signs and subsequently derive acute illness phenotypes, incorporates four components: an interpolation model, a sequence transformation model, a reinterpolation model, and a clustering model (Figure 1 and eFigure 3). Initially, the interpolation model 20 converts the raw time-series data into a meta-representation of data sampled at predefined, regularly spaced reference time points. Following this, a sequence-to-sequence (seq2seq) model, equipped with the gated recurrent unit layers, 21 learns the features of the interpolated data and encodes them into low-dimensional vectors to form a unifying contextual representation. The decoder within the seq2seq model learns from this context vector and outputs regular time-series data of identical length as its input. Subsequently, a radial basis function network 22 re-interpolates the regular time series output back to raw irregular time points. This process generates estimates that can be compared to observations, providing a measure of network performance. To foster a more comprehensive representation of time-series data, we employ two auxiliary prediction tasks: 1) predicting the minimum values of systolic and diastolic blood pressure as well as oxygen saturation, along with the maximum values of heart rate, respiratory rate, and temperature within the seventh hour; and 2) predicting whether the learned representation originated from actual time-series data, by feeding both real and synthetic time series data into the model (eMethods). Prior to clustering process, the dTIC network undergoes pre-training, which is achieved by minimizing both the mean square error of these reconstructed estimates and the prediction error of auxiliary tasks.
In addition, we stack a clustering network on top of the aforementioned feature extraction model to perform concurrent representation learning and clustering. The goal is to enhance the alignment of feature representations and cluster assignments. 23 This integrated approach demonstrates significant potential in learning clustering-friendly representations by which objects can be effectively grouped. The initial cluster assignments are obtained using the contextual representation derived from the pre-trained dTIC network and k-means clustering. 24 For a detailed description of our proposed network, readers are referred to eMethods.

Clinical outcomes
With every hospital admission, we extracted information for demographics, 19 clinical biomarkers that are assessed upon admission (eTable 2), acuity scores for both Sequential Organ Dysfunction Assessment (SOFA) and Modified Early Warning Score (MEWS), and patient outcomes. 25,26 Data processing details are explained in the eMethods section. The primary outcomes were 30-day mortality and 3-year mortality, and the median duration until follow-up was 4.3 years, according to calculations using the reverse Kaplan-Meier method. The secondary outcomes consisted of admission to an intensive care unit (ICU) or intermediate care unit (IMC), mechanical ventilation (MV), acute kidney injury (AKI), sepsis, venous thromboembolism, and renal replacement therapy (RRT).

Statistical methods
To ascertain the optimal number of phenotypes with the dTIC approach, we evaluated a combination of phenotype size, Davies-Bouldin index, 27 silhouette score, 28 elbow method, 29 and gap statistic method. 30 Once the optimal phenotype number was ascertained, patterns of vital signs were visualized by using t-distribution stochastic neighbor embedding (t-SNE) plots, ranked plots that show phenotype pairwise mean standardized differences, line plots with 95% confidence intervals, alluvial plots, and chord diagrams (see eMethods for a comprehensive description).
We assessed the reproducibility of derived phenotypes by assessing their frequency distributions and associated clinical characteristics in the testing cohort. The testing cohort's phenotype assignments were determined through the clinical characteristics of the specific cohort cluster. Predictions were based on the minimum Euclidean distance between individual patients to the phenotype's centroid (eMethods).
To compare phenotypes, we used the χ 2 test for categorical variables and analysis of variance as well as the Kruskal-Wallis test for continuous variables. We used Kaplan-Meier curves to illustrate overall survival and the log-rank test to compare overall survival. Comparisons of adjusted hazard ratios (HR) were made for all phenotypes by using Cox proportional-hazards regression while controlling for demographics, comorbidities, and acuity score when admitted.
Using the Bonferroni correction, all p values were adjusted for multiple comparisons. To ensure that the phenotypes did not simply recapitulate existing acuity scores, we used alluvial plots and chord diagrams to compare the phenotypes to patients' SOFA scores within 24 hours of hospital admission. Python version 3.7 and R version 3.5.1 were used to perform analyses.

Patients
All three cohorts were comparable in clinical characteristics, biomarker distributions, and outcomes (eTables 3 and 4). [24] Across the cohorts, sex was equally distributed and the patients' average age was 54 years old. Nearly two-thirds of the admissions were urgent admission, 18% of the patients were transfers from other hospitals, 27% were admitted to the hospital's ICU or IMC, and 28% underwent surgery while admitted. Of the 27% of patients who were admitted to the ICU or IMC, 22%-27% had high SOFA (>6) or MEWS (>4) scores when admitted. Of the 73% of patients who were admitted to hospital wards, only 2%-3% had high acuity scores. For all cohorts, the 30-day mortality rate was 4% and the 3-year mortality rate was 19%.

Derivation and characteristics of phenotypes
In the training cohort, the dTIC model determined an optimal fit with a four-class model, optimizing a combination of metrics including phenotype size, Davies-Bouldin index, silhouette score, and elbow and gap statistic methods (eFigure 4 and eTable 5). These four phenotypes were associated with distinct pathophysiological signatures and clinical outcomes (Tables 1 and   2, eTables 6 and 7, Figure 2). The phenotypes were categorized as phenotype A (18% of the cohort), B (33% of the cohort), C (31% of the cohort), and D (17% of the cohort) in relation to the descending systolic blood pressure value ( Figure 2A). Phenotype A. Phenotype A had the greatest burden of comorbid disease, such as hypertension (54%) and cardiovascular disease (32%), the highest proportion of African American race (27% vs. 17%-25% in other phenotypes) and emergent admissions (95%). Phenotype A had the highest rate of prolonged respiratory insufficiency (9% received MV, 58% of whom received ventilator support for more than two days), AKI (21%), sepsis (14%), and three-year mortality (25%). Patients in phenotype A had the second greatest incidence of admission to ICU/IMC (35%), hospital mortality (4%), and 30-day mortality (6%). Phenotype B. Phenotype B exhibited physiological signatures similar to those of phenotype A, but displayed a diffuse pattern of mild organ dysfunction with persistent, uncorrected blood pressure abnormalities during the first six hours. Phenotype B had favorable short-term clinical outcomes, manifested by the second lowest rate of ICU/IMC admission (19%), AKI (17%), sepsis (8%), hospital mortality (2%), and 30-day mortality (3%). Phenotype B corresponded to the second highest rate of three-year mortality.
Phenotype C. Phenotype C exhibited low early physiological derangement with a diffuse pattern of mild organ dysfunction. Phenotype C exhibited favorable clinical outcomes, which manifested as the lowest rate of ICU/IMC admission (19%), AKI (13%), sepsis (6%), hospital mortality (1%), 30-day mortality (2%), and three-year mortality (15%). They had the second highest rate of surgery within 24 hours of admission (27%) but similar rates of admission to wards as phenotype B.
Phenotype D. Phenotype D was characterized by early and persistent hypotension, a high incidence of vasopressor support (53%), and the highest proportion requiring early surgery (57%). Phenotype D had significant biomarker incidence of inflammation, evidenced by the highest median white blood cell count (11x10 9 /L compared with 9 x10 9 /L in other phenotypes), premature neutrophils (15% vs. 7%-12% in other phenotypes), and C-reactive protein (39 mg/L vs. 15-20 mg/L in other phenotypes); and the lowest median lymphocytes (10%). Phenotype D had the highest rate of ICU/IMC admission (46%), MV (19%), hospital mortality (5%), and 30-day mortality (6%). They had the second highest incidence of AKI (19%) and sepsis (12%). Despite early and severe illness, phenotype D had favorable long-term outcomes with the second lowest three-year mortality (18%). technique was utilized to reduce the original 128-dimensional vital sign representations to two dimensions. Each dot signifies an individual patient, with separate colors indicating different phenotypes. (C) Visualization of final phenotypes, as assigned by the deep temporal interpolation and clustering network utilizing the t-SNE technique. The network simultaneously learns feature representation and cluster assignments, thus facilitating clustering-friendly representation learning.

Patterns of vital signs
In order to determine which vital signs had the most notable effect on cluster designations, we compared the standardized mean differences between pairs of phenotypes ( Figure 3). The smallest contributors to the differences in phenotypes were temperature and oxygen saturation.
Respiratory rate and heart rate differed significantly across phenotypes except for C and D, which manifested as differences in temperature.

Figure 3. Contributions of vital signs to cluster assignments.
The pairwise phenotype comparisons of vital sign values, which have been standardized to a mean of 0 and standard deviation of 1. The comparison reveals that temperature and oxygen saturation contribute the least to differences between phenotypes. Conversely, respiratory rate and heart rate exhibit considerable variation across all phenotypes, with the exception of phenotypes C and D, where these vital signs appear more consistent. Temp: temperature; SpO2: peripheral capillary oxygen saturation; DBP: diastolic blood pressure; SBP: systolic blood pressure; RR: respiratory rate; HR: heart rate.

Relationship with organ support
The association between phenotypes and the highest SOFA score recorded within 24 hours after admission is depicted in eFigure 5; the SOFA components for every phenotype are depicted with chord diagrams in eFigure 6. The highest percentages of patients with cardiovascular and respiratory dysfunction were found in Phenotypes C and D; however, every phenotype had significant percentages of patients from the entire spectrum of SOFA scores and component subscores; the clustering into phenotypes did not only restate prior SOFA acuity assessments.

Relationship with survival probabilities
Three-year survival adjusted for demographics and comorbidities ( Figure 4, eFigures 7A

Reproducibility
In the training and testing cohorts, the percentage of patients in each phenotype was stable (phenotype A: 18% and 18%; phenotype B: 33% and 35%; phenotype C: 31% and 31%; phenotype D: 17% and 16%). Phenotypes were reproducible in the testing cohort. Within the testing cohort, the phenotypes were similar to the training cohort in terms of clinical characteristics, biomarkers, and patient outcomes (eFigures 11 and 12, eTables 8 and 9). Across the training and testing cohorts, there were similar distributions of SOFA scores, survival scores, and diagnosis groups (eFigures 7 and 13-17).

Evaluation of representation learning of deep interpolation network
The efficacy of the dTIC network in learning cluster-friendly feature representations from sparse, irregularly sampled time series data was clearly depicted through a visual representation of acuity illness phenotypes using t-SNE ( Figures 2B and 2C).
The reconstruction error of physiologic signatures, presented in eFigures 18 and 19 demonstrated that the dTIC network can accurately regenerate the input physiologic signatures across all vital signs, where the disparity between the observed and reconstructed data was negligible.

Discussion
To the best of our knowledge, this is the first instance of using deep interpolation network promising, remains largely theoretical until it is evaluated in a prospective clinical trial setting.

Conclusions
In this study, we developed and evaluated a novel deep temporal interpolation and clustering network for extracting latent representations from sparse and irregularly sampled timeseries data-specifically, vital sign measurements obtained during the first six hours of hospital admission-and identified four distinct patient phenotypes. Each phenotype exhibited unique pathophysiological signatures and associated clinical outcomes, and did not simply recapitulate known, recognized clinical phenotypes, such as SOFA score. Our algorithm has the potential to significantly enhance early clinical decision-making, such as triage decisions, especially in situations where data availability is limited. Future efforts will focus on incorporating this model with historical patient data and additional elements from the EHR, along with external validation of these findings in clinical trials.

Author contributions
AB conceived the original idea for the study, and sought and obtained funding. TOB and AB had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Analyses: YR and YL. Interpretation of data: All authors.
The article was written by YR, YL, TL, TOB, and AB with input from all coauthors. All authors participated in critically revising the manuscript for important intellectual content and gave final approval of the version to be published. AB and TOB served as senior authors. AB is guarantor for this article.

Competing interests
The authors declare no competing interests.

Data Availability
The vitals data used in this analysis include both date and time stamps. In order to prevent patient privacy compromises because of identifiers in the data, our data cannot be publicly shared in a repository. Upon reasonable request, data can be shared by the University

Study design
We non-randomly split the dataset by admission dates into three cohorts: training ( To determine acute illness phenotypes using early physiologic signatures, we derived the clinical phenotypes using unsupervised clustering methods that were applied to the repeated measurements of six vital signs available within the first six hours of hospital presentation in the training cohort. We selected hyper-parameters of clustering model using validation cohort. We assessed phenotype reproducibility by predicting phenotypes in the testing cohort and assessing phenotype frequency distributions and clinical outcomes.

B. Approach to preprocess electronics health records (EHR) data
EHR vital data elements in our cohort studies were irregularly sampled time series. Prior to clustering algorithms, we excluded outliers based on the expert-defined ranges (eTable 1). For the time series missing entirely, which is due to having no measurements during the hospitalization in the plausible range for a variable, we assigned the starting point (time t=0) value of the time series to the mean value of corresponding variables in the training cohort, as listed in eTable 1. We standardized values using Min-Max scaler (eTable 1). We directly input these irregular sampled time series to our clustering algorithm without any time interval imputation.

C. Deep Interpolation Network
In this section, we describe our proposed Deep Temporal Interpolation and Clustering Network (dTIC) for clustering the patients based on their vital sign data during the early stages of hospital admission. Using the raw sparse and irregularly sampled time series vital sign as the input, dTIC can automatically extract a unified and abstract representation of the entire time-series data of an encounter and cluster the patients via an end-to-end unsupervised manner. The overall network architecture consists of four main compounds: Interpolation model, Seq2Seq model, Re-interpolation model and Clustering model. dTIC learns the feature representations and cluster assignments via a two-pass manner. In the first pass, dTIC learns the feature representations, determines the optimal number of clusters and thereby generates the initial cluster assignments. In the second pass, dTIC simultaneously learns feature representations and cluster assignments to refine the results. Figure 1a provides the schematic representation of the dTIC architecture for feature learning process in the first learning pass. We first interpolate the raw time-series vital sign data to a regularly sampled meta-representation with pre-defined reference time points via an interpolation model [1]. Then we feed the interpolated time-series data into a Seq2Seq model with Gated Recurrent Unit (GRU) [2] layers for feature embedding and extracting a unified context vector lying in the low-dimensional feature space by the encoder. The context vector contains the global time-series information and is further used by other downstream tasks (e.g., clustering, classification). The decoder in the Seq2Seq model learns from the context vector and outputs the time-series data with the same length of the Seq2Seq model's input. Then, we deploy a radial basis function network-based model [3] to re-interpolate the fixed-length output to the raw irregular time points for reconstructing the raw vital signs data at corresponding time points. In order to enhance the feature representation, we also make an auxiliary prediction task via a linear neural network in which stacked real and synthetic time series data are input into the interpolation network and the classifier predicts if the learned context vector is from real data. The full feature extraction model is end-to-end trained by minimizing the reconstruction loss measured with mean square error (real data only) and classification loss measured with binary cross-entropy. The extracted feature representation (context vector) will be used to determine the optimal number of clusters and obtain the initial cluster assignment. Figure 1b provides the overall schematic representation of the dTIC architecture for feature representation learning and cluster assignment process. In addition to the feature extraction model, a clustering network is added in the second learning pass. The clustering network computes a soft assignment between the embedded points and the cluster centroids and matches the soft assignment to the target distribution in order to simultaneously improve clustering assignment and feature representation. [4] The full dTIC model is end-to-end trained by minimizing the reconstruction loss measured with mean square error (real data only), classification loss measured with binary crossentropy, and clustering loss measured with Kullback-Leibler (KL) divergence between an embedded distribution and a target data distribution. We describe the components of the dTIC in detail in the following subsections.

Interpolation Model
It is common that the time series vital sign data in electronic health records to be both sparse and irregularly sampled, which means large and irregular intervals widely exist between the data observation time points. Such sparsity and irregularity pose a significant challenge for machine / deep learning techniques to analyze the crucial vital sign data for improving the human health outcome. To deal with this problem, we adopt the network proposed by Shukla and Marlin [1] first to interpolate the raw time-series data to a regularly sampled meta-representation with pre-defined reference time points.
In our study, we utilize six vital signs multivariate time series data, e.g., two kinds of blood pressure (systolic and diastolic), heart rate, temperature, Spo2, and respiratory rate. Take one variable out of six as an example. For one patient, the raw time-series data is denoted as = {( , )| = 1, . . . , }, where represents the total number of observations, is the time point, and is the corresponding observed value. The time intervals between adjacent observation time points vary a lot. The interpolation model can map irregular value to the regular time series data which is defined at the reference time points = [ 1 ; … ; ] with evenly spaced interval.
The interpolation model consists of two layers, where the first layer separately performs the interpolation for each variable, and the second layer aggregates the information across all the studied variables. The model generates three different channel groups at each reference time point, which respectively represents smooth trends , short timescale transients , and local observation frequencies . The interpolation model enables the single observation data point to be considered by all the reference time points and allows for the information to be shared across multiple variables. For more detailed interpolation mathematic denotation, the reader is referred to [1].

Seq2Seq Model
With the interpolated time-series data as the input, we develop a Seq2Seq model to learn its low-dimensional representation, which can embed the contextual information over the full timeline. Seq2Seq model is a method of the encoder-decoder framework that maps an input of sequence to an output of sequence, and it is broadly used in machine translation, text summarization, conversational modeling, and some other tasks. With a single layer GRU network [2] as the encoder, the input sequence is encoded to a fixed-length contextual vector ℎ , which is the hidden state of the last time step. The hidden state of GRU updating mechanism, illustrated in the following equation, ensures that every internal hidden node state will be calculated by the previous state ℎ −1 and current time step input ( , ). ℎ = (( , ), ℎ −1 ) A single-layer GRU network is also used for a decoder. At each time step, the decoder updates its current hidden state with the concatenated features incorporating the previous decoded output −1 and global context vector ℎ as the input:

Re-Interpolation Model
To unsupervised learn the useful representation, a common strategy is to build an autoencoder learning framework by reconstructing the input itself from the extracted bottleneck representation. Therefore, on top of the Seq2Seq model, we develop a re-interpolation network to map the output with the evenly spaced intervals to the raw irregular time points. Similar to the interpolation model, the transformation is also based on a radial basis function network. Our reinterpolation model allows the embedded values at every reference time point to make a continuous contribution to reconstructed values at all the raw time points, but the contribution weight is exponentially decayed in terms of the distance between the referenced time point and target time point : After the re-interpolation, we can easily calculate the mean square error at every input time point and minimizing this reconstruction loss is served as the learning objective of the DIN model. It is worth noting that the interpolation, Seq2Seq, and re-interpolation models in the DIN are jointly optimized. Compared with the work [2], it effectively improves the model learning capacity and allows the clustering representation to contain more global information across the full timeline. After the model training, we also visualize the reconstruction performance of the test cohort to verify our model learning capacity.

Clustering Model
Taking the low-dimensional feature generated in the first pass, we determine the optimal number of clusters using the combination of phenotype size, Davies-Bouldin index (DBI) [5], silhouette score [6], elbow method [7] and gap statistic method [8]. Once the number of cluster is determined, any standard clustering algorithm can be used to derive initial cluster centroids. In our case, we apply the centroid-based classical k-means clustering.
We improve the feature representations and cluster assignments by a clustering network [4]. The clustering network alternates between 1) computing a soft assignment between the embedded points i and the cluster centroids by measuring the similarity between them; 2) minimizing the KL divergence between the soft assignment and the auxiliary distribution. For more detailed description of clustering model, the reader is referred to [4].

Classifier
To enhance the feature representation learning, we make two auxiliary prediction tasks: 1) predicting the maximum or minimum vital sign value within the next hour; and 2) predicting if the learned representation is from real time series data after feeding both real and synthetic time series data into the model. We predict minimum systolic and diastolic blood pressure, and peripheral capillary oxygen saturation; maximum heart rate, respiratory rate and temperature within the next hour. We feed both real and synthetic time series data into the model. We generate the synthetic time series data by randomly replacing values at 50% time points. It worth noting that synthetic time series data is only used for classification and is not counted in the optimization of reconstruction loss and clustering loss. We use a linear neural network to do the prediction task.

D. Data visualization
• Chord plots Chord diagram is widely used to represent connection and relationship between several entities. We generated two sets of chord diagrams to visualize the patients' distribution regarding different studied variables.
One set of chord diagrams were created to visualize the distribution of phenotypes across worst SOFA scores of six organ systems within first 24 hours of admission. These six organ systems include: For each organ system, percent of patients with organ dysfunction, that is with SOFA score of 2 or more were calculated. For each phenotype, the larger percent of patients with higher score of that organ system, the border the ribbon. Phenotypes are shown in separate colors.
The other set of chord diagrams were created to visualize the distribution of nine most common admission diagnosis groups by phenotypes. These most common admission diagnosis groups vary from cohorts For each phenotype, the larger percentage of patients with that admission diagnosis group, the border the ribbon. Phenotypes are shown in separate colors. Diagrams were generated with Circlize R package [9].
• Alluvial plots Alluvial plots were generated to visualize distribution of phenotypes across worst Sequential Organ Failure Assessment Score (SOFA) scores of patients within first 24 hours of admission. Phenotypes were grouped in the left column and the total SOFA scores were categorized into 3 levels (0-1, 2-4 and 5+) listed in the right column. Ribbons connect the phenotypes and SOFA categories, which indicates a percentage of patients in a phenotype fall into a particular SOFA category and vice versa. The larger percentage of patients, the boarder the ribbon. Phenotypes are shown in separate colors. Plots were generated with Alluvial R package [10].
• t-SNE plots t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique wellsuited for embedding high-dimensional data for visualization in a low-dimensional space. In our work, the t-SNE plots depicted the 2 dimensional feature space of the patients vital signs after reducing their original dimension from 36 to 2 by t-SNE algorithm. Each dot represents a patient, and patients in different phenotypes are colored differently. Plots were generated by scikit-learn t-SNE Python package [11].
• Line plots Line plots were generated to visualize the time-series vital sign data as it is well-suited for analyzing trends of different variables along time. We created a line plot for each vital sign studied in our work. To better observe the trends of different vital signs, for each encounter, we resampled the raw time series data to 5 minute frequency by averaging multiple measurements every 5 minutes. Then line plots of phenotypes for each vital sign were created by plotting the mean value and 95% confidence interval around the mean. The six vital signs Phenotypes are shown in separate colors. Plots were created by Seaborn lineplot Python package [12].

E. Predicting cluster members in new datasets
In the testing cohort, we used a prospective approach to assign phenotype membership to subject based upon clinical characteristics of typical cluster members in the training cohort.
To accomplish this, we first preprocessed the data using the procedure above (B). We then predicted phenotype assignments by calculating the Euclidean distance from each testing cohort admission to the centroid of each phenotype from training cohort. Consider the ith subject with p features. We represent it as = [ 1 , 2 , ⋯ , ]. We denote the mean of the kth phenotype with = [ 1 , 2 , ⋯ , ] and represent it as the center of the phenotype. Thus, we calculate the Euclidean distance of the ith admission to the center of the kth phenotype, , as: We calculate distances of all admissions to all phenotype centroids and assigned each admission to its nearest phenotype.

F. Definition of clinical characteristics
Chronic disease burden was characterized by Charlson-Deyo comorbidity index scores. [13] Chronic kidney disease was determined from medical histories obtained prospectively at the time of enrollment and from a validated combination of International Classification of Diseases codes from electronic health records [14]. Severity of illness was characterized by SOFA and Modified Early Warning Score (MEWS) based on worst values within first 24 hours of hospital admission [15]. Missing SOFA and MEWS scores were imputed with 0.
Measurements for clinical biomarkers that fell outside of expert-defined ranges were considered outliers and were removed from the data. All measurements within 24 hours of hospital admission were used to detect highest or lowest value. Ranges of outliers and directionality of worst values are listed in eTable 2. Only results among patients with measurements were reported. We presented continuous variables as mean (SD) and median values with interquartile ranges and as frequencies and percentages for categorical variables.
For blood pressure, invasive measurements were used, and in absence of invasive measurements at a specific date and timestamp, noninvasive measurements were used. Duration of blood pressure below certain cutoff was determined in minutes after forward-propagating previous values. We identified number of pressors and need for inotrope in the first 24 hours of admission based on medications file where dopamine, droxidopa, midodrine, ephpedrine, epinephrine, norepinephrine, phenylephrine, and vasopressin were considered for vasopressors and dobutamine and milrinone for inotrope. Troponin measurements includes Troponin T and Troponin I. In order to determine FiO2 value at each date and time stamp, formulas were used to imputed FiO2 from oxygen delivery device and corresponding oxygen flow rate. [15] If no oxygen flow rate is given, default FiO2 was imputed based on respiratory device. If oxygen flow rate is outside specified range, minimum and maximum flow rate were used for imputing FiO2. If formula result is greater than maximum per-device FiO2, the maximum FiO2 was imputed. In absence of PaO2 to calculate PaO2/FiO2 ratio, SpO2/FiO2 to PaO2/FiO2 conversion was used [15,16].
To determine reference creatinine, we used previously validated modification of the NHS England alert algorithm. [17] For patients with available preadmission measurements, reference value was defined as either the lowest in the last 7 days or a median of values from the preceding 8 to 365 days depending on availability of previous results. For patients with no available preadmission measurements and no history of chronic kidney disease (CKD) we used the lowest of admission creatinine and estimated baseline creatinine using the Modification of Diet in Renal Disease Study equation assuming that baseline estimated glomerular filtration rate (eGFR) is 75 ml/min per 1.73 m2. For patients with known history of CKD and no available preadmission measurements we used lowest creatinine value on admission day. After first seven days of hospitalization, minimum serum creatinine measurements in preceding 7 days was used as the reference creatinine. Reference creatinine was used to estimate preadmission reference glomerular filtration rate using Chronic Kidney Disease Epidemiology Collaboration equation. [18] Chronic kidney disease was determined from medical histories obtained prospectively at the time of enrollment and from a validated combination of International Classification of Diseases codes from electronic health records. [18] Chronic kidney disease stages were determined based on reference eGFR according to guidelines [19,20].

Diagnosis codes
We determined category of admission diagnosis codes, which are assigned either as International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) or International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) code. We used general equivalence mappings to assist with the conversion ICD-10-CM codes to ICD-9-CM codes [21]. The Clinical Classification Software (CCS) [19] consists of two related classification systems, single-level and multi-level, which are designed to meet different needs. We used multi-level CCS which expands the single-level CCS into a hierarchical system and enables evaluating larger aggregations of conditions and procedures or exploring them in greater detail. The multi-level system has four levels for diagnoses and three levels for procedures, which provide the opportunity to examine general groupings or to assess very specific conditions and procedures. We showed distribution of most common Level 1 and Level 2 codes for each cluster as well as distribution of all admission diagnosis codes that are present in at least 1% proportion of patients.

G. Definition of clinical outcomes
We determined complications occurring anytime during hospitalization, including infectious and mechanical wound complications (wound complications), acute kidney injury (AKI), mechanical ventilation (MV) and intensive care unit (ICU) admission for greater than 48 hours, cardiovascular (CV) complications, neurological complications and/or delirium, sepsis, and venous thromboembolism (VTE). We used the exact dates and times to calculate the duration of MV, ICU, and hospital stay. In order to determine the duration of invasive mechanical ventilation, we developed an algorithm to identify the start and stop times for ventilation based on flowsheet data. Patient was determined to be on mechanical ventilation at a time point if the respiratory device is recorded as ventilator or endotracheal tube (ETT) or there is a recorded measurement value for tidal volume, end-tidal carbondioxide (etCO2), positive end-expiratory pressure (PEEP), mechanical respiratory rate, or ventilator mode. We identified need for pressors or inotropes (dobutamine, dopamine, droxidopa, midodrine, milrinone, ephpedrine, epinephrine, norepinephrine, phenylephrine, or vasopressin) during hospitalization based on detailed medication records data as binary variable. Acute kidney injury (AKI) was determined using available clinical information according to Kidney Disease: Improving Global Outcomes criteria (0.3 mg/dl increase in serum creatinine within 48 hours or 50% increase from baseline within seven days or decrease in urine output to less than 0.5 ml/kg/hr for six hours). [19] Community-acquired AKI was defined as development of AKI within 24 hours of hospital admission. Delirium was defined as at least one positive Confusion Assessment Method (CAM) score or having ICD-9 or ICD-10 codes for delirium. The International Classification of Diseases, Ninth and Tenth Revision, Clinical Modification (ICD-9-CM, ICD-10-CM) were used to the remaining complications [22][23][24][25][26]. Date of death was determined using hospital records and the Social Security Death Index database was used to confirm death dates and obtain death dates for subjects who were not in hospital records. Thirty-day and three-year mortality were defined if the death date is thirty days or three year from hospital admission.

eFigure 1. Cohort selection and exclusion criteria eFigure 2. Spearman correlation heat map for the training cohort (N=41,502)
Spearman correlation heat map shows the pairwise spearman rank order correlation coefficient among the 6 vital signs studied in our paper. The darker red color, the higher correlation in positive direction.

(B).
(A) Elbow approach shows the optimal number of clusters. The value of k at the "elbow", the point after which the distortion starts decreasing in a linear fashion, suggests the optimal number of clusters. (B) Gap statistic approach shows the optimal number of clusters. Higher gap statistic value suggests the optimal number of clusters.

eFigure 5. Alluvial plot showing distribution of phenotypes across worst SOFA scores of patients within first 24 hours of admission in the training cohort
For each phenotype, the larger percentage of patients with that score, the broader the ribbon.