Delineating COVID-19 subgroups using routine clinical data identifies distinct in-hospital outcomes

The COVID-19 pandemic has been a great challenge to healthcare systems worldwide. It highlighted the need for robust predictive models which can be readily deployed to uncover heterogeneities in disease course, aid decision-making and prioritise treatment. We adapted an unsupervised data-driven model—SuStaIn, to be utilised for short-term infectious disease like COVID-19, based on 11 commonly recorded clinical measures. We used 1344 patients from the National COVID-19 Chest Imaging Database (NCCID), hospitalised for RT-PCR confirmed COVID-19 disease, splitting them equally into a training and an independent validation cohort. We discovered three COVID-19 subtypes (General Haemodynamic, Renal and Immunological) and introduced disease severity stages, both of which were predictive of distinct risks of in-hospital mortality or escalation of treatment, when analysed using Cox Proportional Hazards models. A low-risk Normal-appearing subtype was also discovered. The model and our full pipeline are available online and can be adapted for future outbreaks of COVID-19 or other infectious disease.


Data preparation
Even though NCCID enrolled many centres in data collection, the significant load imposed by the ongoing COVID-19 pandemic led to many instances of missing data, especially in the clinical readings at admission.As a result, we used a portion of the NCCID dataset, primarily driven by data completeness.A total of 1344 subjects (referred to as case population) were used in the current study, in addition to 137 COVID-19 negative patients who were utilised as controls for the disease progression model (please see "Subtype and stage inference model").Manual data quality assurance, curation and standardisation was performed on all clinical data.
We selected eleven clinical tests as biomarkers for disease progression modelling: creatinine, urea, C-reactive protein, lymphocyte count, platelet count, white cell count, respiratory rate, temperature, heart rate, systolic and diastolic blood pressure.Several of these measures have been suggested as being prognostically important in previous survival analyses 3,5,16 .The choice of clinical tests to include in our model was driven by previous use in research and by practicality.All clinical test results were recorded on admission of the patients to hospital.
The 1344 covid-positive cases were split randomly into a training and validation sample of 672 subjects after matching the two populations for age.All model training and tuning was performed solely on the training population and the patients in the validation population were used only at testing.NCCID data was accessed through a UCL-owned XNAT instance.The Microsoft Azure platform and tools from Microsoft Project InnerEye Open Source Software were used for cloud-based modelling and analysis (https:// aka.ms/ Inner EyeOSS).

Subtype and stage inference model
Subtype and Stage Inference (SuStaIn) is an unsupervised learning algorithm that simultaneously identifies clusters (subtypes) and progression sequences (stages) of disease based on worsening biomarker readings.SuS-taIn was first developed to model long-term chronic diseases such as Alzheimer's 10 and Chronic Obstructive Pulmonary Disease (COPD) 11 .Uniquely, it extracts a temporal (or pseudo-temporal) evolution of disease from single-timepoint, cross-sectional data to account for the inherent progression of diseases.The present study is the first to apply SuStaIn to an infectious disease in its acute phase.
Linear z-score SuStaIn was the chosen SuStaIn model, in which each of the eleven clinical biomarkers was transformed to a z-score with reference to a control population.The control population for this study consisted of 137 patients who were suffering from acute disease (initially suspected to be COVID-19) and were hospitalised but were later determined to not have COVID-19.This population was favourable for usage as controls to SuStaIn since all patients were unwell enough to be admitted to hospital but were not infected with COVID-19.By z-scoring the 11 biomarkers to this population, the effects of COVID-19 infection on the biomarkers were separated from the effects of other acute disease.
Several data preparation steps were carried out prior to initiating modelling with SuStaIn to isolate the COVID-19 signal from other potential covariates.First, the effects of age and sex on all 11 biomarkers were learned in the control population and regressed out from the entire population.Second, the distributions of biomarkers were checked for normality through the Shapiro-Wilk and D' Agostino's K 2 test.If a biomarker distribution failed any of the normality tests, a power transform (either the Box-Cox or Yeo-Johnson) was used to improve the normalisation of its distribution.The transformations were applied both on the control and case populations and were necessary since normal distributions are assumed by the linear z-score SuStaIn model.
Finally, each biomarker was transformed into a z-score with reference to the control population, as described earlier.Since some biomarkers were expected to increase or decrease with disease progression, those found to decrease in the case population with reference to the control population (implying negative z-scores), were inverted to ensure all biomarker progression was represented by monotonically increasing z-scores.
Several hyperparameters-model parameters which are not automatically learned, but are instead chosen and optimised by the researcher, were selected-namely the z-score thresholds which represent a stage of progression and the maximum number of subtypes (clusters) to search for.These were tuned and the best-fitting model selected.Table 1 outlines the z-score thresholds selected for each biomarker.When a biomarker reaches a certain z score value (e.g.z = 1 or z = 2), this represented a new disease severity stage.
After the model was trained (on the training population), each subject was assigned a SuStaIn subtype and stage.Subtype was assigned by selecting the most probable cluster.Instead of assigning a simple integer stage to each subject, a weighted stage was designated.For each subject, each stage was weighed by the probability of the subject belonging to that stage and the result was then summed, producing a continuous weighted stage.Subjects in the validation population were subtyped and staged using the model trained on the training population.

Frailty Cox proportional hazards models
To model the survival of patients admitted with COVID-19 infection, the Cox Proportional Hazards (PH) model was used.We used 5 predictor variables in the model: age, sex, subtype, weighted stage, and the subtypeweighted stage interaction.Two outcomes were predicted-time to in-hospital death and time to escalation of patient management.Escalation was defined as in-hospital deterioration which resulted in either ITU admission, intubation or death.The earliest of these 3 events was used as the measure of time to escalation for each patient.Observations were right censored to 6 months after hospital admission as this was the maximum hospital stay for some patients (before discharge or death).To account for the significant variability between centres, a frailty Cox PH model 17 was adopted with NHS centre as the frailty variable, modelling the random effects in the population.

Covid subtypes and severity progression
SuStaIn discovered 3 clinical subtypes of COVID-19 (based on the training population), characterised by distinct in-hospital disease progression.SuStaIn has previously been used to model long-term disease like Alzheimer's or COPD, which span years, but we adapted it for the relatively short time span of an infectious disease (in-hospital monitoring for up to 6 months).Hence, the disease stages can be interpreted as sequences of progression in the severity of disease within each subtype.We named the three subtypes 'General Haemodynamic' , 'Renal' and 'Immunological' (Fig. 1).

Subtype 1: general haemodynamic
In this subtype, less severe disease was characterised by high diastolic blood pressure, temperature, respiratory and heart rate, which was then followed by further heart rate increases, elevated CRP and a decrease in lymphocyte levels.The Renal subtype was characterised by early elevations in creatinine and urea levels, followed by a decrease of systolic blood pressure and an increase in CRP.Unlike the other 2 subtypes, which only exhibit abnormal creatinine and urea in late-stage disease (SuStaIn severity stages 12+), patients with the Renal subtype experienced these abnormalities early in their disease severity progression.

Subtype 3: immunological
In the Immunological subtype, COVID-19 began with abnormally low systolic blood pressure, followed by a cascade of decreases in lymphocyte and platelet count and then elevated temperature, heart rate and CRP levels at more advanced disease.In all subtypes, abnormalities in the systolic and diastolic blood pressures seemed to be separated-being placed at the opposite ends of SuStaIn stage in all three subtypes.

Data exploration
SuStaIn modelling revealed a large proportion of patients were assigned to SuStaIn stage 0-a disease state, which was very similar to the control population.These patients were grouped into a separate, Normal-appearing Subtype 0-290 patients from the training population and 317 patients from the validation population were found to belong to this subtype.These subjects had a milder COVID-19 presentation and were later found to have a much higher probability of survival.
Furthermore, for the following biomarkers, progression represented a decrease rather than an increase in the real-value biomarker readings: systolic blood pressure, lymphocyte count and platelet count.This meant that for these 3 biomarkers, the average biomarker readings were lower in the case population as compared to the control population.Advancing of SuStaIn stages for these 3 biomarkers, therefore, represented decreases in their absolute values.For clinical context, Table 2 presents an overview of the absolute values of each biomarker for each subtype.General demographic data for the training and validation populations, in aggregate, and also split by subtype, can be found in Table 3.

Cox proportional hazards (PH) frailty model
SuStaIn subtype and weighted stage was found to be a significant predictor of both in-hospital escalation of patient management and in-hospital mortality for patients admitted with COVID-19.Cox PH models were fitted separately on the training and test populations and then set against one another to confirm consistency of the results.The Kaplan-Meier curves and model coefficients were examined as a form of validation, as suggested previously 18 .www.nature.com/scientificreports/Predicting escalation of patient management using SuStaIn Table 4 is a summary of the multivariable Cox proportional hazards models fitted to both the training and validation population, with a frailty term accounting for bias between submitting NHS Hospital trusts.The results were consistent between populations, suggesting that SuStaIn subtype and stage generalise as predictors of escalation between 2 randomly selected populations (albeit in patients whose data was collected as part of the same study).
The interaction of subtype and weighted stage, moreover, produced the greatest overlap in coefficients.Model concordance was good and was nearly equal in the Cox models fitted to both the training (C index of 0.69, 95% CI 0.66-0.72)and validation (C index of 0.69, 95% CI 0.65-0.72)populations.

Table 2.
Descriptive statistics for the 11 biomarkers in the entire case population, split by subtype.Subtype 0 represents the 'normal' looking subtype, which is most similar to the control population.Std-standard deviation.One-way ANOVA with the Tukey post-hoc tests performed between subtypes for each biomarker: results indicated with labels (a, b, c, d)-subtypes with a significant pairwise difference have different labels, while subtypes which were not significantly different share the same labels.www.nature.com/scientificreports/Early SuStaIn stages and Subtype 0 were found to predict much less frequent in-hospital escalation of treatment as compared to the other 3 subtypes (Fig. 2).Among the three subtypes, patients assigned to the Immunological subtype (subtype 3) were least likely to experience escalation of treatment, while the General Haemodynamic (subtype 1) and Renal (subtype 2) subtypes were more likely to require treatment escalation while hospitalised (Fig. 2).The Kaplan-Meier curves for SuStaIn subtypes were generally consistent in the training   SuStaIn stage on its own had significant discrimination for the need for escalation of treatment (Fig. 3) and was a better predictor of escalation than patient age or sex.www.nature.com/scientificreports/Mortality prediction using SuStaIn SuStaIn subtype and stage were also good predictors of in-hospital mortality.As shown in Table 5, the hazard ratio confidence intervals show good overlap between training and validation populations.For determining mortality, subtype and weighted stage on their own were better predictors than the subtype-stage interaction (which did not achieve significance at the 0.05 threshold in the training population).Model concordance for both the training and validation populations was equal: C index of 0.74, 95% CI 0.71-0.77on the training population   www.nature.com/scientificreports/SuStaIn stage was also, independently, associated with higher risk of in-hospital mortality (Fig. 4).As expected, age was a strong predictor of in-hospital mortality, with older patients being at higher risk.Sex had a smaller effect on mortality, but calibration for sex was poor (Fig. 5), probably as a consequence of the random sampling used when creating the training and validation populations, which led to a slightly different proportion of men and women (Table 3).

Discussion
We demonstrated that an unsupervised machine learning model, traditionally used for long-term disease progression modelling-SuStaIn, is readily adaptable to a pandemic of viral disease.The three SuStaIn subtypes we discovered likely represent disease involvement in distinct organ systems while SuStaIn stages provide the required gradation to disease severity in patients with COVID-19, which is valuable for risk stratification and outcome prediction.The zeroth subtype also represents a valuable signal, characterizing patients who have been admitted to hospital but were in fact at low risk of death or escalation of treatment.The robustness of our results further highlights our model's significance as a readily available clinical tool in future epidemics of influenza or further COVID-19 variants.
Several studies have previously investigated factors associated with differing severity of COVID-19 infection on a number of large-scale datasets, such as the NCCID, ISARIC, PHOSP-COVID 19 .As a result, various clinical measures and biomarkers have been derived for use as prognostic factors for patients diagnosed with COVID-19.Patients admitted for COVID-19 have been reported to have a ~ 5 times higher hazard ratio for death, ~ 4 higher hazard ratio for mechanical ventilation and 2.41 higher hazard ratio for being admitted to an intensive care unit (ITU) 20 compared to influenza.In addition to the pulmonary manifestations of pneumonia and ARDS 21 , COVID-19 infection is further associated with injuries to other organs including: acute kidney injury, deep venous thrombosis, stroke, sepsis and sudden cardiac death 20 .To predict short-to-medium term outcomes (in-hospital death or ITU admission), the National Early Warning Score (NEWS2)-an existing risk stratification tool was initially used.However, studies have shown its low discrimination power when applied to COVID-19 patients 3,4 .A combination of NEWS2 with 8 further routinely collected blood and clinical measures (supplemental oxygen flow rate, urea, age, oxygen saturation, C-reactive protein, estimated glomerular filtration rate, neutrophil count, neutrophil/lymphocyte ratio) improved its discrimination power for severe COVID-19 outcomes, but model calibration remained poor 3 , necessitating the development of COVID-19 specific patient stratification and prognostication tools.One such tool was the ROX index, evaluated by Prower et al. 4 The ROX index represents the ratio between the peripheral oxygen saturation (SpO2), and the concentration of oxygen in inhaled oxygen (21% in room air), divided by the patient's respiratory rate and was developed to indicate the need for intubating patients suffering from hypoxia.The authors found that the ROX index predicted adverse events 5 h earlier than NEWS2 and provided a clinically useful warning signal.The study emphasized the prognostic importance associated with a deterioration in respiratory parameters in escalation management of COVID-19.Investigation into other prognostic factors for COVID-19 in hospitalized patients included the development of the ISARIC 4C Mortality Score 5 .The score ranges from 0 to 21 points and included eight routinely collected clinical readings: age, sex, number of comorbidities, respiratory rate, peripheral oxygen saturation, level of consciousness, urea level, and C reactive protein 5 .The ISARIC 4C Mortality score was developed on a large UK population (~ 58,000 patients), as part of the ISARIC study 22 and the authors reported excellent discrimination of the score for in-hospital mortality and, more importantly, very good model calibration suggesting applicability of the score when used in new centres and populations.The performance of the score in predicting mortality was also superior and the authors compared their score to 15 other risk stratification scores 5 .The ISARIC 4C consortium further developed a Deterioration model (based on multiple logistic regression) to predict not only mortality, but clinical deterioration, defined as admission to ITU or need for mechanical ventilation 16 .The model displayed convincing discrimination and calibration by using 11 clinical biomarkers: age, sex, respiratory rate, oxygen saturation, room air or oxygen, level of consciousness (Glasgow Coma Scale), nosocomial infection, radiographic infiltrates, urea concentration, lymphocyte count and C reactive protein 16 .
While it is difficult to make direct model comparison due to an only partial overlap in the used clinical measures/biomarkers, we demonstrated that by using a purely cross-sectional clinical and biological data at admission for COVID-19 (11 routinely collected biomarkers) and modelling disease severity progression with SuStaIn, clinically meaningful subtypes and stages of COVID-19 can be derived.This departs from the idea of a one-size-fits-all index and allows us to model involvement in different organ systems through SuStaIn subtypes.In addition to being predictive of in-hospital outcomes, our results can be valuable for organ-specific studies of damage from COVID-19.Previous studies, using tools such as the ISARIC 4C 22 or ROX index 4 tried to use a single scale to predict patients outcomes and prioritise treatment.However, this view, while it has shown clinical utility, may miss the inherent nuance in the progression patterns of patients infected with Sars-CoV-2.In terms of triaging, our model can be used to assign patients admitted to hospital for COVID-19 to one of the 4 subtypes by simply taking the readings of the 11 biomarkers we used.Subtype 0 patients, while ill enough to be hospitalised, can be classified as 'low-risk' for either experiencing escalation of treatment or dying in hospital.Subtype 3 patients, similarly, are at a lower risk, but patients assigned to Subtypes 1 or 2, and especially at their more advanced SuStaIn stages, should be prioritised for treatment and monitored more closely.
The disease subtypes discovered by SuStaIn modelling broadly affect different systems within the body and consequences from COVID-19 in these systems have been previously described.The Renal subtype (Subtype 2) is consistent with several studies which identified some COVID-19 patients experiencing significant kidney problems or even acute kidney injury (AKI) 23,24 .In the consensus report, patients suffering AKI were at significantly increased risk of all-cause death in hospital 23  The General Haemodynamic subtype (Subtype 1) can be hypothesised to relate to the common blood-clotting and hyper-inflammatory effects, described in a number of studies 25,26 .An interesting finding which our model uncovered is that late-stage disease patients who are at the greatest risk of escalation and dying within this subtype (advanced SuStaIn stage) experience a drop in their lymphocytes, platelets, and systolic blood pressure.An early decrease in platelet count was found to predict mortality in a study in Wuhan 27 , which might represent a possible depletion of systemic platelets due to significant clotting in the lung.Another study also reported a trend of rather sharply dropping platelets in non-survivors over multiple timepoints during hospitalisation 28 .Indeed, late SuStaIn stages in both Subtypes 1 and 2 were characterised by a drop in platelet count-those were the patients at greatest risk of dying in hospital.Although our work reconstructs disease severity progression from just a single timepoint reading, patients assigned to the later SuStaIn stages of Subtypes 1 and 2 might have already had a reduced platelet count by the time of hospital admission (effectively more advanced disease).By examining the absolute values of platelet counts for these patients, the same ranges of values (between 100 and 150 × 10 9 /L) were discovered in late-stage patients in our study and in Yang et al. 28 The decreases in total lymphocyte count, characteristic of the late SuStaIn stages in subtype 1 and 2 patients is also consistent with a meta-analysis of 20 studies, which determined this decrease to be closely associated with advanced severity of disease 29 .
The Immunological subtype (subtype 3), on the other hand showed lower levels of lymphocytes and platelets in the lowest-risk, early disease stages.These findings highlight the importance of signals contained within the multitude of biomarkers routinely collected during medical care.Our model aggregated several of these biomarkers and benefited from the inferred clustering of disease and stages of disease severity rather than employing a one-size-fits-all approach for triaging and prognostication.While decreased lymphocytes and platelets might imply a high risk of death and escalation of treatment when occurring after a series of haemodynamic (Subtype 1) or renal (Subtype 2) symptoms, they might indicate lower risk if occurring without these symptoms as seen in Subtype 3. SuStaIn's ability to disentangle sequences of progressing severity and subtype simultaneously provides a far more detailed picture than a single score for all patients.
Our approach also identified an interesting dissociation of systolic and diastolic blood pressure in all subtypes.Namely, the abnormally increased diastolic blood pressure and abnormally decreased systolic blood pressure were always placed at opposite ends of disease severity stages.This suggests that instead of one of the blood pressure phases indicating severe disease, it might be the effectively decreased pressure range between systole and diastole (pulse pressure) which hallmarked advanced COVID-19 and increased a patient's chance of both escalation of treatment and death.This signal merits further investigation as two studies indicated that a high variability of blood pressure in COVID-19 patients is associated with poorer outcomes 30 and, interestingly, that patients who have recovered from COVID-19 tend to have impaired aortic distensibility 31 .
The main strength of the present work is that it was able to demonstrate clinically significant differences in both escalation of treatment and mortality for patients hospitalised for COVID-19, based on 11 routine and easy to collect clinical measurements.We discovered 3 distinct subtypes of COVID-19, which might imply different underlying pathophysiology and disease course in different patients.Although the data we used was collected as part of a single study (the NHSX NCCID), it came from hospitals and NHS trusts throughout the UK and included patients from diverse socio-economic and racial backgrounds.We further employed one of the most challenging techniques for the validation of our Cox Proportional Hazards models-replication on a separate sample of patients.Our model can be readily applied, tested, and tuned on a larger sample of patients (e.g., from different studies) using the 11 biomarkers we studied.More broadly, our model can be further augmented should a more complete set of biomarkers, or other feasible biomarkers become available.
There were several limitations to this study.Methodologically, SuStaIn was developed for modelling longterm, chronic disease.This was the first time it was adapted to severe infectious disease.One of its assumptions is that biomarkers can only become more abnormal with time.This means that it cannot inherently derive the transient drops and increases in biomarkers, which might happen while a patient is hospitalised.Nevertheless, the model is still appropriate for stratification of patients and triaging since it focuses on the severe period of disease when patients are hospitalised and deteriorating.All clinical measurements in the NCCID were performed in this period.Hence, in this sense the learned model represents a progression of severity of disease and does not currently capture recovery.Furthermore, while the learned disease severity progression is currently unidirectional, the model poses no constraints on staging patients to an earlier (less severe) stage in case data was available for a follow-up visit.Hence, at the individual patient level, recovery can be modelled.Future work on making SuStaIn even more useful for shorter term infectious disease outbreaks could also relax the assumption of unidirectionality in disease progression, to capture potential population-wide increases and declines in health.
The data which was available for this study also had several limitations.First, the NCCID dataset did not track the presence of coronavirus variants (Alpha, Beta, Gamma, Delta, Omicron) 15,32 and this information would have been useful for disease modelling since the population likely included different virus variants.However, the common nature of the biomarkers used in our models opens the way for relatively easy validation when new data becomes available.A follow-up timepoint to validate disease progression, as well as availability of additional variables such as patient blood type would also have benefitted our study.Furthermore, there was a risk of false negative PCR tests across the population, which might have caused presence of COVID-19 positive patients in the control population.Finally, the specific causes of death, for example cardiac arrest or pulmonary embolism due to COVID-19, were not recorded in the study-the availability of these would have brought further insight into the pathophysiology of COVID-19.

Figure 1 .
Figure 1.COVID-19 subtypes and disease severity progression.The warm colours represent disease stages progressing towards positive z-scores (z = 1, z = 2) and the cold colours-towards negative z-scores (z = − 1, z = − 2).Increased colour transparency signifies greater uncertainty.The f-value next to each subtype represents the fraction of the training population which was classified as belonging to this subtype.

Table 4 .
Multivariable Cox Proportional Hazards modelling of Time to Escalation in the training and validation population.The hazard ratios, HR, (and consequently the exponent of model coefficients) between the training and validation populations show significant overlap.The effects of the frailty variable-NHS Hospital trust, are not shown as there are 14 centres in the population.wstage: weighted SuStaIn stage; sex 0: female; sex 1: male; variable interactions denoted with ':'

Figure 2 .
Figure 2. Kaplan-Meier plots for 6-month in-hospital escalation of treatment for the training (left) and validation (right) population.wstage-weighted SuStaIn stage.

Figure 3 .
Figure 3. SuStaIn stage provides better discrimination of time to escalation than age or sex: left-training population, right-validation population.wstage-weighted SuStaIn stage.sex 0-female, sex 1-male.

Figure 4 .
Figure 4. Kaplan-Meier plots for 6-month in-hospital mortality for the training (left) and validation (right) population.Wstage-weighted SuStaIn stage.

Table 1 .
The clinical measures (biomarkers) used for SuStaIn modelling.Biomarkers were thresholded at certain z-score values to represent a SuStaIn disease severity stage-either when a biomarker reaches a z-score of 1 or a z-score of 2. Each threshold for each clinical measure is marked with an 'x' in the table below.

Table 3 .
Demographics per population and subtype.Smoking status: N-never, E-ex-smoker, C-current smoker, U-unknown.No significant differences were found in any variable between the training and validation populations (using t-tests for continuous variables and chi-squared tests for nominal and binary variables).

Table 5 .
Multivariable Cox proportional hazards analyses modelling time to death in the training and validation groups.HR: hazard ratio; wstage: weighted SuStaIn stage; sex 0: female; sex 1: male.
. Our model further provides stages within this subtype which can differentiate patients by considering all 11 readings.While patients admitted with just elevated urea and creatinine, for example, might belong to subtype 2, if they are relatively normal in the other 9 biomarkers, they may be assigned to an early SuStaIn stage.A clinician might then monitor development of further changes in biomarkers to diagnose severity progression within the Renal subtype, which can inform risk determination and treatment.