Introduction

Clinical prognostic models are used as tools for clinical decision making. Several well established models such as the sequential organ failure assessment score SOFA1, CURB-652, the Apgar score3,4, the Nottingham prognostic index4,5, and the Manchester triage system4,6 have been developed and are in common use in hospitals. These prognostic models are traditionally static7. However, model performance can drift and models can lose their predictive abilities over time7,8,9,10,11,12. Drifts in the predictive performance of models can appear as a reduction in overall accuracy13 or miscalibration8,14,15 and can be due to changes in patient characteristics14,15,16,17, as well as new treatments, changes in preventative care, or changes in treatment algorithms14. These reductions in predictive performance of models can render these models less useful18 or even misleading18,19, emphasizing the need to detect and correct these performance drifts.

Various approaches to compensate for model performance calibration drift have been proposed in the past. Newer or recalibrated versions of existing models have occasionally been developed to account for various patterns of invalidity13, when model parameters are no longer optimal, due to changes in the case mix of patients, patient outcomes, and updates to the standard of care provided for hospitalized patients. For example, EuroSCORE was initially developed to predict mortality after heart surgery20, and it was updated twelve years later14. Another clinical prognostic model, QRISK, which predicts the risk of a patient developing a heart attack or stroke within ten years is updated annually15,21,22. Dynamic updating approaches7 update the model parameters at a specific time window11,23,24 or when deemed necessary10,11,12,13. For the latter, the frequency of updating depends on the dynamics of the performance drift of such models, making the case for monitoring of prospective model performance. While rapid and continuous geographic validation of a newly developed model is challenging18,25,26,27, temporal validation within one site or institution can be feasible, due to continuous recording of patient data as part of the standard of care.

With many well-established prognostic models of disease, model performance drift is a slow process and frequent monitoring or updating might not be required. The COVID-19 pandemic manifested itself as multiple geographically and temporally distinct waves that exhibited different characteristics such as different heterogeneous patient populations and different standard treatment methods28. Moreover, virus variants, changing vaccination rates, waning of vaccine protection, and emergence of novel COVID-19 treatments also contributed to a highly dynamic environment for patient characteristics and outcomes29,30,31,32. A commonly reported change was related to the significant decrease of case mortality rate of hospitalized patients in certain areas33,34,35, as well as the number of intubated patients35 and overall hospitalized patients between waves. Since the ongoing novel coronavirus pandemic is a situation in which many prognostic models were developed19,36,37 on geographically distinct populations alongside rapid changes in patient populations and standards of care38, it is unsurprising that a growing number of studies are reporting drifts in the performance of such models across different geographic regions39,40,41 or different temporal windows39,40. While the need for model updating in COVID-19 is clear, there are no proposed self-monitoring and automatically updating prognostic models for COVID-19 patients.

Our main objective was to develop a survival calculator for COVID-1942 that self-monitors and automatically updates whenever needed. We examined the need to recalibrate by monitoring discrimination and calibration from three modeling approaches over time. We compared model performance without updating to various model updating techniques13,23,24 with each of the modeling approaches. We built a framework that can be applied both in traditional prediction model methods as well as machine learning-based methods and enables them to be easily monitored and updated accordingly. These techniques and ideas were tested using data from nearly 35,000 COVID-19 patients across the three major virus variants during the pandemic (alpha, delta, omicron) and three prognostic model architectures: our custom generalized linear model (GLM), logistic regression, and gradient boosted decision trees. The analysis was performed for a 28-day time horizon. Performance of all models with and without the self-monitoring and auto-updating capabilities were measured, and sensitivity and decision-curve analyses were conducted.

Results

Patient characteristics

A total of 38,078 electronic health records (EHRs) were considered in this study. Of these, 3166 were excluded because they were either transferred to a hospital outside of the health system and their outcomes were unknown, were started on invasive mechanical ventilation prior to admission, or had a do not resuscitate order placed outside of five days of death. All patients admitted after April 3, 2022 were also excluded to enforce the 28-day follow-up period. The remaining patients (n = 34,912, Table 1) were included in the development (n = 7346), retrospective (n = 1889), and prospective (n = 25,677) validation cohorts. The included patients (combined development and retrospective versus prospective) had a median age of 65 years with an interquartile range of [IQR 54-76] versus 67 years [IQR 54–79]), and (40% versus 47%) were female. The overall 28-day survival percentages were 77.0% for the combined development and retrospective versus 89.1% for the prospective cohorts.

Table 1 Demographic, clinical, and laboratory data of COVID-19 patients hospitalized at northwell health

Development of survival prognostic models

To determine the predictors of survival, training data were collected from patients hospitalized in 11 of 12 included Northwell Health hospitals (Fig. 1). Three different model types—generalized linear model with the least absolute shrinkage and selection operator (LASSO) penalization (the Northwell COVID-19 Survival Calculator (NOCOS))42, logistic regression (LR) with LASSO penalization43, and extreme gradient boosted decision tree (XGBoost)44—were trained on patients from the development cohort for 28-day survival. The optimal predictors of survival were similar across each model. Age, serum blood urea nitrogen, lactate, and red cell distribution width were chosen as predictors, either in a linear or logarithmic scale, for all three models; albumin serum was chosen as a predictor for two of the three models; and respiratory rate and platelet count were each chosen once. Sometimes the linear and log scale of a single measurement were used as predictors within a model.

Fig. 1: Dataset overview and study design.
figure 1

The data sets include training, retrospective validation, prospective monitoring/updating comprising 34,912 patients. The number of hospitalized patients across the course of the pandemic is plotted as blue bars, with the 7-day rolling average mortality plotted as a red dashed line. The three different dominant variants (alpha, delta and omicron) are represented as background colors). The vertical dashed red lines indicate the left edges of the 2000 patient sliding window that increments by 500 patients at a time. NOCOS Northwell COVID-19 Survival.

Validation of survival prediction model

The retrospective validation included data collected from patients discharged from Long Island Jewish Hospital (n = 1889 [1470 survived past 28 days]) (Fig. 1). Based on this data, the 28-day NOCOS Calculator had a discriminative performance of the area under the receiver-operating-characteristic curve (AUROC) of 0.772 [95% CI 0.762, 0.782] (Fig. 2a) and the area under the precession-recall curve AUPR of 0.912 [0.906, 0.918] (Fig. 2a) and had a calibration performance of the integrated calibration index (ICI) of 0.047 [0.042, 0.054] (Fig. 2b), given the 22% mortality.

Fig. 2: Retrospective and prospective validation of static 28-day survival models.
figure 2

a ROC and PR curves with AUC and 95% CI for the retrospective (n = 1889) and prospective (n = 25,677; no updates) validation cohorts, b calibration plots for the retrospective validation cohort, c calibration plots for the prospective (no updates) validation cohort and d decision curves for the retrospective and prospective (no updates) cohorts based on the original NOCOS, logistic regression, and XGBoost models. The blue dots on the calibration plots show the actual proportion of outcomes averaged over deciles of the predicted probabilities. The red histograms show the counts of patients that survived past 28 days binned by the predicted probabilities. The green histograms show the counts of patients that died before 28 days binned by the predicted probabilities. The diagonal black lines indicate perfect calibration. The ICIs along with their 95% CIs are reported. ROC receiver operating characteristic, PR precision recall, AUC area under the ROC or PR curve, CI confidence interval, ICI integrated calibration index.

The prospective validation based upon data collected from patients discharged from all 12 Northwell hospitals (n = 25,677 [22,876 with 28-day survival]) (Fig. 1), with 28-day NOCOS Calculator discriminative performance AUROC 0.758 [0.755, 0.762] (Fig. 2a) and AUPR 0.945 [0.944, 0.947] (Fig. 2a). While the overall AUROC dropped slightly between the retrospective and prospective validation cohorts (p = 0.005 one-sided unpaired t-test) and the overall AUPR significantly improved (p < 0.001 one-sided unpaired t-test), the overall calibration performance degraded significantly from ICI 0.047 [0.042, 0.054] to ICI 0.119 [0.117, 0.120] (p < 0.001 one-sided unpaired t-test) (Fig. 2b, c).

Updating the survival prediction model

Due to calibration degradation during the course of the COVID-19 pandemic, a 2000-patient sliding window was incremented at 500-patient intervals, and the models were updated when the ICI was greater than 0.03 (Fig. 3). The window size and updating threshold were selected based on hyperparameter optimization (Supplementary Fig. 6). All updating methods for the 28-day NOCOS produce similar AUROCs (Table 2). All updating methods were significantly better than the original model without updating with regard to ICI, with a range of ICI improvement from 0.098 to 0.105, (p < 0.001 one-sided unpaired t-test) (Fig. 3, Table 2). We selected logistic recalibration as our updating method for the 28-day NOCOS Calculator for generalization purposes, since it requires fewer patients for updating and fewer parameters to tune and also retains the same predictors45; it’s also worth noting that intercept only recalibration is almost as good as full logistic recalibration. The overall results of this updating method are presented in Fig. 4.

Fig. 3: Temporal progression of performance metrics across all 28-day survival models and updating procedures.
figure 3

Discrimination (AUROC) and calibration (ICI) performance metrics in a 2000-patient sliding window with a step size of 500 patients for the original and dynamically updated 28-day a NOCOS, b logistic regression, and c XGBoost models. The updating methods are listed in the legend, and dynamic logistic regression is only applicable to the logistic regression model. Updates are performed when the ICI is greater than the threshold of 0.03. AUROC area under the receiver operating characteristic curve, ICI integrated calibration index, LR logistic regression.

Table 2 28-day performance metrics for the prospective (n = 25,677) cohort
Fig. 4: Prospective validation of all 28-day self-monitoring, auto-updating models.
figure 4

a ROC and PR curves with AUC and 95% CI for the prospective (n = 25,677) validation cohort, b calibration plots for the prospective validation cohort, and c decision curves for the prospective cohort based on NOCOS updated using logistic recalibration, logistic regression updated using logistic recalibration, and XGBoost updated using intercept only recalibration. The blue dots on the calibration plots show the actual proportion of outcomes averaged over deciles of the predicted probabilities. The red histograms show the counts of patients that survived past 28 days binned by the predicted probabilities. The green histograms show the counts of patients that died before 28 days binned by the predicted probabilities. The diagonal black lines indicate perfect calibration. The ICIs along with their 95% CIs are reported. ROC receiver operating characteristic, PR precision recall, AUC area under the ROC or PR curve, CI confidence interval, ICI integrated calibration index.

We also performed similar updates and comparisons for 28-day survival logistic regression and XGBoost models (Figs. 24). All updating methods for LR and XGBoost yielded ICIs that were significantly better than not updating (p < 0.001 one-sided unpaired t-test) (Table 2). Neither logistic regression nor XGBoost, when updated, yielded a significantly lower ICI than the self-monitoring, auto-updating NOCOS. However, our analysis shows that updating is needed in all cases regardless of the model type whether it’s a linear or nonlinear machine learning model. Since we favor updating methods with fewer parameters, and there were not drastic differences in performance across all of the different updating methods, we propose logistic recalibration for logistic regression and intercept only recalibration for XGBoost as the preferred updating methods.

Changes in predictor importance

The dynamic Bayesian logistic regression is batch updated at each window position regardless of the current ICI estimate and provides a smoothly time-varying set of coefficients that can be compared to the selected update methods for the logistic regression model. The 28-day coefficients (Fig. 5) with logistic recalibration (Supplementary Fig. 4) are reasonable approximations to the dynamic Bayesian coefficients.

Fig. 5: 28-day model coefficient importance.
figure 5

a NOCOS, b logistic regression, and c XGBoost model predictor importances are plotted. The importance of the NOCOS and logistic regression model predictors are the coefficients of the linear predictor scaled by the standard deviations of the predictors from the development cohort. The importance of the XGBoost model coefficients is the weighted average over the ensemble of trees of the difference in node risk between the parent and children nodes due to splitting at each predictor.

Decision curve analysis

To better assess the clinical utility of the models, decision curve analysis was performed. The original yields a positive net benefit on the retrospective cohort across most decision thresholds (Fig. 2d). The original model without updating yields a negative net benefit over a wide range of preferences, showing that use of this model would worsen decision making, compared to the best treatment strategy overall. Worth reminding is that the probabilities used in our decision curve analysis refer to mortality, rather than survival. When we apply the updating methods, the models tend to provide a positive net benefit. (Fig. 4c, Supplementary Fig. 3c).

Sensitivity analysis

To examine whether the proposed framework remains accurate and well calibrated across different virus variants, sex and race/ethnicities, we performed sensitivity analysis focusing on the performance characteristics of all models across these subgroups. Our results, shown in Fig. 6 and Supplementary Table 4 for the 28-day NOCOS model, as well as for all other models shown in Supplementary Fig. 7 and Supplementary Table 3, reveal small differences in performance metrics. The observed differences in ICI across all subgroups of our sensitivity analysis are not expected to affect the net benefit to the patient and showcase that the proposed approach works efficiently across virus variants, sex and races and ethnicities.

Fig. 6: Sensitivity analysis of the 28-day updating NOCOS model across variants, sex and race/ethnicity.
figure 6

a ROC and PR curves with AUC and 95% CI for the prospective (n = 25677) validation cohort and b their corresponding calibration plots based on the 28-day NOCOS updated with logistic recalibration. The model was filtered by variant, sex, and race/ethnicity. The points on the calibration plot show the actual proportion of outcomes averaged over deciles of the predicted probabilities. The diagonal black lines indicate perfect calibration. The ICIs along with their 95% CIs are reported. ROC receiver operating characteristic, PR precision recall, AUC area under the ROC or PR curve, CI confidence interval; ICI integrated calibration index.

Discussion

We developed a framework of self-monitoring, auto-updating prognostic models to predict the probability of 28-day survival for COVID-19 patients upon admission to the hospital. We applied this framework using three different model architectures: a custom GLM, logistic regression, and XGBoost. The same analysis was repeated for 7-day survival in the supplement. The initial onset of the first wave of the pandemic in New York City was a time when factors such as survival, standard treatment methods, and patient case-mix were under constant flux. These factors resulted in model performance drifts, showcased by the degradation of calibration performance from the retrospective to the prospective validation cohorts. In order to maintain calibration over time, a dynamic updating approach consisting of a 2000-patient sliding window with a 500-patient step was used to monitor the calibration and apply several model updating strategies if the miscalibration exceeded a threshold. The resulting models maintained good discrimination capabilities and calibration throughout the different waves of the pandemic, regardless of model architecture, always outperforming their initial versions. This is also the first study to our knowledge that performs temporal updating of COVID-19 prognostic survival models to correct for model performance calibration drift.

Prognostic models are frequently optimized for discriminative performance, but miscalibration can be harmful when clinical decisions are based on biased predictions46. The magnitude of the performance drift as well as the speed at which it manifests can have a potential impact on the perception of the severity of a patient. This can have an incremental effect on the treatment and, equally importantly, on the trust that clinicians have for these algorithms. While an error of 1–2% in a probability of survival estimate for a specific patient might not change any specific decision making, an average error of 9–10% across thousands of patients means that some patients will have an error a lot higher than 10% which can be misleading and result in confusion, delayed decision making, and loss of trust of the clinician to the algorithms. Our study emphasizes minimizing bias by reporting and maintaining calibration performance through model updates. Notably, despite calibration degradation, discriminative performance remained well-maintained, emphasizing the need for close monitoring of calibration characteristics (e.g., ICI) in addition to discrimination characteristics (e.g., AUC of ROC and PR curves).

We show that model updating is crucial, particularly in the setting of rapidly changing outcomes (e.g., survival), as in the course of the COVID-19 pandemic at our health system (Table 1) and elsewhere47,48. However, updating involves additional complexity, data gathering, and cost7. Ad hoc updated models like EuroSCORE are relatively inexpensive but are not responsive to calibration drift on a short timescale, and periodically updated models like QRISK are more expensive but react to calibration drift sooner. Dynamic models with continuous model surveillance have the benefit of increased responsiveness but also add complexity24,49 to the validation steps since there is no standardized validation methodology7,8,9,10. We validated our models temporally by calculating the performance metrics in a sequence of sliding window positions, similar to the dynamic calibration curves used by Davis et al. albeit with a fixed window size and averaged over all predicted probabilities12. We also provided an overall validation of all models, by accumulating the predictions of the updated models and calculating the performance over all data, after initial development. Ultimately, standardized approaches to dynamic models that provide support in the context of healthcare systems will move us closer to learning health systems50.

Our dynamic monitoring and updating method could be operationalized in a relatively straightforward manner. This deployment would feature a program that monitors the number of patients since deployment, and as soon as the number of patients reaches the required window size, will test the latest model calibration and, if needed, update the model. Creating that program is of minimal cost, and maintaining it can be rather trivial computationally, since these steps don’t require any additional computational infrastructure. The minimum requirement is to maintain an actively updated database on a regular basis with a scheduled database query, employing in our use case specific filters for COVID-19 patients and notifications to inform clinicians of current performance and possible updates. As with every tool, it would be preferable to have a clinician, a developer. or ideally both, to review proper function of the pipeline, as well as the updates, periodically. These processes can be run in a centralized location on a single workstation that interfaces with the electronic health records (EHR) of the institution it is deployed.

Another important factor to consider when updating models is the number of available samples. There is an inherent tradeoff between stationarity and sample size in the update cohort51. While small temporal windows can follow the dynamics of the nonstationary changes in the data, they might not have enough power in samples to enable certain modeling approaches, while larger windows that have adequate sample sizes might not be able to follow quickly changing dynamics11. In order to strike the balance for this tradeoff, our approach was to determine the size of the window based on a combination of formulas that estimate the proper sample size for LR models52 and to perform a hyperparameter optimization that estimates the proper sample size for the custom GLM model (Supplementary Fig. 6). We ultimately recalibrated/retrained our existing model based on a 2000-patient sliding window of data rather than retraining the model on all of the data first.

We evaluated several updating methods: logistic recalibration with intercept only, full logistic recalibration with gain and intercept, retraining the model with fixed predictors, retraining the model with all candidate predictors13, and dynamic Bayesian logistic regression23,24. In addition, other recent methods that smoothly update their parameters like the dynamic logistic state space model53 may also be valuable within a dynamic updating framework. We selected the update method by finding the optimal integrated calibration index with minimal impact to discriminative performance. Most of the updating methods exhibited similar performance and all were superior to not updating, similar to other studies that employed model updating for other use cases11. In addition, our method included a lag equal to the follow-up period before updating the model to ensure that updating was done prospectively, unlike the approach by Schnellinger et al. that was inherently retrospective11, which we found to be overly optimistic (see hyperparameter optimization and causal model design in Supplementary Fig. 6). Other methods in the literature for correcting model performance calibration drift include a closed test procedure54, applying bootstrap resampling while evaluating the updating methods and scoring rules10, and dynamically adjusting the window lengths12. These methods could also be implemented as a variation of our sliding window approach, but we found a single updating method with fixed window size sufficient without additional complexity. The closed testing procedure and retraining methods also require more data points than recalibration methods11, and reestimation would not have been our optimal choice had fewer patients been initially available resulting in a reduced window size (see hyperparameter optimization in Supplementary Fig. 6).

Our results were also similar across model types, demonstrating robustness of the updating framework to the choice of model, whether that is a custom generalized linear model like NOCOS, a standard GLM like logistic regression, or a nonlinear ensemble model like XGBoost. It’s interesting to note that XGBoost, a nonlinear machine learning model, did not yield significantly better results than a simple GLM or logistic regression model. This is likely because we constrained the models to only a few predictors or because the relationships in the data were mostly linear55,56. The results were also similar across the virus variant, sex, and race and ethnicity indicating that the updating methods are also robust to these variations in the data as well.

Updating a model, or more generally, making any change to a procedure, only makes sense if there is a clinical net benefit that results from the change. Since model performance calibration drift biases the model, we expected that there would be a net benefit from correcting the model calibration so that clinicians could make better decisions based on the model outcome. Decision curve analysis is a technique that graphs the net benefit of an intervention against a clinical preference57. All static models developed showed a negative net benefit across a wide range of threshold values, indicating possible detriment if these models were used. When updated, the positive net benefit across a wide range of threshold preferences, indicates that these updates made the models generally useful, correcting for biases that can impact the net-benefit of these models.

Prognostic models for COVID-19 and other diseases are particularly useful clinically and operationally for health systems. For COVID-19 in particular, the NOCOS survival model was used extensively by triage teams (to better inform decisions about appropriate level of care and hospital discharge), clinical rounding teams (for risk assessment), primary medical and palliative care teams (to better guide shared medical decision making), and hospital operations teams (for load-balancing, resource, and personnel planning). Clinical decisions should not be based only on the predicted probability of survival, but also be based on a risk-benefit analysis that may be dependent upon current hospital resources37. Given these multiple uses with significant clinical and operational implications, ongoing monitoring, maintenance, and recalibration is necessary to maintain optimal performance.

The study population only included patients within the New York City metropolitan area. External, and more specifically geographic, validation of similar models has been limited and has demonstrated significant performance drifts41,58. While our model has been trained and validated using patients from the New York population, one of the most demographically and ethnically diverse areas in the world, it could still demonstrate performance drifts in different geographic locations. We believe, as demonstrated in this study, that our proposed framework can quickly correct for performance drifts, even those appearing immediately upon implementation.

The data were collected entirely from EHRs, which supported robust and rapid analysis of a large cohort of patients. However, we did not include data elements that would require manual chart review, such as symptom information or radiographs. Due to the retrospective study design, not all laboratory tests were completed on all patients, and the performance of these variables could not be adequately assessed.

Finally, deploying dynamic prediction models that self-monitor and auto-update can be technically challenging as they necessitate specific data pipelines and retraining scripts requiring maintenance and monitoring. One specific example may include the need for a data pipeline that is capable of dynamically updating the imputation models as well. Also, monitoring dashboards are essential, since engineering teams and stakeholders need to be alerted of model performance drifts and informed of any automatic updates, including specific details on changes occurring in updates. While these challenges can increase the technical burden of deploying these models, we believe that stable pipelines and data monitoring dashboards are necessary, not only for the case of dynamic models but for any deployed clinical predictive model.

This study demonstrates the importance of updating prognostic models in settings with rapidly changing clinical dynamics and proposes a self-monitoring, auto-updating survival model for COVID-19 patients. Biased models can result in potentially harmful biased clinical decisions. This is the first study to our knowledge that performs dynamic updating of COVID-19 prognostic survival models to correct for model performance calibration drift, a methodology that can be extended to other clinical prognostic models.

Methods

Data acquisition

Data were collected from the enterprise EHR (Sunrise Clinical Manager, Allscripts, Chicago, IL). Transfers from one in-system hospital to another were merged and considered one hospital visit. Data collected for the development and internal validation of the tool included patient demographic information, comorbidities, laboratory values, and outcome (28-day survival and discharge). Our project utilized clinical data, obtained retrospectively, which was determined by the Northwell Health Institutional Review Board (IRB) to meet the requirements for review under exemption category 4) Secondary research uses of identifiable private information, (iii) The research involves only information collection and analysis involving the investigator’s use of identifiable health information when that use is regulated under 45 CFR parts 160 and 164, subparts A and E, for the purposes of “health care operations” or “research” as those terms are defined at 45 CFR 164.501 or for “public health activities and purposes” as described under 45 CFR 164.512(b).

Study design and setting

This study includes retrospective development and validation, and prospective validation of models to predict 28-day survival of patients hospitalized with COVID-19 between March 2, 2020, and April 3, 2022 (Table 1). The development cohort includes patients admitted to 11 acute care facilities in the Northwell Health system between March 2, 2020 (all date/times are at midnight), and April 23, 2020 (n = 7346). Long Island Jewish Hospital, the hospital that contained the most COVID-19 positive patients during the same time period, was left out of the development set, and was used for retrospective validation (n = 1889). We repeat this process, sequentially leaving one hospital out as the validation cohort, and we report the variation of retrospective performance metrics (Supplementary Table 5). The prospective validation cohort included patients admitted to all 12 acute care facilities in the Northwell Health system between April 23, 2020, and April 3, 2022 (n = 25,677). The final date of follow-up was May 1, 2022.

Patients were included in the development and retrospective and prospective validation datasets if they were adults (≥18 years old) admitted to the hospital with COVID-19 confirmed by a positive result from polymerase chain reaction testing of a nasopharyngeal sample. Clinical outcomes (i.e., discharge, mortality) were monitored until the final date of follow-up. Patients were excluded if they received invasive mechanical ventilation before admission, either before presentation to or during their stay in the emergency department. Another exclusion criterion is when a do not resuscitate order (DNR) was created up to five days prior to expiration, since these DNRs can potentially bias the outcome. Patients were also excluded if they were transferred to a hospital outside of the health system and their outcomes were unknown.

Potential predictive variables

Potential predictive variables were included if they were available as discrete data points in the EHR for more than half of study patients at the time of admission. For variables with multiple values at admission, such as vital signs and labs, the last value before time of admission was used for analysis. This approach ensured that the results would contain data points routinely available upon admission. In the overview of our dataset (Table 1), continuous variables are presented as median and interquartile range, and categorical variables are expressed as the number of patients and percentage. Demographic variables included age, sex, race, ethnicity, and language preference as English or non-English. Vitals signs included systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, oxygen saturation, temperature, body mass index, height, and weight. Comorbidities included coronary artery disease, diabetes mellitus, hypertension, heart failure, lung disease, and kidney disease. Laboratory variables that have a missingness less than 50% included white blood cell count, absolute neutrophil count, automated lymphocyte count, automated eosinophil count, automated monocyte count, hemoglobin, red cell distribution width, automated platelet count, serum sodium, serum potassium, serum chloride, serum carbon dioxide, serum blood urea nitrogen, serum creatinine, estimated glomerular filtration rate, serum glucose, serum albumin, serum bilirubin, serum alkaline phosphatase, alanine aminotransferase, aspartate aminotransferase, and serum lactate. C-reactive protein serum was also included even though it was more than 50% missing because other studies found it to be predictive of mortality59,60. The logarithms of numeric variables were used in addition to the linear scale. It is possible that both scales for a given measurement can be selected which can assist in modeling nonlinear effects. Categorical variables were dummy encoded by expanding the number of variables to include a binary representation of all but one of the categories where at most one variable at a time can be true.

Imputation

Missing values were assumed to be missing at random and imputed using multiple imputations by chained equations (MICE)61 using the R programming language, version 1.4.1717 (R Foundation for Statistical Computing). Some predictors like body mass index (BMI) and the natural logarithm of the laboratory values and vital signs were imputed using passive imputation to preserve the deterministic relationships between different missing variables in the imputed data. Outcomes were also included in the imputation models as recommended by Steyerberg in 7.4.3 of Clinical Prediction Models Second Edition since missing values were imputed using random draws13. Due to the nature of fully conditional specification and that the imputation models were trained on the development set, prospective outcomes are not necessary for imputing prospective data. Five imputed data sets were created using five random draws for each missing value. The imputation models were created on the development data, and new patients can be imputed using the same imputation models. Once the missing values have been imputed, the dynamic updating can proceed as if the values were known. We are effectively proposing method 6 from Hoogland et al.62. In a real-time implementation, to produce an individual risk prediction, a range of probabilities can be obtained from multiple imputation. This range along with the average estimated probability can be reported back so that an individual risk along with the impact of the missing values to this risk can be assessed by a clinician.

Outcomes

Outcomes collected included death and discharge. The primary outcome was 28-day survival. Patients who were discharged alive within 28 days or were alive and in the hospital longer than 28 days had a positive survival outcome. Patients who died in less than 28 days had a negative survival outcome. The 28-day follow-up ensured that all outcomes were known for patients still in the hospital.

Prediction model development

We previously developed the Northwell COVID-19 Survival Calculator (NOCOS); a generalized linear model that selected predictors from a pool of vitals, labs, and patient demographics using lasso regression42,43. NOCOS uses six variables to calculate a prognosis for in-hospital survival at the time of admission.

For this study, we monitored and retrained a NOCOS model for 28-day survival. The development cohort did not include one of the hospitals, Long Island Jewish Hospital and the cutoff date was modified to use patients that were admitted before April 23, 2020 rather than discharged before April 23, 2020, since the original NOCOS development set was inadvertently biased towards shorter duration patients. We also imputed the data using MICE with five imputations since the original NOCOS used mean imputation. Lastly, we made use of a more recent data set that includes patients admitted past our final date of follow-up, May 1, 2022 (n = 34,912, 29,984 survived 28 days) so that the 28-day follow-up period started on Apr 3, 2022.

The data were standardized by taking the z-score, which puts all measurements on the same scale. All analyses were performed in MATLAB 2020b (Mathworks, Inc., Natick, MA). The five imputed development cohorts were combined via concatenation for model development63 and weighted so as not to artificially increase the sample size, and the minority class (patients that died) was randomly oversampled with replacement64,65 to correct for class imbalance. L1-penalized linear regression followed by Bayes theorem was used to predict the survival of hospitalized patients with COVID-19 (Supplementary Methods). The class-conditional likelihood functions of the linear predictors for survival past 28 days and death before 28 days were estimated with Pareto tails and a Lévy alpha-stable distribution in the center using maximum likelihood estimation, and the priors were estimated as the fraction of patients that either survived or died. The posterior probability of survival past 28 days was evaluated using Bayes Theorem. Similar to logistic regression, linear regression followed by class-conditional likelihood estimation and Bayes theorem can also be formulated as a generalized linear model with a custom link function.

For comparison, we also developed an L1-penalized logistic regression model and an extreme gradient boosted decision trees model for the 28-day outcome. The LR model did not require resampling, and the XGBoost model corrects for class imbalance via its cost function that incorporates the fraction of each outcome class. The XGBoost model can also be recalibrated just like the other models because it also predicts a probability. The hyperparameters for the LR models were selected to yield six predictors in the same way as NOCOS. Six predictors were selected for the XGBoost models by including all candidate predictors, ranking the predictors by their importance that is based on averaging changes in the node risk due to splits on every predictor66, and then retraining the models using only the six most important predictors.

Prediction model validation

The generalizability of the 28-day NOCOS calculator, as well as the LR and XGBoost models, were validated with the retrospective cohort and prospective cohort for each imputed data set. Predictive performance of the model was assessed via AUROC, AUPR, and ICI. AUPR is a measure that is well-suited for imbalanced data, and its values range from the sample prevalence (indicating random performance) to 1 (indicating perfect classification)67. ICI is the expected error between the actual and ideal calibration curves and approaches 0 when there is perfect calibration68.

Ninety-five percent confidence intervals (95% CI) on the AUROCs were calculated using the method described by Hanley and McNeil69, 95% CI for the AUPRs were calculated using the method described by Boyd, Eng, and Page70, and the 95% CI for the ICIs were estimated using a bootstrapping approach with 200 replicates. All confidence intervals were determined over the union of the five imputed data sets. The confidence intervals for both area under the curves (AUCs) and ICIs were compared for significance using one-sided, two-sample t-tests, which are appropriate because AUCs are U-statistics, and the bootstrapped ICIs are observed to be approximately normally distributed.

Prediction model updating

A sliding window was used for model surveillance and updating. The windowed patients were evaluated on the current model and the discrimination (AUROC and AUPR) and calibration (ICI) performance metrics were computed. This sweeps out performance versus time curves. In order to evaluate the overall performance, the predicted probabilities of the 500 most recent patients were accumulated over the course of the simulation and the performance metrics were recalculated over all of the accumulated data at the end of the simulation. If the ICI crossed a threshold, the model would be updated based on the current window of patients but not applied to new patients until after the follow-up since the patients’ outcomes are not known at the time of admission. Hyperparameter optimization was performed by sweeping the window size, the ICI threshold, and the number of imputations and partially rerunning the simulation (through November 15, 2021) to select the optimal values. Hyperparameters that are relatively insensitive to the performance would be expected to be robust with prospective data. The sliding window procedure was used for each updating method—no updating, intercept only logistic recalibration (updating the intercept of the linear predictor), full logistic recalibration (updating the intercept and applying a gain to the linear predictor), reestimation of the model parameters using the same predictors, and reestimation and selection of the model parameters using all candidate predictors13. The LR model also included a dynamic Bayesian logistic regression23 updating method that was batch updated at each window position without a threshold.

In addition, a realizable model was compared with an acausal, unrealizable model. When causality is off in the simulation, patient outcomes are assumed to be known immediately, and the updated model is applied to new patients prior to the follow-up period. This results in an overly optimistic performance. We aimed to quantify the optimism bias based on this assumption for comparison, and the causal model used in all other experiments.

Decision curve analysis

Decision curve analysis was performed by plotting the net benefit vs the preference. \({NB}\left({p}_{t}\right)={TPR}\left({p}_{t}\right)*\varphi -w\left({p}_{t}\right)*{FPR}\left({p}_{t}\right)*\left(1-\varphi \right)-{Net\; Harm}\) where TPR is the true positive rate or sensitivity at the threshold probability pt, FPR is the false positive rate at the threshold probability pt, φ is the estimated population event rate, and the weights \(w\,=\frac{{p}_{t}}{1-{p}_{t}}\) where pt is the threshold probability. The decision curve is compared against the treat-all (TPR = 1, FPR = 1) and treat-none (TPR = 0, FPR = 0) reference curves. In our decision curve analysis, the threshold probability refers to the mortality probability (in contrast to the survival probabilities of the models), to adhere to the commonly preferred way that decision curves are computed. There is no net harm from administering this test71.

Sensitivity analysis

Based on data from the New York State Department of Health72, the alpha variant was dominant from the start of our data until about June 15, 2021. The delta variant was dominant between June 15, 2021 and December 15, 2021. Then the omicron variant was the dominant strain from December 15, 2021 through the end of our data. Race and ethnicity were combined so that White and Other Hispanics and Latinos were grouped with Hispanics and Latinos, but Black and Asian Hispanics and Latinos were grouped with Black and Asian respectively. The results of the primary analysis were indexed according to these groupings.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.