Real-time prediction of COVID-19 related mortality using electronic health records

Coronavirus disease 2019 (COVID-19) is a respiratory disease with rapid human-to-human transmission caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to the exponential growth of infections, identifying patients with the highest mortality risk early is critical to enable effective intervention and prioritisation of care. Here, we present the COVID-19 early warning system (CovEWS), a risk scoring system for assessing COVID-19 related mortality risk that we developed using data amounting to a total of over 2863 years of observation time from a cohort of 66 430 patients seen at over 69 healthcare institutions. On an external cohort of 5005 patients, CovEWS predicts mortality from 78.8% (95% confidence interval [CI]: 76.0, 84.7%) to 69.4% (95% CI: 57.6, 75.2%) specificity at sensitivities greater than 95% between, respectively, 1 and 192 h prior to mortality events. CovEWS could enable earlier intervention, and may therefore help in preventing or mitigating COVID-19 related mortality.

To address these issues, we developed the COVID-19 Early Warning System (CovEWS), a risk assessment system for real-time prediction of COVID-19related mortality that we trained on a large and representative sample of EHRs collected from more than 69 healthcare institutions using machine learning. In contrast to existing risk scores, CovEWS provides early warnings with clinically meaningful predictive performance up to 192 hours prior to observed mortality events, hence enabling critical time to intervene to potentially prevent such events from occurring. Since CovEWS is automatically derived from patient EHRs, it updates in real time without any necessity for manual action to reflect changes in patient status, and accounts for a much larger number of risk factors correlated with COVID-19 mortality than existing risk scores. CovEWS is based on a time-varying neural Cox model that accounts for risk factors changing over time and potential non-linear interactions between risk factors and COVID-19-related mortality risk 1 , and was derived from the de-identified EHRs of 66 430 diverse COVID-19 positive patients. We demonstrate experimentally that the predictive performance of CovEWS is superior to existing generic risk scores, such as SOFA [14], COVID-19 specific risk scores, such as the machine learning models from Yan et al. [17] and Liang et al. [18], and COVER F [19], and a time-varying Cox model with linear interactions [20]. We additionally show that the gradient information of our differentiable CovEWS model can be used to quantify the influence of the input risk factors on the output score in real time. CovEWS may enable clinicians to identify high-risk patients at an early stage, and may, therefore, help improve patient outcomes through earlier intervention.

COVID-19 Early Warning System (CovEWS).
CovEWS is a clinical mortality risk prediction system for COVID-19 positive patients to be used in a continuous manner in both inpatient and outpatient settings. CovEWS uses clinical risk factors from a patient's EHR to automatically calculate a mortality risk score between 0 and 100 that indicates the current risk percentile that this patient is in relative to the reference cohort 2 . A CovEWS score of 90 indicates, for example, that the patient has a higher COVID-19 related mortality risk than 90% of COVID-19 positive patients in the reference cohort. An important property of CovEWS scores is that they always reflect the momentary risk of patients in their current states, and that they update instantaneously to reflect relevant, EHR-derived changes, which is a key differentiator of CovEWS compared to existing COVID-19 related mortality risk prediction systems that are not designed to take into account new, incoming clinical evidence. Figure 1 demonstrates the application of CovEWS to two contrasting patient timelines (a deteriorating patient that ultimately died and a patient that initially deteriorates but then recovers) by visualising a selected number of clinical risk factors, death death death death death death death death death death death death death death death death death death death death death death death death death death death intubation death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death intubation death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death death intubation such as respiratory rate, oxygen saturation, and creatinine levels, alongside the corresponding momentary risk assessment output by CovEWS. As shown in Figure 1, CovEWS additionally maintains a high degree of interpretability for clinicians by indicating the relative positive and negative influences of each clinical risk factor over time on the predicted risk score (see Section S.11). The information conveyed by CovEWS can be used to quickly and objectively assess individual COVID-19 related mortality risk in order to prevent or mitigate mortality, and optimise prioritisation of scarce healthcare resources.

intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation intubation
To develop CovEWS, we used EHR data from two federated networks of US and international healthcare organisations (HCOs), Optum (US) and TriNetX (US + international), that include de-identified EHRs containing data on demographics, clinical measurements, vital signs, lab tests and diagnoses of 47 384 and 5 005 patients seen between March 21 st and June 5 th 2020 (11 weeks) and March 21 st and June 25 st 2020 (13 weeks), respectively. To demonstrate the generalisability of predictions made by CovEWS, we limited the training of Cov-EWS to a training cohort of 14 215 (30%) patients from the Optum cohort, used 9 477 (20%) Optum patients for model selection, and evaluated CovEWS against both a held-out test cohort of 14 215 (30%) patients from the Optum cohort and a separate external test cohort consisting of the entire TriNetX cohort of 5 005 (100%) patients (Table 1, stratification details in Section S.4). In addition, we collected supplementary EHR data on new patients diagnosed with COVID-19 between June 6 th to July 13 th 2020 (5 weeks) from Optum -the Optum future cohort (14 041 patients) -after CovEWS had been trained to demonstrate the robustness of CovEWS under rapidly changing treatment regimes 3 and other temporal effects. The data formats were normalised across the two federated networks of HCOs (Section S.3), and all data were preprocessed to address the missingness that is characteristic for real-world clinical data (Section S.5).
Predictive Performance for Different Prediction Horizons. We compared the predictive performance of CovEWS, several baselines and existing risk prediction scores (Section S.12), including a version of CovEWS based on a linear time-varying Cox model [20] (CovEWS [linear], Section S.6.3), COVID-19 Estimated Risk for Fatality (COVER F) [19], Sequential Organ Failure Assessment (SOFA) [14], the decision tree developed by Yan et al. [17] and the deep learning model developed by Liang et al. [18], in terms of their respective specificity for identifying COVID-19 related mortality with a conservative fixed sensitivity of at least 95% and a slightly more relaxed level of 90% at a minimum of 1, 2, 4, 8, 16, 24, 48, 96 and 192 hours (8 days) prior to observed mortality events 4 on both the hold-out test data of Optum cohort and the external test cohort from the TriNetX network ( Figure 2). In terms of specificity at a sensitivity greater than *** *** *** *** *** *** *** *** ***  Figure 2: Performance comparison in terms of Specificity at greater than either 95% or 90% Sensitivity (y-axis) for different prediction horizons ahead of observed mortality events (in hours, x-axis) for CovEWS, CovEWS (linear), COVID-19 Estimated Risk for Fatality (COVER F) [19], Sequential Organ Failure Assessment (SOFA) [14], Liang et al. [18], and Yan et al. [17] on the held-out Optum test set, the external TriNetX test set, and selected patient subgroups from the Optum test set. Some methods do not reach 90% and 95% sensitivity for some horizons, and may therefore not be visible in all plots. Bars indicate median and error bars indicate 95% confidence intervals (CIs) obtained via bootstrapping with 200 samples. Detailled results are available in Section S.15. (* = p < 0.10, ** = p < 0.05, *** = p < 0.01, NS = not significant, one-sided Mann-Whitney-Wilcoxon for superiority of CovEWS over CovEWS [linear]). 6 September 1, 2020 95%, we found that CovEWS significantly (p < 0.05, one-sided Mann-Whitney-Wilcoxon with Bonferroni correction, see Table S9 and Table S10) outperformed other baselines and existing risk prediction scores at each prediction horizon and on both the Optum and TriNetX cohorts with few exceptions. By comparing the predictive performances of the mortality prediction scores at different time horizons, we additionally quantified the degree to which risk prediction methods give more accurate predictions when the mortality event is closer to the prediction date. For example, the predictive performance of CovEWS in terms of specificity at a sensitivity greater than 95% dropped from 89.3% ( However, all methods were roughly 10% less specific at greater than 95% sensitivity. This difference persisted even in those risk assessment systems that were not originally trained on the Optum training cohort, such as COVER F. We thus attributed this apparent difference in performance not to overfitting to the Optum training cohort, but to (i) the difference of 5.38% against 6.91% in baseline mortality between the held-out Optum test cohort and the external TriNetX test cohort, respectively, and (ii) the higher degree of missingness in short-term mortality risk factors, such as, e.g., respiratory rate, SpO 2 and blood pressure, in TriNetX (Table 1). In addition to assessing predictive performance, we also evaluated the calibration [23] of the risk scores predicted by CovEWS. We found that CovEWS overestimates mortality risk when interpreted as the probability of a mortality event occurring within the next 24 hours because patients' states may change between the prediction time and the end of the prediction horizon ( Figure S7).
Predictive Performance for Different Subgroups. We also compared the predictive performance of CovEWS against the baselines and existing scores across various ethnic subgroups, on patients that were not hospitalised, and on the Optum future cohort ( Figure 2; cohort statistics in Table 2). Overall, across each of these cohorts, we found that CovEWS significantly (p < 0.05, one-sided Mann-Whitney-Wilcoxon with Bonferroni correction) outperformed all of the baselines at each prediction horizon with the sole exception being the 96 and 192 hours prediction horizons on the Optum future cohort -where the performance difference was not in all cases significant. The performance difference was more pronounced across Caucasian and African American populations which is likely reflective of the fact that several baselines have been developed using data from predominantly Asian populations. On the subgroup of patients that was not hospitalised, we found that, although lower than the overall performance on the entire Optum test set, CovEWS maintained a high level of performance. We attributed the lower performance on the non-hospitalised group compared to the overall Optum test set to (i) the considerably higher missingness in this patient group caused by non-hospitalised patients not being monitored as closely as hospitalised patients (Table S6), and (ii) the overall considerably lower mortality rate in this patient group. Respectable performance on the non-hospitalised patient group is particularly important since the majority of COVID-19 patients are treated in an outpatient setting. In addition, when evaluating the various risk assessment methods on the Optum future cohort, we found that CovEWS was largely robust to changes in treatment policies and other temporal effects. A notable anomaly was the 96 and 192 hours prediction horizons where the variance in our performance estimates was relatively high since fewer patients with recorded mortality outcomes and long-term monitoring data were available due to the shorter data collection time (5 weeks) of the Optum future cohort compared to the Optum test set (11 weeks) and the TriNetX test set (13 weeks).
Stratified Time-varying Survival Analysis. As illustrated in the examples in Figure 1, CovEWS continuously varies over time since it accounts for the status of patients deteriorating or improving. To add to the analysis of the predictive performance of CovEWS in identifying the mortality of individual patients at fixed prediction horizons prior to observed mortality events presented in the previous paragraph, we therefore additionally evaluated whether CovEWS enables stratification of high-risk patients continuously over time. To do so, we stratified the held-out Optum test cohort and the external TriNetX cohort into five strata of the CovEWS score respectively assigned to each patient (Section 3). We found that CovEWS effectively separated patients into risk groups with distinct COVID-19 related mortality risk profiles, as patients that were assigned to higher strata of CovEWS scores were more likely to die across all strata over the course of their disease. When comparing stratification results between the held-out Optum test cohort and the external TriNetX cohort, we observed that the ability to stratify patients into risk groups generalised across the two datasets -indicating that the predictive performance of CovEWS can transfer to other sources of data collected with different protocols, from different locations, and under different treatment policies. We also observed that the highest risk stratum of patients assigned CovEWS scores between 90 and 100 was considerably steeper than other strata in the held-out Optum test cohort and this anomaly did not persist to the same degree in the external TriNetX cohort. Qualitatively, we reasoned that this difference between the two datasets was due to the considerably higher missingness of short-term risk factors associated with mortality, such as, e.g., respiratory rate, SpO 2 and blood pressure, in the TriNetX cohort (Table 1). Rapid changes in these short-term risk factors often result in substantially increased near-term mortality risk and CovEWS scores reflected this increased risk immediately (Figure 1), moving patients with extreme short-term risk indicators into the highest risk stratum. Since these short-term risk factors were not included as frequently in the TriNetX cohort, CovEWS was considerably less able to react to short-term deteriorations in the status of the patients, which was reflected in a relatively flatter time-varying survival curve of the highest risk stratum in the TriNetX cohort.  Figure 3: Stratification of patients in the held-out Optum test cohort (left) and the external TriNetX test cohort according to their assigned CovEWS score over time (in hours since COVID-19 diagnosis) into those patients that were assigned a CovEWS score below 60, from 60 to 69, 70 to 79, 80 to 89, and 90 to 100 on the test fold of the Optum dataset. Note that the five strata and their respective limits were chosen for clarity of visualisation -other strata are possible, and may, depending on context, have better clinical utility. Rows show time-varying survival probabilities (top row), the number of patients (centre row), and the cumulative number of mortality events observed (bottom row) for patients in each stratum of assigned CovEWS scores. Steeper curves indicate that more patients died while assigned a CovEWS score in the respective stratum. In contrast to traditional survival curves, cohorts as defined by strata of CovEWS scores are not static over time, and patients move between the stratified groups as they are assigned lower or higher CovEWS scores in response to their status improving or deteriorating, respectively. The results showed that CovEWS enables effective stratification of patients into risk groups over the course of their disease, as patients that were assigned a higher CovEWS score were more likely to die over time on both test cohorts while maintaining separation between the stratified cohorts.

Discussion
We developed and validated CovEWS, a real-time early warning system for predicting mortality of COVID-19 positive patients, using routinely collected clinical measurements and laboratory results from EHRs. When compared to competitive baselines, our method not only provides accurate mortality predictions for each patient, but also provides a real-time early warning system of up to 192 hours (8 days) prior to an observed mortality event for individuals, while identifying clinically-relevant factors for predictive performance. These results are sustained across various ethnic groups and cohorts. Notably, in comparison to existing mortality risk scoring systems, our method achieves significantly higher performance in terms of specificity at greater than 95% sensitivity across all evaluated prediction time frames, and generalises well to data collected under different treatment and data collection policies and environmental conditions. The implications of providing such an early warning system are significant. The provided risk assessment could potentially broadly aid in clinical decision-making as well as in the prioritisation of care and resource allocation. More specifically, CovEWS could enable clinicians to intensify monitoring and therefore initiate treatments earlier in patients with a higher risk of mortality. Moreover, as an additional information source, CovEWS could also help clinicians to decide when to initiate palliative care to improve the quality of remaining life for patients with this need. Additional studies investigating if and how CovEWS can influence clinical decision-making would be necessary to improve both treatment outcomes, the involvement of palliative care, or resource allocation to reduce COVID-related mortality. Before applying CovEWS in clinical practice, it is important to decide and calibrate appropriate warning thresholds, e.g. at the 85%, 90% or 95% sensitivity level (Section S.10). Especially when hospitals are overwhelmed and need to strictly allocate resources, alarm fatigue due to ill-calibrated thresholds ought to be minimised. In addition, while the data used in this study already comprises multiple hospitals, a further analysis including hospitals from other countries would be useful to investigate the impact of geographic and cultural differences -particularly in those geographic contexts that are not well covered by this study. Due to differences in data collection methodology and expected data formats, another limitation of this study is that the implementation of some existing risk scoring systems is based on certain assumptions that may adversely influence their comparative performance (Section S.12). Moreover, this work only concerns risk scores from routinely collected clinical data and patients who are already seeking care at healthcare providers. For efficient mitigation of COVID-19, additional, potentially preventative efforts like tracking apps, risk scores of infection prior to admission, masks and social distancing are necessary.
It is also important to acknowledge upfront the pitfalls of mortality prediction of hospitalised patients. A significant proportion of patients who die in the hospital, do so after cessation of treatment. One may argue that models that predict mortality thus actually predict the likelihood of treatment discontinuation. Numerous factors go into the decision with regard to continuing or stopping interventions, including whether the outcome, if the patient were to survive, is aligned with the patient's preferences. It will only be accurate in a clinical context where clinicians make predictions in a similar manner, where patients share the same values and preferences around the quality of life, and where the decision-making process resembles that of the training cohort.
From the perspective of medical staff, prognostication as well as the perception of the quality of life if the patient were to survive, determine the framing of patient status to the family and friends; these are vulnerable to bias, both conscious or unconscious and influence the decision to admit the patient to the intensive care unit, as well as the decision to discontinue treatment (which almost certainly lead to death among those who are most severely ill). In a perfect world without bias and health disparities, only patient and disease factors determine hospital mortality, but studies have repeatedly demonstrated that this is far from the case. Recently, mortality from critical illness has been shown to be higher in disproportionately minority-serving hospitals after adjustment for illness severity and other biological factors that pertain to the patient and to the disease [24,25]. It is nearly impossible to incorporate these factors precisely in a model that is trained on mortality as an outcome. As a decision support tool to inform discussion around goals of care, CovEWS is subject to the same limitations that mortality prediction models have -it may permeate or even magnify existing health disparities and provider bias. As an early warning system, however, we speculate that the impact of the exclusion of social determinants on model performance is acceptable.
In summary, we presented, developed, and experimentally validated CovEWS, a real-time early warning system that provides clinically meaningful predictions of COVID-19 related mortality up to 192 hours (8 days) in advance for individual patients using routinely collected EHR data. In contrast to existing risk scoring systems, CovEWS provides real-time continuous risk assessment that accounts for a large set of short-term and long-term risk factors associated with COVID-19 related mortality, is automatically derived from readily available EHR data, and was externally validated using data from multiple hospitals, diverse patient groups, and across time frames. Accessible risk assessment from readily available EHRs is especially important in the ongoing COVID-19 pandemic since access to advanced clinical lab testing and imaging techniques may be limited in many hospitals. CovEWS allows for critical time in clinical decision making, even without access to specialised lab tests or advanced diagnostic equipment. Prospective studies are needed to conclusively establish if the availability of early warnings for COVID-19 related mortality through CovEWS improves patient outcomes compared to the standard of care.

Data availability
Accredited users may license the TriNetX COVID-19 research and Optum deidentified COVID-19 electronic health record databases used in this study at TriNetX and Optum, respectively.

Competing Interests
PS is an employee and shareholder of F. Hoffmann-La Roche Ltd.
[28] Safiya Richardson, Jamie S. Hirsch, Mangala Narasimhan, James M.  Figure S4: During training the model is provided with EHR data, including lab tests, clinical measurements, and information about pre-existing conditions. Before the data is fed to the model, the preprocessing phase handles missing values and standardises scaling of each covariate (Section S.5). CovEWS was trained on the training fold of the Optum cohort, and evaluated on the held-out Optum test cohort, the Optum Future cohort and the external TriNetX cohort (Section S.2). The proposed model accommodates the effect of time-varying and nonlinearly interacting covariates and is trained using partial likelihood as detailed in Section S.7. Section S.5 presents further details on preprocessing.

S.1 Overview
The overall pipeline of the method is shown in Figure S4. We refer to Section S.6.4 for a detailed presentation of the predictive model used by CovEWS and Figure S6 for a detailed diagram of the model architecture.

S.2 Data Collection
We used data collected by two federated networks of healthcare organisations: Optum. The Optum de-identified COVID-19 electronic health records database includes de-identified electronic medical records and clinical administrative data including bedside observations and laboratory data from a geographically diverse set of healthcare institutions in the United States (US). The EHR data was sourced from more than 45 provider groups and integrated delivery networks. We used Optum cohort data collected between 21st March and 5th June 2020, and another cohort separated in time from 6th June to 13th July 2020 for our analysis.
TriNetX. TriNetX is a global health research network providing a de-identified dataset of electronic medical records (diagnoses, procedures, medications, laboratory values, genomic information) including patients diagnosed with COVID-19. The data is de-identified based on standard defined in Section §164.514(a) of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. The process by which Data Sets are de-identified is attested to through a formal determination by a qualified expert as defined in Section §164.514(b)(1) of the HIPAA Privacy Rule. We used TriNetX cohort data collected between 21st March and 25th June 2020 from 24 healthcare organisations in the US, Australia, Malaysia and India for our analysis.
Data Quality. Both Optum and TriNetX as well as the data providing healthcare institutions applied quality control steps to their data, but these procedures are not standardised neither across the federated networks nor across healthcare institutions. Varying levels of data quality across EHRs collected at different healthcare institutions and networks are therefore expected. However, heterogeneous data quality standards are characteristic for real-world data collected at different healthcare institutions. By evaluating CovEWS against an external test cohort from healthcare institutions with data collection policies different from our training cohort, we are able to give a fair assessment as to how robust and transferable CovEWS is in presence of realistic variations in data quality.  Table S4, and their distributions across the datasets in Table 1.  Ischemic heart disease see Table S5 0.04 Other heart disease see Table S5 0.01 Cerebrovascular disease see Table S5 0.20 Hypertension see Table S5 0.03 Diabetes see Table S5 0.15 Hyperlipidemia see Table S5 0.17 Cancer see Table S5 0.65 Dyspnea see Table S5 0.42 COPD see Table S5 0.18 Asthma see Table S5 0.80 Pulmonary embolism see Table S5 0.58 Connective tissue disease see Table S5 0.84 Inflammatory bowel disease see Table S5 0.99 Osteoarthritis see Table S5 0.37 Rheumatoid arthritis see Table S5 0.79 HIV see   Figure S5: Distributions of the number of observations per patient (y-axis, log scale, violin plots) for time-varying covariates (x-axis) in the Optum and TriNetX datasets for patients that have at least one observation of a given covariate. The percentage of patients with no observation for each covariate is shown in Table 1.
Data Characteristics. The cohort statistics of the two datasets are presented in Table 1, the ICD-9 and ICD-10 codes corresponding to the diagnoses shown in Table 1 are given in Table S5, and the number of observations per patient for time-varying covariates for the two datasets is visualised in Figure S5. As is characteristic for clinical data collected in real-world contexts, missing covariates are common in both datasets. Missingness in real-world EHR data is caused primarily by differences in laboratory testing guidelines, data collection practices, available testing resources and measurement devices between hospitals, and may in some cases depend on patient status and preferences of clinical staff. For example, Table S6 compares the missingness between the Optum test set and the non-hospitalised patient subgroup of the Optum test set. In contrast to traditional clinical studies, realistic missingness patterns in both the training and evaluation datasets are a desirable feature in the context of our study as CovEWS is designed to be deployed in clinical contexts with similar missingness, and therefore has to be trained and evaluated in the presence of missingness patterns seen across a representative range of heterogeneous hospitals. Covariates were mostly balanced across the Optum and TriNetX datasets. The primary differences were a higher observed mortality rate, and higher ratios of intubations, connective tissue disease, and rheumatoid arthritis in the TriNetX data compared to the Optum data. In addition, we note that the majority of admissions were recorded as being of the "Unknown" type in the TriNetX database. Since the large fraction of unknown admission entries limited potential admission outcome analyses, we reported hospital and ICU admission outcomes as not available for TriNetX (Table 1). In compliance with the HIIPA Privacy Rule Section §164.514(a), patients' exact dates of death were not available to protect patient privacy. In our analysis, we therefore imputed the last recorded EHR entry date as the reference date of death for deceased patients. The actual dates of death may have happened at a later point, and our performance estimates are therefore potentially underestimating actual predictive performance, since (i) correct predictions that happened later would mean CovEWS predicted sooner than we thought for that patient (which is generally harder, see Figure 2), and (ii) incorrect predictions of CovEWS may actually have been outside of the prediction time horizon. We believe this approximation of the exact date of death is therefore an acceptable trade-off, since underestimation of performance is not as much a concern as overestimation would be.

S.3 Data Normalisation
The EHR data across both data sources used two different, but compatible, underlying data models consisting of recorded diagnoses, demographics, lab tests, procedures, medications and clinical observations. For our risk factors of interest, we converted records from both datasets into a unified data representation. We used ICD-9 and ICD-10 to extract diagnoses (Table S5)

S.4 Stratification
We split the Optum cohort used for model development into training (50%), validation (20%) and held-out test folds (30% of all patients) at random stratified by patient age, gender, presence of mortality events, presence of intubation events, presence of ICU admission and presence of a human immunodeficiency virus (HIV) diagnosis. We added HIV to the set of stratification covariates since its low prevalence could otherwise have led to imbalances in this risk factor across the folds. Stratification produced balanced cohorts across the three folds ( Table 1). The Optum training fold was used to train CovEWS, the validation fold was used to select the optimal hyperparameter configuration for CovEWS, and the held-out test fold was used in addition to the external TriNetX test cohort to evaluate the out-of-sample generalisation performance of CovEWS.

S.5 Preprocessing
Discrete covariates with p different values were transformed into their one-hot encoded representation with one out of p indicator variables set to 1 to indicate the discrete value for this patient. All continuous features were standardised to have zero mean and unit standard deviation using observed covariate distribution on the Optum training fold. Missing values of continuous covariates were imputed in an iterative fashion using multiple imputation by chained equations (MICE) [29]. After the preprocessing stage, continuous input features were standardised and fully imputed, and discrete input covariates were one-hot encoded. All preprocessing operations were derived only from the training fold, and naïvely applied without adjustment to other folds and datasets in order to avoid information leakage.

S.6 Method
We adopt a variation of the widely used Cox proportional hazard model that is adapted to accommodate nonlinear and time-varying effects of covariates on the log-hazard function. In the following, the basics of time-to-event analysis that is the main subject of this paper is briefly presented. Then we touch upon the Cox proportional model for continuous-time covariates that is followed by the modifications we applied to this model to prepare it for the current work.

S.6.1 Survival Analysis
Survival analysis which is also known as Time-To-Event (TTE) analysis included a large body of work consisting of mathematical tools to give a statistical analysis of the time duration until a specified event occurs. In the current work, the event is defined to be the time when a patient dies. An important tool in time-to-event analysis is hazard function. In discretetime setting, (e.g. if times are given in specified periods) the hazard function is a conditional probability defined in discrete-time as that represents the risk of dying at time t if the patient has survived until that time. The relevant covariates of the patients up to time t are encapsulated in the vector x ∈ R d . Age, sex, and lab tests are examples of such covariates that can take either binary or standardised real values after preprocessing. Intuitively, the hazard function captures the underlying dynamics of the transition of the condition of the patient from alive to dead. The larger h is at time t, the more likely it is for the patient to die at time t. Another useful function is called survival function that is denoted by S(t) and in discrete-time defined as Similar functions can be defined in the continuous-time regime. Let T c be the continuous survival time with the probability density function f c and cumulative distribution function F c . Similar to Equation (2), the continuous survival function represents the probability of surviving until time t that is defined as Likewise, the continuous hazard function is defined as Notice that unlike discrete hazard function (1), the continuous hazard function (4) is not a probability distribution and can take values larger than one. The last useful function in continuous survival analysis is the cumulative hazard function 27 September 1, 2020 The connection between these quantities can be simply derived: Before introducing the simple yet flexible Cox model, we discuss a few important issues that must be taken into account in survival analysis.
Censoring. What makes the survival analysis different from a simple regression from the covariates x to T -observed duration up to the occurrence of the event -is the concept of censoring. An observation is called censored -or more precisely right-censored -if its survival time has not been fully observed. There are several causes for a censored observation. For example, if a patient is not under observation when the event occurs or if the information of the patient is lost for some reason, only a lower bound to the time-to-event T is observed that is the last time the condition of the patient is recorded.
Discrete vs. Continuous. Although time is a continuous physical quantity, in practice, it is measured at discrete points. Especially, in medical sciences, the condition of the patient is measured on a regular daily or bi-daily basis. This implies that even though the change of the covariates of a patient occurs at certain points of time, the exact time is not known. The transition point is only known up to the resolution of the measurement. We assume the time at which an event of interest occurs is denoted by T . As the resolution of the measurement is hours in the datasets used in the current work, T = t refers to an event that occurs within the t th hour after the patient is admitted to the hospital and its health condition is recorded.
Ties. In the limited resolution measurement of time, some observations may have the same survival times, e.g., two patients die on the same day even though it is extremely unlikely that both die at the same moment. However, even in continuous time data, ties may occur which is a hint of underlying discrete sampling in time.
A major difference between continuous and discrete-time survival analysis is that the hazard function is a probability distribution in discrete settings while it can take any positive value in continuous settings. However, the traditional continuous-time approach can still be used for discrete event times especially when the measurements are equally spaced.

S.6.2 Cox Hazards Model
The most widely known model in the analysis of continuous survival time is Cox's proportional hazard model [30] that parameterizes the hazard function as where h 0 is called the baseline hazard that is modulated by the effect of covariates via exp(β T x). Notice that in the traditional Cox model (9) the covariates x are assumed constant over time. Consequently, the temporal variation of the hazard function is separated from the influence of the covariates.

S.6.3 Time-varying Covariates
In many experimental settings, the assumption of time-invariant covariates in (9) does not hold. For example, many entries in the electronic health records such as heart rate, temperature, and blood measurements do not remain constant over the course of the hospitalisation of a patient. Therefore, the traditional Cox model (9) is extended to a time-varying setting by replacing x in (9) with x t that is the measured covariates at time t. Assume a dataset consists of N patients indexed by n = 1, 2, . . . , N . As a notational convention, x (n) t denotes the vector of the corresponding covariates to the patient n at time t.
If the Cox model holds and continuous events are observed, the following function called partial likelihood is maximized to estimate β: where t 1 < t 2 < . . . < t k are the ordered times at which the events occur and x t k are the corresponding set of covariates at those times. Notice that the equality of the superscript of the covariate vector x (i) ti (patient's index) and the subscript of time t i emphasizes the continuous event times and the fact that at most one patient experiences the event at each time. For the moment, we assume time is continuous that results in distinct event times. The set R(t i ) is the set of the patient's indices that are at risk at time t i . Being at risk means they are alive and can potentially experience the event.

S.6.4 Nonlinear Time-varying Covariates
One clear limitation of (10) that is caused by the definition of the hazard function (9), is the fact that the exponent of the modulating function exp(β T x) is a linear function of x. Hence higher order interactions among different dimensions of the covariate vector cannot be captured by this method. To improve the expressiveness of the model, we replace the linear function β T x with a nonlinear function realised by a neural network. Let φ(·; θ) : R d → R be the 29 September 1, 2020 function implemented by the neural network and parameterised by θ. Therefore, the hazard model is represented as where h 0 (t) is the baseline population-level hazard that is independent of the associated covariates to each patient. Time-varying covariates are transformed by the function φ(·; θ) to log-hazard. The parameters θ are learned via maximising the partial log-likelihood [30]. Despite traditional Cox proportional hazard model where the gradient and Hessian can be computed analytically, here, we use automatic differentiation to compute gradients with respect to θ. The nonlinear function φ(·; θ) is implemented as a 2−layer multilayer perceptronsee Section S.6 for a detailed description. The hazard function (11) estimates the instantaneous risk of death at each time for each patient. Integrating with respect to time and exponentiating the result gives the survival function defined as Notice that x 0:t denotes the set of covariates until time t, meaning that, the probability of survival up to time t depends on the history of the covariates. The partial likelihood (10) is re-written as To give an intuition of (13), observe that the partial log-likelihood that is computed by taking logarithm of the right-hand side of (13) will consist of k terms corresponding to k observed events. The parameter vector θ is perturbed such that the hazard increases for the covariates of a patient who dies at time t i while it decreases for the covariates of the patients who remain alive at t i .

S.6.5 Resolving Ties
Even though we adopt a continuous-time approach due to the non-normalised parametric form of the hazard function (9) and the resultant partial likelihood (13), the ties can still occur as we work in hourly resolution. Hence, it is possible that two patients die at the same time. When an event occurs for two patients at the same time, the partial likelihood (13) is not valid anymore. Several methods exist in the literature to break the ties and remove the ambiguity such as average partial likelihood [30] and Berslow's method [31] that lives on two ends of a spectrum. The former takes average among all possible orders of the events that can break the tie. Hence, it is the most accurate method but computationally prohibitive. The latter gives a partial likelihood almost exactly like the original Cox likelihood by assuming that every ordering of tied events results in the same partial likelihood. This method gives a crude estimate but is

4 w 0 W y i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R z d R v P X J t R K w e c J x w P 6 I D J U L B K F r p / q n n 9 U p l t + L O Q J a J l 5 M y 5 K j 3 S l / d f s z S i C t k k h r T 8 d w E / Y x q F E z y S b G b G p 5 Q N q I D 3 r F U 0 Y g b P 5 u d O i G n V u m T M N a 2 F J K Z + n s i o 5 E x 4 y i w n R H F o V n 0 p u J / X i f F 8 M r P h E p S 5 I r N F 4 W p J B i T 6 d + k L z R n K M e W U K a F v Z W w I d W U o U 2 n a E P w F l 9 e J s q x T u v V O 8 u y r X r P I 4 C H M M J n I E H l C D W 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W e c f O Y I / s D 5 / A E M / o 2 k < / l a t e x i t >
x 2    easy to implement. A midway approach that we adopted in this work is called Efron's tie-breaker [32]. In this method, a weighted average likelihood of tied cases is subtracted from the denominator of Equation (13). Efron's method gives good accuracy and is moderately easy to work with -see [32] for details.

S.7 Algorithm Details
Survival analysis by the Cox model is done via maximum likelihood estimation, where the aim is to maximise the logarithm of (10) in the original Cox's proportional hazard model and (13) in the nonlinear extension. In the original method with linear exponent both gradient ∂ log L/∂β and Hessian ∂ 2 log L/∂ 2 β can be computed analytically. This is not the case for our proposed extension (13) with nonlinear exponent. Instead of an analytical gradient, we use automatic differentiation to compute the gradient ∂ log L/∂θ. Once the gradient is derived, an appropriate gradient-based method is used to perturb θ in the direction that increases the partial likelihood. As mentioned in section S.6.4, the linear exponent β T x in (10) is replaced with a nonlinear function φ(·; θ). We use a neural network with L hidden layers to realise this function. The employed network linearly transforms the input features to a N dim -dimensional hidden layer. The transformed features are passed through a leaky rectified linear unit (LeakyReLU) [33] nonlinear activation function. The hidden activations are then transformed by a linear transformation to a single node and finally passes through a tangent hyperbolic (tanh) activation function. In summary the network function can be represented as where θ = {W 1 , W 2 } and W i , i = 1, 2 are the trainable weight matrices of the network ( Figure S6)

S.8 Hyperparameter Optimisation
For the methods trained on the Optum training fold (CovEWS and CovEWS [linear]), we used a systematic approach to hyperparameter optimisation where each prediction algorithm was given a maximum of 15 hyperparameter optimisation runs with different hyperparameter configurations chosen at random without duplicates from predefined ranges (see Table S7). Out of the models generated in the hyperparameter optimisation runs, we then selected the model that achieved the highest specificity at greater than 95% sensitivity on the validation set of the Optum cohort.

S.9 Postprocessing and Calibration
After training CovEWS using the Optum training cohort, the predicted hazard for a patient with state x is the hazard function h(t|x) eq. (9) evaluated at t = 128 hours (≈ 5.33 days) given the current patient covariates x and under the assumption that patient covariates stay constant. To produce CovEWS scores, we additionally apply post-processing using a percentile transformation that converts h(t|x) into the percentile of patient states in the Optum validation set that are assigned a lower h(t|x) than the evaluated patient state x. We chose to output CovEWS scores in form of percentiles to aid in the clinical interpretation of CovEWS as a risk score relative to a representative set of reference states, and to discourage interpretation as a mortality probability. Interpretation of CovEWS scores as a mortality probability is difficult since the mortality risk of a patient depends on their uncertain future trajectory and the prediction horizon, and is influenced by clinical interventions that may be initiated in the future. As shown in Figure 1, patients' states may change rapidly and frequently, and clinical interventions can significantly and positively alter the trajectory of patients. We also verified experimentally that, when interpreted as a probability of mortality, CovEWS scores overestimate the actually observed probability of death on the Optum and TriNetX test sets since patients' states may improve, due to intervention or otherwise, between the prediction time and the end of the prediction horizon ( Figure S7; similar results with CovEWS [linear] Figure S8). We, therefore, decided to instead output CovEWS scores as relative risk percentiles between 0 and 100 that discourages interpretation as a probability of mortality. To aid in the use of CovEWS, the following Section S.10 outlines calibrated thresholds that can be used to maximise specificity at the desired target level of sensitivity for different prediction horizons.

S.10 Thresholds
A key question for clinical decision making is which threshold should be used for CovEWS scores to indicate severe risk, and potentially trigger an automated warning. To provide guidance in choosing the appropriate CovEWS score depending on the desired trade-off between sensitivity and specificity, we evaluated the optimal observed thresholds of CovEWS scores for various target sensitivity levels using their respective receiver operator characteristic (ROC) curves for each prediction horizon (Table S8). Optimal score thresholds to maximise specificity at high levels of sensitivity were between 61 and 36, 44 and 27, and 34 and 19 depending on the prediction horizon for target sensitivity levels greater than 85%, 90% and 95%, respectively. We note that lower thresholds are necessary to achieve high sensitivity for prediction horizons farther in the future as patients' deterioration has to be identified earlier in its progression.

S.11 Feature Importance
Highlighting the clinical risk factors that positively or negatively influenced CovEWS to output a certain score is of high utility as it enables clinical users to contextualise CovEWS scores, and, in some cases, these highlights could potentially even point towards opportunities for timely intervention. We utilised the differentiability of our prediction model as outlined in Section S.7 to provide a real-time visualisation of the clinical covariates that are most important for CovEWS at any given time point (see Figure 1 for an example). To compute the importance scores at each time point, we used the Integrated Gradients (IG) [37] method that calculates relative importance scores a i ∈ (−100%, 100%) for each input feature x t,i in the feature vector where d is the number of input features. We used IG with the mean feature vectorx t across the Optum training set as a reference, calculated 50 intermediate steps for each explained x t , and normalised a i to the range of (−100%, 100%) by dividing each a i by the sum Σ d−1 i=0 |a i | of all feature attributions for x t . To obtain a timeline of attributions as shown in Figure 1, we calculate attributions a i whenever a change in patient status was recorded in the patient's EHR.

S.12 Baselines
In our analysis, we compared the performance of CovEWS to the following existing generic and COVID-19 specific clinical risk scores, and baselines: CovEWS (linear). A linear time-varying survival Cox model as described in Section S.6.3 trained using the same Optum training set and using the same pipeline as the non-linear CovEWS. We used the implementation provided in version 0.24.8 of the lifelines [20] Python package.
COVID-19 Estimated Risk for Fatality (COVER F). The COVER F scoring system for COVID-19 as described in [19]. Since COVER F uses a single flag for any heart disease diagnosis, we aggregated all diagnoses in the diagnosis categories ischemic heart disease, pulmonary embolism, other heart diseases in our dataset into one single joint diagnosis code if any diagnosis in those three categories was present. All other input features used by COVER F were direct matches with the input covariates of the same name also used by CovEWS (Table 1).
Sequential Organ Failure Assessment (SOFA). SOFA scoring is commonly used in clinical contexts to indicate the risk of organ failure in critical patients. We, therefore, used SOFA as a generic risk scoring baseline that was not specifically designed for COVID-19 to demonstrate the comparative benefits 34 September 1, 2020 when interpreting the predicted risk percentile of CovEWS as the probability of a mortality event being observed within the next 24 hours. The reference time point for those patients that did not have an observed mortality event is the date of their respective last observed EHR entry. When interpreted as a patient's mortality probability, CovEWS overestimates the risk since it can not account a priori for clinical interventions and potential future patient trajectory changes that may occur rapidly and frequently. Direct interpretation of CovEWS scores as a mortality probability is discouraged, and CovEWS scores should instead be seen as a relative risk score to stratify patients into risk groups (Section 3).
in the predictive performance of a COVID-19 specific risk scoring system. Since we did not have FiO 2 values available in our EHR datasets, we assumed a default FiO 2 value of 21% (equal to the fraction of oxygen in inhaled air) for patients that were not intubated, and an average of 71% for patients that are intubated (FiO 2 is often set to 100% initially and then progressively lowered as the patient stabilises, see e.g. [39] for an example). In addition, we did not have access to Glasgow coma scale (GCS) scores in the EHRs, and potential additional points from a high GCS score (a maximum of +4) were therefore not reflected in our calculated SOFA scores. 5-fold cross validation) [17]. All input features used by Yan et al. [17] were direct matches with the input covariates of the same name also used by CovEWS (Table 1).
Liang et al. 2020. Liang et al. [18] developed a prediction model for critical COVID-19 related illness using data from 1 590 patients seen at 575 medical centers in China using deep learning and 10 input covariates, including observed X-ray abnormalities. On three external cohorts from different Chinese provinces, they reported a predictive performance in terms of concordance index (c-index) of 0.890, 0.852 and 0.967 for predicting critical illness under the missingness of input covariates, respectively. Since we did not have access to radiologic assessments in our EHR datasets, we evaluated their model with the X-ray abnormality covariate missing for all evaluated patients (i.e. set to zero). To the best of our knowledge, Liang et al. [18] did not specify which co-morbidities were included in their collected dataset. However, their study reports a maximum of 6 co-morbidities diagnosed in one patient. In our evaluation, we counted existing patient diagnoses of pulmonary embolism, kidney disease, inflammatory bowel disease, asthma, rheumatoid arthritis, and diabetes towards these 6 co-morbidities.

S.13 Software
The source code used for developing CovEWS and for conducting the presented experiments and analyses was implemented using Python (version 3.7), scikit-learn (version 0. 22

S.14 Hardware
We used the high-performance computing (HPC) infrastructure provided by the Personalised Healthcare Informatics group at F. Hoffmann-La Roche Ltd to run the presented experiments. The compute nodes used 1st and 2nd generation Intel Xeon Platinum 8000 series processors and had access to 72 GB random access memory (RAM) each.

S.15 Performance Evaluation
In addition to the results presented in the main body of this work, we also present Receiver operator characteristic (ROC) curves for CovEWS for various prediction horizons between 1 and 192 hours evaluated on the held-out Optum test set ( Figure S9) and the external TriNetX test set ( Figure S10), the same ROC curves for CovEWS (linear) ( Figure S11 an Figure S12) a comparison of CovEWS, Time Varying Cox [20], COVER F [19], SOFA [14], Yan et al. [17], and Liang et al. [18] at various prediction horizons in terms of AUC, AUPR, F 1 , sensitivity, specificity and specificity at greater than 95% sensitivity (Spec.@95%Sens.) for predicting COVID-19 related mortality on the held-out Optum test set (Table S9), on the external TriNetX test set (Table S10), on the Optum Future cohort (Table S15), and on the Black or African American (Table S11), Hispanic (Table S12), Asian (Table S13), Caucasian (Table S14) and non-hospitalised (Table S16) [19], SOFA [14], Yan et al. [17], and Liang et al. [18] at various prediction horizons in terms of AUC, AUPR, F 1 , sensitivity, specificity and specificity at greater than 95% sensitivity (Spec.@95%Sens.) for predicting COVID-19 related mortality on the Optum future cohort. Values are the median and the 95% confidence intervals (CIs, in parentheses) obtained via bootstrap resampling with 200 samples. † = significant at p < 0.05 to CovEWS (one-sided Mann-Whitney-Wilcoxon, Bonferroni corrected). Best results for each metric are highlighted in bold.   [19], SOFA [14], Yan et al. [17], and Liang et al. [18] at various prediction horizons in terms of AUC, AUPR, F 1 , sensitivity, specificity and specificity at greater than 95% sensitivity (Spec.@95%Sens.) for predicting COVID-19 related mortality on the subgroup of non-hospitalised patients of the Optum test set. Values are the median and the 95% confidence intervals (CIs, in parentheses) obtained via bootstrap resampling with 200 samples. † = significant at p < 0.05 to CovEWS (one-sided Mann-Whitney-Wilcoxon, Bonferroni corrected). Best results for each metric are highlighted in bold.