Deep learning for deterioration prediction of COVID-19 patients based on time-series of three vital signs

Unrecognized deterioration of COVID-19 patients can lead to high morbidity and mortality. Most existing deterioration prediction models require a large number of clinical information, typically collected in hospital settings, such as medical images or comprehensive laboratory tests. This is infeasible for telehealth solutions and highlights a gap in deterioration prediction models based on minimal data, which can be recorded at a large scale in any clinic, nursing home, or even at the patient’s home. In this study, we develop and compare two prognostic models that predict if a patient will experience deterioration in the forthcoming 3 to 24 h. The models sequentially process routine triadic vital signs: (a) oxygen saturation, (b) heart rate, and (c) temperature. These models are also provided with basic patient information, including sex, age, vaccination status, vaccination date, and status of obesity, hypertension, or diabetes. The difference between the two models is the way that the temporal dynamics of the vital signs are processed. Model #1 utilizes a temporally-dilated version of the Long-Short Term Memory model (LSTM) for temporal processes, and Model #2 utilizes a residual temporal convolutional network (TCN) for this purpose. We train and evaluate the models using data collected from 37,006 COVID-19 patients at NYU Langone Health in New York, USA. The convolution-based model outperforms the LSTM based model, achieving a high AUROC of 0.8844–0.9336 for 3 to 24 h deterioration prediction on a held-out test set. We also conduct occlusion experiments to evaluate the importance of each input feature, which reveals the significance of continuously monitoring the variation of the vital signs. Our results show the prospect for accurate deterioration forecast using a minimum feature set that can be relatively easily obtained using wearable devices and self-reported patient information.


I. INTRODUCTION
T HE significant shock imposed by the novel coronavirus (COVID-19) pandemic fundamentally challenged the delivery and management of health care services globally [1].According to the World Health Organization, more than 620 million patients have been diagnosed with COVID-19 as of October 2022, and there are around 6.52 million deaths [2].Since March 2020, 96.2 million patients have been admitted to emergency departments across the United States [3].
Patients with COVID-19 can experience rapid deterioration entailing the need for invasive measures associated with high morbidity or mortality [4].During the pandemic, patient prognosis was challenging, especially in the early days when the knowledge about the disease was limited, and any modifications in admission protocols could significantly alter the patient outcomes [5].This highlighted the importance of routine patient monitoring to ensure that patients with the highest risk of deterioration receive early attention [6].
Due to the saturation of healthcare systems and concerns over unnecessary exposure, many outpatients or those in nursing centers were advised to monitor symptoms remotely and report through telemedicine [7].Hence, patients would avoid visiting emergency care facilities unless symptoms were considered significantly severe and require immediate and specialized attention [8].Although this practice could reduce exposure and unnecessary loads on emergency services [9], it could also result in poor patient prognosis.In fact, for some patients, especially those with comorbidities, the development of symptoms was followed by sudden, drastic, and unexpected deterioration resulting in morbidity, even after discharge from a clinic [10].
Considering the scale of data gathered from a plethora of patients with COVID-19 admitted to emergency departments worldwide, many DL and ML methods were developed for early diagnosis [11], patient severity assessment [12], or prognosis prediction [13], [14].In Table I, we summarize relevant work based on choice of datasets and models for the different prediction tasks.Due to the large volume of research, it would not be possible to cite all relevant papers, hence Table I provides a balanced list.While most of the existing work focuses on the diagnosis of COVID-19 rather than patient prognosis, many studies heavily rely on large input feature sets, specifically high-dimensional imaging, such as chest CT or X-ray scans, and other nonimaging modalities, such as laboratory test results.In addition, many of the existing models do not exploit variations in data over time.Even though the use of such data for computational models has shown great potential, there is a lack of seamlesslyscalable models based on minimal feature sets collected over time.We specifically prioritize data that can be collected not only in hospitals, but also in nursing centers or patient homes, such as using wearable devices e.g., smartwatches [15].
In this study, we propose a deep neural network to model time-series of three vital signs only to predict the deterioration amongst patients with COVID-19.The ultimate goal of this work is to provide a light and scalable prediction model to support clinical decision-making for a wide range of patients on the long term, including at home patients, outpatients, and inpatients.To minimize the size of the input feature space, we focus on three basic vital signs (i.e., SpO2, HR, and temperature).This design choice is motivated by the wide availability of wearable devices, such as smart watches, that can monitor these vital signs.We specifically exclude other vital signs, such as blood pressure, because it cannot be measured using readily available wearable systems.
To develop and evaluate this model, we use real-world data collected at NYU Langone Health between January 2020 and September 2022.The model predicts deterioration at time horizons of 3 to 24 hours using the vital-signs time-series data collected in the 24 hours preceding the time of prediction (corresponding to the beginning of the prediction horizon), defined as in-hospital mortality, admission to the intensive care unit (ICU), or intubation.We refer to the vital-sign data as SEQ data in the remainder of this paper.The model is also provided with a small set of features reflecting CCC, including sex, age, vaccination status, vaccination date, and status of obesity, hypertension, and diabetes (referred to as non-SEQ data).The model includes a LSTM network [50] to process the SEQ data, and a MLP that combines the last hidden state of the LSTM and the non-SEQ data.The LSTM network utilizes temporal dilation to enable access to longer memory dynamics without exponentially increasing the size and complexity of the computational framework.For each prediction horizon, a separate model is trained and is optimized using the commonly-used cross entropy loss through a three-phase training procedure.
The results show that the proposed model achieves an AUROC of 0.808-0.880 in the 3 to 24 hours prediction time horizons, in a three-fold cross-validation setting.While the results are not directly comparable to those in existing work due to differences in data pre-processing, the model achieves a comparable performance.For example, the model in [51], which uses CXR images and other clinical variables achieves 0.765 AUROC in predicting deterioration within 24 hours.
In order to assess the significance of the various CCC features, and the importance of the temporal history of vital signs, we also perform a sensitivity analysis through an occlusion experiment.Overall, our work highlights the feasibility of achieving high model performance for deterioration prediction amongst patients with COVID-19 using minimal feature sets, which are easy to obtain not only in the hospital setting, but potentially also in nursing centers and at patient homes.

A. Dataset
In this study, we use the NYU Langone De-identified COVID-19 dataset [52] collected from patients at the NYU Langone facilities between January 2020 and September 2022.We define an inclusion and exclusion criteria.First, in the case of multiple patient encounters, we use the patient's most recent encounter.Then, we include patients who either tested positive for COVID-19 at the facility, or were already diagnosed with COVID-19 at the time of their admission.Next, we include in-patients with vital sign-measurements.The vital signs of these patients, including SpO2, temperature, and HR, are periodically measured and recorded roughly every 4-5 hours.For each patient, age, sex, vaccination status and time, and the presence of comorbidities including obesity, diabetes, and hypertension are also recorded.
Similar to previous work (see [51] and references therein), we define deterioration as the occurrence of the composite outcome of mortality, ICU admission, or intubation, i.e., any of the three events.In patient encounters with several adverse events, we only consider the occurrence of the earliest deterioration event.It should be noted that if there are multiple deteriorations of the same type (e.g.ICU admission) recorded for a patient for more than a week apart, we only consider the latest as the reference time of deterioration for the patient.For patients who had deteriorated, we extract vital-sign data in the 48 hours preceding the time of deterioration.
We use this data to define "positive" windows for each prediction horizon, where t = 0 represents the end of the window and t = −24 represents the start of the window, such that for example, in the prediction horizon of 24 hours, deterioration would have occurred at t = 24.For patients who did not experience deterioration and were discharged, we use the 48 hours window preceding the last vital-sign recording, and similarly use those to formulate the "negative" windows.We exclude all samples containing less than 48 hours of vitalsign monitoring, either preceding the deterioration time or discharge time.
To pre-process the time-series data, we first normalize the data using Z-score normalization based on the mean and standard deviation of each vital sign.Since the vital signs are measured at irregular intervals, we resample each timeseries to obtain regularly sampled input for the LSTM network.This resampling is done by first interpolating the raw nonuniformly sampled data through cubic spline interpolation, and then sampling the interpolated signal at every 15 minutes.In Figure 1, we show a schematic summarizing the preprocessing of the raw time-series data, which we refer to as the SEQ data.
As for the non-SEQ data, we encode patient sex, vaccination status, hypertension status and obesity status as binary (0 or 1).For diabetes status, we use one-hot encoding to represent if a patient is non-diabetic ([1, 0, 0]), diabetic without complications ([0, 1, 0]), or diabetic with complications ([0, 0, 1]).We grouped the age into 18 different sub-groups, and replaced each age with the corresponding age sub-group (value between 1 to 18).For vaccination time, we count the number of elapsed months between the time of the second COVID-19 vaccination shot and the day of the time of prediction (t = 0).

B. DL-based Deterioration Prediction Model
Our proposed deep neural network architecture consists of LSTM layers and fully-connected (FC) layers.The overall architecture of this network is shown in Figure 2 and we refer to it as the Sequential Vital Sign Network (SVS-Net), consisting of two modules.The SEQ data is processed by a module consisting of an LSTM network and a single FC layer, while the non-SEQ data is processed by a second module consisting of an independent FC layer.The final prediction is based on both modalities.
1) SVS-Net architecture details: LSTM networks are wellknown for their ability to learn from SEQ data and have been widely used in studies where the time-series data is integral to the learning of the system and predicting future events [53], [54].The temporal module consists of a temporally-dilated LSTM, which takes three vital signs as input at each time step, and has three layers each containing 32 hidden units.
The final hidden state of the LSTM network, consisting of dimensionality of 32, is processed by a FC layer with an output dimensionality of 16.The non-SEQ data, dimensionality of 9, is processed by a single FC layer, which computes an output of dimensionality of 16.
Finally, the latent representations of the two modalities are concatenated as a vector of dimensionality of 32 and then processed by a FC fusion layer with output dimensionality of 8.This is then followed by a single FC layer with sigmoid activation and an output dimensionality of one, which represents the prediction that a sample precedes deterioration or not within the specified time-horizon.All of the FC layers use hyperbolic tangent activation except for the last layer which uses sigmoid activation.
2) Three-Phase Training Strategy: In order to optimize the performance of the proposed network, the training strategy consists of three phases as described below.The boxplot of the vital signs recorded from the patients at the end of the 24-hour input window (t=0), which corresponds to the prediction time.(C) The boxplot of the vital signs recorded from the patients at the beginning of the 24-hour input window (t=-24).We observe differences between the two groups (evaluated using the T-test), which motivates the design of the proposed temporal model.the SEQ module is connected to another FC layer that computes the prediction.After this phase of training, the weights are used to pre-initialize SEQ module in the next phase and we remove the second FC layer used to compute predictions during pre-training in the first phase.

• Phase 2: Training of fusion layer
In the second phase, we compute the representations of the SEQ data after initializing the associated module with the weights obtained in the first phase, and then freeze the SEQ module.We then train the FC layer, FC fusion layer, and the FC output prediction layer using the non-SEQ data.• Phase 3: End-to-end fine-tuning of SVS-Net In the last phase, we initialize the parameters of the entire network using the weights obtained in the first two phases.The network is trained end-to-end, with the aim of improving the overall network performance.

3) Model Training and Evaluation:
To train and evaluate the model, we use a three-fold cross validation process.We randomly divide the entire dataset into three folds so that each fold has the same distribution of positive and negative samples.The final performance results reported are obtained by averaging the validation performance over three folds.
We train the model within each fold for 200 epochs using the three-phase training strategy.For the first and second phase, we choose a learning rate of 0.0001 based on initial experimentation, and a learning rate of 0.00001 for the final fine-tuning stage.For all phases, we use the ADAM optimizer [55] with β 1 = 0.9, β 2 = 0.999, = 10 −8 .To avoid over-fitting, we include a patience period of 100 epochs to stop training if the validation loss does not improve.The best model is chosen based on the minimum validation loss.We choose the binary focal cross-entropy loss [56] in order to manage class imbalance in the dataset.We evaluate the performance of the model using three widely used metrics for binary classification: accuracy (with 0.5 threshold to convert the model predictions into binary), AUROC, and AUPRC.

A. Patient Cohort
In Figure 3(A), we show the application of the inclusion and exclusion criteria.This resulted with 37,006 patient samples, including 6,104 positive samples, and 30,902 negative samples.Table II summarizes the characteristics of the patients.In Figure 3(B), we show the differences in vital-signs of the two cohorts at t = 0, while in Figure 3(C), we show the differences in vital-signs between the two cohorts at t = −24.Using t-test statistical analysis, it can be seen that the difference between the two cohorts is statistically significant even as early as 24 hours before deterioration.This motivates the use of the proposed DL to decode the hidden pattern differentiating the two cohorts.In Figure 4, we show the distribution of the admitted patients over time.

B. Model Performance
The final model performance is summarized in Table III and also shown in Figure 5 after the three-phase training strategy.The performance at the 24 hours time horizon reaches 86.4% accuracy, 0.808 AUROC and 0.559 AUPRC.Although the datasets used in these studies are not the same, the results are comparable to those reported previously in [51] for a subset of this dataset.We also observe that the prediction accuracy improves as the prediction horizon reduces.
As shown in Figure 5, AUROC and AUPRC consistently improve at all prediction horizons after each phase of training, except for the time horizon of three hours, where the AUROC and AUPRC are comparable across phases two and three.This implies that the three-phase training strategy is better suited for longer prediction horizons.
Comparing the results of the performance of the three phases of training, it can be observed that the adopted threephase training strategy boosts the performance of the model by forcing the network to extract information initially from the SEQ data and in the end from combination of SEQ and non-SEQ data.The improvement can be seen in accuracy, AUROC and AUPRC.

C. Ablation Studies
In order to understand the impact of our design choices within the model architecture, we compare our model to two other networks.The first model, referred to as Memory-Less Vital Sign Network (MLVS-Net), processes the non-SEQ data and only the last set of vital-signs collected from the patient, ignoring any sequential information.Hence, instead of using a dilated LSTM, we process the vital-sign data (3 features) with a MLP consisting of two FC layers with output dimensionality of 16 each.The computed representation of the MLP is then concatenated with the representation of the non-SEQ data.We train the model in a similar fashion using the three-phase training strategy, and we freeze the weights of the MLP network in phase two.The second model, referred to as the non-Sequential Health Status Network (nSHS-Net), only considers the non-SEQ data.Hence, the output of the FC layer is processed by a second FC layers to generate the prediction.
We compare the three models in Figure 6.First, we observe that nSHS-Net performs the worst, which implies that the incorporation of vital signs is crucial for the model prediction.When comparing, MLVS-Net and SVS-Net, we observe a better performance with the latter across all prediction horizons and evaluation metrics.This implies that the incorporation of sequential information can significantly improve the capability of the model in predicting deterioration, relative to using a single measurement of vital signs.The numerical results of Figure 6 are summarized in Appendix I, Table IV.

D. Occlusion Analysis
In order to assess the influence of each input variable, we conduct an occlusion analysis on the final optimized model.In particular, we occlude one feature at a time (by setting the corresponding values to zero) and evaluate the performance on the validation set.The greater the reduction in the performance metrics upon feature occlusion, the more important the feature is for the prediction.
The results of this analysis are shown in Figure 7.As shown in Figure 7(A), amongst the non-SEQ features, we observe that age plays the most significant role in the model's performance.The other CCC features have less of an impact once occluded and do not have consistent trends across the different performance metrics and prediction horizons.It should be noted that due to the correlation between the various features, some features may be relevant to the prediction task, yet are not considered to be important by this occlusion analysis.For example if one CCC feature affects the variations in the vital signs over time, then the corresponding effect would be captured by the model.In this case, the occlusion analysis may show that this CCC feature is not important.However, when a feature shows low sensitivity through the occlusion study, it means that the need for that feature to be given to the model as an "independent" input is not significant.Among the SEQ features, we observe heart rate as the most important feature, followed by SpO2, and then temperature.
It should be highlighted that in Fig. 7, we have shown that the vital signs are more important than the clinical and comorbidity features, and SEQ vital signs bring additional information than the current vital signs only.The observation here matches the aforementioned analysis.Numerical results of Fig. 7 is given in Appendix I, Table V.

IV. CONCLUSION
This study is motivated by the high availability of personal medical devices, such as wearable systems (e.g., smart watches), that can record time-series medical data, and the prospects of using such devices in the context of telehealth for the prediction of deterioration.In summary, we propose, develop, and evaluate a deterioration prediction model using a large dataset (n=37,006) collected at NYU Langone Health during the period of January 2020 to September 2022.The model achieves an AUROC of 0.808-0.880 in 3 to 24 hours prediction horizons.
Our study has several strengths.First, the model uses a minimal input feature set consisting of time-series of three routine vital signs, i.e., SpO2, heart rate, and temperature.The model is also provided with basic patient information, including sex, age, vaccination status, vaccination date, presence of obesity, hypertension, and diabetes, which can be easily collected.Compared to previous work [51] that achieved 0.765 AUORC, our model achieves a comparable performance.However, that model was trained and evaluated using a different data modalities.In order to assess the significance of the various clinical and comorbidity features and the importance of using time-series vital-signs data, we performed a sensitivity analysis through an occlusion experiment as well as an ablation study.The results showed the importance of modeling the temporal variations of vital signs and the possibility of achieving high prognosis accuracy without the need for sophisticated medical imaging.Finally, the proposed framework is scalable as it can be extended for other prediction horizon ranges and using windows of different lengths.
The proposed work also has limitations.First, our input windows are limited to a size of 24 hours.In future work, we are interested in varying the length of the input feature windows and investigate the impact on performance, with the goal of reducing computational complexity if similar accuracy results can be obtained with shorter windows.Second, we do not perform any external or prospective validation of the model due to lack of access to similar datasets.Finally, we believe that the final results can be improved via hyperparameter tuning, including the learning rate, and this is an area of future work.
To conclude, this study highlights the feasibility of an accessible and scalable model to help assist the medical workforce in decision-making.The versatility of the proposed model is of importance, as the data types used for training and evaluating the model can be easily acquired from patients using wearable sensors and a few clinical data features that can be self-reported.

V. ACKNOWLEDGEMENT
• Funding: This material is based upon work supported by the National Science Foundation (Award # 2031594).
• Conflict of Interest: S. Farokh Atashzar and Yao Wang are inventors of "Smart Wearable IOT Device for Health Tracking, Contact Tracing and Prediction of Health Deterioration" which is licensed by Tactile Robotics, Ltd., Canada.

Fig. 1 :Fig. 2 :
Fig.1: Data pre-processing pipeline.We encode the non-SEQ data and pre-process the SEQ data: (i) normalize via Z-score normalization, (ii) model the time-series using cubic spline interpolation, (iii) and resample at every 15 minutes.

• Phase 1 :Fig. 3 :
Fig. 3: Application of data inclusion and exclusion criteria and distribution of vital signs.(A) In this flowchart, we illustrate the application of the inclusion and exclusion criteria, where n represents the number of patients after each step.(B)The boxplot of the vital signs recorded from the patients at the end of the 24-hour input window (t=0), which corresponds to the prediction time.(C) The boxplot of the vital signs recorded from the patients at the beginning of the 24-hour input window (t=-24).We observe differences between the two groups (evaluated using the T-test), which motivates the design of the proposed temporal model.

Fig. 4 :
Fig.4: Distribution of samples over time.We show the number of patients who deteriorated vs those who did not deteriorate in our final filtered dataset (n=37,006).

Fig. 5 :
Fig. 5: Results after each training phase.Performance results after each phase in the training strategy across all prediction horizons.

Fig. 6 :
Fig. 6: Ablation study results.Performance results for each of SVS-Net (non-SEQ data and SEQ vital sign data), MLVS-Net (non-SEQ data and single set of vital-signs), and nSHS-Net (non-SEQ only).

Fig. 7 :
Fig.7: Results of occlusion analysis to understand the importance of the input features.(A): Occlusion analysis on the clinical and comorbidity characteristics data.Occlusion of Age decreases the model performance (across all three evaluation metrics) more significantly than others when occluded.(B): Occlusion analysis on SEQ vital sign data.it can be observed that the HR contributes more significantly to the model performance than the SpO2 and temperature.

TABLE I :
Summary of related work.Overview of related work on the diagnosis of patients with COVID-19, patient severity assessment, and patient prognosis.

TABLE II :
Overview of patient cohort.We summarize in this table the patient characteristics, including demographics, and distribution of vital signs, for patients who deteriorated and patients who did not deteriorate.

TABLE III :
Model performance.We summarize the performance of the proposed network after the three-phase training stage across all prediction horizons.