An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

During the coronavirus disease 2019 (COVID-19) pandemic, rapid and accurate triage of patients at the emergency department is critical to inform decision-making. We propose a data-driven approach for automatic prediction of deterioration risk using a deep neural network that learns from chest X-ray images and a gradient boosting model that learns from routine clinical variables. Our AI prognosis system, trained using data from 3661 patients, achieves an area under the receiver operating characteristic curve (AUC) of 0.786 (95% CI: 0.745–0.830) when predicting deterioration within 96 hours. The deep neural network extracts informative areas of chest X-ray images to assist clinicians in interpreting the predictions and performs comparably to two radiologists in a reader study. In order to verify performance in a real clinical setting, we silently deployed a preliminary version of the deep neural network at New York University Langone Health during the first wave of the pandemic, which produced accurate predictions in real-time. In summary, our findings demonstrate the potential of the proposed system for assisting front-line physicians in the triage of COVID-19 patients.


Supplementary
: Positive predictive values (PPV) and negative predictive values (NPV) of the outcome classification task on the held-out test set (n represents the number of images). We include 95% confidence intervals estimated by 1,000 iterations of the bootstrap method [30]. COVID-GBM achieves the best performance across all time windows in terms of the PPV and NPV. The average importance of the top ten features computed by the COVID-GBM models are shown in Supplementary Figure 3.a. The importance of a feature is measured by the numbers of times the feature is used to split the data across all trees in a single COVID-GBM model. Age is amongst the top ten features across all time windows, which is consistent with existing findings that mortality is more common amongst elderly COVID-19 patients than younger patients [64]. The inclusion of the vital sign variables, amongst the top ten features across all models, is also aligned with existing research suggesting that they are strong indicators of deterioration [20].   (in pixels) of images prior to data augmentation. b, An example raw image. c, To ensure that the inputs to the model have a consistent size, we perform center cropping and rescaling. In addition, we apply random horizontal flipping, rotation, and translation to augment the training dataset.

Supplementary Note 3: Ablation studies
DenseNet-121-based models. DenseNet [48] is a deep neural network architecture which consists of dense blocks in which layers are directly connected to every other layer in a feed-forward fashion. It achieves strong performance on benchmark natural images dataset, such as CIFAR10/100 [65] and ILSVRC 2012 (ImageNet) dataset [66] while being computationally efficient. Here we compare COVID-GMIC to a specific variant of DenseNet, DenseNet-121, which has been applied to process chest X-ray images in the literature [49,50,51,52].
The model assumes an input size of 224⇥224. We applied DenseNet-121-based models to predict deterioration and also to compute deterioration risk curves. We initialized the models with weights pretrained on the ChestX-ray14 dataset [59], provided at https://github.com/arnoweng/CheXNet. We used weight decay in the optimizer. To perform hyperparameter search, we sampled the learning rate and the rate of weight decay per step uniformly on a logarithmic scale between 10 [ 6, 1] and 10 [ 6,3] . can be compared to Figure 4 in the main manuscript, which shows analogous graphs for COVID-GMIC-DRC. a, DRCs generated by DenseNet-121 model for patients in the test set with (faded red lines) and without adverse events (faded blue lines). The mean DRC for patients with adverse events (red dashed line) is higher than the DRC for patients without adverse events (blue dashed line) at all times. The graph also includes the ground-truth population DRC (black dashed line) computed from the test data. b, Reliability plot of the DRCs generated by DenseNet-121 model for patients in the test set. The empirical probabilities are computed by dividing the patients into deciles according to the value of the DRC at each time t. The empirical probability equals the fraction of patients in each decile that suffered adverse events up to t. This is plotted against the predicted probability, which equals the mean DRC of the patients in the decile. The diagram shows that these values are similar across the different values of t, and hence the model is well-calibrated (for comparison, perfect calibration would correspond to the diagonal black dashed line).
Impact of training set size. We evaluated the impact of the sample size used for training our machine learning models. Specifically, we evaluated our models on a subset of the training data, obtained by randomly sampling 12.5%, 25%, and 50% of the exams. Table 3 presents the AUCs and PR AUCs and the concordance indices achieved on the test set. It is evident that the performance of COVID-GMIC and COVID-GBM improve when increasing the number of images and clinical variables used for training, which highlights the importance of using a large dataset.
Supplementary Table 3: Model performance with 95% confidence intervals when using 12.5%, 25%, 50%, and 100% of the training data. We report AUCs for each time window in the adverse event prediction task. When evaluating the deterioration risk curves, we report the concordance index with a reference time of 96 hours, as well as the average of the index over all discretized times (3, 12, 24, 48, 72, 96, 144, and 192 hours  Impact of input image resolution. Prior work on deep learning for medical images [67] report that using high resolution input images can improve performance. In this section, we analyze the impact of image resolution on our tasks of interest. We consider the following image sizes: 128⇥128, 256⇥256, 512⇥512, and 1024⇥1024. We pretrain all models on the ChestX-ray14 dataset [59] and then fine-tune them on our dataset. Results on the test set are reported in Supplementary Table 4.
The DenseNet-121 based model achieves the best AUCs when using an image size of 256 ⇥ 256, and the best concordance index for 512⇥512. Further increasing the resolution does not improve performance. COVID-GMIC achieves the best performance for the highest input image resolution of 1024⇥1024, while achieving the best concordance index for 512⇥512. While a further increase in performance may be possible, we did not consider any larger image sizes resolutions because the computational cost would become prohibitively high.
Supplementary Table 4: Model performance with 95% confidence intervals when using input images of sizes of 128⇥128, 256⇥256, 512⇥512, and 1024⇥1024. For COVID-GMIC, we started with a size of 256⇥256 since an image with resolution of 128⇥128 pixels results in saliency maps that are too small to generate meaningful ROI patches. We report AUCs for predicting the risk of deterioration within 24, 48, 72, and 96 hours. When evaluating the deterioration risk curves, we report the concordance index with a reference time of 96 hours, as well as the average of the index over all possible reference times (3, 12, 24, 48, 72, 96, 144, and 192 hours). Impact of different transfer learning strategies. In data-scarce applications, it is crucial to pretrain deep neural networks on a related task for which a large dataset is available, prior to finetuning on the task of interest [68,69]. Given the relatively small number of COVID-19 positive cases in our dataset, we investigate the impact of different weight initialization strategies on our results. Specifically, we compare three strategies: 1) initialization by He et al. [70], 2) initialization with weights from models trained on natural images (ImageNet [66]), and 3) initialization with weights from models trained on chest X-ray images (ChestX-ray14 dataset [59]). We apply the initialization procedure to all layers except the last fully connected layer, which is always initialized randomly. We then fine-tune the entire network on our COVID-19 task.
Based on results shown in Supplementary Table 5, fine-tuning the network from weights pretrained on the ChestX-ray14 dataset is the most effective strategy for COVID-GMIC. This dataset contains over 100,000 chest X-ray images from more than 30,000 patients, including many with advanced lung disease. The images are paired with labels representing fourteen common thoracic observations: atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, and hernia. By pretraining a model to detect these conditions, we hypothesize that the model learns a representation that is useful for our downstream task of COVID-19 prognosis. Supplementary

Algorithm 2 ROI retrival
Input: chest X-ray image x 2 R H,W , saliency maps A 2 R h,w,|Ta| , number of ROI patches K Output: a set of retrieved ROI patches O = {x k |x k 2 R hc,wc } 1: O = ; 2: for each time window t 2 T a do 3:Ã t = min-max-normalization(A t ) 4: end for 5: A ⇤ = P