CT-based Rapid Triage of COVID-19 Patients: Risk Prediction and Progression Estimation of ICU Admission, Mechanical Ventilation, and Death of Hospitalized Patients

The wave of COVID-19 continues to overwhelm the medical resources, especially the stressed intensive care unit (ICU) capacity and the shortage of mechanical ventilation (MV). Here we performed CT-based analysis combined with electronic health records and clinical laboratory results on Cohort 1 (n = 1662 from 17 hospitals) with prognostic estimation for the rapid stratification of PCR confirmed COVID-19 patients. These models, validated on Cohort 2 (n = 700) and Cohort 3 (n = 662) constructed from 9 external hospitals, achieved satisfying performance for predicting ICU, MV and death of COVID-19 patients (AUROC 0.916, 0.919 and 0.853), even on events happened two days later after admission (AUROC 0.919, 0.943 and 0.856). Both clinical and image features showed complementary roles in events prediction and provided accurate estimates to the time of progression (p<.001). Our findings are valuable for delivering timely treatment and optimizing the use of medical resources in the pandemic of COVID-19.


Compar ison of r adiomics with r adiologists' scor ing
The performance of Radiom models was overall superior to that of radiologist score (R-score) models on both two validation cohorts for the three tasks (ICU/MV/death: Cohort 2 AUROC 0.776/0.804/0.678, AUPRC 0.332/0.222/0.120; Cohort 3 AUROC 0.772/0.736/0.653, AUPRC 0.137/0.115/0.092) ( Table 2). Specifically, Radiom models had significantly improved predictive value in predicting ICU (p < .001) and were comparable to Rscore models with a higher AUPRC for MV (p = .003) and death (p = .021) on Cohort 2. The predictive value of Radiom for ICU and MV happening two days later was higher than R-score, while there was no significant difference between these two models on prediction of death on Cohort 3 (Table S4-5, Figure S4).

Key imaging featur es and clinical pr ognostic indicator s
Among the top-ranking prognostic indicators, clinical data and radiomics features showed a complementary role with no significant correlations (Figure 3, Figure S5-6). In clinical data, elder age, dyspnea, higher lactate dehydrogenase (LDH) and inflammatory factors (white blood cell (WBC), neutrophil) signaled severe outcomes. Particularly, hypertension and some inflammatory factors (lower lymphocyte, higher C-reactive protein (CRP) and neutrophil)) were valuable for predicting ICU admission, also higher potassium and α-Hydroxybutyrate dehydrogenase (HBDH) and several inflammatory factors (lower lymphocyte, higher CRP) were predictive for MV, while higher D-dimer provided great diagnostic value for death. Most clinical variables were independently correlated with disease progression (Supplementary Appendix 5). Furthermore, GLSZM-based, GLCM-based, and first-order radiomics features are important features for the prediction of outcomes. In addition, our R-score model suggested that diffuse pulmonary parenchymal ground-glass and consolidative pulmonary opacities in the left upper lobe and pleural effusion increased the adverse outcomes (ICU, MV, death) in COVID-19 patients. Notably, crazypaving on the initial CT chest was a risk factor of death. (Table S6, Figure S7) Individual sever e-event-fr ee sur vival analysis and per for mance of time-to-event models Next, we used time-to-event modeling to stratify survival outcomes of patients. We first separated the patients into high-risk and low-risk groups and evaluated the survival curves of the two groups. Kaplan-Meier curves using the predicted score with the optimal RadioClinLab were generated (Figure 4). The high-risk group (ICU: 40  observations with 18 events, MV: 23 observations with 8 events, death: 13 observations with 3 events) had a much  lower survival probability compared to the low-risk group (ICU: 642 observations with 32 events, MV: 659  observations with 28 events, death: 669 observations with 19 events) in all 3 tasks with a significant statistical difference (p < 0.001, log-rank test).
According to the results of time-to-event prediction (Table S9) on Cohort 2, the RadioClinLab showed the highest concordance index values on three prediction tasks (0.917, 0.888, and 0.906). Additionally, the RadioClinLab outperformed other models on ICU and MV prediction (Brier score 0.061 and 0.053) while the ClinLab model performed best on death prediction (Brier score 0.028). On Cohort 3, RadioClinLab showed the highest concordance index values on three tasks: 0.921, 0.884 and 0.911 and the lowest integrated Brier score on ICU and MV prediction: 0.039 and 0.036 while the ClinLab model showed the lowest integrated Brier score of 0.027. The bootstrapping experiments (Table S10) showed that on Cohort 2, RadioClinLab showed the highest concordance index on three tasks (p<0.001, paired one-sided t-test) and the lowest integrated Brier score on ICU and MV prediction (p<0.03) while there was no statistically significant difference in the integrated Brier score values between RadioClinLab and ClinLab on death prediction. Generally, these results showed that Radiom, RadioClinLab and ClinLab models achieved satisfactory performances in time-to-event prediction. In particular, the combination of radiomics features and clinical data contributed most to the prediction and provided the most accurate estimates to the time in days that critical care demands are required.

Discussion
Our study achieved three goals. First, we provided risk stratification based on CT-based radiomics features and clinical data for COVID-19-infected patients in terms of stable or severe disease (requiring ICU) on admission. Second, our models provided specific outcome prediction (MV and death) for critically ill patients. Finally, we offered insights into estimating time to progression of severe events (i.e. ICU, MV, and death). This analysis potentially enables rapid stratification and timely intensive care management of patients during this pandemic.
We carefully defined outcome events (i.e. ICU, MV, or death) as prediction labels rather than the general risk severity, so that different medical centers can optimize the resources allocation by utilizing the prediction outcomes. According to our prognosis estimation results, it is possible to request medical resource transfers, such as personnel, local ICU beds, or MV from the Emergency Medical Services command as well as distribution of stable patients from overloaded local ICUs to neighboring affected regions with lower COVID-19 prevalence to balances ICU loads. Additionally, the prediction of MV on admission allows for closer monitoring and repeat assessments of patients over time to determine priority for initiating MV, because there is typically only a limited time window for life saving when their breathing deteriorates. 20 Furthermore, combining predictions of demand for medical resources with outcome estimation of death anticipated the need to allocate resources to the patients who are most likely to benefit, which may also help develop priority rationing strategies during pandemics. 21 Our findings demonstrated the predictive value of CT-based imaging for outcome predictions of CVOID-19 patients. The performance of radiomics-based models (Radiom) was better than radiologist's scores (defined as R-score). Concretely, we found that first-order texture, and higher-order radiomics features (i.e. GLSMZ and GLCM-based) constituted the most important predictors. Our results also indicated that the feature values of diffuse pulmonary parenchymal ground-glass and consolidative pulmonary opacities in the left upper lobe as well as pleural effusion increased the adverse outcomes (ICU, MV, death) in COVID-19 patients, which were consistent with prior findings. 22,23,24 Additionally, crazy paving was a predictor of death. 25 Among the identified clinical predictors in our study, elder age, dyspnea, a liver biochemistry marker (higher lactate dehydrogenase (LDH)) were significant in all three prediction tasks. 26,27,28,29 Furthermore, the change of various inflammatory factors (higher white blood cell (WBC), C-reactive protein (CRP) and neutrophil, and lower lymphocytes) was predictive for the three severe events, consistent with current research that SARS-CoV-2 may accelerate the inflammatory response and cause the fluctuation of inflammatory factors, thereby leading to severe immune injury and lymphopenia. 26,27,30,31,32,33 Previous studies also indicated that leukocytosis resulting from a mixed infection of bacteria and fungi in the context of viral pneumonia indicates poor outcomes. 34,35 In addition, our study suggests that electrolyte and acid-base balance (K+) relating to respiratory function and the indicator of myocardial infarction (higher α-Hydroxybutyrate dehydrogenase (HBDH)) contributes to the prediction of progression to severe illness requiring MV, while D-dimer was associated with an increased risk of in-hospital mortality, in agreement with previous studies. 11 , 12, 26, 32, 36 Other features such as comorbidity (e.g. hypertension) were also related to poor prognosis. 27,29 Our work has several limitations. First, we did not consider the effect of different treatments on the prognosis of patients among clinical centers. In our study, several treatments were adopted including oxygen therapy, MV, ECMO, antiviral treatment, antibiotic treatment, glucocorticoids, and intravenous immunoglobulin therapy. In-depth comparison of different treatment outcomes might improve response prediction. Second, ten well-experienced thoracic radiologists analyzed the CT images in consensus and evaluated traditional imaging features in our study, however, we did not study inter-reader variability and such an analysis might need to be addressed in future work. In addition, although our study had a large sample size with clear prognosis information, the numbers of endpoints were limited and only from Chinese hospitals which could potentially limit the generalizability of models in other areas. Finally, additional validation across populations from European and American hospitals are needed to further validate the reported models.
In conclusion, we developed computational models with clinical prognostic estimation functions incorporating CTbased radiomics features as well as clinical data from electronic medical records for COVID-19 patients. This information may aid in delivering proper treatment and optimizing the use of limited medical resources in the current pandemic of COVID-19.

Patient cohor t
Our data in this study was collected from 39 hospitals in China. All patients (n = 3522) followed the inclusion criteria: (a) confirmed positive SARS-CoV-2 nucleic acid test; (b) chest CT examinations and laboratory tests on the date of admission; (c) clear short-term prognosis information was available (discharge, or adverse outcomes including the admission to ICU, requiring MV support, and in-hospital death). After screening with exclusion criteria, 2363 from 26 medical centers were analyzed in our study (Figure 1, Figure S1, Table S1). This study protocol was approved by the institutional review board of Jinling Hospital, Nanjing University School of Medicine (2020NZKY-005-02).

Data collection and pr ocessing
Our multi-modal data for each patient included (a) Clinical records (abbreviated as Clin): demographics, comorbidities, and clinical symptoms; (b) Laboratory results (abbreviated as Lab): blood routine, blood biochemistry, coagulation function, infection-related biomarkers; (c) CT-based radiomics features (abbreviated as Radiom); (d) Radiologists' semantic data (abbreviated as R-score); (e) Time-to-event data: the time intervals between the date of admission and the date of development of adverse outcomes (requiring ICU, MV, and death), or the date of discharge. (Table S2, Supplementary Appendix 1) To address the imbalance and high feature dimensionality in modeling, we adopted several combinations of methods to downsample the negative cases (n=2207, referred to the stable group where patients discharged without any adverse outcome) and oversampling the positive cases (n=155, referred to the adverse group where patients required ICU admission, including 94 patients who needed MV and 59 death) to enhance models' learning ability for the imbalanced data (Supplementary Appendix 2).

Model development and pr ediction evaluation
There were three binary classification tasks in this study, namely, stable (negative) samples vs. adverse (ICU) samples, non-MV samples vs. MV samples, and survival samples vs. death samples. To test the prediction performances of different data type combinations, multivariable models based on five types of data were developed and compared: 1) radiomics data only (denoted as "Radiom"); 2) radiomics, clinical features (including demographics, comorbidity and clinical symptoms) (denoted as "RadioClin"); 3) radiomics data, clinical features, and laboratory results data (denoted as "RadioClinLab"); 4) clinical features and laboratory results (denoted as "ClinLab"); 5) radiological score based on the linear combination of semantic imaging features evaluated by radiologists (denoted as "R-score"). To confirm the patients were reasonably grouped based on the adverse outcomes and whether the event occurred within 48 hours, we first provided an intuitive manner to understand the distribution of all types of features used in this study with the help of heatmaps and t-distributed Stochastic Neighbor Embedding (t-SNE) in terms of ICU, MV, and death (Supplementary Appendix 3). All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.04.20225797 doi: medRxiv preprint To systematically explore the performance of multiple machine-learning classifiers, we used the following approaches to predict outcomes: 1) Logistic Regression (LR); 2) Random Forest (RF); 3) Support Vector Machine (SVM); 4) Multilayer Perceptron (MLP); 5) LightGBM. In Cohort 1 (n=1662), the data were split into training and testing sets (ratio 7:3) using a stratified random sampling based on death cases. We used 5-fold cross-validation on the training set only to tune the model hyperparameters (Supplementary Appendix 4). Both a randomized search with accuracy as the optimization goal and a grid search with F1 score as the optimization goal were implemented on the 5-fold cross-validation and the predictive performances were evaluated on the test set of Cohort 1. Finally, to select an optimal model for each prediction task, five models with the top receiver operating characteristic (AUROC) 14 were firstly selected and the model with the highest precision-recall (AUPRC) 15 curves was then chosen as the optimal model for each outcome prediction because AUROC and AUPRC could show model accuracy, precision and recall in a more comprehensive manner with varying thresholds.

Model validation and compar ison
We tested the statistical difference of the performance of selected models with 30 bootstrapped resamples on unseen data (Cohort 2 n = 700, Cohort 3 n = 662, Figure 1) and used the AUROC and AUPRC curves to estimate their generalization ability. Particularly, with Cohort 3, we could verify models' ability of predicting events that will occur two days later, which may allow the healthcare system to have at least two days to plan ahead and react to the demand for resources. Box plots were also drawn to compare the performances of the optimal models found based on Cohort 1 in three classification tasks. Finally, we selected an optimal model for each prediction task based on the results of the paired one-sided t-test, which compared the AUROC and AUPRC of models consisting of different data types (Radiom, RadioClin, RadioClinLab, ClinLab). Additionally, we constructed the R-score model using logistic regression based on semantic features to compare with the Radiom model (on both Cohort 2 and Cohort 3) and found out the traditional image features that were helpful to predict the outcome events (Supplementary Appendix 2).

Analysis of pr edictive featur es
We identified the feature importance from the selected optimal models and normalized the highest importance scores in each of the bootstrapping experiments on Cohort 2 (n =700). By taking an average of the feature importance values over thirty bootstrapping experiments, we then focused on the ten most important features for each prediction task. We also plotted the pairplot of the most important features to visualize the relationship of top ten features. Furthermore, we performed the independent two-sided t-test (continuous variables, with normal distribution), proportional z-test (categorical variables) and rank sum test (continuous variables, without normal distribution) to validate the statistical significance in the feature values of positive cases and all cases in Cohort 1, Cohort 2 after firstly using Shapiro Wilk normality test.

Time-to-event modeling
Cox regression with the l1 penalty and scikit-survival package 0.12.1 was adopted on time-to-event data in Cohort 1 (n = 1277, 77% of the patients originally in Cohort 1 had event time recorded) and Cohort 2 (n = 682, 97% of the patients originally in Cohort 2 had event time recorded). 16,17,18,19 Three different data combinations were used for the time-to-event modeling: Radiom, RadioClinLab, and ClinLab. We used five-fold cross-validation on Cohort 1 to determine the "alpha_min_ratio" hyperparameter, 18,19 and calculated the performance on Cohort 2. We used the concordance index (C index) and the integrated Brier score to evaluate the models. On Cohort 1, the optimal model for each data combination was chosen in a similar manner as previously described for the classification tasks by first filtering based on mean C index and then optimizing the mean integrated Brier score on the three tasks. Next, we used Kaplan-Meier analysis to visualize the time-to-event models and the log-rank test to estimate significance. A "high-risk" and "low-risk" group was created according to the predicted score for each patient on each task with the optimal RadioClinLab model. To group the patients into the high-risk group and the low-risk group, we first calculated the ratios of positive cases in Cohort 1, then set thresholds on the predicted probability of the test samples to separate patients according to the ratios based on Cohort 2. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted November 6, 2020. ; and standard deviation (SD) were used to describe normally distributed data, while the median and interquartile range (IQR) was used to describe non-normally distributed data. Categorical variables were presented as numbers and percentages. The AUROC, AUPRC, accuracy value and their 95% CI were listed to assess the model performance. The paired one-sided t-test was used to calculate the statistical significance of the difference between each AUROC and AUPRC value in the bootstrapping experiments. Chi-square test and Fisher's exact test were exploited to compare categorical data while independent t-test and Wilcoxon rank sum test were used to compare the feature values of continuous variables in positive and negative cases in the entire cohort (n = 2362). Proportional test was done to compare the feature values of categorical variables in positive and negative cases among the most important features found by classifiers and test the statistical significance of categorical variables between Cohort 1 and Cohort 2. Kaplan-Meier survival analysis was done on the high-risk and low-risk group based on predictions and log-rank test was used to evaluate statistical significance.

Role of the funding sour ce
The funders of the study had no role in the study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Data availability
The data that support the findings of this study are available on request from the corresponding author (G.M.L.). The data with participant privacy/consent are not publicly available due to hospital regulation restrictions.
Refer ence 1 WHO. Weekly Epidemiological and Operational updates October 2020. https://www.who.int/docs/default-source/coronaviruse/situation-reports/20201012-weekly-epi-update-9.pdf. Note. P-values show statistically significant differences in features between Cohort 1 and Cohort 2. There were statistically significant differences in prognostic features (e.g. age, dyspnea) in Cohort 1 and Cohort 2, but there was no significant difference in these features of positive cases (refers to the adverse group where patients required ICU admission) in the two cohorts (Table S3). Thus, this difference may be due to the discrepancy in the proportion of Hubei cases (Cohort 1, 69.8%; Cohort 2, 80.1%), which have a higher proportion of severe outcomes (6.9%, 8.6%, respectively).
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figur e captions
Figur e 1. Illustr ation of wor kflow in this study. (a) Our primary cohort (Cohort 1, n = 1662) for model development included patients from 17 hospitals, and our validation cohort (Cohort 2, n = 700) consisted of patients from 7 external and independent medical centers. Additionally, we built a specific cohort (Cohort 3, n = 662) for patients from the 7 medical centers whose interval between admission and progression to critical outcomes (ICU/MV/death) were more than two days, aiming to evaluate the performance of our models on predicting events happening at least two days after admission. (b) Explanation of our data split and the cor r esponding usages. (1) Step one: feature visualization of Cohort 1 and Cohort 2 to get the preliminary intuitive sense; (2) Step two: 70% samples of Cohort 1 were picked as the training set using stratified sampling based on death cases, where 5-fold cross-validation was used to tune the hyperparameters of the models; (3) Step three: model selection was performed on the remaining 30% samples of Cohort 1; (4) Step four: Cohort 2 and Cohort 3 were used to evaluate model performance in different aspects.
Figur e 2. Heatmap showing the pr ognostic per for mance of (a) r adiomics data and (b) clinical data and Rscor e data on Cohor t 2 with cluster ing of featur es. 150 Negative patients were randomly selected as well as all patients having outcomes of ICU admission, Mechanical Ventilation or Death to draw the heatmap. For patients with more than one adverse outcome, they will appear as samples in each corresponding category. The patients were grouped based on adverse outcomes (i.e. ICU admission, MV, and death) and whether the event occurred within 48 hours after admission. The features were clustered within their categories to better visualize the data. The differences between negative outcome patients (yellow) and positive outcome patients can be seen from both (a) and (b), with some features showing different patterns for negative (patients discharged without any adverse outcomes) or positive patients (patients who required ICU, MV, or death while hospitalized). Almost all CT image features showed good discrimination between negative and severe outcome patients and had more obvious distinctions compared to clinical data. Among clinical data, lab results and demographics had good discriminating power. Part of radiologists' score features had good discriminating power while clinical features have comparatively weak discriminating power. Regarding the distinctions between ICU admission, mechanical ventilation, and death, CT image features showed better discriminating power than clinical data. In CT image features, from ICU to MV to death, trends of value increasing or decreasing can be observed while in clinical data, this kind of trend is not visible.
Figur e 3. The model per for mances in the pr ediction of thr ee outcomes (Cohor t 2) and the ten most impor tant featur es in the thr ee outcome pr ediction tasks. The first and second row presented ROC curves and PR curves for predicting three events of models based on different data types. a) and d), b) and e), c) and f) indicated that RadioClinLab based models for predicting ICU/MV/death achieved the highest AUROC (0.944/0.942/0.860) and AUPRC (0.665/0.551/0.346), respectively. g-i) The ten most important features and their relative importance based on thirty bootstrapping experiments for the three prediction tasks based on the feature importance of the LightGBM classifiers.
Figur e 4. Kaplan-Meier cur ves for 3 tasks in Cohor t 2. Risk groups were divided according to model predicted scores. (a) ICU admission (b) mechanical ventilation, and (c) death (high-risk: risk=1, low-risk: risk=0) All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.