Machine learning based predictors for COVID-19 disease severity

Predictors of the need for intensive care and mechanical ventilation can help healthcare systems in planning for surge capacity for COVID-19. We used socio-demographic data, clinical data, and blood panel profile data at the time of initial presentation to develop machine learning algorithms for predicting the need for intensive care and mechanical ventilation. Among the algorithms considered, the Random Forest classifier performed the best with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {AUC} = 0.80$$\end{document}AUC=0.80 for predicting ICU need and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {AUC} = 0.82$$\end{document}AUC=0.82 for predicting the need for mechanical ventilation. We also determined the most influential features in making this prediction, and concluded that all three categories of data are important. We determined the relative importance of blood panel profile data and noted that the AUC dropped by 0.12 units when this data was not included, thus indicating that it provided valuable information in predicting disease severity. Finally, we generated RF predictors with a reduced set of five features that retained the performance of the predictors trained on all features. These predictors, which rely only on quantitative data, are less prone to errors and subjectivity.

www.nature.com/scientificreports/ physical examination including fever, dyspnea, respiratory rate, and blood oxygen saturation (SpO 2 ); (c) blood panel profile including RT-PCR, InterLeukin-6, d-Dimer, complete blood count, lipase, and C-reactive protein (CRP). They also include the outcome data, namely, the need for ICU admission and mechanical ventilation. A description of all the input features, their type, and their median, minimum and maximum values is presented in Tables 1, 2, 3, 4 and 5.
The study cohort comprised of 212 patients (123 males, 89 females) with an average age of 53 years (13-92 years), of which 74 required intensive care at some point during their stay, and 47 required mechanical ventilation. We note that only data obtained at the time of initial presentation, with 24 hours of initial presentation, was included as input to the predictive models, and the need for ICU admission and mechanical ventilation at any time during hospitalization were selected as outcomes.
Features with more than 30% missing data were excluded from the analysis. In the retained features, missing data was imputed using an iterative imputation method. In this method the feature to be imputed is treated as a function of a subset of other highly-correlated features and missing values are obtained using regression 9 . This subset of features is then iterated over to arrive at the final estimate. As part of this strategy, in order to prevent data leakage, only the training samples were used to develop regression models for imputation.
The retained features were used to compute the correlation of the outcome with input features. Thereafter, data was split into training (60%), and testing sets (20%). Fivefold cross-validation was performed using the training set to train the supervised learning models and tune their hyperparameters (random forest, multilayer perceptron, support vector machines, gradient boosting, extra tree classifier, adaboost). Among all these algorithms the Random Forest 10 (RF) classifier was found to be the most accurate and was considered for further analysis. The tuned RF model was applied to testing data to compute the probability of ICU admission and mechanical ventilation. This was repeated with five different folds, yielding predicted probabilities for 212 subjects generated by five distinct RF models. These were used to generate an ROC curve and compute the area under the curve (AUC). The relative importance of the input features was evaluated by computing their Gini importance.
The analysis describe above was first performed with input data from all categories, that is, socio-demographic data, presenting clinical data, and blood panel profile data. Thereafter, the blood panel profile data was excluded  www.nature.com/scientificreports/ and the analysis was performed once again. This second analysis was done to assess the relative importance of the blood panel data in predicting the outcomes.

Results
In Fig. 1, we have plotted the AUC values for predicting the need for ICU and mechanical ventilation for all the algorithms considered in this study. From this figure we observe that the algorithms based on decision trees, that is, Random Forest, Extra Tree Classifier, and Gradient Boosting tend to perform better. This is likely because the simpler algorithms like Support Vector Machines do not have sufficient capacity to capture the complexity in the prediction, while other algorithms like Multi-Layer Perceptrons (MLP) do not have sufficient data for efficient training. This leads to issues with robustness and over-fitting. Further, among the algorithms based on decision trees, the Random Forest (RF) classifier is the most accurate and was considered for further analysis.
For the RF predictor, we reported an AUC of 0.80, 95% CI (0.73-0.86) in predicting the need for ICU and an AUC of 0.83, 95% CI (0.76-0.90) for predicting the need for mechanical ventilation. At the optimal cut-point in the ROC curve 11 Table 6). These values demonstrate that we are able to accurately predict the need for intensive care and ventilation from data acquired at the time of admission. In terms of the AUC, the performance of the RF predictor is similar to results reported in studies from China 4 , New York 7 and the Netherlands 5 (AUC of 0.88, 0.8, and 0.77, respectively). We note that these studies differ from ours due to the regional differences in the population and the viral strain. Further, Table 4. Input features from presenting clinical data and the results of an initial physical examination. www.nature.com/scientificreports/ some these studies also included chest x-ray imaging features and tested a single type of ML algorithm (logistic regression or random forest). Deep learning models were also developed based on a cohort from China 6 , and these report an AUC 0.89 for a coarse measure of disease severity that clubs together patients receiving ICU care or mechanical ventilation, and those ultimately succumbing to the disease. When only socio-demographic and presenting clinical data was used as input (lab markers were excluded), the AUC value for predicting ICU need dropped to 0.68, 95% CI (0.60-0.75), and that for predicting ventilation dropped to 0.70, 95% CI (0.61-0.79). The values of Sensitivity, Specificity, PPV and NPV at the optimal point also dropped by about 0.1 (see Table 6). This indicates that the lab marker data provides significant additional information and is important in improving the accuracy of these predictions. A recent comprehensive survey of laboratory markers concluded that many of the markers that are included in this study are correlated with COVID-19 severity and should therefore be used in models for predicting disease severity 12 . However, our results also indicate that it is possible to make moderately accurate predictions with only socio-demographic and presenting clinical data. This is particularly useful when quick decisions are required and the time or resources necessary for acquiring lab marker data are not available in a timely manner.
The top ten features with the strongest correlation to ICU admission are shown in Fig. 2A, and the most important features for the RF classifier for ICU need are shown in Fig. 2B. Similarly, the top ten features with the strongest correlation to the need for mechanical ventilation are shown in Fig. 3A, and the most important features for the RF classifier for mechanical ventilation need are shown in Fig. 3B.
Taken together, this set represents features that strongly influence the likelihood of ICU admission and mechanical ventilation. We note that they belong to all three categories-socio-demographic data, presenting clinical data, and blood panel profile data-showing that all these type of data are necessary in making an accurate assessment of disease severity. Several of these features have been implicated in determining the severity of COVID-19 by other researchers 7,[13][14][15][16][17][18][19] ; however, there are few studies that have considered them together and determined their relative importance.  www.nature.com/scientificreports/ Table 6. Performance of Random Forest Predictors at the optimal operating point. We report Sensitivity, Specificity, Positive Predictive Value (PPV), and Negative Predictive Value (NPV). Numbers in parenthesis are the 95% confidence interval.  www.nature.com/scientificreports/ Finally, we considered RF predictors that are trained only using the top five features for predicting ICU need. These are the values for CRP, d-Dimer, Procalcitonin, SpO 2 , and respiratory rate. Models based on this reduced set of features are easier to implement since they require less data. They are also more robust and not prone to subjective assessment since all these features are quantitative numbers that can be measured accurately. For the model designed to predict ICU need using these features we report an AUC of 0.79 (0.72, 0.85) and for the model designed to predict the need for mechanical ventilation we report an AUC of 0.83 (0.77, 0.9). Both these values are very close to the corresponding predictors that utilize all 72 features, thereby indicating not much accuracy is lost by employing the simpler, more robust models. The sensitivity, specificity, PPV and NPV values for these reduced models are reported in the third and sixth rows of Table 6, and these are also quite close to the corresponding models that utilize all 72 features.

Model
In Fig. 4, we plot the distribution of some of the most important input features, including lab markers, presenting symptoms, and socio-demographic data for two sets of patients: those who require ICU care and whose who do not. We observe that the distribution of Creatinine (indicator of kidney function), C-reactive Protein (measure of inflammatory response), d-Dimer (measure of blood clot formation and breakdown), and Procalcitonin (elevated during infection and sepsis) among patients who require ICU care is spread over a larger range and has a higher average value. A similar trend is observed in the distribution for the respiratory rate. For SpO 2 levels also we observe a distribution spread over a wider range for patients admitted to the ICU; however, in this case this group has a lower average value. We also note that the presence of the influenza-like symptoms roughly doubles the likelihood of requiring ICU care (from around 25% to 52%). Further, the percentage of males who are admitted to the ICU is much higher than the percentage of females (46% to 20%).

Discussion
The results presented in this study demonstrate that data acquired at or around the time of admission of a COVID-19 patient to a care facility can be used to make an accurate assessment of their need for critical care and mechanical ventilation. Further, the important features in this data belong to three different sets, namely, socio-demographic data, presenting clinical data, and blood panel profile data. We report that in cases where the blood panel data is not available, useful prediction might still be made, albeit with some loss of accuracy. This would be relevant to situations where the time or resources to acquire this type of data are limited. Out of all the machine learning models considered in this study, we found the random forest to be most accurate and robust to data perturbation for both critical care and mechanical ventilation prediction. We also demonstrate that the values of just five features, namely, CRP, Procalcitonin, d-Dimer, SpO 2 , and respiratory rate, can be used to predict the need for critical care and mechanical ventilation with an accuracy that is comparable to using all 72 features. The list of important features identified in our study is also indicative of a disease that affects multiple systems in the body including the respiratory, the circulatory system, and the immune system.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.