Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning

COVID-19 is a newly emerging infectious disease, which is generally susceptible to human beings and has caused huge losses to people's health. Acute respiratory distress syndrome (ARDS) is one of the common clinical manifestations of severe COVID-19 and it is also responsible for the current shortage of ventilators worldwide. This study aims to analyze the clinical characteristics of COVID-19 ARDS patients and establish a diagnostic system based on artificial intelligence (AI) method to predict the probability of ARDS in COVID-19 patients. We collected clinical data of 659 COVID-19 patients from 11 regions in China. The clinical characteristics of the ARDS group and no-ARDS group of COVID-19 patients were elaborately compared and both traditional machine learning algorithms and deep learning-based method were used to build the prediction models. Results indicated that the median age of ARDS patients was 56.5 years old, which was significantly older than those with non-ARDS by 7.5 years. Male and patients with BMI > 25 were more likely to develop ARDS. The clinical features of ARDS patients included cough (80.3%), polypnea (59.2%), lung consolidation (53.9%), secondary bacterial infection (30.3%), and comorbidities such as hypertension (48.7%). Abnormal biochemical indicators such as lymphocyte count, CK, NLR, AST, LDH, and CRP were all strongly related to the aggravation of ARDS. Furthermore, through various AI methods for modeling and prediction effect evaluation based on the above risk factors, decision tree achieved the best AUC, accuracy, sensitivity and specificity in identifying the mild patients who were easy to develop ARDS, which undoubtedly helped to deliver proper care and optimize use of limited resources.


Scientific Reports
| (2021) 11:2933 | https://doi.org/10.1038/s41598-021-82492-x www.nature.com/scientificreports/ Acute Respiratory Distress Syndrome (ARDS) is a common and devastating critical illness 8 . It has been reported that 67% of COVID-19 patients with the severe illness have developed ARDS, which is the main cause of death 9 . However, in the early stage of onset, quite a few patients have no obvious clinical symptoms, so it is difficult to judge until ARDS occurs. Predicting which patients are more likely to develop ARDS, and thus face a greater risk of complications including death, is particularly important in a novel and accelerating outbreak 10 . It would be useful in evaluation or prediction the public health burden or resources demand in a large scale e.g. in a city or a province.
Artificial intelligence (AI) has begun to tackle these difficult challenges in healthcare and it can provide clinical decision support if used carefully 11 . Currently, the prediction models of COVID-19 reported mainly focus on epidemics trend, early screening, CT diagnosis, and prognosis of COVID-19 patients [12][13][14][15] . Few models have been studied for early identification of patients who are most likely to develop ARDS and recommending interventions. Xiang Bai et al. established a Long Short-Term Memory (LSTM) model by combining 75 clinical features and a quantitative CT sequence data obtained at different times to predict the malignant progression of COVID-19, which achieved an AUC of 0.954 16 . Xiangao Jiang et al. used traditional machine learning methods such as decision tree(DT), random forest(RF), and support vector machine(SVM) to predict disease progression to ARDS in COVID-19 patients, with the overall accuracy of 70%-80% 10 . This study was a small sample prediction model of only 53 patients, so the prediction accuracy was slightly lower. The most-reported predictors of severe progression in patients with COVID-19 included age, sex, features derived from computed tomography scans, C reactive protein, lactic dehydrogenase, and lymphocyte count. The C index of these models ranged from 0.85 to 0.98 17 . However, most reports did not include a description of the study population or intended use of the models and were rated at high risk of bias at the same time. Early detection of patients who are likely to develop critical illness is of great importance and may help to deliver proper care and optimize use of limited resources. We aimed to develop the COVID-19 ARDS clinical decision support system using machine learning algorithms and deploy it into electronic medical records(EMR) to assist doctors in identifying severe patients at the time of hospital admission.

Results
Characteristics of COVID-19 patients. Tables  Demographics and epidemiology. In this study, we collected a total of 659 patients from Wuhan and non-Wuhan areas who were confirmed with COVID-19, of which 76 patients (11.5%) developed ARDS. 447 patients (70.9%) had contact with infected persons and 44.3% had a family infection. The median incubation period was 5 days (interquartile range, 3 to 9) and the average time from onset to ARDS and admission to ARDS were 10 days and 3 days, respectively. The median age of the patients was 50 years (interquartile range, 37 to 62) and 50.4% of the patients were male. Patients with ARDS were significantly older than those with non-ARDS by a median of 7.5 years (56.5 years vs. 49 years) and male patients (76.3%) were more likely to develop ARDS. More than 50% of ARDS patients had a BMI greater than 25. However, the exposure histories of the two groups were similar (Table 1).
Overall, the presence of any comorbidities was more common among ARDS patients than no-ARDS (56.6% vs. 39.8%). Patients with ARDS had a much higher incidence of hypertension (48.7% vs.23%) and diabetes (17.8% vs.9.5%). Two of the five patients infected with other viruses developed ARDS. ARDS also occurred in one patient who was treated with immunosuppressive agents (Table 2). Table 3 shows the results of radiologic, laboratory findings on admission and complications. 74.7% of the patients presented ground-glass shadows on chest CT images and 28.3% of the patients presented consolidation. The above two imaging features accounted for a higher proportion of patients with ARDS than non-ARDS patients, which were 80.8% vs 73.9% and 53.9% vs 24.7%, respectively. The median number of consolidation quadrant in ARDS patients was two.

Radiologic, laboratory findings and complications.
Within 48 h of admission, lymphocytopenia was present in 36.1% of the patients and leukopenia in 24.8%. However, among ARDS patients, 19.7% had an increase in the white blood cell count, which indicated that ARDS patients had a secondary infection. The ratio of neutrophils to lymphocytes was greater than 3 in 45.3% of COVID-19 patients and 82.7% in ARDS patients with a median of 6.11. 47.4% and 32.2% of patients had elevated levels of C-reactive protein and lactate dehydrogenase, respectively. In a small number of patients, levels of alanine aminotransferase (ALT), glutamate aminotransferase (AST), creatine kinase (CK) and D-dimer were elevated. Laboratory abnormalities were more severe in ARDS patients than in non-ARDS patients. Besides, the medians of myoglobin and fasting glucose in ARDS patients were 85.9 μg/L and 8.1 mmol/L respectively, which exceeded the normal reference range and was significantly different from the non-ARDS group. www.nature.com/scientificreports/ During hospitalization, 91.3% of patients were diagnosed with pneumonia, and there was no statistical difference between the ARDS group and non-ARDS group. However, patients with ARDS had a higher incidence of shock and secondary bacterial infection (5.5% and 30.3%) than those with non-ARDS (0 and 4.3%), and 45.2% of them were admitted to ICU (Tables 2, 3).

Prediction of risk factors for COVID-19 ARDS.
After removal of variables with missing rate > 20%, a total of 98 variables consisting of demographic, epidemiology, clinical symptoms, underlying diseases, complication, CT image features and laboratory results were extracted from the structured and unstructured data    www.nature.com/scientificreports/ Table 5 shows the mean ± standard deviation (std.) for 10-fold cross validation with AUC and accuracy. DT, LR and RF all exceeded AUC of 0.85 and the mean accuracy of each algorithm was over 0.8. In order to further verify the accuracy of the models, performances of five algorithms were evaluated on the external test set with each technique. Table 6 and Fig. 1 show that DT, LR, RF and DNN all demonstrated good performance in term of AUC, accuracy and specificity. The sensitivity of DT and LR was much higher than that of other three models.
Considering the unbalance of the actual dataset, we also evaluated the balanced accuracy of each model. The result of DT and DNN was 0.98 and 0.93, respectively. The predictive model established by SVM exhibited the worst performance in five models. It is necessary for ARDS diagnosed tool with high sensitivity and accuracy.
The results show that DT marked the best value in each evaluation with AUC of 0.99, accuracy of 0.97 and sensitivity of 1.0 respectively. Therefore, the model constructed by decision tree algorithm was optimum tool for ARDS prediction.

Discussion
In this study, we comprehensively compared the clinical characteristics of all confirmed COVID-19 patients with and without ARDS, and determined 19 features for modeling. All included variables were strongly correlated with disease progression. Age (> 70 years), gender, hypertension, diabetes as well as severity evaluation are recognized risk factor for developing ARDS in COVID-19 patients 18 . Clinical manifestations such as fever, cough, hemoptysis, shortness of breath and lung consolidation reflect the progression of COVID-19 19,20 . Viral infections predispose patients to secondary bacterial infections, which often lead to a more severe clinical course. Secondary bacterial infection has been considered as a critical risk factor for the severity and mortality rates of COVID-19 despite antimicrobial therapies 21,22 . Lymphopenia, high concentrations of CRP and LDH may indicate severe acute lung inflammatory reaction and cell damage [23][24][25] , which has been reported to be risk factors for severe patients with COVID-19 26 . ALT and AST are markers of acute liver injury. Studies have found that abnormal liver tests in patients with COVID-19 were associated with the progression to severe pneumonia. The detrimental effects on liver were mainly related to the use of lopinavir/ritonavir during hospitalization. Therefore, liver function should be monitored and evaluated frequently during medication 27,28 . NLR is an indicator of systemic inflammation 29 , mainly seen in tumor-related diseases, autoimmune diseases, bacterial infectious pneumonia and tuberculosis [30][31][32][33] . It was reported that COVID-19 infection-triggered inflammation increased NLR, which was significantly associated with poor clinical outcomes of COVID-19 patients 34 . We found that CK was a high-risk factor for ARDS. On the one hand, it might be associated with heart injury in critically ill patients with COVID-19 35 . On the other hand, this indicator was related to rhabdomyolysis 36,37 . Several cases of rhabdomyolysis were reported in COVID-19 severe patients, with a marked increase of CK [38][39][40] . We tried five algorithms for modeling and finally the decision trees performed best. In clinical prediction research, decision tree is frequently designed to build binary classifiers, such as cancer prediction/prognosis 41 . As a method used in machine learning, it is nonparametric which makes fewer data assumptions and it can accommodate collinear independent variables 42 . It is also less sensitive to outliers and more robust to high-dimensional data, which possess many independent variables relative to outcomes 43 . The main advantage of decision tree is its simple structure, which allows for better extracting classification rules and interpretation. Our model consisted of 19 clinical variables, which were all relatively inexpensive and easy to be obtained directly from clinical symptoms and routine laboratory tests. At the same time, the system showed good sensitivity, specificity and AUC in the external test cohort. Compared with the results of Jiang et al. 10 , the overall accuracy of our model is higher (70% vs 91%).
Our study has several strengths: first, we have successfully used a machine learning algorithm to analyze clinical datasets and developed a diagnosis aid system, which has been deployed in electronic medical records for early identification of ARDS in COVID-19 patients. By submitting clinical information online, medical staff can triage patients at hospital admission based on the predicted risk factors and arrange patient treatment plans accordingly, ensuring patients receive treatment early and medical resources can be efficiently allocated. Secondly, to ensure the reliability of the conclusion, we used data from multi-centers with large samples for modeling and verification. Third, we found that CK (> 185 U/L) and NLR were strongly correlated with ARDS, which might be the new potential early identification biomarkers in COVID-19 severe patients.
There are still some deficiencies in our study and we have a lot of works to do in the future. Firstly, although we collected data of 659 COVID-19 patients in multiple centers, samples available for ARDS were limited. www.nature.com/scientificreports/ Secondly, we did not collect CT images data, and the quantitative information of CT diagnostic data was not detailed enough. Thirdly, it has been reported that D-dimer was a risk factor for COVID-19 severity. However, due to a large number of missing data, similar conclusions were not reached in our study. Finally, it is of great clinical value to study the intervention measures and prognosis of COVID-19 patients before and after the development of ARDS and integrate them into the diagnostic system to achieve personalized recommendations of treatment measures.

Conclusion
We retrospectively analyzed the clinical characteristics of COVD-19 patients with and without ARDS from Zhejiang Province and Wuhan and identified 19 risk factors. Further, based on these risk factors, we used five methods for modeling, four of which had good predicting effect. The decision tree performed best with an accuracy rate of 97%. We have deployed it to the infectious disease electronic medical record system to assist doctors in early warning severe patients with COVID-19. Data analysis. Continuous variables were expressed as medians and interquartile ranges or simple ranges, as defined by experts. Categorical variables were summarized as counts and percentages. We assessed differences between ARDS and non-ARDS using Two-Sample T test or Mann-Whitney U test depending on parametric or non-parametric data for continuous variables and the Chi-square for categorical variables. Tests were twosided with significance set at α less than 0·05. All statistical analysis was performed using IBM SPSS Ver. 19.0. The Python programming language (Python Software Foundation, version 3.6.6, https ://www.pytho n.org/downl oads/) was used for our models.

Method
Machine learning model establishment and evaluation. Datasets. All data was divided into three separate parts with no overlapping topics: training, validation, and external test sets (Table 7). For COVID-19 ARDS prediction.
• Training and validation datasets: 236 subjects were assigned to the training and validation datasets following a 9:1 ratio, including 189 non-ARDS and 47 ARDS cases from 11 regions in Wuhan and Zhejiang, further cross-validated 10 times. These datasets were used to train model parameters. • External test dataset: There were 57 non-ARDS and 14 ARDS cases from 11 regions in Wuhan and Zhejiang.
This dataset was used to evaluate and analyze the performances of different models to select the best model for AI system.
Algorithms. Four conventional types of machine learning algorithms (decision trees, random forests, support vector machines and logistic regression) and one deep learning method with ReLu activation function (deep neural networks, DNN) were conducted to develop the ARDS prediction model in COVID-19 patients. We implemented support vector machines with the RBF kernel. ID3 decision tree was constructed with the max leaf nodes of 5 and random forest was constructed by 40 decision trees with criterion of entropy algorithm. The pipeline of the DNN model was shown in Figure S1. The input data was a 19-dimensional vector, containing the clinical data of patients. The DNN model employed in this study was a 4-layer network structure with the hidden www.nature.com/scientificreports/ neurons of 64, 32, 8 and 1 respectively. A sigmoid layer was added at the top of the network to output the probability of ARDS occurrence and a total of 100 epochs were executed.
Evaluation. The performance of the models was assessed by 10-fold cross validation (10-fold CV) and external tests. Specifically, we randomly divided the training and validation datasets into 10 parts: 9 parts were used to train the algorithms and 1 part was used to estimate the prediction performance of the method. The mean AUC and accuracy were calculated by 10-fold CV as indicators of prediction accuracy. This process was repeated 10 times. Furthermore, we verify the prediction accuracy of the models on the external test dataset by evaluating the receiver operating characteristic (ROC) curves, the classification accuracy, F-measure, sensitivity and specificity.
Application development. The best algorithm for ARDS risk prediction was embedded into EMR and could be accessed via the link https ://ai-ards.rubik stack .com/#/login . The Anaconda Distribution (Anaconda Inc, Austin, Texas), Visual Studio Code version 1.45.1 (Microsoft, Redmond, Washington), and Python version 3.6 (Python Software Foundation, Wilmington, Delaware) were used for data analysis, model creation, and web application development.
Ethics approval and consent to participate. This study has been approved by the ethics committee of ShuLan (Hangzhou) Hospital. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Written informed consents were signed during hospitalization from patients or their parents. The data used in this study were anonymized before its use. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.