Development and validation of a hybrid deep learning–machine learning approach for severity assessment of COVID-19 and other pneumonias

The Coronavirus Disease 2019 (COVID-19) is transitioning into the endemic phase. Nonetheless, it is crucial to remain mindful that pandemics related to infectious respiratory diseases (IRDs) can emerge unpredictably. Therefore, we aimed to develop and validate a severity assessment model for IRDs, including COVID-19, influenza, and novel influenza, using CT images on a multi-centre data set. Of the 805 COVID-19 patients collected from a single centre, 649 were used for training and 156 were used for internal validation (D1). Additionally, three external validation sets were obtained from 7 cohorts: 1138 patients with COVID-19 (D2), and 233 patients with influenza and novel influenza (D3). A hybrid model, referred to as Hybrid-DDM, was constructed by combining two deep learning models and a machine learning model. Across datasets D1, D2, and D3, the Hybrid-DDM exhibited significantly improved performance compared to the baseline model. The areas under the receiver operating curves (AUCs) were 0.830 versus 0.767 (p = 0.036) in D1, 0.801 versus 0.753 (p < 0.001) in D2, and 0.774 versus 0.668 (p < 0.001) in D3. This study indicates that the Hybrid-DDM model, trained using COVID-19 patient data, is effective and can also be applicable to patients with other types of viral pneumonia.

Since the outbreak of coronavirus disease 2019 (COVID-19) in 2019, the pandemic has had a profound and widespread impact, resulting in a significant increase in mortality globally.Particularly, elderly individuals and those have severe underlying medical conditions are at a higher risk of experiencing severe complications [1][2][3][4][5] .This alarming context highlighted the pressing need for a robust severity assessment system to guarantee appropriate care for severe patients.Fortuitously, as of now, COVID-19 is transitioning into the endemic phase.Nonetheless, it is crucial to remain mindful that pandemics related to infectious respiratory diseases (IRDs) can emerge unpredictably.This reality underscores the importance of not only managing the current situation but also preparing for other IRDs such as viral pneumonia (VP) and bacterial pneumonia (BP) to be better equipped for potential future outbreaks.
The presence of multifocal ground-glass opacity (GGO), consolidation, reticular opacity, and crazy-paving pattern in the lung fields is frequently observed in patients diagnosed with pneumonia, with GGO and consolidation being the most prevalent findings [6][7][8][9] .As evidenced by Park et al., the diagnostic performance for patients can be enhanced by taking various characteristics of lung abnormalities, such as GGO and consolidation, into consideration 10 .In effect, COVID-19 severity assessment algorithms primarily focus on pulmonary involvement when utilizing computed tomography (CT) scans, as the severity of the disease in patients with COVID-19 can www.nature.com/scientificreports/be determined by analyzing the extent of lung involvement 11 .Furthermore, numerous studies have reported a correlation between the quantitative measurement of lung involvement in CT scans and laboratory findings as well as clinical parameters, often using machine learning (ML) techniques or a stratified scoring system for the analysis [12][13][14] .For instance, Lessmann et al. reported that the severity of COVID-19 patients can be determined from CT images by calculating the percentage of affected lung tissue per lobe 15 .Wenli et al. demonstrated that texture features for lesion volume and non-lesion lung volume are instrumental in determining the severity of COVID-19 16 .In this study, we limit our focus to the most common patterns of lesions, GGO and consolidation, as previously reported in relevant literature.
To date, several studies have utilized deep learning (DL) for the severity assessment of COVID-19 patients.Zhang et al. demonstrated the application of two imaging biomarkers derived from lung field and lung abnormality segmentation models for the severity assessment of COVID-19 patients 17 .Similarly, Goncharov et al. leveraged DL to generate a segmentation mask and calculate the affected lung percentage for severity assessment purposes 18 .Chieregato et al. presented a method that combined laboratory and clinical data with imaging features 19 .They extracted imaging features from CT scans using a DL model and integrated them into a CatBoost ML model, along with tabular data.Gao et al. proposed a dual-branch combination network that leverages lesion segmentation information for DL model training, focusing on the lesion area while simultaneously performing lesion segmentation and COVID-19 prediction 20 .As such, most studies have designed a severity assessment model utilizing either lung-masked or lesion-masked CT images 21 .To our knowledge, no study has yet developed a severity assessment model using both types of masked CT images and combined the DL model with quantitative features obtained from the lung area.
Consequently, this study introduces a hybrid approach that combines two DL models and one ML model, trained using quantitative features.Notably, the proposed model underwent rigorous external validation that included not only COVID-19 patients but also other types of IRDs including influenza, novel influenza, and BP.

Methods
Patient population.In this study, all research was performed in accordance with relevant guidelines/ regulations and all experimental protocols were approved by the Institutional Review Board of Samsung Medical Center, Pusan National University Hospital, Chonnam National University Hospital, Keimyung University Dongsan Medical Center, Chungnam National University Hospital, Gachon University Gil Medical Center, Kyungpook National University Hospital, and Chungnam National University Sejong Hospital, which also waived written informed consent for this study.This study included data from 1243 patients diagnosed with COVID-19, admitted to a referral hospital.Data from 438 patients who did not undergo CT imaging were excluded from the analysis.The remaining 805 patients were randomly divided into two groups, allocating 80% of the participants for training and 20% for internal validation (referred to as D1).Three additional external data sets were collected from 7 different external cohorts.The external validation sets consisted of 1138 patients diagnosed with COVID-19 (referred to as D2), 233 patients with influenza and novel influenza (referred to as D3), and 268 patients with BP (referred to as D4), respectively.In this study, each patient was divided into severe and non-severe cases based on admission to the intensive care unit, mortality, and whether they received at least one of four specific treatments: steroid injection, oxygen supply, mechanical ventilation, or extracorporeal membrane oxygenation.A data flow diagram and the clinical characteristics of the patients are detailed in Fig. 1 and Table 1, respectively.Representative CT images from both non-severe and severe COVID-19 cases are provided in Fig. 2.

CT acquisition parameters.
The CT acquisition parameters for the dataset from one referral hospital were as follows: 34-782 mAs, 100-150 kVp, 0.625-5.00mm slice thickness, and 0.35-0.97mm pixel size.For the external validation sets, the CT acquisition parameters were unavailable due to data anonymization.S1.In this study, we defined this feature set as quantitative Lung Involvement Features (LIFe).Using the LIFe, we constructed a random forest (RF) 29 model.In this study, we defined this RF model as ML LIFe because the RF is one of the ML algorithms.

Combination of DL and ML models.
In this study, we proposed a hybrid model called Hybrid-DDM by the combination of two DL models ( DL lung and DL lesion ) and one ML model ( ML LIFe ).The Hybrid-DDM was obtained by uniform averaging the estimates of the severity from the three models.9.7.0.1190202 (R2019a).Natick, Massachusetts: The MathWorks Inc.), statistical analysis (R Core Team (2020).R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria.URL https:// www.R-proje ct.org/.).The severity assessment performance was evaluated with the area under the receiver operating characteristic (ROC) curve (AUC), and Delong's method 30 was used to compare two AUC values.A p-value lower than 0.05 was considered statistically significant.

Results
The AUC and ROC curves of the Hybrid-DDM and the baseline models in each validation set are shown in Fig. 4a,b.More detailed results were summarized in Table 2.
The ML LIFe showed comparable performance to the DL lung−lesion with AUCs of 0.811 (0.737-0.885), 0.777 (0.750-0.804), and 0.754 (0.683-0.825) in D1, D2, and D3, respectively.Figure 5 shows the feature importance of 33 features used in the ML LIFe .Feature importance in the RF model measures the contribution of each feature to the model's predictions.It is quantified by the degree to which the model's performance decreases when the values of a particular feature are randomly permuted, thereby disrupting the relationship between the feature and the label.Consequently, a feature is considered highly important if its random permutation leads to a significant decline in the model's performance, indicating that the model relies heavily on this feature for accurate predictions 29 .As a result, GGO-related features tended to have higher importance than consolidation-related features.Box plots of 33 features for severe and non-severe patients are shown in Fig. 6.The features are listed in descending order of feature importance from left to right.In all datasets, severe patients tend to have larger feature values (the proportion of lesions) than non-severe patients.There is also a tendency for feature values to be larger in the lower lobes (L3, L4, R3, and R4) than in the upper lobes (L1, L2, R1, and R2).

Discussion
This objective of this study was to develop an automated severity assessment model for patients with COVID-19 and other IRDs using CT images.To this end, we developed a Hybrid-DDM model that combined two DL models and one ML model.Our investigation yielded three crucial insights.Firstly, training a model with lung-masked  or lesion-masked CT images enhanced the efficiency of severity assessment of patients with COVID-19, as compared to training solely with lung-cropped CT images.Secondly, the integration of two DL models with an ML model improved the performance of the severity assessment model.Thirdly, while the Hybrid-DDM model demonstrated significant effectiveness for patients with VP, it was not similarly effective for patients with BP.
The expedient severity assessment of patients with COVID-19 constituted a vital component of patient care and mandated immediate attention.Utilizing CT imaging for the diagnosis of disease severity in patients with IRDs provides clinicians with critical insights into the progression of the disease and potential responses to treatment, thereby enabling timely and tailored therapeutic interventions to enhance patient outcomes.Although the COVID-19 pandemic is currently in decline and transitioning into an endemic phase, it remains essential to prepare for future pandemics given their unpredictable nature.Consequently, it is crucial to establish reliable severity assessment systems for other IRDs.Numerous studies have risen to this challenge, exploring an array of techniques for severity assessment of patients with COVID-19 and other IRDs 31 .Specifically, many studies have adopted DL techniques that utilize lung or lesion information as guiding features during model training 20,32,33 .Our study aimed to assess the merits of employing differently pre-processed CT images for training DL models.Leveraging lung-masked CT images can be beneficial as it incorporates perilesional areas, which may encompass opacity, ambiguous lesions, and bronchus regions that are not discernible in lesion-masked CT images.Models trained on such comprehensive characteristics could estimate the severity of IRDs based on overall texture-based features instead of solely on lesion volume.On the contrary, the use of lesion-masked CT images allows the model to concentrate on lesion volume, texture, and shape, thus facilitating the determination of disease severity based on the presence and extent of lesions within the lung.
The combination of DL lung and DL lesion demonstrated superior AUC compared to individual models for both the internal and external validation sets of patients with COVID-19 (0.812 vs 0.790 and 0.807 in D1, and 0.794 vs 0.756 and 0.788 in D2, respectively).This improvement was also evident in the external validation set of patients with influenza and novel influenza (0.765 vs 0.755 and 0.732 in D3), substantiating the effectiveness of combining DL lung and DL lesion in severity assessment of patients with VP.Furthermore, the baseline model trained without any disease-related guidance was found to be inferior to both DL lung and DL lesion .
In clinical practice, the severity of COVID-19 and other IRDs can often be determined by evaluating the extent of lung involvement 15,16 .In order to quantitatively assess disease severity, we formulated a set of quantitative features, termed LIFe, which we subsequently utilized to construct a ML model.LIFe calculates the proportion of distinct lesion types, including GGO and consolidations.An added benefit of LIFe is that it eliminates the need for a lobe-specific segmentation mask, as it can be derived directly from the comprehensive lung segmentation mask.Our investigation showed that an ML model trained with LIFe features showed comparable performance to those of DL models.It is noteworthy that most lesions in patients with COVID-19 are typically located in the lower lobe of the lung 34,35 .Consistently, our study also found that LIFe values for patients with IRDs were higher in the lower lobe than in the upper lobe.In particular, the difference in GGO-related LIFe values between severe and non-severe patients was more marked than the difference in consolidation-related values.This insight guided us towards the development of a severity assessment model premised on measuring www.nature.com/scientificreports/ the proportion of lesions in bronchopulmonary segments, diverging from DL models that rely on nonlinear features extracted from the entire lung or specific lesions.However, a reliance solely on LIFe values might overlook other radiological features present in CT images.By incorporating DL models into the ML model, we were able to improve our severity assessment performance, as measured by the AUC, from 0.811 to 0.830 for D1, from 0.777 to 0.801 for D2, and from 0.754 to 0.774 for D3, respectively.However, our approach did not prove effective for patients with BP.Even when incorporating prior information like lung masks, lesion masks, or LIFe values, the model's performance did not differ from DL models using lung-cropped CT images.This can be attributed to our training data being derived from patients with COVID-19, where one severe complication is VP 36 .Clinically, VP and BP exhibit distinct characteristics such as symptoms, disease severity, and radiological findings, which have been the subject of numerous studies 37,38 .In fact, we observed that the difference in LIFe values between severe and non-severe patients was smaller in BP patients than in VP patients.Thus, applying a severity assessment model trained on VP patients to BP patients would be unsuitable.Although our investigation featured a multi-institutional and multi-disease validation approach, the model fell short in enhancing performance for patients with BP.This underscores the imperative for cultivating more sophisticated and universally applicable methodologies that could be beneficial to patients with BP as well.
Our study had some limitations.Firstly, while our study includes data from multiple diseases, further validation should be performed on a variety of infectious respiratory diseases to ensure its applicability.Secondly, our research was conducted retrospectively, and prospective validation is required to strengthen its credibility.Thirdly, both internal and external datasets consist solely of patients from South Korea, so additional research is needed to extend its implications to other nations.Lastly, our model's practical utility could be improved if it were able to distinguish between VP and BP using CT images, laboratory findings, or clinicopathological information.
Despite these limitations, our findings have shown that training a DL model on CT images that were masked with either lung or lesion information, and subsequently integrating this with a ML model, significantly enhanced the performance of the severity assessment model for patients with VP.In the future, incorporating additional clinicopathological information such as age, gender, smoking history, or symptoms like cough, fever, and chills, could further improve the model's performance for the severity assessment of patients with IRDs.

Figure 5 .
Figure 5. Feature importance of the random forest model ( ML LIFe ).A total of 33 features were used for model training.

Figure 6 .
Figure 6.Visualization of the LIFe for severe and non-severe patients as a box plot.The LIFe is listed in descending order of feature importance from left to right.(a) Internal validation set of patients with COVID-19 (n = 156).(b) External validation set of patients with COVID-19 (n = 1138).(c) External validation set of patients with influenza and novel influenza (n = 233).(d) External validation set of patients with bacterial pneumonia (n = 268).

Table 1 .
Demographic and pathological characteristics.COVID-19, coronavirus disease 2019; ECMO, Extracorporeal membrane oxygenation; ICU, intensive care unit.*p-values were calculated by comparing with the training set using the t-test or the chi-squared test.† Mean ± standard deviation.Characteristics No. (%) for training and internal validation No. (%) for external validation

Table 2 .
Performances of the severity assessment model for patients with COVID-19 and other three types of pneumonia.Cut-off points for sensitivity and specificity were determined by the validation data of the training set (not the same as D1).COVID-19, coronavirus disease 2019; AUC, area under the receiver operating curve; CI, confidence interval; SEN, sensitivity; SPE, specificity.*p-values were calculated by Delong's method.