Deep learning model for the automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy: a multi-center retrospective study

This retrospective study aimed to develop and validate a deep learning model for the classification of coronavirus disease-2019 (COVID-19) pneumonia, non-COVID-19 pneumonia, and the healthy using chest X-ray (CXR) images. One private and two public datasets of CXR images were included. The private dataset included CXR from six hospitals. A total of 14,258 and 11,253 CXR images were included in the 2 public datasets and 455 in the private dataset. A deep learning model based on EfficientNet with noisy student was constructed using the three datasets. The test set of 150 CXR images in the private dataset were evaluated by the deep learning model and six radiologists. Three-category classification accuracy and class-wise area under the curve (AUC) for each of the COVID-19 pneumonia, non-COVID-19 pneumonia, and healthy were calculated. Consensus of the six radiologists was used for calculating class-wise AUC. The three-category classification accuracy of our model was 0.8667, and those of the six radiologists ranged from 0.5667 to 0.7733. For our model and the consensus of the six radiologists, the class-wise AUC of the healthy, non-COVID-19 pneumonia, and COVID-19 pneumonia were 0.9912, 0.9492, and 0.9752 and 0.9656, 0.8654, and 0.8740, respectively. Difference of the class-wise AUC between our model and the consensus of the six radiologists was statistically significant for COVID-19 pneumonia (p value = 0.001334). Thus, an accurate model of deep learning for the three-category classification could be constructed; the diagnostic performance of our model was significantly better than that of the consensus interpretation by the six radiologists for COVID-19 pneumonia.

Proposed DL model. EfficientNet 20 was used as our DL model. By use of the EfficientNet B5 pretrained with noisy student 21 , transfer learning was performed for the automatic classification of CXR images of COVID-19, non-COVID-19 pneumonia, and the healthy. The implementation of our DL model was based on the opensource software (https:// github. com/ jurad er/ covid 19_ xp) of a prior study 10 . While VGG16 22 was used as the pretrained model in the prior study 10 , EfficientNet with noisy student was used in the current study. The outline of the DL model is shown in Fig. 1. The details of the DL model are described in the Supplementary information. Grad-CAM was used for visual explanation of the diagnosis by our DL model 23 . Datasets. CXR images with anterior-posterior or posterior-anterior views of two public datasets and one private dataset were implemented in the current study. One public dataset was the COVIDx dataset 12,24 . The other public dataset was constructed from two public datasets: the PadChest dataset 25,26 and BIMCV-COVID19 + dataset 27,28 . Hereafter, we will refer to the second public dataset as COVID BIMCV . CXR images of the private dataset (COVID private ) were retrospectively collected from the six hospitals. The details of the three obtained datasets are described in the Supplementary information. Table 1 shows the total number of CXR images and the number of CXR images of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy in the COVIDx, COVID BIMCV , and COVID private datasets, respectively. The total number of CXR images was 14,258,11,253, and 455 in the COVIDx, COVID BIMCV , and COVID private datasets, respectively. The number of COVID-19 pneumonia cases were 617, 1475, and 177 in the COVIDx, COVID BIMCV , and COVID private datasets, respectively.
The patient characteristics of the COVID private dataset are shown in Table 2. The number of CXR images of the healthy, non-COVID-19 pneumonia, and COVID-19 pneumonia in the COVID private dataset was 139, 139, and 177, respectively. The COVID private dataset included 198 males and 257 females, aged 61.0 ± 18.6 years. The examination date of CXR in the COVID private dataset ranged from January 13th, 2015 to December 22th, 2020. Dataset splitting and model training. Since the development set and test set were defined for the COV-IDx dataset, they were used in the current study. A total of 100 and 50 CXR images were randomly selected as test sets for each of the COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy, in the COVID BIMCV and COVID private datasets, respectively. The other CXR images were used as development sets in the COVID BIMCV and COVID private datasets. Thus, the number of CXR images of the development set was 13,958, 10,953, and 305 in the COVIDx, COVID BIMCV , and COVID private datasets, respectively. The test set size was 300 in the COVIDx and COVID BIMCV datasets, and 150 in the COVID private dataset.   www.nature.com/scientificreports/ The development set was further divided into a training and validation set for each dataset. The validation set size was 300 in the COVIDx and COVID BIMCV datasets, and 90 in the COVID private dataset. The combined training set was constructed from the training sets of the three datasets for training the DL model. For the development set, five different random divisions of training and validation sets were performed for each dataset. Based on the five random divisions, model training with transfer learning and performance validation were performed. Therefore, five different trained models were obtained. In order to predict the diagnosis from the CXR image of the test set, an ensemble of the five trained models was used. Schematic illustration of the dataset splitting, model training, and prediction using our DL model is shown in Fig. 2.
Comparison with other DL models. Three code-available DL models were used for comparison. The first model was the COVID-Net model trained with the COVIDx dataset 12 . Its pretrained model is available at https:// github. com/ linda wangg/ COVID-Net (COVIDNet-CXR4-A). The second model was the DL model of Sharma A et al. 11 , whose pretrained model is available at https:// github. com/ aruns harma 8osdd/ covid pred (Combined model 3 [101 epochs]). The final model was the DarkCovidNet 9 , which is available at https:// github. com/ muham medta lo/ COVID-19. Since the pretrained model of DarkCovidNet was unavailable, its model training was performed from scratch by the authors.
Observer study by the radiologists. In order to compare our DL model with the radiologists' diagnostic ability, an observer study was performed including six radiologists (experience of the six radiologists ranged from 10 months to 15 years). The radiologists visually evaluated the CXR images of the test set of the COVID private dataset and determined the diagnosis for the three-category classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. With the exception of the CXR images, the radiologists were blinded to any clinical information of the test set of the COVID private dataset. Since the combined training set used for our DL model was too large for the radiologists, the development set of the COVID private dataset were provided for the radiologists' training before the observer study. The training and interpretation time were not limited.
Performance evaluation. For our DL model, performance evaluation was conducted using the classification metrics of the three-category classification (class-wise precision, recall, F1-score, and three-category classification accuracy) in the three test sets 29 . For radiologists and the code-available DL models, the same performance evaluation was conducted in the test set of the COVID private dataset with 150 CXR images. In addition, the class-wise area under the curve (AUC) of the receiver operating characteristics (ROC) analysis was calculated for COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy 29 . For the ROC analysis of the radiologists, a consensus interpretation score for the six radiologists was determined by majority voting of the individual interpretations 14 ; the score ranged from 0 to 6. Statistical analysis. The 95% confidence intervals (CI) of the classification metrics were calculated using 2000 bootstrap samples 14 . In addition, the class-wise AUC was compared using DeLong's test between our DL model and the consensus interpretation of the radiologists. In order to control the family-wise error rate, Bonfer-  www.nature.com/scientificreports/ roni correction was used; a p value less than 0.01666 was considered statistically significant. Statistical analyses were performed using scikit-learn package 30 of Python and pROC package 31 of R (version 4.0.4, https:// www.rproje ct. org/). Table 3 shows the results of the diagnostic performance of the four DL models, including our DL model, and the six radiologists in the test set of the COVID private dataset. The three-category classification accuracy of our DL model was 0.8667 (130/150), and those of the six radiologists ranged from 0.5667 (85/150) to 0.7733 (116/150). The 95% CI of the three-category classification accuracies were 0.8067-0.9200 and 0.7067-0.8400 for our DL model and the radiologist with best accuracy (Radiologist 3), respectively. The three-category classification accuracy of our DL model was better than that of the six radiologists. For our DL model, the class-wise F1-scores of the healthy and COVID-19 pneumonia were higher than that of the non-COVID-19 pneumonia. This indicates that for our DL model, the diagnostic performance of the healthy and COVID-19 pneumonia was better than that of the non-COVID-19 pneumonia. On the other hand, for the six radiologists, the class-wise F1-scores of the healthy were higher than those of the COVID-19 pneumonia and non-COVID-19 pneumonia; hence, the diagnostic performance of the healthy was higher than that for COVID-19 and non-COVID-19 pneumonia. The three-category classification accuracies of the three code-available DL models were 0.6467 (97/150), 0.4267 (64/150), and 0.4000 (60/150), and COVID-Net 12 achieved the highest accuracy in the three-category classification among the three code-available DL models. Although the three-category classification accuracy of COVID-Net (0.6467) was comparable to those of the six radiologists, those of the other code-available DL models (0.4267 and 0.4000) were worse than those of the six radiologists.  Table 4 shows the results of class-wise AUC and its 95% CI of our DL model in the test sets of the COVIDx, COVID BIMCV , and COVID private datasets. Table 4 also shows the results of the consensus of the six radiologists in the test set of the COVID private dataset. Figure 3 shows the class-wise ROC curves of our DL model and consensus of the six radiologists in the test set of the COVID private dataset. The class-wise AUC and its 95% CI of our DL model were as follows: 0.9914 and 0.9837-0.9990 for the healthy, 0.9772 and 0.9601-0.9942 for non-COVID-19 pneumonia, and 0.9934 and 0.9871-0.9996 for COVID-19 pneumonia. The class-wise AUC and its 95% CI of consensus of the six radiologists were as follows: 0.9656 and 0.9401-0.9911 for the healthy, 0.8654 and 0.8022-0.9286 for non-COVID-19 pneumonia, and 0.8740 and 0.8164-0.9316 for COVID-19 pneumonia. The difference of the class-wise AUC between our DL model and consensus of the six radiologists was statistically significant for COVID-19 pneumonia (p value = 0.001334). The differences were not statistically significant for Table 3. Class-wise precision, recall, F1-score, and three-category classification accuracy of four DL models and six radiologists in the COVID private dataset. Each cell includes classification metric and its 95% CI (lower and upper bounds of CI). * indicates 3-category classification accuracy. The experience of the six radiologists were 10 months, and 4, 7, 10, 10, and 15 years. The underlined values represent the best values for each column. DL deep learning; CI confidence interval; COVID private private dataset collected from six hospitals.  Figure 4 shows the CXR images and the results of Grad-CAM for the healthy, non-COVID-19 pneumonia, and COVID-19 pneumonia. The result of Grad-CAM of Fig. 4A illustrates that our DL model focused on the non-specific areas for diagnosis of the healthy. Figure 4B shows that our DL model focused on the infiltration shadow of the right lung field for diagnosis of non-COVID-19 pneumonia. Figure 4C shows that our DL model focused on the ground glass shadow of the peripheral area of both the lung fields for the diagnosis of COVID-19 pneumonia.

Discussion
The results of this study indicate that it is possible to construct an accurate DL model using the two public datasets (COVIDx and COVID BIMCV ) and one private dataset (COVID private ). Our deep learning model based on EfficientNet with noisy student could achieve an accurate diagnosis of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. The three-category classification accuracy of our model was 0.8667, and those of Table 4. Class-wise AUC and its 95% CI of our DL model and consensus of six radiologists. DL deep learning; CI confidence interval; AUC area under the curve; COVIDx public dataset used for COVID-Net; COVID BIMCV public dataset obtained from the PadChest dataset and the BIMCV-COVID19 + dataset; COVID private private dataset collected from six hospitals.

Model or Radiologist Dataset
The healthy www.nature.com/scientificreports/ the six radiologists ranged from 0.5667 to 0.7733. Difference of class-wise AUC between our model and the consensus of the six radiologists was statistically significant for COVID-19 pneumonia (p value = 0.001334). Using the two public datasets and one private dataset, our DL model could achieve a higher diagnostic performance than the three code-available DL models and the six radiologists. Especially, for COVID-19 pneumonia, the class-wise AUC of our DL model was significantly higher than that of the consensus of the six radiologists. In DL, a large number of datasets is necessary for accurate classification. While COVID-Net used more than 10,000 CXR images to develop and evaluate its model 12 , we used more than 20,000 CXR images for our DL model. We believe that the dataset size was a major factor in the diagnostic performance of our DL model. Another reason for the superiority of our DL model could be attributed to the use of a pretrained model constructed using noisy student 21 . Noisy student is a relatively new method for increasing the robustness of the DL model; the pretrained model of EfficientNet 20 with noisy student could be useful in improving our DL model.
The results of the three code-available DL models demonstrate that their classification metrics are not satisfactory. Although the three-category classification accuracy of COVID-Net was the highest in the three DL models, the F1-score of COVID-Net was the worst for COVID-19 pneumonia. In the other two models, the three-category classification accuracy was lower than those of the six radiologists. Many studies have used DL models for automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images 7-14,18,19 . Table 5 summarizes these previous studies. While most of them were developed and  The three-category classification accuracy of the six radiologists ranged from 0.5667 to 0.7733. There was large variability in the diagnostic performance of the radiologists in the classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images. Inversely, this indicates that the radiologists' diagnostic performance could be improved using our DL model. The effectiveness of our DL model for computeraided diagnosis system should be evaluated in future studies.
There are certain limitations to our study. First, although our DL model was developed and validated using two public datasets and one private dataset, it was not evaluated using external validation. Clinical usefulness of our DL model should be further evaluated by external validation 32 . Second, our DL model focused on the threecategory classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. The DL model ignored lung cancer and other diseases, which are considered important for detection on CXR images. This threecategory classification may be considered unnatural from a clinical viewpoint. However, we speculate that this was justified owing to the higher priority of the three-category classification in the COVID-19 pandemic. Third, our observer study was conducted on the CXR image obtained from relatively large-sized hospitals. However, since CXR can be performed in various hospitals and clinics, further studies are warranted to determine whether our DL model is effective in small hospitals and clinics. Thus, the outputs of our DL model should be adjusted based on the circumstances in which our DL model is used. Fourth, we focused on the automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images and the diagnostic performance of radiologists with our DL model was not evaluated. Thus, we did not evaluate the usefulness of our DL model as a computer-aided system. If radiologists doubt the results of our DL model, the diagnostic performance of radiologists may not be improved using our DL model. Therefore, in the future, it is crucial to build trust between the radiologists and the DL model for its implementation in clinical practice 33 . Fifth, although the results of Grad-CAM (for example, Fig. 4) could be useful to radiologists for comprehending the classification results of our DL model, the effectiveness of the results of Grad-CAM was not validated in the current study.
In conclusion, it is feasible to create an accurate model of DL for three-category classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. The diagnostic performance of our model was significantly better than that of the consensus interpretation by the six radiologists for COVID-19 pneumonia.

Data availability
The private dataset cannot be disclosed because of privacy protection and regulation. Source code of our DL model and the two public datasets are available from the following URL: https:// github. com/ jurad er/ covid 19_ xp_ effic ientn et.