Introduction

The novel coronavirus disease (COVID-19) outbreak is caused by a strain of coronavirus known as the severe acute respiratory syndrome coronavirus 2 that originated in Wuhan in the Hubei province in China at the end of 20191. The World Health Organization declared COVID-19 as a pandemic on March 11, 2020, then it had spread across the world2. The website of the World Health Organization has listed the total number of reported patients with COVID-19 and the associated deaths. At the time of writing this paper, 163,869,893 patients and 3,398,302 deaths were reported on the website3.

COVID-19 is diagnosed using real-time polymerase chain reaction (RT-PCR) in many clinical situations. However, RT-PCR sensitivity is not very high in the detection of COVID-19; for example, one study reported that the sensitivity of RT-PCR (71%) was lower than that of chest computed tomography (98%)4. Owing to the low RT-PCR sensitivity, the effectiveness of chest X-Ray imaging (CXR) and computed tomography in the diagnosis of COVID-19 has been investigated5. The combination of CXR and artificial intelligence, such as deep learning (DL)6, has been extensively examined for automatic diagnosis of COVID-197,8,9,10,11,12,13,14. Since CXR is widely available and its cost is relatively low, the combination of CXR and artificial intelligence could be employed for screening purposes of COVID-19 without the need for medical doctors.

Recent advances in DL have shown promising diagnostic performance for automatic classification of various diseases of the skin, retinal fundus, brain, and other organs6,15,16,17. DL-based automatic diagnosis is reportedly accurate, and performed well in the classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on CXR images7,8,9,10,11,12,13. Elgendi et al. compared the performance of 17 DL models with and without different geometric augmentations and examined the influence of data augmentation with respect to automatic classification of COVID-19 pneumonia. Their results demonstrated that the removal of the geometrical augmentation steps actually improved the performance of the DL models13. Monshi et al. optimized the data augmentation and the DL hyperparameters for classifying COVID-19 pneumonia. Their proposed CovidXrayNet based on EfficientNet-B0 achieved state-of-the-art accuracy18. Karakanis et al. proposed a new approach to classify COVID-19 pneumonia by exploiting a conditional generative adversarial network that generated synthetic images for augmenting the limited data amount. Their lightweight DL model (ResNet8-based) achieved competitive performance19. These technical advances of DL make the classification models of COVID-19 pneumonia more accurate and robust. However, the performance of DL models was mainly investigated using the public database of CXR, and the comparison of the diagnostic performance between DL models and radiologists was limited14.

Our study aimed to develop and validate a DL model for the automatic diagnosis of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images. In order to develop and validate our DL model, two public datasets and one private dataset of CXR images were implemented in the current study; CXR images of the private dataset were collected from six hospitals. To compare the diagnostic performance, both our DL model and six radiologists evaluated the CXR images of the private dataset. In addition, code-available DL models for diagnosing COVID-19 were also compared with our DL model. The major contributions of this study were as follows. (i) The two large public datasets of CXR images were constructed, which can be available online. (ii) Our DL model was validated with CXR images of our private dataset of clinical cases. (iii) The comparison of diagnostic performance was performed between our DL model and six radiologists.

Methods

This retrospective study was approved by the institutional review boards of six hospitals (Kobe University Graduate School of Medicine, Kobe City Medical Center General Hospital, Kobe City Nishi-Kobe Medical Center, Hyogo Prefectural Kakogawa Medical Center, Kita Harima Medical Center, and Hyogo Prefectural Awaji Medical Center); the requirement for acquiring informed consent was waived owing to the retrospective nature of the stud. This study complied with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https://www.mhlw.go.jp/file/06-Seisakujouhou-10600000-Daijinkanboukouseikagakuka/0000080278.pdf).

Proposed DL model

EfficientNet20 was used as our DL model. By use of the EfficientNet B5 pretrained with noisy student21, transfer learning was performed for the automatic classification of CXR images of COVID-19, non-COVID-19 pneumonia, and the healthy. The implementation of our DL model was based on the open-source software (https://github.com/jurader/covid19_xp) of a prior study10. While VGG1622 was used as the pretrained model in the prior study10, EfficientNet with noisy student was used in the current study. The outline of the DL model is shown in Fig. 1. The details of the DL model are described in the Supplementary information. Grad-CAM was used for visual explanation of the diagnosis by our DL model23.

Figure 1
figure 1

Our DL model. Abbreviation: DL, deep learning.

Datasets

CXR images with anterior–posterior or posterior-anterior views of two public datasets and one private dataset were implemented in the current study. One public dataset was the COVIDx dataset12,24. The other public dataset was constructed from two public datasets: the PadChest dataset25,26 and BIMCV-COVID19 + dataset27,28. Hereafter, we will refer to the second public dataset as COVIDBIMCV. CXR images of the private dataset (COVIDprivate) were retrospectively collected from the six hospitals. The details of the three obtained datasets are described in the Supplementary information.

Table 1 shows the total number of CXR images and the number of CXR images of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy in the COVIDx, COVIDBIMCV, and COVIDprivate datasets, respectively. The total number of CXR images was 14,258, 11,253, and 455 in the COVIDx, COVIDBIMCV, and COVIDprivate datasets, respectively. The number of COVID-19 pneumonia cases were 617, 1475, and 177 in the COVIDx, COVIDBIMCV, and COVIDprivate datasets, respectively.

Table 1 Numbers of CXR images in the COVIDx, COVIDBIMCV, and COVIDprivate datasets.

The patient characteristics of the COVIDprivate dataset are shown in Table 2. The number of CXR images of the healthy, non-COVID-19 pneumonia, and COVID-19 pneumonia in the COVIDprivate dataset was 139, 139, and 177, respectively. The COVIDprivate dataset included 198 males and 257 females, aged 61.0 ± 18.6 years. The examination date of CXR in the COVIDprivate dataset ranged from January 13th, 2015 to December 22th, 2020.

Table 2 Patients’ characteristics in the COVIDprivate dataset.

Dataset splitting and model training

Since the development set and test set were defined for the COVIDx dataset, they were used in the current study. A total of 100 and 50 CXR images were randomly selected as test sets for each of the COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy, in the COVIDBIMCV and COVIDprivate datasets, respectively. The other CXR images were used as development sets in the COVIDBIMCV and COVIDprivate datasets. Thus, the number of CXR images of the development set was 13,958, 10,953, and 305 in the COVIDx, COVIDBIMCV, and COVIDprivate datasets, respectively. The test set size was 300 in the COVIDx and COVIDBIMCV datasets, and 150 in the COVIDprivate dataset.

The development set was further divided into a training and validation set for each dataset. The validation set size was 300 in the COVIDx and COVIDBIMCV datasets, and 90 in the COVIDprivate dataset. The combined training set was constructed from the training sets of the three datasets for training the DL model. For the development set, five different random divisions of training and validation sets were performed for each dataset. Based on the five random divisions, model training with transfer learning and performance validation were performed. Therefore, five different trained models were obtained. In order to predict the diagnosis from the CXR image of the test set, an ensemble of the five trained models was used. Schematic illustration of the dataset splitting, model training, and prediction using our DL model is shown in Fig. 2.

Figure 2
figure 2

Schematic illustration of dataset splitting, model training, and prediction with our DL model. Abbreviations: COVIDx, Public dataset used for COVID-Net; COVIDBIMCV, Public dataset obtained from the PadChest dataset and the BIMCV-COVID19 + dataset; COVIDprivate, Private dataset collected from six hospitals.

Comparison with other DL models

Three code-available DL models were used for comparison. The first model was the COVID-Net model trained with the COVIDx dataset12. Its pretrained model is available at https://github.com/lindawangg/COVID-Net (COVIDNet-CXR4-A). The second model was the DL model of Sharma A et al.11, whose pretrained model is available at https://github.com/arunsharma8osdd/covidpred (Combined model 3 [101 epochs]). The final model was the DarkCovidNet9, which is available at https://github.com/muhammedtalo/COVID-19. Since the pretrained model of DarkCovidNet was unavailable, its model training was performed from scratch by the authors.

Observer study by the radiologists

In order to compare our DL model with the radiologists’ diagnostic ability, an observer study was performed including six radiologists (experience of the six radiologists ranged from 10 months to 15 years). The radiologists visually evaluated the CXR images of the test set of the COVIDprivate dataset and determined the diagnosis for the three-category classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. With the exception of the CXR images, the radiologists were blinded to any clinical information of the test set of the COVIDprivate dataset. Since the combined training set used for our DL model was too large for the radiologists, the development set of the COVIDprivate dataset were provided for the radiologists’ training before the observer study. The training and interpretation time were not limited.

Performance evaluation

For our DL model, performance evaluation was conducted using the classification metrics of the three-category classification (class-wise precision, recall, F1-score, and three-category classification accuracy) in the three test sets29. For radiologists and the code-available DL models, the same performance evaluation was conducted in the test set of the COVIDprivate dataset with 150 CXR images. In addition, the class-wise area under the curve (AUC) of the receiver operating characteristics (ROC) analysis was calculated for COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy29. For the ROC analysis of the radiologists, a consensus interpretation score for the six radiologists was determined by majority voting of the individual interpretations14; the score ranged from 0 to 6.

Statistical analysis

The 95% confidence intervals (CI) of the classification metrics were calculated using 2000 bootstrap samples14. In addition, the class-wise AUC was compared using DeLong’s test between our DL model and the consensus interpretation of the radiologists. In order to control the family-wise error rate, Bonferroni correction was used; a p value less than 0.01666 was considered statistically significant. Statistical analyses were performed using scikit-learn package30 of Python and pROC package31 of R (version 4.0.4, https://www.r-project.org/).

Results

Table 3 shows the results of the diagnostic performance of the four DL models, including our DL model, and the six radiologists in the test set of the COVIDprivate dataset. The three-category classification accuracy of our DL model was 0.8667 (130/150), and those of the six radiologists ranged from 0.5667 (85/150) to 0.7733 (116/150). The 95% CI of the three-category classification accuracies were 0.8067–0.9200 and 0.7067–0.8400 for our DL model and the radiologist with best accuracy (Radiologist 3), respectively. The three-category classification accuracy of our DL model was better than that of the six radiologists. For our DL model, the class-wise F1-scores of the healthy and COVID-19 pneumonia were higher than that of the non-COVID-19 pneumonia. This indicates that for our DL model, the diagnostic performance of the healthy and COVID-19 pneumonia was better than that of the non-COVID-19 pneumonia. On the other hand, for the six radiologists, the class-wise F1-scores of the healthy were higher than those of the COVID-19 pneumonia and non-COVID-19 pneumonia; hence, the diagnostic performance of the healthy was higher than that for COVID-19 and non-COVID-19 pneumonia. The three-category classification accuracies of the three code-available DL models were 0.6467 (97/150), 0.4267 (64/150), and 0.4000 (60/150), and COVID-Net12 achieved the highest accuracy in the three-category classification among the three code-available DL models. Although the three-category classification accuracy of COVID-Net (0.6467) was comparable to those of the six radiologists, those of the other code-available DL models (0.4267 and 0.4000) were worse than those of the six radiologists. The class-wise F1-scores of the three code-available DL models for COVID-19 pneumonia were 0.3636, 0.5684, and 0.4160, and the DL model of Sharma et al.11 achieved the highest class-wise F1-score for COVID-19 pneumonia among them; the class-wise F1-score of the DL model of Sharma et al. (0.5684) was higher than those of two radiologists (Radiologist 5 and Radiologist 6). However, the class-wise F1-score of the DL model of Sharma et al. for the healthy was 0.0000. Table S1 of the Supplementary information shows the results of the diagnostic performance in our DL model in the test sets of the COVIDx and COVIDBIMCV datasets.

Table 3 Class-wise precision, recall, F1-score, and three-category classification accuracy of four DL models and six radiologists in the COVIDprivate dataset.

Table 4 shows the results of class-wise AUC and its 95% CI of our DL model in the test sets of the COVIDx, COVIDBIMCV, and COVIDprivate datasets. Table 4 also shows the results of the consensus of the six radiologists in the test set of the COVIDprivate dataset. Figure 3 shows the class-wise ROC curves of our DL model and consensus of the six radiologists in the test set of the COVIDprivate dataset. The class-wise AUC and its 95% CI of our DL model were as follows: 0.9914 and 0.9837–0.9990 for the healthy, 0.9772 and 0.9601–0.9942 for non-COVID-19 pneumonia, and 0.9934 and 0.9871–0.9996 for COVID-19 pneumonia. The class-wise AUC and its 95% CI of consensus of the six radiologists were as follows: 0.9656 and 0.9401–0.9911 for the healthy, 0.8654 and 0.8022–0.9286 for non-COVID-19 pneumonia, and 0.8740 and 0.8164–0.9316 for COVID-19 pneumonia. The difference of the class-wise AUC between our DL model and consensus of the six radiologists was statistically significant for COVID-19 pneumonia (p value = 0.001334). The differences were not statistically significant for the healthy and non-COVID-19 pneumonia (p values = 0.07252 and 0.02617, respectively). Table S2 of the Supplementary information presents the confusion matrix of the three-category classification for our DL model in the test set of the COVIDprivate dataset. Table S3 of the Supplementary information shows the class-wise AUC and its 95% CI for our DL model when changing the data splitting between the test and development sets. Figures S1 and S2 of the Supplementary information show the class-wise ROC curves of our DL model in the test sets of the COVIDx and COVIDBIMCV datasets, respectively.

Table 4 Class-wise AUC and its 95% CI of our DL model and consensus of six radiologists.
Figure 3
figure 3

Class-wise ROC curves in COVIDprivate dataset. Note: (A) consensus of radiologists and (B) our DL model. Abbreviation: DL, deep learning; COVIDprivate, private dataset collected from six hospitals; AUC, area under the curve; ROC, receiver operating characteristics.

Figure 4 shows the CXR images and the results of Grad-CAM for the healthy, non-COVID-19 pneumonia, and COVID-19 pneumonia. The result of Grad-CAM of Fig. 4A illustrates that our DL model focused on the non-specific areas for diagnosis of the healthy. Figure 4B shows that our DL model focused on the infiltration shadow of the right lung field for diagnosis of non-COVID-19 pneumonia. Figure 4C shows that our DL model focused on the ground glass shadow of the peripheral area of both the lung fields for the diagnosis of COVID-19 pneumonia.

Figure 4
figure 4

Results of Grad-CAM for our DL model. Note: (A) the healthy, (B) non-COVID-19 pneumonia, (C) COVID-19 pneumonia. Each image part consists of CXR image and result of Grad-CAM. One trained model of our DL model was used for Grad-CAM. Abbreviation: DL, deep learning; CXR, chest X-Ray imaging.

Discussion

The results of this study indicate that it is possible to construct an accurate DL model using the two public datasets (COVIDx and COVIDBIMCV) and one private dataset (COVIDprivate). Our deep learning model based on EfficientNet with noisy student could achieve an accurate diagnosis of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. The three-category classification accuracy of our model was 0.8667, and those of the six radiologists ranged from 0.5667 to 0.7733. Difference of class-wise AUC between our model and the consensus of the six radiologists was statistically significant for COVID-19 pneumonia (p value = 0.001334).

Using the two public datasets and one private dataset, our DL model could achieve a higher diagnostic performance than the three code-available DL models and the six radiologists. Especially, for COVID-19 pneumonia, the class-wise AUC of our DL model was significantly higher than that of the consensus of the six radiologists. In DL, a large number of datasets is necessary for accurate classification. While COVID-Net used more than 10,000 CXR images to develop and evaluate its model12, we used more than 20,000 CXR images for our DL model. We believe that the dataset size was a major factor in the diagnostic performance of our DL model. Another reason for the superiority of our DL model could be attributed to the use of a pretrained model constructed using noisy student21. Noisy student is a relatively new method for increasing the robustness of the DL model; the pretrained model of EfficientNet20 with noisy student could be useful in improving our DL model.

The results of the three code-available DL models demonstrate that their classification metrics are not satisfactory. Although the three-category classification accuracy of COVID-Net was the highest in the three DL models, the F1-score of COVID-Net was the worst for COVID-19 pneumonia. In the other two models, the three-category classification accuracy was lower than those of the six radiologists. Many studies have used DL models for automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images7,8,9,10,11,12,13,14,18,19. Table 5 summarizes these previous studies. While most of them were developed and validated using CXR images of public datasets, they were not validated with those of clinical cases. Our results indicate that most of the DL models of COVID-19 pneumonia in previously published papers may not be useful in clinical situations.

Table 5 Summary of COVID-19 DL models on CXR images.

The three-category classification accuracy of the six radiologists ranged from 0.5667 to 0.7733. There was large variability in the diagnostic performance of the radiologists in the classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images. Inversely, this indicates that the radiologists’ diagnostic performance could be improved using our DL model. The effectiveness of our DL model for computer-aided diagnosis system should be evaluated in future studies.

There are certain limitations to our study. First, although our DL model was developed and validated using two public datasets and one private dataset, it was not evaluated using external validation. Clinical usefulness of our DL model should be further evaluated by external validation32. Second, our DL model focused on the three-category classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. The DL model ignored lung cancer and other diseases, which are considered important for detection on CXR images. This three-category classification may be considered unnatural from a clinical viewpoint. However, we speculate that this was justified owing to the higher priority of the three-category classification in the COVID-19 pandemic. Third, our observer study was conducted on the CXR image obtained from relatively large-sized hospitals. However, since CXR can be performed in various hospitals and clinics, further studies are warranted to determine whether our DL model is effective in small hospitals and clinics. Thus, the outputs of our DL model should be adjusted based on the circumstances in which our DL model is used. Fourth, we focused on the automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy using CXR images and the diagnostic performance of radiologists with our DL model was not evaluated. Thus, we did not evaluate the usefulness of our DL model as a computer-aided system. If radiologists doubt the results of our DL model, the diagnostic performance of radiologists may not be improved using our DL model. Therefore, in the future, it is crucial to build trust between the radiologists and the DL model for its implementation in clinical practice33. Fifth, although the results of Grad-CAM (for example, Fig. 4) could be useful to radiologists for comprehending the classification results of our DL model, the effectiveness of the results of Grad-CAM was not validated in the current study.

In conclusion, it is feasible to create an accurate model of DL for three-category classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. The diagnostic performance of our model was significantly better than that of the consensus interpretation by the six radiologists for COVID-19 pneumonia.