Introduction

The novel coronavirus disease 2019 (COVID-19), a new infectious disease, was first discovered in China in 2019 and has currently caused a significant number of infections and deaths worldwide1. At the time of writing this paper, a total of at least 529,410,287 infections and 6,296,771 deaths have been confirmed worldwide2. The development of vaccines and measures to prevent the spread of the disease have temporarily succeeded in reducing the number of infected people. However, the threat of COVID-19 continues worldwide because of a highly infectious species known as the Omicron strain.

Real-time polymerase chain reaction (RT-PCR) is used as a diagnostic method for COVID-19 in many medical institutions. However, RT-PCR is not always an effective method. One report has indicated that computed tomography (CT) is more sensitive than RT-PCR3. CT and chest X-ray (CXR) may serve as more accurate diagnostic methods for COVID-194,5.

The clinical application of deep learning (DL) in the diagnosis of COVID-19 on CXR has attracted attention6,7. Although CXR is less accurate than CT, CT scanners are not always available. For example, as a 24/7 in-hospital service, rural hospitals have very limited local access to CT scanners8. CXR is simple and inexpensive, and radiation exposure of CXR is less than that of CT. Therefore, if COVID-19 can be diagnosed using a combination of DL and CXR, it may be possible to screen for COVID-19.

Many studies have already been conducted on CT/CXR for the diagnosis of COVID-19 using DL, and most of them have shown promising results9,10,11. However, in the case of the clinical application of DL as a computer-aided diagnosis system, medical doctors must compare their own diagnosis with that of DL. If there is an inconsistency between doctors and DL, doctors may reject the DL diagnosis. To evaluate the clinical usefulness of DL, an observer study of CXR readings must be conducted for both DL and radiologists. Only a few studies have compared the diagnostic performance of DL and radiologists12,13,14.

This study aimed to evaluate the diagnostic performance of our DL model of COVID-19 and investigate whether radiologists changed their diagnosis by referring to our DL model of CXR and whether the diagnostic performance of radiologists was significantly improved. To evaluate the clinical usefulness of DL, an observer study of radiologists and external validation of our DL model were conducted. Based on the reading sessions of the observer study, the diagnostic performance was compared among (i) our DL model, (ii) eight radiologists without DL, and (iii) eight radiologists with DL.

Materials and methods

This retrospective study was approved by the institutional review boards of eight hospitals (Kobe University Hospital, St. Luke's International Hospital, Nishinomiya Watanabe Hospital, Kobe City Medical Center General Hospital, Kobe City Nishi-Kobe Medical Center, Hyogo Prefectural Kakogawa Medical Center, Kita Harima Medical Center, and Hyogo Prefectural Awaji Medical Center); the requirement for acquiring informed consent was waived by the institutional review boards of these eight hospitals owing to the retrospective nature of the study. This study complied with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https://www.mhlw.go.jp/file/06-Seisakujouhou-10600000-Daijinkanboukouseikagakuka/0000080278.pdf).

Dataset

The CXR datasets used for developing and evaluating our DL model contain CXRs for the following three categories: normal CXR (NORMAL), non-COVID-19 pneumonia CXR (PNEUMONIA), and COVID-19 pneumonia CXR (COVID). Our DL model was developed using two public (COVIDx and COVIDBIMCV) and one private (COVIDprivate) datasets. One public dataset (COVIDx) was built to accelerate the development of highly accurate and practical deep learning model for detecting COVID-19 cases (https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md)15. The other public dataset (COVIDBIMCV) was constructed from two public datasets: the PadChest dataset (https://github.com/auriml/Rx-thorax-automatic-captioning)16 and BIMCV-COVID19+ dataset (https://github.com/BIMCV-CSUSP/BIMCV-COVID-19)17. COVIDprivate was based on the dataset collected from six hospitals previously, and the two public datasets (COVIDx and COVIDBIMCV) were the same as those in previous studies18,19. The details of these datasets are described in the Supplementary material. Compared with the previous study, CXRs were added for COVIDprivate in the current study. The additional CXRs included 37, 7, and 31 cases of NORMAL, PNEUMONIA, and COVID, respectively. COVIDprivate contained 530 CXRs (176 NORMAL, 146 PNEUMONIA, and 208 COVID).

In addition to COVIDprivate, CXRs were collected from two other medical institutions. In total, 168 CXRs (80 NORMAL, 37 PNEUMONIA, and 51 COVID) collected from one medical institution (Hospital A) were used for the internal validation of the DL model (as a part of validation set) and for radiologists’ reading practice conducted before the observer study. Moreover, as unseen test set, 180 CXR cases (60 NORMAL, 60 PNEUMONIA, and 60 COVID) collected from another medical institution (Hospital B) were used for the external validation of the DL model and observer study of radiologists.

In the Hospital B, COVID was limited to those diagnosed with COVID-19 pneumonia using RT-PCR, and CXR was obtained after symptom onset. The time of COVID-19 diagnosis was between January 24, 2020, and May 5, 2020. PNEUMONIA was defined as patients clinically diagnosed with bacterial pneumonia that improved with appropriate treatment. Patients who showed no pneumonia on CT or had lung metastasis of malignancy and acute exacerbation of interstitial pneumonia were excluded from PNEUMONIA. NORMAL was defined as the absence of abnormalities in the lung, mediastinum, thoracic cavity, or chest wall on CXR and CT. NORMAL and PNEUMONIA were limited to cases before the summer of 2019 (before the COVID-19 pandemic). The details of the unseen test set collected from the Hospital B are described in the Supplementary material. The inclusion criteria of CXRs in the COVIDprivate and the Hospital A were the same as the previous study19.

Table 1 lists the details of each CXR dataset. The 180 cases (as the unseen test set) used for the external validation and reading sessions were adults aged 20 years or older. In the 180 cases, NORMAL included 39 men and 21 women aged 58.1 ± 27.9 years. PNEUMONIA included 43 men and 17 women aged 76.2 ± 20.8 years. The COVID group included 46 men and 14 women aged 53.4 ± 38.6 years.

Table 1 Numbers of CXR images in the datasets: COVIDx, COVIDBIMCV, and COVIDprivate, Hospital A, and Hospital B.

Deep learning model

Our EfficientNet-based DL model was constructed in the same manner as described in previous papers18,19. Figure 1 shows a schematic of the construction of the DL model. There are two major differences in the DL model construction between the present study and previous studies; one is that the 168 CXRs collected from Hospital A were used for internal validation as a part of the validation set, and the other is that the 180 CXRs collected from Hospital B were used for external validation as the unseen test set. The DL model development set included two public datasets, COVIDprivate, and 168 CXRs collected from Hospital A. Five different random divisions of the training and validation sets were created from the development set. In the division, 300, 300, and 90 images were randomly selected as the validation set from COVIDx, COVIDBIMCV, and COVIDprivate, respectively. The remaining images of COVIDx, COVIDBIMCV, and COVIDprivate were used as the train set. In addition, all the 168 CXRs collected from Hospital A were used for the validation set. Model training and internal validation of diagnostic performance were performed for the training set and validation set, respectively. The training of our DL model is also described in the Supplementary material.

Figure 1
figure 1

Schematic illustration of dataset splitting and model training for our DL model. Abbreviation: DL, deep learning; COVIDx, public dataset used for COVID-Net; COVIDBIMCV, public dataset obtained from the PadChest and BIMCV-COVID19+ datasets; COVIDprivate, private dataset collected from six hospitals; Hospital A, dataset collected for internal validation and radiologist’s practice before the observer study; Hospital B, dataset collected for external validation.

The inference results of the DL model were calculated using an ensemble of five trained models. For the 180 CXRs of the external validation, an average of the probabilities obtained from the five trained models was calculated as the inference results of the DL model to evaluate the diagnostic performance of the DL model and to provide supporting information for radiologists during the observer study.

The DL model calculated the probability of NORMAL, PNEUMONIA, or COVID for each CXR, with a total of 100%. We also created images using Grad-CAM and Grad-CAM++ as explainable artificial intelligence, which visualized the reasoning for the diagnosis of the DL model20,21. Grad-CAM and Grad-CAM++ images were used for the observer study. Min–max normalization with a linear transformation was performed on the original Grad-CAM and Grad-CAM++ images.

Observer study

Eight radiologists (with 5–20 years of experience in diagnostic radiology) performed the observer study at two medical facilities. For the 180 CXRs collected from Hospital B, each radiologist performed two reading sessions over a period of more than 1 month. One reading session was performed with reference to CXRs only, and the other was performed with reference to both CXRs and the results of the DL model. The order of the two sessions was randomly selected to reduce bias. The eight radiologists scored the probabilities of NORMAL, PNEUMONIA, and COVID on a 100% scale. In the reading session with the DL model, the radiologists referred to the probabilities of NORMAL, PNEUMONIA, and COVID calculated using the DL model. If there was any uncertainty regarding the probabilities of the DL model, the results of Grad-CAM and Grad-CAM++ were available. Images of the 168 CXRs collected from Hospital A were also processed with Grad-CAM and Grad-CAM++ , and the diagnosis of the DL model and images of Grad-CAM and Grad-CAM++ of the 168 CXRs were presented to the radiologists for practice sessions before each reading session. Eight radiologists were taught how to interpret the Grad-CAM and Grad-CAM++ images before the observer study. There was no time limit for reading and practice sessions. Prior to the reading sessions, only the approximate frequencies of the three categories were presented to the radiologists and no other clinical information was provided. Our novelties in this study were to investigate whether radiologists changed their diagnosis by referring to our DL model of CXR and whether the diagnostic performance of radiologists was significantly improved.

Evaluation of Grad-CAM++ images

After the observer study, one senior radiologist visually evaluated the 180 Grad-CAM++ images in the test set. The visual evaluation of the Grad-CAM++ images was performed on the images that were accurately diagnosed by the DL. The radiologist visually examined the CXR and Grad-CAM++ images and determined whether the Grad-CAM++ images were typical or understandable. The typical Grad-CAM++ images were described in Supplementary material. If abnormal findings on CXR images were highlighted on Grad-CAM++ images, the cases were considered understandable by the radiologist. In addition, for COVID, the radiologist counted the number of Grad-CAM++ images with highlighted regions outside the lung area.

Statistical analyses

We evaluated the diagnostic performance of the DL model alone and compared the results between reading sessions with and without the DL model. The evaluation metrics were accuracy, sensitivity, specificity, and area under the curve (AUC) in the receiver operating characteristics. Because three-category classification was performed, these metrics were calculated class-wise (one-vs-rest), except for accuracy. For the AUC, multi-reader multi-case statistical analysis was used to statistically analyze the results of the eight radiologists. MRMCaov was used for the statistical analyses22. Although MRMCaov is a statistical method designed for binary classification of two categories, this study was designed to diagnose three categories: NORMAL, PNEUMONIA, and COVID. Therefore, the three-category classification was divided into three binary classifications (one-vs-rest): (1) NORMAL versus PNEUMONIA or COVID, (2) PNEUMONIA versus NORMAL or COVID, and (3) COVID versus NORMAL or PNEUMONIA. We then compared the class-wise AUC of the eight radiologists between reading sessions with and without the DL model. The difference in the AUC was statistically tested using MRMCaov. Because it was necessary to integrate the results from the eight radiologists, the class-wise MRMCaov was used in the present study. To control the family-wise error rate, Bonferroni correction was used; a p value less than 0.01666 was considered statistically significant. R (version 4.1.2) was used for the statistical analysis.

Results

Figure 2 shows examples of CXR, Grad-CAM, and Grad-CAM++ images from NORMAL, PNEUMONIA, and COVID. As shown in Fig. 2, in the images of Grad-CAM and Grad-CAM++ from NORMAL, there was often a relatively symmetrical region of interest in the lung fields. In PNEUMONIA, the region of interest was observed in the unilateral lung field in most cases, which was consistent with an abnormal shadow caused by pneumonia. COVID tended to show regions of interest in both the lungs and mediastinum.

Figure 2
figure 2

Results of Grad-CAM and Grad-CAM++ for our DL model. (A) NORMAL, (B) PNEUMONIA, and (C) COVID. Each row consists of CXR images collected from Hospital A and the Grad-CAM and Grad-CAM++ results. One trained DL model was used for Grad-CAM and Grad-CAM++ . Left column, original CXR image; middle column, result of Grad-CAM; right column, results of Grad-CAM++ . Abbreviations: DL, deep learning; CXR, chest X-ray.

Table 2 shows the sensitivity, specificity, accuracy, and AUC of the DL model and eight radiologists with and without the DL model. Here, the three types of binary classifications (one-vs-rest) were defined as follows: A, “NORMAL versus PNEUMONIA or COVID”; B, “PNEUMONIA versus NORMAL or COVID”; and C, “COVID versus NORMAL or PNEUMONIA.” Fig. 3 shows the receiver operating characteristics curves of our DL model alone for the three types of binary classifications. Figure 4 shows the receiver operating characteristics curves of eight radiologists with and without the DL model.

Table 2 Class-wise sensitivity, specificity, AUC, and 3-category classification accuracy of our DL model alone and eight radiologists with and without our DL model.
Figure 3
figure 3

Class-wise receiver operating characteristics curves of our DL model in external validation. (A) NORMAL versus PNEUMONIA or COVID, (B) PNEUMONIA versus NORMAL or COVID, and (C) COVID versus NORMAL or PNEUMONIA. Abbreviation: DL, deep learning.

Figure 4
figure 4figure 4

Class-wise receiver operating characteristics curves of eight radiologists with and without our DL model in observer study. (A) NORMAL versus PNEUMONIA or COVID, (B) PNEUMONIA versus NORMAL or COVID, and (C) COVID versus NORMAL or PNEUMONIA. The blue and red lines represent the receiver operating characteristic curves of the radiologists with and without our DL model, respectively. Abbreviation: DL, deep learning.

The three-category classification accuracy of the DL model was 0.733 (132/180). The 95% confidence intervals of class-wise AUC of the DL model were as follows: A, 0.872–0.955; B, 0.903–0.972; and C, 0.711–0.862. The mean accuracy of radiologists without the DL model was 0.696 ± 0.031 (range, 0.667 [120/180]–0.756 [136/180]). Their class-wise AUCs without the DL model were as follows: A, 0.889 ± 0.027 (0.860–0.941); B, 0.844 ± 0.046 (0.792–0.905); and C, 0.716 ± 0.028 (0.679–0.757). The mean accuracy of radiologists with the DL model was 0.723 ± 0.021 (range, 0.689 [124/180]–0.756 [136/180]). Their class-wise AUCs with the DL model were as follows: A, 0.903 ± 0.028 (0.871–0.954); B, 0.883 ± 0.055 (0.792–0.938); and C, 0.762 ± 0.029 (0.730–0.816). The accuracy of our DL model was better than that of six radiologists without the DL model.

Table 3 shows the averaged AUC of senior and junior radiologists with and without our DL model. The numbers of senior and junior radiologists were five and three, respectively. According to the Table 3, in both senior and junior radiologists, the difference of averaged class-wise AUC for C (“COVID versus NORMAL or PNEUMONIA”) between with and without the DL model was larger than those for A and B.

Table 3 Averaged AUC of senior and junior radiologists with and without our DL model.

We integrated the results of eight radiologists with and without the DL model using the software MRMCaov and compared the class-wise AUC of radiologists between reading sessions with and without the DL model. The results of MRMCaov showed that in the classification C (COVID versus NORMAL or PNEUMONIA), there were significant differences in AUC between the radiologists with and without the DL model (p = 0.0038). In classifications A and B, there were no significant differences in the AUC between the radiologists with and without the DL model (p = 0.2396 and 0.1190, respectively). Figure 5 shows the class-wise receiver operating characteristics curves of the integrated results of eight radiologists with and without the DL model.

Figure 5
figure 5

Class-wise receiver operating characteristics curves obtained by integration of reading session results of eight radiologists with and without our DL model. Note: (A) NORMAL versus PNEUMONIA or COVID, (B) PNEUMONIA versus NORMAL or COVID, (C) COVID versus NORMAL or PNEUMONIA. The blue and red lines represent the integrated receiver operating characteristics curves of radiologists with and without our DL model, respectively. Abbreviation: DL, deep learning.

Table 4 shows the results of visual evaluation of the Grad-CAM++ images. The ratio of the typical or understandable Grad-CAM++ images was 0.932 (123/132). The ratio of Grad-CAM++ images highlighted outside the lung area was 0.200 (8/40) for COVID.

Table 4 Results of the visual evaluation of Grad-CAM +  + images in the unseen test set.

Discussion

In this study, eight radiologists performed the reading sessions with and without the DL model, and the results were compared and analyzed using multi-reader multi-case statistical analysis. The diagnostic performance of the DL model alone was also evaluated. Our DL model achieved a higher accuracy and AUC than the majority of the eight radiologists without the DL model. Furthermore, the results of the statistical analysis showed that radiologists’ diagnostic performance was significantly improved by the DL model in diagnosing COVID-19 on CXR.

Based on the results of the receiver operating characteristics analysis with MRMCaov, there was a significant difference in AUC of radiologists between with and without the DL model for “C: COVID versus NORMAL or PNEUMONIA” (p = 0.0038). However, there was no significant difference for “A: NORMAL versus PNEUMONIA or COVID” and “B: PNEUMONIA versus NORMAL or COVID.” One possible reason for these results may be that radiologists have less experience in reading COVID than NORMAL or PNEUMONIA. Based on these results, the DL model may be more useful for medical doctors in other fields with less experience in reading COVID.

Because the DL model alone had a higher diagnostic performance than the majority of the eight radiologists, it may be possible to apply the DL model to COVID-19 diagnosis on CXR for screening and other purposes. This DL model of CXRs may be useful, especially in areas where medical resources are limited.

In a previous study, our DL model was significantly superior to radiologists in diagnosing COVID-19 pneumonia19. However, the DL model was not evaluated as computer-aided diagnosis system in the previous study. On the other hand, because the reading sessions of the present study were conducted by radiologists with and without the DL model, this is more similar to the situation of practical clinical use of the DL model. In addition, the previous study had the disadvantage that it was performed by internal validation. The current study was performed using external validation, which generally produces more reliable results than internal validation. Rangarajan et al.23 also performed external validation of the DL model for COVID-19 diagnosis. They pointed out that their DL model may complement COVID diagnosis on CXR. Although the study by Rangarajan et al. is similar to our study, the classification targets and method of statistical analysis are different from ours.

To the best of our knowledge, there are no studies in which three-category classification (including COVID) was performed using DL models and external validation. This study is the first to evaluate the generalizability of the DL model in a three-category classification. Several studies have compared the diagnostic performance of the DL model with that of radiologists for COVID-19 on CXR12,13,14. They reported that the AUC and accuracy of the DL model tended to exceed those of radiologists in most cases. For example, Wehbe et al.14 compared the diagnostic performance between their DL model and two radiologists in the diagnosis of COVID-19 positive and COVID-19 negative. Their DL had a significantly higher sensitivity (71%) than that of one radiologist (60%) and a significantly higher specificity (92%) than that of two radiologists (75% and 84%, respectively).

RT-PCR is the most commonly used test to detect COVID-19, but its sensitivity is not significantly high. One study reported that the sensitivity of RT-PCR is approximately 71%3. RT-PCR is also time consuming and often difficult to perform in small medical facilities. This is particularly true in developing countries. In contrast, CXR is a simple imaging examination. The disadvantage of CXR is that its diagnostic performance depends on the reader’s ability. The sensitivity and specificity of our DL model were relatively high for the three types of target classification. Therefore, it may be possible to increase the usefulness of CXR as an alternative or complementary test to RT-PCR.

One of the reasons why we evaluated our DL model by external validation is that it is difficult to evaluate the DL model accurately using public datasets. Garcia Santa Cruz et al. pointed out that public datasets contain undetected bias24. When these datasets are used for internal validation, there is a risk of overestimation of the diagnostic performance of the DL model. Therefore, we attempted to mitigate these biases using external validation.

Our study has some limitations. First, the CXRs in this study were obtained from large-sized hospitals, and good-quality CXRs were used. Therefore, we did not evaluate the usefulness of our DL model on poor-quality CXRs. Second, we conducted an observer study for CXRs with normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. Because we excluded CXRs with other lung diseases, we could not assess the usefulness of our DL model for these images.

In conclusion, our DL model alone showed better diagnostic performance than most of the eight radiologists in the external validation of the three-category classifications of normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. In addition, our DL model significantly improved the diagnostic performance of the eight radiologists in COVID-19 pneumonia versus normal or non-COVID-19 pneumonia.