Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in external validation study by radiologists with and without deep learning system

Miyazaki, Aki; Ikejima, Kengo; Nishio, Mizuho; Yabuta, Minoru; Matsuo, Hidetoshi; Onoue, Koji; Matsunaga, Takaaki; Nishioka, Eiko; Kono, Atsushi; Yamada, Daisuke; Oba, Ken; Ishikura, Reiichi; Murakami, Takamichi

doi:10.1038/s41598-023-44818-9

Download PDF

Article
Open access
Published: 16 October 2023

Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in external validation study by radiologists with and without deep learning system

Aki Miyazaki¹,
Kengo Ikejima²,
Mizuho Nishio¹,
Minoru Yabuta²,
Hidetoshi Matsuo¹,
Koji Onoue^3,4,
Takaaki Matsunaga¹,
Eiko Nishioka¹,
Atsushi Kono¹,
Daisuke Yamada²,
Ken Oba²,
Reiichi Ishikura³ &
…
Takamichi Murakami¹

Scientific Reports volume 13, Article number: 17533 (2023) Cite this article

921 Accesses
1 Citations
18 Altmetric
Metrics details

Subjects

Abstract

To evaluate the diagnostic performance of our deep learning (DL) model of COVID-19 and investigate whether the diagnostic performance of radiologists was improved by referring to our model. Our datasets contained chest X-rays (CXRs) for the following three categories: normal (NORMAL), non-COVID-19 pneumonia (PNEUMONIA), and COVID-19 pneumonia (COVID). We used two public datasets and private dataset collected from eight hospitals for the development and external validation of our DL model (26,393 CXRs). Eight radiologists performed two reading sessions: one session was performed with reference to CXRs only, and the other was performed with reference to both CXRs and the results of the DL model. The evaluation metrics for the reading session were accuracy, sensitivity, specificity, and area under the curve (AUC). The accuracy of our DL model was 0.733, and that of the eight radiologists without DL was 0.696 ± 0.031. There was a significant difference in AUC between the radiologists with and without DL for COVID versus NORMAL or PNEUMONIA (p = 0.0038). Our DL model alone showed better diagnostic performance than that of most radiologists. In addition, our model significantly improved the diagnostic performance of radiologists for COVID versus NORMAL or PNEUMONIA.

Segment anything in medical images

Article Open access 22 January 2024

Towards a general-purpose foundation model for computational pathology

Article 19 March 2024

AI in health and medicine

Article 20 January 2022

Introduction

The novel coronavirus disease 2019 (COVID-19), a new infectious disease, was first discovered in China in 2019 and has currently caused a significant number of infections and deaths worldwide¹. At the time of writing this paper, a total of at least 529,410,287 infections and 6,296,771 deaths have been confirmed worldwide². The development of vaccines and measures to prevent the spread of the disease have temporarily succeeded in reducing the number of infected people. However, the threat of COVID-19 continues worldwide because of a highly infectious species known as the Omicron strain.

Real-time polymerase chain reaction (RT-PCR) is used as a diagnostic method for COVID-19 in many medical institutions. However, RT-PCR is not always an effective method. One report has indicated that computed tomography (CT) is more sensitive than RT-PCR³. CT and chest X-ray (CXR) may serve as more accurate diagnostic methods for COVID-19^4,5.

The clinical application of deep learning (DL) in the diagnosis of COVID-19 on CXR has attracted attention^6,7. Although CXR is less accurate than CT, CT scanners are not always available. For example, as a 24/7 in-hospital service, rural hospitals have very limited local access to CT scanners⁸. CXR is simple and inexpensive, and radiation exposure of CXR is less than that of CT. Therefore, if COVID-19 can be diagnosed using a combination of DL and CXR, it may be possible to screen for COVID-19.

Many studies have already been conducted on CT/CXR for the diagnosis of COVID-19 using DL, and most of them have shown promising results^9,10,11. However, in the case of the clinical application of DL as a computer-aided diagnosis system, medical doctors must compare their own diagnosis with that of DL. If there is an inconsistency between doctors and DL, doctors may reject the DL diagnosis. To evaluate the clinical usefulness of DL, an observer study of CXR readings must be conducted for both DL and radiologists. Only a few studies have compared the diagnostic performance of DL and radiologists^12,13,14.

This study aimed to evaluate the diagnostic performance of our DL model of COVID-19 and investigate whether radiologists changed their diagnosis by referring to our DL model of CXR and whether the diagnostic performance of radiologists was significantly improved. To evaluate the clinical usefulness of DL, an observer study of radiologists and external validation of our DL model were conducted. Based on the reading sessions of the observer study, the diagnostic performance was compared among (i) our DL model, (ii) eight radiologists without DL, and (iii) eight radiologists with DL.

Materials and methods

This retrospective study was approved by the institutional review boards of eight hospitals (Kobe University Hospital, St. Luke's International Hospital, Nishinomiya Watanabe Hospital, Kobe City Medical Center General Hospital, Kobe City Nishi-Kobe Medical Center, Hyogo Prefectural Kakogawa Medical Center, Kita Harima Medical Center, and Hyogo Prefectural Awaji Medical Center); the requirement for acquiring informed consent was waived by the institutional review boards of these eight hospitals owing to the retrospective nature of the study. This study complied with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https://www.mhlw.go.jp/file/06-Seisakujouhou-10600000-Daijinkanboukouseikagakuka/0000080278.pdf).

Dataset

The CXR datasets used for developing and evaluating our DL model contain CXRs for the following three categories: normal CXR (NORMAL), non-COVID-19 pneumonia CXR (PNEUMONIA), and COVID-19 pneumonia CXR (COVID). Our DL model was developed using two public (COVIDx and COVID_BIMCV) and one private (COVID_private) datasets. One public dataset (COVIDx) was built to accelerate the development of highly accurate and practical deep learning model for detecting COVID-19 cases (https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md)¹⁵. The other public dataset (COVID_BIMCV) was constructed from two public datasets: the PadChest dataset (https://github.com/auriml/Rx-thorax-automatic-captioning)¹⁶ and BIMCV-COVID19+ dataset (https://github.com/BIMCV-CSUSP/BIMCV-COVID-19)¹⁷. COVID_private was based on the dataset collected from six hospitals previously, and the two public datasets (COVIDx and COVID_BIMCV) were the same as those in previous studies^18,19. The details of these datasets are described in the Supplementary material. Compared with the previous study, CXRs were added for COVID_private in the current study. The additional CXRs included 37, 7, and 31 cases of NORMAL, PNEUMONIA, and COVID, respectively. COVID_private contained 530 CXRs (176 NORMAL, 146 PNEUMONIA, and 208 COVID).

In addition to COVID_private, CXRs were collected from two other medical institutions. In total, 168 CXRs (80 NORMAL, 37 PNEUMONIA, and 51 COVID) collected from one medical institution (Hospital A) were used for the internal validation of the DL model (as a part of validation set) and for radiologists’ reading practice conducted before the observer study. Moreover, as unseen test set, 180 CXR cases (60 NORMAL, 60 PNEUMONIA, and 60 COVID) collected from another medical institution (Hospital B) were used for the external validation of the DL model and observer study of radiologists.

In the Hospital B, COVID was limited to those diagnosed with COVID-19 pneumonia using RT-PCR, and CXR was obtained after symptom onset. The time of COVID-19 diagnosis was between January 24, 2020, and May 5, 2020. PNEUMONIA was defined as patients clinically diagnosed with bacterial pneumonia that improved with appropriate treatment. Patients who showed no pneumonia on CT or had lung metastasis of malignancy and acute exacerbation of interstitial pneumonia were excluded from PNEUMONIA. NORMAL was defined as the absence of abnormalities in the lung, mediastinum, thoracic cavity, or chest wall on CXR and CT. NORMAL and PNEUMONIA were limited to cases before the summer of 2019 (before the COVID-19 pandemic). The details of the unseen test set collected from the Hospital B are described in the Supplementary material. The inclusion criteria of CXRs in the COVID_private and the Hospital A were the same as the previous study¹⁹.

Table 1 lists the details of each CXR dataset. The 180 cases (as the unseen test set) used for the external validation and reading sessions were adults aged 20 years or older. In the 180 cases, NORMAL included 39 men and 21 women aged 58.1 ± 27.9 years. PNEUMONIA included 43 men and 17 women aged 76.2 ± 20.8 years. The COVID group included 46 men and 14 women aged 53.4 ± 38.6 years.

Table 1 Numbers of CXR images in the datasets: COVIDx, COVID_BIMCV, and COVID_private, Hospital A, and Hospital B.

Full size table

Deep learning model

Our EfficientNet-based DL model was constructed in the same manner as described in previous papers^18,19. Figure 1 shows a schematic of the construction of the DL model. There are two major differences in the DL model construction between the present study and previous studies; one is that the 168 CXRs collected from Hospital A were used for internal validation as a part of the validation set, and the other is that the 180 CXRs collected from Hospital B were used for external validation as the unseen test set. The DL model development set included two public datasets, COVID_private, and 168 CXRs collected from Hospital A. Five different random divisions of the training and validation sets were created from the development set. In the division, 300, 300, and 90 images were randomly selected as the validation set from COVIDx, COVID_BIMCV, and COVID_private, respectively. The remaining images of COVIDx, COVID_BIMCV, and COVID_private were used as the train set. In addition, all the 168 CXRs collected from Hospital A were used for the validation set. Model training and internal validation of diagnostic performance were performed for the training set and validation set, respectively. The training of our DL model is also described in the Supplementary material.

The inference results of the DL model were calculated using an ensemble of five trained models. For the 180 CXRs of the external validation, an average of the probabilities obtained from the five trained models was calculated as the inference results of the DL model to evaluate the diagnostic performance of the DL model and to provide supporting information for radiologists during the observer study.

The DL model calculated the probability of NORMAL, PNEUMONIA, or COVID for each CXR, with a total of 100%. We also created images using Grad-CAM and Grad-CAM++ as explainable artificial intelligence, which visualized the reasoning for the diagnosis of the DL model^20,21. Grad-CAM and Grad-CAM++ images were used for the observer study. Min–max normalization with a linear transformation was performed on the original Grad-CAM and Grad-CAM++ images.

Observer study

Eight radiologists (with 5–20 years of experience in diagnostic radiology) performed the observer study at two medical facilities. For the 180 CXRs collected from Hospital B, each radiologist performed two reading sessions over a period of more than 1 month. One reading session was performed with reference to CXRs only, and the other was performed with reference to both CXRs and the results of the DL model. The order of the two sessions was randomly selected to reduce bias. The eight radiologists scored the probabilities of NORMAL, PNEUMONIA, and COVID on a 100% scale. In the reading session with the DL model, the radiologists referred to the probabilities of NORMAL, PNEUMONIA, and COVID calculated using the DL model. If there was any uncertainty regarding the probabilities of the DL model, the results of Grad-CAM and Grad-CAM++ were available. Images of the 168 CXRs collected from Hospital A were also processed with Grad-CAM and Grad-CAM++ , and the diagnosis of the DL model and images of Grad-CAM and Grad-CAM++ of the 168 CXRs were presented to the radiologists for practice sessions before each reading session. Eight radiologists were taught how to interpret the Grad-CAM and Grad-CAM++ images before the observer study. There was no time limit for reading and practice sessions. Prior to the reading sessions, only the approximate frequencies of the three categories were presented to the radiologists and no other clinical information was provided. Our novelties in this study were to investigate whether radiologists changed their diagnosis by referring to our DL model of CXR and whether the diagnostic performance of radiologists was significantly improved.

Evaluation of Grad-CAM++ images

After the observer study, one senior radiologist visually evaluated the 180 Grad-CAM++ images in the test set. The visual evaluation of the Grad-CAM++ images was performed on the images that were accurately diagnosed by the DL. The radiologist visually examined the CXR and Grad-CAM++ images and determined whether the Grad-CAM++ images were typical or understandable. The typical Grad-CAM++ images were described in Supplementary material. If abnormal findings on CXR images were highlighted on Grad-CAM++ images, the cases were considered understandable by the radiologist. In addition, for COVID, the radiologist counted the number of Grad-CAM++ images with highlighted regions outside the lung area.

Statistical analyses

We evaluated the diagnostic performance of the DL model alone and compared the results between reading sessions with and without the DL model. The evaluation metrics were accuracy, sensitivity, specificity, and area under the curve (AUC) in the receiver operating characteristics. Because three-category classification was performed, these metrics were calculated class-wise (one-vs-rest), except for accuracy. For the AUC, multi-reader multi-case statistical analysis was used to statistically analyze the results of the eight radiologists. MRMCaov was used for the statistical analyses²². Although MRMCaov is a statistical method designed for binary classification of two categories, this study was designed to diagnose three categories: NORMAL, PNEUMONIA, and COVID. Therefore, the three-category classification was divided into three binary classifications (one-vs-rest): (1) NORMAL versus PNEUMONIA or COVID, (2) PNEUMONIA versus NORMAL or COVID, and (3) COVID versus NORMAL or PNEUMONIA. We then compared the class-wise AUC of the eight radiologists between reading sessions with and without the DL model. The difference in the AUC was statistically tested using MRMCaov. Because it was necessary to integrate the results from the eight radiologists, the class-wise MRMCaov was used in the present study. To control the family-wise error rate, Bonferroni correction was used; a p value less than 0.01666 was considered statistically significant. R (version 4.1.2) was used for the statistical analysis.

Results

Figure 2 shows examples of CXR, Grad-CAM, and Grad-CAM++ images from NORMAL, PNEUMONIA, and COVID. As shown in Fig. 2, in the images of Grad-CAM and Grad-CAM++ from NORMAL, there was often a relatively symmetrical region of interest in the lung fields. In PNEUMONIA, the region of interest was observed in the unilateral lung field in most cases, which was consistent with an abnormal shadow caused by pneumonia. COVID tended to show regions of interest in both the lungs and mediastinum.

Table 2 shows the sensitivity, specificity, accuracy, and AUC of the DL model and eight radiologists with and without the DL model. Here, the three types of binary classifications (one-vs-rest) were defined as follows: A, “NORMAL versus PNEUMONIA or COVID”; B, “PNEUMONIA versus NORMAL or COVID”; and C, “COVID versus NORMAL or PNEUMONIA.” Fig. 3 shows the receiver operating characteristics curves of our DL model alone for the three types of binary classifications. Figure 4 shows the receiver operating characteristics curves of eight radiologists with and without the DL model.

Table 2 Class-wise sensitivity, specificity, AUC, and 3-category classification accuracy of our DL model alone and eight radiologists with and without our DL model.

Full size table

The three-category classification accuracy of the DL model was 0.733 (132/180). The 95% confidence intervals of class-wise AUC of the DL model were as follows: A, 0.872–0.955; B, 0.903–0.972; and C, 0.711–0.862. The mean accuracy of radiologists without the DL model was 0.696 ± 0.031 (range, 0.667 [120/180]–0.756 [136/180]). Their class-wise AUCs without the DL model were as follows: A, 0.889 ± 0.027 (0.860–0.941); B, 0.844 ± 0.046 (0.792–0.905); and C, 0.716 ± 0.028 (0.679–0.757). The mean accuracy of radiologists with the DL model was 0.723 ± 0.021 (range, 0.689 [124/180]–0.756 [136/180]). Their class-wise AUCs with the DL model were as follows: A, 0.903 ± 0.028 (0.871–0.954); B, 0.883 ± 0.055 (0.792–0.938); and C, 0.762 ± 0.029 (0.730–0.816). The accuracy of our DL model was better than that of six radiologists without the DL model.

Table 3 shows the averaged AUC of senior and junior radiologists with and without our DL model. The numbers of senior and junior radiologists were five and three, respectively. According to the Table 3, in both senior and junior radiologists, the difference of averaged class-wise AUC for C (“COVID versus NORMAL or PNEUMONIA”) between with and without the DL model was larger than those for A and B.

Table 3 Averaged AUC of senior and junior radiologists with and without our DL model.

Full size table

We integrated the results of eight radiologists with and without the DL model using the software MRMCaov and compared the class-wise AUC of radiologists between reading sessions with and without the DL model. The results of MRMCaov showed that in the classification C (COVID versus NORMAL or PNEUMONIA), there were significant differences in AUC between the radiologists with and without the DL model (p = 0.0038). In classifications A and B, there were no significant differences in the AUC between the radiologists with and without the DL model (p = 0.2396 and 0.1190, respectively). Figure 5 shows the class-wise receiver operating characteristics curves of the integrated results of eight radiologists with and without the DL model.

Table 4 shows the results of visual evaluation of the Grad-CAM++ images. The ratio of the typical or understandable Grad-CAM++ images was 0.932 (123/132). The ratio of Grad-CAM++ images highlighted outside the lung area was 0.200 (8/40) for COVID.

Table 4 Results of the visual evaluation of Grad-CAM + + images in the unseen test set.

Full size table

Discussion

In this study, eight radiologists performed the reading sessions with and without the DL model, and the results were compared and analyzed using multi-reader multi-case statistical analysis. The diagnostic performance of the DL model alone was also evaluated. Our DL model achieved a higher accuracy and AUC than the majority of the eight radiologists without the DL model. Furthermore, the results of the statistical analysis showed that radiologists’ diagnostic performance was significantly improved by the DL model in diagnosing COVID-19 on CXR.

Based on the results of the receiver operating characteristics analysis with MRMCaov, there was a significant difference in AUC of radiologists between with and without the DL model for “C: COVID versus NORMAL or PNEUMONIA” (p = 0.0038). However, there was no significant difference for “A: NORMAL versus PNEUMONIA or COVID” and “B: PNEUMONIA versus NORMAL or COVID.” One possible reason for these results may be that radiologists have less experience in reading COVID than NORMAL or PNEUMONIA. Based on these results, the DL model may be more useful for medical doctors in other fields with less experience in reading COVID.

Because the DL model alone had a higher diagnostic performance than the majority of the eight radiologists, it may be possible to apply the DL model to COVID-19 diagnosis on CXR for screening and other purposes. This DL model of CXRs may be useful, especially in areas where medical resources are limited.

In a previous study, our DL model was significantly superior to radiologists in diagnosing COVID-19 pneumonia¹⁹. However, the DL model was not evaluated as computer-aided diagnosis system in the previous study. On the other hand, because the reading sessions of the present study were conducted by radiologists with and without the DL model, this is more similar to the situation of practical clinical use of the DL model. In addition, the previous study had the disadvantage that it was performed by internal validation. The current study was performed using external validation, which generally produces more reliable results than internal validation. Rangarajan et al.²³ also performed external validation of the DL model for COVID-19 diagnosis. They pointed out that their DL model may complement COVID diagnosis on CXR. Although the study by Rangarajan et al. is similar to our study, the classification targets and method of statistical analysis are different from ours.

To the best of our knowledge, there are no studies in which three-category classification (including COVID) was performed using DL models and external validation. This study is the first to evaluate the generalizability of the DL model in a three-category classification. Several studies have compared the diagnostic performance of the DL model with that of radiologists for COVID-19 on CXR^12,13,14. They reported that the AUC and accuracy of the DL model tended to exceed those of radiologists in most cases. For example, Wehbe et al.¹⁴ compared the diagnostic performance between their DL model and two radiologists in the diagnosis of COVID-19 positive and COVID-19 negative. Their DL had a significantly higher sensitivity (71%) than that of one radiologist (60%) and a significantly higher specificity (92%) than that of two radiologists (75% and 84%, respectively).

RT-PCR is the most commonly used test to detect COVID-19, but its sensitivity is not significantly high. One study reported that the sensitivity of RT-PCR is approximately 71%³. RT-PCR is also time consuming and often difficult to perform in small medical facilities. This is particularly true in developing countries. In contrast, CXR is a simple imaging examination. The disadvantage of CXR is that its diagnostic performance depends on the reader’s ability. The sensitivity and specificity of our DL model were relatively high for the three types of target classification. Therefore, it may be possible to increase the usefulness of CXR as an alternative or complementary test to RT-PCR.

One of the reasons why we evaluated our DL model by external validation is that it is difficult to evaluate the DL model accurately using public datasets. Garcia Santa Cruz et al. pointed out that public datasets contain undetected bias²⁴. When these datasets are used for internal validation, there is a risk of overestimation of the diagnostic performance of the DL model. Therefore, we attempted to mitigate these biases using external validation.

Our study has some limitations. First, the CXRs in this study were obtained from large-sized hospitals, and good-quality CXRs were used. Therefore, we did not evaluate the usefulness of our DL model on poor-quality CXRs. Second, we conducted an observer study for CXRs with normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. Because we excluded CXRs with other lung diseases, we could not assess the usefulness of our DL model for these images.

In conclusion, our DL model alone showed better diagnostic performance than most of the eight radiologists in the external validation of the three-category classifications of normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. In addition, our DL model significantly improved the diagnostic performance of the eight radiologists in COVID-19 pneumonia versus normal or non-COVID-19 pneumonia.

Data availability

The source code of our DL model and its related data are available from the following URL of GitHub: https://github.com/jurader/covid19_xp_efficientnet.

References

WHO, “Novel Coronavirus—China,” 2020. https://www.who.int/csr/don/12-january-2020-novel-coronavirus-china/en/. Accessed 7 June 2022.
WHO Coronavirus (COVID-19) Dashboard | WHO Coronavirus (COVID-19) Dashboard with Vaccination Data. https://covid19.who.int/. Accessed 7 June 2022.
Fang, Y. et al. Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR. Radiology 296(2), E115–E117. https://doi.org/10.1148/RADIOL.2020200432 (2020).
Article PubMed Google Scholar
Hao, W. & Li, M. Clinical diagnostic value of CT imaging in COVID-19 with multiple negative RT-PCR testing. Travel Med. Infect. Dis. 34, 101627. https://doi.org/10.1016/j.tmaid.2020.101627 (2020).
Article PubMed PubMed Central Google Scholar
Jacobi, A., Chung, M., Bernheim, A. & Eber, C. Portable chest X-ray in coronavirus disease-19 (COVID-19): A pictorial review. Clin. Imaging 64, 35. https://doi.org/10.1016/J.CLINIMAG.2020.04.001 (2020).
Article PubMed PubMed Central Google Scholar
Ozturk, T. et al. Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput. Biol. Med. 121, 103792. https://doi.org/10.1016/J.COMPBIOMED.2020.103792 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gudigar, A. et al. Role of artificial intelligence in COVID-19 detection. Sensors (Basel) 21(23), 8045. https://doi.org/10.3390/S21238045 (2021).
Article ADS CAS PubMed Google Scholar
Fleet, R. et al. Rural versus urban academic hospital mortality following stroke in Canada. PLoS ONE 13(1), e0191151. https://doi.org/10.1371/journal.pone.0191151 (2018).
Article CAS PubMed PubMed Central Google Scholar
Minaee, S., Kafieh, R., Sonka, M., Yazdani, S. & Jamalipour, S. G. Deep-COVID: Predicting COVID-19 from chest X-ray images using deep transfer learning. Med. Image Anal. 65, 101794. https://doi.org/10.1016/J.MEDIA.2020.101794 (2020).
Article PubMed PubMed Central Google Scholar
Qaid, T. S. et al. Hybrid deep-learning and machine-learning models for predicting COVID-19. Comput. Intell. Neurosci. 3(2021), 9996737. https://doi.org/10.1155/2021/9996737 (2021).
Article Google Scholar
Okolo, G. I., Katsigiannis, S., Althobaiti, T. & Ramzan, N. On the use of deep learning for imaging-based COVID-19 detection using chest X-rays. Sensors (Basel) 21(17), 5702. https://doi.org/10.3390/S21175702 (2021).
Article ADS CAS PubMed Google Scholar
Zhang, R. et al. Diagnosis of coronavirus disease 2019 pneumonia by using chest radiography: Value of artificial intelligence. Radiology 298(2), E88–E97. https://doi.org/10.1148/radiol.2020202944 (2021).
Article PubMed Google Scholar
Murphy, K. et al. COVID-19 on chest radiographs: A multireader evaluation of an artificial intelligence system. Radiology 296(3), E166–E172. https://doi.org/10.1148/radiol.2020201874 (2020).
Article PubMed Google Scholar
Wehbe, R. M. et al. DeepCOVID-XR: An artificial intelligence algorithm to detect COVID-19 on chest radiographs trained and tested on a large U.S. Clinical data set. Radiology 299(1), E167–E176. https://doi.org/10.1148/RADIOL.2020203511 (2021).
Article PubMed Google Scholar
Wang, L., Lin, Z. Q. & Wong, A. COVID-Net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 10(1), 19549. https://doi.org/10.1038/s41598-020-76550-z (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Bustos, A., Pertusa, A., Salinas, J. M. & de la Iglesia-Vayá, M. PadChest: A large chest X-ray image dataset with multi-label annotated reports. Med. Image Anal. 66, 101797. https://doi.org/10.1016/j.media.2020.101797 (2020).
Article PubMed Google Scholar
Vayá, M. D. L. I., Saborit, J. M., Montell, J. A. et al. BIMCV COVID-19+: A large annotated dataset of RX and CT images from COVID-19 patients. 2020. http://arxiv.org/abs/2006.01174.
Nishio, M., Noguchi, S., Matsuo, H. & Murakami, T. Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: Combination of data augmentation methods. Sci. Rep. 10(1), 17532. https://doi.org/10.1038/s41598-020-74539-2 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Nishio, M. et al. Deep learning model for the automatic classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy: A multi-center retrospective study. Sci. Rep. 12(1), 8214. https://doi.org/10.1038/s41598-022-11990-3 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. & Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 2017, 618–626https://doi.org/10.1109/ICCV.2017.74
Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings—2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018. 2018, 839–847 (2018). https://doi.org/10.1109/WACV.2018.00097.
Smith, B. J. & Hillis, S. L. Multi-reader multi-case analysis of variance software for diagnostic performance comparison of imaging modalities. Proc. SPIE Int. Soc. Opt. Eng. 11316, 113160K. https://doi.org/10.1117/12.2549075 (2020).
Article PubMed PubMed Central Google Scholar
Rangarajan, K. et al. Artificial Intelligence-assisted chest X-ray assessment scheme for COVID-19. Eur. Radiol. 31(8), 6039–6048. https://doi.org/10.1007/s00330-020-07628-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Garcia Santa Cruz, B., Bossa, M. N., Sölter, J. & Husch, A. D. Public Covid-19 X-ray datasets and their impact on model bias—A systematic review of a significant problem. Med. Image Anal. 74, 102225. https://doi.org/10.1016/j.media.2021.102225 (2021).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Yoshiaki Watanabe (Nishinomiya Watanabe Hospital) for his cooperation.

Funding

This work was supported by JST Adaptable and Seamless Technology Transfer Program through Target-driven R&D (A-STEP) (Grant No.: JPMJTM20QL). In addition, this work was partly supported by JSPS KAKENHI (Grant No.: JP19K17232 and 22K07665).

Author information

Authors and Affiliations

Department of Radiology, Kobe University Graduate School of Medicine, 7-5-2 Kusunoki-Cho, Chuo-Ku, Kobe, 650-0017, Japan
Aki Miyazaki, Mizuho Nishio, Hidetoshi Matsuo, Takaaki Matsunaga, Eiko Nishioka, Atsushi Kono & Takamichi Murakami
Department of Radiology, St. Luke’s International Hospital, 9-1 Akashi-Cho, Chuo-Ku, Tokyo, 104-8560, Japan
Kengo Ikejima, Minoru Yabuta, Daisuke Yamada & Ken Oba
Department of Radiology, Kobe City Medical Center General Hospital, 2-1-1 Minatojimaminamimachi, Chuo-Ku, Kobe, 650-0047, Japan
Koji Onoue & Reiichi Ishikura
Department of Diagnostic Imaging and Interventional Radiology, Kyoto Katsura Hospital, 17 Yamada-Hirao, Nishikyo-Ku, Kyoto, 615-8256, Japan
Koji Onoue

Authors

Aki Miyazaki
View author publications
You can also search for this author in PubMed Google Scholar
Kengo Ikejima
View author publications
You can also search for this author in PubMed Google Scholar
Mizuho Nishio
View author publications
You can also search for this author in PubMed Google Scholar
Minoru Yabuta
View author publications
You can also search for this author in PubMed Google Scholar
Hidetoshi Matsuo
View author publications
You can also search for this author in PubMed Google Scholar
Koji Onoue
View author publications
You can also search for this author in PubMed Google Scholar
Takaaki Matsunaga
View author publications
You can also search for this author in PubMed Google Scholar
Eiko Nishioka
View author publications
You can also search for this author in PubMed Google Scholar
Atsushi Kono
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Ken Oba
View author publications
You can also search for this author in PubMed Google Scholar
Reiichi Ishikura
View author publications
You can also search for this author in PubMed Google Scholar
Takamichi Murakami
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: M.N. Data curation: A.M., K.I., K.O., R.I. Formal analysis: A.M., M.N. Funding acquisition: M.N. Investigation: A.M., M.N. Methodology: A.M., M.N. Project administration: M.N. Resources: A.M., K.I., M.N., M.Y., T.M., E.N., A.K., D.Y., K.O. Software: M.N., H.M. Supervision: T.M. Validation: A.M., M.N., H.M. Visualization: A.M., M.N., H.M. Writing—original draft: A.M., M.N. Writing—review and editing: All authors.

Corresponding author

Correspondence to Mizuho Nishio.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Miyazaki, A., Ikejima, K., Nishio, M. et al. Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in external validation study by radiologists with and without deep learning system. Sci Rep 13, 17533 (2023). https://doi.org/10.1038/s41598-023-44818-9

Download citation

Received: 12 December 2022
Accepted: 12 October 2023
Published: 16 October 2023
DOI: https://doi.org/10.1038/s41598-023-44818-9

This article is cited by

Generalizable disease detection using model ensemble on chest X-ray images
- Maider Abad
- Jordi Casas-Roma
- Ferran Prados
Scientific Reports (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.