Computer-aided diagnosis of chest X-ray for COVID-19 diagnosis in external validation study by radiologists with and without deep learning system

To evaluate the diagnostic performance of our deep learning (DL) model of COVID-19 and investigate whether the diagnostic performance of radiologists was improved by referring to our model. Our datasets contained chest X-rays (CXRs) for the following three categories: normal (NORMAL), non-COVID-19 pneumonia (PNEUMONIA), and COVID-19 pneumonia (COVID). We used two public datasets and private dataset collected from eight hospitals for the development and external validation of our DL model (26,393 CXRs). Eight radiologists performed two reading sessions: one session was performed with reference to CXRs only, and the other was performed with reference to both CXRs and the results of the DL model. The evaluation metrics for the reading session were accuracy, sensitivity, specificity, and area under the curve (AUC). The accuracy of our DL model was 0.733, and that of the eight radiologists without DL was 0.696 ± 0.031. There was a significant difference in AUC between the radiologists with and without DL for COVID versus NORMAL or PNEUMONIA (p = 0.0038). Our DL model alone showed better diagnostic performance than that of most radiologists. In addition, our model significantly improved the diagnostic performance of radiologists for COVID versus NORMAL or PNEUMONIA.


Materials and methods
This retrospective study was approved by the institutional review boards of eight hospitals (Kobe University Hospital, St. Luke's International Hospital, Nishinomiya Watanabe Hospital, Kobe City Medical Center General Hospital, Kobe City Nishi-Kobe Medical Center, Hyogo Prefectural Kakogawa Medical Center, Kita Harima Medical Center, and Hyogo Prefectural Awaji Medical Center); the requirement for acquiring informed consent was waived by the institutional review boards of these eight hospitals owing to the retrospective nature of the study.This study complied with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https:// www.mhlw.go.jp/ file/ 06-Seisa kujou hou-10600 000-Daiji nkanb oukou seika gakuka/ 00000 80278.pdf).
In addition to COVID private , CXRs were collected from two other medical institutions.In total, 168 CXRs (80 NORMAL, 37 PNEUMONIA, and 51 COVID) collected from one medical institution (Hospital A) were used for the internal validation of the DL model (as a part of validation set) and for radiologists' reading practice conducted before the observer study.Moreover, as unseen test set, 180 CXR cases (60 NORMAL, 60 PNEUMONIA, and 60 COVID) collected from another medical institution (Hospital B) were used for the external validation of the DL model and observer study of radiologists.
In the Hospital B, COVID was limited to those diagnosed with COVID-19 pneumonia using RT-PCR, and CXR was obtained after symptom onset.The time of COVID-19 diagnosis was between January 24, 2020, and May 5, 2020.PNEUMONIA was defined as patients clinically diagnosed with bacterial pneumonia that improved with appropriate treatment.Patients who showed no pneumonia on CT or had lung metastasis of malignancy and acute exacerbation of interstitial pneumonia were excluded from PNEUMONIA.NORMAL was defined as the absence of abnormalities in the lung, mediastinum, thoracic cavity, or chest wall on CXR and CT.NORMAL and PNEUMONIA were limited to cases before the summer of 2019 (before the COVID-19 pandemic).The details of the unseen test set collected from the Hospital B are described in the Supplementary material.The inclusion criteria of CXRs in the COVID private and the Hospital A were the same as the previous study 19 .
Table 1 lists the details of each CXR dataset.The 180 cases (as the unseen test set) used for the external validation and reading sessions were adults aged 20 years or older.In the 180 cases, NORMAL included 39 men and 21 women aged 58.1 ± 27.9 years.PNEUMONIA included 43 men and 17 women aged 76.2 ± 20.8 years.The COVID group included 46 men and 14 women aged 53.4 ± 38.6 years.

Deep learning model
Our EfficientNet-based DL model was constructed in the same manner as described in previous papers 18,19 .Figure 1 shows a schematic of the construction of the DL model.There are two major differences in the DL model construction between the present study and previous studies; one is that the 168 CXRs collected from Hospital A were used for internal validation as a part of the validation set, and the other is that the 180 CXRs collected from Hospital B were used for external validation as the unseen test set.The DL model development set included two public datasets, COVID private , and 168 CXRs collected from Hospital A. Five different random divisions of the training and validation sets were created from the development set.In the division, 300, 300, and 90 images were randomly selected as the validation set from COVIDx, COVID BIMCV , and COVID private , respectively.The remaining images of COVIDx, COVID BIMCV , and COVID private were used as the train set.In addition, all the 168 CXRs collected from Hospital A were used for the validation set.Model training and internal validation of diagnostic performance were performed for the training set and validation set, respectively.The training of our DL model is also described in the Supplementary material.
The inference results of the DL model were calculated using an ensemble of five trained models.For the 180 CXRs of the external validation, an average of the probabilities obtained from the five trained models was calculated as the inference results of the DL model to evaluate the diagnostic performance of the DL model and to provide supporting information for radiologists during the observer study.
The DL model calculated the probability of NORMAL, PNEUMONIA, or COVID for each CXR, with a total of 100%.We also created images using Grad-CAM and Grad-CAM++ as explainable artificial intelligence, which visualized the reasoning for the diagnosis of the DL model 20,21 .Grad-CAM and Grad-CAM++ images were used for the observer study.Min-max normalization with a linear transformation was performed on the original Grad-CAM and Grad-CAM++ images.

Observer study
Eight radiologists (with 5-20 years of experience in diagnostic radiology) performed the observer study at two medical facilities.For the 180 CXRs collected from Hospital B, each radiologist performed two reading sessions over a period of more than 1 month.One reading session was performed with reference to CXRs only, and the other was performed with reference to both CXRs and the results of the DL model.The order of the two sessions was randomly selected to reduce bias.The eight radiologists scored the probabilities of NORMAL, PNEUMONIA, and COVID on a 100% scale.In the reading session with the DL model, the radiologists referred to the probabilities of NORMAL, PNEUMONIA, and COVID calculated using the DL model.If there was any uncertainty regarding the probabilities of the DL model, the results of Grad-CAM and Grad-CAM++ were available.Images of the 168 CXRs collected from Hospital A were also processed with Grad-CAM and Grad-CAM++ , and the diagnosis of the DL model and images of Grad-CAM and Grad-CAM++ of the 168 CXRs were presented to the radiologists for practice sessions before each reading session.Eight radiologists were taught how to interpret the Grad-CAM and Grad-CAM++ images before the observer study.There was no time limit for reading and practice sessions.Prior to the reading sessions, only the approximate frequencies of the three categories were presented to the radiologists and no other clinical information was provided.Our novelties in this study were to   www.nature.com/scientificreports/investigate whether radiologists changed their diagnosis by referring to our DL model of CXR and whether the diagnostic performance of radiologists was significantly improved.

Evaluation of Grad-CAM++ images
After the observer study, one senior radiologist visually evaluated the 180 Grad-CAM++ images in the test set.The visual evaluation of the Grad-CAM++ images was performed on the images that were accurately diagnosed by the DL.The radiologist visually examined the CXR and Grad-CAM++ images and determined whether the Grad-CAM++ images were typical or understandable.The typical Grad-CAM++ images were described in Supplementary material.If abnormal findings on CXR images were highlighted on Grad-CAM++ images, the cases were considered understandable by the radiologist.In addition, for COVID, the radiologist counted the number of Grad-CAM++ images with highlighted regions outside the lung area.

Statistical analyses
We evaluated the diagnostic performance of the DL model alone and compared the results between reading sessions with and without the DL model.The evaluation metrics were accuracy, sensitivity, specificity, and area under the curve (AUC) in the receiver operating characteristics.Because three-category classification was performed, these metrics were calculated class-wise (one-vs-rest), except for accuracy.For the AUC, multi-reader multi-case statistical analysis was used to statistically analyze the results of the eight radiologists.MRMCaov was used for the statistical analyses 22 .Although MRMCaov is a statistical method designed for binary classification of two categories, this study was designed to diagnose three categories: NORMAL, PNEUMONIA, and COVID.Therefore, the three-category classification was divided into three binary classifications (one-vs-rest): (1) NOR-MAL versus PNEUMONIA or COVID, (2) PNEUMONIA versus NORMAL or COVID, and (3) COVID versus NORMAL or PNEUMONIA.We then compared the class-wise AUC of the eight radiologists between reading sessions with and without the DL model.The difference in the AUC was statistically tested using MRMCaov.
Because it was necessary to integrate the results from the eight radiologists, the class-wise MRMCaov was used in the present study.To control the family-wise error rate, Bonferroni correction was used; a p value less than 0.01666 was considered statistically significant.R (version 4.1.2) was used for the statistical analysis.

Results
Figure 2 shows examples of CXR, Grad-CAM, and Grad-CAM++ images from NORMAL, PNEUMONIA, and COVID.As shown in Fig. 2, in the images of Grad-CAM and Grad-CAM++ from NORMAL, there was often a relatively symmetrical region of interest in the lung fields.In PNEUMONIA, the region of interest was observed in the unilateral lung field in most cases, which was consistent with an abnormal shadow caused by pneumonia.COVID tended to show regions of interest in both the lungs and mediastinum.Table 2 shows the sensitivity, specificity, accuracy, and AUC of the DL model and eight radiologists with and without the DL model.Here, the three types of binary classifications (one-vs-rest) were defined as follows: A, "NORMAL versus PNEUMONIA or COVID"; B, "PNEUMONIA versus NORMAL or COVID"; and C, "COVID versus NORMAL or PNEUMONIA." Fig. 3 shows the receiver operating characteristics curves of our DL model alone for the three types of binary classifications.Figure 4 shows the receiver operating characteristics curves of eight radiologists with and without the DL model."> The three-category classification accuracy of the DL model was 0.733 (132/180).The 95% confidence intervals of class-wise AUC of the DL model were as follows: A, 0.872-0.955;B, 0.903-0.972;and C, 0.711-0.862.The mean accuracy of radiologists without the DL model was 0.696 ± 0.031 (range, 0.667 [120/180]-0.756[136/180]).Their class-wise AUCs without the DL model were as follows: A, 0.889 ± 0.027 (0.860-0.941);B, 0.844 ± 0.046 (0.792-0.905); and C, 0.716 ± 0.028 (0.679-0.757).The mean accuracy of radiologists with the DL model was 0.723 ± 0.021 (range, 0.689 [124/180]-0.756[136/180]).Their class-wise AUCs with the DL model were as follows: A, 0.903 ± 0.028 (0.871-0.954);B, 0.883 ± 0.055 (0.792-0.938); and C, 0.762 ± 0.029 (0.730-0.816).The accuracy of our DL model was better than that of six radiologists without the DL model.
Table 3 shows the averaged AUC of senior and junior radiologists with and without our DL model.The numbers of senior and junior radiologists were five and three, respectively.According to the Table 3, in both senior and junior radiologists, the difference of averaged class-wise AUC for C ("COVID versus NORMAL or PNEUMONIA") between with and without the DL model was larger than those for A and B.
We integrated the results of eight radiologists with and without the DL model using the software MRMCaov and compared the class-wise AUC of radiologists between reading sessions with and without the DL model.The results of MRMCaov showed that in the classification C (COVID versus NORMAL or PNEUMONIA), there were significant differences in AUC between the radiologists with and without the DL model (p = 0.0038).In classifications A and B, there were no significant differences in the AUC between the radiologists with and without the DL model (p = 0.2396 and 0.1190, respectively).Figure 5 shows the class-wise receiver operating characteristics curves of the integrated results of eight radiologists with and without the DL model.
Table 4 shows the results of visual evaluation of the Grad-CAM++ images.The ratio of the typical or understandable Grad-CAM++ images was 0.932 (123/132).The ratio of Grad-CAM++ images highlighted outside the lung area was 0.200 (8/40) for COVID.

Discussion
In this study, eight radiologists performed the reading sessions with and without the DL model, and the results were compared and analyzed using multi-reader multi-case statistical analysis.The diagnostic performance of the DL model alone was also evaluated.Our DL model achieved a higher accuracy and AUC than the majority of the eight radiologists without the DL model.Furthermore, the results of the statistical analysis showed that radiologists' diagnostic performance was significantly improved by the DL model in diagnosing COVID-19 on CXR.
Based on the results of the receiver operating characteristics analysis with MRMCaov, there was a significant difference in AUC of radiologists between with and without the DL model for "C: COVID versus NORMAL or PNEUMONIA" (p = 0.0038).However, there was no significant difference for "A: NORMAL versus PNEUMONIA or COVID" and "B: PNEUMONIA versus NORMAL or COVID." One possible reason for these results may be One of the reasons why we evaluated our DL model by external validation is that it is difficult to evaluate the DL model accurately using public datasets.Garcia Santa Cruz et al. pointed out that public datasets contain undetected bias 24 .When these datasets are used for internal validation, there is a risk of overestimation of the diagnostic performance of the DL model.Therefore, we attempted to mitigate these biases using external validation.
Our study has some limitations.First, the CXRs in this study were obtained from large-sized hospitals, and good-quality CXRs were used.Therefore, we did not evaluate the usefulness of our DL model on poor-quality  CXRs.Second, we conducted an observer study for CXRs with normal, non-COVID-19 pneumonia, and COVID-19 pneumonia.Because we excluded CXRs with other lung diseases, we could not assess the usefulness of our DL model for these images.
In conclusion, our DL model alone showed better diagnostic performance than most of the eight radiologists in the external validation of the three-category classifications of normal, non-COVID-19 pneumonia, and COVID-19 pneumonia.In addition, our DL model significantly improved the diagnostic performance of the eight radiologists in COVID-19 pneumonia versus normal or non-COVID-19 pneumonia.

Table 1 .
Numbers of CXR images in the datasets: COVIDx, COVID BIMCV , and COVID private , Hospital A, and Hospital B. All cases of PNEUMONIA were bacterial pneumonia in COVID private , Hospital A, and Hospital B. Abbreviations: CXR, chest X-ray; COVIDx, public dataset used for COVID-Net; COVID BIMCV , public dataset obtained from the PadChest and BIMCV-COVID19+ datasets; COVID private , private dataset collected from six hospitals.Hospital A, dataset collected for internal validation; Hospital B, dataset collected for external validation.Hospitals A and B were not included in the six hospitals where COVID private data were collected.

Figure 1 .
Figure 1.Schematic illustration of dataset splitting and model training for our DL model.Abbreviation: DL, deep learning; COVIDx, public dataset used for COVID-Net; COVID BIMCV , public dataset obtained from the PadChest and BIMCV-COVID19+ datasets; COVID private , private dataset collected from six hospitals; Hospital A, dataset collected for internal validation and radiologist's practice before the observer study; Hospital B, dataset collected for external validation.

Figure 2 .
Figure 2. Results of Grad-CAM and Grad-CAM++ for our DL model.(A) NORMAL, (B) PNEUMONIA, and (C) COVID.Each row consists of CXR images collected from Hospital A and the Grad-CAM and Grad-CAM++ results.One trained DL model was used for Grad-CAM and Grad-CAM++ .Left column, original CXR image; middle column, result of Grad-CAM; right column, results of Grad-CAM++ .Abbreviations: DL, deep learning; CXR, chest X-ray.

Figure 3 .
Figure 3. Class-wise receiver operating characteristics curves of our DL model in external validation.(A) NORMAL versus PNEUMONIA or COVID, (B) PNEUMONIA versus NORMAL or COVID, and (C) COVID versus NORMAL or PNEUMONIA.Abbreviation: DL, deep learning.

Figure 4 .
Figure 4. Class-wise receiver operating characteristics curves of eight radiologists with and without our DL model in observer study.(A) NORMAL versus PNEUMONIA or COVID, (B) PNEUMONIA versus NORMAL or COVID, and (C) COVID versus NORMAL or PNEUMONIA.The blue and red lines represent the receiver operating characteristic curves of the radiologists with and without our DL model, respectively.Abbreviation: DL, deep learning.