Validating deep learning inference during chest X-ray classification for COVID-19 screening

The new coronavirus unleashed a worldwide pandemic in early 2020, and a fatality rate several times that of the flu. As the number of infections soared, and capabilities for testing lagged behind, chest X-ray (CXR) imaging became more relevant in the early diagnosis and treatment planning for patients with suspected or confirmed COVID-19 infection. In a few weeks, proposed new methods for lung screening using deep learning rapidly appeared, while quality assurance discussions lagged behind. This paper proposes a set of protocols to validate deep learning algorithms, including our ROI Hide-and-Seek protocol, which emphasizes or hides key regions of interest from CXR data. Our protocol allows assessing the classification performance for anomaly detection and its correlation to radiological signatures, an important issue overlooked in several deep learning approaches proposed so far. By running a set of systematic tests over CXR representations using public image datasets, we demonstrate the weaknesses of current techniques and offer perspectives on the advantages and limitations of automated radiography analysis when using heterogeneous data sources.

In order to answer these questions, this paper presents a systematic set of protocols necessary to validate deep learning models during decision-making based on CXR, pointing out to the dangers of those approaches that can be oblivious to relevant information from the lungs. In addition, we investigate the role that the lung segmentation might play in the CXR classification process, particularly when including both data sources with previously known respiratory infection cases 16 and COVID-19 specific imaging 17 . The main contributions of this investigation are the development of: • Protocols to measure ML models when using heterogeneous data sources, particularly with an number of patient cases; • Strategies to verify that the visual features detected by the ML models are specifically recording the locations of lung abnormalities as opposed to bright artifacts, such as medical equipment or hard tissue; • Algorithms to monitor the position of a feature used in the CXR image processing task, and to assess the correlation to critical factors associated with COVID-19.
The outline of this paper is as follows. We first discuss state-of-the-art research to show potential gaps when using neural networks to process radiography images. We then introduce protocols and strategies for assessing deep learning models for segmentation and classification, applied to the open access benchmark dataset COVIDx. The following section describes the experimental results when including or excluding lung segmentation, a method we call "Region of Interest (ROI) Hide-and-Seek"; which uses a U-net to detect the lungs as prior information to create 5 different image representations of the CXR data. Each set of CXR image representations is used to train and test each of the 5 different deep learning architectures, based on COVIDNet-CXR3-A, COVIDNet-CXR4-A, AlexNet, VGG-11 and ResNet-50 for the classification of these different CXR representations separately. Finally, we summarize the results, and offer perspectives into the future of automated image classification using deep learning. The full pipeline of our proposed experiments is illustrated in Fig. 2, which emphasizes the creation of different CNN models and evaluation using Grad-CAM to highlight what regions the neural networks consider important for classification in each of these cases.

Related work
Clinical imaging such as chest X-rays holds the promise to democratize access to early screening, particularly for symptomatic staging and treatment arrangements. That is because it enables rapid triage in resource-constrained and/or overwhelmed areas, improving availability and accessibility while using equipment broadly available at medical facilities throughout the world. As immediate treatment and care continue to challenge the healthcare systems worldwide, screening rules became essential to quickly isolate potential infected patients in order to mitigate the spread of the virus. In response, a community of investigators coalesced around using public datasets of CXR images, some also including metadata with respiratory conditions. There have been various studies published in the area of deep learning for COVID-19 detection. Oh et. al. 18 use patches of CXR images as input to a neural network in order to improve classification results for a small COVID-19 dataset. That paper uses data normalization by type casting images to float32, followed by histogram equalization, then gamma correction, and then image resizing to 256 × 256 pixels. After image preprocessing, the network is trained with random patches from the lungs to detect COVID-19. For inference, a hundred random patches from the test set are used to evaluate the network performance, using the majority rule as the decision for the network, finally reporting accuracy of 91.9% in classifying CXR images. Apostolopoulos et al. 12 20 propose a novel architecture for COVID-19 classification named Corodet, reporting accuracy of 99%, 94.2%, 91.2% for the 2-, 3-, and 4-class detection problems respectively. Rajaraman et. al. 21 propose a method for weakly labeled data augmentation as a means to expand the training dataset and improve  After considering an assortment of different image preprocessing steps and augmentation strategies, most of the approaches report better results using residual neural nets. Nishio et al. 10 compare the performance of several pre-trained models (VGG16, Resnet-50, MobileNet, DenseNet-121, EfficientNet) utilizing data augmentation for training on the task of classification of COVID-19 Pneumonia, Non-COVID Pneumonia, and normal CXR images, reporting the best accuracy from VGG-16 as 83.68%. Bressem et al. 9 evaluate the performance of 16 different networks based on ResNet, DenseNet, VGG, SqueezeNet, Inception-V4 and AlexNet for the task of COVID-19, pneumonia, and normal classification using a public COVID-19 image data collection, as well as the task of classifying cardiomegaly, edema, consolidation, atelectasis and pleural effusion using the CheXpert Dataset. They observed AUROCS between 0.83 and 0.89 for the CheXpert classification task, and AUROCS between 0.983 and 0.998 for the COVID-19 Image Collection dataset Classification task. Stubblefield et al. 24 explore the utility of using deep neural networks used as feature extractors for classical networks to be applied on smaller datasets; they utilize a deep neural network trained on the CheXpert dataset for image feature extraction and XGBoost as the final algorithm for performing the classification task on their small dataset aimed at classification of cardiac vs infectious etiologies of ARDS. They also evaluated the performance of inclusion of clinical features into the final classifier for the predictive model. For the infection label they achieved 67.5% accuracy, and for the cardiac labeling task reached 74.5%. Sahlol et al. 25 evaluate the performance of various neural networks including Inception V3 on the COVID-19 classification task, and additionally show that performance can be improved when using the Marine Predators Algorithm for feature selection from Inception V3 while minimizing network size, reaching accuracy of 0.9877 and 0.9968 for two separate datasets. Despite increased efforts toward standardizing data sources and metrics for comparison, these articles also considered different data sources to create a suitable CXR image set for all the classes to be well-represented.
After the public release of COVIDx dataset 13 , containing hundreds of COVID-19 CXR images, more than five hundred research works leveraged variants of that workflow and/or datasets, many proposing CNNs to evaluate computer vision tasks such as lung segmentation, and classification of COVID-19 cases [26][27][28] . One of the most prominent CNNs was COVID-Net 13 , which was proposed as a prediction engine to gain insights into important factors associated with COVID-19 cases imaged using CXR. Such work also attempted to verify that the decision-making is based on relevant information from the images. While representing an invaluable step forward in COVID-19 research, particularly regarding reproducibility, the authors state that they do not mean it is a production-ready solution. Our work investigate to which extent previous work on CXR could be leveraged as part of software that can improve prescreening. In order to address critical issues in using deep learning, www.nature.com/scientificreports/ we created a set of protocols and strategies to validate inference models based on CNNs using such CXR data sources, and audit if the learned models used radiological signatures rather than random artifacts, such as bright spots outside of the body, electrodes, and/or markup symbols.

Materials and methods
This section describes the different CXR image datasets we use to accomplish two main tasks: (a) CXR segmentation into lung area and non-lung; (b) CXR classification into normal, pneumonia and COVID-19.
Lung segmentation with U-Net. Segmentation data. This paper investigates the dataset curated and made publicly available by Tang et al. 16 , in which they utilized the MUNIT method 29 to generate an augmented lung segmentation dataset using image-to-image translation. For training, we use the "Augmentation" dataset that contains 2400 images and corresponding masks and for testing we use their "NIH" dataset, which contains additional 100 images and masks. We perform additional testing using 206 images and masks available in the Cohen dataset as well.
Segmentation method. For the purposes of the segmentation task, we use an implementation of the U-Net 30 , a deep neural network architecture that has shown to have strong performance in various biological image segmentation tasks [31][32][33][34] . The configurations for the proposed U-net are: batch size of 1, and a learning rate of .0001, and trained for 50 epochs/until convergence. This U-Net model is then used to segment the training images in the COVIDx5 dataset in order to remove or isolate the ROI's from the images.
While undoubtedly a relevant contribution as a publicly available data repository, COVIDx also comes with caveats, for example, the distribution of patient images across the different infection types. As illustrated in Table 1, the number of images is highly skewed toward non-COVID-19 cases. To exacerbate this issue, COVID-19 images arose from one data source while images from other cases come from different origins. Therefore, these images are potentially influenced by features derived from acquiring data from different instruments and their respective artifacts.
ROI Hide-and-Seek protocol for data representation. Using the segmentation results, we modify the COVIDx5 dataset in four different ways. We first use the segmentation to isolate the lungs, and call this version of the dataset "lungs isolated". We then do the inverse operation and remove the lungs using the segmentation and name this dataset "lungs removed". In a third dataset called "lungs framed", we also use the segmentation to form bounding boxes around the lung and remove everything outside of the boxes. We also perform a similar inverse operation and name this dataset "lungs boxed out". These will be treated as separate datasets used for training and testing to evaluate whether the networks are able to learn when the image data includes or excludes the presumed ROI. Examples of these datasets are shown in Fig. 1. In order to avoid artifacts from the lung-removal operator, for example, the remainders of lung borders or lung shape that might bias the model, we have also included the lung-boxed-out operator. The expectation is that this set of protocols will enable clearer interpretations regarding accuracy metrics and presence of the ROI of interest in the input.
Data preprocessing and augmentation. Wang et al. 13 performs various data prepossessing and augmentation functions for training and testing purposes. They first crop the top 8% of the images in order to remove metadata. Their provided training scripts use the following augmentations: random ratio resizing, ±10% degree random rotation, ±10% horizontal and vertical translation, ±15% zoom, ±10% intensity shifts, and horizontal flipping. For the purposes of retraining their provided models from scratch using their own scripts on the raw and modified COVIDx5 datasets, we use all of these augmentations. However, for our own experiments training ResNet-50, VGG-11, and Alexnet, we consider each of these augmentations with the exception of horizontal flipping, which is anatomically unrealistic since each image in the dataset is an anterior X-ray image.
Training. We created a modified version to the Tensorflow 36 scripts provided by Wang et al. 13 for training their provided models from scratch. For example, the derived scripts load information from the provided .meta files without the weights. Our scripts train the networks using a learning rate of 0.0002 and a batch size of 8. We www.nature.com/scientificreports/ train the COVID-Net CXR3A and 4A models for 10 epochs each. They also employ the softmax cross entropy loss function and a class weighting scheme of 1,1,4 for the normal, pneumonia and COVID-19 classes respectively. They also use data batch re-balancing to ensure that each batch is balanced in terms of the classes. Our modified version seeks to recreate the experiments as closely as possible of their work without having access to GenSynth 37 , as their model was built automatically from a pre-trained baseline model, which is not available. For our own training experiments with ResNet-50, VGG-11, and AlexNet, we use the Pytorch deep learning framework 38 , considering models pre-trained with ImageNet. We train each network with a batch size of 32 and 0.0001 learning rate for 20 epochs, considering the cross entropy loss with class weights of 1, 1, 4 for normal, pneumonia and COVID-19 classes respectively. We also use a data sampler to ensure that the data batches are balanced across the three datasets as well.
In order to speed up our computations, we run these experiments on a high performance supercomputer called Cori, which is a Cray XC40 with a peak performance of about 30 petaflops, comprised of 2388 Intel Xeon "Haswell" processor nodes and 9688 Intel Xeon Phi "Knight's Landing" (KNL) nodes, also with access to GPUs. These resources are available at the National Energy Research Scientific Computing facility (NERSC) at Lawrence Berkeley National Laboratory. Figure 3 illustrates results of running Grad-CAM on CXR images with examples labeled as normal, pneumonia and COVID-19. Table 2 shows the performance metrics of our U-Net segmentation model on both the NIH test set and Cohen dataset. It is observed that we reach an F1 score of .95 and .918 for the "lung" class for the NIH and Cohen datasets respectively. This is satisfactory performance for our purposes of masking out or isolating the ROI in the COVIDx dataset. Figure 1 visualizes the sample image from the resulting datasets generated using the segmentation results. Each modified version of the COVIDx5, obtained with the ROI-Hide-and-Seek, will then be used to train and test each of the neural network models separately. As a result, this paper reports on the performance metrics when using COVID-Net CXR3-A as in Table 3, and also on a more recent version of that model, COVID-Net CXR4-A as in Table 4. In addition, we also check the classification performance of AlexNet as in Table 5, VGG-11 as in Table 6 and ResNet-50 as in Table 7. All of these tables show the test accuracy and class-wise performance metrics (sensitivity, positive predictive value) in different columns, with each row showing the results corresponding to each of the individual modified datasets, namely lungs_isolated, lungs_removed, lungs_framed, lungs_boxed_out.

Results
Notice that each row corresponds to the test results of training each architecture from scratch on that specific modified version of the COVIDx dataset. It can be observed that the networks each have relatively high accuracy, even when trained and tested on datasets where the ROI lung regions are completely removed via segmentation (lungs_isolated) and bounding boxes (lungs_boxed_out), reaching up to accuracy of .927 and .897 for each of these modified datasets respectively (see Table 7). Additionally, the test accuracy reached using each of these datasets never falls below .873 and .817 respectively (see Table 5). Most importantly, when compared to the accuracy using the original dataset, we observe that removing the ROI only results in a loss of accuracy between 0% and 4.4% for the lungs_removed dataset 3.6% and 10% for the lungs_boxed_out date-set.
It can also be observed that the class-wise performance metrics (sensitivity, positive predictive value) also remain relatively high across all different modifications of the dataset as well, implying that each class is still predicted well even when the ROI is removed. Architecture comparisons. Despite considering architectures proposed in Wang et. al 13 , we developed several adaptations and improvements to circumvent the lack of reproducibility in their original proposal as well as introduce essential methods for scrutinizing classification schemes, which is the case of our proposed ROI Hide-and-Seek protocol. The original COVID-Net model is described in their paper, but that, along with the other models' architecture formations, are developed via the GenSynth framework. Unfortunately, that software is proprietary, therefore it is very difficult to reproduce the results and development of the network. This also makes it unclear how the new network models were created following the original paper, such as the CXR4-A model. Tables 3, 4, 5, 6, 7 shows that well-known artificial neural network architectures led to similar representational capacity and accuracy when compared to the COVID-Net, despite the fact that the latter was created via a human-machine collaborative design strategy.

Findings and interpretation.
The results of this study demonstrate that each of the neural network architectures used for classifying lung CXR images is able to predict whether a lung scan belongs to a COVID-19 patient with high accuracy even when the dataset is modified to remove the lungs from the image. When we remove significant parts of the image that are within the main ROI by segmenting the lungs, we see relatively little change in accuracy. This calls into question what features the neural network actually extract when optimizing its parameters for the classification problem. ResNet-50 in particular is able to reach up to .897 (see Table 7) accuracy even for the most aggressive ROI removal case of lungs_boxed_out. This could imply that the features that the neural network extracts and uses for classification in fact exist outside of the lung as well. In the case of the CXR3 and CXR4 models, it can additionally be observed that the network performs better when the lung is removed from the image vs. when it is isolated. This could imply that significant amounts of information that contribute to classification in fact exist outside the lung regions, which is unexpected of a lung-based illness.
Instead of using GSInquire 13 to highlight critical factors, we use Grad-CAM 39 to provide explainable reasons for the classifications results and further interpretation. Figure 3 shows the Grad-CAM results using test data of      For datasets where lungs are visible in the images, the lungs are typically highlighted as the important feature. However, in datasets where the lungs are removed, the Grad-CAMs shift towards other regions such as the stomach and arms. These results are difficult to interpret, as it seems that the important features extracted by the neural network are entirely dependent on whatever information is presented to it, regardless of whether it makes sense from a domain knowledge standpoint. One of the main assertions of Wang et. al 13 paper is that the Grad-CAMs can be used to interpret and isolate unhealthy parts of the lungs; however, these observations seem to suggest that what they highlight may be arbitrary.

Discussion
The classification of lung CXR for COVID-19 prescreening continues to be controversial, and some additional issues have been highlighted in this paper. Initially, we expected that the lung segmentation would enable data reduction and increase accuracy of the networks, or would reduce convergence time during network training since in theory this would allow the network to focus on the ROI. However, what we found out was that ROI isolation led to decreased accuracy when using any of the deep learning models for classification.
For example, we see a decline in accuracy of ResNet-50 of about 2% when isolating the lungs with the segmentation, and about 1.3% when using a bounding box to isolate the lungs. In order to better understand how much of the information outside of the lungs played a role in the classification results, we performed the same exact training but with the lungs removed from the images. What we found out was that when the lungs are removed, the network's accuracy barely changes, and in some cases, actually outperforms the network with the lungs isolated. This might suggest that there is contextual information outside of the lungs that may be contributing to the classification results in the original/unmodified dataset. By running Grad-CAM, we highlight the different possible regions that might be affecting the classification results, which are potentially information that is related to the data sources instead of being related to COVID-19.
Overall, we observed that the main issues of current approaches proposed to automate COVID-19 screening and/or diagnosis of CXR are: (a) the lack of clarity about which features are being detected when using deep learning algorithms, which can bring serious consequences to image classification, such as identification of image landmarks that are clinically irrelevant to COVID-19; (b) the lack of more systematic ways of testing classification strategies that prevent skewed conclusions regarding the use of convolutional neural networks; and (c) the need for more clear protocols for normalization of data coming from different sources, which are key to distinguish correlations associated to a particular data source as opposed to clinical usefulness of CXR image classification for respiratory illnesses. These issues are important in testing and and comparing deep learning algorithms for CXR classification.

Data availability
The datasets generated during and/or analyzed during the current investigation are available in the github repository: https:// github. com/ Elect ro1111/ COVID_ 19_ CXR_ CLASS IFICA TION.  Table 7. Performance metrics using different data preparation strategies classified with pre-trained ResNet-50 model. Bold values correspond to best performance metric for dataset in the same row.