Searching for pneumothorax in x-ray images using autoencoded deep features

Fast diagnosis and treatment of pneumothorax, a collapsed or dropped lung, is crucial to avoid fatalities. Pneumothorax is typically detected on a chest X-ray image through visual inspection by experienced radiologists. However, the detection rate is quite low due to the complexity of visual inspection for small lung collapses. Therefore, there is an urgent need for automated detection systems to assist radiologists. Although deep learning classifiers generally deliver high accuracy levels in many applications, they may not be useful in clinical practice due to the lack of high-quality and representative labeled image sets. Alternatively, searching in the archive of past cases to find matching images may serve as a “virtual second opinion” through accessing the metadata of matched evidently diagnosed cases. To use image search as a triaging or diagnosis assistant, we must first tag all chest X-ray images with expressive identifiers, i.e., deep features. Then, given a query chest X-ray image, the majority vote among the top k retrieved images can provide a more explainable output. In this study, we searched in a repository with more than 550,000 chest X-ray images. We developed the Autoencoding Thorax Net (short AutoThorax -Net) for image search in chest radiographs. Experimental results show that image search based on AutoThorax -Net features can achieve high identification performance providing a path towards real-world deployment. We achieved 92% AUC accuracy for a semi-automated search in 194,608 images (pneumothorax and normal) and 82% AUC accuracy for fully automated search in 551,383 images (normal, pneumothorax and many other chest diseases).

www.nature.com/scientificreports/ DenseNet121 architecture 26 , has been trained on ChestX-ray14 dataset and achieved radiologist-level detection of pneumonia. Since then, many DNN architectures have been proposed for a variety of tasks ranging from localization 21 , lateral and frontal dual chest X-ray reading 15 , integration of non-image data in classification 20 , attention-guided approaches 27 , location-aware schemes 6 , weakly-supervised methods 28,29 as well as generative models 8 . For detecting pneumothorax, a recent study collected 13,292 frontal chest X-rays (3107 with pneumothorax) to train a DNN to verify the presence of a large or moderate-sized pneumothorax. Another recent study 4 collected 1003 images (437 with pneumothorax and 566 with no abnormality) to detect pneumothorax with DNNs. So far, there has been no study to investigate pneumothorax detection in a large dataset, perhaps by combining the three large public datasets.
Deep learning for content-based image retrieval. Retrieving similar images given a query image is known as Content-Based Image Retrieval (CBIR) 30 or Content-Based Medical Image Retrieval (CBMIR) 31 for medical applications. While classification-based methods provide promising results 32 , CBMIR systems can assist clinicians by enabling them to compare the case they are examining with previous (already diagnosed) cases and by exploiting the information in corresponding medical reports 31 . It may also help radiologists in faster and more reliably preparing reports for particular diagnosis 31 .
While deep learning methods have been applied to CBIR tasks in recent studies 33,34 , there has been less attention on exploring deep learning methods for CBMIR tasks 35,36 .
One study investigated the performance of DNNs for MR and CT images with human anatomy labels 35 . Another study investigated the retrieval performance of DNNs among multimodal medical images for different body organs 36 . There is also a study exploring hashing deep features into binary codes, testing among lung, pancreas, neuro and urothelial bladder images 37 . Deep Siamese Convolutional Neural Networks 38 have also been tested for CBMIR to minimize the use of expert labels using multiclass diabetic retinopathy fundus images. Another study explored deep learning for CBMIR among multiple modalities for a large number of classes 39 . So far, there has not been any report to validate CBMIR techniques for a challenging case like pneumothorax detection in large datasets. We attempt to close this gap by reporting our results on a large dataset of chest x-ray images by fusion three public datasets.

Methods
Given a query chest X-ray image, the problem is to output whether it contains pneumothorax using image search in archived images through majority vote among retrieved similar images from the archive.
The proposed method of using image search as a classifier comprises of three phases ( Fig. 2): (1) Tagging images with deep features (all images in the database are fed into a pre-trained network to extract features), (2) image searching (tagging the query image with features and calculating its distance with all other features in the archive to find the most similar images), (3) classification (majority vote among the labels of retrieved similar images). Phase 1: tagging images with deep features. In this phase, all chest X-ray images in the archive are tagged with deep features. To represent a chest X-ray image as a feature vector with a fixed dimension, the last www.nature.com/scientificreports/ pooling layer may be used as image representation. In other words, the pre-trained deep convolutional neuronal network is considered as a feature extractor to convert a chest X-ray image into an n-dimensional feature vector with n = 1, 024 being a typical value for many networks. In our study, DenseNet121 26 is used for converting a chest X-ray image into a feature vector with 1,024 dimensions. DenseNet topology has a strong gradient flow contributing to diverse features and is, compared to many other architectures such as ResNet and EfficientNet, quite compact with only 7 million weights. We adopted DenseNet121 also for a fair comparison in experiments with CheXNet, which has also used DenseNet121 as its backbone architecture. Three configurations are explored to extract deep features from a chest X-ray image ( Fig. 3): • Configuration 1-a feature vector is extracted from the entire chest X-ray image. Representing the entire image with one feature vector is quite common and assumes that the object or abnormality will be adequately quantified in the single feature vector for the entire image. • Configuration 2-two feature vectors are extracted, one from the left chest side and one from the flipped version of the right chest side. The final feature vector is a concatenation of these two feature vectors. If DenseNet121 26 is adopted as the feature extractor, the feature vector has 2, 048 values. The rational behind this idea is to allow expressive features for each side of the chest to be quantified separately to make feature matching easier for the unsupervised search. As well, flipping the right lung is a registration-like operation to facilitate alignment in matching. • Configuration 3-three feature vectors are extracted as described in previous two configurations. The final feature vector is a concatenation of these three feature vectors. If DenseNet121 26 is adopted as the feature extractor, the dimension of the final feature vector is 3072 real-valued features. The rationale behind this configuration is that matching a combined feature vector that represents the whole image, the left chest side and the flipped right chest side not only provides a global image view but also more focused and aligned attention to each chest side to emphasize their features in the search and matching process.
Phase 2: image search. In this phase, the distance between the deep features of the query chest X-ray image and all chest X-ray images in the database are computed. The chest X-ray images having the shortest distance with those of the query chest X-ray image are subsequently retrieved. The Euclidean distance, as the most commonly used norm for deep feature matching, was used for computing the distance between the deep features of two given chest X-ray images. It is the geometric distance in the multidimensional space recommended when all variables have the same metric 40 . The calculated distances can be sorted to retrieve as many as matched images as desired. The impact of distance norms on retrieval may be investigated in future works. www.nature.com/scientificreports/ Phase 3: classification. In this phase, the majority vote among the labels of retrieved chest X-ray images is used as a classification decision. For example, given a query chest X-ray image, the top k most similar chest X-ray images are retrieved. If m chest X-ray images are labelled with pneumothorax (with m ≤ k), the query image is classified as pneumothorax with a class likelihood of m/k. The larger k the more reliable the classification vote will become. This, in turn, requires a large archive of tagged images to increase the probability of finding similar images.
Compressing feature dimensionality using autoencoders. The dimensionality of the feature vectors, especially the concatenated ones, may become a computational obstacle but it can be reduced by employing autoencoders. One may use an autoenoder for all configurations but our main motivation was a size reduction for the longest feature vector for configuration 3. Two steps are required to construct an encoder to reduce feature vector dimensionality: • Step 1: Unsupervised end-to-end training with a decoder An autoencoder with the architecture summarized in Fig. 4a is first constructed. A dropout layer 41 with a probability of 0.2 is introduced between each layer to reduce the probability of overfitting. The model is then trained for 10 epochs by backpropagation with outputs being set equal to inputs. The batch size, loss function and optimizer were set to 128, Mean Squared Error and Adam, respectively. The training details are visualized in Fig. 5. • Step 2: Supervised fine-tuning with labels After the training, the decoder in the model is removed as we only need the encoding part as dimensionality reduction. Instead, a one-dimensional fully connected layer of neurons with the sigmoid function as activation function was used in training phase. The network was trained for 10 epochs with the batch size of 128 using binary cross-entropy loss function and Adam optimizer. The training details are visualized in Fig. 5. During training, to deal with class imbalance, individual class weight was set for each class using the following formula: where w c j is the class weight of class c j ; C is the total number of classes; S is the total number of training samples; S c j is the total number of training samples belonging to class c j .
(1)  www.nature.com/scientificreports/ The model architecture is summarized in Fig. 4b. Similarly, a dropout layer 41 was introduced between the 256-dimensional layer and the one-dimensional layer to reduce the probability of overfitting. The model is then trained with through backpropagation. After the training, the one-dimensional fully connected layer is removed. The model architecture of the final encoder is summarized in Fig. 4c.
Model architecture. The architecture of AutoThorax-Net to obtain features from a chest X-ray image is illustrated in Fig. 6.

Results
In this section, we first describe the datasets collected and the prepossessing procedures. We then describe the experiments that were conducted, followed by the analysis. The main goal of experiments is to validate the performance of image search via matching deep features. In order to establish performance quantification, we treat search like a classifier by taking a consensus vote guided by the ROC statistics. We also compare the results with the CheXNet (without any modification or fine-tuning) which is an end-to-end deep network specially trained for classifying chest X-ray images.
Dataset collection. Three large public datasets of chest X-ray images were collected. The first is MIMIC-CXR 24,42 , a dataset of 371,920 chest X-rays associated with 227,943 imaging studies. A total of 248,236 frontal chest X-ray images in the training set were used in this study. The second dataset is CheXpert 19 , a dataset consisting of 224,316 chest radiographs belonging to 65,240 patients. A total of 191,027 frontal chest X-ray images in the training set were used in this study. The third dataset is ChestX-ray14 13 consisting of 112,120 frontal-view X-ray images of 30,805 patients. All chest X-ray images in this dataset were used in this study.  www.nature.com/scientificreports/ In total, 551,383 frontal chest X-ray images were used in our investigations. 34,605 images (6% of of all images) were labelled as pneumothorax. The labels refer to the entire image; the collapsed lungs were not highlighted in any way. Implementation and parameter setting. We used the library Keras (http:// keras. io/) v2.2.4 with Tensorflow backend 43 to implement the approach. As we used a pre-trained network for feature extraction, the DenseNet121 was selected 26 , and the weight file was obtained through the default setting of Keras. For CheXNet 14 , the weight file was downloaded from GitHub (https:// github. com/ bruce chou1 983/ CheXN et-Keras). All images were resized to 224 224 before feeding into networks. All other parameters were default values unless otherwise specified. All experiments were run on a computer with 64.0 GB DDR4 RAM, an Intel Core i9-7900X @3.30 GHz CPU (10 Cores) and one GTX 1080 graphic card.
Performance evaluation. Following relevant literature 14,19 , the performance of classification was evaluated by the area under the curve (AUC) for the receiver operating characteristic curve (ROC curve) to enable the comparison over a range of prediction thresholds. As a tenfold cross-validation was conducted in the experiments, average ROC was computed with 95% confidence interval.

Dataset preparation & preprocessing.
There is a concern for ChestX-ray14 13 dataset that its chest X-ray images with chest tubes were frequently labelled with Pneumothorax 44,45 . As we combined ChestX-ray14 with CheXpert 19 , and MIMIC-CXR 24,42 datasets in our experiments, the concern was mitigated to address the bias.

Dataset 1 (semi-automated detection).
If there is a suspicion of pneumothorax by the user, then the search will limited to the archived images that are either normal (i.e., no finding) or pneumothorax. We This is a dataset comprising of 34,605 pneumothorax chest X-ray images and 160,003 normal chest X-ray images. Searching in this dataset means there is already a suspicion by the expert that the image may contain pneumothorax, hence the search is guided to only search within archived images that are diagnosed as either pneumothorax or normal (no finding). The pneumothorax images were obtained from the collected frontal chest x-ray images with the label "Pneumothorax". They were considered as the positive (+ ve) class. The normal images were obtained from the collected frontal chest x-ray images with label the "No Finidng". These chest X-ray images were considered as the negative (− ve) class. A summary of dataset 1 is provided in Table 1.

Dataset 2 (fully-automated detection).
If there is no concrete suspicion from user, we match the input image against all other images regardless of their tagged disease label. This dataset is comprising of 34,605 pneumothorax chest X-ray images and 516,778 non-pneumothorax chest x-ray images. Searching in this dataset means the computer is automatically searching in all images to verify the likelihood of pneumothorax without any guidance of the expert. The pneumothorax images were obtained from the collected frontal chest X-ray images with the label "Pneumothorax". They were considered as the positive (+ ve) class. The non-pneumothorax images were obtained from the collected frontal chest X-ray images without the label "Pneumothorax", meaning that they may contain cases such as normal, pneumonia, edema, cardiomegaly and more. They were considered as the negative (-ve) class. A summary of dataset 2 is provided in Table 2.
First experiment series: semi-automated solution. The first experiments series focuses on a "semiautomated" solution for pneumotharx. We confine the search and classification to cases that are either normal or diagnosed with pneumotharx (Dataset 1). We test all three configurations (Fig. 3), CheXNet, and the proposed AutoThorax-Net.  We constructed the receiver operating characteristics (ROC) curve for the dataset to find the trade-off between sensitivity and specificity (Fig. 7). We used Youden's index 46 to find the trade-off position on the ROC curve providing the threshold for match selection. The Youden's index can be calculated as "sensitivity + specificity 1". A standard tenfold cross-validation was adopted for tests that showed a very low standard deviation for all experiments apparently due to the large size of the datasets. All tagged chest X-ray images were divided into 10 groups.
In each fold, one group of chest X-ray images was used as validation set, while the remaining chest X-ray images were used as "archived" images to be searched. The above process was repeated 10 times, such that in each fold a different group of chest X-ray images was used as the validation set. In each fold, an encoder was trained using the archived set of that fold. The encoder was then used for compressing deep features for each chest X-ray image in the validation set. The parameters of the encoder construction process are described as follows: Step 1 Unsupervised end-to-end training with decoder: The training epoch and batch size were set as 10 and 128, respectively. The loss function was chosen as mean-squared-error. Adam optimizer 47 was used. The dropout rate was set to 0.2, i.e., a probability of 20% setting the neuron output as zero to counteract possible overfitting.
Step 2 Supervised fine-tuning with labels: The loss function was chosen as binary cross-entropy. Other parameters remained the same as in Step 1.
Given a query chest X-ray image from validation sets, image search was conducted on the archived set to retrieve k similar images for each query image. The consensus vote among the top k retrieved chest X-ray images subsequently determines whether the query image is pneumothorax. Results were generated with k 11, 51, 101, 251, 501, 1001. As we were using a large number of archived images, one was excepting to see better results for higher k values. Table 3 for AutoThorax-Net, ChexNet and all three feature configurations from Fig. 3. We calculated area under the curve (AUC), sensitivity and specificity for all 10 folds. Standard deviations were quite low (< 1%), hence not reported. Figure 8 shows the confusion matrices for both AutoThorax-Net and ChexNet.

Results. Experimental results on Dataset 1 are summarized in
The average sensitivity and specificity obtained by Configuration 3 for k = 1001 are higher than those obtained by Configuration 1 although they have almost the same AUC. Configuration 1 shows higher sensitivity for k = 11 (86% versus 83%) but its specificity is lower than Configuration 3 (76% versus 80%). Configuration 2 delivers the same AUC in range 88% but in individual comparison is always worse than other configurations with lower sensitivity and specificity.
AutoThorax-Net has clearly the highest AUC (92%). ChexNet delivers an AUC of 88% similar to the three search configurations. The highest sensitivity is 86% achieved by all tested methods. However, AutoThorax-Net also provides a specificity of 84% whereas the specificity of all other methods, including ChexNet, are in the 70% range.
To verify that the improvements of the proposed methods are significant, we have performed the two-sided Wilcoxon Signed-Rank test 48 between the performance of CheXNet and our best performing configuration which is AutoThorax-Net with k = 1001. Our results, shown in Table 3, suggest that AUC and Specificity are improved from 88 to 92 and 76 to 84, respectively. The calculated p-values for these two metrics, both 0.005, are smaller than 0.05 and reject the null hypothesis which means significant differences exist between the performance of AutoThorax-Net with k = 1001 and CheXNet with respect to these two metrics.

Second experiment series: automated solution.
In these experiments, we investigated the possibility of constructing a "fully automated" solution by searching the entire archive, i.e., Dataset 2. We summarize the experimental workflow, and report the results with some analysis. www.nature.com/scientificreports/ Experimental workflow. We constructed the receiver operating characteristics (ROC) curve for Dataset 2 to find the trade-off between sensitivity and specificity (Fig. 9). We used Youden's index to find the trade-off position on the ROC curve providing the threshold for match selection. A standard tenfold cross-validation was adopted for testing that showed a very low standard deviation (< 1%) for all experiments apparently due to the large size of the datasets. All chest X-ray images were divided into 10 folds. In each fold, one group of chest X-ray images was used as validation set, while the remaining chest X-ray images were used as archived set. The above process was repeated 10 times, such that in each fold a different group of chest X-ray images was used as the validation set. In each fold, an encoder was trained using the archived set of that fold. The encoder was then used for compressing deep features for each chest X-ray image in the validation set. The parameters of the encoder construction process were set as before described for Dataset 1. For image search, given a chest X-ray image (from the validation set), the compressed deep feature was used for searching in the archived set. The consensus vote among the top k retrieved chest X-ray images to classify the query image from the validation set. Experiments were conducted with k 11, 51, 101, 251, 501, 1001 to observe the effect of more retrievals on consensus voting. For comparison, CheXNet 14 was adopted as a baseline to be applied to the validation set in each fold. Table 3. A summary of classification performance using image search as a classifier on Dataset 1. The numbers (in percentage) are the result of averaging 10 folds with very low standard deviation (< 1%).

Method
Sensitivity Specificity AUC   Table 4. Figure 10 shows the confusion matrices for AutoThorax-Net and ChexNet. The highest AUC of 82% is achieved by AutoThorax-Net for k = 251, 501 and 1001. The highest sensitivity of 74% is achieved by Configuration 2 (for k = 251) and Configuation 3 (for 101). However, they both deliver low specificity values of 61% and 65%, respectively. The second highest sensitivity of 73% is achieved by Configuration 1, Configuration 2 and AutoThorax-Net. Their specificity is 61%, 63% and 75%, respectively. AutoThorax-Net can clearly provide a higher and more reliable trade-off between sensitivity and specificity in a fully automated setting when applied on a large archive of X-ray images.
To verify that the improvements of the proposed methods are significant, we have performed the two-sided Wilcoxon Signed-Rank test 48 between the performance of CheXNet and our best performing configuration which is AutoThorax-Net with k = 1001. Our results, shown in Table 4, suggest that AUC and Specificity are improved  www.nature.com/scientificreports/ from 77 to 82 and 67 to 75, respectively. The calculated p-values for these two metrics, both 0.005, are smaller than 0.05 and reject the null hypothesis which means significant difference exist between the performance of AutoThorax-Net with k = 1001 and CheXNet with respect to these two metrics.
Comparing Autoencoder against PCA. As one of the main contributions of the AutoThorax-Net is encoding the concatenated feature vector (i.e., reducing the dimensionality), the question arises whether the same level of performance can be achieved by traditional algorithms such as the principal component analysis (PCA). We did run the tenfold cross validation on both dataset configurations for k = 11 and k = 51. We observed in all settings that the performance of autoencoder was better than PCA. For instance, for the second experiment, PCA achieved 72% and 76% AUC for k = 11 and k = 51, respectively, while autoencoder achieved 74% and 80% AUC. As the performance of the dimensionality reduction is independent of k, one expects that a more capable compression should already manifest itself for any k. However, as we are using the compressed/encoded features for image search, good performance is expected to be particularly visible for a small number of matched cases.

Discussions
In our investigations, we experimented with image search as a classifier to detect pneumothorax based on autoencoded concatenated features applied on more than half a million chest X-ray images obtained through the merging three large public datasets.
In our experiments, we verified that the use of image search as a classifier with AutoThorax-Net as a feature extractor can improve classification performance. This was demonstrated by analysing the ROC curves to find the trade-off for each individual approach. We further confirmed that compressing concatenated deep features via autoencoders further improves the results of image search. This indicates that image search as a classifier is a viable and more conveniently explainable solution for the practice of diagnostic radiology when reports and history of evidently diagnosed cases of similar cases are readily available.
Please note that some of the folds we used may contain images that ChextNet has already seen during its training. This may bring a slight inflation of performance numbers for ChexNet. We ignored this unfair advantage for ChexNet over our AutoThorax-Net since we had to exploit the mixture of three public datasets and apply k-fold cross validation for maximum data usage and decreasing data bias.