Comparing different deep learning architectures for classification of chest radiographs

Chest radiographs are among the most frequently acquired images in radiology and are often the subject of computer vision research. However, most of the models used to classify chest radiographs are derived from openly available deep neural networks, trained on large image datasets. These datasets differ from chest radiographs in that they are mostly color images and have substantially more labels. Therefore, very deep convolutional neural networks (CNN) designed for ImageNet and often representing more complex relationships, might not be required for the comparably simpler task of classifying medical image data. Sixteen different architectures of CNN were compared regarding the classification performance on two openly available datasets, the CheXpert and COVID-19 Image Data Collection. Areas under the receiver operating characteristics curves (AUROC) between 0.83 and 0.89 could be achieved on the CheXpert dataset. On the COVID-19 Image Data Collection, all models showed an excellent ability to detect COVID-19 and non-COVID pneumonia with AUROC values between 0.983 and 0.998. It could be observed, that more shallow networks may achieve results comparable to their deeper and more complex counterparts with shorter training times, enabling classification performances on medical image data close to the state-of-the-art methods even when using limited hardware.


Introduction
Chest radiographs are among the most frequently used imaging procedures in radiology.They have been widely employed in the field of computer vision, as chest radiographs are a standardized technique and, if compared to other radiological examinations such as computed tomography or magnetic resonance imaging, contain a smaller group of relevant pathologies.Although many artificial neural networks for the classification of chest radiographs have been developed, it is still the subject of intensive research.Only a few groups design their own networks from scratch, but rather use already established architectures, such as ResNet-50 or DenseNet-121 (with 50 and 121 representing the number of layers within the respective neural network) [3][5] [7][2][14] [11].These neural networks have often been trained on large, openly available datasets, such as ImageNet, and are therefore already able to recognize numerous image features.When training a model for a new task, such as the classification of chest radiographs, the use of pre-trained networks may improve the training speed and accuracy of the new model, since important image features that have already been learned can be transferred to the new task and do not have to be learned again.However, the feature space of freely available data sets such as ImageNet differs from chest radiographs as they contain color images and more categories.The ImageNet Challenge includes 1000 possible categories per image, while CheXpert, a large freely available data set of chest radiographs, only distinguishes between 14 categories (or classes) [13].Although the ImageNet challenge showed a trend towards higher accuracies for deeper networks, this may not be fully transferrable to radiology.In radiology, sometimes only limited features of an image can be decisive for the diagnosis.Therefore, images cannot be scaled down as much as desired, as the required information would otherwise be lost.But, the more complex a neural network architecture is, the more resources are required for training and deployment of such an algorithm.As up-scaling the input-images resolution exponentially increases memory usage during training for large neural networks, that evaluate many parameters, the size of a mini batch needs to be reduced earlier and more strongly, potentially affecting optimizers such as stochastic gradient descent.Therefore, it is currently not clear, which of the available artificial neural networks designed for and trained on the ImageNet dataset will perform the best for the classification of chest radiographs.The hypothesis of this work is, that shallow networks are already sufficient for the classification of radiographs and might even outperform deeper networks while requiring lesser resources.Therefore, we systematically examine the performance of fifteen openly available artificial neural network architectures in order to identify the most suitable ones for the basic classification of chest radiographs.

Data preparation
The free available CheXpert dataset consists of 224,316 chest radiographs from 65,240 patients.Fourteen findings have been annotated for each image: enlarged cardiomediastinum, cardiomegaly, lung opacity, lung lesion, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture and support devices.Hereby the findings can be annotated as present (1), absent (NA) or uncertain (-1).Similar to previous work on the classification of the CheXpert dataset [7][16], we trained these networks on a subset of labels: cardiomegaly, edema, consolidation, atelectasis and pleural effusion.As we only aim at network comparison and not on maximal precision of a neural network, for this analysis, each image with an uncertainty label was excluded, other approaches such as zero imputation or self-training were also not adopted.Furthermore, only frontal radiographs were used, leaving 135,494 images from 53,388 patients for training.CheXpert offers additional dataset with 235 images (201 images after excluding uncertainty labels and lateral radiographs), annotated by two independent radiologists, which is intended as an evaluations dataset and was therefore used for this purpose.

Data augmentation
For the first and second training session, the images were scaled to 320 x 320 pixels, using bilinear interpolation, and pixel values were normalized.During training, multiple image-transformations were applied: flipping of the images alongside the horizontal and vertical axis, rotation of up to 10°, zooming of up to 110%, adding of random lightning or symmetric wrapping.

Model training
14 different convolutional neural networks (CNN) of five different architectures (ResNet, DenseNet, VGG, SqueezeNet and AlexNet) were trained on the CheXpert dataset [3][5] [17][6] [9].All training was done using the Python programming language (https:// www.python.org,version 3.8) with the PyTorch (https: //pytorch.org)and FastAI (https://fast.ai)libraries on a workstation running on Ubuntu 18.04 with two Nvidia GeForce RTX 2080ti graphic cards (11 GB of RAM each) [4] [10].In the first training session, batch size was held constant at 16 for all models, while it was increased to 32 for all networks in the second session.In the first two sessions, each model was trained for eight epochs, whereas during the first five epochs only the classification-head of each network was trained.Thereafter, the model was unfrozen and trained as whole for three additional epochs.Before training and after the first five epochs, the optimal learning rate was determined [19], which was between 1e-1 and 1e-2 for the first five epochs and between 1e-5 and 1e-6 for the rest of the training.We trained one multilabel classification head for each model.Since the performance of a neural network can be subject to minor random fluctuations, the training was repeated for a total of five times.The predictions on the validation data set were then exported as comma separated values (CSV) for evaluation.

Evaluation
Evaluation was performed using the "R" statistical environment including the "tidyverse" and "ROCR" libraries [12][20] [18].Predictions on the validation dataset of the five models for each network architecture were pooled so that the models could be evaluated as a consortium.For each individual prediction as well as the pooled predictions, receiver operation characteristic (ROC) curves and precision recall curves (PRC) were plotted and the areas under each curve were calculated (AUROC and AUPRC).AUROC and AUPRC were chosen as they enable a comparison of different models, independent of a chosen threshold for the classification.

Results
The CheXpert validation dataset consists out of 234 studies of 200 patients, not used for training with no uncertainty-labels.After excluding lateral radiographs (n = 32), 202 images of 200 patients remained.
The dataset presents class imbalances (% positives for each finding: cardiomegaly 33%, edema 21%, consolidation 16%, atelectasis 37%, pleural effusion 32%), so that the AUPRC as well as the AUROC can be considered equally important measurements for the performance of the network.The performance of the tested networks is compared to the AUROC reported by Irvin et al. [7].However, only values for AUROC, but not for AUPRC, are provided there.In most cases, the best results were achieved with a batch size of 32, so all the information provided below refers to models trained with this batch size.Results achieved with smaller batch sizes of 16 will be explicitly mentioned.

Training time
Fourteen different network-architectures were trained 10 times each with a multilabel-classification head (five times each for batch size of 16 or 32 and an inputimage resolution of 320 x 320 pixels) and once with a binary classification head for each finding, resulting in 210 individual training runs.Overall, training took 340 hours.As to be expected, the training of deeper networks required more time than the training of shallower networks.For an image resolution of 320 x 320 pixels, the training of AlexNet required the least amount of time with a time per epoch of 2:29 to 2:50 minutes and a total duration of 20 minutes for a batch size of 32.Using a smaller batch size of 16, the time per epoch raised to 2:59 -3:06 minutes and a total duration of 24 minutes.In contrast, using a batch size of 16, training of a DenseNet-201 took the longest with 5:11 hours and epochs requiring 41 minutes.For a batch size of 32, training a DenseNet-169 required the largest amount of time with 3:06 hours (epochs between 21 and 27 minutes).Increasing the batch size from 16 to 32 lead to an average acceleration of training by 29.9% ± 9.34%.Table 3 gives an overview of training times.

Discussion
In the present work, different architectures of artificial neural networks are analyzed with respect to their performance for the classification of chest radiographs.We could show that more complex neural networks do not necessarily perform better than shallow networks.
Instead, an accurate classification of chest radiographs may be achieved with comparably shallow networks, such as AlexNet (8 layers), ResNet-34 or VGG-16, which surpass even complex deep networks such as ResNet-150 or DenseNet-201.The use of smaller neural networks has the advantage, that hardware requirements and training time are lower compared to deeper networks.Shorter training times allow to test more hyperparameters, simplifying the overall training process.Lower hardware requirements also enable the use of increased image resolutions.This could be of relevance for the evaluation of chest radiographs with a generic resolution of 2048 x 2048 px to 4280 x 4280 px, where specific findings, such as small pneumothorax, require larger resolutions of input-images, because otherwise the crucial information regarding their presence could be lost due to a downscaling.Furthermore, shorter training times might simplify the integration of improvement methods into the training data, such as the implementation of 'human in the loop' annotations.'Human in the loop' implies that the training of a network is supervised by a human expert, who may intervene and correct the network at critical steps.For example, the human expert can check the misclassifications with the highest loss for incorrect labels, thus effectively reducing label noise.With shorter training times, such feedback loops can be executed faster.In the CheXpert dataset, which was used as a groundwork for the present analysis, labels for the images were generated using a specifically developed natural language processing tool, which did not produce perfect labels.For example, the F1 scores for the mention and subsequent negation of cardiomegaly were 0.973 and 0.909, and the F1 score for an uncertainty label was 0.727.Therefore, it can be assumed, that there is a certain amount of noise in the training data, which might affect the accuracy of the models trained on it.Implementing a human-in-the loop approach for partially correcting the label noise could further improve performance of networks trained on the CheXpert dataset [8].Our findings differ from applied techniques used in previous literature, where deeper network architectures, mainly a DenseNet-121, were used instead of small networks to classify the CheXpert data set [11][1] [15].The authors of the CheXpert dataset achieved an average overall AUROC of 0.889 [7], using a DenseNet-121 which was not surpassed by any of the models used in our analysis, although differences between the best performing networks and the CheXpert baseline were smaller than 0.01..It should be noted, however, that in our analysis the hyperparameters for the models were probably not selected as precise as in the original CheXpert paper by Irvin et al., since the focus of this work was more on comparing the architectures and not on the complete optimization of one specific network.Still, we identified model, which achieved higher AU-ROC values in two of the five findings (cardiomegaly and effusion).Pham et al. also used a DenseNet-121 as the basis for their model and proposed the most accurate model of the CheXpert dataset with a mean AUROC of 0.940 for the five selected findings [11].The good results are probably due to the hierarchical structure of the classification framework, which takes into account correlations between different labels, and the application of a label-smoothing technique, which also allows the use of uncertainty labels (which were excluded in our present work).Allaouzi et al. similarly used a DenseNet-121 and created three different models for the classification of the CheXpert and ChestX-ray14, yielding an AUC of 0.72 for atelectasis, 0.87-0.88for cardiomegaly, 0.74-0.77for consolidation, 0.86-0.87for edema and 0.90 for effusion [1].Except for cardiomegaly, we achieved better values with several models (e.g.ResNet-34, ResNet-50, AlexNet, VGG-16).We would interpret this as proof that complex deep networks are not necessarily superior to more shallow networks for chest x-ray classification.At least for the CheXpert dataset it seems that methods optimizing the handling of uncertainty labels and hierarchical structures of the data are important to improve model performance.Sabottke et al. trained a ResNet-32 for classification of chest radiographs and therefore are one of the few groups using a smaller network [15].With an AUROC of 0.809 for atelectasis, 0.925 for cardiomegaly, 0.888 for edema and 0.859 for effusion, their network performed not as good as some of our tested networks.Raghu et al. employed a ResNet-50, an Inception-v3 as well as a customdesigned small network.Similar to our findings, they observed, that smaller networks showed a comparable performance to deeper networks [13].

Conclusion
In the present work, we could show that smaller artificial neural networks for the classification of chest radiographs can perform similar, or even surpass deeper and very deep neural networks.In contrast to many previous studies, which mostly used a DenseNet-121, we achieved the best results with up to 95% smaller networks.Using smaller networks therefore has the advantage that that they have lower hardware requirements, as they require less GPU RAM and can be trained faster without loss of performance.2 shows the area under the precision recall curve (AUPRC) for all networks and findings.In contrast to the AUROC, where deeper models achieved higher values, shallower networks yielded the best results for AUPRC (ResNet-24, AlexNet, VGG-16).DenseNet-201 and Squeezenet showed the lowest AUPRC values.Again, a batch size of 32 appeared to deliver better results compared to a batch size of 16.

Figures 1 ,
Figures 1, 2 and 3 display the ROC-curves for all models.The colored lines represent a single training, black lines represent the pooled performance over five trainings.

Table 1
shows the different areas under the receiver operating characteristic curve (AUROC) for each of the network architectures and individual finding as well as the pooled AUROC per model.According to the pooled AUROC, ResNet-152, ResNet-50 und DenseNet-161 were the best models, while SqueezeNet and AlexNet showed the poorest performance.For cardiomegaly, ResNet-34, ResNet-50, ResNet-152 and DenseNet-161 could surpass the CheXpert baseline provided by Irvin et al.ResnEt-50, ResNet-101, ResNet-152 and DenseNet-169 could also surpass the CheXpert baseline for pleural effusion.A batch size of 32 often lead to better results compared to a batch size of 16.

Table 3
provides an overview of training time per epoch (duration/epoch) and an overall training-time (duration/training) for each neural network.The times given are the average of five training runs rounded to the nearest minute.