A retrospective study of deep learning generalization across two centers and multiple models of X-ray devices using COVID-19 chest-X rays

Generalization of deep learning (DL) algorithms is critical for the secure implementation of computer-aided diagnosis systems in clinical practice. However, broad generalization remains to be a challenge in machine learning. This research aims to identify and study potential factors that can affect the internal validation and generalization of DL networks, namely the institution where the images come from, the image processing applied by the X-ray device, and the type of response function of the X-ray device. For these purposes, a pre-trained convolutional neural network (CNN) (VGG16) was trained three times for classifying COVID-19 and control chest radiographs with the same hyperparameters, but using different combinations of data acquired in two institutions by three different X-ray device manufacturers. Regarding internal validation, the addition of images from an external institution to the training set did not modify the algorithm’s internal performance, however, the inclusion of images acquired by a device from a different manufacturer decreased the performance up to 8% (p < 0.05). In contrast, generalization across institutions and X-ray devices with the same type of response function was achieved. Nonetheless, generalization was not observed across devices with different types of response function. This factor was the key impediment to achieving broad generalization in our research, followed by the device’s image-processing and the inter-institutional differences, which both reduced generalization performance to 18.9% (p < 0.05), and 9.8% (p < 0.05), respectively. Finally, clustering analysis with features extracted by the CNN was performed, revealing a substantial dependence of feature values extracted by the pre-trained CNN on the X-ray device which acquired the images.


Materials and methods
Three experiments were carried out to study the influence of institutional and X-ray device related factors on the internal validation and generalization performance of a DL network for CXR classification (Fig. 1).

Ethical approval
This research involved patients from two different medical institutions: Hospital Universitario Marqués de Valdecilla, located in Santander, Cantabria, Spain-referred to as Institution 1 in the text; and Hospital de Sierrallana in Torrelavega, Cantabria, Spain-referred to as Institution 2. The Ethics Committee of both institutions, Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, approved this research.Since this study was approved by the Comité de Ética de la Investigación con Medicamentos y Productos Sanitarios de Cantabria, without direct interaction with patients or use of tissue samples, and using retrospective images acquired in the past and anonymized, informed consent was not required 15 .All methods reported in this work were carried out in accordance with the pertinent guidelines and regulations.

Dataset and subsets
Images for this research were all frontal view CXRs manually labeled by three expert radiologists, with more than 5 years of experience, in two classes (COVID-19 and Control), according to the inclusion criteria summarized in Table 1.In the text, these classes are named as target classes.
The main image dataset was created by simple random sampling from four image databases, as described in Supplementary Appendix A1.This main dataset contained images acquired by three different X-ray devices in two institutions: 394 images acquired by a Fujifilm FDR smart FGX device in institution 1; 244 images acquired by the same device model (Fujifilm FDR smart FGX) in institution 2; 192 images acquired by a general electric (GE) revolution XRD device in institution 2; and 94 images acquired by a Carestream DRX Evolution Plus device in institution 2. Note that Fujifilm and Carestream devices have the same type of response function (logarithmic), while GE has a different type (linear) 16,17 .
Finally, eight subsets described in Table 2 were created by random sampling without repetition of the main dataset.Random sampling was performed with stratification to ensure an equal distribution of COVID-19 and Control images within each subset (50 percent of each class).

Experiments to test the influence of institutional and device related factors Experiment 1: evaluation of internal validation performance
The first experiment analyzed the influence of institutional and device related factors on the internal validation performance of a DL algorithm (Fig. 1).For this purpose, the same DL network (a VGG16) was trained three times for the classification of CXR images.Each time, with the same architecture, hyperparameters (details in Supplementary Appendix A2), and number of images (300), but using different training subsets (Table 3).
First training was performed with subsets Fuji_Inst1_TRAIN_A and Fuji_Inst1_TRAIN_B, so it includes 300 images acquired by a Fujifilm FDR smart FGX device from institution 1.The resulting model was named Model-F1A_F1B, because of the subsets used (F1A = Fuji_Inst1_TRAIN_A, etc.).
Second training included subsets Fuji_Inst1_TRAIN_A and Fuji_Inst2_TRAIN, so it was performed with all images acquired by a Fujifilm FDR smart FGX device, but half in institution 1 and half in institution 2. This model was named Model-F1A_F2.
The third and last training was performed with 101 random images from Fuji_Inst1_TRAIN_A, 101 random images from Fuji_Inst2_TRAIN, and the 98 images from GE_Inst2_TRAIN.This means that the resulting model, named Model-F1A'_ F2'_GE2, was trained including images acquired by two different manufacturers, Fujifilm FDR smart FGX and general electric (GE) revolution XRD, and in two different institutions (1 and 2).
Later, internal validation performance of these three models were tested on Fuji_Inst1_TEST subset, which contained all images acquired by a Fujifilm FDR smart FGX device in institution 1.This subset was the only test subset that could give results of internal validation for the three models, since all models were trained using sets of images which contained CXRs acquired by a Fujifilm FDR smart FGX device in institution 1.Finally, the internal validation performances of the three models were compared to assess the influence of institutional and X-ray machine related factors.
The influence of institutional factors was studied by comparing performances of Model-F1A_F1B and Model-F1A_F2, as both of them were trained with 300 images of the same X-ray device model, but Model-F1A_F1B coming from institution 1, and Model-F1A_F2 coming from institutions 1 and 2. Therefore, performance differences among these models could be probably attributable to institutional related factors.
The influence of X-ray machine related factors was assessed comparing the performances of Model-F1A'_ F2'_GE2 with Model-F1A_F1B and Model-F1A_F2.While Model-F1A_F1B and Model-F1A_F2 were trained with 300 images acquired by a Fujifilm FDR smart FGX, Model-F1A'_ F2'_GE2 was trained with 300 images, some of them acquired by a Fujifilm FDR smart FGX, and the rest by a GE revolution XRD device.Thus, this comparison revealed the effect of adding images from a different manufacturer to the training sample.
The metric used to evaluate the performances was the area under the receiver operating characteristic curve (AUC) 18 .Additionally, gradient-weighted class activation mapping (Grad-CAM) heatmaps were used to identify the significant regions in the image for making the prediction 19 .This analysis evaluated how the addition of images acquired in a distinct institution and by a different X-ray device affected the DL network's ability to learn causal relationships.

Experiment 2: evaluation of generalization
The second experiment studied the influence of both institutional and device-related factors on a DL network's generalization (Fig. 1).Model-F1A_F1B was evaluated on the four test subsets that included images from two different institutions and three distinct X-ray devices (Table 2).Therefore, four AUCs were obtained: AUC on subset Fuji_Inst1_TEST (internal validation); AUC on Fuji_Inst2_TEST (generalization to an external institution with the same X-ray device); AUC on Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and AUC on GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and different type of response function).
Later, considering the shifting factor, AUCs were compared to evaluate the influence of institutional factors on generalization (AUC-Fuji_Inst1_TEST vs AUC-Fuji_Inst2_TEST), as well as X-ray device factors, including device's image processing (AUC-Fuji_Inst2_TEST vs AUC-Care_Inst2_TEST) and device's type of response function (AUC-Care_Inst2_TEST vs AUC-GE_Inst2_TEST).Table 3. Trainings and models.Models were all trained with the same architecture (VGG16) and hyperparameters, and they were named with a reference of the subsets used (F1A = Fuji_Inst1_TRAIN_A; F1A' = 101 random images from Fuji_Inst1_TRAIN_A, etc.) *101 random images from Fuji_Inst1_TRAIN_A.**101 random images from Fuji_Inst2_TRAIN.The third experiment investigated the influence of institutional and device related factors on the values of the features extracted by the pre-trained CNN (Fig. 1).The hypothesis was that radiological image textures depend on the X-ray device used to acquire the image.Differences in device´s image processing and response functions result in differences among imaging textures based on the X-ray device.
Therefore, feature values extracted by CNNs could also be highly dependent on the X-ray device that acquires the image.This issue could hinder the generalization of DL networks, making them only suitable to images obtained from the same devices as the ones used in training.
In this context, Model-F1A_F1B was used to extract features from the test subsets.Later, features from the last convolutional layer were clustered using a hierarchical clustering algorithm 20 , which was implemented using python's scientific graphics library, seaborn (version 0.11.1) 21.This unsupervised approach allowed us to examine which image classes were more evident for the CNN, namely target classes (COVID-19 and control) or other hidden classes (such as the X-ray device which acquired the image, or the institution where the images were obtained).To ensure that results were not biased by CXRs metallic tokens, this experiment was repeated by cropping the images to preserve only their central part.
This experiment aimed to evaluate whether the difference in pixel values between a COVID-19 image from a Fujifilm device and a COVID-19 image from a GE device is greater or smaller than the difference between a COVID-19 image and a control image, both acquired by the same X-ray device.

Statistical analysis
Cross-Validated AUCs 95% confidence intervals (CIs) were computed with the R package cvAUC 22 .The AUC differences with their 95% CIs were calculated using the bootstrap method 23 .Any difference where the CI excluded the 0 value was considered to be statistically significant with a p-value < 0.05.

Experiments to test the influence of institutional and device related factors
Experiment 1: evaluation of internal validation performance Internal validation performances of Model-F1A_F1B and Model-F1A_F2 did not show significant statistical differences (Fig. 2 and Table 5).Thus, the addition of images to the training sample acquired in a different institution by the same X-ray model did not have a significant impact on the algorithm's internal validation.
Grad-CAM heatmaps showed similar activation patterns between the two models.Heatmaps depicted activations over COVID-19 lung opacities, and absence of activations within any region of the image in Control patients (Fig. 3).This result suggests that both models were able to learn the radiological findings of COVID-19 and, thus, made predictions based on causal relationships.
In contrast, internal validation performances of Model-F1A_F1B and Model-F1A_F2 were 8% (p < 0.05) and 5.2% (p < 0.05) significantly higher than Model-F1A'_ F2'_GE2 internal validation performance (Table 6).Hence, the addition of images acquired by another X-ray device of a different manufacturer to the training sample decreased the algorithm's internal validation performance.

Experiment 2: evaluation of generalization
Model-F1A_F1B generalized to Fujifilm and Carestream images from institution 2 (Fuji_Inst2_TEST and Care_ Inst2_TEST) with a performance decrease of 9.8% (p < 0.05) and 18.9% (p < 0.05), respectively.In contrast, this model did not generalize to GE images from institution 2 (GE_Inst2_TEST), as it showed a loss in the AUC of 33.5% (p < 0.05) which caused the model to perform randomly (Fig. 4 and Table 6).Thus, Model-F1A_F1B generalized across institutions and across X-ray devices from different manufacturers with the same type of response function, however, it did not generalize across X-ray devices with different types of response function.A hierarchy of factors influencing the generalization capability of the DL network is presented in Fig. 5.

Experiment 3: evaluation of the dependence of CNN-features on textures
The hierarchical clustering algorithm grouped images from the test subsets into three evident clusters, which corresponded to images from each of the three X-ray devices used to acquire the images (Fujifilm, GE and Carestream).Radiographies acquired by both Fujifilm devices (subsets Fuji_Inst1_TEST and Fuji_Inst2_TEST) were mixed, despite being acquired in different institutions.Ultimately, images from different target classes (COVID-19 and control) were not separated (Fig. 6).
Besides, the two clusters corresponding to images acquired by the two X-ray devices with the same type of response function (Fujifilm and Carestream) were next to each other, grouped together in a higher cluster level, and separated from the cluster containing GE images, which had a different type of response function.Same results were observed when the experiment was repeated using a cropped version of the images.
In summary, the hierarchical clustering algorithm found that the feature values extracted by the pretrained ImageNet CNN were more dissimilar regarding the hidden classes (X-ray device and type of response function) than the real target classes (COVID-19 and control).Table 5. AUC differences on subset Fuji_Inst1_TEST (internal validation).95% confidence intervals for bootstrapping differences are reported in parentheses.*Statistical significant difference (p-value < 0.05).

Model-F1A_F2
Model   The similarity in performance between Model-F1A_F1B and Model-F1A_F2 suggests that institutional related factors may not have a significant impact on the internal validation performance of the algorithm.Besides, the addition of images acquired by a different model of X-ray device to the training set led to a significant performance reduction in the internal validation of Model-F1A'_ F2'_GE2.This result indicates a potentially important influence of device related factors on the algorithm's internal validation performance.Grad-CAM heatmaps were in line with the aforementioned results.Heatmaps of Model-F1A_F1B and Model-F1A_F2 showed activations over lung opacities in COVID-19 images and absence of activations in control images.These activation patterns suggest that those two models were able to learn causal relationships.Conversely, Model-F1A'_ F2'_GE2 did not show human-recognizable activations.COVID-19 lung consolidations were not properly identified and several activations without clinical meaning appeared both in COVID-19 and . ROC curves of Model-F1A_F1B on the tests subsets: Fuji_Inst1_TEST (internal validation); Fuji_ Inst2_TEST (generalization to an external institution with same X-ray device); Care_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing but the same type of response function); and GE_Inst2_TEST (generalization to an external institution with a X-ray device which has a different image processing and a different type of response function).control images.In summary, Grad-CAM heatmaps also provided evidence of a variable level of influence of device related factors on the internal validation performance of the algorithm.

Experiment 2: evaluation of generalization
This study found that a DL network can generalize across institutions and X-ray devices with the same type of response function, however, it may suffer a variable decrease in performance when deployed on external datasets.In contrast, generalization across X-ray devices with a different type of response function was not observed in this research.
Generalization of DL networks for CXRs classification to external datasets has been argued by a few authors.Pooch et al. 9 concluded that state-of-the-art DL algorithms do not generalize to external data which differs from the data used for training.Similar to this, Zech et al. 12 and Sathitratanacheewin et al. 10 defend that CNNs do not generalize to external sites.Additionally, Zech et al. 12 and Maguolo and Nanni 7 warn that neural networks can often distinguish the dataset or the hospital where the images come from.For Maguolo and Nanni 7 , this issue is very important since most papers obtain images of each class to predict from different datasets.Trying to understand how CNNs distinguish the source of the dataset, Cohen et al. 24 proposed discrepancies in image labeling criteria among medical centers to be the potential cause.Furthermore, for Rajpurkar et al. 14 and Pan et al. 8 DL algorithms for CXR classification can generalize to datasets from external institutions with a decrease in their performance.Our results agree with these last two authors.
In an attempt to shed light on the controversy surrounding the generalization of DL networks, we separately assessed the influence of multiple factors on generalization.Our research found that the X-ray device's response function is probably the most important factor to enable generalization, followed by the device's image processing, which hindered but did not impede the algorithm to generalize.
Furthermore, institutional related factors were also found to reduce algorithm's performance, but to a minor extent than X-ray device related factors (Fig. 5).Hierarchical clustering showed that feature values extracted by a CNN could be highly dependent on the X-ray device that acquires the image.The reason is that each X-ray device model applies a unique image processing and has a distinct response function, which generates different textures in radiographic images.These textural differences may lead to disparities in CNN-feature values among images from different devices and vendors that could hinder generalization.Therefore, the application of DL networks to images acquired by devices from manufacturers that are different from those used to acquire the training images should be carefully accomplished.
In fact, the results of experiment 3 also suggest that the influence of X-ray devices on CNN-feature values could be even higher than the influence of the target classes or the institution.Nevertheless, the impact of this issue may be more significant in challenging classification tasks, while in relatively easy tasks, such as body parts classification in radiography, it may not pose a significant obstacle.
The dependence of CNN-feature values on the X-ray device indicates that at least some of the pre-trained CNNs extract features mainly based on textures rather than shapes.This is an important issue for generalization since shape-based features are potentially more robust and invariant than texture-based features.Accordingly, outside the medical field, Geirhos et al. 25 previously argued that ImageNet-trained CNNs are biased towards recognizing textures rather than shapes.These authors also suggest that shape biased networks are inherently more robust than texture biased networks 25 .
Finally, this research introduces hierarchical clustering as a potentially useful tool to detect hidden classes in a dataset which can be more relevant than target classes.Therefore, training a different algorithm for each hidden class to classify the target classes instead of training a single algorithm may be a prudent approach to consider.
Taking all these findings into account, this paper argues that generalization across institutions is possible; however, the influence of the X-ray device on the performance of DL networks is highly significant.In light of these findings, we propose a new strategy for developing algorithms for interpreting radiological images: training a different algorithm for each device model.We believe that this strategy could achieve higher-performing DL models, using a smaller training dataset.This strategy could also help the algorithms to learn causal relationships, as in our research Grad-CAM heatmaps showed.In cases where acquisition equipment is unknown, hierarchical clustering can help to separate images into homogeneous clusters.This new strategy should be studied in future papers.

Conclusion
The performance of DL algorithms in medical imaging can be influenced by mainly two types of factors: institutional and device related.On the one hand, institutional related factors are those that do not modify pixel values (labeling criteria, radiology workflow, etc.).Although these factors do not impede generalization, they can produce a relevant performance decrease when adopted in an external institution.
On the other hand, device related factors (device's image processing, response function, and acquisition protocol) modify image pixel values, and they can have a significant impact on internal validation and generalization performances.The device's type of response function was found to be the most critical factor, as a change on it prevented the algorithm from generalizing, while other device related factors hindered, but did not impede, generalization.
Thereby, radiography devices apply a unique image processing and response function which generate different textures in radiographic images.Hence, feature values extracted by CNNs were found to be highly dependent on the X-ray device from which the image was acquired (hidden class).This is an especially relevant issue, which may compromise generalization to external X-ray models.Clustering algorithms are deemed useful to identify hidden classes in the dataset, and we propose them as a potential strategy to evaluate CNN feature values.

Figure 3 .
Figure 3.Comparison of four Grad-CAM heatmaps based on the predictions of the three models.

Figure 5 .
Figure 5. Hierarchy of factors that affect the generalization of a deep learning network in medical image classification.

Table 1 .
Inclusion criteria for target classes.Meeting all the inclusion criteria was required for an image to be included in a class.PCR polymerase chain reaction, CXR chest radiograph.

Target class Medical history Imaging findings Follow-up COVID
-19 Positive PCR testing Three radiologists, where at least one was a thoracic radiologist, reported COVID-19 findings on the CXR Progression of imaging opacities were verified in a follow-up CXR Control No COVID-19 symptoms Three radiologists, with at least one thoracic radiologist, reported no pathological findings on the CXR

Table 2 .
ImageImages were collected as 16-bit unsigned integer monochrome pixels stored in DICOM format.After data collection, images were resized to 512 × 512 pixels using cubic spline interpolation, and pixel values were rescaled to [0, 1].

Table 4 .
Descriptive statistics of population age and gender.SD standard deviation, GE general electric.

Table 6 .
Generalization of Model-F1A_F1B across institutions and devices from different manufacturers.
Note that Fujifilm and Carestream devices had the same type of response function but different image processing, whereas the GE device had a different type of response function and different image processing.*Statistical significant difference (p-value < 0.05).