Introduction

Chest X-ray (CXR) is an essential tool to detect pulmonary abnormalities and has become one of the most prescribed medical tests. An estimated 110 million CXRs are performed annually in the United States1, with only around 39,000 radiologists providing the official reading2. The need for immediate interpretation or a “wet” read by those who ordered them has prompted clinicians to resort to teleconsultation, especially in settings where they may not have access to a radiologist 24/7. With advances in smartphone technology, doctors have increasingly taken photographs of CXRs and sent them to colleagues for instantaneous reading3,4.

In recent years, deep learning algorithms have been proposed as computer-aided diagnosis (CAD) solutions to the radiologist shortage5,6,7,8,9,10,11,12,13,14. Mostly built on convolutional neural networks (CNNs), the algorithms can detect certain pulmonary abnormalities in CXR images within a second. Numerous studies have shown the competency of CNNs achieving performance close to radiology experts7,11,12,15,16,17.

On the other hand, incorporating the algorithm for automated CXR radiological finding detection into a smartphone offers a number of benefits. First, it will provide access to radiologist-level expertise to a healthcare worker seeking assistance with CXR interpretation or a second opinion anytime, anywhere. Second, it can scale and standardize the process of teleconsultation with less variation in the interpretation compared to one given by different individuals with varying levels of expertise. Third, there is an opportunity for quality assurance as the algorithms can be continuously evaluated and recalibrated against radiologists.

In this study, we explore combining the power of deep learning and the ubiquity of smartphones for CXR finding detection. To the best of our knowledge, this is the first study that recalibrates deep learning models specifically targeting CXR photographs. The target user of the software is a healthcare provider in a resource-limited setting who may not be confident about her/his interpretation or is not a specialist in radiology. It will be easier to install an algorithm on smartphones rather than a legacy computer system in a public hospital or clinic where data interoperability is almost always a challenge. The methodology can also be applied to the abnormality detection on plain films but requires images of the plain films for the recalibration.

We begin the study by showing that the performance of the original CNN-based models trained on high-resolution digital CXR images decreases on CXR photographs. Using less than 200 photographs of CXR, we recalibrated the training process of the models and obtained significant performance improvement. To ascertain the generalizability of the recalibrated model, we measured the performance on four photograph datasets derived from two large and publicly accessible digital CXR databases (MIMIC-CXR18 and CheXpert18,19). To simulate real-world teleconsultation, these photographs were taken by twelve users including nine medical residents using different computer monitors and smartphones to display and photograph the CXRs. We are also open-sourcing these photograph datasets to the community to promote novel research and the development of similar systems.

Results

Experiment design

We conducted four experiments corresponding to four testing CXR photograph datasets, as shown in Fig. 1b: (1) internal validation using 1,759 photographs taken from MIMIC-CXR dataset (Photo-MMC); (2) external validation using 1,337 photographs taken from CheXpert CXR dataset (Photo-CXP); (3) end-user scenario using 1,337 photographs taken from CheXpert by nine medical residents (Photo-MED); (4) device-variance test using 2020 photographs taken from CheXpert by a single physician with different smartphones and computer monitors (Photo-DEV).

Fig. 1: Overview of the proposed method.
figure 1

a The uncalibrated model (Model-ORIG) was trained on the original CXR images. The transfer learning-based model (Model-TRNS) was transferred from the uncalibrated model and fine-tuned by real photographs. The photograph-based model (Model-PHOT) was trained by the real smartphone-captured photographs. The recalibrated model (Model-RECA) was recalibrated from Model-ORIG by using augmented CXR images. b Model-ORIG, Model-TRNS, Model-PHOT, and Model-RECA were tested on four CXR photograph datasets (Photo-MMC, Photo-CXP, Photo-MED, and Photo-DEV) in four experiments. The performance metrics across 14 labels were calculated. Gradient-weighted Class Activation Mapping was employed for diagnostic focus visualization of models.

Four models based on the MIMIC-CXR dataset were constructed and tested, as shown in Fig. 1a. (1) Model-ORIG is the conventional model trained on digital CXRs. (2) Model-RECA is our recalibrated model trained on augmented CXRs. (3) Model-TRNS is the transfer learning model transferred from Model-ORIG and fine-tuned by CXR photographs. (4) Model-PHOT is the model directly trained on CXR photographs. The details of model construction and dataset collection are described in the “Methods” section.

Performance evaluation

Table 1 and Fig. 2 summarize the results for the first three experiments: internal validation, external validation, and end-user scenario. The areas under the receiver operating characteristic curves (AUROCs) were used to present the performance of different models. Conventional metrics such as sensitivity, specificity, f1 score, and accuracy were also calculated and presented in Supplementary Table 2 - Table 4. Six major radiological findings (cardiomegaly, edema, consolidation, atelectasis, pneumothorax, and pleural effusion) were selected as target labels due to clinical relevance. We obtained the comparison reference by using high-resolution images for both training and testing to avoid the domain discordance problem. The blue lines present the results of comparison reference. The green, yellow, pink, and red lines show the results of the Model-ORIG, Model-RECA, Model-TRNS, and Model-PHOT, respectively, using CXR photographs as the testing data.

Table 1 AUROCs using MIMIC-based models in internal validation, external validation, and end-user scenario.
Fig. 2: Radiographic detection performance evaluated by AUROCs for six labels including cardiomegaly, edema, consolidation, atelectasis, pneumothorax, and pleural effusion, using different approaches.
figure 2

a Internal validation: the comparison for the models tested on MIMIC CXRs and the photographic copies (Photo-MMC). b External validation: the comparison for the models tested on CheXpert CXRs and the photographic copies (Photo-CXP). c End-user scenario: the comparison for the models tested on photographs taken by medical residents (Photo-MED). In these figures, blue lines show the comparison reference performance of the models tested on original CXRs. Among the three experiments, except for the baseline model, the proposed model, Model-RECA, outperformed the other models. (PTX: Pneumothorax; PE: Pleural effusion; Cons.: Consolidation; Model-ORIG: Model trained on MIMIC-CXR; Model-RECA: Recalibrated model trained on MIMIC-CXR; Model-TRNS: Model transferred from Model-ORIG and fine-tuned by MIMIC-CXR photographs; Model-PHOT: Model trained on MIMIC-CXR photographs).

Internal validation

First, we developed and internally tested our models using the MIMIC-CXR database. That is, both the training images and the source of the testing photographs, Photo-MMC, were derived from the same database. As shown in Fig. 2a, across the six major radiological findings, Model-ORIG shows a performance decrease from an averaged AUROC of 0.86 to 0.77 (p < 0.0001) compared to our comparison reference. After model recalibration, the Model-RECA shows significant performance recovery from an averaged AUROC of 0.77 to 0.84 (p < 0.0001), close to our comparison reference. Figure 3 shows the receiver operating characteristic (ROC) curves. The blue lines show the comparison reference. The yellow lines show the results when the Model-ORIG was evaluated using CXR photographs. The green lines show the performance of the Model-RECA on the CXR photographs. The AUROC, sensitivity, specificity, F1-score, and accuracy for all 14 labels are presented in Supplementary Table 2. The results reiterate two insights and underscore the importance of this study. First, the model trained on the original CXRs was incapable of maintaining its performance on CXR photographs. Second, the recalibration process improved the model performance and successfully transferred image-based models’ detection accuracy to the CXR photographs.

Fig. 3: ROC curves for detecting cardiomegaly, edema, consolidation, atelectasis, pneumothorax, and pleural effusion observed in CXRs and CXR photographs.
figure 3

Blue lines show the ROC curves using the uncalibrated model on original CXRs. The yellow and green lines show the results of interpreting CXR photographs using the uncalibrated model (Model-ORIG) and recalibrated model (Model-RECA), respectively. The AUC of Model-ORIG is significantly greater than that of Model-RECA for each disease (p < 0.0001).

External validation

To investigate whether the two insights mentioned above can be generalized to the other database, we tested models developed from the MIMIC-CXR database by the photographs made from an external database, CheXpert. That is, the models were tested on the Photo-CXP. As shown in Fig. 2b, across the six radiological findings, the Model-ORIG lost its performance from an averaged AUROC of 0.75 to 0.67 when tested on the CXR photographs (p < 0.0001). On the other hand, the Model-RECA improved an averaged AUROC from 0.67 to 0.75, significantly recaptured the performance loss (p < 0.0001). The results of external validation were consistent with those of internal validation. Furthermore, although the transfer learning model (Model-TRNS) had significantly better performance than Model-PHOT (0.71 vs. 0.59, p < 0.0001) and Model-ORIG (0.71 vs. 0.67, p < 0.0001), the Model-RECA still outperformed Model-TRNS (0.75 vs. 0.71, p < 0.0001). The results imply that although transfer learning strategy can help to deal with the domain shifting problem, the recalibration process provides a better solution for radiographic finding detection on CXR photographs. Finally, when comparing Fig. 2a with Fig. 2b, the difference between the MIMIC-CXR and the CheXpert databases led to AUROC drops for each model and each label, except for the pleural effusion. The AUROC, sensitivity, specificity, F1-score, and accuracy for all 14 labels can be found in Supplementary Table 3.

End-user scenario

To simulate model performance when implemented in real clinical practice, the Photo-MED dataset was used to test models. Nine medical residents were told to take the pictures on their own smartphones and computer monitors as if they would send them to their colleagues for further discussion. Figure 2c shows the comparison results. Again, we reached similar results as those in internal or external validation. The recalibrated model (Model-RECA) has the best performance among the four models tested (0.75 vs. 0.72, p < 0.0001; 0.75 vs. 0.70, p < 0.0001; and 0.75 vs. 0.61, p < 0.0001). This achievement is the same as that of the comparison reference (0.75 vs. 0.75). The AUROC, sensitivity, specificity, F1-score, and accuracy for all 14 labels can be found in Supplementary Table 4. The results demonstrate that the model improvement is not user-dependent and the recalibrated model has potential to be deployed to the real clinical works.

Device-variance test

Figure 4 shows the results of the device-variation test, in which a physician photographed the same set of the CheXpert CXRs ten times by using ten different device setting (smartphones and computer monitors) combinations (Photo-DEV). The box plots show the median and interquartile range of AUROCs for Model-ORIG and Model-RECA across ten settings. The overall AUROC for Model-RECA (0.80 ± 0.076) is significantly higher than that for Model-ORIG (0.74 ± 0.094) (p < 0.0001).

Fig. 4: Results of the device-variation experiment, in which the same set of 202 CheXpert CXRs were copied into photographs by ten different device settings.
figure 4

The box plots for the uncalibrated model (Model-ORIG) and recalibrated model (Model-RECA) show the median and interquartile range of AUROCs. In each box, the central line indicates the median, and the edges of the box indicate the 25th and 75th percentiles. Three labels “fracture,” “lung disease,” and “pleural other” were excluded in the plots because the numbers of cases are less than 1%. The intraclass correlation coefficient (ICC) score for the Model-ORIG is [0.39, 0.77] (95% confidence interval) and the ICC for the Model-RECA is [0.85, 0.93]. The p-value between these two ICCs is smaller than 0.0001, which indicates that Model-RECA provides more reliable radiographic detection results.

Besides, we used the intraclass correlation coefficient (ICC) to evaluate the radiographic detection stability of both uncalibrated and recalibrated models when tested on photographs taken by different device combinations. The ICC score for the Model-ORIG, [0.39, 0.77] (95% confidence interval), is significantly higher than that for the Model-RECA, [0.85, 0.93] (p < 0.0001). These results indicate that although the noises of photographs generated by different smartphones and computer monitors were various (Supplementary Fig. 1a), the Model-RECA can provide more consistent detection results to the same CXR image taken under different noise distribution than Model-ORIG.

Diagnostic visualization

In this study, activation maps were created for demonstration of explainability. By identifying the segment of the CXR that weighed most heavily with regard to the algorithm output, the user is provided some insight of what the algorithm “sees”. This can be particularly useful to determine the trustworthiness of an algorithm’s classification. With the Gradient-weighted Class Activation Mapping (Grad-CAM)20, Fig. 5 shows the resilience of Model-RECA to noise disturbance in an example case labeled as consolidation. When applied to the original CXR image, irregular opacification at the right lower lobe was correctly tagged by both models. However, when applied to the CXR photograph, Model-ORIG was distracted by photography noise and mistakenly used the right clavicle as the determining factor to label consolidation. On the contrary, the Model-RECA identified the same location as where it focused when tested on the original CXR images, visually showing its model stability even though the CXR photograph was presented with conspicuous noise. However, the algorithm might be influenced by noise if the quality of the CXR or the photograph captured by the smartphone is suboptimal (see Supplementary Fig. 4).

Fig. 5: An example of the visualization of the diagnostic focus of two models.
figure 5

a An example CXR is diagnosed as consolidation from the radiology report. The red arrow indicates the abnormal location. (b) and (c) show the diagnostic focus of the recalibrated model (Model-RECA) and the uncalibrated model (Model-ORIG) tested on the original CXR, respectively. (e) and (f) show the diagnostic focus of the Model-RECA and the Model-ORIG tested on the corresponding CXR photograph, respectively. The colors from blue to red map the strengths of the contribution of each image location from low to high for predicting consolidation.

Cross-database validation

To re-examine the stability of the recalibration method, we further constructed the CheXpert-based models (Model-ORIG, Model-RECA, Model-TRNS, and Model-PHOT) by swapping the roles of MIMIC-CXR and CheXpert databases for training and testing. We again went through all procedures to confirm the consistency of our results and we obtained similar results as shown in Supplementary Fig. 3, Supplementary Table 5, and Supplementary Table. 6.

Discussion

This study presented a framework to recalibrate conventional deep learning model training process and obtained models capable of detecting radiological findings on CXR photographs. We first demonstrated that the conventional detection algorithms trained on the digital CXRs did not perform well on CXR photographs due to the discordance between images and photographs. We also showed that, from a transfer learning perspective, the performance of the model fine-tuned on a limited number of CXR photographs was not good enough to recover the performance losses. Instead of retraining a model on a large corpus of CXR photographs, we presented a method to recalibrate the models using a small number of photographs from publicly accessible CXR datasets, which saved time from collecting huge number of data for fine-tuning. Finally, we conducted four experiments and showed that the performance losses caused by shifting targets from the original images to photographs could be recovered by the proposed recalibration method.

The main goal of this study is to solve the problem of domain shift in CXR interpretation. Previous research has shown that machine learning systems are vulnerable to adversarial examples generated by smartphone cameras21. With different noise, photographs generated from the same image source were classified into incorrect categories. Similarly, our study shows that the uncalibrated model failed to overcome the difference between digital CXRs and CXR photographs. A feasible solution to this obstacle is using transfer learning strategy. A model trained on digital CXRs can be fine-tuned using CXR photographs in order to solve the problem of domain shift. However, we demonstrated that the recalibrated model required only 10% of photographs (n = 175) but performed better than the transferred model. Moreover, the proposed recalibration process does not rely on any specific deep learning architecture and thus is applicable to various models. We suggest that the recalibration method can serve as an alternative to transfer learning for the model building when dealing with domain shift problems.

A challenge for recent deep learning advances in radiology is generalizability22. Some algorithms with high accuracy, as reported in publications, struggle to translate their success in the real world. A study also demonstrated the reduction in model performance when training and testing were done on different CXR databases, MIMIC-CXR and CheXpert23. To ascertain generalizability of our methods, we employed a second large CXR dataset to conduct external validation and experiments with different users and devices. Across these experiments, the performance of the recalibrated model was stably better than the uncalibrated model and close to the comparison reference. Besides, the significantly better ICC of the recalibrated model demonstrated its robustness despite various noise distributions. Finally, we performed a cross-database validation by swapping the training and testing sets of MIMIC-CXR and CheXpert and showed that results were consistent with the analog experiment. These experiments suggest that our recalibrated model has generalizability to different hospitals, users, and devices and can provide a foundation to build a smartphone application to assist clinicians with CXR interpretation anytime, anywhere.

Although a previous work24 showed the possibility of directly shifting the models from digital CXRs to CXR photographs, the photographs used to test the model were limited to a small number and certain categories. In our experiments, similar and minor performance loss can only be observed when we internally validated the uncalibrated model with five specific categories (i.e., cardiomegaly, edema, consolidation, atelectasis, and pleural effusion). Otherwise, the uncalibrated model lost its accuracy when tested on an external dataset or more categories of radiological findings.

Phillips and colleagues have also built a CXR photograph dataset25, which contains CXR photographs taken by a single physician but using different techniques. Although both studies look at photographs of CXRs, our study focuses on recalibrating the algorithms on images taken by different users and different cameras. The noise generation and non-trivial image transformations of the photographs are greatly affected by camera hardware (e.g., the sensor’s resolution and the construction of the lens) and software (e.g., auto-adjustment of ISO and white balances). Moreover, the photographs taken by a single experienced user could greatly differ from those taken by a less experienced user. Therefore, we built our validation sets by capturing the images using several devices and taken by several users. As shown by our results, the uncalibrated model had greater loss of its accuracy while the recalibrated model performed well when tested on different data sources, users, and devices.

The primary use case envisioned for the smartphone-based algorithm is for assistance with interpretation of a CXR (digital image or plain film) in an acute care facility with a legacy clinical information system. Currently, a messaging application such as WhatsApp is typically employed to take a photo of either the digital image or the plain film and send it to a colleague for a wet read. Is the CXR suggestive of pneumonia? Is there pulmonary edema? The use of the smartphone-based algorithm is not intended for the detection of lung nodules for cancer screening nor for quality assurance of radiologists given that these two tasks require high-resolution images. Despite applications limited to acute care, the software can still help address radiologist shortage in low-resource countries. For example, there are only three radiologists in Botswana for two million people26. However, in countries like Botswana, CXR films are still printed instead of digitalized. When radiology consultation is required in remote areas, clinicians send printed films to the capital and receive the reading days, if not weeks later. Applying this smartphone-based application can help to shorten the turnaround time and provide immediate assistance to local clinicians.

Lastly, the model performance should be carefully assessed in clinical scenarios. We used AUROCs to evaluate the discrimination of the models. However, clinicians may care more about precision, or positive predictive value, and recall, or sensitivity. The consequence of missing some CXR finding (false negative) must be balanced with the harm of overcalling it (false positive). For example, if clinicians would like to use the algorithm to screen for pneumonia, then the model with the best recall is preferred over one with the best discrimination. However, if the intent is to help filter referrals from rural health centers and decongest strained tertiary care facilities in the capital, then precision is prioritised over recall.

There are a number of limitations to this work. First, both digital CXR databases used in this study were obtained from patients in the US. Ideally, the model should be recalibrated using photographs obtained from the local population. The model will be used, particularly as the common radiographic findings in such a population will likely differ from those in a US population. Second, the model performance we report is tied to the accuracy and consistency of the CXR labels on MIMIC-CXR and CheXpert. For example, the discrimination between “consolidation,” “pneumonia,” and “opacity” may not be the same across different datasets and will interfere with the recalibration process. This issue can be addressed by harmonizing labels and annotations across the CXR datasets before the recalibration. Finally, we did not compare the re-calibration of the algorithm with reader re-calibration when interpreting the high-resolution DICOM image and the low-resolution photo of the same CXR on a smartphone. In all the experiments, the interpretation of the high-resolution DICOM image was used as the gold standard. Neither a human or an algorithm can compensate for a significant loss of information with a reduction in the image resolution.

In summary, we presented a method to recalibrate deep learning models built on high-resolution digital images to detect radiological findings on smartphone-captured CXR photographs. The recalibrated model achieves similar performance as the original model, and its performance is not significantly affected by variation in devices and operators.

Methods

Overview

Figure 1 illustrates the proposed method. We first collected CXRs from two databases and created the CXR photograph datasets by taking smartphone photographs of digital CXR. Instead of taking a large number of photographs, we built a series of augmentation functions to augment the training datasets to be photographic-like CXRs. Hyperparameters of augmentation functions were tuned by comparing the similarity between the augmented results and 175 real photographs. The final augmented CXR photographs were used to train the recalibrated model (Model-RECA), as shown in Fig. 1a. Three other models (Model-ORIG, Model-TRNS, and Model-PHOT) were constructed for comparison. Finally, as shown in Fig. 1b, the models were tested on four derivative CXR photograph datasets (Photo-MMC, Photo-CXP, Photo-MED, and Photo-DEV) corresponding to four experiments (internal validation, external validation, end-user scenario, and device-variance test. The performance metrics and activation maps for 14 labels representing radiological findings were used to evaluate model performance.

Data collection and curation

We used frontal-view CXR images from MIMIC-CXR and CheXpert databases18,19. MIMIC-CXR contains data from 64,588 patients from the Beth Israel Deaconess Medical Center Emergency Department collected between 2011 and 2016. MIMIC-CXR database v2.0.0 has been de-identified. The institutional review boards of Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the creation of the database for research. Requirement for informed consent was waived because the study did not impact clinical care and all protected health information was removed. A total of 14 labels of radiological findings, as listed in Table 2, were extracted from the radiology reports using the CheXpert and NegBio algorithms18,27. Twenty-two images, simultaneously labeled as ‘no finding’ and positive for one of the 14 labels, were excluded in the following analyses. A total of 250,022 frontal-view CXR images were randomly separated into training (n = 248,263), and testing (n = 1759) sets.

Table 2 Numbers of cases for 14 labels in MIMIC-CXR and CheXpert datasets.

CheXpert is a publicly available database collected from Stanford Hospital. The database includes 224,316 CXRs from 65,240 patients. Each CXR was labeled with the presence or absence of 14 pulmonary radiological findings. A total of 191,229 frontal-view CXR images were used and were randomly separated into training (n = 189,892), and testing (n = 1337) sets. The ratio of the size of the training and testing data is the same (1000:7) for both datasets. Another 202 frontal-view CXR images annotated by three board-certified radiologists and originally designed as a validation set, were included to examine the device variation.

Table 2 summarizes the distribution of the radiological findings in the training and test sets of MIMIC-CXR and CheXpert. Prior to the analysis, all images were normalized by histogram equalization.

To create the CXR photograph datasets, we selected the current generation of smartphones with different camera specifications (see Supplementary Table 1). All photographs were taken under random angles, ambiance factors, and noise disturbance. The alignment of each photograph was automatically adjusted with the Microsoft Office Lens App (Microsoft Corp.) to simulate instructions to users for obtaining the best possible image. We reduced the resolution of the photographs to 320 × 320 pixels after they were captured. Four CXR photograph datasets were created.

  1. (1)

    Photo-MMC: Photographs of the CXRs in MIMIC-CXR were captured by three participants using eight different smartphones. The images were displayed on eight different computer monitors. The CXR photographs were taken at different times, locations, and using various lighting sources. A total of 1759 photographs were included in the MIMIC-CXR testing set.

  2. (2)

    Photo-CXP: Using the same settings as those to create Photo-MMC, a total of 1337 photographs were taken from the CheXpert testing set.

  3. (3)

    Photo-MED: 1337 photographs in the CheXpert testing set were separated into nine subsets. Nine medical residents were recruited to take photos of each subset by using their own smartphones and monitors. They were instructed to “take photos as if you want to send them to a radiologist for interpretation.” No other instruction or quality requirement was given.

  4. (4)

    Photo-DEV: To examine the effect of the make of the computer monitor and the smartphone, 202 photographs of the CheXpert validation dataset were repeatedly taken by a single physician ten times. For the first nine subsets, nine different device settings were used under the same lighting condition and location. The last subset was taken with a brighter lighting condition. This dataset consists of 2020 photographs in total. Supplementary Fig. 1a shows examples of CXR photographs taken by different device settings.

Data augmentation

We augmented the training datasets by generating simulated CXR photographs with the hyperparameters determined by photographs for recalibration. Eight common types of noise were embedded in the functions: (1) Gaussian noise, (2) saturation change, (3) overexposure, (4) contrast change, (5) motion blur, (6) moiré pattern, (7) Poisson noise and (8) noise-induced by image compression28 (see Supplementary Fig. 2). We used the imgaug 0.4.0 library for Python 3.7.0 to generate noise (1)–(5), (7), and (8)29. The moiré pattern was simulated using the Radon and inverse Radon transform30,31 from the scikit image library v0.17.dev032. These noise simulation functions were aligned in their occurrence order on the optical path, starting from the computer monitor. An example photograph produced by augmentation functions is shown in Supplementary Fig. 1b. The augmented photograph shows the effects of noise patterns, overexposure, and contrast enhancement in the CXR photograph.

Hyperparameter optimization

Ten hyperparameters were optimized in the augmentation function with the range: (1) the mean (range: 5–20) and (2) the variance (range: 4–12) of Gaussian noise, (3) the possibility of saturation change (range: 0.5–0.8), (4) the white/yellowish ratio of saturation changes (range: 0.6–0.8), (5) the intensity mean (range: 1–1.4) and (6) the intensity variance (range: 0.2–0.4) of overexposure, (7) the intensity of contrast correction (range: 1.6–2.2), (8) the probability of motion blur (range: 0.2–0.5) (9) the probability of moiré pattern (range: 0.3–0.9) and (10) the lambda of Poisson noise (range: 2–10). Motion intensity was fixed to 5, and the compression rate was set to 30–70%.

A similarity comparison between the CXR photographs and the augmented photographs was performed to determine the value of each hyperparameter in the augmentation functions. The similarity was calculated by using the complex wavelet structural similarity method33 and the Bhattacharyya distance of image histogram. We performed hyperparameter optimization using a grid search of reasonable values. 10% of the photographs from Photo-MMC were partitioned for tuning the hyperparameters and were excluded from the performance evaluation.

Three different parameter selection approaches were adopted to determine the value of each hyperparameter and evaluate their effectiveness based on the performance of the models. The methods are: (1) randomly selecting hyperparameters from the chosen range. (2) Selected by an author based on his subjective perception of each hyperparameter and (3) similarity comparison. The comparison results are shown in Supplementary Table 7.

Model construction

Deep learning models were built for detecting radiological findings. The training and testing were performed on the Multiple-GPU Google platform. Tensorflow 2.0 and Keras 2.3 were used for model training. A 121-layer Densely Connected Convolutional Network (DenseNet-121)34 with max-pooling was used as the comparison reference model architecture, which was also used in the previous studies7,8,11,18,35. The consistent results of the comparison reference model can also be found in recent studies using the same model structure (DenseNet-121) and databases (MIMIC-CXR and CheXpert)23,35. The input image size was 320 by 320 because the previous study has demonstrated that performance did not increase with higher resolution CXR images and the use of higher resolution images requires more computational cost36. The initial weights of the network were randomly initialized. The final fully connected layer contained 14 outputs corresponding to the 14 target labels. Binary cross entropy was chosen as the loss function and the Adam optimizers were applied in the training process with parameters: learning rate = 0.001, beta1 = 0.9, and beta2 = 0.99937. As shown in Fig. 1a, four models were constructed: The comparison reference model, Model-ORIG, was trained using the original MIMIC-CXR images. The recalibrated model, Model-RECA, was trained using the augmented CXR photographs. The model Model-TRNS was acquired by using the Photo-MMC dataset (n = 1,759) to fine-tune the Model-ORIG. Finally, the photograph-based model, Model-PHOT, was directly trained on the real photographs in Photo-MMC (n = 1759).

We trained the Model-ORIG and Model-RECA using mini-batches of size 32 and five epochs. The models converged after five epochs. We trained the Model-TRNS and Model-PHOT using 10 and 50 epochs, respectively, and after that the model was converged. For the Model-ORIG, the training dataset was augmented by a random transformation (rotating ±7 degrees, scaling ±2%, and shearing ±5 pixels) twice38. For the Model-RECA, we augmented the training dataset using our augmentation functions with two sets of hyperparameters, which were determined by complex wavelet structural similarity method33 and the Bhattacharyya distance of image histogram, respectively. The total numbers of training data were the same for Model-ORIG and Model-RECA (n = 496,570).

Experiment design

Figure 1b shows that the four models (Model-ORIG, Model-TRNS, Model-PHOT, and Model-RECA) described above were tested on four CXR photograph datasets (Photo-MMC, Photo-CXP, Photo-MED, and Photo-DEV) separately, which were constructed for the purpose below:

  1. (1)

    Internal validation

    The Model-ORIG and Model-RECA were tested on the original MIMIC-CXR testing set and Photo-MMC. The Model-TRNS and Model-PHOT were excluded in this experiment because they were trained using Photo-MMC.

  2. (2)

    External validation

    CheXpert testing dataset and Photo-CXP were used as external datasets to test the performance of four models.

  3. (3)

    End-user scenario

    Four models were tested on the Photo-MED to investigate the model performance when applied to real-world healthcare scenarios.

  4. (4)

    Device-variance test

To investigate whether the model performance is device-dependent, Photo-DEV was used to test the models.

Performance evaluation and statistical analysis

We calculated one-versus-all AUROC, sensitivity, specificity, F1-score, and binary classification accuracy in each experiment to evaluate model performance. In the device-variance test, the intraclass correlation coefficient (ICC) was used to evaluate the intra-rater reliability (i.e., the stability of label production in our test) of both models. We used a “two-way mix effect,” “single measurement,” and “absolute agreement” model in R to estimate the final value32. Bootstrapping was used to estimate the 95% confidence interval and perform t statistics. Finally, we used a nonparametric approach to estimate the p-value. We bootstrapped the testing data 1000 times to obtain the AUROCs and performed the Welch’s two sample t-test to calculate the p-value.

Model visualization

Finally, we employed the Grad-CAM20 to obtain visual explanations for each label of our CNN-based models. The heatmaps produced by Grad-CAM can be used to visualize the diagnostic focus of the working algorithm and investigate whether the algorithms used the same visual pattern to detect radiological findings as what radiologists have used.

Cross-database validation

We swapped the roles of MIMIC-CXR and CheXpert datasets for training and testing and then went through all procedures again. The parameters used in the CheXpert-based model construction were the same as those in MIMIC-based model construction. As a result, three additional models were constructed. The baseline model, Model-ORIG, was trained by the original CheXpert CXR images. The recalibrated model, Model-RECA, was trained by the augmented CXR photographs. The model Model-TRNS was acquired by using the Photo-CXP dataset (n = 1337) to fine-tune the Model-ORIG. The photograph-based model, Model-PHOT, was trained on the real photographs in Photo-CXP (n = 1337). These four models were tested on two CXR photograph datasets (Photo-CXP and Photo-MMC), and one-versus-all AUROC, sensitivity, specificity, F1-score, and binary classification accuracy were computed for each label.

Reporting summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.