Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging

Tayebi Arasteh, Soroosh; Ziller, Alexander; Kuhl, Christiane; Makowski, Marcus; Nebelung, Sven; Braren, Rickmer; Rueckert, Daniel; Truhn, Daniel; Kaissis, Georgios

doi:10.1038/s43856-024-00462-6

Download PDF

Article
Open access
Published: 14 March 2024

Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging

Communications Medicine volume 4, Article number: 46 (2024) Cite this article

900 Accesses
1 Citations
28 Altmetric
Metrics details

Subjects

Abstract

Background

Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacceptable in medicine and represent a main barrier to the widespread use of privacy-preserving techniques. In this work, we evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.

Methods

We used two datasets: (1) A large dataset (N = 193,311) of high quality clinical chest radiographs, and (2) a dataset (N = 1625) of 3D abdominal computed tomography (CT) images, with the task of classifying the presence of pancreatic ductal adenocarcinoma (PDAC). Both were retrospectively collected and manually labeled by experienced radiologists. We then compared non-private deep convolutional neural networks (CNNs) and privacy-preserving (DP) models with respect to privacy-utility trade-offs measured as area under the receiver operating characteristic curve (AUROC), and privacy-fairness trade-offs, measured as Pearson’s r or Statistical Parity Difference.

Results

We find that, while the privacy-preserving training yields lower accuracy, it largely does not amplify discrimination against age, sex or co-morbidity. However, we find an indication that difficult diagnoses and subgroups suffer stronger performance hits in private training.

Conclusions

Our study shows that – under the challenging realistic circumstances of a real-life clinical dataset – the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.

Plain Language Summary

Artificial intelligence (AI), in which computers can learn to do tasks that normally require human intelligence, is particularly useful in medical imaging. However, AI should be used in a way that preserves patient privacy. We explored the balance between maintaining patient data privacy and AI performance in medical imaging. We use an approach called differential privacy to protect the privacy of patients’ images. We show that, although training AI with differential privacy leads to a slight decrease in accuracy, it does not substantially increase bias against different age groups, genders, or patients with multiple health conditions. However, we notice that AI faces more challenges in accurately diagnosing complex cases and specific subgroups when trained under these privacy constraints. These findings highlight the importance of designing AI systems that are both privacy-conscious and capable of reliable diagnoses across patient groups.

Medical imaging deep learning with differential privacy

Article Open access 29 June 2021

End-to-end privacy preserving deep learning on multi-institutional medical imaging

Article 24 May 2021

Advancing COVID-19 diagnosis with privacy-preserving collaboration in artificial intelligence

Article Open access 15 December 2021

Introduction

The development of artificial intelligence (AI) systems for medical applications represents a delicate trade-off: On the one hand, diagnostic models must offer high accuracy and certainty, as well as treat different patient groups equitably and fairly. On the other hand, clinicians and researchers are subject to ethical and legal responsibilities towards the patients whose data is used for model training. In particular, when diagnostic models are published to third parties whose intentions are impossible to verify, care must be undertaken to ascertain that patient privacy is not compromised. Privacy breaches can occur, e.g., through data reconstruction, attribute inference or membership inference attacks against the shared model¹. Federated learning^2,3,4 has been proposed as a tool to address some of these problems. However, it has become evident that training data can be reverse-engineered from federated systems, rendering them just as vulnerable to the aforementioned attacks as centralized learning⁵. Thus, it is apparent that formal privacy preservation methods are required to protect the patients whose data is used to train diagnostic AI models. The gold standard in this regard is differential privacy (DP)⁶.

Most, if not all, currently deployed machine learning models are trained without any formal privacy-preservation technique. It is especially crucial to employ such techniques in federated scenarios, where much more granular information about the training process can be extracted, or even the training process itself can be manipulated by a malicious participant^7,8. Moreover, trained models can be attacked to extract training data through so-called model inversion attacks^9,10,11. We also note that such attacks work better if the models have been trained on less data, which is especially concerning since even most FDA-approved AI algorithms have been trained on fewer than 1000 cases¹². Creating a one-to-one correspondence between a successful attack and the resulting “privacy risk” requires a case-by-case consideration. The legal opinion (e.g., the GDPR) seems to have converged on the notion of singling out/ re-identification. Even from the aspect of newer legal frameworks, such as the EU AI act, which demand “risk moderation” rather than directly specifying “privacy requirements,”, DP can be seen as the optimal tool as it can quantitatively bound both the risk of membership inference (MI)^13,14 and data reconstruction¹⁵. Moreover, this was also shown empirically for both aforementioned attack classes^16,17,18,19. It is also known that DP, contrary to de-identification procedures such as k-anonymity, provably protects against the notion of singling out^20,21.

DP is a formal framework encompassing a collection of techniques to allow analysts to obtain insights from sensitive datasets while guaranteeing the protection of individual data points within them. DP thus is a property of a data processing system which states that the results of a computation over a sensitive dataset must be approximately identical whether or not any single individual was included or excluded from the dataset. Formally, a randomized algorithm (mechanism) ${{{{{{{\mathcal{M}}}}}}}}:{{{{{{{\mathcal{X}}}}}}}}\to {{{{{{{\mathcal{Y}}}}}}}}$ is said to satisfy (ε, δ)-DP if, for all pairs of databases $D,{D}^{{\prime} }\in {{{{{{{\mathcal{X}}}}}}}}$ which differ in one row and all $S\subseteq {{{{{{{\mathcal{Y}}}}}}}}$, the following holds:

$$\,{{\mbox{Pr}}}\,\left({{{{{{{\mathcal{M}}}}}}}}(D)\in S\right)\le {e}^{\varepsilon }\,{{\mbox{Pr}}}\,\left({{{{{{{\mathcal{M}}}}}}}}({D}^{{\prime} })\in S\right)+\delta ,$$

(1)

where the guarantee is given over the randomness of ${{{{{{{\mathcal{M}}}}}}}}$ and holds equally when D and ${D}^{{\prime} }$ are swapped. In more intuitive terms, DP is a guarantee given from a data processor to a data owner that the risks of adverse events which can occur due to the inclusion of their data in a database are bounded compared to the risks of such events when their data is not included. The parameters ε and δ together form what is typically called a privacy budget. Higher values of ε and δ correspond to a looser privacy guarantee and vice versa. With some terminological laxity, ε can be considered a measure of the privacy loss incurred, whereas δ represents a (small) probability that this privacy loss is exceeded. For deep learning workflows, δ is set to around the inverse of the database size. We note that, although mechanisms exist where δ denotes a catastrophic privacy degradation probability, the sampled Gaussian mechanism used to train neural networks does not exhibit this behavior. The fact that quantitative privacy guarantees can be computed over many iterations (compositions) of complex algorithms like the ones used to train neural networks is unique to DP. This process is typically referred to as privacy accounting. Applied to neural network training, the randomization required by DP is ensured through the addition of calibrated Gaussian noise to the gradients of the loss function computed for each individual data point after they have been clipped in ℓ₂-norm to ensure that their magnitude is bounded²², where the clipping threshold is an additional hyperparameter in the training process.

DP does not only offer formal protection, but several works have also empirically shown the connection between the privacy budget and the success of membership inference¹⁶ and data reconstruction attacks^17,19,23. We note that absolute privacy (i.e., zero risk) is only possible if no information is present²⁴. This is, for example, the case in encryption methods, which are perfectly private as long as data is not decrypted. Note that training models e.g., via homomorphic encryption does, however, not offer such perfect privacy guarantees, as the information learned by the model is actually revealed at inference time through the model’s predictions. Thus, without the protection of differential privacy, no formal barrier stands between the sensitive data and an attacker (beyond potential imperfections of the attack algorithm, which are usually not controllable a priori). DP offers the ability to upper-bound the risk of successful privacy attacks while still being able to draw conclusions from the data. Determining the exact privacy budget is challenging, as it is a matter of policy. The technical perspective can provide insight into the appropriate budget level, as it is possible to quantify the risk of a successful attack at a given privacy budget compared to the model utility that can be achieved. The trade-offs between model utility and privacy preservation are also a matter of ethical, societal and political debate. The utilization of DP also creates two fundamental trade-offs: The first is a “privacy-utility trade-off,” i.e., a reduction in diagnostic accuracy when stronger privacy guarantees are required^25,26. The other trade-off is between privacy and fairness. Intuitively, the fact that AI models learn proportionally less about under-represented patient groups²⁷ in the training data is amplified by DP, leading to demographic disparity in the model’s predictions or diagnoses²⁸. Both of these trade-offs are delicate in sensitive applications, such as medical ones, as it is not acceptable to have wrong diagnoses or to discriminate against a certain patient group.

The need for the use of differential privacy (DP) has been illustrated by Packhäuser et al.²⁹, who showed that it is trivial to match chest x-rays of the same patient, which directly enables re-identification attacks; this was similarly shown in tabular databases by Narayanan et al.³⁰. The training of deep neural networks on medical data with DP has so far not been widely investigated. Li et al.³¹ investigated privacy-utility trade-offs in the combination of advanced federated learning schemes and DP methods on a brain tumor segmentation dataset. They find that DP introduces a considerable reduction in model accuracy in the given setting. Hatamizadeh et al.²³ illustrated that the use of federated learning alone can be unsafe in certain settings. Ziegler et al.³² reported similar findings when evaluating privacy-utility trade-offs for a chest x-ray classification on a public dataset. These results also align with our previous work¹⁷, where we demonstrated the utilization of a suite of privacy-preserving techniques for pneumonia classification in pediatric chest X-rays. However, the focus of this study was not to elucidate privacy-utility or privacy-fairness trade-offs, but to showcase that federated learning workflows can be used to train diagnostic AI models with good accuracy on decentralized data while minimizing data privacy and governance concerns. Moreover, we demonstrated that empirical data reconstruction attacks are thwarted by the utilization of differential privacy. In addition, the work did not consider differential diagnosis but only coarse-label classification into normal vs. bacterial or viral pneumonia.

In this work, we aim to elucidate the connection between using formal privacy techniques and the fairness towards underrepresented groups in the sensitive setting of medical use-cases. This is an important prerequisite for the deployment of ethical AI algorithms in such sensitive areas. However, so far, prior work is limited to benchmark computer vision datasets^33,34. We thus contend that the widespread use of privacy-preserving machine learning requires testing under real-life circumstances. In the current study, we perform the first in-depth investigation into this topic. Concretely, we utilize a large clinical database of radiologist-labeled radiographic images, which has previously been used to train an expert-level diagnostic AI model, but otherwise not been curated or pre-processed for private training in any way. Furthermore, we analyze a dataset of abdominal 3D computed tomography (CT) images, where we classify the presence of a pancreatic ductal adenocarcinoma (PDAC). This mirrors the type of datasets available at clinical institutions. In this setting, we then study the extent of privacy-utility and privacy-fairness trade-offs in training advanced computer vision architectures.

To the best of our knowledge, our study is the first work to investigate the use of differential privacy in the training of complex diagnostic AI models on a real-world dataset of this magnitude (nearly 200,000 samples) and a 3D classification task, and to include an extensive evaluation of privacy-utility and privacy-fairness trade-offs.

Our results are of interest to medical practitioners, deep learning experts in the medical field and regulatory bodies such as legislative institutions, institutional review boards and data protection officers and we undertook specific care to formulate our main lines of investigation across the important axes delineated above, namely the provision of objective metrics of diagnostic accuracy, privacy protection and demographic fairness towards diverse patient subgroups.

Our main contributions can be summarized as follows: (1) We study the diagnostic accuracy ramifications of differentially private deep learning on two curated databases of medically relevant use-cases. We reach 97% of the non-private AUROC on the UKA-CXR dataset through the utilization of transfer learning on public datasets and careful choice of architecture. On the PDAC dataset, our private model at ε = 8.0 is not statistically significantly inferior compared to the non-private baseline. (2) We investigate the fairness implications of differentially private learning with respect to key demographic characteristics such as sex, age and co-morbidity. We find that – while differentially private learning has a mild fairness effect – it does not introduce significant discrimination concerns based on the subgroup representation compared to non-private training, especially at the intermediate privacy budgets typically used in large-scale applications.

Methods

Patient cohorts

We employed UKA-CXR^35,36, a large cohort of chest radiographs. The dataset consists of N = 193,311 frontal CXR images of 45,016 patients, all manually labeled by radiologists. The available labels include: pleural effusion, pneumonic infiltrates, and atelectasis, each separately for right and left lung, congestion, and cardiomegaly. The labeling system for cardiomegaly included five classes “normal,” “uncertain,” “borderline,” “enlarged,” and “massively enlarged.” For the rest of the labels, five classes of “negative,” “uncertain,” “mild,” “moderate,” and “severe” were used. Data were split into N = 153,502 training and N = 39,809 test images using patient-wise stratification, but otherwise completely random allocation^35,36. There was no overlap between the training and test sets. Supplementary Table 1 shows the statistics of the dataset, which are further visualized in Supplementary Figs. 1 and 2.

In addition, we used an in-house dataset at Klinikum Rechts der Isar of 1625 abdominal CT scans from unique, consecutive patients, of which 867 suffered from pancreatic ductal adenocarcinoma (PDAC) (positive) and 758 were a control group without a tumor (negative). We split the dataset into 975 train and 325 validation and test images respectively. During splitting we maintained the ratio of positive and negative samples in all subsets.

The experiments were performed in accordance with relevant national and international guidelines and regulations. Approval for the UKA-CXR dataset by the Ethical Committee of the Medical Faculty of RWTH Aachen University has been granted for this retrospective study (Reference No. EK 028/19). Analogously, for the PDAC dataset, the protocol was approved by the Ethics Committee of Klinikum Rechts der Isar (Protocol Number 180/17S). Both institutional review boards did not require informed consent from subjects and/or their legal guardian(s) as this was a retrospective study. The study was conducted in accordance with the Declaration of Helsinki.

Data pre-processing

We resized all images of the UKA-CXR dataset to (512 × 512) pixels. Afterward, a normalization scheme as described previously by Johnson et al.³⁷ was utilized by subtracting the lowest value in the image, dividing by the highest value in the shifted image, truncating values, and converting the result to an unsigned integer, i.e., in the range of [0,255]. Finally, we performed histogram equalization by shifting pixel values towards 0 or towards 255 such that all pixel values 0 through 255 have approximately equal frequencies³⁷.

We selected a binary classification paradigm for each label. The “negative” and “uncertain” classes ("normal” and “uncertain” for cardiomegaly) were treated as negative, while the “mild,” “moderate,” and “severe” classes ("borderline,” “enlarged,” and “massively enlarged” for cardiomegaly) were treated as positive.

For the PDAC dataset, we clipped the voxel density values of all CT scans to an abdominal window from −150 to 250 Hounsfield units and resized to a shape of 224 × 224 × 128 voxels.

Deep learning process

Network architecture

For both datasets, we employed the ResNet9 architecture introduced in ref. ³⁸ as our classification architecture. For the UKA-CXR dataset, images were expanded to (512 × 512 × 3) for compatibility with the neural network architecture. The final linear layer reduces the (512 × 1) output feature vectors to the desired number of diseases to be predicted, i.e., 8. The sigmoid function was utilized to convert the output predictions to individual class probabilities. The full network contained a total of 4.9 million trainable parameters. For the PDAC dataset, we used the conversion proposed by Yang et al.³⁹ to convert the model to be applicable to 3D data, which in brief applies 2D-convolutional filters along axial, coronal, and sagittal axes separately. Our utilized ResNet9 network employs the modifications proposed by Klause et al.³⁸ and by He et al.⁴⁰. Batch Normalization⁴¹ is incompatible with DP-SGD, as per-sample gradients are required, and batch normalization inherently intermixes information of all images in one batch. Hence, we used group normalization⁴² layers instead with 32 groups to be compatible with DP processing. For the CXR dataset we pretrained the network on the MIMIC Chest X-ray JPG dataset v2.0.0 (MIMIC-CXR),⁴³ consisting of N = 210,652 frontal images. All training hyperparameters were selected empirically based on their validation accuracy, while no systematic/automated hyperparameter tuning was conducted.

Non-DP training

For the UKA-CXR dataset, the Rectified Linear Unit (ReLU)^44,45 was chosen as the activation function in all layers. We performed data augmentation during training by applying random rotation in the range of [ − 10, 10] degrees and medio-lateral flipping with a probability of 0.50. The model was optimized using the NAdam⁴⁶ optimizer with a learning rate of 5 ⋅ 10⁻⁵. The binary weighted cross-entropy with inverted class frequencies of the training data was selected as the loss function. The training batch size was chosen to be 128. In the PDAC dataset, we used an unweighted binary cross-entropy loss as well as the NAdam optimizer with a learning rate of 2 ⋅ 10⁻⁴.

DP training

For UKA-CXR, we chose Mish⁴⁷ as the activation function in all layers. No data augmentation was performed during DP training as we found further data augmentation during training to be harmful to accuracy. All models were optimized using the NAdam⁴⁶ optimizer with a learning rate of 5 ⋅ 10⁻⁴. The binary weighted cross-entropy with inverted class frequencies of the training data was selected as the loss function. The maximum allowed gradient norm (see Fig. 1) was chosen to be 1.5 and the network was trained for 150 epochs for each chosen privacy budget. Each point in the batch was sampled with a probability of 8 ⋅ 10⁻⁴ (128 divided by N = 153,502). For the PDAC dataset, we chose a clipping norm of 1.0, δ = 0.001 and a sampling rate of 0.31 (512/1 625). In both cases, the noise multiplier was calculated such that for a given number of training steps, sampling rate, and maximum gradient norm the privacy budget was reached on the last training step. For the UKA-CXR dataset, the indicated privacy guarantees are “per record” since some patients have more than one image, while for the PDAC datasets, they are “per individual.”

**Fig. 1: Differences between the private and non-private training process of a neural network.**

Quantitative evaluation and statistical analysis

The area under the receiver operating characteristic curve (AUROC) was utilized as the primary evaluation metric. We report the average AUROC over all the labels for each experiment. The individual AUROC as well as all other evaluation metrics of individual labels are reported in the supplementary information (Supplementary Tables 2–8). For the UKA-CXR test set, we used bootstrapping with 1000 redraws for each measure to determine the statistical spread⁴⁸. For calculating sensitivity, specificity, and accuracy, a threshold was chosen according to Youden’s criterion⁴⁹, i.e., the threshold that maximized (true positive rate – false positive rate).

To evaluate the correlation between results of data subsets and their sample size, Pearson’s r coefficient was used. To analyze fairness between subgroups, the statistical parity difference⁵⁰ was used which is defined as

$$P(\hat{Y}=1| C=\,{{\mbox{Minority}}}\,)-P(\hat{Y}=1| C=\,{{\mbox{Majority}}}\,)$$

(2)

where $\hat{Y}=1$ represents correct model predictions and C is the group in question. Intuitively, it is the difference in classification accuracy between the minority and majority class and thus is optimally zero. Values larger than zero mean that there is a benefit for the minority class, while values smaller than zero mean that the minority class is discriminated against.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

High classification accuracy is attainable despite stringent privacy guarantees

Table 1 shows an overview of our results for all subgroups. Supplementary Tables 2–8 show the per-diagnosis evaluation results for non-DP and DP training for different ε values. On the UKA-CXR dataset our non-private model achieves an AUROC of 89.71% over all diagnoses. It performs best on pneumonic infiltration on the right (AUROC=94%) while struggling the most to accurately classify cardiomegaly (AUROC=84%). Training with DP decreases all results slightly yet significantly (Hanley & McNeil-test p-value < 0.001, 1 000 bootstrapping redraws) and achieves an overall AUROC of 87.36%. The per-diagnosis performance ranges from 92% (pleural effusion right) to 81% AUROC (congestion). We next consider classification performance at a very strong level of privacy protection (i.e., at ε < 1). Here, at an ε-budget of only 0.29, our model achieves an average AUROC of 83.13% over all diagnoses. A visual overview is displayed in Fig. 2, which shows the average AUROC, accuracy, sensitivity, and specificity values over all labels.

Table 1 Summary of dataset statistics and results

Full size table

**Fig. 2: Average results of training with differential privacy (DP) with different ϵ values for δ = 6 ⋅ 10⁻⁶.**

On the PDAC dataset, we found that, while non-private training achieved almost perfect results on the test set the loss in utility for private training at ε = 8 is statistically non-significant (Hanley & McNeil-test p-value: 0.34, 3 independent experiments) compared to non-private training. Again, with lower privacy budgets, model utility decreases, but even at a very low privacy budget of ε = 1.06, we observe an average AUROC score of 95.58%.

Moreover, for UKA-CXR, the use of pre-training helps to boost model performance and reduce the amount of additional information the model needs to learn “from scratch” and consequently reduces the privacy budgets required (refer to Supplementary Fig. 3). This appears to primarily benefit the under-represented groups in the dataset. Conversely, non-private training, whether initialized with pre-training weights or trained from scratch, tends to yield comparable diagnostic results, as the latter network can leverage a greater amount of information. These findings are in line with the observations on the PDAC dataset (where no pretrained weights were available), namely that, at low privacy budgets, specific patient groups suffer a higher discrimination.

For the purpose of further generalization, we replicated the experiments using three other network architectures. All three models displayed a trend consistent with the utility penalties we observed for ResNet9 in both DP and non-DP training (see Supplementary Fig. 4). For further details, we refer to the supplementary information.

Diagnostic accuracy is correlated with patient age and sample size for both private and non-private models

Fig. 3 shows the difference in classification performance on the UKA-CXR dataset for each diagnosis between the non-private model evaluation and its private counterpart compared to the sample size (that is, the number of available samples with a given label) within our dataset. At an ε = 7.89, the largest difference of AUROC between the non-private and privacy-preserving model was observed for congestion (3.82%) and the smallest difference was observed for pleural effusion right (1.55%, see Fig. 3). Of note, there is a visible trend (Pearson’s r: 0.44) whereby classes in which the model exhibits good diagnostic performance in the non-private setting also suffer the smallest drop in the private setting. On the other hand, classes that are already difficult to predict in the non-private case deteriorate the most in terms of classification performance with DP (see Supplementary Fig. 9). Both non-private (Pearson’s r: 0.57) and private (Pearson’s r: 0.52) diagnostic AUROC exhibit a weak correlation with the number of samples available for each class (see Supplementary Fig. 9). However, the drop in AUROC between private and non-private training is not correlated with the sample size (Pearson’s r: 0.06). On the PDAC dataset, patients with a tumor are overrepresented and in the non-private case diagnosed more accurately. Not surprisingly, the classification performance is thus also higher for private trainings except for the most restrictive privacy budget (see Supplementary Figs. 5–8).

**Fig. 3: Evaluation results of training with differential privacy (DP) and without DP with different ϵ values for δ = 6 ⋅ 10⁻⁶.**

Furthermore, we evaluated our models based on age range and patient sex (Table 1 and Figs. 4 and 5). Additionally, we calculated statistical parity difference for those groups to obtain a measure of fairness (Table 1). On the UKA-CXR dataset all models performed the best on patients younger than 30 years of age. It appears that, the older patients are, the greater the difficulty for the models to predict the labels accurately. Statistical parity difference scores are slightly negative for the age groups between 70 and 80 years and older than 80 years for all models, indicating that the models discriminate slightly against these groups. In addition, while for the aforementioned age groups the discrimination does not change with privacy levels, younger patients become more privileged as privacy increases. This finding indicates that – for models which are most protective of data privacy – young patients benefit the most, despite the group of younger patients being smaller overall. For patient sex, models show slightly better performance for female patients and slightly discriminate against male patients (Table 1). Statistical parity does not appear to correlate (Pearson’s r: 0.13) with privacy levels.

Fig. 4: Average results of training with differential privacy (DP) with different ϵ values for δ = 6 ⋅ 10⁻⁶, separately for samples of different age groups including [0, 30), [30, 60), [60, 70), [70, 80), and [80, 100) years.

**Fig. 5: Average results of training with differential privacy (DP) with different ϵ values for δ = 6 ⋅ 10⁻⁶, separately for female and male samples.**

On the PDAC dataset, we observed that, for all levels of privacy including non-private training, classification performance was worse for female patients compared to male patients, who are over-represented in the dataset. However, there is no trend observable between the privacy level and the parity difference. When analysing results of subgroups separated by patient age, we observed similarly to UKA-CXR that in all settings, statistical parity differences are on average better for younger patients compared to older ones. Just as in the UKA-CXR dataset, we found that the more restrictive the privacy budget is set, the stronger the privilege enjoyed by younger patients. We furthermore observed that the control group (i.e., no tumor) has an over-representation of both male patients and young patients, which consequently both exhibit better performance compared to the rest of the cohort. Conversely, female patients as well as older patients, have a higher chance of misclassification and are more abundant in the tumor group.

Discussion

The main contribution of our paper is to analyse the impact of strong objective guarantees of privacy on the fairness enjoyed by specific patient subgroups in the context of AI model training on real-world medical datasets.

Across all levels of privacy protection, training with DP still yielded models exhibiting AUROC scores of 83% at the highest privacy level and 87% at an ε = 7.89 on the UKA-CXR dataset. The fact that the model maintained a relatively high AUROC even at ε = 0.29 is remarkable, and we are unaware of any prior work to report such a strong level of privacy protection at this level of model accuracy on clinical data. Our results thus exemplify that, through careful choice of architectures and best practices for the training of DP models, the use of model pretraining on a related public dataset, and the availability of sufficient data samples, privately trained models require only very small additional amounts of private information from the training dataset to achieve high diagnostic accuracy on the tasks at hand.

For the PDAC dataset, even though private models at ε = 8.0 are not significantly inferior compared to non-private counterparts, the effect of the lower amount of training samples is observable at more restrictive privacy budgets. Especially at ε ≤ 1.06, the negative effect of private training on the discrimination of patients in certain age groups becomes noticeable. This underscores the requirement for larger training datasets, which the objective privacy guarantees of DP can enable through incentivizing data sharing.

Our analysis of the per-diagnosis performance of models that are trained with and without privacy guarantees shows that models discriminate against diagnoses that are underrepresented in the training set in both private and non-private training. This finding is not unusual and several examples can be found in⁵¹. However, the drop in performance between private and non-private training is uncorrelated to the sample size. Instead, the difficulty of the diagnosis seems to drive the difference in AUROC between the two settings. Concretely, diagnostic performance under privacy constraints suffers the most for those classes, which already have the lowest AUROC in the non-private setting. Conversely, diagnoses that are predicted with the highest AUROC suffer the least when DP is introduced.

Previous works investigating the effect of DP on fairness show that privacy preservation amplifies discrimination³³. This effect is limited to very low privacy budgets in our study. Our models remain fair despite at the levels of privacy protection typically used for training state-of-the-art models in current literature²⁵, likely due to our real-life datasets’ large size and/or high quality.

The effects we observed are not limited to within-domain models. Indeed, in a concurrent work, we investigated the effects of DP training on the domain generalizability of diagnostic medical AI models⁵². Our findings revealed that even under extreme privacy conditions, DP-trained models show comparable performance to non-DP models in external domains.

Our analysis of fairness related to patient age showed that older patients are discriminated against both in the non-private and private settings. On UKA-CXR, age-related discrimination remains approximately constant with stronger privacy guarantees. On the other hand, young patients enjoy overall lower model discrimination in the non-private and the private setting. Interestingly, young patients seem to profit more from stronger privacy guarantees, as they enjoy progressively more fairness privilege with increasing privacy protection level. This holds despite the fact that patients under 30 represent the smallest fraction of the UKA-CXR dataset. The privilege of young patients is most likely due to a confounding variable, namely the lower complexity of imaging findings in younger patients due to their improved ability to cooperate during radiograph acquisition, resulting in better discrimination of the pathological finding on a more homogeneous background (i.e., “cleaner”) radiographs which are easier to diagnose overall^35,53 (see Fig. 6). This hypothesis should be validated in cohorts with a larger proportion of young patients, and we intend to expand on this finding in future work. On the PDAC dataset, classification accuracy remains approximately on par between age subgroups except at very restrictive privacy budgets, where older patients begin to suffer discrimination, likely due to the aforementioned imbalance between control and tumor cases and the overall smaller dataset coupled with a lack of pre-training. The analysis of model fairness related to patient sex for UKA-CXR shows that female patients (which – similar to young patients – are an underrepresented group) enjoy a slightly higher diagnostic accuracy than male patients for almost all privacy levels and vice versa on the PDAC dataset. However, effect size differences were found to be small, so that this finding can also be explained by variability between models or by the randomness in the training process. Further investigation is thus required to elucidate the aforementioned effects.

**Fig. 6: Illustrative radiographs from the UKA-CXR dataset. All examinations share the diagnosis of pneumonic infiltrates on the right patient side (=left image side).**

Furthermore, there is no final conclusion for which fairness measure is preferable. In our study we focused on the statistical parity difference, however, there are other works proposing other measures. One, which recently received attention, is the underdiagnosis rate of subgroups⁵⁴. We evaluated this for the PDAC dataset and found that in principle it shows the same trends as the statistical parity difference (see Supplementary Tables 9 and 10).

In conclusion, we analyzed the usage of privacy-preserving neural network training and its implications on utility and fairness for a relevant diagnostic task on a large real-world dataset. We showed that the utilization of specialized architectures and targeted model pre-training allows for high model accuracy despite stringent privacy guarantees. This enables us to train expert-level diagnostic AI models even with privacy budgets as low as ε < 1, which – to our knowledge – has not been shown before, and represents an important step towards the widespread utilization of differentially private models in radiological diagnostic AI applications. Moreover, our findings that the introduction of differential privacy mechanisms to model training does – in most cases – not amplify unfair model bias regarding patient age, sex or comorbidity signifies that – at least in our use case – the resulting models abide by important non-discrimination principles of ethical AI. We are hopeful that our findings will encourage practitioners and clinicians to introduce advanced privacy-preserving techniques such as differential privacy when training diagnostic AI models.

Data availability

The UKA-CXR dataset is not publicly accessible, in adherence to the policies for patient privacy protection at the University Hospital RWTH Aachen in Aachen, Germany. Similarly, the PDAC dataset cannot be publicly shared due to patient privacy considerations, as it is an in-house dataset at Klinikum Rechts der Isar, Munich, Germany. Data access for both datasets can be granted upon reasonable request to the corresponding author. Source data presented in Figures are available as Supplementary Data 1.

Code availability

All source codes used for UKA-CXR for training and evaluation of the deep neural networks, differential privacy, data augmentation, image analysis, and preprocessing are publicly available at https://github.com/tayebiarasteh/DP_CXR. All code for the experiments was developed in Python 3.9 using the PyTorch 2.0 framework. The DP code was developed using Opacus 1.4.0⁵⁵. Considering the utilization of equivalent computational resources, the time taken for the DP training to converge was approximately 10 times longer, in terms of total training time, than that required for the non-DP training with a similar network architecture. All code for the analyses on the PDAC dataset are available at https://github.com/TUM-AIMED/2.5DAttention. All source codes for both datasets are permanently archived on Zenodo and are accessible via⁵⁶ and⁵⁷.

References

Usynin, D. et al. Adversarial interference and its mitigations in privacy-preserving collaborative machine learning. Nat. Mach. Intell. 3, 749–758 (2021).
Article Google Scholar
Konečny`, J., McMahan, H. B., Ramage, D. & Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016).
Konečny`, J. et al. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273–1282 (PMLR, 2017).
Truhn, D. et al. Encrypted federated learning for secure decentralized collaboration in cancer image analysis. Med. Image Anal. (2024). https://doi.org/10.1016/j.media.2023.103059.
Dwork, C. & Roth, A. et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–407 (2014).
Article MathSciNet Google Scholar
Boenisch, F. et al. When the curious abandon honesty: Federated learning is not private. In 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), 175–199 (IEEE, 2023).
Fowl, L., Geiping, J., Czaja, W., Goldblum, M. & Goldstein, T. Robbing the fed: Directly obtaining private data in federated learning with modified models. In International Conference on Learning Representations (2021).
Wang, K.-C. et al. Variational model inversion attacks. Adv. Neural Inf. Process. Syst. 34, 9706–9719 (2021).
Google Scholar
Haim, N., Vardi, G., Yehudai, G., Shamir, O. & Irani, M. Reconstructing training data from trained neural networks. Adv. Neural Inf. Processing Syst. 35, 22911–22924 (2022).
Google Scholar
Carlini, N. et al. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), 5253–5270 (2023).
Food, U. & Administration, D. Artificial intelligence and machine learning (ai/ml)-enabled medical devices. Webpage (2023). https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices.
Wasserman, L. & Zhou, S. A statistical framework for differential privacy. J. Am. Stat. Assoc. 105, 375–389 (2010).
Article MathSciNet CAS Google Scholar
Dong, J., Roth, A. & Su, W. J. Gaussian differential privacy. J. Royal Stat. Soc. Ser. B: Stat. Methodol. 84, 3–37 (2022).
Article MathSciNet Google Scholar
Kaissis, G., Hayes, J., Ziller, A. & Rueckert, D. Bounding data reconstruction attacks with the hypothesis testing interpretation of differential privacy. Theory and Practice of Differential Privacy Workshop (2023).
Nasr, M. et al. Tight auditing of differentially private machine learning. In 32nd USENIX Security Symposium (USENIX Security 23), 1631–1648 (2023).
Kaissis, G. et al. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nat. Mach. Intell. 3, 473–484 (2021).
Article Google Scholar
Hayes, J., Mahloujifar, S. & Balle, B. Bounding training data reconstruction in dp-sgd. arXiv preprint arXiv:2302.07225 (2023).
Balle, B., Cherubin, G. & Hayes, J. Reconstructing training data with informed adversaries. In 2022 IEEE Symposium on Security and Privacy (SP), 1138–1156 (IEEE, 2022).
Cohen, A. & Nissim, K. Towards formalizing the gdpr’s notion of singling out. Proc. Nat. Acad. Sci. 117, 8344–8352 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Cohen, A. Attacks on deidentification’s defenses. In 31st USENIX Security Symposium (USENIX Security 22), 1469–1486 (2022).
Abadi, M. et al. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318 (2016).
Hatamizadeh, A. et al. Do gradient inversion attacks make federated learning unsafe? IEEE Trans. Med. Imaging (2023).
Dwork, C. A firm foundation for private data analysis. Commun. ACM 54, 86–95 (2011).
Article Google Scholar
De, S., Berrada, L., Hayes, J., Smith, S. L. & Balle, B. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650 (2022).
Kurakin, A. et al. Toward training at imagenet scale with differential privacy. arXiv preprint arXiv:2201.12328 (2022).
Tran, C., Fioretto, F., Van Hentenryck, P. & Yao, Z. Decision making with differential privacy under a fairness lens. In IJCAI, 560–566 (2021).
Cummings, R., Gupta, V., Kimpara, D. & Morgenstern, J. On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, 309–315 (2019).
Packhäuser, K. et al. Deep learning-based patient re-identification is able to exploit the biometric nature of medical chest x-ray data. Sci. Rep. 12, 14851 (2022).
Article ADS PubMed PubMed Central Google Scholar
Narayanan, A. & Shmatikov, V. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), 111–125 (IEEE, 2008).
Li, W. et al. Privacy-preserving federated brain tumour segmentation. In Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10, 133–141 (Springer, 2019).
Ziegler, J., Pfitzner, B., Schulz, H., Saalbach, A. & Arnrich, B. Defending against reconstruction attacks through differentially private federated learning for classification of heterogeneous chest x-ray data. Sensors 22, 5195 (2022).
Article ADS PubMed PubMed Central Google Scholar
Farrand, T., Mireshghallah, F., Singh, S. & Trask, A. Neither private nor fair: Impact of data imbalance on utility and fairness in differential privacy. In Proceedings of the 2020 Workshop on Privacy-preserving Machine Learning in Practice, 15–19 (2020).
Bagdasaryan, E., Poursaeed, O. & Shmatikov, V. Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems 32, https://proceedings.neurips.cc/paper_files/paper/2019/hash/fc0de4e0396fff257ea362983c2dda5a-Abstract.html (2019).
Khader, F. et al. Artificial intelligence for clinical interpretation of bedside chest radiographs. Radiology 307, e220510 (2022).
Tayebi Arasteh, S. et al. Collaborative training of medical artificial intelligence models with non-uniform labels. Sci. Rep. 13, 6046 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Johnson, A. E. et al. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Article PubMed PubMed Central Google Scholar
Klause, H., Ziller, A., Rueckert, D., Hammernik, K. & Kaissis, G. Differentially private training of residual networks with scale normalisation. Theory and Practice of Differential Privacy Workshop, ICML (2022).
Yang, J. et al. Reinventing 2d convolutions for 3d images. IEEE J. Biomed. Health Inform. 25, 3009–3018 (2021).
Article PubMed Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456 (pmlr, 2015).
Wu, Y. & He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19 (2018).
Johnson, A. et al. Mimic-cxr-jpg-chest radiographs with structured labels. PhysioNet (2019).
Fukushima, K. Cognitron: A self-organizing multilayered neural network. Biol. Cybern. 20, 121–136 (1975).
Article CAS PubMed Google Scholar
Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807–814 (2010).
Dozat, T. Incorporating nesterov momentum into adam. In International Conference on Learning Representations, Workshop Track (2016).
Misra, D. Mish: A self regularized non-monotonic activation function. In The 31st British Machine Vision Conference (2020).
Konietschke, F. & Pauly, M. Bootstrapping and permuting paired t-test type statistics. Stat. Comput. 24, 283–296 (2014).
Article MathSciNet Google Scholar
Unal, I. Defining an optimal cut-point value in roc analysis: an alternative approach. Comput. Math. Methods Med. 2017 (2017).
Calders, T. & Verwer, S. Three naive bayes approaches for discrimination-free classification. Data Mining Knowl. Discov. 21, 277–292 (2010).
Article MathSciNet Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54, 1–35 (2021).
Article Google Scholar
Tayebi Arasteh, S. et al. Securing collaborative medical AI by using differential privacy: Domain transfer for classification of chest radiographs. Radiol. Artif. Intel. 6, e230212 (2024).
Article Google Scholar
Wu, J. T. et al. Comparison of chest radiograph interpretations by artificial intelligence algorithm vs radiology residents. JAMA Netw. Open 3, e2022779–e2022779 (2020).
Article PubMed PubMed Central Google Scholar
Seyyed-Kalantari, L., Zhang, H., McDermott, M. B., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yousefpour, A. et al. Opacus: User-friendly differential privacy library in pytorch (2021). https://arxiv.org/abs/2109.12298.
Arasteh, S. T. DP CXR. https://doi.org/10.5281/zenodo.10361657 (2023).
Ziller, A. 2.5d attention. https://doi.org/10.5281/zenodo.10361128 (2023).

Download references

Acknowledgements

STA is funded and partially supported by the Radiological Cooperative Network (RACOON) under the German Federal Ministry of Education and Research (BMBF) grant number 01KX2021. AZ and GK were supported by the German Ministry of Education and Research (BMBF) under Grant Number 01ZZ2316C (PrivateAIM). DT is supported from the Deutsche Forschungsgemeinschaft (DFG) (TR 1700/7-1) as well as by the German Federal Ministry of Education (TRANSFORM LIVER, 031L0312A; SWAG, 01KD2215B) and the European Union’s Horizon Europe and innovation programme (ODELIA [Open Consortium for Decentralized Medical Artificial Intelligence], 101057091). RB was supported by Deutsches Konsortium für Translationale Krebsforschung (DKTK). GK and DR have been funded by the German Federal Ministry of Education and Research and the Bavarian State Ministry for Science and the Arts through the Munich Centre for Machine Learning. DR was supported by ERC Grant Deep4MI (no. 884622). The funders played no role in the design or execution of the study. The authors of this work take full responsibility for its content.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Soroosh Tayebi Arasteh, Alexander Ziller.
These authors jointly supervised this work: Daniel Truhn, Georgios Kaissis.

Authors and Affiliations

Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
Soroosh Tayebi Arasteh, Christiane Kuhl, Sven Nebelung & Daniel Truhn
Institute of Diagnostic and Interventional Radiology, Technical University of Munich, Munich, Germany
Alexander Ziller, Marcus Makowski, Rickmer Braren & Georgios Kaissis
Artificial Intelligence in Healthcare and Medicine, Technical University of Munich, Munich, Germany
Alexander Ziller, Daniel Rueckert & Georgios Kaissis
Department of Computing, Imperial College London, London, United Kingdom
Georgios Kaissis
Institute for Machine Learning in Biomedical Imaging, Helmholtz Munich, Neuherberg, Germany
Georgios Kaissis

Authors

Soroosh Tayebi Arasteh
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ziller
View author publications
You can also search for this author in PubMed Google Scholar
Christiane Kuhl
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Makowski
View author publications
You can also search for this author in PubMed Google Scholar
Sven Nebelung
View author publications
You can also search for this author in PubMed Google Scholar
Rickmer Braren
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Rueckert
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Truhn
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Kaissis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The formal analysis was conducted by S.T.A., A.Z., D.T. and G.K. The original draft was written by S.T.A. and A.Z. and edited by D.T. and G.K. The experiments as well as the software development for UKA-CXR were performed by S.T.A. and for PDAC by A.Z. Statistical analyses were performed by A.Z. and S.T.A. D.T. and G.K. provided clinical and technical expertise. S.T.A., A.Z., C.K., M.M., S.N., R.B., D.R., D.T. and G.K. read the manuscript and agreed to the submission of this paper.

Corresponding authors

Correspondence to Soroosh Tayebi Arasteh, Alexander Ziller, Daniel Truhn or Georgios Kaissis.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks Valeriu Codreanu, Holger Roth and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplemental Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tayebi Arasteh, S., Ziller, A., Kuhl, C. et al. Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun Med 4, 46 (2024). https://doi.org/10.1038/s43856-024-00462-6

Download citation

Received: 06 April 2023
Accepted: 16 February 2024
Published: 14 March 2024
DOI: https://doi.org/10.1038/s43856-024-00462-6

This article is cited by

Application of ChatGPT as a support tool in the diagnosis and management of acute bacterial tonsillitis
- Miguel Mayo-Yáñez
- Lucía González-Torres
- Jerome R. Lechien
Health and Technology (2024)

Subjects

Abstract

Background

Methods

Results

Conclusions

Plain Language Summary

Similar content being viewed by others

Introduction

Methods

Patient cohorts

Data pre-processing

Deep learning process

Network architecture

Non-DP training

DP training

Quantitative evaluation and statistical analysis

Reporting summary

Results

High classification accuracy is attainable despite stringent privacy guarantees

Diagnostic accuracy is correlated with patient age and sample size for both private and non-private models

Discussion

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links