Generative models improve fairness of medical classifiers under distribution shifts

Ktena, Ira; Wiles, Olivia; Albuquerque, Isabela; Rebuffi, Sylvestre-Alvise; Tanno, Ryutaro; Roy, Abhijit Guha; Azizi, Shekoofeh; Belgrave, Danielle; Kohli, Pushmeet; Cemgil, Taylan; Karthikesalingam, Alan; Gowal, Sven

doi:10.1038/s41591-024-02838-6

Download PDF

Article
Open access
Published: 10 April 2024

Generative models improve fairness of medical classifiers under distribution shifts

Nature Medicine volume 30, pages 1166–1173 (2024)Cite this article

10k Accesses
1 Citations
107 Altmetric
Metrics details

Subjects

Abstract

Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. We hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching our training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. We show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner. We demonstrate that learned augmentations make models more robust and statistically fair in-distribution and out of distribution. To evaluate the generality of our approach, we studied three distinct medical imaging contexts of varying difficulty: (1) histopathology, (2) chest X-ray and (3) dermatology images. Complementing real samples with synthetic ones improved the robustness of models in all three medical tasks and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups, especially out of distribution.

Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation

Article Open access 24 September 2021

Physical imaging parameter variation drives domain shift

Article Open access 09 December 2022

Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging

Article 08 June 2023

Main

The advent of machine learning (ML) in healthcare promises advances in care in a wide range of applications^1,2,3. Artificial intelligence (AI) dermatological tools (for example, refs. ^1,4) have the potential to allow patients to assess their conditions better and improve diagnostic accuracy⁵. Similarly, ML technologies have unlocked new capabilities in computational pathology that have the ability to handle the gigantic quantity of data created throughout the patient care lifecycle and improve classification, prediction and prognostication of diseases^5,6. These solutions are often motivated by the global shortage of expert clinicians, for example, in the case of radiologists⁷, and demonstrate that ML models can facilitate the detection of conditions⁸. Despite these rapid methodological developments and the promise of transformative impact⁹, few of these approaches (if any) have yet achieved widespread adoption and scaled impact on clinical outcomes¹⁰. One major barrier to adoption is the brittle degradation in performance of medical ML systems caused by ‘out-of-distribution’ data: discrepancies between the populations, diseases, acquisition technologies or environments used to train medical ML systems and those encountered during deployment. As ref. ¹¹ highlighted, only 24% of published studies evaluate the performance of their proposed algorithms on external cohorts or compare this out-of-distribution performance with that of clinical experts. Many studies do not validate the efficacy of algorithms in multiple settings; the ones that do often perform poorly when introduced to new environments not represented in the training data.

In addition to this challenge of out-of-distribution generalization, underrepresentation of specific groups, conditions or hospitals also causes notable challenges of fairness and equity even when systems are deployed in datasets mirroring their training environment, with lower performance typically seen in rarer groups, conditions, individuals or their intersections. Previous work showed that a developed model may perform unexpectedly poorly on underrepresented populations or population subgroups in radiology^12,13, histopathology¹⁴ and dermatology¹⁵. However, the issues of robustness to distribution shifts and statistical fairness have rarely been tackled together. Building a method that is robust across populations and subgroups, such that model performance does not degrade and benefits can be transferred when applied across groups, is a nontrivial task. This is because of data scarcity¹⁶, challenges in the acquisition strategies of evaluation datasets (for example, different imaging or screening protocols^10,17,18) and the limitations of evaluation metrics¹⁰.

In this work, we leveraged diffusion models^19,20 and potentially available unlabeled data to capture the underlying data distribution and augment real samples when training diagnostic models across these three modalities. We showed that combining synthetic and real data can lead to significant improvements in diagnostic accuracy, while closing the fairness gap with respect to different attributes under distribution shifts. While we do not propose this approach as a replacement for high-quality and representative data collection strategies, we posit that, in the absence of additional resources, it allows practitioners to make the most of their available labeled and unlabeled data to close potentially harmful gaps in diagnostic accuracy between overrepresented and underrepresented populations without penalizing the former. Finally, we showed that diffusion models can generate high-quality images (Fig. 1a) across modalities and performed an in-depth analysis to shed light on the mechanisms that improve the generalization capabilities of the downstream classifiers (Methods, ‘In-depth analysis for dermatology’). This capability was further validated by an evaluation of synthetic images by expert dermatologists, yielding diagnostic accuracy comparable to when diagnosing real images.

**Fig. 1: Generated samples and method overview.**

Results

Overview of the proposed approach and experimental setting

Our proposed approach, illustrated in Fig. 1b, leverages diffusion models for learning augmentations of the data to improve the robustness and fairness of medical ML models. We viewed learned augmentations as a means of enriching our training dataset with the goal of making it more diverse in a steerable and configurable way. Our approach consisted of three main steps: (1) we train a generative model given the available labeled and unlabeled data; we assumed that labeled data were available only for a single source domain (for example, a particular hospital with a specific scanner or imaging protocol), while additional unlabeled data could be from any domain (in-distribution or out of distribution (OOD), for example, data from multiple hospitals, a subset of which was not labeled by experts because of limited resources). We either conditioned the generative model only on the diagnostic label or on both the diagnostic label and a property (for example, hospital ID or sensitive attribute label). We borrowed the term ‘sensitive attribute’ from the fairness literature to describe demographic attributes (for example, sex, ethnicity or age) we wanted the model to be fair against. All of the data used in this research were de-identified before authors gained access to it. Conditioning the model on either or both of these attributes allowed us to configure the synthetic examples that we wanted to use to enrich our training set. If high-resolution images are required (more than 96 × 96 resolution), we further trained an upsampling diffusion model in a similar manner. It is worth highlighting that both the low-resolution generative model and the upsampler were trained with the same conditioning vector (that is, either with label or label and property conditioning); (2) we sampled from the generative model according to a sampling strategy. In our experiments, we assumed that uniform representation of different values of an attribute constitutes a fair strategy, for example, for each condition it is equally likely to observe an image of a male and a female individual, or from a particular hospital. To do this, we sampled uniformly from the attribute distribution and preserved the original diagnostic label distribution to preserve the original disease prevalence. Sampling multiple times from the generative model allowed us to obtain different augmentations for a given condition (and property), consequently increasing the diversity of training samples for the downstream classifier; (3) we enriched our original training dataset from the source domain with the synthetic images sampled from the generative model and trained a diagnostic model (potentially for multiple labels, if more than one condition is present at once). We provide the exact details of the experimental setting for each modality in the Methods (‘Experimental setting for each modality’).

Experimental protocol

We evaluated this approach using denoising diffusion probabilistic models (DDPMs) on different medical contexts and tracked diagnostic performance (for example, top-1 accuracy) and fairness in-distribution and OOD. We considered in-distribution datasets as consisting of images from the same demographic and disease distribution and acquired with the same imaging protocol as the training set. Out-of-distribution datasets may differ from the training set in any or all of those dimensions. Evaluation of the out-of-distribution datasets is equivalent to developing an ML model on a certain population (for example, from a particular hospital or geographical location) and testing its performance on a population from an unseen hospital or acquired under new conditions. Across all settings, the diagnostic and diffusion models were trained with the same labeled data. We provide more details about this and a summary of the setting used for each modality in the Methods (‘Overview of methodology’).

Evaluation metrics

To measure the performance of the different baselines and the proposed method, we used two sets of metrics: one set was more focused on diagnostic accuracy (that is, top-1 accuracy for histopathology, receiver operating characteristic (ROC)-area under the curve (AUC) for radiology and high-risk sensitivity for dermatology), while the second set was more geared toward fairness (see summary in Table 1). The performance metrics varied depending on the classification task performed for each modality (that is, binary versus multiclass versus multilabel) and considered label imbalance. High-risk sensitivity captured the true positive rate for the high-risk conditions and was deemed the most relevant for the diagnostic tool by expert dermatologists. For fairness, we looked at the performance gap (depending on the metric of interest) in the binary attribute setting and the difference between the worst and best subgroup performance for categorical attributes, for example, hospital ID and ethnicity. For continuous sensitive attributes, like age, we discretized them into appropriate buckets (Methods and Extended Data Table 1).

Clinical tasks and datasets

Histopathology

The first setting we considered is histopathology. Variation in staining procedures in different hospitals leads to distribution shifts that can challenge an ML model that has only encountered images from a particular hospital. The cancer metastases in lymph nodes challenge (CAMELYON17) by Bandi et al.²¹ aims to improve generalization capabilities of automated solutions and reduce the workload on pathologists who have to manually label those cases. The corresponding dataset contains images from five different hospitals and the task was to predict whether the histological lymph node sections captured by the images contain cancerous cells, indicating breast cancer metastases (as posed by the WILDS challenge²²). Two of the hospital datasets provided by the challenge were held out for out-of-distribution evaluation and three were considered in-distribution datasets because of similar staining procedures. We considered this as the simplest setting for our experiments because there was no extreme disease prevalence or demographic shifts. The labeled dataset contained 455,954 patches, while the unlabeled dataset contained 1.8 million patches from the three training hospitals; full statistics are given in Methods and Extended Data Table 1a. The unlabeled dataset contained the hospital identifier but not the diagnostic label.

To understand the impact of the number of labeled examples on fairness and overall performance, we created different variants of the labeled training set, where we varied the number of samples from two of the three training hospitals (3 and 4). The number of labeled examples from one hospital remained constant. We compared top-level classification accuracy and fairness gap, that is, the accuracy gap between the best and worst performing hospital across the in-distribution hospitals, to different baselines (more details about the baselines are provided in Methods (‘Baselines’)).

We found that using synthetic data outperformed both in-distribution baselines in the less skewed (with 1,000 labeled samples from hospitals 3 and 4) and more skewed setting (with only 100 labeled samples) while closing the fairness gap between hospitals. We obtained the best accuracy OOD when using all in-distribution labeled examples as shown in Fig. 2b (in the OOD setting, there were one validation and one test hospital, so we do not report a performance gap). We found that performing color augmentation on top of the generated samples generalized best overall, leading to a 48.5% relative improvement over the baseline model and 3.2% over the model trained with color augmentations on the test hospital, while reducing the performance gap between in-domain hospitals by 20 absolute percentage points.

**Fig. 2: Results on histopathology dataset.**

This validated that we can indeed use synthetic data to better model the data distribution and outperform variants using real data alone. We also observed that this method was most effective in a low-data regime (that is, the more skewed setting in Fig. 2a), while being able to recover performance that other approaches achieve with 100× more labeled samples, as shown in Extended Data Fig. 1a. This translates to more significant improvements in scenarios where we only have access to a few labeled examples from a particular hospital or population because of limited resources.

Chest radiology

The second setting we considered is radiology. We focused our analysis on two large public radiology datasets, CheXpert²³ and ChestX-ray14 (National Institutes of Health)²⁴. These datasets have been widely studied^8,12,13 for model development and fairness analyses. For these datasets, demographic attributes like sex and age are publicly available; classification was performed at a higher resolution, that is, 224 × 224 as in ref. ²⁵. After training the generative and diagnostic models on 201,055 examples of chest X-rays from the CheXpert dataset, we evaluated on a held-out CheXpert test set (containing 13,332 images), which we considered in-distribution, and the test set of ChestX-ray14 (containing 17,723 images), which we considered OOD because of demographic and acquisition shifts. We focused on five conditions for which labels existed in common between the two datasets, that is, atelectasis, consolidation, cardiomegaly, pleural effusion and pulmonary edema, while each of these datasets contained more conditions (not necessarily overlapping), as well as examples with no findings, corresponding to healthy controls. Note that the labeling procedures for the two datasets were defined and enacted separately, which probably increased the complexity of the task. In this setting, the model backbone was shared across all conditions, while a separate (binary classification) head was trained for each condition, given that multiple conditions can be present at once. We report the ROC-AUC curve in line with the CheXpert leaderboard.

We observed that synthetic images improved the average AUC for the five conditions of interest in-distribution, but even more so OOD (Fig. 3a). Improvements were particularly striking for cardiomegaly, where the model trained purely with synthetic images improved the AUC by 21.1% (Fig. 3a). Overall, we observed a relative improvement of 5.2% on average AUC OOD and a 44.6% improvement in sex fairness gap. We also observed a 31.7% decrease in race fairness gap in-distribution (Fig. 3b). We show some examples of synthetic images for a model conditioned on the diagnostic label in Extended Data Fig. 2c,d.

**Fig. 3: Results on chest radiology datasets.**

Dermatology

For the dermatology setting, we considered a dermatology dataset of images grouped into 27 labeled conditions ranging from low risk (for example, acne, verruca vulgaris) to high risk (for example, melanoma). Out of these conditions, three were considered to be high risk: basal cell carcinoma; melanoma; and squamous cell carcinoma (SCC) and squamous cell carcinoma in situ (SCCIS). For the purposes of our experiments, we considered three datasets: the in-distribution dataset featuring 16,530 cases from a teledermatology dataset acquired from a population in the United States (Hawaii and California); the OOD 1 dataset featuring 6,639 images of clinical type focusing mostly on high-risk conditions from an Australian population; and OOD 2 featuring 3,900 teledermatology images acquired in Colombia. To train the downstream classifier, we used labeled samples from only one of these datasets (in-distribution), while we included unlabeled images from the other two distributions when training the diffusion model. We evaluated on a held-out slice of the in-distribution dataset and two OOD sets to investigate how well models generalized. We present results for the OOD 2 dataset in Supplementary Information, Additional results for dermatology, because it has similar label distribution to the in-distribution dataset and is less challenging.

We explored whether the proposed approach can be used to not only improve OOD accuracy but also fairness over the different label predictions and attributes for the in-distribution dataset. While the datasets were already imbalanced with respect to different labels and sensitive attributes, we also investigated how the performance varied as a dataset becomes more or less skewed along a single one of these axes. This allowed us to better understand to what extent conditioning generative models on the axis of interest can help alleviate biases with regard to the corresponding attribute.

In Fig. 4, we illustrate how different methods compare for a single axis of interest with regard to sensitivity for the three high-risk conditions mentioned above and fairness. In the more skewed setting, the training dataset contained a maximum of 100 samples from the underrepresented subgroup regardless of the underlying condition, while in the less skewed setting it contained a maximum of 1,000 samples. We compared all methods in the four different settings: in-distribution and OOD, as well as less and more skewed with respect to the sensitive attribute of interest, that is, sex. We observed that in all settings, combining heuristic augmentations improved the predictive performance across the board, but harmed fairness of the model. Using RandAugment alone was beneficial for high-risk sensitivity in-distribution, but not OOD, but it harmed fairness in the OOD setting. Oversampling slightly closed the fairness gap across the board while improving performance, as expected. The approaches that leverage synthetic data, ‘Label conditioning’ and ‘Label and property conditioning’, improved on high-risk sensitivity in-distribution without reducing fairness, while they yielded a significant improvement in the OOD setting on both axes. In the more skewed setting, in particular, ‘Label and property conditioning’ led to 27.3% better high-risk sensitivity compared to the baseline in-distribution and a striking 63.5% OOD, while closing the fairness gap by 7.5× OOD. It is worth noting that the underrepresented group in the training set and the ID evaluation set was overrepresented in the OOD evaluation set. Our approach showed improvements in accuracy and fairness metrics with respect to different sensitive attributes, while being able to generalize these improvements OOD as shown in Methods, ‘Additional results’. The strong overall performance and reduced fairness gap OOD indicates that the diagnostic model learned better generalizable features when leveraging synthetic data.

**Fig. 4: Results on dermatology datasets.**

Discussion

In this work, we propose using conditional diffusion models to improve the robustness and fairness of ML systems applied to medical imaging. More specifically, we show that diffusion models can produce useful synthetic images in three different medical settings of varying difficulty, complexity and resolution: histopathology, radiology and dermatology. Our experimental evaluation provides extensive evidence that synthetic images can indeed improve statistical fairness, balanced accuracy and high-risk sensitivity in a multiclass setting, while improving the robustness of models both in-distribution and OOD. In fact, we observe that generated data can be more beneficial OOD than in-distribution even in the absence of data from the target domain during training of the generative model (in the case of radiology). Generative models were label-efficient in both histopathology and dermatology settings, where we demonstrate that only a few labeled examples are sufficient for the diffusion models to capture the underlying data distribution well. This is particularly impactful in the medical setting, where data for particular conditions or demographic subgroups can be scarce or, even when available, acquiring expert labels can be expensive and time-consuming. For the reader that is familiar with regularization techniques, we view diffusion models as another form of regularization, which can be combined with any other architecture or learning method improvements.

Even though we did not make any assumptions when training the diffusion model, we found interesting dynamics when combining real and synthetic data. In certain settings, that is, histopathology and radiology, we observed that we can rely purely on generated data and still outperform baselines trained with real labeled data (Methods, ‘Additional results’). In other settings, like dermatology, we observed that real data were more essential for training of the downstream discriminative model. We took this a step further and analyzed the impact of generated data and the mechanisms underlying the improvements in robustness and fairness that we report. In-depth analysis in one of the modalities indicated that synthetic samples from a diffusion model yield diverse (Fig. 5), realistic and canonical images deemed diagnosable by expert clinicians to a great extent (Methods, ‘In-depth analysis for dermatology’). Synthetic samples seem to better align distributions of different domains, while at the same time allowing models to learn more complex decision boundaries that reduce their reliance on spurious correlations. Finally, we highlight some practical benefits and discuss a number of potential risks and limitations from relying on generated data.

**Fig. 5: Generated images in the dermatology setting.**

First, synthetic data are reusable. Beyond the analysis and utility of synthetic data for the particular tasks that we considered in this work, there are many other potential applications for which they can be useful. The same synthetic data can be used for data augmentation across different models and, potentially, tasks. For example, handcrafted augmentations are often used to introduce invariances and learn better representations in a self-supervised manner for a variety of downstream tasks.

Furthermore, the proposed approach is scalable. As we demonstrate in the Supplementary Information, if we have a perfect generative model, then we can perform perfectly under the fair distribution. Moreover, the better the generative model, the more our results should improve. Thus, as generative modeling improves or as more data are available, results should improve accordingly.

Combining this technique with privacy-preserving technologies holds significant promise for the medical field. Principles of data governance, confidentiality, privacy and consent are vital in healthcare, but may be associated with relative limitations of data availability for the training of ML models in underrepresented groups. There is preliminary evidence that federated learning can be used to learn classification models from multiple institutions²⁶; if it were possible to generate private synthetic data, these synthetic data could be used for data augmentation along with a smaller, public dataset to improve performance. This could have practical benefits when data sharing to protect personally identifiable information while achieving high-quality performance. Such an approach would of course be associated with its own risks, some of which are discussed in ref. ²⁷.

Even though we showed that diffusion models can be particularly label-efficient, this should not encourage practitioners to abandon their data and label acquisition efforts; nor does it imply that generated data can replace real data under any circumstances. What this research demonstrates is that, when labeled data and resources are limited, there are ways to make more of the available labeled and unlabeled data. There is also the potential that using generative models may lead to overconfidence in an AI system because images look realistic to a nonexpert. Additional data collection will always be important, along with comprehensive analysis of the underlying data and their caveats. Synthetic data from a generative model should only be used as a complement to additional data collection and accompanied by rigorous evaluation on real data, ideally outside the main source domain to understand the generalization capabilities of the models. In other words, synthetic data are one solution to increase diversity, but not a substitution of efforts to increase data representation for underrepresented conditions and populations.

If the generative model is of poor quality or biased, then we may end up exacerbating problems of bias or structural inequities in the downstream model. The generative model may be unable to generate images of a certain label and sensitive attribute. In other settings, the model may always generate a specific part of the distribution for a certain label and sensitive attribute instead of capturing the true image distribution. The generative model may also create incorrect images of a given label and sensitive attribute, leading the classification model to make mistakes confidently in those regions. Medical training datasets can also encode structural inequities in the delivery of healthcare, which could be propagated by generative models in ways that might not be immediately apparent on inspection of the synthetic examples being created. Therefore, it is particularly important that evaluation data remain unbiased and that multiple safeguards are implemented to assess model fairness and mitigate the multifaceted nontechnical health inequities that cannot be addressed solely by data curation and model development.

Another important risk to be mindful of is that the insights that we obtain by analyzing the model are only as good as our evaluation setup. If the evaluation datasets are not diverse enough, do not capture high-risk conditions well or are not representative of the population, then any conclusions we draw from these results will be limited. Therefore, care needs to be taken to report and understand what each of the evaluation setups is capturing. For example, as Varoquaux et al.¹⁰ highlight, clinician-level performance is often overstated without validating models OOD. Moreover, clinical applicability of patch-level evaluation for the histopathology setting can be limited and whole-slide image analysis should be investigated further.

In terms of limitations, sensitive attributes are not always observed or explicitly tracked and reported²⁸, often to protect people’s privacy. At the same time, the way labels are assigned may have its own limitations. For example, using binary gender and sex attributes (or using the two interchangeably) does not represent people that identify as nonbinary. Similarly, researchers have criticized the Fitzpatrick skin type because it is less accurate on shades of darker skin tones, which could cause models to misidentify or misrepresent people with darker skin. Similarly, there are other unobserved characteristics that can influence disease and are not accounted for in a visual image of skin, for example, social determinants of health. One instance of this is how dermatitis in a person who lives in a communal setting could have a different differential diagnosis than dermatitis in a high-income setting on a high-income individual. These are important considerations when relying on such attributes to condition learned augmentations or to perform fairness analyses.

Finally, synthetic images should be handled with caution and transparency because they may perpetuate biases in the original training data. It is important to tag and identify when a synthetic image has been added to a database, especially when considering reusing the dataset in a different setting or by different practitioners.

We see potential for future work that improves fairness and OOD generalization by leveraging powerful generative models but without explicitly relying on predefined categorical labels. When we consider synthetic images as an option for addressing performance gaps across subgroups, the following challenges still need to be addressed: reducing memorization for rare attributes and conditions; providing privacy guarantees; and accounting for unobserved characteristics.

Methods

Our research complies with all relevant ethical regulations. We only repurposed existing assets and datasets and did not collect new assets for the purposes of our study, beyond annotations by dermatology experts for the generated images. The non-accessible data used in the study can be used for research purposes without further scrutiny or collection of consent from the source individuals.

Datasets

In this section, we describe the datasets we used to train the downstream classifiers and diffusion models across the different modalities and medical contexts. Three different datasets were used, all of which are de-identified; informed consent was obtained from the participants in the original studies that collected these data.