A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis

Although generative adversarial networks (GANs) can produce large datasets, their limited diversity and fidelity have been recently addressed by denoising diffusion probabilistic models, which have demonstrated superiority in natural image synthesis. In this study, we introduce Medfusion, a conditional latent DDPM designed for medical image generation, and evaluate its performance against GANs, which currently represent the state-of-the-art. Medfusion was trained and compared with StyleGAN-3 using fundoscopy images from the AIROGS dataset, radiographs from the CheXpert dataset, and histopathology images from the CRCDX dataset. Based on previous studies, Progressively Growing GAN (ProGAN) and Conditional GAN (cGAN) were used as additional baselines on the CheXpert and CRCDX datasets, respectively. Medfusion exceeded GANs in terms of diversity (recall), achieving better scores of 0.40 compared to 0.19 in the AIROGS dataset, 0.41 compared to 0.02 (cGAN) and 0.24 (StyleGAN-3) in the CRMDX dataset, and 0.32 compared to 0.17 (ProGAN) and 0.08 (StyleGAN-3) in the CheXpert dataset. Furthermore, Medfusion exhibited equal or higher fidelity (precision) across all three datasets. Our study shows that Medfusion constitutes a promising alternative to GAN-based models for generating high-quality medical images, leading to improved diversity and less artifacts in the generated images.


Introduction
The performance of deep learning crucially depends on the size of the available training set [1; 2].However, in the medical domain, data is often not publicly available and large data pools cannot be sourced from multiple sites because of privacy issues.In the past, generative adversarial models (GANs) have been used to address these problems [3].Generative models have a variety of possible applications, from sharing data and circumventing legal or ethical difficulties [4] to reducing the need for data through modality translation [5] and improving deep learning performance [4; 6].However, generating meaningful medical data is hard, since medical diagnosis often depends on subtle changes in the appearance of complex organs and it is often more challenging than image classification on natural images.In addition, GANs suffer from inherent architectural problems such as the failure to capture true diversity, mode collapse, or unstable training behavior [7].Thus, particular emphasis needs to be put on the generation of high-quality synthetic medical data.Recently, denoising diffusion probabilistic models (DDPMs) [8] and latent DDPMs [9] have shown state-of-the-art results and were able to outperform GANs on natural images [10].While DDPMs have already demonstrated their superiority over GANs on natural images, a wide-scale direct comparison of latent DDPMs to GANs on medical images covering multiple domains has so far not been done.We found two studies that directly compared DDPMs and GANs for medical image synthesis in specific use cases.Pinaya et al. [11] used a latent DDPM to generate 3D brain MRI images, which was trained and conditioned on 31,740 T1-weighted images from the UK Biobank with covariables such as age and sex.They found that their latent DDPM outperforms LSGAN and VAE-GAN.In a similar study, a DDPM has been used to generate 3D brain MRI images and was trained but not conditioned on 1500 T1-weighted images from the ICTS dataset [12].In a quantitative comparison, the DDPM could outperform a 3D-α-WGAN but not a CCE-GAN.However, when two radiologists with 15 years of experience were asked to classify the images as real or fake, 60% of the DDPM-generated images were rated as real but none of the GAN images.These studies show that (latent) DDPMs are a promising alternative to GANs also in the medical domain.However, tests were limited to MRI and the brain and focused on 3D image generation.In this study, we propose Medfusion, a conditional latent DDPM for medical images.We compare our DDPM-based model against GAN-based models by using images sourced from ophthalmology, radiology and histopathology and demonstrate that DDPM beat GANs in all relevant metrics.To foster future research, we make our model publicly available as open-source to the scientific community.

Ethics statement
All experiments were conducted in accordance with the Declaration of Helsinki and the International Ethical Guidelines for Biomedical Research Involving Human Subjects by the Council for International Organizations of Medical Sciences (CIOMS).The study has additionally been approved by the local ethical committee (EK 22-319).

Datasets
In this retrospective study, three publicly available datasets were used.First, the AIROGS [13] challenge train dataset, containing 101,442 256x256 RGB eye fundus images from about 60,357 subjects of which 98,172 had "no referable glaucoma" and 3,270 with "referable glaucoma".Sex and age of the subjects were unknown.Second, the CRCDX [14] dataset, containing 19,958 colornormalized 512x512 RGB histology colorectal cancer images at a resolution of 0.5 µm/px.Half of the images were microsatellite stable and microsatellite instable, respectively.Sex and age of the subjects were unknown.Third, the CheXpert [15] train dataset, containing 223,414 gray-scaled chest radiographs of 64,540 patients.Images taken in lateral position were excluded, leaving 191,027 images from 64,534 patients.All images were scaled to 256x256 and normalized between -1 and 1, following the pre-processing routine in [4].Of the remaining radiographs, 23,385 showed an enlarged heart (cardiomegaly), 7,869 showed no cardiomegaly, and 159,773 had an unknown status.Labels with unknown status were relabeled as in [4] such that 160,935 images had no cardiomegaly and 30,092 had cardiomegaly, respectively.Mean age of the 28,729 female and 35,811 male patients was 60±18 years at examination.

Model architecture and training details
Two types of generative models were used in this study.
First, classical generative adversarial networks (GANs) as introduced by Goodfellow et al. [16].We aimed to use GANs that exhibited state-of-the-art quality on the respective datasets to allow a fair comparison with the diffusion model.For the CheXpert dataset, we used a pre-trained progressive growing GAN (proGAN) from [4].Chest x-rays generated by this GAN were barely differentiably by three inexperienced and three experienced radiologists [4].Furthermore, this GAN architecture has already been used for data augmentation and has led to higher downstream performance as compared to traditional augmentation [17].For the CRCDX dataset, we employed a pre-trained, conditional GAN (cGAN) as described in [6].This GAN has been shown to produce realistic-looking histological cancer images in a blinded test with five readers and the authors were able to show that a classifier benefits from using generated images during training.No pre-trained GAN was available for the AIROGS dataset at the time of writing.Therefore, StyleGAN-3 [18] was used as it incorporates most of the latest GAN developments and has shown state-of-the-art results.We used the default settings that were proposed by the authors for images of 256x256 pixels and trained the model for 3000 iterations.
Second, our proposed Medfusion model (Figure 1) is based on the Stable Diffusion model [9].It consisted of two parts: An Autoencoder that encoded the image space into a compressed latent space and a DDPM [8].Both parts were trained in two subsequent phases.In the first training phase, the Autoencoder was trained to encode the image space into an 8-times compressed latent space of size 32x32 and 64x64 for the 256x256 and 512x512 input space, respectively.During the training phase, the latent space was directly decoded back into image space and supervised by a multi-resolution loss function, which was described in the supplemental material.In the second training phase, the pre-trained Autoencoder encoded the image space into a latent space, which was then diffused into Gaussian noise using t=1000 steps.A UNet [19] model was used to denoise the latent space.The weights of the Autoencoder were frozen during the second-training phase.Samples were generated with a Denoising Diffusion Implicit Model (DDIM) [20] and t=150 steps.We motivate our choice of steps in the supplemental material.

Experimental design
The study was divided into two sub-studies: First, we investigated whether the capacity of the autoencoder in the Medfusion model was sufficient to encode the images into a latent, highly compressed space and decode the latent space back into the image space without losing relevant medical details.It was also investigated whether the autoencoder of the Stable Diffusion Model (pre-trained on natural images) could be used directly for medical images, i.e. without further training on medical images and loss of medically relevant image details.Second, we compared the images generated by Medfusion and the GANs quantitatively and qualitatively.For the quantitative evaluation, we would like to refer to the statistics section in which we go into detail about the individual metrics.For the qualitative assessment, real, GAN-generated, and Medfusion-generated images were compared side-by-side.

Statistical analysis
All statistical analyses were performed using Python and implemented in TorchMetrics [21].To compare sample quality across models, the following metrics were used.First, the Fréchet Inception Distance (FID) [22], that has become a standard metric for quality comparisons of generative models [23] and measures the agreement of the real and synthetic images by comparing the features of the deepest layer of the Inception-v3 [24] model.Second, the Improved Precision and Recall [25] metric, that measures the fidelity as the overlap of synthetic and real features relative to the entire set of synthetic features (precision) and the diversity as the overlap relative to the entire set of real features (recall).Following a previous study [10], Inception-v3 was used instead of the original proposed VGG-16 [26] to extract features.Third, the Multiscale Structural Similarity Index Measure (MS-SSIM) [27] that is a generalized version of the SSIM [28] by applying SSIM at different resolutions.
The SSIM measures image distortion by the structural information change that is expressed by comparing luminance, contrast, and structure.To ensure consistency between model comparisons, a reference batch was used for the AIROGS, CRCDX, and CheXpert dataset with 6540, 19,958, and 15,738 equally distributed images of both classes.Of note, the used metrics depend on the reference subset and implementation [29] and are not directly comparable with other studies.

Implementation and data availability
All experiments were implemented in Python v3.8 and were executed on a computer with an Nvidia RTX 3090.The datasets can be downloaded directly from the websites of the referenced authors.3 Results

High reconstruction capacity of Medfusion's Autoencoder
Sicne the quality of the diffusion-generated images is limited by the reconstruction quality of the autoencoder, we first investigated possible image quality losses due to the autoencoder architecture.
To evaluate the maximum possible quality, samples in the reference batches were encoded and decoded by the Autoencoder.Subsequently, the MS-SSIM and mean squared error (MSE) between the input images and reconstructed images were calculated and averaged (Table 1).Both metrics indicated a nearly perfect (MS-SSIM = 1, MSE=0) reconstruction of the images in the AIROGS and CheXpert dataset.Reconstruction quality in the CRCDX dataset was good but lower compared to AIROGS or CheXpert datasets, most likely due to the four times higher resolution.Since these metrics were measured on the reference set which is part of the training set, these values can be considered as an upper bound (for MS-SSIM) and lower bound (for MSE), respectively.The results on the publicly available test set of the CheXpert and CRCDX dataset were however nearly identical to the results from the reference set and are available in the supplemental materials.This experiment demonstrates that the autoencoder architecture of the DDPM does not restrict image quality of synthesized in terms of numeric metrics.

Dataset-specific reconstruction challenges
To investigate, if there are qualitative autoencoder reconstruction errors in the autoencoding process, we compared the original and reconstructed images side-by-side.This confirmed the numerically measured high reconstruction quality but revealed dataset-specific reconstruction errors (Figure 2).The compression in the autoencoding stage resulted in subtle structural changes in the fundus images, color changes in the histology images, and a loss of sharpness in the thorax images.This demonstrates that DDPM could be further enhanced by making use of a better autoencoding architecture.
To investigate, if this can be remedied with less compression during the autoencoding stage, we performed an additional experiment: Universal Autoencoders -4 channels are not enough for medical images A comparison with the autoencoder taken out of the box from the Stable Diffusion Model demonstrated that the reconstruction of medical images work well with an autoencoder pre-trained on natural images (Table 2).However, when comparing images side-by-side, Stable Diffusion showed characteristic reconstruction errors in the CheXpert dataset when Stable Diffusion's default VAE with 4 channels was used (Figure 3).Although less severe, reconstruction errors were also evident in Medfusion's VAE reconstructions.A further increase in the number of trainable parameters did not seem reasonable because Stable Diffusion 4-channel VAE already had about three times as many parameters as Medfusion's 4-channel VAE (24 million).Therefore, we increased the number of channels from 4 to 8 instead, which resulted in a notable quality gain at the cost of compression ratio.These results demonstrate that DDPM's autoencoding architecture can benefit from a higher number of channels during the autoencoding stage.

Medfusion outperforms GANs
When comparing real and synthetic images based on the FID metric, we found that Medfusion generated more realistic-looking images in all three datasets than the corresponding GAN models (Table 3).This was confirmed by the numeric metrics: Precision and Recall values showed that Medfusion had higher fidelity while it maintained greater diversity among the images as compared  with the GAN models.Sample images for qualitative comparison are given in Figure 4. We found that DDPM generated consistently more realistic and more diverse synthetic images than GANs.Together, these data show that DDPMs are superior to GANs both in terms of quantitative and qualitative metrics.
We provide a website with sample images to the scientific community so that a straightforward and more comprehensive review of Medfusion's image quality is possible.The website can be accessed at: https://huggingface.co/spaces/mueller-franzes/medfusion-app.

GAN Synthesized Images Exhibit Characteristic Artefacts
Identifying the GAN-generated images was possible due to characteristic visual artifacts (Figure 5).For the eye fundus images, we found that the synthetic image sometimes exhibited two optical discs, while every real fundoscopy always only exhibits one optical disc.No such occurrences were noted for the Medfusion-generated images.The GAN-generated images exhibited an artificial grid pattern for some generated histological images.Again, we did not observe such artifacts for the Medfusion model.Chest radiographs were identifiable as synthetic by blurred letters or fuzzy borders and irregular appearances of medical devices (e.g., cardiac pacemakers).We found these artifacts to appear in both the GAN-generated and DDPM-generated synthetic images.It should be noted that some of the real images showed strong visual artifacts due to image acquisition errors.However,   the real artifacts differed from the artifacts in the synthetic data.We provide examples for such real artifacts in the supplementary material.

Discussion
The success of Deep Learning depends largely on the size and quality of training data.Therefore, generative models have been proposed as a solution to extend the availability of training data [4].DDPM have been demonstrated to achieve superior image quality on natural images.In this study, we investigated if such models can also generate more diverse and realistic images as compared to GAN-based models in the medical domain.We explored DDPM in three domains of medical data: ophthalmologic data (fundoscopic images), radiological data (chest x-rays) and histological data (whole slide images of stained tissue).We optimized our Medfusion architecture for medical image synthesis and found that image quality of DDPM generated images was superior to that of GAN generated images: Medfusion achieved an FID score of 11.63 in the eye, 30.03 in the histology, and 17.28 in the chest dataset which were all lower (better) than those of the GAN models (20.43, 49.26, 84.31), indicating a higher image quality.Also, the precision of the images generated by Medfusion was higher in the eye (0.70 versus 0.43), histology (0.66 versus 0.64), and chest (0.68 versus 0.30) dataset, indicating higher fidelity.A known problem with GANs is mode collapse, where the generator produces very realistic (high precision) but similar images so that the true diversity between the real images is not represented.Recall, as a measure of diversity, was strikingly low for histological images generated by the cGAN compared to Medfusion (0.02 versus 0.41), which indicates a mode collapse.
In a study by Pinaya et al. [11], a latent DDPM was trained to generate 3D MRI brain images and compared with two GANs.In agreement with our study, the latent DDPM model showed a lower (better) FID score of 0.008 compared to the two GANs (0.023 and 0.1576).Remarkably, FIDs were 3 to 4 orders of magnitude lower than in our study.We suspect that this is due to the 3D data used instead of 2D data because our measured FIDs are in the same order of magnitude as in previous studies on natural 2D images.Regardless of whether a GAN or our latent DDPM was used, we observed a maximum recall (diversity) of approximately 0.4 on the medical datasets.On natural images, recalls of 0.5 or better were observable [10].One possible reason for this is that natural images can achieve diversity by changing backgrounds and colors, medical images often have a constant (black or white) background, and colors are narrowly limited to e.g.grayscale.Therefore, diversity in medical images mainly manifests as change in details (eg.variations in heart size, or variations in the opacity of lung tissue).It may be more difficult to achieve high diversity while maintaining high fidelity in medical image generation than in natural images.Future studies are needed to investigate this.
Our study has limitations.First, the training and generation of the CheXpert and AIROGS images were performed in a lower resolution than the original resolution and the images were square (i.e.height equals width).There were two main reasons for this: 1) we wanted to compare the Medfusion model with the GAN results from the previous studies, which were trained and evaluated for a specific (lower) resolution.2) the StyleGAN-3 that we employed for comparison only allows a quadratic resolution which must be a power of 2. Future studies should investigate how the Medfusion model behaves for higher resolutions compared to GAN models.Second, we would like to point out that the metrics used in this study to judge the image quality were not developed for medical images which could reduce their validity and should in general be evaluated with care.The development of metrics that are proxies for human judgment is still an ongoing topic area of research [23].Furthermore, to the best of our knowledge, no study has investigated these metrics focusing on medical images.This should be addressed in a future study.

Conclusion
Our study shows DDPMs provide promising new ways for medical image generation besides GANs.Future work should focus on high-resolution and quality metrics for 2D and 3D medical images.

DDIM Sampling Steps
We increased the number of sampling steps of Medfusion's DDIM from t=50 to 250 in the inference mode and measured FID, precision, and recall on the reference data set (Figure 7).In terms of the three metrics, there was an increase in image quality with increasing number of steps.In general, quality increased notably in the first 150 steps and then reached a plateau.Therefore, 150 steps appeared to be an appropriate tradeoff between globally increasing quality and increasing inference time.

Figure 1 :
Figure 1: Illustration of the Medfusion model.A: General overview of the architecture.B: Details of the autoencoder with a sampling of the latent space via the reparameterization trick at the end of the encoder and direct connection (dashed lines) into the decoder (only active for training the autoencoder).C: Detailed view of the UNet with a linear layer for time and label embedding.D: Detailed view of the submodules.If not specified otherwise a convolution kernel size of 3x3, GroupNorm with 8 groups, and Swish activation was used.

Figure 2 :
Figure 2: Reconstruction quality of Medfusion Variational Autoencoder (VAE).Original images (first row) and reconstructed images (second row) by the VAE in the AIRGOS, CRCDX, and CheXpert dataset.In the eye fundus images, fine deviations from the original images were apparent in the veins of the optical disc (green arrow).Slight deviations in the color tone (green arrow) could be observed in the CRCDX dataset.In the CheXpert dataset, letters (green arrow) became blurry after reconstruction.

Figure 3 :
Figure 3: Reconstruction quality comparison.Both the out of the box VAE and the specifically trained VAE exhibit artifacts that may affect diagnostic accuracy.Here, lead cables are not reconstructed properly.Only when the number of channels is increased to eight are such small structures accurately reconstructed.

Figure 4 :
Figure 4: Qualitative image generation comparisons.Real images (first row), generated image by the GAN (second row) and Medfusion (third row), columns 1-2 conditioned on no glaucoma and columns 3-4 glaucoma (A), microsatellite stable and microsatellite instable (B) and no cardiomegaly and cardiomegaly (C).

Figure 5 :
Figure5: GAN-generated images that can be easily identified as synthetic.Synthetic images were easily identifiable because of two optical discs in eye fundus images, artificial grid patterns in histology images, and fuzzy borders and irregular appearances of medical devices in chest x-ray images.

Figure 7 :
Figure 7: Fréchet Inception Distance (FID), Precision, and Recall as a function of the sampling steps.

Table 3 :
Quantitative image generation comparisons.Models include Generative Adversarial Net-