Introduction

Medical imaging plays a vital role in the diagnosis and treatment of many diseases, enabling healthcare professionals to understand and visualize the internal structures and functions of the human body. With the advancement of artificial intelligence (AI) the field of medical imaging has seen significant improvements in terms of accuracy, efficiency, and cost-effectiveness. AI techniques such as machine learning and deep learning are commonly applied to medical imaging to, for instance, facilitate early detection and diagnosis of diseases and speedup time consuming segmentations1,2. For example, radiotherapy treatment planning requires segmentation of the tumor and several organs at risk. It is still common that these segmentations are done manually, and segmentation networks can here be used to reduce the required time for one patient from hours to a few minutes3.

However, training deep learning models, such as convolutional neural networks (CNNs) and vision transformers, for classification or segmentation normally requires large annotated datasets as the models may have millions of parameters. In computer vision, tremendous progress has been made during the last 10 years, and a crucial resource is the open ImageNet database4 which contains more than 14 million labeled images. Techniques developed in computer vision are rapidly transferred to the medical imaging field, but a major constraint is that access to medical images is much more complicated due to ethics, anonymization and data protection legislation (e.g. the general data protection regulation (GDPR)). There are several openly available medical imaging datasets, but they are much smaller compared to ImageNet (for example, the human connectome project (HCP) shares 1,100 subjects5, OpenNeuro shares about 30,0006, UK biobank will scan and share 100,0007). Furthermore, openly available data are often anonymized through defacing, can represent selective populations around universities, focus on healthy controls rather than diseased populations, and are often curated before distribution to eliminate bad quality data. This limits the potential applicability of any model trained on such data in clinical settings. Hospitals have records containing immense quantities of medical images, but these records are often not accessible for research due to regulatory hurdles.

Generative models, such as generative adversarial networks (GANs) and diffusion models, can today produce very realistic synthetic images, by learning the high dimensional distribution of the training images. A potential solution to facilitate sharing of medical images is therefore to generate and share synthetic images, or more precisely synthetic patients, as GDPR should not apply to medical images which do not belong to a specific person (but further legal research is needed). Recent work has demonstrated that generative models (especially diffusion models) can memorize the training images8,9,10, meaning that the synthetic images are just copies of the training images. As this questions the validity of sharing synthetic medical images, it is thoroughly discussed at the end of this paper.

To share synthetic medical images, and to motivate further research regarding legal aspects and memorization, it must first be demonstrated that they can be used for training deep learning models with acceptable performance. Due to the growing number of generative image models, one must also select the best model.

Related work

Rankin et al.11 used 19 open health datasets to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data, but only used tabular datasets and no image data. Similarly, El Emam et al.12 used synthetic tabular data from COVID-19 patients to predict death, and obtained similar performance using synthetic data. Using synthetic images for training CNNs for classification has become popular during recent years13,14,15,16,17,18, especially in medical imaging19 where obtaining large annotated datasets is much more time consuming compared to computer vision. On the other hand, related work on training segmentation networks with synthetic images and corresponding annotations is more limited. To generate synthetic images and the corresponding annotations can be done in at least two ways; jointly as a multi-channel image20,21,22 or as a two-step process where one model generates a synthetic label (annotation) image and another model generates the medical image from the label image23,24,25,26. Bowles et al.20 demonstrated that adding synthetic images from a 2D GAN lead to improvements of Dice similarity coefficient between 1 and 5 percent, but did not perform training with only synthetic images. Shin et al.24 also demonstrated small improvements when adding synthetic images as augmentation. Thambawita et al.27 compared different GANs for generating synthetic colonoscopy images and annotations, but did not use more recent models like StyleGAN or diffusion models. Fernandez et al.28 used the two-step approach to generate label images with a diffusion model, and then used SPADE29 to generate the medical image from the label image. They also applied their models to brain tumor images, but only performed a binary tumor segmentation and did not compare with any other generative models. Furthermore, the generative model was only trained with 1064 slices from 5 subjects.

This work

Here we comprehensively evaluate four 2D GANs (progressive GAN30, StyleGAN 1–331,32,33) and a 2D diffusion model34,35 for generating brain tumor images and tumor annotations, using two openly available datasets (BraTS 2020 and 202136,37,38,39,40,41). We demonstrate that using synthetic images for training segmentation networks (a U-Net42 and a Swin transformer43) leads to performance metrics which are slightly lower compared to training with real images, and that sharing synthetic images therefore is a viable option to sharing real images (as long as one verifies that the synthetic images are not too similar to the training images). To the best of our knowledge, no such comprehensive evaluation, requiring more than 2000 GPU days for training all models, has previously been performed for medical image segmentation. The trained generative models and the generated synthetic images are shared on AIDA data hub44,45.

Results

Figure 1 shows a real 5-channel image, and a randomly selected synthetic slice from each generative model. To investigate how distinct the synthetic images are from the training images, we calculated the highest correlation between 100 synthetic images and all training images (see our related work on memorization10). Briefly, synthetic images from a GAN show a distribution of highest correlations which is similar to when comparing training images and test images. For diffusion models, many of the synthetic images are very similar to a training image. Figures 2, 3 show the resulting U-Net segmentations for a random slice in the two test sets, when training the network with different settings.

Fig. 1
figure 1

Synthetic 5-channel images from the BraTS 2021 data. Each row shows a generative model, except for the top row which shows a real example, and each column shows a different MR sequence.

Fig. 2
figure 2

Example U-Net predictions on an image in the BraTS 2020 test set. Classes are visualized as colored overlay where red is GD-enhancing tumor, blue is peritumoral edema (ED) and green is necrotic and non-enhancing tumor core (NCR/NET). Each prediction is shown for four trainings using images from each generative model; with and without augmentation and with and without the original data. The two bottom rows present predictions from when training using synthetic images.

Fig. 3
figure 3

Example U-Net predictions on an image in the BraTS 2021 test set. Classes are visualized as colored overlay where red is GD-enhancing tumor, blue is peritumoral edema (ED) and green is necrotic and non-enhancing tumor core (NCR/NET). Each prediction is shown for four trainings using images from each generative model; with and without augmentation and with and without the original data. The two bottom rows present predictions from when training using synthetic images.

Evaluation metrics

To compare the five generative models we use a variety of metrics. The quality and diversity of synthetic images are often evaluated using metrics such as Frechét inception distance (FID) and inception score (IS)46, which use pre-trained CNNs to calculate how different the activations in the CNNs are when feeding real and synthetic images through them. The most important evaluation is in our opinion to train segmentation networks with the synthetic images, and then test how these networks perform on real images. Here we used a U-Net42, as it is one of the most common networks for medical image segmentation, and a Swin transformer43, to see how the results generalize to a more recent network, see the Methods section for details. Segmentation networks are normally evaluated using Dice (measures overlap between true and predicted annotations) and Hausdorff distance (measures the greatest of all the distances from a point in one set to the closest point in the other set). Augmentation is often applied when training segmentation networks, and the segmentation networks were therefore trained with and without augmentation (see Methods section for details).

Ranking of generative models

To summarize all the results, Table 1 shows the ranking of the five generative models, based on the different metrics FID, IS, Dice and Hausdorff distance. Here we focus on training segmentation networks with synthetic images only, to not make the table too complicated. The diffusion model performs best when comparing the models in terms of Dice and Hausdorff distance, but unfortunately this is in several cases explained by memorization. As expected, the older progressive GAN model often performs worse compared to more recent StyleGAN models. Overall the rankings are similar for U-Net and the Swin transformer. Clearly, the rankings according to the common FID and IS metrics (shown in Table 2) do not correlate well with the ranking according to Dice and Hausdorff distance. Both FID and IS have been questioned as good metrics16,47, but are still commonly used due to the lack of better alternatives. FID and IS focus on image quality and diversity, but do not consider memorization. Since the CNNs used for calculating FID and IS are trained on ImageNet, which only contains non-medical images, the metrics will also be biased for medical images.

Table 1 Ranking of the five generative models based on the metrics FID, IS, Dice and Hausdorff distance (when using synthetic images only).
Table 2 Comparison of the generative models using the most commonly used metrics, Fréchet inception distance (FID) and inception score (IS).

Dice scores

Tables 3, 4 show the obtained Dice scores then training the segmentation networks with different combinations of real and synthetic images, and testing with real images, for BraTS 2020 and BraTS 2021 respectively. To make it easier to compare the performance to only using real images, Table 5 shows the relative Dice scores, i.e. the obtained mean Dice score when using real and synthetic images, or only synthetic images, divided by the obtained mean Dice score when using only real images (with augmentation).

Table 3 Results on the test dataset (56 subjects) when training the generative models with BraTS 2020 (313 training subjects).
Table 4 Results on the test dataset (56 subjects) when training the generative models with BraTS 2021 (1195 training subjects).
Table 5 Results on the test datasets (56 subjects) when training the generative models with BraTS 2020 and 2021.

For U-Net trained with BraTS 2020 the diffusion model results in the highest Dice scores when using only synthetic images and augmentation, followed by StyleGAN 2, StyleGAN 3 and progressive GAN. A similar ranking is obtained for the Swin transformer. Using synthetic images from StyleGAN 1 results in very low Dice scores, explained by the fact that we were not able to find good hyperparameters. When excluding StyleGAN 1, the mean Dice score when using synthetic images only is very similar for U-Net and the Swin transformer, demonstrating that the synthetic images can be used also for more recent segmentation networks. Excluding StyleGAN 1, the mean Dice score is improved by 16.8% for the U-Net when adding augmentation to synthetic images only, compared to 4.1% for the Swin transformer.

For U-Net trained with BraTS 2021, the diffusion model again results in the highest Dice scores when using only synthetic images and augmentation, followed by StyleGAN 2 and StyleGAN 3. The same ranking is obtained for the Swin transformer. Using synthetic images from StyleGAN 1 results in Dice scores that are much higher compared to for BraTS 2020, possibly explained by the fact that the hyperparameters are a better fit for this dataset. The mean Dice score when using synthetic images only is 6.3% higher for the Swin transformer compared to the U-Net, again demonstrating that the synthetic images can be used also for more recent segmentation networks. The mean Dice score is improved by 15.9% for the U-Net when adding augmentation to synthetic images only, compared to only 1.8% for the Swin transformer.

Regarding relative Dice scores, Table 5 shows that that the diffusion model for U-Net trained with synthetic images only from BraTS 2020 results in the same Dice scores as when using real images, while StyleGAN 2 reaches 66%–93% and StyleGAN 3 reaches 81%–87%. For the Swin transformer, synthetic images from the diffusion model result in Dice scores that are 89%–92% compared to training with real images, while StyleGAN 2 reaches 79%–84% and StyleGAN 3 reaches 78%–81%. For U-Net trained with BraTS 2021 the Dice scores obtained when training with only synthetic images are in general lower compared to BraTS 2020, except for StyleGAN 1. The diffusion model reaches 89%–91% relative Dice, while StyleGAN 2 reaches 63%–87% and StyleGAN 3 reaches 79%–82%. For the Swin transformer the relative Dice scores are in general higher compared to BraTS 2020, partly explained by the fact that the Swin transformer results in a lower Dice score than the U-Net when training with only real images (this may be explained by the fact that vision transformers normally need larger datasets to perform well). The diffusion model reaches about 96% relative Dice, compared to 85%–86% for StyleGAN 2 and 82%–84% for StyleGAN 3.

To assess the impact of the ratio of real and synthetic images, we systematically increased the proportion of real images in a training set with a constant size of 100,000 images. This approach allowed us to evaluate the benefits of real data and the utility of synthetic images in enhancing model performance. The outcomes of this incremental integration are illustrated in Fig. 4, which showcases how varying the ratio of real to synthetic data affects the results. Using only 5000 real images, along with 95,000 synthetic images, still results in good performance (substantially higher compared to using 100,000 synthetic images).

Fig. 4
figure 4

Graph depicting the U-Net segmentation performance (Dice score) when using different proportions of real (BraTS 2021) and synthetic images generated from StyleGAN 3 (trained on BraTS 2021), in a constant total set of 100,000 images. As the number of real images increases along the x-axis, fewer synthetic images are used. To avoid random fluctuations, each segmentation model was trained 10 times and the average performance is presented.

Hausdorff distance

Tables 6, 7 show obtained Hausdorff distances when training the segmentation networks with synthetic images from BraTS 2020 and 2021, respectively. Overall the rankings of the generative models are very similar to the ranking from the Dice scores, the main difference being that the progressive GAN is ranked higher for U-Net. The mean Hausdorff distance is in general much lower (i.e. better) for the Swin transformer compared to the U-Net.

Table 6 Results on the test dataset (56 subjects) when training the generative models with BraTS 2020 (313 training subjects).
Table 7 Results on the test dataset (56 subjects) when training the generative models with BraTS 2021 (1195 training subjects).

Qualitative evaluation by neuroradiologist

In addition to the quantitative metrics a qualitative evaluation by an experienced neuroradiologist was performed, see the Methods section for details. Table 8 shows the results of the evaluation, i.e. how the images were classified (real or synthetic).

Table 8 Results from the qualitative evaluation by an experienced neuroradiologist, who classified 600 four-panel images as real or synthetic (100 real images and 100 images per generative model).

Discussion

Our evaluation shows that training segmentation networks with synthetic images works well, with Dice scores that reach 91%–100% compared to when training with real images (for BraTS 2021 and 2020, respectively). Shin et al.24 obtained a relative Dice score of 77.6% for brain tumor segmentation, using the deep convolutional GAN architecture which is older than progressive GAN. Fernandez et al.28 obtained a similar relative Dice of 93.8%, but only performed a binary tumor segmentation which is an easier task. No comparison with other generative models was conducted. Thambawita et al.27 obtained a relative Dice score of 97.2%, but for segmentation of endoscopy images making it difficult to compare the results. Furthermore, the authors did not use more recent generative models like StyleGAN or diffusion models.

The Dice scores for the diffusion model are for BraTS 2020 basically the same as when training with real images, which made us suspicious. An investigation revealed that the diffusion model had memorized many of the training images10. Memorization has previously been shown when using diffusion models for non-medical images8,9, but to the best of our knowledge not for medical images. Diffusion models are more likely to memorize the training images compared to GANs8,10, due to a completely different architecture.

Even better results can be obtained using an ensemble of 5–10 generative models16,22,48, as each model by random chance will learn a different subset of the high dimensional distribution, at the cost of a training time which is 5–10 times longer. Larsson et al.22 demonstrated that using an ensemble of 10 progressive GANs improved the mean Dice score for brain tumor segmentation by 9.5%, compared to a single GAN. The benefit of using an ensemble is expected to be larger for BraTS 2021, compared to BraTS 2020 used in22, to capture all modes of the distribution (due to a larger number of imaging sites in BraTS 2021).

Training the generative models with 1195 subjects (BraTS 2021), instead of 313 (BraTS 2020), leads to worse performance for the U-Net (lower relative Dice scores when using only synthetic images, mean 83.68% versus 77.36%, excluding StyleGAN 1) which may seem surprising. However, BraTS 2021 contains data from a larger number of sites (23 versus 19), which will result in more modes in the high dimensional distribution, which is harder to learn. Furthermore, using a larger dataset like BraTS 2021 makes it harder for the generative models to memorize the training images9,10. It would be very interesting to compare our results to SinGAN-Seg27, where a single image is used to train the generative model, but we suspect that such a model is prone to memorization.

Our results show that augmentation makes a rather big difference for the U-Net when training with only synthetic images, while the improvement is smaller for the Swin transformer. A possible explanation for this is that the augmentation helps the segmentation networks to overcome systematic differences between real and synthetic images. To apply augmentation when training the generative models needs to be explored in future work, as it on the one hand can increase the number of training images, but on the other hand it may introduce more modes in the high dimensional distribution (which will be harder to learn).

Several other researchers have demonstrated that combining real and synthetic images (i.e. using generative models for advanced augmentation) can improve segmentation accuracy20,21,24, or classification accuracy13,18, compared to training with only real images, but our results show that adding synthetic images only provides minor improvements or even results in worse performance. There are at least two possible explanations for this. First, we use a rather strong baseline segmentation model with several types of traditional augmentation during training. Second, we repeat the training of each segmentation network 10 times to avoid differences due to random chance. It is possible that using a subset of the real data (e.g. 20%), instead of all the real data, would result in larger improvements when adding synthetic images. However, the generative models should then be trained with the same subset, which will reduce the quality and diversity of the synthetic images.

The results are likely to strongly depend on the hyperparameters and the architecture of each generative model, but to explore many parameter combinations and architectures is difficult due to the long training times of both generative models and segmentation networks. For this reason, the results presented in this work may not correspond to the optimal results for each generative model, which one could obtain after performing an exhaustive optimization of hyperparameters and architectures. The total training time for this work was over 2000 days on an Nvidia A100 graphics card, and an exhaustive search of hyperparameters could have increased this a factor 10. This work demonstrates that new efficient metrics for evaluating synthetic medical images are required, as FID and IS are based on ImageNet (which does not contain medical images), do not consider memorization8,9,10, and do in general not correlate with how a network trained on the synthetic images will perform (see Table 1)16,47. In future work we will calculate FID and IS using CNNs pre-trained on RadImageNet49, which is a large collection of medical images, to see if Rad-FID and Rad-IS better correlate with our other metrics.

Regarding the qualitative evaluation by a neuroradiologist, the results show that the generative models produce synthetic images that are on the same level as real images (a similar number of images were classified as synthetic). It should however be noted that the setup of this experiment was not similar to a regular clinical assessment of brain tumor MRI, which is reflected in the fact that a large portion of the real images were falsely classified as synthetic images. Normally, in a clinical workflow a neuroradiologist would assess the whole brain, instead of a single slice, with even more MRI sequences than used in this study. Furthermore, a neuroradiologist does not normally look at skullstripped brain images. A challenge with this evaluation was that the BraTS images originate from many different scanners, with varying image quality, which probably also affected the visual assessment. Two additional limitations are that only one neuroradiologist performed the visual assessment, and that the sample of 600 images is not balanced in terms of real and synthetic images (which may introduce a bias).

The implication of our results are that sharing synthetic medical images is a viable option to sharing real images. A researcher can use synthetic images for pre-training, and then fine-tune the model on a small number of locally available images. Sharing synthetic medical images can be substantially easier11,12,27,50, as GDPR should not apply for data which do not belong to a specific person (but further legal research is needed). Regarding consent, Larson et al.51 argue that clinical data should be treated as a form of public good, to be used for the benefit of future patients, and further argue that consent is not required before collected data are used for secondary purposes when obtaining such consent is prohibitively costly or burdensome (e.g. contacting 1,000–10,000 persons). On the other hand, the argument of clinical data being treated as a form of public good, and not requiring further consent for use in research or development, may be a slippery slope and many examples exist where a retrospective look identifies the continued use of such data as an unauthorized abuse (e.g. Henrietta Lacks).

Before sharing synthetic images it is important to investigate how similar each synthetic image is to all training images, as especially diffusion models have been shown to memorize the training images8,9,10,52. This is extra important for small datasets, as memorization is then more likely9,10. Common evaluation metrics like FID and IS do not capture memorization, and it is therefore necessary to for example calculate the correlation, or some other metric like mutual information, between each synthetic image and all training images10,48,52. Pre-trained generative models can play an important role for sharing synthetic images from small datasets, as it should be less likely for a pre-trained model to memorize a small number of new images during fine tuning (compared to training the model from scratch). To determine an acceptable range of overlap with real clinical data is a very difficult task, especially since different legal experts interpret GDPR differently (in Sweden it is in general interpreted stricter compared to other countries) and since this acceptable range is likely to be different for different types of medical images. It therefore remains an open question how high the highest similarity can be before a synthetic image is seen as a copy of a training image.

Methods

Data

The MR images used for this project were downloaded from the Multimodal Brain tumour Segmentation Challenge (BraTS) 2020 and 202136,37,38,39,40,41. The training set contains MR volumes of shape 240 × 240 × 155 from 369 subjects for BraTS 2020 and from 1251 subjects for BraTS 2021. For each subject four types of MR images are available: T1-weighted (T1w), post gadolinium contrast T1-weighted (T1wGd), T2-weighted (T2w), and T2-weighted fluid attenuated inversion recovery (FLAIR). The annotations cover three parts of the brain tumor: peritumoural edema (ED), necrotic and non-enhancing tumour core (NCR/NET), and GD-enhancing tumour (ET). We used 313/1195 subjects for training and 56 subjects for testing, after first performing a random shuffling of the subjects. The data in the test sets were not used for training the generative models.

All 3D volumes were split into 2D slices, as a 2D GAN and a 2D diffusion model were used (3D GANs and 3D diffusion models are not yet very common). Only slices with at least 15% pixels with an intensity of more than 50 were included in the training. This resulted in a total of 23,478 5-channel images for BraTS 2020, and 91,271 5-channel images for BraTS 2021. Each slice was zero padded from 240 × 240 to 256 × 256 pixels, as the used GANs only work for resolutions that are a power of 2, and the intensity was rescaled to 0–255. The intensities for the tumor annotations were changed from1,2,4 to [51,102,204], such that the intensity range is more similar for the 5 channels.

Image generation

In this work we compare four different GANs (progressive growing GAN30, StyleGAN 131, StyleGAN 232, StyleGAN 333) and a diffusion model34,35, for the task of generating brain tumor images. GANs are trained through adversarial learning (using an adversarial loss function), where a generator and a discriminator compete against each other, to produce more realistic images and to be better at discriminating images as real or synthetic. At inference time, only the generator is used. A diffusion model, on the other hand, starts with real data samples, progressively adds noise over many steps according to a predetermined schedule until the data becomes pure noise. In this work a linear noise scheduler was used. The diffusion model is then trained to reverse this process, using more traditional loss functions, reconstructing less noisy data from more noisy data at each step. During inference, a diffusion model starts with an image of pure noise and sequentially applies the learned denoiser to reduce the noise, following the reverse of the training noise schedule. This process is iterated until the noise is completely removed, resulting in the generation of a new image which resembles the training data distribution. In general diffusion models are easier to train compared to GANs, due to more traditional loss functions, but are much slower at generating images.

The openly available code of each generative model was modified to generate 5-channel images instead of 3 channels, no other modifications to the default architectures were done. Each generative model will thereby learn to jointly generate the four MR images (T1w, T1wGD, T2w, FLAIR) and the corresponding tumor annotation at the same time. There is no guarantee that the synthetic annotations will be restricted to the same values as the real annotations ([51,102,204]). The synthetic annotations were therefore thresholded to the closest original annotation value.

The used hyperparameters of each model, and the approximate training times, are provided in Table 9. We used a set of common hyperparameters across all models, along with some model-specific ones. For instance, in the case of StyleGANs, we experimented with different gamma values, the best of which are detailed in the accompanying table. For the diffusion models, they were trained with varying diffusion steps, but the optimal results were obtained with 4000 steps for both training and inference (decided by visual inspection).

Table 9 The used hyperparameters for the different generative models, as well as the used hardware and the approximate training time.

For each generative model a total of 100,000 synthetic 5-channel images were generated. For the GANs this took about 10 minutes, while it took 1.5 days (using 8 GPUs) for the diffusion model The synthetic images and the trained generative models are shared on AIDA data hub44,45.

Quantitative evaluation and tumor segmentation

The quality and diversity of synthetic images are often evaluated using metrics such as Frechét inception distance (FID) and inception score (IS)46. Since these metrics are based on CNNs trained on ImageNet, which does not contain medical images, they will be biased for medical images. Furthermore, these metrics will not tell us how well a network trained with synthetic images will perform on real images. The synthetic images were therefore used to train segmentation networks (based on U-Net and Swin transformers), and the evaluation was performed using real images in the test set. To investigate how FID and IS correlate with the performance of training with only synthetic images, FID and IS were also calculated.

Training deep networks is a stochastic process, meaning that training the same model several times will give different results. Each segmentation network was therefore trained 10 times, to make sure that performance differences between the different generative models are not due to random chance. The segmentation was performed for each slice independently, and the Dice scores and Hausdorff distances were then calculated in 3D after putting the slices for each subject back into a volume.

U-Net

The model structure and training setup used was inspired by the 2D segmentation code from nnUNet53. The model was a U-Net54 with an extra depth layer and instance normalization instead of batch normalization. In addition, the ReLU activations was swapped for leaky ReLUs with a negative slope of 10−2. All models were trained with a loss consisting of a cross-entropy term and a soft Dice term weighted equally. In addition, deep supervision was used, meaning that the loss was applied on the five highest depth level with weighting 0.5d where d is the depth.

The loss was minimized using stochastic gradient descent with Nesterov momentum of 0.99 and weight decay of 3·10−5. The initial learning rate was 5·10−2 and was decreased using polynomial learning rate decay with an exponent of 0.9. The learning rate and optimizer was different for the generative models, see their respective paper. All models were trained for 3·107 samples or 3 days, whichever occured first. 20% of the available images were used for validation and the model with the best mean Dice over the validation set was used for evaluation. If both real and synthetic data was used during training, the real dataset and the synthetic dataset were sampled equally often.

During training a geometric, and intensity-augmentation was applied, as our previous work on augmentation for brain tumor segmentation55 demonstrated that augmentation can provide significant improvements even if the dataset is large. The image and target is first randomly rotated and scaled. Both rotation and scaling is applied with a probability of 0.75, the image and target is rotated with an angle uniformly sampled from [−30°, 30°] and the width and height is scaled (independently) with a scale factor uniformly sampled from [0.9, 1.1]. Then the four input channels are augmented by; adding Gaussian noise (applied with probability 0.5, zero mean and standard deviation uniformly sampled from [0.0, 0.05]), blurring the image (applied with probability 0.2, Gaussian blurring with standard deviation sampled from [0.5, 1.0]), faking lower resolution (zooming with a factor between 0.75 and 1.0 and then upsampling) and changing the gamma factor (scaling it with a factor between 0.8 and 1.2). Lastly the input image channels are normalized using Z-score normalization.

Swin transformers

The Swin transformer segmentation network was implemented and trained using the MMSegmentation library56. The architecture employed is the Swin-Base variant, as implemented in MMSegmentation, with a window size of 7 and a patch size of 4 × 4. The original Swin transformer was designed for 3-channel RGB images, hence, to accommodate MRI scans with four modalities per slice, the number of input channels in the model was modified to 4. Consequently, the input dimension of the Swin transformer is set to 256 × 256 × 4, where the ‘4’ denotes the number of channels, and ‘256’ represents both the height and width of each modality slice. The sole alteration to the original Swin-Base architecture is the adaptation of both the number of channels and the number of classes to 4.

The network was trained using DiceCELoss function. Furthermore, the Dice Loss component does not take into account the background class, label 0. The loss per batch was derived by calculating the loss for each training image, and taking the mean loss value.

The AdamW optimizer with β_1 = 0.9, β_2 = 0.999 and weight decay λ = 0.01 was used. Additionally, a learning rate scheduler with warmup and linear decay was employed. The warmup ratio was set to 1e−6. For the first 1500 iterations, the learning rate is increased and after 1500 warmup iterations, the learning rate reached 0.00006. For the rest of the training, the learning rate was decreased linearly until it reached 0.0 at the end of the training. The models were trained for 25 epochs, and after every epoch, a validation loss was calculated. The batch size was set to 8, and the last batch is dropped if it is not the same size as all of the other batches to ensure that all models are provided with batches of consistent size.

Data augmentation was performed using five techniques inspired by Cirillo et al.55. Images and their corresponding segmentations undergo rotation, with angles from 0° to 30° chosen randomly, and scaling, with axis factors varying within ±20%. Images were also subjected to a 50% chance of either horizontal or vertical flipping. Brightness was adjusted through a power-law γ intensity transformation, with parameters randomly picked between 0.8 and 1.2. Lastly, elastic deformation, following the methodology from the original U-Net paper42, was applied using a deformation grid with normal distribution displacements (σ = 2 voxels) and smoothed with a third-order spline filter.

Qualitative evaluation by neuroradiologist

To evaluate how the synthetic images are perceived by a clinician, a total of 600 4-sequence-panel images (T1w, T1wGd, T2w, T2w FLAIR) were presented to an experienced (13 years) neuroradiologist (co-author IB). The task was to determine if each presented 4-panel image was real or synthetic. The 600 images consisted of 100 real images, and 500 synthetic images, 100 each from the five generative models (progressive GAN, StyleGAN 1–3, diffusion model). The total number of real images, and the number of synthetic images per generative model, was known to the neuroradiologist before starting the evaluation. The real and synthetic images were presented in a random order. The evaluation took approximately 12 hours. Figure 5 shows an example of a real and a synthetic image presented during the evaluation.

Fig. 5
figure 5

Left: a real 4-channel image shown during the qualitative evaluation, where the task was to classify each example as real or synthetic. Right: a synthetic 4-channel image shown during the qualitative evaluation.

Ethics

This research study was conducted retrospectively using anonymized human subject data made available by BraTS. The ethical review board of Linköping decided that no further ethical approval was required.