DoseGAN: a generative adversarial network for synthetic dose prediction using attention-gated discrimination and generation

Deep learning algorithms have recently been developed that utilize patient anatomy and raw imaging information to predict radiation dose, as a means to increase treatment planning efficiency and improve radiotherapy plan quality. Current state-of-the-art techniques rely on convolutional neural networks (CNNs) that use pixel-to-pixel loss to update network parameters. However, stereotactic body radiotherapy (SBRT) dose is often heterogeneous, making it difficult to model using pixel-level loss. Generative adversarial networks (GANs) utilize adversarial learning that incorporates image-level loss and is better suited to learn from heterogeneous labels. However, GANs are difficult to train and rely on compromised architectures to facilitate convergence. This study suggests an attention-gated generative adversarial network (DoseGAN) to improve learning, increase model complexity, and reduce network redundancy by focusing on relevant anatomy. DoseGAN was compared to alternative state-of-the-art dose prediction algorithms using heterogeneity index, conformity index, and various dosimetric parameters. All algorithms were trained, validated, and tested using 141 prostate SBRT patients. DoseGAN was able to predict more realistic volumetric dosimetry compared to all other algorithms and achieved statistically significant improvement compared to all alternative algorithms for the V100 and V120 of the PTV, V60 of the rectum, and heterogeneity index.

www.nature.com/scientificreports/ gross tumor volume (GTV) 25,26 . Since conventional CNNs learn to predict the most probable dose, they are not well suited to model SBRT or SRS dose distributions 20,27,28 .
Recently, generative adversarial networks have been used to facilitate realistic predictions, by training a secondary CNN to distinguish real from fake predictions [29][30][31][32] . The generator CNN aims to create realistic predictions that fool a discriminator CNN, which attempts to classify realism. The two networks are trained adversarially until a Nash equilibrium is reached, which is the minimax loss of the aggregate training protocol 33 . Since the two networks need to be trained in unison, the discriminator network is usually shallow with fewer parameters compared to stand-alone classification CNNs such as VGG-16, ResNet-151, or DenseNet-201 architectures 29 . However, conventional GANs rely on the discriminator's ability to distinguish fake predictions from real predictions, so the overall performance is limited by the discriminator's ability decipher realism 34 .
Attention gates have recently emerged to help networks highlight relevant anatomy and suppress irrelevant information by encouraging compatibility between the input, intermediate layers, and output function of the network 35,36 . Additive self-attention gates have been proposed to encourage parsimonious feature propagation throughout a network [37][38][39] . Spatial self-attention allows networks to selectively emphasis portions of the intermediate convolutional layers as opposed to indiscriminately propagating information using conventional raster scanning.
This study suggests a novel attention-gated generative adversarial network (DoseGAN) as a superior alternative to current state-of-the-art dose prediction networks. DoseGAN offers deeper and more efficient discrimination, while simultaneously being efficient enough to train in unison with the generator network.

Methods and materials
Attention gated generation and discrimination. DoseGAN utilizes attention-gated generation and discrimination networks that selectively propagate information through a gating mechanism. The attention gates enable the networks to highlight relevant input features and help suppress redundant information propagation through the network. The gating mechanism also helps encourage compatibility between the output function and the extracted intermediate local feature vectors in each network 35,36 . DoseGAN utilizes additive self-attention gates to modulate multi-scale level feature response propagation throughout each network [37][38][39] .
The attention-gating mechanism applies a 1 × 1 × 1 convolutional kernel to a propagation signal (z 1 ) and a gating signal (z 2 ). Signals z 1 and z 2 are added together and the combined activations (z 1,2 ) are ReLU activated before being passed through a 1 × 1 × 1 convolutional kernel. The output is batch normalized and sigmoidally activated to form x 1,2 . The final gated output signal (z g ) is formed by multiplying z 1 by x 1,2 . Figure 1 depicts the attention gating mechanism used in the discriminator and generator networks.
DoseGAN utilizes an attention-aware 3D encoder-decoder variation of the pix2pix generator network 29 . The generator network is five multi-scale levels deep and selectively propagates encoder information directly to the decoder stage through attention gated skip connections. All convolutional layers, except for those residing in the gating mechanism, use 4 × 4 × 4 convolutional kernels with synchronized batchnorm, and leaky ReLU activations. The last layer in the generator network uses hyperbolic tangent activation. The CT, planning target volume (PTV), and organs at risk (OARs) are concatenated and used by the generator network to predict synthetic dose volumes. The predicted synthetic dose and real dose volumes are fed into a densely-connected attention-gated discriminator network which utilizes "PatchGAN" classification to predict a realism matrix that selectively captures local style characteristics 40,41 . The discriminator network is comprised of 8 convolutional layers with 3 convolutional downsampling layers that incrementally reduce the multi-scale resolution of the network. The first layer of each multi-scale level is concatenated to the last layer of each multi-scale level through attention-gated denseconnections. The last convolutional layer of each multi-scale level is used as the gating signal for the attention gated skip connections. Figure 2 shows a schematic of the attention-gated discriminator and generator networks.
Ground truth. DoseGAN was trained and validated using 126 prostate cancer patients previously treated with SBRT using a CyberKnife (Accuray, Sunnyvale) machine. An additional 15 test patients were used to report final results, following Kaggle-style competition rules. All patients received a monotherapy dose regimen of 38 Gy in 4 fractions, or a 19 Gy boost in 2 fractions and all treatment plans followed peer-reviewed acceptance criteria 42 . simultaneously classify predicted dose volumes (D Fake) as 0. DoseGAN uses mean aggregate categorical cross entropy loss from the discriminator and voxel-to-voxel (L1) loss from the generator to update network parameters during training. Introducing L1 loss helps facilitate convergence and enforce spatial congruence in the conditional GAN context. To avoid multiple hypothesis testing, patients were separated into training, validation, and testing groups, prior to training. In order to mimic the planning environment of the dosimetrist, the model was agnostic to demographic information, and only considered the raw CT image, PTV, OARs, and prescription.
DoseGAN was implemented on a Nvidia V100 graphics processor unit (GPU). Data augmentation was conducted on the fly with the PyTorch data loader using random rigid shifts, rotations, noise, and histogram intensity re-distribution. DoseGAN inferencing took 0.31 s to predict a 128 × 128 × 64 voxel synthetic dose volume and rescale it to its original resolution. The output and input resolutions of DoseGAN were 3 mm × 3 mm × 3 mm. The data used for this study is not publicly available due to sensitive medical information, but is available from the corresponding author on reasonable request. All patient data has been approved by the Institutional Review Board (IRB) and has been fully anonymized. The methods used in this study were performed in accordance with the University of California San Francisco institutional guidelines. IRB number 14-15452 allowed us to retrospectively collect and analyze our patient dataset. Since this study used retrospective data, informed consent was not required. Dosimetric evaluation. DoseGAN was compared to a fully-connected neural network that uses relative distance map information of neighboring input structures (FC), U-Net (UNet), DoseNet, and a 3D GAN architecture (GAN) 18,29,[43][44][45] .
All algorithms were hyperparameter tuned and the model with the best validation performance was saved and used for inferencing on the final test set to report final results. The FC model followed the original model architecture reported in Shiraishi et al., and was trained with 0.45 dropout, a batch size of 4, and a learning rate of 0.01 using Adam optimization 18 . U-Net followed the implementation of the Unet architecture reported in Kearney et al. and was trained with a 0.2 dropout, a batch size of 4, and a learning rate of 0.005 using Adam optimization 21 . DoseNet followed the original implementation reported in Kearney et al. and was trained with a dropout of 0.35, a batch size of 2, and a learning rate of 0.001. For our GAN architecture we used a 3D pix-topix implementation by Isola et al. and trained it with a dropout of 0.0, a batch size of 2, and an adaptive learning rate scheduler 26 . It is important to note that we kept the architectures the same or as similar as possible to not detract from their original successful form, however, we conducted a rigorous hyperparameter search to ensure optimal performance on our dataset and a fair comparison. Each algorithm was allowed to max out the memory of the GPU. All models automatically picked the maximum number of parameters before exceeding the memory threshold.
The heterogeneity index (HI), conformity index (CI), and several dose volume objectives were used to evaluate the dosimetric congruence between the synthetic dose predictions and the real ground truth dose. The HI formalism is defined as, HI = D max D p , where D p denotes the prescription and D max denotes the maximum dose value 46 . CI is defined as, www.nature.com/scientificreports/ CI = (TVPIV ) 2 (TV )(PIV ) , where TV is the target volume, TV PIV is the intersection of the target volume and the prescription isodose volume, and PIV is the prescription isodose volume 47 .
DoseGAN predicts the most realistic dose volume given a set of arbitrary input anatomy, as opposed to the best possible dose distribution. Comparator p-values, from a one-sided two-sample Mann-Whitney U test, were used to test if DoseGAN was statistically superior to each alternative dose prediction algorithm. P-values less than 0.05 were considered significant. Tables 1 and 2 show the mean values, mean absolute differences between the real dose and each algorithm, and the comparator p-values between DoseGAN and each alternative algorithm. Table 1 shows the PTV V 95 , V 100 , V 120 , and HI for all dose volumes. DoseGAN achieved a statistically significant improvement compared to all alternative algorithms for the V 100 and V 120 of the PTV the HI. Table 2 shows the CI, V 60 of the bladder, V 60 of the rectum, and mean dose of the penile bulb for all dose volumes. DoseGAN achieved a statistically significant improvement compared to all alternative algorithms for the V 60 of the rectum. Figure 3 shows the real dose, DoseGAN predicted synthetic dose, and dose difference for two patients. DoseGAN was able to achieve realistic synthetic dose predictions compared to the original real plans, as seen in Fig. 3. Figure 4 shows the dose volume histograms (DVHs) and DVH differences between the real dose distributions and DoseGAN synthetic dose distributions for the PTV, urethra, bladder, rectum, and penile bulb for 38 Gy plan. DVHs represent the radiation dose to tissue volume and the DVH differences represent the difference between the planned DVH of the predicted DVH. Figure 5 depicts the loss at each epoch for the DoseGAN algorithm. The L1 loss from the generator and the discriminator losses can be seen progressing in unison during model training.

Discussion
This study demonstrates the superiority of a novel conditional generative adversarial attention-gated network for SBRT synthetic dose prediction. This is the first ever implementation of generative adversarial attention-gated networks to this problem space.
On average DoseGAN was able achieve more realistic dose predictions compared to all other algorithms by learning a realism matrix that helped mimic the dosimetric nuances of real clinical SBRT plans. DoseGAN achieved statistically significant improvement compared to all alternative algorithms for the V 100 and V 120 of the PTV, HI, and V 60 of the rectum. Table 1. The mean values, mean absolute differences between the real dose and each algorithm, and the comparator p-values between DoseGAN and each alternative algorithm are shown for the V 95 , V 100 , and V 120 of the PTV as well as the HI. Bold font indicates the least difference between the real and predicted dose and statistically significant p-values. www.nature.com/scientificreports/ The conventional GAN algorithm achieved good results for the V 95 of the PTV, CI, and V 60 of the bladder, but did not perform as well as DoseGAN for the V 100 and V 120 of the PTV, and V 60 of the rectum. Similarly, DoseNet achieved good results for the V 95 of the PTV, mean dose of the penile bulb, and V 60 of the bladder, but did not perform as well as DoseGAN for the V 100 and V 120 of the PTV, HI, V 60 of the rectum, and mean dose of the penile bulb. Table 1 shows that DoseGAN performs much better than the alternative algorithms for the target V 120 and HI. While conventionally fractionated dose regimens tend to have much smoother dose distributions, SBRT Table 2. The mean values, mean absolute differences between the real dose and each algorithm, and the comparator p-values between DoseGAN and each alternative algorithm are shown for the CI, V 60 of the bladder, V 60 of the rectum, and mean dose of the bulb. Bold font indicates the least difference between the real and predicted dose and statistically significant p-values.  www.nature.com/scientificreports/ plans tend to have intentional hotspots within the main tumor volume. The alternative algorithms consistently predicted lower target V 120 and HI values, meaning that the plans have less dose escalation within the target volume and implying a loss in clinical efficacy. Table 2 shows that DoseGAN performed better at predicting the dose to the V 60 of the rectum and V 60 of the bladder, which is partially due to the stochastic nature of SBRT plans. Pure spatial loss algorithms failed to model the hot or cold spots within the sensitive organs. All algorithms performed well for the mean bulb since this metric takes the average dose to the structure and is more forgiving than structures that are more sensitive to hot spots. All algorithms also performed well for the CI, since the CI is a measurement of the target coverage and our dataset of dose volumes were fairly consistent with regards to this metric.
The models with pure spatial loss tended to produce overly smooth synthetic dose distributions and were not able to capture the heterogeneous hotspots and cold spots that are endemic to SBRT dose volumes. Pure spatial loss, such as mean squared error between the dose volumes, will produce the most likely dose at each voxel given a set of inputs. However, in the presence of dose heterogeneity or inconsistent planner preferences, conventional CNNs will learn to predict a best approximation of the dose in order to reconcile the inconsistent dose targets with respect to the input variables. Since conventional CNNs reach a compromise with respect to varied learning objectives, they are inherently disadvantaged compared to architectures that do not rely on pure spatial loss, such as GANs.
Since GANs are difficult to train, the number of network parameters needs to be kept as low as possible to facilitate adversarial training. Attention gates were used to reduce redundancy within the network, improve  www.nature.com/scientificreports/ efficiency, and facilitate model convergence, which enabled a deeper discriminator architecture. The realism matrix was able to incorporate broader dosimetric information, since it uses a deeper discriminator which allows for a wider receptive field. The model architecture of all algorithms, such as the depth, number of filters at each layer, and other hyperparameters, were determined using the validation set and were designed to stay within the memory limitations of the GPU hardware used in this study. Since GANs are notoriously difficult to train, DoseGAN borrowed many architectural design elements form the original pix2pix network, such as the size of each convolutional kernel, and relative location and type of various network activations.
This study has some limitations. Since this study was only conducted on SBRT prostate patients, it is not clear if this approach would work non-SBRT plans. Also, DoseGAN was trained to predict dose volumes with a 3 × 3 × 3 mm 3 voxel resolution. Although this resolution is clinically acceptable, typical SBRT dose calculations tend to use 1 × 1 × 1 m 3 or 2 × 2 × 2 mm 3 voxel resolutions. Increasing the resolution of DoseGAN would increase the number of parameters, change the receptive field of the model, and require more GPU memory. More extensive hyperparameter tuning and greater hardware resources would also be necessary to determine the viability of finer resolution dose prediction. Also, the number of parameters for each model was restricted by the GPU memory since only one GPU was used in this study. Also, the number of parameters is not the only determining factor in memory allocation. Each intermediary output layer is held in GPU memory, so networks that have more layers at higher resolutions will be more memory intensive. Hyperparameter tuning assured a balance between memory utilization at the upper multi-scale levels and lower-multi levels. Since the hyperparameter tuning stage automatically picked the upper memory limit for each model, we can assume that each model would have achieved better results with a bigger batch size and more parameters 48 . Furthermore, DoseGAN was only evaluated on abdominal anatomy, so it can not be assumed that DoseGAN will work on other anatomical regions.
In spite of these limitations, dose prediction using attention-aware generative adversarial networks presents a viable solution to dose prediction for prostate SBRT patients. Clinically incorporating DoseGAN would help conserve hospital resources by determining achievable plan dosimetry at the time of CT simulation as opposed to after the entire treatment planning process. Furthermore, DoseGAN could be used as a clinical decision support tool or be incorporated into the plan optimization process, to help improve plan quality and reduce the strain on clinical resources.

conclusions
We have developed a novel attention-aware generative adversarial network for synthetic dose prediction that was able to achieve superior dose prediction accuracy compared to current alternative state-of-the-art methods. DoseGAN presents a solution to overcome the challenges of realistic volumetric dose prediction in the presence of diverse patient anatomy.