A novel deep learning conditional generative adversarial network for producing angiography images from retinal fundus photographs

Fluorescein angiography (FA) is a procedure used to image the vascular structure of the retina and requires the insertion of an exogenous dye with potential adverse side effects. Currently, there is only one alternative non-invasive system based on Optical coherence tomography (OCT) technology, called OCT angiography (OCTA), capable of visualizing retina vasculature. However, due to its cost and limited view, OCTA technology is not widely used. Retinal fundus photography is a safe imaging technique used for capturing the overall structure of the retina. In order to visualize retinal vasculature without the need for FA and in a cost-effective, non-invasive, and accurate manner, we propose a deep learning conditional generative adversarial network (GAN) capable of producing FA images from fundus photographs. The proposed GAN produces anatomically accurate angiograms, with similar fidelity to FA images, and significantly outperforms two other state-of-the-art generative algorithms (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<.001$$\end{document}p<.001 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<.0001$$\end{document}p<.0001). Furthermore, evaluations by experts shows that our proposed model produces such high quality FA images that are indistinguishable from real angiograms. Our model as the first application of artificial intelligence and deep learning to medical image translation, by employing a theoretical framework capable of establishing a shared feature-space between two domains (i.e. funduscopy and fluorescein angiography) provides an unrivaled way for the translation of images from one domain to the other.

Fluorescein angiography (FA) is a procedure used to image the vascular structure of the retina and requires the insertion of an exogenous dye with potential adverse side effects. Currently, there is only one alternative non-invasive system based on Optical coherence tomography (OCT) technology, called OCT angiography (OCTA), capable of visualizing retina vasculature. However, due to its cost and limited view, OCTA technology is not widely used. Retinal fundus photography is a safe imaging technique used for capturing the overall structure of the retina. In order to visualize retinal vasculature without the need for FA and in a cost-effective, non-invasive, and accurate manner, we propose a deep learning conditional generative adversarial network (GAN) capable of producing FA images from fundus photographs. The proposed GAN produces anatomically accurate angiograms, with similar fidelity to FA images, and significantly outperforms two other state-of-the-art generative algorithms ( p < .001 and p < .0001 ). Furthermore, evaluations by experts shows that our proposed model produces such high quality FA images that are indistinguishable from real angiograms. Our model as the first application of artificial intelligence and deep learning to medical image translation, by employing a theoretical framework capable of establishing a shared feature-space between two domains (i.e. funduscopy and fluorescein angiography) provides an unrivaled way for the translation of images from one domain to the other.
For a long time fluorescein angiography (FA) combined with Retinal Funduscopy have been used for diagnosing retinal vascular and pigment epithelial-choroidal diseases 1 . The process requires the injection of a fluorescent dye which, depending on the age and cardiovascular structure of the eye, appears in the optic vascular system within 8-12 s and can stay up to 10 min 2 . Although generally considered safe, there have been reports of mild to severe complications due to allergic reactions to the dye [3][4][5] . Side effects can range from nausea and heart attack, to anaphylactic shock and death [6][7][8][9][10] . In addition, leakage of fluorescein at the injection site can occur.
Given the complications and the risks associated with this procedure, non-invasive, affordable, and computationally effective alternatives to FA are quite necessary. The only current alternative to fluorescein angigraphy (FA) for visualizing retinal vasculature is carried out by additional hardware and software modifications to Optical Coherence Tomography (OCT) 11,12 , called OCT Angiography (OCTA) 13,14 . Despite the ability to generate visual blood flow maps without the adverse side effects of FA, OCTA systems are not widespread in assessment of retinal vascular diseases, due to their cost, the need for multiple acquisitions in the same anatomical location 15 , and limited field of view (FOV). In addition, the recent CoVID-19 pandemic has had a significant negative impact on ophthalmologists' ability to conduct in-clinic exams 16 , demonstrated the limitations of the current state of tele-ophthalmology 17 , and highlighted the need for developing effective, low-cost, and reliable alternatives for both in-home and in-clinic measurements.
The introduction of convolutional neural networks and a gradient-based optimization regime for training these networks by LeCun et al. 18 has resulted in a subsequent deep learning revolution 19 in the field of Artificial Intelligence (AI). Not only has deep learning significantly improved the performance of visual object

Results
We designed a conditional generative adversarial network (GAN) comprising of two generator modules and four discriminator modules (Fig. 1A) to take fundus photographs and produce anatomically accurate FA images inferred from the fundus images. The generator block consists of two generator modules, the fine and coarse generators, which are designed in a U-shaped encoder-decoder manner. The coarse generator is comprised of a reflection+padding block, three convolution (Conv)+batch normalization (BN)+leaky rectified linear units (ReLU), and four novel residual blocks 44,45 (ResBlk), followed by two transpose convolution (Deconv), one reflection+padding, one Conv, and an output activation layers (Fig. 1B-left), and is responsible for generating coarse and global structures of the FA image such as the structures of the macula, optic disc, color, contrast, and brightness. The fine generator is comprised of one reflection+padding, one Conv+BN+ReLU, and one Conv layer, followed by three ResBlk, one Deconv, one Conv, and one output activation layer (Fig. 1B-right), and produces local information including retinal venules, arterioles, hemorrhages, exudates, and microaneurysms. The last ResBlk of the coarse generator is added to the first Conv layer of the fine generator to integrate the global features from the coarse generator with the local information in the fine generator. The discriminator blocks of the proposed network are encoders tied to a final layer of fully connected binary classification, and takes a pair of real and generated FA images and decide which one is real. The fine discriminators take the pair of real and generated images at full resolution, while the coarse discriminators take images at half resolution (Fig. 1A). Each discriminator is comprised of an initial Conv layer and Conv+BN+ReLU layers, followed by one last Conv layer and finally the output activation layer (Fig. 1B-bottom). In our implementation the fine generator has 170,305 trainable parameters, the coarse Generator has 6,695,041, and each discriminator has 234,785 trainable parameters for a total of around 7.8 million parameters.
The models were trained over 100 epochs, with each epoch comprising of 212 iterations, in a minimax setup in which the generators attempt to produce realistic FA images and discriminators attempt to correctly identify whether a given FA image is real or produced by the generators. The loss values for the coarse and fine generators as well as the combined loss values of the discriminator block are shown in Fig. 1C. The black dotted curve is the discriminator loss, while the solid blue and dashed red curves are the fine and course generator losses, respectively. At the beginning of the training, the generators produce sampled random images from a latent representation of FA images, and thus the discriminators can easily identify real from generated FA images. This can be observed as the smaller (better) discriminator loss values compared to generator loss values for early epochs in Fig. 1C. As training progresses, the generators learn to produce more realistic FA images which become increasingly difficult for discriminators to identify as not real, as observed from the downward trend in generator loss and upward trend in discriminator loss curves in early epochs in Fig. 1C. The goal of the network is to reach an equilibrium where the loss values for the generators and discriminators stabilize (late epochs in Fig. 1C). The ideal loss curves for a generative adversarial network (GAN) is shown in Fig. 1D, in which the network reaches the Nash equilibrium.
For training, we use the fundus and angiography data-set provided by Hajeb et al. 55 . The data-set includes 30 pairs of diabetic retinopathy and 29 pairs of normal FA and fundus images from 59 patients. Fundus photographs are in color format, whereas angiograms are in gray-scale format. Our proposed network is capable of performing with high degrees of accuracy even on this small dataset. To improve the accuracy of training we perform a randomized data augmentation process by which N random crops of size 512 × 512 from each images are extracted. These random images are then processed through geometric and photometric manipulations and then used for training the model. So, the total number of training sample is O(17 × N × f ) , where f is the number of photometric and geometric manipulations. This process can potentially generate an infinite number of training samples for our deep learning algorithm, addressing data limitation issues in deep learning. For example, with N = 500 crops from 17 images and with 10 geometric and photometric manipulations, the training set will consist of 85,000 pairs of fundus photographs and FA images. In our method we used image rotation, horizontal flip, and vertical flip as geometric transformations. Photometric transformations used in the proposed data augmentations include gamma correction, contrast stretching, contrast compression, and color manipulations. www.nature.com/scientificreports/ The results of the training is shown in Fig. 2 over the course of 100 epochs. In this figure, the first and the third rows are the original fundus images, their paired FA image, and the generated results after training at epochs 1, 25, 50, 75, and 100, respectively. The second and fourth rows show the magnification of the red rectangular regions for a better visual representation of the vascular structure generated by our method as the training progresses. Our proposed method learns the global information like optic-disc, fovea position, and large vascular structures first. Next, it tries to learn the minuscule vascular structures, e.g., arteries and veins in a progressive manner.

Fundus-Angio alignment effects.
Training on unaligned images hampers the synthesizing of realistic FA images. Although the global information such as the overall intensity, contrast, and the location of geometric features such as the optic-disc are retained, local information like vascular structures are distorted or absent from the generated image (Fig. 3). Without proper alignment of the paired fundus and FA images used for training, the deep learning GAN architectures fail to generate accurate vascular structures. In order to address this problem, there is a need for aligning the FA images with their fundus counterparts prior to training. The algorithm 1 shows the process by which a misaligned FA image could be aligned with its fundus counterpart. The process takes as input a pair of fundus and FA images and using a fast SIFT feature extractor, called SURF 56 , finds corresponding features between the two images. Singular valued decomposition technique will be then utilized on the matched features between the fundus and FA images to uncover the transformation ( ) between the two images. This transformation will be then utilized to align the FA image with the fundus photograph.  www.nature.com/scientificreports/ FA image generation. The first experiment was designed to establish the performance of our proposed method in generating anatomically accurate FA images from fundus photographs, and compare our results with the leading conditional generative adversarial networks proposed by Wang et al. 50 and Isola et al. 27 Fig. 4 shows the results of this comparison. In this experiment we supplied the trained networks with a fundus image ( Fig. 4A) while the paired FA image of the same patient ( Fig. 4B) was held-out as ground truth. Figure  For quantitative evaluation we use two established measures, i.e., the Fréchet inception distance (FID) 57 and structural similarity measures (SSIM) 58 . FID is a metric that measures the distance between feature vectors calculated for real and generated images. Since FID represents a distance metric, lower FID measures mean higher accuracy in generating images. FID allows for comparing how accurately the generated FA images represent anatomical features compared to the ground truth. Comparisons of FID measures between our proposed method and those presented by Wang et al. 50 and Isola et al. 27 show that our method produces significantly more accurate FA images, p = .0005 and p = 4.5 × 10 −6 , respectively (Fig. 5A). Table 1 shows the average FID values of our proposed method compared to those of Wang et al. 50 and Isola et al. 27 . The lower the FID the better the generated image. As it can be seen our method produces consistently lower FID measures when producing FA images from the original fundus image and from its transformed counterparts when motion blur, sharpening, noise, linear shift, and radial shifts are applied.
The SSIM is a well-known quality metric used to measure the similarity between two images. It is considered to be correlated with the quality perception of the human visual system (HVS) and is designed by modeling www.nature.com/scientificreports/ any image distortion as a combination of three factors that are loss of correlation, luminance distortion, and contrast distortion 59 . Comparisons of SSIM measures between our proposed method and those presented by Wang et al. 50 ( p < .0003 ) and Isola et al. 27 show that our method produces significantly more accurate FA images compared to both. Table 2 shows the average SSIM values of our proposed method compared to those of Wang et al. 50 and Isola et al. 27 . The higher the SSIM the better the generated image. As it can be seen our method produces consistently higher SSIM measures when producing FA images from the original fundus image and from its transformed counterparts when motion blur, sharpening, noise, linear shift, and radial shifts are applied.
Results on changes to fundus image acquisition. An important benefit of the proposed method is the robustness in accuracy of generated FA images from fudus photographs subject to varying imaging issues such as high signal to noise ratio (SNR), motion blur, and color sharpness. Figure 6 shows the generated FA images from noisy fundus photographs (Fig. 6A) compared to the FA image of same subject (Fig. 6B). Using a high SNR fundus photograph as input,  Fig. 6H, small vasculature are preserved and generated by our method, but they are lost in the generated FA images by Wang el al. 50 and Isola et al. 27 -red triangles in Fig. 6I,J. Comparisons on the FA image generated by our proposed method from normal and from high SNR images show no statistically significant difference in FID measures (Fig. 5B). Figure 7 shows the generated FA images from motion blurred fundus photographs (Fig. 7A) compared to the FA image of same subject (Fig. 7B). Using a motion blurred fundus photograph as input, Fig. 7C   Wang et al. 50 57 Isola et al. 27 68 Table 2. Comparisons between structural similarity measure (SSIM) achieved by our method compared with Wang et al. 50 and Isola et al. 27 .  Fig. 7H, small vasculature are preserved and generated by our method, but they are lost in the generated FA images by Wang el al. 50 and Isola et al. 27 -red triangles in Fig. 7I,J. Comparisons on the FA image generated by our proposed method from normal and from motion blurred images show no statistically significant difference in FID measures (Fig. 5B). Figure 8 shows the generated FA images from fundus photographs subject to color and contrast sharpness (Fig. 8A) compared to the FA image of same subject (Fig. 8B). Using a sharpened fundus photograph as input,   Fig. 8H, small vasculature are preserved and generated by our method, but they are lost in the generated FA images by Wang el al. 50 and Isola et al. 27 -red triangles in Fig. 8I,J.
Comparisons on the FA image generated by our proposed method from normal and from sharpened images show no statistically significant difference in FID measures (Fig. 5B).
Results on anatomical structure changes. Although robustness to fuduscopy imaging variations should not impact the results of the FA image generation, certain anatomical changes to the vascular structure should be identified and utilized in the image generation process. Our proposed approach has the capability of generating anatomically correct FA images from fundus photographs that contain two kinds of anatomical changes, i.e. slight linear (translational) shift in the vascular pattern and radial shifts changing the curvature of blood vessels. Figure 9 shows the generated FA images from fundus photographs subject to anatomical changes to the structure of retina blood vessels (Fig. 9A,K) compared to the FA image of same subject (Fig. 9B,L). Figure Fig. 9P-T for better viewing. As shown by the yellow circle in Fig. 9R, the distortions in the blood vessel patterns are reconstructed with high fidelity by our method, but they are lost in the generated FA images by Wang el al. 50 and Isola et al. 27 -red triangles in Fig. 9S,T. Comparisons on the FA image generated by our proposed method from normal and from the linear vasculature shifts show a statistically significant difference ( p = .010 ) in FID measures (Fig. 5C).

Qualitative evaluations.
In the next experiment we evaluated the quality of the generated angiograms by asking experts to identify whether a given angiograms is real, from a collection of 40 balanced (50%, 50%) and randomly mixed angiograms. For this experiment, the experts were not told how many of the images are real and how many are not real. The non-disclosed ratio of non-real and real images was a significant design choice for this experiment, as it will allow us to evaluate three metrics: (1) incorrectly identified generated images represent how real the generated images look, (2) correctly labeled real images representing how accurate the experts recognized angiogram salient features, and (3) the confusion metric representing how effective the overall performance of our proposed method was in confusing the expert in the overall experiment. The results are shown in Fig. 10. Given a real FA image, two out of three experts significantly identified fewer real images than www.nature.com/scientificreports/ the ground truth (Fig. 10A). Given generated angiongrams, all three experts missed significantly more generated angiograms (Fig. 10B). This experiment shows that the generated FA images by our proposed method are virtually indistinguishable from real FA images.

Discussion
Our study demonstrates that a deep learning generative adversarial network (GAN) can be trained to map anatomical features from different image modalities, i.e. fundus photographs and FA images, onto a shared feature manifold for the purpose of generating one image modality from the other. Once the deep learning network is trained on a training dataset of paired fundus and FA images, it is capable of generating anatomically accurate retinal vasculature in the form of FA images. Our deep learning model was capable of generating accurate and reliable FA images from fundus photographs, even under significant noise, motion blur, and color and contrast manipulations. The most significant aspect of the proposed deep learning architecture is the that it is the first application of deep learning in ophthalmology capable of translating between two different modalities of data. We also utilized a comprehensive data augmentation method to increase the accuracy of our deep learning system without the need for a very large training dataset. Taking advantage of our study, detailed retinal vascular structures can be created without the need for fluorescein angiogrpahy to avoid its potential side effects. Furthermore, generating vascular images from fundus photography via deep learning generative networks enables a non-invasive, cost effective, easy to use, and lowcost alternative for FA. Bypassing FA protocols by utilizing our proposed deep learning approach has the potential to enable remote monitoring of patients. In addition, generating FA images from fundus photographs does not impose the need for multiple measurements required by OCTA to reconstruct vascular maps over large areas of the retina.
A potential explanation by which the proposed deep learning approach is capable of inferring the retinal vascular structure from fundus images is that a paired set of FA and fundus images of the same eye share the same statistical distributions governing the anatomical structure of the eye from which the images are acquired. Although not visible from fundus images, the light reflects differently from the blood vessels and their neighboring region on the retina. These minute differences, locally and globally, are utilized by our deep learning algorithm www.nature.com/scientificreports/ to establish shared local and global feature representations from paired FA and fundus images. The trained model is then capable of using these learned shared features to infer the structural statistics of an FA directly from the structural statistics of a given fundus photograph and produce an anatomically accurate FA image counterpart. In fact, this shared feature representation learning is recently used in computer vision to transfer image modalities and styles, e.g. transforming real photographs to art styles of Monet or Van Gogh paintings [60][61][62][63] . The proposed deep learning generative network in our study produces FA images from fundus photographs. This finding has significant clinical applications. Fundus imaging is an easy, low-cost, and non-invasive procedure and is one of the most commonly performed eye procedures, resulting in a very large number of fudnus imaging databases. Moreover, fundus imaging can be done at home from a number of recently introduced portable funduscopes 64-67 . Our study demonstrates the potential for using fundus images acquired from these portable fundus imaging systems to produce reliable and anatomically accurate retina vascular structures. The inferred structural measurements of retinal vasculature may allow clinicians to determine the natural history of retinal vascular changes and clinical outcomes of retinal diseases as previously reported from direct analysis of fundus images 68,69 , but with the accuracy of FA image analysis 70,71 or even OCTA 72 .
In our work we used a multi-scale conditional deep learning network comprised of two components, a generator block and a discriminator block. The generator block is responsible for sampling a probability distribution function to generate an image. The discriminator block is responsible for deciding whether a given image is a real FA image or a generated one. For training, the entire system undergoes a minimax optimization process 73 , in which the generator tries to produce realistic FA images in such a way that the discriminator cannot correctly label as not real, while the discriminator tries to predict from a pair of real or generated images which one is real, as accurately as possible. To our knowledge, this architecture is the first of its kind to be designed and utilized for ophthalmic applications to generate one modality of images from another. The multi-scale design of the proposed network enables it to perform more accurately with fewer training samples and overcome the data limitations from which the majority of traditional deep learning architecture suffer 28 . Our study and data shows the superior results of our network compared to recent generative networks. Future work would including designing a side network within this generative network capable of mapping anatomical structures and bio markers representative of specific pathologies to establish a latent manifold of pathology feature representations. This latent manifold would be instrumental in predicting future progression of retina vascular disorders much earlier in the disease stage.
More broadly, we demonstrated that deep learning architectures are capable of translating between different ophthalmic image modalities. A similar approach to our architecture could be utilized to establish relationships between ocular anatomical structure measurements, e.g. OCT, MRI, funduscopy, and visual function such as field perimetry, acuity, and color and contrast sensitivities. For example, several physiological assessments such as OCT and funduscopy, and functional measurements such as visual fields assessments of the same eye are usually performed at each clinic visit. A similarly designed network to the proposed architecture can be utilized to map OCT and fundus photographs along with visual fields onto a manifold of shared feature representations. These shared representations can then be utilized to convert visual field progressions to OCT images establishing Figure 10. Expert evaluation of real versus generated FA images, given real FA images (A) and FA images generated by our proposed approach (B). Experts had difficulty distinguishing between real and generated FA images. For real images, two out of three experts had significantly lower correct classification compared to the ground truth ( p < .1 ). Given generated images, all experts had significantly lower correct classification scores compared to the ground truth ( p < .1 , p < .001 , and p < .0001). www.nature.com/scientificreports/ retinal fiber layer changes in glaucoma patients without the need to perform OCT measurements. In addition, converting available OCT measurements to a more objective representation of the visual field deficits could help evaluate disease progression in a more objective manner without the need to use subjective field perimetry. It is worth noting that the clinical use of generative adversarial networks without sufficient ablation studies on the neural network and its performance in generating images is dangerous. This is due to the GAN potential in producing fake features. General GAN approaches simply sample a random distribution to generate fake images, and therefore are susceptible to producing fake feature. This is particularly the case for GAN architectures designed for an unpaired dataset, as well as in traditional cycleGAN architectures. To avoid this issue we proposed the use of our conditional GAN framework in a paired setup with a hierarchical architecture (Fig. 1A). This is clinically significant, as the proposed method, if applied to unpaired images, will produce unnatural generated FA images. Therefore, the natural FA images are assured to include accurate anatomical features that are trained from paired FA and fundus datasets (Fig. 11).
The limitations of the current study are in the use of a single dataset of paired fundus and FA images, the size of the dataset, and due to data limitations, our inability to perform longitudinal studies of benefits of the proposed method in evaluating disease progression. While this dataset was sufficient for establishing the performance of the proposed deep learning model, further studies are needed on additional paired fundus and FA images to validate the results. In this study we proposed the use of data augmentation and conditional GANs in a multi-scale architecture to overcome the size limitations of the dataset. We anticipate using larger training samples acquired from large datasets will further improve the already establish superior results of our method. Another limitation in the study relates to the lack of information about the phase of the FAG images. The FAG phase information is missing from the current dataset on which the proposed method has been evaluated. In future studies we plan to include this information in our analyses. Finally, future longitudinal studies could prove the benefits of utilizing the proposed deep learning method in generating FA images from fundus photographs for regularly monitoring retinal vascular disease progression in ways not possible before while avoiding costs and side effects associated with FA.
In conclusion, we demonstrated that a deep learning based generative adversarial network (GAN) is capable of producing FA images from single fundus photographs alone, that are virtually indistinguishable from real FA images. This approach can be used on any existing fundus photograph dataset or could be integrated into funduscopy system to produce FA images along with the fundus photographs. Although our proposed framework provides an unrivaled way for the translation of images from one domain to the other, this study is designed as a proof-of-concept framework to demonstrate the technical and computational viability of performing image domain transformation to provide adjunctive information in the absence of FA modalities. Future studies are needed to validate how diagnostic capabilities may be improved by utilizing our framework in the absence of a FA test results.

Methods
This study utilizes publicly available and de-identified paired fluorescein angiogram and fundus photographs from the Isfahan University of Medical Sciences Persian Eye Clinic (Feiz Hospital) 55 . The study has been approved by the University of Nevada, Reno Institutional Review Board for the use of retrospective de-identified data and all methods were performed in accordance with the relevant guidelines and regulations. This study only uses anonymized and de-unidentified data. However, informed consent was obtained from all subjects, who were over the age of 18, as a part of the original study. Retinal images ( 576 × 720 pixels) were collected and include 30 normal stage and 40 abnormal stages.
Deep learning conditional generative adversarial network. This study proposes a new conditional generative adversarial network (GAN) comprising of a novel residual block 44,45 for producing realistic FA from retinal fundus images. We use two generators ( G fine and G coarse ) in the proposed network, as illustrated in Fig. 1A. The generator G fine synthesizes fine angiograms from fundus images by learning local information, including retinal venules, arterioles, hemorrhages, exudates, and microaneurysms. On the other hand, the generator G coarse tries to extract and preserve global information, such as the structures of the macula, optic disc, color, contrast and brightness, while producing coarse angiograms. The generator G fine takes input images of size 512 × 512 and produces output images with the same resolution. Similarly, the generator G coarse network takes an image with half the size ( 256 × 256 ) and outputs an image of the same size as the input. In addition, the G coarse outputs a feature vector of the size 256 × 256 × 64 that is eventually added with one of the intermediate layers of G fine . These hybrid generators are quite powerful for sharing local and global information between multiple architectures as seen in 50,52,74 . Both generators use convolution layers for downsampling and transposed convolution layers for upsampling. It should be noted that G coarse is downsampled twice ( ×2 ) before being upsampled twice again with transposed convolution. In both the generators, the proposed residual blocks are used after the last downsampling operation and before the first upsampling operations as illustrated in Fig. 1B. On the other hand, in G fine , downsampling takes place once with necessary convolution layer, followed by adding the feature vector, repetition of residual blocks and then upsampling to get fine angiography image. All convolution and transposed convolution operation are followed by Batch-Normalization 75 and Leaky-ReLU activations. To train these generators, we start with G coarse by batch-training it on random samples once and then we train the G fine once with a new set of random samples. During this time, the discriminator's weights are frozen. Lastly, we jointly fine-tune all the discriminator and generators together to train the GAN.
Multi-scale PatchGAN as discriminator. For synthesizing fluorescein angiography images, GAN discriminators need to adapt to coarse and fine generated images for distinguishing between real and generated images. To alleviate this problem, we either need a deeper architecture or, a kernel with wider receptive field. Both these solutions result in over fitting and increase the number of parameters. Additionally, a large amount of processing power will be required for computing all the parameters. To address this issue, we exploit the idea of using two Markovian discriminators, first introduced in a technique called PatchGAN 76 . This technique takes input from different scales as previously seen in 50,52 . We use four discriminators that have a similar network structure but operate at different image scales. Particularly, we downsample the real and generated angiograms by a factor of 2 using the Lanczos sampling 77 to create an image pyramid of three scales (original and 2× downsampled and 4×downsampled). We group the four discriminators into two, D fine = [D1 fine , D2 fine ] and D coarse = [D1 coarse , D2 coarse ] as seen in Fig. 1A. The discriminators are then trained to distinguish between real and generated angiography images at the three distinct resolutions respectively. The outputs of the PatchGAN for D fine are 64 × 64 and 32 × 32 and for D coarse are 32 × 32 and 16 × 16 . With the given discriminators, the loss function can be formulated as given in Eq. 1. It is a multi-task problem of maximizing the loss of the discriminators while minimizing the loss of the generators.
Despite discriminators having similar network structure, the one that learns features at a lower resolution has wider receptive fields. It tries to extract and retain more global features such as macula, optic disc, color and brightness to better distinguish real images. In contrast, the discriminator that learns feature at original resolution dictates the generator to produce fine features such as retinal veins, arteries, and exudates. By doing this we combine feature information of global and local scale while training the generators independently with their paired multi-scale discriminators.
Weighted objective function and adversarial loss. We use LSGAN 78 to train our conditional GAN.
The objective function for our conditional GAN is given in Eq. 2.
where the discriminators are first trained on the real fundus, x and real angiography image, y and then trained on the the real fundus, x and generated angiography image, G(x). We start with training the discriminators D fine and D coarse for couple of iterations on random batches of images. Next, we train the G coarse while keeping the weights of the discriminators frozen. We then train the the G fine on a batch of random samples in a similar fashion. We use Mean-Squared-Error (MSE) for calculating the individual loss of the generators as shown in Eq. 3.