Deep learning segmentation of hyperautofluorescent fleck lesions in Stargardt disease

Stargardt disease is one of the most common forms of inherited retinal disease and leads to permanent vision loss. A diagnostic feature of the disease is retinal flecks, which appear hyperautofluorescent in fundus autofluorescence (FAF) imaging. The size and number of these flecks increase with disease progression. Manual segmentation of flecks allows monitoring of disease, but is time-consuming. Herein, we have developed and validated a deep learning approach for segmenting these Stargardt flecks (1750 training and 100 validation FAF patches from 37 eyes with Stargardt disease). Testing was done in 10 separate Stargardt FAF images and we observed a good overall agreement between manual and deep learning in both fleck count and fleck area. Longitudinal data were available in both eyes from 6 patients (average total follow-up time 4.2 years), with both manual and deep learning segmentation performed on all (n = 82) images. Both methods detected a similar upward trend in fleck number and area over time. In conclusion, we demonstrated the feasibility of utilizing deep learning to segment and quantify FAF lesions, laying the foundation for future studies using fleck parameters as a trial endpoint.

. Summary of patient demographics, clinical data and genetic diagnoses. a Individuals from the same family are indicated by a lower case letter following the same number; all were siblings except for subjects 3a and 3b, who were mother and son. b BCVA; best-corrected visual acuity, as measured at imaging session. c Modified Fishman grading scale: Grade 1: flecks limited to within 1DD of foveal centre with no atrophy, Grade 2: flecks beyond 1DD of fovea with no atrophy, Grade 3: Choriocapillaris atrophy of the macula associated with flecks within 1DD of foveal centre (3A), flecks beyond 1DD of fovea but within central 55° (3B) and flecks beyond 55° (3C), and Grade 4: Diffuse flecks and choriocapillaris atrophy throughout the fundus. d No familial DNA for phase determination, four ABCA4 variants detected: c.5836-145C>A, c.4253+43G>A, c.5177C>A, c.5603A>T. e Origin of parental alleles not established but familial analysis indicates variants are biallelic. f Variant not detected in parent but familial analysis indicates variants are biallelic. g Asymptomatic mother and son, age of onset is age at which retinal lesions were first identified. h East Asian origin. Of the 47 manually segmented FAF images, 37 were selected as the training and validation sets in deep learning and 10 used as independent testing images. The images were graded according to their clinical appearance and the senior author (FKC) ensured that, in the training, validation and testing sets, the distribution of clinical appearance was even across the three datasets. Fifty image patches (512 × 512 pixels) were generated from each of the 37 FAF images (total 1850 patches). Of these, 1750 and 100 partitioned image patches were used for training and validation, respectively. There was improvement in dice score and a decrease in validation loss with training. Both dice score and validation loss appeared to plateau from the 100th epoch onwards.
Bland-Altman analyses 16 (Fig. 2) between manual and deep learning segmentation were performed in the 10 FAF images (4 with diffuse speckled and 6 with discrete fleck lesions) that underwent testing. Three concentric circles (10°, 20°, 30° diameter) centered at the fovea were placed on each image, which partitioned each image into a central disc of 10° diameter and two hollowed-out rings of 10°-20° and 20°-30° diameters. Manual and deep learning segmentation-derived fleck count were similar, with a mean difference (deep learning − manual) of − 0.60 (95% CI − 9.57 to 8.37), 4.10 (95% CI − 24.01 to 32.21) and 3.00 (95% CI − 27.21 to 33.21) for the 10° disc, 10°-20° and 20°-30° rings, respectively. Total fleck area was also similar, with a mean difference (deep Table 2. Characteristics of FAF images used in deep learning training. n/a not applicable. a Individuals from the same family are indicated by a lower case letter following the same number; all were siblings except for subjects 3a and 3b, who were mother and son. b Dice score between manual and deep learning segmentation, only available in 10 left eye images used for validation. c Longitudinal serial FAF images were used as part of the training set in subjects 1 and 17.    2 for the 10° disc, 10°-20° and 20°-30° rings, respectively. The mean ± standard deviation (SD) dice score for FAF images with diffuse speckled patterns (N = 4) was lower than those with discrete flecks (N = 6); 0.54 ± 0.14 versus 0.71 ± 0.08, respectively (p < 0.04, t test).

Longitudinal analysis.
Right and left eye images from 6 subjects with discrete pisciform flecks and a minimum of 12 months of follow-up (mean ± SD total follow-up period: 4.2 ± 1.8 years) underwent both manual and deep learning segmentation. Longitudinal analyses are illustrated using two case examples: Patients 4b and 1 (Fig. 3). Patient 4b at 21 years old showed discrete pisciform FAF regions at the first visit ( Fig. 3a), which progressed into speckled hypoautofluorescent regions at the last visit (27 years, Fig. 3b). Manual (blue outlines) and deep learning (red outlines) segmentation results at each visit are shown side by side. In all images, three concentric circles (10°, 20°, 30° diameter) were centred on the fovea, from which the number and area of FAF flecks within each ring were analyzed. Fleck number (Fig. 3c) did not significantly increase over time in the 10° ring  In the 20° and 30° rings, fleck number increased over time (both p < 0.05) and the deep learning method again underestimated the fleck count in both rings (both p < 0.05). Fleck area was low with either method in the 10° ring over the five years (Fig. 3d, left). In contrast, both methods detected an increase in fleck area over time (p < 0.05) in the 20° and 30° rings, with the deep learning method overestimating fleck area (both p < 0.05). Fleck progression was also shown in an older patient (Patient 1, 56 years at first visit) with discrete pisciform lesions regions at the first visit (Fig. 3e), which progressed into discrete regions of RPE atrophy centrally and speckled hypoautofluorescent regions in the perimacula at the last visit (Fig. 3f, 62 years). There was no increase in fleck number across time in the 10° ring (p = 0.15, Fig. 3g, left) and no difference in fleck number detected with either method (p = 0.47). In the 20 and 30° rings, both manual and deep learning methods showed a significant increase in fleck number over time (p < 0.05) but with no difference in fleck number estimation between the two methods (20° ring, p = 0.15; 30° ring, p = 0.07). Fleck area (Fig. 3h) was increased significantly over 6 years in all three rings (all p < 0.05). The 10° ring showed a significantly larger area with the manual method (p < 0.05), but there was no difference between deep learning and manual segmentation in the outer two rings (20° ring, p = 0.27; 30° ring, p = 0.06).

Discussion
The U-Net deep learning architecture employed in the current study has been developed for biomedical image segmentation 17  Several FAF parameters have been proposed for quantifying STGD1 severity or documenting natural history, including analysing the area of definitely or questionably decreased autofluorescence 9 , quantitative fundus autofluorescence 13 , fluorescence lifetime profile of FAF-detected flecks over time 19 and expansion of the boundary of the hyperautofluorescent fleck 20 . The most popular of these currently is the growth rate of the central atrophic lesion 7,9,14,21,22 . In this work, we proposed an alternative method of disease severity assessment using quantitative analysis of hyperautofluorescent fleck lesions based on count and area. We purposely chose a wide variety of FAF images-those with well-defined or discrete pisciform lesions that were easy to manually segment as well as those with more diffuse and speckled FAF lesions that were much more difficult to identify-to include in the training and validation sets. Almost one third of the patients did not have any RPE atrophy, supporting the value of this method in assessing disease progression prior to atrophy formation. We observed that segmentation of flecks from images with diffusely speckled FAF regions tended to achieve poorer identification during validation, which was reflected in the dice scores shown in Table 2. This could be due to uncertainties in manual delineation and/or the lack of lesion feature similarity identified by the deep learning algorithm. Nevertheless, Bland-Altman analysis (Fig. 2) showed a good overall agreement between manual and deep learning in both fleck count and fleck area for the 10 FAF images used for testing, albeit with relatively wide 95% confidence intervals.
In longitudinal analysis with a larger number of consecutive FAF images, we observed that the deep learning algorithm tended to underestimate the number of fleck lesions but overestimate a greater fleck area compared to manual segmentation of the same image; this seems to contradict the Bland-Altman analysis based on the 10 test images. It is important to note that test images contain FAF images with diffusely speckled (N = 4) and discrete flecks (N = 6) whereas the longitudinal cohort only had images with the latter types of lesions, specifically chosen to illustrate the clinical utility of this method. We also observed that deep learning segmentation tended to ignore blood vessels that course over a large fleck and identify this as a single fleck lesion whilst manual segmentation delineated two separate lesions separated by the overlying blood vessel (Fig. 3a,b,e,f). We attribute this difference in fleck outline segmentation to the CLAHE transformation that was performed prior to segmentation using the deep learning method. The CLAHE transformation also tended to enhance the overall appearance of fleck lesions and this in turn obscured the shadow created by the overlying blood vessel, resulting in the identification of a single fleck lesion without blood vessel interference. One explanation for the increased area of fleck lesion using the deep learning method is that CLAHE enhances the appearance of lesions within the FAF image, which effectively enlarges the apparently hyperautofluorescent fleck lesion size. In addition, given that deep learning segmentation tended to incorporate the entire fleck lesion without interference from overlying blood vessels, the overall lesion area would have included the pixels representing blood vessels coursing over the fleck lesion.
To our knowledge, there are no data on the natural history of hyperautofluorescent fleck count and area in STGD1. The similar trends in the change in fleck count and area over time using both manual and deep learning segmentation suggests that deep learning algorithms may be useful in future studies of retinal fleck pathophysiology. The reduction in fleck count and area in the central 10° disc can be explained by the natural life cycle of flecks evolving into RPE atrophy 9,14,23 . In contrast, there was a general increase in fleck number and size over time in the outer rings as peripheral atrophy generally does not develop until later in the disease course 21,24 . Given that hyperautofluorescent flecks appear to increase gradually in number and size over time and precede the formation of central atrophy by several years in STGD1 course, further work towards validating fleck count and area as a potential clinical trials endpoint measure is warranted.
One major limitation of the study is the small number of FAF images used, due to the low number of these rare Stargardt patients. To overcome this drawback, we utilized a patch-based approach commonly used in deep learning methods to increase the training set sample size. However, our training set still had eyes with very few flecks and/or speckled FAF signals; future training sets should remove and replace these images with those that have the greatest number of well-defined pisciform fleck lesions. The second limitation is the use of CLAHE transformation in our deep learning workflow, which transformed the appearance of the hyperautofluorescent flecks when compared to the raw images used in manual segmentation. Future studies should either manually segment the FAF images after CLAHE or explore other image transformation techniques such as region erosion method that truthfully retain the appearance of the hyperautofluorescent flecks.
In conclusion, we have demonstrated that hyperautofluorescent flecks in FAF images may be a useful structural outcome measure in STGD1. More importantly, we trained and put forward a deep learning-based fleck segmentation method which is less time-consuming than manual marking. Further research to refine the deep learning algorithm for fleck segmentation is warranted given its potential as a clinical trials outcome measure in STGD1.

Methods
The study adhered to the tenets of the Declaration of Helsinki and ethics approval was obtained from the Human Ethics Office of Research Enterprise, The University of Western Australia (RA/4/1/8932 and RA/4/1/7916) and the Human Research Ethics Committee, Sir Charles Gairdner Hospital (2001-053), Perth, Western Australia, Australia. Written informed consent was obtained from each individual for their imaging data to be used for research purpose. . Sequences were aligned to the ABCA4 reference sequence NM_000350.2/3, with nucleotide 1 corresponding to the A of the start codon ATG, and described in accordance with Human Genome Variation Society recommendations version 15.11 27 . The phase of the variants was examined in the parents, where available, and family members. Variant pathogenicity was assessed as previously described 28 and interpreted according to the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) joint guidelines 29 .
Fundus autofluorescence image acquisition. FAF images were acquired using a confocal scanning laser ophthalmoscope (Heidelberg Spectralis HRA without OCT, Heidelberg Engineering, Heidelberg, Germany) as previously described 30,31 . In brief, FAF (488 nm excitation, > 500 nm emission) images were acquired using a 30° × 30° acquisition window. During acquisition, the manufacturer's averaging function (Automatic Real Time; ART), which averages multiple images in real time, was activated in order to maximize signal-tonoise ratio.
Manual segmentation and analysis. Following image acquisition, raw images were extracted in bitmap format (1536 × 1536 pixels) and an ophthalmologist (FKC) marked the foveal centre and manually segmented individual autofluorescent flecks for each image using Paint.NET (v 4.2.9). Thirty-seven marked images served as training and validation sets for deep learning while 10 were used for testing. Prior to testing, parameters from each image (i.e. number of flecks, total area of flecks) were extracted. Fleck lesion size was calculated using the x-and y-axis scaling factor provided by the Spectralis software, which took into account the magnification factor of the eye due to differences in refractive power. This was done by creating an image mask of the fleck outline for each of the 10 FAF images, followed by placing three concentric circles (diameters of 10°, 20°, 30°) on each image, centred at the fovea. This allowed partitioning of all FAF images into three zones (10° disc, 10°-20° ring and 20°-30° ring), from which the two parameters extracted from each region served as reference data (i.e. ground truth). Longitudinal analysis was performed in a subset of 12 eyes from 6 patients with discrete or well-defined pisciform flecks. These images were not used for deep learning training and testing. Following manual segmentation, image registration was first performed following manual segmentation of all available FAF images. A two-step image registration approach was utilized for registering a baseline image to its follow-up images. Given that the images acquired were macula-centred images with a large common area between them, at the first step, a simple rigid translation registration approach was applied to ensure that the baseline and each follow-up image align at the fovea. A 12-parameter transformation matrix was used to mimic the curved retinal surface via quadratic modelling. In the second registration step, a Dual-Bootstrap Iterative Closest Point (ICP) registration approach 32 was used between the image pairs from the first step. The principle of Dual-Bootstrap ICP approach involves starting from an initial 12-parameter transformation estimate that is only accurate in a small region (i.e. bootstrap region). In each bootstrap region, the algorithm iteratively refines the transformation and expands this locally accurate initial alignment into a globally accurate alignment between the two images. Pixel-level registration accuracy was achieved by applying the two-step registration method. However, the limitations are that it was not suitable for eyes with large atrophic areas at baseline and eyes with large atrophic changes in their follow-up images. Flecks were quantified after registration by placing three concentric rings (radius 5°, 10° and 15°) centred at the fovea and measuring the number and the total area within each zone: 10° disc, 10°-20° ring and 20°-30° ring.

Developing deep learning algorithm for fleck segmentation. A UNet model with ResNet encoder
(ResNet-UNet) was constructed for the deep learning fleck segmentation. The architecture of the ResNet-UNet follows traditional UNet's structure 17 , composing of encoder (down-sampling) and decoder (up-sampling) portions. The key difference between the ResNet-UNet and the UNet is that the down-sampling encoder adopts the ResNet-34 model 33 , which is widely applied in image classification area and benefits from the advanced deep residual learning method. The ResNet-34 used here consists sequentially of firstly a 7 × 7 convolutional layer (CL), a max pooling layer, and the following 16 residual blocks. Each residual block contains two 3 × 3 convolutional layers with ReLu and a batch normalization identity shortcut connection (Fig. 5). Therefore, the ResNet down-sampling portion totally consists of 34 layers. To match the image size changes [32,64,128,256] in the up-sampling portion, the corresponding 4 feature maps from the down-sampling structure are: (1) output from the 7 × 7 CL (feature map size 256 × 256); (2) output from the first 3 residual blocks (feature map size 128 × 128); (3) output from the second 6 residual blocks (feature map size 64 × 64); (4) output from the third 4 residual blocks (feature map size 32 × 32). The up-sampling portion starts from the 512-channel 16 × 16 feature map. It is convoluted by a 2 × 2 transposed convolution with up-sampling factor 2 (stride = 2). The up-sampled  34 was applied to the raw FAF images prior to deep learning training. The number of images in the training set was increased by extracting image patches from cropped portions (512 × 512 pixels) of the full-size CLAHE-FAF images. The training set was further augmented by using patches that have been randomly (1) rotated through 90°, 180° and 270°; (2) inverted horizontally or vertically; and (3) magnified by a scale of 1.0-1.3. The overall model was trained in two stages in order to take advantage of the ResNet-34 33 with its pre-trained weights. The first stage involved freezing the down-sampling encoder portion of the model and only optimized the weights of the up-sampling decoder portion. For the second stage, the ResNet-34 down-sampling portion was unfrozen to fine-tune the weights of the entire model. The learning rates were adjusted during the training. Adam optimization 35 and BCEWithLogitsLoss were adopted in the overall training process. Dice score between the learning results and the ground truth was used as performance metric. The model was trained on Google cloud (Google Colab). The computing configuration was a single 12 GB NVIDIA Tesla K80 GPU. The training data batch size was set as 6. The initial training rate was 0.001 and the learning rate was adjusted during the training for the different layers. The model training was completed within 11 h (138 epochs).
Applying deep learning algorithm to segment and analyze hyperautofluorescent flecks. The FAF images were first processed by CLAHE. The size of the original images was 1536 × 1536 pixels, which was partitioned into 9 non-overlapping image patches (3 × 3 grid, each grid 512 × 512 pixels). Each 512 × 512 pixel image patch was rotated 90°, 180° and 270° to generate 3 new image patches. This was followed by horizontally flipping the original image patch then rotation of the flipped image by 90°, 180°, 270° to generate 4 more image patches. All 8 patches were processed by the ResNet-UNet deep learning model for hyperautofluorescent region segmentations and consequently 8 outputs were obtained. The final lesion mask for the patch is the average of the 8 patches from the reverse transformations of the 8 model outputs. The hyperautofluorescent mask of the entire image was then formed from the 9 partitioned 512 × 512 pixel image patches. www.nature.com/scientificreports/ Longitudinal analysis was performed using deep learning in the same subset of 12 eyes with manually segmented flecks. Following the delineation of the fleck outlines by deep learning, a mask of the fleck outline was generated and the same registration approach described in the manual segmentation section was utilized to measure the number and the total size of flecks within the three concentric rings (radius 5°, 10° and 15°) centred at fovea. Figure 6 illustrates the workflow for both manual and deep learning segmentation employed in the study.
Validation outcome measure. In order to evaluate the model's segmentation performance of the lesions, dice score (i.e. Intersection over Union, Eq. 1) was used, where the dice score (D) is equal to two times the common elements between the manual (Y) and deep learning (Ŷ) results divided by the total number of elements in each set.
Dice score was calculated for each image by comparing manually delineated regions to that marked by deep learning.
Statistical analysis. Bland-Altman analysis 16 was employed to evaluate the difference in parameters extracted from deep learning and manual segmentation methods in the testing set. In a subset of 12 eyes from 6 subjects, disease progression over time was assessed by using both deep learning and manual segmentation methods. Fleck count and area were extracted from all images from both methods and were compared across time via two-way ANOVA. All data were summarized as mean ± SD.

Data availability
The data that support the findings of this study are from the WARD study and restrictions apply to the availability of these data so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the WARD study.