Predicting risk of late age-related macular degeneration using deep learning

By 2040, age-related macular degeneration (AMD) will affect ~288 million people worldwide. Identifying individuals at high risk of progression to late AMD, the sight-threatening stage, is critical for clinical actions, including medical interventions and timely monitoring. Although deep learning has shown promise in diagnosing/screening AMD using color fundus photographs, it remains difficult to predict individuals’ risks of late AMD accurately. For both tasks, these initial deep learning attempts have remained largely unvalidated in independent cohorts. Here, we demonstrate how deep learning and survival analysis can predict the probability of progression to late AMD using 3298 participants (over 80,000 images) from the Age-Related Eye Disease Studies AREDS and AREDS2, the largest longitudinal clinical trials in AMD. When validated against an independent test data set of 601 participants, our model achieved high prognostic accuracy (5-year C-statistic 86.4 (95% confidence interval 86.2–86.6)) that substantially exceeded that of retinal specialists using two existing clinical standards (81.3 (81.1–81.5) and 82.0 (81.8–82.3), respectively). Interestingly, our approach offers additional strengths over the existing clinical standards in AMD prognosis (e.g., risk ascertainment above 50%) and is likely to be highly generalizable, given the breadth of training data from 82 US retinal specialty clinics. Indeed, during external validation through training on AREDS and testing on AREDS2 as an independent cohort, our model retained substantially higher prognostic accuracy than existing clinical standards. These results highlight the potential of deep learning systems to enhance clinical decision-making in AMD patients.


Introduction
Age-related macular degeneration (AMD) is the leading cause of legal blindness in developed countries. 1 Through global demographic changes, the number of people with AMD worldwide is projected to reach 288 million by 2040. 2 The disease is classified into early, intermediate, and late stages. 3 Late AMD, the stage associated with severe visual loss, occurs in two forms, geographic atrophy (GA) and neovascular AMD (NV). Making accurate time-based predictions of progression to late AMD is clinically critical. This would enable improved decision-making regarding: (i) medical treatments, especially oral supplements known to decrease progression risk, (ii) lifestyle interventions, particularly smoking cessation and dietary changes, 1 arXiv:2007.09550v1 [eess.IV] 19 Jul 2020 and (iii) intensity of patient monitoring, e.g., frequent reimaging in clinic and/or tailored home monitoring programs. [4][5][6][7][8] It would also aid the design of future clinical trials, which could be enriched for participants with a high risk of progression events. 9 Color fundus photography (CFP) is the most widespread and accessible retinal imaging modality used worldwide; it is the most highly validated imaging modality for AMD classification and prediction of progression to late disease. 10,11 Currently, two existing standards are available clinically for using CFP to predict the risk of progression. However, both of these were developed using data from the AREDS only; now, an expanded dataset with more progression events is available following the completion of the AREDS2. 5 Of the two existing standards, the most commonly used is the five-step Simplified Severity Scale (SSS). 10 This is a points-based system whereby an examining physician scores the presence of two AMD features (macular drusen and pigmentary abnormalities) in both eyes of an individual. From the total score of 0-4, a five-year risk of late AMD is then estimated. The other standard is an online risk calculator. 12 Like the SSS, its inputs include the presence of macular drusen and pigmentary abnormalities; however, it can also receive the individuals age, smoking status, and basic genotype information consisting of two SNPs (when available). Unlike the SSS system, the online risk calculator predicts the risk of progression to late AMD, GA, and NV at 1-10 years.
Both existing clinical standards face limitations. First, the ascertainment of the SSS features from CFP or clinical examination requires significant clinical expertise, typical in retinal specialists, but remains timeconsuming and error-prone, 13 even when performed by expert graders in a reading center. 11 Second, the SSS relies on two hand-crafted features and cannot receive other potentially risk-determining features. Recent work applying deep learning (DL) 14 has shown promise in the automated diagnosis and triage of conditions including cardiac, pediatric, dermatological, and retinal diseases, 13,[15][16][17][18][19][20][21][22][23][24][25][26] but not in predicting the risk of AMD progression on a large scale or at the patient level. 27 Specifically, Burlina et al reported on the use of DL for estimating the AREDS 9-step severity grades of individual eyes, based on CFP in the AREDS dataset. ?, 27,29 However, this approach relied on previously published 5-year risk estimates at the severity class level, 30 rather than using the ground truth of actual progression/non-progression at the level of individual eyes, or the timing and subtype of any progression events. In addition, no external validation using an independent dataset was performed in that study. Babenko et al. proposed a model to predict risk of progression to neovascular AMD only (i.e. not late AMD or GA). ? Importantly, the model was designed to predict 1-year risks only, i.e. at one fixed and very short interval only. Schmidt-Erfurth et al. proposed to use OCT images to predict progression to late AMD. 29 Specifically, they used a dataset (495 eyes, containing only 159 eyes that progressed to late AMD) with follow-up of 2 years. Their dataset was annotated by two retinal specialists (rather than unified grading by reading center experts using a published protocol).
Here, we developed a DL architecture to predict progression with improved accuracy and transparency in two steps: image classification followed by survival analysis (Figure 1). The model is developed and clinically validated on two datasets from the Age-Related Eye Disease Studies (AREDS 31 and AREDS2 32 ), the largest longitudinal clinical trials in AMD ( Figure 2). The framework and datasets are described in detail in the Methods section.
Our framework has several important strengths. First, it performs progression predictions directly from CFP over a wide time interval (1-12 years). Second, training and testing were based on the ground truth of reading center-graded progression events at the level of individuals. Both training and testing benefitted from an expanded dataset with many more progression events, achieved by using data from the AREDS2 alongside AREDS, for the first time in DL studies. Third, our framework can predict the risk not only of late AMD, but also of GA and NV separately. This is important since treatment approaches for the two subtypes of late AMD are very different: NV needs to be diagnosed extremely promptly, since delay in access  Figure 2: Creation of the study data sets. To avoid cross-contamination between the training and test datasets, no participant was in more than one group. to intravitreal anti-VEGF injections is usually associated with very poor visual outcomes, 33 while various therapeutic options to slow GA enlargement are under investigation. 34,35 Finally, the two-step approach has important advantages. By separating the DL extraction of retinal features from the survival analysis, the final predictions are more explainable and biologically plausible, and error analysis is possible. By contrast, end-to-end black-box DL approaches are less transparent and may be more susceptible to failure. 36

Results
Deep learning models trained on the combined AREDS/AREDS2 training sets and validated on the combined AREDS/AREDS2 test sets The characteristics of the participants are shown in Table 1. The characteristics of the images are shown in Supplementary Table S1. The overall framework of our method is shown in Figure 1 and described in detail in the Methods section. In short, first, a deep convolutional neural network (CNN) was adapted to (i) extract multiple highly discriminative deep features, or (ii) estimate grades for drusen and pigmentary abnormalities (Figure 1 b to c). Second, a Cox proportional hazards model was used to predict probability of progression to late AMD (and GA/NV, separately), based on the deep features (deep features/survival) or the DL grading (DL grading/survival) ( Figure 1 d and e). In this step, additional participant information could be added, such as age, smoking status, and genetics. Separately, all of the baseline images in the test set were graded by 88 (AREDS) and 192 (AREDS2) retinal specialists. By using these grades as input to either the SSS or the online calculator, we computed the prediction results of the two existing standards: retinal specialists/SSS and retinal specialists/calculator. The prediction accuracy of the approaches was compared using the five-year C-statistic as the primary outcome measure. The five-year C-statistic of the two DL approaches met and substantially exceeded that of both existing standards (  Similarly, for predictions at 1-4 years, the C-statistic was higher in all cases for the two DL approaches than the retinal specialists/calculator approach. Of the two DL approaches, the C-statistics of deep features/survival were higher in most cases than those of DL grading/survival. Predictions at these time intervals were not available for retinal specialists/SSS, since the SSS does not make predictions at any interval other than five years. Regarding the separate predictions of progression to GA and NV, deep features/survival also provided the most accurate predictions at most time intervals. Overall, DL-based image analysis provided more accurate predictions than those from retinal specialist grading using the two existing standards. For deep feature extraction, this may reflect the fact that DL is unconstrained by current medical knowledge and not limited to two hand-crafted features. In addition, the prediction calibrations were compared using the Brier score ( Figure 3). For five-year predictions of late AMD, the Brier score was lowest (i.e., optimal) for deep features/survival. We also split the data into 5 groups based on the AREDS SSS at baseline. We compared calibration plots for deep features/survival, DL grading/survival, and retinal specialist/survival with the actual progression data for the 5 groups (Supplement Figure S1). The actual progression data for the five groups are shown in lines (KaplanMeier curves) and the predictions of our models are shown in lines with markers. The figure shows that the predictions of the deep features/survival model correspond better to the actual progression data than those of the other two models. Prediction error curves of the survival models in predicting risk of progression to late age-related macular degeneration on the combined AREDS/AREDS2 test sets (601 participants), using the Brier score (95% confidence interval).

Deep learning models trained separately on individual cohorts (either AREDS or AREDS2) and validated on the combined AREDS/AREDS2 test sets
Models trained on the combined AREDS/AREDS2 cohort ( Table 2) were substantially more accurate than those trained on either individual cohort (Table 3), with the additional advantage of improved generalizability. Indeed, one challenge of DL has been that generalizability to populations outside the training set can be variable. In this instance, the widely distributed sites and diverse clinical settings of AREDS/AREDS2 participants, together with the variety of CFP cameras used, help provide some assurance of broader generalizability.

Deep learning models trained on AREDS and externally validated on AREDS2 as an independent cohort
In separate experiments, to externally validate the models on an independent dataset, we trained the models on AREDS (2,177 participants) and tested them on AREDS2 (1,121 participants). Table 4 shows that deep features/survival demonstrated the highest accuracy of five-year predictions in all scenarios, and DL grading/survival also had higher accuracy than retinal specialists/calculator.

Survival models with additional input of genotype
For all approaches possible, the predictions were tested with the additional input of genotype (Table 5). Interestingly, adding the genotype data, even the 52 SNP-based Genetic Risk Score (GRS; see Methods) available only in rare research contexts, 37     that adding genotype information may partially compensate for the inferior accuracy obtained from human gradings, but contributes little to the accuracy of DL approaches, particularly deep feature extraction.

Research software prototype for AMD progression prediction
To demonstrate how these algorithms could be used in practice, we developed a software prototype that allows researchers to test our model with their own data. The application (shown in Figure 4) receives bilateral CFP and performs autonomous AMD classification and risk prediction. For transparency, the researcher is given (i) grading of drusen and pigmentary abnormalities, (ii) predicted SSS, and (iii) estimated risks of late AMD, GA, and NV, over 1-12 years. This approach allows improved transparency and flexibility: users may inspect the automated gradings, manually adjust these if necessary, and recalculate the progression risks. Following further validation, this software tool may potentially augment human research and clinical practice.

Discussion
We developed, trained, and validated a framework for predicting individual risk of late AMD by combining DL and survival analysis. This approach delivers autonomous predictions of a higher accuracy than those from retinal specialists using two existing clinical standards. Hence, the predictions are closer to the ground truth of actual time-based progression to late AMD than when retinal specialists are grading the same bilateral CFP and entering these grades into the SSS or the online calculator. In addition, deep feature extraction generally achieved slightly higher accuracy than DL grading of traditional hand-crafted features. Table 4 shows that the C-statistic values were lower for AREDS2, as expected, since the majority of its participants were at higher risk of progression; 38 though more difficult, predicting progression for AREDS2 participants is more representative of a clinically meaningful task. When compared to the results in Table 2, the accuracy of our models decreased less than retinal specialists/calculator and retinal specialists/SSS. This demonstrates the relative generalizability of our models. Furthermore, this suggests the ability of deep features/survival to differentiate individuals with relatively similar SSS more accurately than existing clinical standards (since nearly all AREDS2 participants had SSS ≤2 at baseline). It is also worth noting that it was possible to improve further the performance of our models on the AREDS2 dataset, if the models had been modified to accommodate the unique characteristics of the AREDS2. Since the AREDS is enriched for participants with milder baseline AMD severity, it may be possible to improve the performance of our models on the AREDS2 dataset by either training the models on progressors only or decreasing the prevalence of non-progressors. However, for fair comparisons to other results in this study, we kept our models unchanged.
In addition, unlike the SSS, whose five-year risk prediction becomes saturated at 50%10, the DL models enable ascertainment of risk above 50%. This may be helpful in justifying medical and lifestyle interventions, 4, 5, 8 vigilant home monitoring, 6, 7 and frequent reimaging, 33 and in planning shorter but highly powered clinical trials. 9 For example, the AREDS-style oral supplements decrease the risk of developing late AMD by approximately 25%, but only in individuals at higher risk of disease progression. 4,5 Similarly, if subthreshold nanosecond laser treatment is approved to slow progression to late AMD8, accurate risk predictions may be very helpful for identifying eyes that may benefit most.
Considering that the DL grading model is better at detecting drusen and pigmentary changes, we also took the grades from the DL grading model and used these as input to the SSS and the Casey calculator, to generate predictions (Supplementary Table S3). We found that the accuracy of the predictions was higher when using the grades from the DL grading than from the retinal specialists. Hence, it is the combination of the DL grading and the survival approach that provides accurate and automated predictions.
In addition, we compared C-statistic results on three DL models: nnet-survival, 39 DeepSurv, 40 and CoxPH (Supplementary Table S4). CoxPH had the best calibration performance using either the deep features or DL grading, though the differences were fairly small.
For all approaches, the accuracy was substantially higher for predictions of GA than NV. The DL-based approaches improved accuracy for both subtypes, but, for NV, insufficiently to reach that obtained for GA. Potential explanations include the partially stochastic nature of NV and/or higher suitability of predicting GA from en face imaging.
Error analysis (i.e., examining the reasons behind inaccurate predictions) was facilitated more easily by our two-step architecture, unlike in end-to-end, black-box DL approaches. This revealed that the survival model accounted for most errors, since (i) classification network accuracy was relatively high (Supplementary Figure S2), and (ii) when perfect classification results (i.e., taken from the ground truth) were used as input to the survival model, the five-year C-statistic of DL grading/survival on late AMD improved only slightly, from 85.1 (85.0,85.3) to 85.8 (85.6-86.0). Supplementary Figure S3 demonstrates two example cases of progression. In the first case, both participants were the same age (73 years) and had the same smoking history (current), and AREDS SSS scores (4) at baseline. The retinal specialist graded the SSS correctly. However, participant 1 progressed to late AMD at year 2 (any GA at right eye), but participant 2 progressed to late AMD at year 5 (any GA at left eye). Hence, both the retinal specialist/calculator and retinal specialist/SSS approaches incorrectly assigned the same risk of progression to both participants (0.413). However, the deep feature/survival approach correctly assigned a higher risk of progression to participant 1 (0.751) and a lower risk to participant 2 (0.591). In the second case, both participants were the same age (73 years) and had the same smoking history (former). Their AREDS SSS scores at baseline were 3 and 4, respectively. The retinal specialist graded the SSS correctly. However, participant 3 progressed to late AMD at year 2 (NV at right eye), while participant 4 had still not progressed to late AMD by final follow-up at year 11. Hence, both the retinal specialist/calculator and retinal specialist/SSS approaches incorrectly assigned a lower risk of progression to participant 3 (0.259) and a higher risk to participant 4 (0.413). However, the deep feature/survival approach correctly assigned a higher risk of progression to participant 4 and a lower risk to participant 3 (0.569 vs 0.528).
The strengths of the study include the application of combining DL image analysis and deep feature extraction with survival analysis to retinal disease. Survival analysis has been used widely in AMD progression research; 4,38 in this study, it made best use of the data, specifically the timing and nature of any progression events at the individual level. Deep feature extraction has the advantages that the model is unconstrained by current medical knowledge and not limited to two features. It allows the model to learn de novo what features are most highly predictive of time-based progression events, and to develop multiple (e.g., 5, 10, or more) predictive features. Unlike shallow features that appear early in CNNs, deep features are potentially complex and high-order features, and might even be relatively invisible to the human eye. However, one limitation of deep feature extraction is that predictions based on these features may be less explainable and biologically plausible, and less amenable to error analysis, than those based on DL grading of traditional risk features. 36 Additional strengths include the use of two well-characterized cohorts, with detailed time-based knowledge of progression events at the reading center standard. By pooling AREDS and AREDS2 (which has not been used previously in DL studies), we were able to construct a cohort that had a wide spectrum of AMD severity but was enriched for cases of higher baseline severity. In addition, in other experiments, keeping the datasets separate enabled us to perform external validation using AREDS2 as an independent cohort.
In terms of limitations, as in the two existing clinical standards, AREDS/AREDS2 treatment assignment was not considered in this analysis10,12. Since most AREDS/AREDS2 participants were assigned to oral supplements that decreased risk of late AMD, the risk estimates obtained are closer to those for individuals receiving supplements. However, this seems appropriate, given that AREDS-style supplements are considered the standard of care for patients with intermediate AMD. 33 Another limitation is that this work relates to CFP only. DL approaches to optical coherence tomography (OCT) datasets hold promise for AMD diagnosis, 41,42 but no highly validated OCT-based tools exist for risk prediction. In the future, we plan to apply our framework to multi-modal imaging, including fundus autofluorescence and OCT data.
In conclusion, combining DL feature extraction of CFP with survival analysis achieved high prognostic accuracy in predictions of progression to late AMD, and its subtypes, over a wide time interval (1-12 years). Not only did its accuracy meet and surpass existing clinical standards, but additional strengths in clinical settings include risk ascertainment above 50% and without genotype data.

Datasets
For model development and clinical validation, two datasets were used: the AREDS 31 and the AREDS2 32 ( Figure 2). The AREDS was a 12-year multi-center prospective cohort study of the clinical course, prognosis, and risk factors of AMD, as well as a phase III randomized clinical trial (RCT) to assess the effects of nutritional supplements on AMD progression. 31 In short, 4,757 participants aged 55 to 80 years were recruited between 1992 and 1998 at 11 retinal specialty clinics in the United States. The inclusion criteria were wide, from no AMD in either eye to late AMD in one eye. The participants were randomly assigned to placebo, antioxidants, zinc, or the combination of antioxidants and zinc. The AREDS dataset is publicly accessible to researchers by request at dbGAP (https://www.ncbi.nlm.nih.gov/projects/gap/ cgi-bin/study.cgi?study_id=phs000001.v3.p1).
Similarly, the AREDS2 was a multi-center phase III RCT that analyzed the effects of different nutritional supplements on the course of AMD32. 4,203 participants aged 50 to 85 years were recruited between 2006 and 2008 at 82 retinal specialty clinics in the United States. The inclusion criteria were the presence of either bilateral large drusen or late AMD in one eye and large drusen in the fellow eye. The participants were randomly assigned to placebo, lutein/zeaxanthin, docosahexaenoic acid (DHA) plus eicosapentaenoic acid (EPA), or the combination of lutein/zeaxanthin and DHA plus EPA. AREDS supplements were also administered to all AREDS2 participants, because they were by then considered the standard of care. 33 We will make the AREDS2 dataset publicly accessible upon publication.
In both studies, the primary outcome measure was the development of late AMD, defined as neovascular AMD or central GA. Institutional review board approval was obtained at each clinical site and written informed consent for the research was obtained from all study participants. The research was conducted under the Declaration of Helsinki and, for the AREDS2, complied with the Health Insurance Portability and Accessibility Act. For both studies, at baseline and annual study visits, comprehensive eye examinations were performed by certified study personnel using a standardized protocol, and CFP (field 2, i.e., 30°imaging field centered at the fovea) were captured by certified technicians using a standardized imaging protocol. Progression to late AMD was defined by the study protocol based on the grading of CFP, 31, 32 as described below.
As part of the studies, 2,889 (AREDS) and 1,826 (AREDS2) participants consented to genotype analysis. SNPs were analyzed using a custom Illumina HumanCoreExome array. 37 For the current analysis, two SNPs (CFH rs1061170 and ARMS2 rs10490924, at the two loci with the highest attributable risk of late AMD), were selected, as these are the two SNPs available as input for the existing online calculator system. In addition, the AMD GRS was calculated for each participant according to methods described previously. 37 The GRS is a weighted risk score based on 52 independent variants at 34 loci identified in a large genomewide association study 37 as having significant associations with risk of late AMD. The online calculator cannot receive this detailed information.
The eligibility criteria for participant inclusion in the current analysis were: (i) absence of late AMD (defined as NV or any GA) at study baseline in either eye, since the predictions were made at the participant level, and (ii) presence of genetic information (in order to compare model performance with and without genetic information on exactly the same cohort of participants). Accordingly, the images used for the predictions were those from the study baselines only.
In the AREDS dataset of CFPs, information on image laterality (i.e., left or right eye) and field status (field 1, 2, or 3) were available from the Reading Center. However, these were not available in the AREDS2 dataset of CFPs. We therefore trained two Inception-v3 models, one for classifying laterality and the other for identifying field 2 images. Both models were first trained on the gold standard images from the AREDS and fine-tuned on a newly created gold standard AREDS2 set manually graded by a retinal specialist (TK). The AREDS2 gold standard consisted of 40 participants with 5,164 images (4,097 for training and 1,067 for validation). The models achieved 100% accuracy for laterality classification and 97.9% accuracy (F1-score 0.971, precision 0.968, recall 0.973) for field 2 classification.

Gold standard grading
The ground truth labels used for both training and testing were the grades previously assigned to each CFP by expert human graders at the University of Wisconsin Fundus Photograph Reading Center. The reading center workflow has been described previously. 30 In brief, a senior grader performed initial grading of each photograph for AMD severity using a 4-step scale and a junior grader performed detailed grading for multiple AMD-specific features. All photographs were graded independently and without access to the clinical information. A rigorous process of grading quality control was performed at the reading center, including assessment for inter-grader and intra-grader agreement. 30 The reading center grading features relevant to the current study, aside from late AMD, were: (i) macular drusen status (none/small, medium (diameter ≤63m and <125m), and large (≤125m)), and (ii) macular pigmentary abnormalities related to AMD (present or absent).
In addition to undergoing reading center grading, the images at the study baseline were also assessed (separately and independently) by 88 retinal specialists in AREDS and 196 retinal specialists in AREDS2. The responses of the retinal specialists were used not as the ground truth, but for comparisons between human grading as performed in routine clinical practice and DL-based grading. By applying these retinal specialist grades as input to the two existing clinical standards for predicting progression to late AMD, it was possible to compare the current clinical standard of human predictions with those predictions achievable by DL.

Development of the algorithm
The overall framework of our method is shown in Figure 1. First, a CNN was adapted to (i) extract multiple highly discriminative deep features, or (ii) estimate grades for drusen and pigmentary abnormalities (Figure 1 a to c). Second, a Cox proportional hazards model was used to predict probability of progression to late AMD (and GA/NV, separately), based on the deep features (deep features/survival) or the DL grading (DL grading/survival) ( Figure 1 e and e). In this step, additional participant information could be added, such as age, smoking status, and genetics.
As the first stage in the workflow, the DL-based image analysis was performed using two different adaptations of DeepSeeNet. 13 DeepSeeNet is a CNN framework that was created for AMD severity classification. It has achieved state-of-the-art performance for the automated diagnosis and classification of AMD severity from CFP; this includes the grading of macular drusen, pigmentary abnormalities, the SSS, 13 and the AREDS 9-step severity scale. 43 In particular, using reading center grades as the ground truth, we have recently demonstrated that DeepSeeNet performs grading with accuracy that was superior to that of human retinal specialists (Supplementary Figure S1). The two different adaptations are described here: Deep features. The first adaptation was named deep features. This approach involved using DL to derive and weight predictive image features, including high-dimensional hidden features. 44 Deep features were extracted from the second to last fully-connected layer of DeepSeeNet (the highlighted part in the classification network in Figure 1). In total, 512 deep features could be extracted for each participant in this way, comprising 128 deep features for each of the two models (drusen and pigmentary abnormalities) in each of the two images (left and right eyes). After feature extraction, all 512 deep features were normalized as standardscores. Feature selection was required at this point, to avoid overfitting and to improve the generalizability, because of the multi-dimensional nature of the features. Hence, we performed feature selection to group correlated features and pick one feature for each group. 45 Features with non-zero coefficients were selected and applied as input to the survival models described below.
Deep learning grading. The second adaptation of DeepSeeNet was named DL grading, i.e., referring to the grading of drusen and pigmentary abnormalities, the two macular features considered by humans most able to predict progression to late AMD. In this adaptation, the two predicted risk factors were used directly. In brief, one CNN was previously trained and validated to estimate drusen status in a single CFP, according to three levels (none/small, medium, or large), using reading center grades as the ground truth. 13 A second CNN was previously trained and validated to predict the presence or absence of pigmentary abnormalities in a single CFP.
Survival model. The second stage of our workflow comprised a Cox proportional hazards model 46 to estimate time to late AMD (Fig1. D and E). The Cox model is used to evaluate simultaneously the effect of several factors on the probability of the event, i.e., participant progression to late AMD in either eye. Separate Cox proportional hazards models were created to analyze time to late AMD and time to subtype of late AMD (i.e., GA and NV). In addition to the image-based information, the survival models could receive three additional inputs: (i) participant age; (ii) smoking status (current/former/never), and (iii) participant AMD genotype (CFH rs1061170, ARMS2 rs10490924, and the AMD GRS).

Experimental design
In both of the DeepSeeNet adaptations described, the DL CNNs used Inception-v3 architecture, 47 which is a state-of-the-art CNN for image classification; it contains 317 layers, comprising a total of over 21 million weights that are subject to training. Training was performed using two commonly used libraries: Keras (https://keras.io) and TensorFlow. 48 All images were cropped to generate a square image field encompassing the macula and resized to 512 x 512 pixels. The hyperparameters were learning rate 0.0001 and batch size 32. The training was stopped after 5 epochs once the accuracy on the development set no longer increased. All experiments were conducted on a server with 32 Intel Xeon CPUs, using a NVIDIA GeForce GTX 1080 Ti 11Gb GPU for training and testing, with 512Gb available in RAM memory. We fitted the Cox proportional hazard model using the deep features as covariates. Specifically, we selected the 16 features with the highest weights (i.e. the features found to be most predictive of progression to late AMD, with their inclusion as covariates in the Cox model). We performed feature selection using the glmnet package 45 in R version 3.5.2 statistical software.

Training and testing
For training and testing our framework, we used both the AREDS and AREDS2 datasets. In the primary set of experiments, eligible participants from both studies were pooled to create one broad cohort of 3,298 individuals that combined a wide spectrum of baseline disease severity with a high number of progression events. The combined dataset was split at the participant level in the ratio 70%/10%/20% to create three sets: 2,364 participants (training set), 333 participants (development set), and 601 participants (hold-out test set).
Separately, all of the baseline images in the test set were graded by 88 (AREDS) and 192 (AREDS2) retinal specialists. By using these grades as input to either the SSS or the online calculator, we computed the prediction results of the two existing standards: retinal specialists/SSS and retinal specialists/calculator.
For three of the four approaches (deep features/survival, DL grading/survival, and retinal special-ists/calculator), the input was bilateral CFP, participant age, and smoking status; separate experiments were conducted with and without the additional input of genotype data. For the other approach (retinal specialists/SSS), the input was bilateral CFP only. In addition to the primary set of experiments where eligible participants from the AREDS and AREDS2 were combined to form one dataset, separate experiments were conducted where the DL models were: (i) trained separately on the AREDS training set only, or the AREDS2 training set only, and tested on the combined AREDS/AREDS2 test set, and (ii) trained on the AREDS training set only and externally validated by testing on the AREDS2 test set only.

Statistical analysis
As the primary outcome measure, the performance of the risk prediction models was assessed by the Cstatistic 49 at five years from study baseline. Five years from study baseline was chosen as the interval for the primary outcome measure since this is the only interval where comparison can be made with the SSS, and the longest interval where predictions can be tested using the AREDS2 data.
For binary outcomes such as progression to late AMD, the C-statistic represents the area under the receiver operating characteristic curve (AUC). The C-statistic is computed as follows: all possible pairs of participants are considered where one participant progressed to late AMD and the other participant in the pair progressed later or not at all; out of all these pairs, the C-statistic represents the proportion of pairs where the participant who had been assigned the higher risk score was the one who did progress or progressed earlier.
A C-statistic of 0.5 indicates random predictions, while 1.0 indicates perfectly accurate predictions. We used 200 bootstrap samples to obtain a distribution of the C-statistic and reported 95% confidence intervals. For each bootstrap iteration, we sampled n patients with replacement from the test set of n patients.
As a secondary outcome measure of performance, we calculated the Brier score from prediction error curves, following the work of Klein et al. 12 The Brier score is defined as the squared distances between the models predicted probability and actual late AMD, GA, or NV status, where a score of 0.0 indicates a perfect match. The Wald test was used to assess the statistical significance of each factor in the survival models. 50 It corresponds to the ratio of each regression coefficient to its standard error. The survival package in R version 3.5.2 was used for Cox proportional hazards model evaluation. Finally, saliency maps were generated to represent the image locations that contributed most to decision-making by the DL models (for drusen or pigmentary abnormalities). This was done by back-projecting the last layer of the neural network. The Python package keras-vis was used to generate the saliency map. 51 Figure S1: Survival model predictions for the 5 groups. The participants were split into 5 groups based on the AREDS simplified severity scores at the baseline. Actual survival for the 5 groups is shown in lines (Kaplan-Meier curves). The deep features/survival, DL grading/survival, and retinal specialist/survival model predictions for the five groups are shown in lines with markers. The deep features/survival predictions corresponded better to actual progression data than those of the other two models. 20 Figure S2: Receiver operating characteristic curves for the deep neural networks to grade drusen size and pigmentary abnormalities presence/absence on the combined AREDS/AREDS2 test sets (601 participants). The AUC was used to evaluate the performance of deep neural networks in image classification (as opposed to predictions of progression), and for comparison with the retinal specialists (with reference to the Reading Center grades as the ground truth). Figure S3: Example cases of progression to late age-related macular degeneration (AMD) where deep features/survival made more accurate predictions than retinal specialist/calculator or retinal specialist/Simplified Severity Scale (SSS). a. Both participants were the same age (73 years) and had the same smoking history (current) and AREDS SSS scores (4) at baseline. The retinal specialist graded the SSS correctly. However, participant 1 progressed to late AMD at year 2 (GA in the right eye), but participant 2 progressed to late AMD at year 5 (GA in the left eye). Hence, both the retinal specialist/calculator and retinal specialist/SSS approaches incorrectly assigned the same risk of progression to both participants (0.413). However, the deep feature/survival approach correctly assigned a higher risk of progression to participant 1 (0.751) and a lower risk to participant 2 (0.591). b. Both participants were the same age (73 years) and had the same smoking history (former). Their AREDS SSS scores at baseline were 3 and 4, respectively. The retinal specialist graded the SSS correctly. However, participant 3 progressed to late AMD at year 2 (neovascular AMD in the right eye), while participant 4 had still not progressed to late AMD by final followup at year 11. Hence, both the retinal specialist/calculator and retinal specialist/SSS approaches incorrectly assigned a lower risk of progression to participant 3 (0.259) and a higher risk to participant 4 (0.413). However, the deep feature/survival approach correctly assigned a higher risk of progression to participant 4 and a lower risk to participant 3 (0.569 vs 0.528). 22