Introduction

Age-related macular degeneration (AMD) is the leading cause of legal blindness in developed countries1. Through global demographic changes, the number of people with AMD worldwide is projected to reach 288 million by 20402. The disease is classified into early, intermediate, and late stages3. Late AMD, the stage associated with severe visual loss, occurs in two forms, geographic atrophy (GA) and neovascular AMD (NV). Making accurate time-based predictions of progression to late AMD is clinically critical. This would enable improved decision-making regarding: (i) medical treatments, especially oral supplements known to decrease progression risk, (ii) lifestyle interventions, particularly smoking cessation and dietary changes, and (iii) intensity of patient monitoring, e.g., frequent reimaging in clinic and/or tailored home monitoring programs4,5,6,7,8. It would also aid the design of future clinical trials, which could be enriched for participants with a high risk of progression events9.

Color fundus photography (CFP) is the most widespread and accessible retinal imaging modality used worldwide; it is the most highly validated imaging modality for AMD classification and prediction of progression to late disease10,11. Currently, two existing standards are available clinically for using CFP to predict the risk of progression. However, both of these were developed using data from the AREDS only; now, an expanded data set with more progression events is available following the completion of the AREDS25. Of the two existing standards, the most commonly used is the five-step Simplified Severity Scale (SSS)10. This is a points-based system whereby an examining physician scores the presence of two AMD features (macular drusen and pigmentary abnormalities) in both eyes of an individual. From the total score of 0–4, a 5-year risk of late AMD is then estimated. The other standard is an online risk calculator12. Like the SSS, its inputs include the presence of macular drusen and pigmentary abnormalities; however, it can also receive the individual’s age, smoking status, and basic genotype information consisting of two SNPs (when available). Unlike the SSS system, the online risk calculator predicts the risk of progression to late AMD, GA, and NV at 1–10 years.

Both existing clinical standards face limitations. First, the ascertainment of the SSS features from CFP or clinical examination requires significant clinical expertise, typical in retinal specialists, but remains time-consuming and error-prone13, even when performed by expert graders in a reading center11. Second, the SSS relies on two hand-crafted features and cannot receive other potentially risk-determining features. Recent work applying deep learning (DL)14 has shown promise in the automated diagnosis and triage of conditions including cardiac, pediatric, dermatological, and retinal diseases13,15,16,17,18,19,20,21,22,23,24,25,26, but not in predicting the risk of AMD progression on a large scale or at the patient level27. Specifically, Burlina et al. reported on the use of DL for estimating the AREDS 9-step severity grades of individual eyes, based on CFP in the AREDS data set27,28,29. However, this approach relied on previously published 5-year risk estimates at the severity class level30, rather than using the ground truth of actual progression/non-progression at the level of individual eyes, or the timing and subtype of any progression events. In addition, no external validation using an independent data set was performed in that study. Babenko et al.28 proposed a model to predict risk of progression to neovascular AMD only (i.e., not late AMD or GA). Importantly, the model was designed to predict 1-year risks only, i.e., at one fixed and very short interval only. Schmidt-Erfurth et al.29 proposed to use OCT images to predict progression to late AMD. Specifically, they used a data set (495 eyes, containing only 159 eyes that progressed to late AMD) with follow-up of 2 years. Their data set was annotated by two retinal specialists (rather than unified grading by reading center experts using a published protocol).

Here, we developed a DL architecture to predict progression with improved accuracy and transparency in two steps: image classification followed by survival analysis (Fig. 1). The model is developed and clinically validated on two data sets from the Age-Related Eye Disease Studies (AREDS31 and AREDS232), the largest longitudinal clinical trials in AMD (Fig. 2). The framework and data sets are described in detail in the “Methods” section.

Fig. 1: The two-step architecture of the framework.
figure 1

a Raw color fundus photographs (CFP; field 2, i.e., 30° imaging field centered at the fovea). b Deep classification network, trained with CFP (all manually graded by reading center human experts). c Resulting deep features or deep learning grading. d Survival model, trained with imaging data, and participant demographic information, with/without genotype information: ARMS2 rs10490924, CFH rs1061170, and 52-SNP Genetic Risk Score. e Late age-related macular degeneration survival probability.

Fig. 2: Creation of the study data sets.
figure 2

To avoid ‘cross-contamination’ between the training and test data sets, no participant was in more than one group.

Our framework has several important strengths. First, it performs progression predictions directly from CFP over a wide time interval (1–12 years). Second, training and testing were based on the ground truth of reading center-graded progression events at the level of individuals. Both training and testing benefitted from an expanded data set with many more progression events, achieved by using data from the AREDS2 alongside AREDS, for the first time in DL studies. Third, our framework can predict the risk not only of late AMD, but also of GA and NV separately. This is important since treatment approaches for the two subtypes of late AMD are very different: NV needs to be diagnosed extremely promptly, since delay in access to intravitreal anti-VEGF injections is usually associated with very poor visual outcomes33, while various therapeutic options to slow GA enlargement are under investigation34,35. Finally, the two-step approach has important advantages. By separating the DL extraction of retinal features from the survival analysis, the final predictions are more explainable and biologically plausible, and error analysis is possible. By contrast, end-to-end ‘black-box’ DL approaches are less transparent and may be more susceptible to failure36.

Results

Deep learning models trained on the combined AREDS/AREDS2 training sets and validated on the combined AREDS/AREDS2 test sets

The characteristics of the participants are shown in Table 1. The characteristics of the images are shown in Supplementary Table 1. The overall framework of our method is shown in Fig. 1 and described in detail in the “Methods” section. In short, first, a deep convolutional neural network (CNN) was adapted to (i) extract multiple highly discriminative deep features, or (ii) estimate grades for drusen and pigmentary abnormalities (Fig. 1b, c). Second, a Cox proportional hazards model was used to predict probability of progression to late AMD (and GA/NV, separately), based on the deep features (deep features/survival) or the DL grading (DL grading/survival) (Fig. 1d, e). In this step, additional participant information could be added, such as age, smoking status, and genetics. Separately, all of the baseline images in the test set were graded by 88 (AREDS) and 192 (AREDS2) retinal specialists. By using these grades as input to either the SSS or the online calculator, we computed the prediction results of the two existing standards: ‘retinal specialists/SSS’ and ‘retinal specialists/calculator’.

Table 1 Characteristics of AREDS and AREDS2 participants.

The prediction accuracy of the approaches was compared using the 5-year C-statistic as the primary outcome measure. The 5-year C-statistic of the two DL approaches met and substantially exceeded that of both existing standards (Table 2). For predictions of progression to late AMD, the 5-year C-statistic was 86.4 (95% confidence interval 86.2–86.6) for deep features/survival, 85.1 (85.0–85.3) for DL grading/survival, 82.0 (81.8–82.3) for retinal specialists/calculator, and 81.3 (81.1–81.5) for retinal specialists/SSS. For predictions of progression to GA, the equivalent results were 89.6 (89.4–89.8), 87.8 (87.6–88.0), and 82.6 (82.3–82.9), respectively; these are not available for retinal specialists/SSS, since the SSS does not make separate predictions for GA or NV. For predictions of progression to NV, the equivalent results were 81.1 (80.8–81.4), 80.2 (79.8–80.5), and 80.0 (79.7–80.4), respectively.

Table 2 The C-statistic (95% confidence interval) of the survival models in predicting risk of progression to late age-related macular degeneration on the combined AREDS/AREDS2 test sets (601 participants).

Similarly, for predictions at 1–4 years, the C-statistic was higher in all cases for the two DL approaches than the retinal specialists/calculator approach. Of the two DL approaches, the C-statistics of deep features/survival were higher in most cases than those of DL grading/survival. Predictions at these time intervals were not available for retinal specialists/SSS, since the SSS does not make predictions at any interval other than 5 years. Regarding the separate predictions of progression to GA and NV, deep features/survival also provided the most accurate predictions at most time intervals. Overall, DL-based image analysis provided more accurate predictions than those from retinal specialist grading using the two existing standards. For deep feature extraction, this may reflect the fact that DL is unconstrained by current medical knowledge and not limited to two hand-crafted features.

In addition, the prediction calibrations were compared using the Brier score (Fig. 3). For 5-year predictions of late AMD, the Brier score was lowest (i.e., optimal) for deep features/survival. We also split the data into five groups based on the AREDS SSS at baseline. We compared calibration plots for deep features/survival, DL grading/survival, and retinal specialist/survival with the actual progression data for the five groups (Supplement Fig. 1). The actual progression data for the five groups are shown in lines (Kaplan–Meier curves) and the predictions of our models are shown in lines with markers. The figure shows that the predictions of the deep features/survival model correspond better to the actual progression data than those of the other two models.

Fig. 3: Prediction error curves.
figure 3

Prediction error curves of the survival models in predicting risk of progression to late age-related macular degeneration on the combined AREDS/AREDS2 test sets (601 participants), using the Brier score (95% confidence interval).

Deep learning models trained separately on individual cohorts (either AREDS or AREDS2) and validated on the combined AREDS/AREDS2 test sets

Models trained on the combined AREDS/AREDS2 cohort (Table 2) were substantially more accurate than those trained on either individual cohort (Table 3), with the additional advantage of improved generalizability. Indeed, one challenge of DL has been that generalizability to populations outside the training set can be variable. In this instance, the widely distributed sites and diverse clinical settings of AREDS/AREDS2 participants, together with the variety of CFP cameras used, help provide some assurance of broader generalizability.

Table 3 The 5-year C-statistic (95% CI) results of models trained on only AREDS or only AREDS2, and validated on the combined AREDS/AREDS2 test sets (601 participants), without using genotype information.

Deep learning models trained on AREDS and externally validated on AREDS2 as an independent cohort

In separate experiments, to externally validate the models on an independent data set, we trained the models on AREDS (2177 participants) and tested them on AREDS2 (1121 participants). Table 4 shows that deep features/survival demonstrated the highest accuracy of 5-year predictions in all scenarios, and DL grading/survival also had higher accuracy than retinal specialists/calculator.

Table 4 The 5-year C-statistic (95% CI) results of models trained on the entire AREDS and tested on the entire AREDS2 (1121 participants), without using genotype information.

Survival models with additional input of genotype

For all approaches possible, the predictions were tested with the additional input of genotype (Table 5). Interestingly, adding the genotype data, even the 52 SNP-based Genetic Risk Score (GRS; see “Methods” section) available only in rare research contexts37, did not improve the accuracy for deep features/survival or DL grading/survival; by contrast, adding just two SNPs (the maximum handled by the calculator) did improve the accuracy modestly for the retinal specialists/calculator approach. Multivariate analysis (Supplementary Table 2) demonstrated that deep features/DL grading, age, and AMD GRS contributed significantly to the survival models. The non-reliance of the DL approaches on genotype information favors their accessibility, as genotype data are typically unavailable for patients currently seen in clinical practice. It suggests that adding genotype information may partially compensate for the inferior accuracy obtained from human gradings, but contributes little to the accuracy of DL approaches, particularly deep feature extraction.

Table 5 The accuracy (C-statistic, 95% confidence interval) of the three different approaches in predicting risk of progression to late AMD on the combined AREDS/AREDS2 test sets (601 participants), with the inclusion of accompanying genotype information.

Research software prototype for AMD progression prediction

To demonstrate how these algorithms could be used in practice, we developed a software prototype that allows researchers to test our model with their own data. The application (shown in Fig. 4) receives bilateral CFP and performs autonomous AMD classification and risk prediction. For transparency, the researcher is given (i) grading of drusen and pigmentary abnormalities, (ii) predicted SSS, and (iii) estimated risks of late AMD, GA, and NV, over 1–12 years. This approach allows improved transparency and flexibility: users may inspect the automated gradings, manually adjust these if necessary, and recalculate the progression risks. Following further validation, this software tool may potentially augment human research and clinical practice.

Fig. 4: A screenshot of our research prototype system for AMD risk prediction.
figure 4

a Screenshot of late AMD risk prediction. 1, Upload bilateral color fundus photographs. 2, Based on the uploaded images, the following information is automatically generated separately for each eye: drusen size status, pigmentary abnormality presence/absence, late AMD presence/absence, and the Simplified Severity Scale score. 3, Enter the demographic and (if available) genotype information, and the time point for prediction. 4, The probability of progression to late AMD (in either eye) is automatically calculated, along with separate probabilities of geographic atrophy and neovascular AMD. b Four selected color fundus photographs with highlighted areas used by the deep learning classification network (DeepSeeNet). Saliency maps were used to represent the visually dominant location (drusen or pigmentary changes) in the image by back-projecting the last layer of neural network.

Discussion

We developed, trained, and validated a framework for predicting individual risk of late AMD by combining DL and survival analysis. This approach delivers autonomous predictions of a higher accuracy than those from retinal specialists using two existing clinical standards. Hence, the predictions are closer to the ground truth of actual time-based progression to late AMD than when retinal specialists are grading the same bilateral CFP and entering these grades into the SSS or the online calculator. In addition, deep feature extraction generally achieved slightly higher accuracy than DL grading of traditional hand-crafted features.

Table 4 shows that the C-statistic values were lower for AREDS2, as expected, since the majority of its participants were at higher risk of progression38; though more difficult, predicting progression for AREDS2 participants is more representative of a clinically meaningful task. When compared to the results in Table 2, the accuracy of our models decreased less than retinal specialists/calculator and retinal specialists/SSS. This demonstrates the relative generalizability of our models. Furthermore, this suggests the ability of deep features/survival to differentiate individuals with relatively similar SSS more accurately than existing clinical standards (since nearly all AREDS2 participants had SSS ≥ 2 at baseline). It is also worth noting that it was possible to improve further the performance of our models on the AREDS2 data set, if the models had been modified to accommodate the unique characteristics of the AREDS2. Since the AREDS is enriched for participants with milder baseline AMD severity, it may be possible to improve the performance of our models on the AREDS2 data set by either training the models on progressors only or decreasing the prevalence of non-progressors. However, for fair comparisons to other results in this study, we kept our models unchanged.

In addition, unlike the SSS, whose 5-year risk prediction becomes saturated at 50%10, the DL models enable ascertainment of risk above 50%. This may be helpful in justifying medical and lifestyle interventions4,5,8, vigilant home monitoring6,7, and frequent reimaging39, and in planning shorter but highly powered clinical trials9. For example, the AREDS-style oral supplements decrease the risk of developing late AMD by ~25%, but only in individuals at higher risk of disease progression4,5. Similarly, if subthreshold nanosecond laser treatment is approved to slow progression to late AMD8, accurate risk predictions may be very helpful for identifying eyes that may benefit most.

Considering that the DL grading model is better at detecting drusen and pigmentary changes, we also took the grades from the DL grading model and used these as input to the SSS and the Casey calculator, to generate predictions (Supplementary Table 3). We found that the accuracy of the predictions was higher when using the grades from the DL grading than from the retinal specialists. Hence, it is the combination of the DL grading and the survival approach that provides accurate and automated predictions.

In addition, we compared C-statistic results on three DL models: nnet-survival40, DeepSurv41, and CoxPH (Supplementary Table 4). CoxPH had the best calibration performance using either the deep features or DL grading, though the differences were fairly small.

For all approaches, the accuracy was substantially higher for predictions of GA than NV. The DL-based approaches improved accuracy for both subtypes, but, for NV, insufficiently to reach that obtained for GA. Potential explanations include the partially stochastic nature of NV and/or higher suitability of predicting GA from en face imaging.

Error analysis (i.e., examining the reasons behind inaccurate predictions) was facilitated more easily by our two-step architecture, unlike in end-to-end, ‘black-box’ DL approaches. This revealed that the survival model accounted for most errors, since (i) classification network accuracy was relatively high (Supplementary Fig. 2), and (ii) when perfect classification results (i.e., taken from the ground truth) were used as input to the survival model, the 5-year C-statistic of DL grading/survival on late AMD improved only slightly, from 85.1 (85.0,85.3) to 85.8 (85.6–86.0). Supplementary Figure 3 demonstrates two example cases of progression. In the first case, both participants were the same age (73 years) and had the same smoking history (current), and AREDS SSS scores (4) at baseline. The retinal specialist graded the SSS correctly. However, participant 1 progressed to late AMD at year 2 (any GA at right eye), but participant 2 progressed to late AMD at year 5 (any GA at left eye). Hence, both the retinal specialist/calculator and retinal specialist/SSS approaches incorrectly assigned the same risk of progression to both participants (0.413). However, the deep feature/survival approach correctly assigned a higher risk of progression to participant 1 (0.751) and a lower risk to participant 2 (0.591). In the second case, both participants were the same age (73 years) and had the same smoking history (former). Their AREDS SSS scores at baseline were 3 and 4, respectively. The retinal specialist graded the SSS correctly. However, participant 3 progressed to late AMD at year 2 (NV at right eye), while participant 4 had still not progressed to late AMD by final follow-up at year 11. Hence, both the retinal specialist/calculator and retinal specialist/SSS approaches incorrectly assigned a lower risk of progression to participant 3 (0.259) and a higher risk to participant 4 (0.413). However, the deep feature/survival approach correctly assigned a higher risk of progression to participant 4 and a lower risk to participant 3 (0.569 vs 0.528).

The strengths of the study include the application of combining DL image analysis and deep feature extraction with survival analysis to retinal disease. Survival analysis has been used widely in AMD progression research4,38; in this study, it made best use of the data, specifically the timing and nature of any progression events at the individual level. Deep feature extraction has the advantages that the model is unconstrained by current medical knowledge and not limited to two features. It allows the model to learn de novo what features are most highly predictive of time-based progression events, and to develop multiple (e.g., 5, 10, or more) predictive features. Unlike shallow features that appear early in CNNs, deep features are potentially complex and high-order features, and might even be relatively invisible to the human eye. However, one limitation of deep feature extraction is that predictions based on these features may be less explainable and biologically plausible, and less amenable to error analysis, than those based on DL grading of traditional risk features36. Additional strengths include the use of two well-characterized cohorts, with detailed time-based knowledge of progression events at the reading center standard. By pooling AREDS and AREDS2 (which has not been used previously in DL studies), we were able to construct a cohort that had a wide spectrum of AMD severity but was enriched for cases of higher baseline severity. In addition, in other experiments, keeping the data sets separate enabled us to perform external validation using AREDS2 as an independent cohort.

In terms of limitations, as in the two existing clinical standards, AREDS/AREDS2 treatment assignment was not considered in this analysis10,12. Since most AREDS/AREDS2 participants were assigned to oral supplements that decreased risk of late AMD, the risk estimates obtained are closer to those for individuals receiving supplements. However, this seems appropriate, given that AREDS-style supplements are considered the standard of care for patients with intermediate AMD39. Another limitation is that this work relates to CFP only. DL approaches to optical coherence tomography (OCT) data sets hold promise for AMD diagnosis42,43, but no highly validated OCT-based tools exist for risk prediction. In the future, we plan to apply our framework to multi-modal imaging, including fundus autofluorescence and OCT data.

In conclusion, combining DL feature extraction of CFP with survival analysis achieved high prognostic accuracy in predictions of progression to late AMD, and its subtypes, over a wide time interval (1–12 years). Not only did its accuracy meet and surpass existing clinical standards, but additional strengths in clinical settings include risk ascertainment above 50% and without genotype data.

Methods

Data sets

For model development and clinical validation, two data sets were used: the AREDS31 and the AREDS232 (Fig. 2). The AREDS was a 12-year multi-center prospective cohort study of the clinical course, prognosis, and risk factors of AMD, as well as a phase III randomized clinical trial (RCT) to assess the effects of nutritional supplements on AMD progression31. In short, 4757 participants aged 55–80 years were recruited between 1992 and 1998 at 11 retinal specialty clinics in the United States. The inclusion criteria were wide, from no AMD in either eye to late AMD in one eye. The participants were randomly assigned to placebo, antioxidants, zinc, or the combination of antioxidants and zinc. The AREDS data set is publicly accessible to researchers by request at dbGAP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000001.v3.p1).

Similarly, the AREDS2 was a multi-center phase III RCT that analyzed the effects of different nutritional supplements on the course of AMD32. 4203 participants aged 50–85 years were recruited between 2006 and 2008 at 82 retinal specialty clinics in the United States. The inclusion criteria were the presence of either bilateral large drusen or late AMD in one eye and large drusen in the fellow eye. The participants were randomly assigned to placebo, lutein/zeaxanthin, docosahexaenoic acid (DHA) plus eicosapentaenoic acid (EPA), or the combination of lutein/zeaxanthin and DHA plus EPA. AREDS supplements were also administered to all AREDS2 participants, because they were by then considered the standard of care39. We will make the AREDS2 data set publicly accessible upon publication.

In both studies, the primary outcome measure was the development of late AMD, defined as neovascular AMD or central GA. Institutional review board approval was obtained at each clinical site and written informed consent for the research was obtained from all study participants. The research was conducted under the Declaration of Helsinki and, for the AREDS2, complied with the Health Insurance Portability and Accessibility Act. For both studies, at baseline and annual study visits, comprehensive eye examinations were performed by certified study personnel using a standardized protocol, and CFP (field 2, i.e., 30° imaging field centered at the fovea) were captured by certified technicians using a standardized imaging protocol. Progression to late AMD was defined by the study protocol based on the grading of CFP31,32, as described below.

As part of the studies, 2889 (AREDS) and 1826 (AREDS2) participants consented to genotype analysis. SNPs were analyzed using a custom Illumina HumanCoreExome array37. For the current analysis, two SNPs (CFH rs1061170 and ARMS2 rs10490924, at the two loci with the highest attributable risk of late AMD), were selected, as these are the two SNPs available as input for the existing online calculator system. In addition, the AMD GRS was calculated for each participant according to methods described previously37. The GRS is a weighted risk score based on 52 independent variants at 34 loci identified in a large genome-wide association study37 as having significant associations with risk of late AMD. The online calculator cannot receive this detailed information.

The eligibility criteria for participant inclusion in the current analysis were: (i) absence of late AMD (defined as NV or any GA) at study baseline in either eye, since the predictions were made at the participant level, and (ii) presence of genetic information (in order to compare model performance with and without genetic information on exactly the same cohort of participants). Accordingly, the images used for the predictions were those from the study baselines only.

In the AREDS data set of CFPs, information on image laterality (i.e., left or right eye) and field status (field 1, 2, or 3) were available from the Reading Center. However, these were not available in the AREDS2 data set of CFPs. We therefore trained two Inception-v3 models, one for classifying laterality and the other for identifying field 2 images. Both models were first trained on the gold standard images from the AREDS and fine-tuned on a newly created gold standard AREDS2 set manually graded by a retinal specialist (TK). The AREDS2 gold standard consisted of 40 participants with 5164 images (4097 for training and 1067 for validation). The models achieved 100% accuracy for laterality classification and 97.9% accuracy (F1-score 0.971, precision 0.968, recall 0.973) for field 2 classification.

Gold standard grading

The ground truth labels used for both training and testing were the grades previously assigned to each CFP by expert human graders at the University of Wisconsin Fundus Photograph Reading Center. The reading center workflow has been described previously30. In brief, a senior grader performed initial grading of each photograph for AMD severity using a 4-step scale and a junior grader performed detailed grading for multiple AMD-specific features. All photographs were graded independently and without access to the clinical information. A rigorous process of grading quality control was performed at the reading center, including assessment for inter-grader and intra-grader agreement30. The reading center grading features relevant to the current study, aside from late AMD, were: (i) macular drusen status (none/small, medium (diameter ≥ 63 µm and <125 µm), and large (≥125 µm)), and (ii) macular pigmentary abnormalities related to AMD (present or absent).

In addition to undergoing reading center grading, the images at the study baseline were also assessed (separately and independently) by 88 retinal specialists in AREDS and 196 retinal specialists in AREDS2. The responses of the retinal specialists were used not as the ground truth, but for comparisons between human grading as performed in routine clinical practice and DL-based grading. By applying these retinal specialist grades as input to the two existing clinical standards for predicting progression to late AMD, it was possible to compare the current clinical standard of human predictions with those predictions achievable by DL.

Development of the algorithm

The overall framework of our method is shown in Fig. 1. First, a CNN was adapted to (i) extract multiple highly discriminative deep features, or (ii) estimate grades for drusen and pigmentary abnormalities (Fig. 1a–c). Second, a Cox proportional hazards model was used to predict probability of progression to late AMD (and GA/NV, separately), based on the deep features (deep features/survival) or the DL grading (DL grading/survival) (Fig. 1d, e). In this step, additional participant information could be added, such as age, smoking status, and genetics.

As the first stage in the workflow, the DL-based image analysis was performed using two different adaptations of ‘DeepSeeNet’13. DeepSeeNet is a CNN framework that was created for AMD severity classification. It has achieved state-of-the-art performance for the automated diagnosis and classification of AMD severity from CFP; this includes the grading of macular drusen, pigmentary abnormalities, the SSS13, and the AREDS 9-step severity scale44. In particular, using reading center grades as the ground truth, we have recently demonstrated that DeepSeeNet performs grading with accuracy that was superior to that of human retinal specialists (Supplementary Fig. 1). The two different adaptations are described here:

Deep features

The first adaptation was named ‘deep features’. This approach involved using DL to derive and weight predictive image features, including high-dimensional ‘hidden’ features45. Deep features were extracted from the second to last fully-connected layer of DeepSeeNet (the highlighted part in the classification network in Fig. 1). In total, 512 deep features could be extracted for each participant in this way, comprising 128 deep features for each of the two models (drusen and pigmentary abnormalities) in each of the two images (left and right eyes). After feature extraction, all 512 deep features were normalized as standard-scores. Feature selection was required at this point, to avoid overfitting and to improve the generalizability, because of the multi-dimensional nature of the features. Hence, we performed feature selection to group correlated features and pick one feature for each group46. Features with non-zero coefficients were selected and applied as input to the survival models described below.

Deep learning grading

The second adaptation of DeepSeeNet was named ‘DL grading’, i.e., referring to the grading of drusen and pigmentary abnormalities, the two macular features considered by humans most able to predict progression to late AMD. In this adaptation, the two predicted risk factors were used directly. In brief, one CNN was previously trained and validated to estimate drusen status in a single CFP, according to three levels (none/small, medium, or large), using reading center grades as the ground truth13. A second CNN was previously trained and validated to predict the presence or absence of pigmentary abnormalities in a single CFP.

Survival model

The second stage of our workflow comprised a Cox proportional hazards model47 to estimate time to late AMD (Fig. 1d, e). The Cox model is used to evaluate simultaneously the effect of several factors on the probability of the event, i.e., participant progression to late AMD in either eye. Separate Cox proportional hazards models were created to analyze time to late AMD and time to subtype of late AMD (i.e., GA and NV). In addition to the image-based information, the survival models could receive three additional inputs: (i) participant age; (ii) smoking status (current/former/never), and (iii) participant AMD genotype (CFH rs1061170, ARMS2 rs10490924, and the AMD GRS).

Experimental design

In both of the DeepSeeNet adaptations described, the DL CNNs used Inception-v3 architecture48, which is a state-of-the-art CNN for image classification; it contains 317 layers, comprising a total of over 21 million weights that are subject to training. Training was performed using two commonly used libraries: Keras (https://keras.io) and TensorFlow49. All images were cropped to generate a square image field encompassing the macula and resized to 512 × 512 pixels. The hyperparameters were learning rate 0.0001 and batch size 32. The training was stopped after 5 epochs once the accuracy on the development set no longer increased. All experiments were conducted on a server with 32 Intel Xeon CPUs, using a NVIDIA GeForce GTX 1080 Ti 11 Gb GPU for training and testing, with 512 Gb available in RAM memory. We fitted the Cox proportional hazard model using the deep features as covariates. Specifically, we selected the 16 features with the highest weights (i.e., the features found to be most predictive of progression to late AMD, with their inclusion as covariates in the Cox model). We performed feature selection using the ‘glmnet’ package46 in R version 3.5.2 statistical software.

Training and testing

For training and testing our framework, we used both the AREDS and AREDS2 data sets. In the primary set of experiments, eligible participants from both studies were pooled to create one broad cohort of 3298 individuals that combined a wide spectrum of baseline disease severity with a high number of progression events. The combined data set was split at the participant level in the ratio 70%/10%/20% to create three sets: 2364 participants (training set), 333 participants (development set), and 601 participants (hold-out test set).

Separately, all of the baseline images in the test set were graded by 88 (AREDS) and 192 (AREDS2) retinal specialists. By using these grades as input to either the SSS or the online calculator, we computed the prediction results of the two existing standards: ‘retinal specialists/SSS’ and ‘retinal specialists/calculator’.

For three of the four approaches (deep features/survival, DL grading/survival, and retinal specialists/calculator), the input was bilateral CFP, participant age, and smoking status; separate experiments were conducted with and without the additional input of genotype data. For the other approach (retinal specialists/SSS), the input was bilateral CFP only.

In addition to the primary set of experiments where eligible participants from the AREDS and AREDS2 were combined to form one data set, separate experiments were conducted where the DL models were: (i) trained separately on the AREDS training set only, or the AREDS2 training set only, and tested on the combined AREDS/AREDS2 test set, and (ii) trained on the AREDS training set only and externally validated by testing on the AREDS2 test set only.

Statistical analysis

As the primary outcome measure, the performance of the risk prediction models was assessed by the C-statistic50 at 5 years from study baseline. Five years from study baseline was chosen as the interval for the primary outcome measure since this is the only interval where comparison can be made with the SSS, and the longest interval where predictions can be tested using the AREDS2 data.

For binary outcomes such as progression to late AMD, the C-statistic represents the area under the receiver operating characteristic curve (AUC). The C-statistic is computed as follows: all possible pairs of participants are considered where one participant progressed to late AMD and the other participant in the pair progressed later or not at all; out of all these pairs, the C-statistic represents the proportion of pairs where the participant who had been assigned the higher risk score was the one who did progress or progressed earlier. A C-statistic of 0.5 indicates random predictions, while 1.0 indicates perfectly accurate predictions. We used 200 bootstrap samples to obtain a distribution of the C-statistic and reported 95% confidence intervals. For each bootstrap iteration, we sampled n patients with replacement from the test set of n patients.

As a secondary outcome measure of performance, we calculated the Brier score from prediction error curves, following the work of Klein et al.12. The Brier score is defined as the squared distances between the model’s predicted probability and actual late AMD, GA, or NV status, where a score of 0.0 indicates a perfect match. The Wald test was used to assess the statistical significance of each factor in the survival models51. It corresponds to the ratio of each regression coefficient to its standard error. The ‘survival’ package in R version 3.5.2 was used for Cox proportional hazards model evaluation. Finally, saliency maps were generated to represent the image locations that contributed most to decision-making by the DL models (for drusen or pigmentary abnormalities). This was done by back-projecting the last layer of the neural network. The Python package ‘keras-vis’ was used to generate the saliency map52.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.