## Introduction

Age-related macular degeneration (AMD) is the most common cause for severe visual loss in industrialized countries1. The introduction of anti-vascular endothelial growth factor (anti-VEGF) therapy has markedly improved visual outcomes in patients with choroidal neovascularization (CNV) secondary to AMD2. However, clinical trials investigating combined therapeutic approaches (e.g. anti-VEGF in combination with anti- Pigment epithelium-derived factor (PEDF) beyond anti-VEGF monotherapy have yet failed to demonstrate superiority (e.g. CAPELLA, [clinicaltrials.gov identifier: NCT02418754], OPH1002 [NCT01944839], OPH1003 [NCT01940900])3. While the negative trial results may be explained by the lack of biological effectiveness, they may also be a result of limitations of the utilized structural surrogate and functional endpoints. In a worst-case scenario, this may lead to disregarding a candidate drug that in fact actually was efficient.

Best-corrected visual acuity (BCVA) is the most commonly used functional endpoint in ophthalmological trials. However, it has limited accuracy with regard to subtle therapeutic effects, as it only measures photopic function at central retinal fixation and exhibits considerable retest-variability4,5. In this regard, fundus-controlled perimetry (FCP, ‘microperimetry’) offers information over and beyond BCVA. FCP is a established psychophysical assessment allowing for spatially resolved probing of retinal sensitivity even in patients with instable fixation due to eye tracking6,7,8,9,10. Recently, the refined probing of rod function by dark-adapted (DA) two-color FCP has become possible with the introduction of a novel device (S-MAIA, Centervue, Padua, Italy)11,12,13,14. However, the test requires dedicated equipment, is rather time consuming and the number of test-points and consequently the spatial-resolution is limited due to fatigue of the patient. Spectral-domain optical coherence tomography imaging (SD-OCT), which allows for axially resolved imaging of the retina, infrared reflection (IR) imaging and fundus autofluorescence (FAF) imaging, which enables mapping of retinal fluorophores, are now widely available15. Hereby, the en face resolution of these modalities (11.4 or 5.7 µm/pixel for the Spectralis OCT 2 device, Heidelberg Engineering, Germany) is by more than one log unit higher as compared to FCP testing (128 µm [Goldmann III] stimulus for the S-MAIA device). However, the currently used imaging biomarkers have limited informative value. For example in neovascular AMD, a decrease in central (full) retinal thickness could represent both, positive (e.g. reduction of macular edema) or negative (e.g. outer retinal atrophy) treatment effects. It has recently been demonstrated, that artificial intelligence (AI) algorithms, including machine learning techniques such as random forest regression, may be applied in neovascular AMD to predict future BCVA based on previous BCVA and structural SD-OCT data16. Yet, similar to BCVA, “inferred BCVA” would be expected to be rather insensitive to localized - particularly extrafoveal - alterations in the retinal structure.

The aim of this study was to predict retinal senstivity based on retinal microstructure in neovascular AMD using machine-learning algorithms. The analysis was based on multimodal, volumetric state-of-the-art retinal imaging and differential mesopic, dark-adapted cyan and dark-adapted red FCP testing. To potentially improve the accuracy of the applied models, we also estimated the additional predictive value provided by “patient-reliability indices” that may account for patient-specific behavioral factors. Finally, we designed this study with the aim to explore the utility of “inferred sensitivity” mapping as a quasi-functional surrogate endpoint for future clinical trials. Hereby, we introduce the term “inferred sensitivity” to describe the spatially-resolved prediction of retinal sensitivity based on clinically feasible multimodal retinal imaging and with subsequent application of AI algorithms.

## Results

### Cohort characteristics

Fifty eyes of 50 patients with CNV secondary to AMD (age [mean ± SD] 76.1 ± 7.6 years [range: 54.6–90.2 years]) and 40 eyes of 40 controls (55.8 ± 17.4 years [21.8–82.1 years]) were included in this study (Table 1). The median BCVA was logMAR 0.38 ± 0.34 [Snellen equivalent approximately 20/50] for patients and 0.03 ± 0.07 [Snellen equivalent approximately 20/20] for controls. For all following analyses, the normal data was exclusively used to standardize patient data in consideration of the spatial differences in retinal sensitivity as well as layer thicknesses and reflectivities (cf. Methods and Fig. 1). Accordingly, only patient data were used to derive the estimates for the prediction accuracies to obtain as much as possible conservative estimates. A single observation (i.e. single test-point within one patient) for all three types of FCP testing had to be excluded due to missing SD-OCT data. This left a total of 3049 observations for predictive modeling for each type of testing (i.e. 50 patients with 61 point-wise observations for mesopic, DA cyan and DA red testing).

### Prediction model for retinal sensitivity in an unknown patient (scenario 1)

The prediction accuracies of the machine learning models were determined in two clinically meaningful scenarios. Firstly, scenario 1 (patient-wise leave-one-out cross-validation [LOO-CV], Fig. 2) represents the prediction accuracy for a completely unknown patient with only imaging data available. For mesopic sensitivity, the prediction accuracy for scenario 1 based on imaging data only (S1A) reached a mean absolute error (MAE [95% CI]) of 4.22 dB [3.72, 4.72], which is markedly better as compared to the MAE of the corresponding null model (Table 2). A likelihood ratio test revealed, that the prediction accuracies varied significantly in dependence of the feature set (P < 0.001). With additional inclusion of “patient reliability indices” as predictors (S1B), to control for potential confounding factors such as the patient-specific false-positive response rate, the MAE could be further reduced to 4.06 dB [3.52, 4.6] (P < 0.001). Of note, fixation stability was not considered as “patient reliability indices” in this study, since it could be partially informative of function. Additional inclusion fixation stability (S1C) allowed for a further slight reduction of the MAE to 3.94 dB [3.38, 4.5] as shown in Fig. 2 (P < 0.001). Similar prediction accuracies were reached for S1A for dark-adapted cyan sensitivity (5.15 dB [4.68, 5.62]) and dark-adapted red testing (4.05 dB [3.66, 4.43]). Again, the prediction accuracy varied in dependence of the feature set for both types of testing (likelihood ratio test, P < 0.001). Hereby, inclusion of “patient reliability indices” likewise improved the prediction accuracies markedly for dark-adapted cyan testing to 4.89 [4.55, 5.24] (P < 0.001). For dark-adapted red testing, the prediction accuracy did not improve through inclusion of “patient reliability indices” as predictors. Inclusion of fixation stability did not further improve the prediction accuracies significantly for both types of testing (Fig. 2, Table 2).

### Prediction model for retinal sensitivity in a patient with prior perimetry data (scenario 2)

Since potentially influential factors (e.g. lens opacity) may not be directly deducible from retinal imaging data, we assessed whether data from a brief FCP exam, which would be feasible in the context of a multicenter trial, (i.e. 30 test-points, duration of 5 minutes), could further enhance the prediction accuracy. For all three types of testing, the prediction accuracy was markedly improved for scenario 2 as compared to scenario 1 (P < 0.001). For all three types of testing, the prediction accuracy varied in dependence of the feature set (likelihood ratio test, P < 0.001). For mesopic sensitivity, the prediction accuracy for scenario 2 reached a MAE of 3.14 dB [2.85, 3.43]. Inclusion of “patient reliability indices” did further reduce the MAE (2.8 dB [2.51, 3.09]; P < 0.001), significantly while additional inclusion of fixation stability as predictor did not result in a further improvement of the prediction accuracy. Similarly, good prediction accuracies were reached for scenario 2A for dark-adapted cyan sensitivity (4.01 dB [3.76, 4.26]) and dark-adapted red testing (3.15 dB [2.9, 3.41]). Inclusion of “patient reliability indices” markedly improved the prediction accuracy for dark-adapted cyan sensitivity (3.73 [3.48, 3.98]; P < 0.001) and dark-adapted red testing (2.93 dB [2.71, 3.16]; P < 0.001). While inclusion of fixation stability as additional predictor did not further improve the prediction accuracy for dark-adapted cyan testing, the prediction accuracy for dark-adapted red-testing was further improved (2.85 dB [2.62, 3.08]; P < 0.001).

### Feature importance and cone versus rod dysfunction

Based on the permutation accuracy importance, ONL thickness (59.3% Inc MSE) followed by FAF intensity (51.3% Inc MSE) and inner retinal thickness (37.3% Inc MSE) constituted the most important features in predicting mesopic sensitivity (Fig. 3). For prediction of dark-adapted cyan sensitivity, ONL thickness (87.8% Inc MSE) followed by IR intensity (55.2% Inc MSE) and OS thickness (49.9% Inc MSE) constituted the most important features. The feature importance order for dark-adapted red testing was similar to the feature importance for mesopic testing with ONL thickness (103.91% Inc MSE) followed by FAF intensity (71.5% Inc MSE) and IR intensity (44.9% Inc MSE). Moreover, graphical analysis (Fig. 3) underscores that the ONL thickness truly stands out in terms of feature importance across all three types of testing and similarly FAF intensity is separated in terms of importance from the other predictors for mesopic and dark-adapted red testing. Generally, thickness measurements exhibited a higher feature importance as compared to the layer intensities (Fig. 3).

### Structure-function correlation

The projections of feature contributions revealed distinctly different relationships between the retinal sensitivity and the ONL thickness for mesopic testing versus dark-adapted cyan testing (Fig. 4). For both types of testing, reduced ONL thickness indicating outer retinal atrophy was associated with decreased sensitivity, while normative ONL thickness was associated with normal function (Fig. 4). For mesopic and dark-adapted testing, ONL thickening exhibited no distinct effect on sensitivity. In contrast, an inverted-U relationship was observed for dark-adapted cyan sensitivity implying that ONL thickening is associated with a decreased dark-adapted cyan sensitivity (Fig. 4).

### Prediction accuracy and clinical utility

Random forest (RF) regression allowed for accurate prediction of sensitivity for a wide variety of retinal structural alterations through the examination of their specific thickness- and reflectivity deviations (Fig. 5). This was confirmed for both, para-foveal as well as peripheral region enabling a comprehensive prediction throughout the retina. Moreover, the algorithm could be used to predict the function for the whole imaged retina (Fig. 6). The patient in Fig. 6 exhibits centrally rather intact retinal sensitivity despite markedly increased central (full) retinal thickness while exhibiting parafoveally scotomata associated with slight retinal thinning. This patient clearly showcases the limitations of central full retinal thickness as surrogate endpoint since full retinal thinning could represent both, loss of function (outer retinal atrophy) and gain of function (reduction in retinal edema).

## Discussion

Demonstrating therapeutic benefits of emerging combined treatment approaches tackling diferent pathways simultaneously constitutes a challenge, especially given that visual outcomes in patients with neovascular AMD were markedly improved with the introduction of anti-VEGF therapy. Adequate clinical trial design with selection of suitable endpoints constitutes a prerequisite towards clear assessment of additional potential therapeutic benefit by novel interventional approaches. Using machine learning algorithms, the present study outlines the possibility to predict retinal function, when (a) volumetric, multimodal retinal imaging data is obtained only or (b) additionally a short FCP exam is performed. For this AI-based analysis strategy, we have introduced the term “inferred sensitivity” that may serve as a functional surrogate endpoint in future clinical trials.

To date, BCVA constitutes the most common functional outcome parameter in clinical trials in ophthalmology in general and specifically in studies for neovascular AMD. However, BCVA is primarily representative of cone-function and of function at the fovea or of the preferred retinal locus in eyes with extrafoveal fixation (i.e. no spatial resolution)4. Moreover, BCVA assessment represents a psychophysical test that is rather time-consuming and requires good patient cooperation. FCP may partially compensate these shortcomings by allowing for differential testing of cone and rod function (dark-adapted FCP) and allowing for assessment of retinal loci outside the foveal center with moderate spatial resolution11,12. Disadvantages include the the duration of the examination, which limits the spatial resolution due to patient fatigue, and the need for specific equipment, i.e. a microperimetry device14. Hereby, surrogate endpoints represent a viable alternative to obtain quasi-functional results, especially with a high spatial resolution that could not be be achieved with psychophysical testing. This study demonstrates that it is possible to infer sensitivity based on routinely obtained structural imaging data. Using “inferred sensitivity” as a surrogate functional endpoint would provide five key advantages. “Inferred sensitivity” would (i) provide a much higher spatial resolution compared to current functional testing, (ii) be ubiquitously available and (iii) and could be obtained within a short time frame even in patient unfit for psychophysical testing. Moreover, (iv) “inferred sensitivity” could adequately represent potentially opposing treatment effect (e.g. edema reduction versus outer retinal atrophy), which would be inadequately represented by currently used SD-OCT surrogate endpoints such as central (full) retinal thickness. Finally, (v) “inferred sensitivity” could be compared across diseases to potentially facilitate objective cost-benefit analysis. All of these advantages would be relevant in interventional trials in neovascular AMD. Specifically, “inferred sensitivity” as an endpoint would allow for enrolment of patients with early extrafoveal or peripapillary CNV and/or concurrent macular atrophy in clinical trials17,18,19. These large subgroups of patients were previously systematically excluded from trials due to the limitations of BCVA as functional endpoint and of central [full] retinal thickness as a surrogate endpoint17,18,19.

Previous studies provided evidence for the close structure-function correlations between retinal sensitivity and multimodal imaging in AMD, albeit with only a limited number of narrowly selected predictors and/or application of linear models7,8,20,21,22,23,24. Building on this, by using a wide array of potentially predictive variables (26 imaging features) and non-linear models, it is demonstrated herein that the relationship between structure and function is indeed close. By electing a supervised machine learning approach using RF regression, we could evaluate the feature importance and graphically analyze the effect of these features. The fact that the ONL, which includes the cell bodies of the light-sensitive photoreceptor cells, was the most important feature for all three types of sensitivity predictions, underscores the biological plausibility of the models. However, importance of variables in the models may differ from biological relevance. Especially in the setting of correlated features, features exhibiting less measurement variability will be given higher importance. This may for example explain why the ONL thickness, which is significantly thicker than the IS and OS and therefore (relatively) less prone to grading errors, exhibited highest feature importance.

A similar observation has been previously reported in patients with intermediate AMD in absence of late-stage disease25. Interestingly, FAF intensity exhibited a high feature importance for mesopic and dark-adapted red test exceeding all of the SD-OCT features with the exception of ONL thickness. In contrast to SD-OCT features, the FAF intensity may be analyzed without any prior image segmentation, which is especially attractive for clinical evaluation. Various previous studies in AMD could demonstrate that FAF imaging allows not only for precise demarcation of geographic atrophy26,27, but may provide indirect information with regard to outer retinal thinning in the context of reticular pseudodrusen28,29, or loss of IS and OS in the context of persistently increased autofluorescence caused by prior subretinal fluid30.

The differential effect of retinal structure on mesopic versus dark-adapted cyan sensitivity further underscores the biological plausibility of our models. For example Fig. 3 shows that ONL thickening results in a more distinct reduction of dark-adapted cyan function as compared to mesopic function. Since all predictors were standardized in consideration of the location-specific normative values and since sensitivity losses rather than absolute sensitivity values were used as outcome variable, the inverted U-shaped for dark-adapted cyan testing may not be explained by the physiological rod photoreceptor distribution or ONL thickness topography. Accordingly, rod photoreceptor function appears to be more affected by macular edema as compared to cone photoreceptor function. Further, subretinal fluid or disintegrity of the RPE in terms of predictions leads to a more severe loss dark-adapted cyan sensitivity than mesopic or dark-adapted red sensitivity (Fig. 5). This would be in accordance with the observation that rod-photoreceptors are strictly dependent on the canonical visual cycle via the RPE, while cone-photoreceptors may obtain their chromophores via an additional cone-specific visual cycle involving Muller cells31.

Moreover, our study also included “patient-reliability indices” in the modeling process, demonstrating that consideration of these increased the prediction accuracy for scenario 1. In terms of interpretation, inclusion of these features appears to correct for patient-specific tendencies such as false positive responses.

Based on criteria established by the International Conference on Harmonization (ICH) Guidelines on Statistical Principles for Clinical Trials, “evidence for surrogacy depends upon (i) the biological plausibility of the relationship, (ii) the demonstration in epidemiologic studies of the prognostic value of the surrogate for the clinical outcome and (iii) evidence from clinical trials that treatment effects on the surrogate correspond to effects on the clinical outcome”32. While the biological plausibility is established in this study (as aforementioned), the other two aspects warrant further consideration. The second criterion is only partially applicable to “inferred sensitivity” given its quasi-functional character, in contrast to traditional surrogate endpoints that do not directly represent function (e.g. intra-ocular pressure in glaucoma). However, the third criterion is highly relevant for “inferred sensitivity” as a surrogate endpoint, since models are strictly limited by their applicability domain (i.e. predictor space where the model makes prediction with a given reliability). Clearly, the models developed here would be expected to perform sub-optimally in eyes with more rare forms of neovascular AMD including retinal angiomatous proliferation (RAP) due to the lack of corresponding training data in our cohort. In longitudinal clinical trials (third ICH criterion), it would be even more difficult to define the appropriate applicability domain as exemplified below.

### Limitations

Due to the exclusion of optic nerve diseases in our training data, the inner retinal features exhibit only a low feature importance in our models. The same consideration would apply for significant changes in media opacification (i.e. cataract). Therefore, “inferred sensitivity” based on our models would be unsuitable of reflecting certain potential side-effects such as optic neuropathy and glaucoma. To avoid such fallacies, at least a subset of patients in clinical trials, that evaluate change in “inferred sensitivity” as a surrogate endpoint, should undergo longitudinal FCP testing. Then, longitudinal accuracy of the models could be confirmed based on this subset prior to inferring sensitivity data for the remaining patients. Further, the discrepancy between the MAE of test and retest measurement differences versus the MAE for the prediction accuracy suggest that a larger training data set would have been beneficial (cf. Tables 1 and 2). Potentially, application of more complex AI approaches (e.g. convolutional neural network) to the raw imaging data could have further improved the prediction accuracies. However, the latter would have come at the cost of interpretability as well as an increased need for training data. Last, the study used the MAE as a conceptually simple and easily interpretable measure of model accuracy. Nevertheless, the root mean squared error, which penalizes particularly large errors that would be undesirable for inferred sensitivity, was also provided (Supplementary Table S2).

In summary, we have introduced the AI-based analysis strategy of “inferred sensitivity” to estimate differential effects of retinal structural abnormalities on cone- and rod-function in nAMD. This method constitutes a potential valuable tool to predict macular visual field losses at high-spatial resolution in future nAMD cohorts without the need for extensive psychophysical examinations. In the potential future application, individual subjects would undergo standard ophthalmological assessment and non-invasive retinal imaging in a relative rapid and straight-forward examination, while FCP testing only includes a limited number of test stimuli or can even completely be waived. The findings of this study suggest that “inferred sensitivity” opens the possibility for a refined investigation of treatment effects in nAMD superior to standard BCVA testing, particularly in order to differentiate functional outcomes of different treatment strategies. This technique may also be expanded in the future for high-resolution mapping of localized functional impairment in other macular and retinal conditions in order to investigate the functional impact of progressive structural abnormalities or to assess new therapeutical interventions. The notion of “inferred sensitivity” as a quasi-functional outcome measure might be further applicable to other retinal diseases including diabetic retinopathy, retinal vein occlusion as well as inherited retinal diseases.

## Methods

### Subjects

Subjects with neovascular AMD were recruited from injection clinics of the Department of Ophthalmology, University of Bonn. The inclusion and exclusion criteria have been published previously33. Inclusion criteria were age ≥50 years, a CNV lesion proven in OCT angiography (OCTA), fluorescein angiography (FA) and/or indocyanine green angiography (ICGA). Exclusion criteria for the study eye included refractive errors ≥5.00 diopters of spherical equivalent and >1.50 diopters of astigmatism assessed by autorefraction (ARK-560A; Nidek, Gamagori, Japan), a history of glaucoma or relevant anterior segment diseases with media opacities and no history of any intraocular surgery except cataract extractions <3 months ago. If both eyes met the inclusion criteria, the eye with better BCVA was included. Apart from taking the medical history, all subjects underwent routine ophthalmological examinations, including BCVA, slit-lamp and funduscopic examination. Control eyes were recruited from the hospital wards among patients with a healthy fellow eye and patient’s companions. The study protocol was in accordance with the relevant guidelines and regulations and approved by the Institutional Review Board of the University of Bonn (ethics approval ID: 191/16). Written informed consent conforming to the tenets of the Declaration of Helsinki was acquired from all participants.

### Imaging protocol

Based on previous publications, standardized retinal imaging was performed including combined confocal scanning laser ophthalmoscopy (cSLO) and spectral-domain optical coherence tomography (SD-OCT) imaging (30° × 25°, ART 25, 121 B-scans, Spectralis HRA-OCT 2, Heidelberg Engineering, Heidelberg, Germany)33. Further, 30° fundus autofluorescence (FAF) and multicolor imaging as well as 55° FAF imaging were performed on the same device. OCTA was performed using a swept-Source OCT (SS-OCT) device (3 × 3 mm, 6 × 6 mm, 9 × 9 mm OCTA scan, PLEX Elite 9000, Carl Zeiss Meditec AG, Jena, Germany). Color fundus photography (CFP) was performed (Visucam 500, Carl Zeiss Meditec AG). Both OCT-A and CFP were not included for prediction of inferred sensitivity.

### Fundus- controlled perimetry

FCP testing was carried out based on our previous experience with the S-MAIA (CenterVue, Padova, Italy) device in normal subjects and patients with intermediate and atrophic late stage AMD11,13,14,25,33,34,35. It was performed after dilating pupils using 2.5% phenylephrin and 0.5% tropicamide to facilitate fundus tracking. Patients with no prior perimetry experience underwent a short mesopic practice FCP test to accustom them to the procedure. Patients underwent duplicate (28 of 50 patients) or singular (22 of 50 patients) mesopic (achromatic stimuli, 400–800 nm) FCP, with subsequent 30 minutes of dark adaptation (light level <0.1 lux), followed by duplicate or singular dark-adapted cyan (505 nm) and dark-adapted red (627 nm) FCP using the S-MAIA device. Testing was performed with the pre-set 4–2 dB staircase strategy. The stimulus size was 0.43° (Goldmann III). The test grid consisted of 61 stimuli covering the central 18° of the retina. The test points were evenly distributed in five rings at 1°, 3°, 5°, and 9° around a central test-point. In terms of “patient-reliability indices”, false-positive responses were measured through presentation of suprathreshold stimuli to the optic nerve head (i.e. Heijl-Krakau method). Further, the rate of wrong pressure events was measured as the number of pressure events outside of the response window of the S-MAIA device36. Last, the 95% bivariate contour ellipse area (BCEA) encompassing 95% of the fixation points was recorded as measure of fixation stability37.

$${z}_{f,e,a}=\frac{{x}_{f,e,a}-{\bar{x}}_{{\rm{f}},{\rm{e}},{\rm{a}}}}{{S}_{{\rm{f}},{\rm{e}},{\rm{a}}}}$$
 Symbol Meaning z f,e,a z-score for imaging feature (f) at the eccentricity (e) and angular position (a) x f,e,a Observation in a patient for imaging feature (f) at the eccentricity (e) and angular position (a) $${\bar{x}}_{{\rm{f}},{\rm{e}},{\rm{a}}}$$ Age-adjusted normative mean value for a given imaging feature (f) at a given eccentricity (e) and angular position (a) S f,e,a Age-adjusted normative standard deviation for a given imaging feature (f) at a given eccentricity (e) and angular position (a)