Comparability of automated drusen volume measurements in age-related macular degeneration: a MACUSTAR study report

Drusen are hallmarks of early and intermediate age-related macular degeneration (AMD) but their quantification remains a challenge. We compared automated drusen volume measurements between different OCT devices. We included 380 eyes from 200 individuals with bilateral intermediate (iAMD, n = 126), early (eAMD, n = 25) or no AMD (n = 49) from the MACUSTAR study. We assessed OCT scans from Cirrus (200 × 200 macular cube, 6 × 6 mm; Zeiss Meditec, CA) and Spectralis (20° × 20°, 25 B-scans; 30° × 25°, 241 B-scans; Heidelberg Engineering, Germany) devices. Sensitivity and specificity for drusen detection and differences between modalities were assessed with intra-class correlation coefficients (ICCs) and mean difference in a 5 mm diameter fovea-centered circle. Specificity was > 90% in the three modalities. In eAMD, we observed highest sensitivity in the denser Spectralis scan (68.1). The two different Spectralis modalities showed a significantly higher agreement in quantifying drusen volume in iAMD (ICC 0.993 [0.991–0.994]) than the dense Spectralis with Cirrus scan (ICC 0.807 [0.757–0.847]). Formulae for drusen volume conversion in iAMD between the two devices are provided. Automated drusen volume measures are not interchangeable between devices and softwares and need to be interpreted with the used imaging devices and software in mind. Accounting for systematic difference between methods increases comparability and conversion formulae are provided. Less dense scans did not affect drusen volume measurements in iAMD but decreased sensitivity for medium drusen in eAMD. Trial registration: ClinicalTrials.gov NCT03349801. Registered on 22 November 2017.

association between larger drusen volumes and an increased risk of AMD progression 5,6 , while drusen volume regression can precede conversion to late AMD lesions 7 . Drusen volume might be also more precisely measurable and repeatable than drusen area 5,8 , thus making it a promising biomarker and structural endpoint in AMD. OCT also allows for accurate assessment of reticular pseudodrusen (RPD), which are not included in the Beckmann classification but have emerged as an important biomarker of AMD severity and progression risk 9 .
Automated algorithms for drusen volume quantification are available including a software for the high definition-OCT Cirrus (Carl Zeiss Meditec, Dublin, CA), achieving approval by the Food and Drug Administration (FDA) in 2012 4 .
Several studies in the last decade have compared drusen measurements obtained from this software against manual quantification of drusen or similar readouts on different imaging modalities (mainly CFP-based), often showing that measurements across different imaging methods and devices yield different results and are not directly interchangeable 10,11 . One previous work compared drusen volume measurements from two different devices, with similar findings 12 . However, drusen volumes obtained from the FDA-cleared algorithms on Cirrus have not been compared with those from the Spectralis SD (spectral domain)-OCT device (Heidelberg Engineering, Heidelberg, Germany).
Another study found that drusen volume measurements obtained from 145 B-and 15 B-scans in iAMD are similar 13 . Nevertheless, different scan patterns on the same device have not been compared as to their sensitivity and specificity for drusen detection in eAMD and iAMD.
In order to better understand how the use of different devices, softwares and scan patterns might affect drusen volume measurements, we compared all these factors in persons with no AMD, eAMD and iAMD. We included automated drusen volume measures from the FDA-approved software in Cirrus and from two scans (a denser volume scan, 241 B-scans and a less dense volume scan, 25 B-scans) in Spectralis, assessed through a newly developed software, in the MACUSTAR study cohort.

Methods
We assessed an initial dataset of 258 subjects from the cross-sectional part of MACUSTAR, a multi-center clinical cohort study focused on early stages of AMD 14,15 .
In brief, the major objective of the MACUSTAR consortium is to develop novel clinically validated endpoints in the area of functional, structural, and patient-reported outcome measures in patients with iAMD 14,15 . AMD staging (no, early, intermediate and late) for all subjects is reading center-confirmed using multimodal imaging 14,15 . Since drusen assessment was the main focus of this analysis, individuals with late AMD were not included in the study population.
Inclusion criteria, design and goals of MACUSTAR have been previously described 14,15 . For 10 patients, no imaging data could be retrieved for data analysis due to data management issues. Seven individuals were excluded because of a time gap between Cirrus and Spectralis examinations (> 6 weeks). Two individuals were excluded because of low-quality scans in both eyes (internal Cirrus quality parameter < 6 or internal Spectralis quality parameter < 20 dB). Incomplete date (at least one scan lacking in both eyes) lead to the exclusion of 39 participants. In the analytical population, 20 eyes were excluded due to either a missing scan in one of the three modalities (N = 8), low scan quality (N = 5 Cirrus, N = 3 Spectralis) or artifacts in drusen segmentation (N = 1 for Cirrus, N = 3 for Spectralis), leaving 380 eyes from 200 individuals (no AMD, n = 49 (22.3%), eAMD, n = 25 (13.1%), iAMD, n = 126 (64.6%)) with high-quality, complete data. This study has been conducted according to the provisions of the Declaration of Helsinki and was approved by local licensing ethic committees of participating countries, including University Hospital Bonn ethics committee (384/17), as listed previously 15 . All participants provided informed consent.
In brief, MACUSTAR participants are recruited at 20 clinical sites from seven European countries. Imaging data are graded at the central reading center (GRADE Reading Center, Bonn, Germany) by one junior reader followed by one senior reader grading review according to standardized and predefined grading procedures. For AMD status grading, the dense SD-OCT raster scan was used as the reference imaging modality. The B-scan with the largest possible drusen was preselected and its measurement was used to assess the maximum drusen size, which allowed for classification into small (≤ 63 µm), medium (> 63 µm and ≤ 125 µm) and large (> 125 µm) drusen. RPD were defined as hyperreflective irregularities and elevations above the RPE/BM complex on OCT that had to display corresponding lesions on either infrared imaging or fundus autofluorescence. Prerequisite for grading was a minimum of five individual lesions, each of a diameter of approximately 100 µm.
Drusen segmentation and quantification. Drusen volume measurements on Cirrus were derived from an established and FDA-approved software, whose measures are repeatable and reproducible 4,5,17,18 . In brief, in the Cirrus algorithm the observed and expected contours of the RPE layer are obtained by interpolating and fitting the shape of the segmented RPE layer, respectively. The areas located between the interpolated and fitted RPE shapes (which have nonzero area when drusen occurs) are marked as drusen 4 www.nature.com/scientificreports/ Drusen quantification on Spectralis is based on OCT layer predictions. BM and RPE layer heights are predicted with a state-of-the-art deep learning model for order-constrained layer regression (predicting layer heights while guaranteeing their correct anatomical order). For the drusen computation, a healthy RPE height is derived from BM and RPE predictions under the assumption that it has a fixed distance to the BM which varies only based on individual physiology and image resolution. The drusen height, required for filtering small false positives, is determined based on connected components in a drusen enface projection. The algorithm on Spectralis was built as an extension of a previously published tool for drusen volume segmentation and is freely available [19][20][21] . Interestingly, the algorithm on Cirrus and the one on Spectralis adopt a similar method: drusen are computed as the area between the predicted and the computed healthy RPE. In both algorithms, small false positive RPE elevations less than 5 pixels (19.5 μm) high are filtered out 4,20,21 .
To ensure full automation of measurements, neither drusen nor retinal layers segmentation was manually corrected. However, we performed post-hoc quality assurance in both scans and enface projection of drusen segmentation in both devices, ensuring that all scans were fully centered and drusen segmentation maps were plausible. Both algorithms report drusen volume measures both inside a fovea-centered 3-and a 5-mm diameter circle 4 ; we only investigated values from the 5-mm circle as they reflect the grid used for Beckmann AMD grading. Pixel-microns conversion was based on the respective formula provided by the Heidelberg Eye Explorer and Cirrus Zeiss software. A previous study showed high comparability between their axial and lateral retinal measurements 22 .

Statistical analyses.
We assessed inter-device and inter-scan differences with intra-class correlation coefficients (ICCs), root mean squared error (RMSE), and mean difference.
Differences between devices were assessed with Wilcoxon paired signed rank test; increases across AMD stages were assessed with the Jonckheere-Terpstra test for ordered variables.
Sensitivity, specificity and area under the curve (AUC) of both algorithms in the population sample were tested with a receiver operating characteristic (ROC) analysis. We tested accuracy of drusen volume measurements in discriminating eyes with any e-and iAMD, as well as subsets with only i-and eAMD, vs controls. We selected drusen volume measurement thresholds maximizing the optimality criterion expressed by the formula below 23 To assess the relative magnitude of the mean difference in each AMD group, we standardized it by dividing it by the mean average value of the two devices, respectively. We only calculated ICC in the iAMD group due to low variability of drusen volume in no and eAMD, leading to poorly interpretable ICC 24 .
Current algorithms are trained for segmenting the BM-RPE complex; since RPDs are located between the RPE and ellipsoid zone, algorithms may be less consistent in RPD detection and segmentation. For this reason, we stratified ICC by excluding individuals with reticular pseudodrusen (RPD) in iAMD to assess its effect on measures comparability. We reported both consistency and agreement using two-way ICC. In brief, ICC type consistency compares two measures without adding a penalty for a systematic error (x = y + e), contrarily to ICC type agreement(x = y), hence their conjoint assessment is highly informative of measurements' interchangeability 24 . We visualized differences between measurements from different algorithms and modalities with Bland-Altman plots 25 . Conversion formulae were obtained with Deming regression on the dense Spectralis scan. To assess their accuracy, we randomly split the dataset into approximately 80% of observations for training and 20% for testing and assessed prediction accuracy with mean error and RMSE between converted and observed values. We compared results in the whole iAMD dataset against an optimal iAMD subset based on the Bland-Altman analysis While continuous drusen volume measures might provide more detailed phenotyping, differences among quantitative measures might be less relevant when considering quantitative cut-offs (e.g., one indicating high risk to progression). Hence, we assessed comparability across modalities in iAMD against a binary cut-off indicating higher progression risk, previously shown at 0.03 mm 3 in Cirrus 5 . The cut-off was 0.083 mm 3 in Spectralis and was derived by converting the value of 0.03 mm 3 with the conversion formula obtained in this paper. All statistical analyses were performed in R (base version 3.4).

Sensitivity and specificity.
In the dense Spectralis scan, we observed at the selected threshold a specificity of 93.7% and a sensitivity of 91.9% (68.1% for eAMD and 94.7% for iAMD). In the 25 B-scans modality, we observed a higher specificity (97.9%) but a lower sensitivity of 87.0% (36.2% in eAMD and 97.1% in iAMD). When assessing measurements obtained from the algorithm on Cirrus, we observed a specificity of 91.6%. Sensitivity was lower than for both Spectralis scan patterns (14.9% in eAMD, 87.0% in iAMD and 75.1% in the whole sample). ROC curves with AUC of each algorithm for drusen volume assessment and respective thresholds are reported in Supplementary Fig. 1.
Agreement and differences between drusen volume measurements. The mean systematic difference between the dense Spectralis scan and Cirrus was 0.0679 mm 3 in iAMD and corresponded to 70% of the mean average value (Table 3). When comparing drusen volumes between the two algorithms in iAMD, ICC type consistency was higher and more stable than type agreement ( and the mean difference (RMSE) were higher than for other subgroups (0.0737 and 0.1116 mm 3 , respectively). The CI of the ICC in individuals with and without RPD did not overlap, indicating a statistically significant difference between the two groups.
Drusen volume measurements of the two Spectralis scans had very high agreement (ICC agreement > 0.99) and the mean difference was 0.0055, corresponding to only 4% of the mean average value.  Table 2. Summary statistics for different drusen volume measurements. SD standard deviation, IQR interquartile range, AMD age-related macular degeneration, eAMD early AMD, iAMD intermediate AMD, SBP systolic blood pressure, BCVA best-corrected visual acuity. 1  www.nature.com/scientificreports/ In the inter-device comparison of the Bland-Altman plot, we observed, both in e-and iAMD, larger drusen detection on the dense Spectralis scan (in iAMD, n = 223 (93.6%) eyes, the difference between the two modalities was positive) and larger drusen volume measurement on Spectralis at larger average drusen volume measurement on the two devices (Fig. 2a,b). In iAMD, we observed a linear trend between drusen volume measurements in cirrus and the dense Spectralis scan for most data points. At the lower end of drusen volume a small number of participants (n = 13) had larger values on Cirrus and at the upper end, we observed a flattening of the linear trend with an increasingly broader confidence interval, indicating lower comparability. (Fig. 2b). In the inter-scan comparison, we observed smaller drusen measurement on the dense Spectralis scan and a random measurement error between the two modalities (random scatter around the x-axis) (Fig. 2c,d).
Formulae for algorithms conversion. We identified an optimal dataset with a linear trend between the two measurements consisting of 194 eyes for inter-device drusen volume measurements conversion based on Fig. 2b, i.e. we excluded eyes with measurement obtained from the Cirrus algorithm higher than Spectralis (N = 13) and mean values larger than 0.2 mm 3 , corresponding to a flatter trend with large confidence interval (N = 31). When predicting drusen volume from Cirrus to the dense Spectralis scan in the test dataset, the mean error (RMSE) decreased in the whole iAMD dataset from − 0.0113 (0.0640) mm 3 to 0.0074 (0.0313) mm 3 in the optimal dataset. When predicting drusen volume from the dense Spectralis scan to Cirrus, the mean error (RMSE) amounted to 0.0173 (0.0458) mm 3 in the whole iAMD dataset and − 0.0003 (0.0168) mm 3 in the optimal dataset.

Discussion
We present a study systematically comparing drusen volume measurements obtained from two different algorithms on two different SD-OCT devices (Cirrus, Spectralis) and from two modalities at different B-scan density from the same device (Spectralis), as well as evaluating their classification accuracy in no AMD, eAMD and iAMD individuals.
The algorithm using Spectralis images showed a higher sensitivity in both the e-and iAMD groups than the Cirrus algorithm, while specificity was similar. In iAMD, after accounting for a systematic difference, comparability between the two algorithms was good (ICC consistency type > 0.75) and more stable (CI width decreased by 84%). The mean difference in iAMD was 0.0679 mm 3 . The conversion formulae that were provided could be used to collate and compare data from the two algorithms and devices. The formulae were derived in an optimal iAMD dataset; hence they might be less accurate at average drusen volume measurement larger than 0.2 mm 3 and in case of higher quantification from the Cirrus algorithm.
Comparability between drusen volume measurements from the two devices was lower in individuals with RPD. This might be due to factors both intrinsic to imaging and performance of algorithms for drusen segmentation 26 . In particular, RPD and soft drusen have a different relationship with respect to the RPE, hence current algorithms often fail to accurately segment RPD.
Furthermore, we observed a good agreement between the two devices against a high-risk cut-off in drusen volume in iAMD. This indicates a substantial agreement in detecting individuals at high-risk, which might prove useful in clinical settings to efficiently triage patients 5 .
To the best of our knowledge, a comparison between drusen volume measurements from two different algorithms on Spectralis and Cirrus SD-OCT has not been performed. However, previous studies have observed a systematic difference when investigating other biomarkers (such as retinal thickness or the BM-RPE complex derived with built-in softwares) across the two devices 12,22,27,28 .
Any such differences may be due to differences in image acquisition, resolution and scaling 22,27,28 , device specific softwares (e.g. computational methods, minimal elevation of the RPE necessary to identify drusen) or chosen scan modality (number of A-and B-scans 22 ). In our study, the number of significant decimal figures of the two algorithms is different, which might in part account for observed sensitivity differences.
Interestingly, a previous study found comparable retinal thickness measurements from Cirrus and Spectralis utilizing a third-party segmentation algorithm 28 .
Similarly, another study found that differences in drusen volume measurements between two SD-OCT devices decreased when measuring drusen volume with the same third-party software, as compared to measuring drusen volume with in-built algorithms on each device 12 . These findings suggest that differences in software might be more relevant than in hardware; however our study design did not account for dissecting intrinsic image differences against software differences in drusen segmentation.
When assessing drusen segmentation between the dense, 241 B-scans and the less dense, 25-B-scans modalities in Spectralis, we observed almost complete agreement between the two scan patterns in iAMD. Similar findings were observed in a recent study, comparing manually delineated drusen volume with Spectralis in an iAMD cohort between 145 B-scans and 15 B-scans modalities 13 . Our results extend these previous findings, with the observation that a less dense grid might suffice for drusen volume quantification in iAMD but has lower Table 3. Measures of agreement, in the whole population and stratified by AMD stage, across different drusen measurements. ICC intra-class correlation coefficient, Cons. consistency type, Agreem. agreement type, RMSE root mean squared error, Spec. spectralis, AMD age-related macular degeneration, std. standardized, diff. difference. 1 Inter-device comparison refers to the 241 B-scans Spectralis modality and Cirrus. 2 Interscan comparison refers to the 241 and 25 B-scans modalities in Spectralis. 3 The standardized mean value corresponds to mean difference divided the mean average value of the twodevices, respectively. 4 Assessed against a binary cut-off indicating high progression risk. 5  www.nature.com/scientificreports/ sensitivity in eAMD. This difference is explained by the observation that interpolation of large drusen between the scans might account for a smaller number of B-scans, but medium drusen (between 63 and 125 μm) might occur between B-scans and be more easily missed in less dense scan. In this context, part of the lower sensitivity in Cirrus for eAMD might also be explained by less densely placed B-scans compared to the Spectralis scan (200 vs 241 B-scans, respectively). More studies are needed investigating drusen detection at intermediate B-scan densities, to derive an optimal number of B-scans optimizing examination velocity and detection of smaller biomarkers (such as medium-sized drusen or hyperreflective foci). Strengths of our study include the well phenotyped sample of participants with no, early and iAMD, the implementation of standardized image acquisition protocols, training of study site personnel, use of a central reading center and implementation of automated image analysis softwares. Limitations include the relatively small sample size, lack of an external validation of our findings and lack of data on repeatability and reproducibility of the Spectralis software while such studies exist for Cirrus 8,17 . However, the high agreement we observed between the two Spectralis scans might be indicative of good reproducibility of its findings.
In conclusion, drusen volume measurements obtained from the two devices and algorithms are not directly interchangeable. In iAMD, accounting for a systematic error largely increased their comparability, possibly allowing for data integration from the two modalities. Presence of RPD further complicated drusen detection and quantification. Comparability between a 25-and 241 B-scans modality was high, but dense scan patterns are required in eAMD. Further research is required to better characterize optimal scan patterns and image analysis softwares for best possible drusen detection and quantification.

Data availability
Data are not publicly available. However, the datasets used in the present study can be made available from the MACUSTAR consortium upon reasonable request at dataaccess@macustar.eu.

Disclaimer
The communication reflects the authors' view and neither IMI nor the European Union, EFPIA, or any Associated Partners are responsible for any use that may be made of the information contained therein.

Author contributions
D.G.: study design; data acquisition, analysis and interpretation; manuscript writing. J.H.T.: study design; data acquisition and interpretation; manuscript editing. O.M.: manuscript editing; data acquisition and interpretation, creation of new software used for this work. M.W.M.W. manuscript editing; data interpretation, creation of new software for this work. M.S.: study design; data acquisition and interpretation; manuscript editing. M.M.B.: study design; manuscript editing. S.S.V.: study design; data acquisition and interpretation; manuscript editing. M.P.: study design; data acquisition and interpretation; manuscript editing. S.H.T.: study design; data acquisition and interpretation; manuscript editing. S.P.: study design; manuscript editing. S.L.: study design; manuscript editing. F.G.H.: study design; data acquisition and interpretation; manuscript editing. R.P.F.: study design, data acquisition and interpretation; manuscript editing. All authors approved the final version of the manuscript to be published.

Funding
Open Access funding enabled and organized by Projekt DEAL. This project received funding from the Innovative Medicines Initiative 2 Joint Undertaking (Grant Agreement Number 116076). This joint undertaking received