Global optical coherence tomography measures for detecting the progression of glaucoma have fundamental flaws

Objective To understand the problems involved in using global OCT measures for detecting progression in early glaucoma. Subjects/Methods Eyes from 76 patients and 28 healthy controls (HC) had a least two OCT scans at least 1 year apart. To determine the 95% confidence intervals (CI), 151 eyes (49 HC and 102 patients) had at least two scans within 6 months. All eyes had 24-2 mean deviation ≥-6dB. The average (global) thicknesses of the circumpapillary retinal nerve fibre layer (cRNFL), GONH, and of the retinal ganglion cell layer plus inner plexiform layer (RGCLP), Gmac, were calculated. Using quantile regression, the 95% CI intervals were determined. Eyes outside the CIs were classified as “progressors.” For a reference standard (RS), four experts evaluated OCT and VF information. Results Compared to the RS, 31 of the 76 (40.8%) patient eyes were identified as progressors (RS-P), and 45 patient, and all 28 HC, eyes as nonprogressors (RS-NP). The metrics missed (false negative, FN) 15 (48%) (GONH) and 9 (29%) (Gmac) of the 31 RS-P. Further, GONH and/or Gmac falsely identified (false positive, FP) 10 (22.2%) of 45 patient RS-NP eyes and 7 (25%) of the 28 HC eyes as progressing. Post-hoc analysis identified three reasons (segmentation, centring, and local damage) for these errors. Conclusions Global metrics lead to FPs and FNs because of problems inherent in OCT scanning (segmentation and centring), and to FNs because they can miss local damage. These problems are difficult, if not impossible, to correct, and raise concerns about the advisability of using GONH and Gmac for detecting progression.


Introduction
Detecting the progression of glaucoma is a challenge for the clinician. Traditionally, the most commonly used quantitative techniques involved the mean deviation (MD) of the 24-2 visual field (VF), obtained with standard automated perimetry. With the advent of optical coherence tomography (OCT), the average thickness of the circumpapillary retinal nerve fibre layer (cRNFL) became a common measure of progression. This measure, called global cRNFL thickness, has been incorporated into commercial OCT reports. With the recent incorporation of OCT scanning of the macula, an average (global) measure of the retinal ganglion cell plus inner plexiform layer (RGCLP) thickness also has been employed to track progression, and a number of studies have compared these two OCT global measures [1][2][3][4][5][6][7].
However, these two measures, global cRNFL (G ONH ) and global RGCLP (G mac ), miss early glaucomatous damage clearly visible on probability/deviation maps, which display abnormal regions of RNFL and/or RGCLP thickness [8][9][10]. Thus, it is likely that these two measures will also miss clear progression of glaucoma, while also falsely identifying some eyes as progressors.
Our purpose here was to understand the problems involved in using global OCT measures for detecting progression in early glaucoma. First, we show, as expected, that the conventional thickness measures, G ONH and G mac , combined with a traditional event-based analysis, lead to both excessive false positives (FPs) and false negatives (FNs). Second, and most importantly, we identify the reasons for these errors via a post-hoc analysis.

Participants
There were 104 study eyes from 104 individuals; 76 were from glaucoma or glaucoma suspect patients. The remaining 28 eyes were healthy controls (HCs) with normal fundus examination, normal VFs, and IOP < 22 mmHg. All eyes had 24-2 MD better than -6 dB and at least two OCT scans: a baseline scan and a scan obtained at least 1 year after the baseline (mean: 24.9 ± 8.7 months, range 12-42 months). All individuals were enroled in Columbia University's prospective study, Macular Damage in Early Glaucoma and Progression (ClinicalTrials.gov: NCT02547740).
Study procedures followed the tenets of the Declaration of Helsinki and Health Insurance Portability and Accountability Act and were approved by the Institutional Review Board of Columbia University. Written informed consent was obtained from all participants.

OCT data
Widefield (12 × 9 mm) swept-source OCT volume scans (Atlantis; Topcon Inc., Tokyo, Japan) were obtained for each eye. Every scan was rotated to a common fovea-to-disc angle, which accounted for head-eye torsion, and to some extent anatomical differences, as previously described [11], and currently incorporated in a commercial report similar to the one in Fig. 1 generated by our custom programme. A derived B-scan image (Fig. 1Aa) was generated from the widefield scan for a circle 3.45-mm in diameter centred on the optic disc. The cRNFL thicknesses were measured (black-magenta-blue-black curve in Fig. 1Ab). A RNFL thickness map (Fig. 1Ad) was obtained from the widefield scan. A portion of the widefield scan, 6 × 6 mm centred on the fovea, was used to produce a RGCLP thickness map (Fig. 1Ae). Age-corrected RNFL (Fig. 1Ac) and RGCLP (Fig. 1Af) probability maps were created based on these thickness maps and normative controls [12].

Establishing progression with OCT summary metrics
Global cRNFL (G ONH ) and global RGCLP (G mac ) average thicknesses were calculated for each eye at each visit. The thresholds [95% confidence interval (CI)] to identify statistically significant event-based progression in the study group were derived from a short-term group after performing quantile regression [13], which is analogous to how event-based progression is defined with commercially available VFs and OCT. Details of this event-based methodology are provided in the Supplementary Information (Supplementary Fig. 1).
These 95% thresholds were then applied to the 104 eyes of the study group. Eyes whose G ONH or G mac metric on the follow-up test were equal or greater than the 95% CI were classified as "statistical progressors".

Reference standard (RS) for progression
Our objective here was to identify factors affecting changes in G ONH or G mac by analysing B-scans (e.g., Fig. 1Aa) and probability maps (e.g., Fig. 1Ac, f) of possible FPs and possible FNs. To identify the eyes that are possible FP and false FN, a reference standard (RS) was used. In particular, four of the authors independently decided on progression or no progression after evaluating all available OCT and VF tests, and all OCT reports with probability maps (Fig. 1). For the 104 study eyes, the average number of visits was 8.3 ± 2.6. Initially, the experts agreed for 98 eyes, and consensus was reached for the remaining 6 after they reviewed the cases together.

Progressors according to metrics
The two global summary metrics (i.e., G ONH and G mac ) identified a similar number of patient eyes as 'statistical progressors'; 24 for G ONH and 25 for G mac ( Fig. 2A). About half, 12 eyes, were 'statistical progressors' according to both metrics.
The G ONH and/or G mac metric also identified 7 of the 28 (25%) HC eyes as "statistical progressors" (Fig. 2B). These seven eyes were clearly FP as they were HCs with no signs of glaucomatous damage. Of these seven FP, two were FP on G ONH and six on G mac , and one on both.

FNs based upon RS
Only four (12.9%) RS-P eyes were missed by both metrics (Fig. 3A). All four showed clear glaucomatous damage when the entire report was evaluated.
The fact that only four eyes were missed by both G ONH and G mac underestimates the extent of the problem with the clinical use of these metrics. Suppose we were to use "abnormal on G ONH OR G mac " for clinical decision making. Then, although the FN rate for RS-P would be 12.9% (4 eyes), the FP rate for the HC would be 25% (Table 1). Thus, we need to understand the FNs for G ONH and G mac alone. A total of 20 (64.5%) of 31 RS eyes were missed by one or both metrics. That is, in addition to the 4 Missed only by G mac Five (5) of the 31 eyes categorised as RS-P were identified as 'statistical progressors' on the G ONH , but not G mac , metric. Three of the five eyes showed clear thinning on the RGCLP thickness map, even though the G mac metric failed to identify the eye as a progressor.
Missed only by G ONH Eleven (11) of the 31 eyes in the RS were identified as 'possible progressors' on the G mac , but not the G ONH , metric. Seven of these 11 G ONH FN eyes showed clear progressive thinning on the RNFL, which was not detected by the G ONH metric. Figure 1 shows the reports for one of these eyes. The arrows point to corresponding regions with clear progression in the inferior retina and disc (red) and the superior retina and disc (black).
Three additional examples are provided in the Supplementary Information where the metrics failed to detect the RS-P eyes correctly (Results).

FPs based upon RS-NP
First, of the 28 HC eyes, G ONH and G mac falsely classified 2 (G ONH ) and 6 (G mac ) eyes as statistical progressors. Further, of the 45 patient eyes judged to be RS-NP, 8 (G ONH ) and 3 (G mac ) eyes were classified as "statistical progressors", with 1 eye judged as progressing by both.

Post-hoc analysis of FP and FN
A post-hoc analysis was performed to understand the possible reasons for the disagreement between the metrics and the RS. This analysis identified three possible reasons: (1) local damage; (2) disc and fovea centring; and (3) segmentation errors.

Local damage
Of the 20 FN eyes missed by one or both metrics, 6 had local defects (2 FNs on both metrics, 3 on G ONH , and 1 on G mac ). The reports (panels A and B in Fig. 1) are for an eye "progressing" according to the G mac, but not the G ONH . Local defects in both the superior (black arrows) and inferior (red arrows) retina deepen over time. The G ONH metric missed this local damage.

Differences in centring of derived circle or fovea
In six of the eyes where the G ONH metric disagreed with the RS (four FN, two FP), there was a small difference in centring of the optic disc between days identified on the reports. Figure 3A, B shows an example where the disc was centred differently on the two reports. This resulted in a change in the location of the derived circle scan, as can be seen by the shadows of the blood vessels (red arrows and dashed lines). This resulted in an FN for G ONH . For these six eyes, the change in G ONH was small (average of 3.8 μm), only just outside the 95% CI. (Overall, based upon the quantile regression, the 95% CI for G ONH ranged from 3.2 to 3.6 μm.) Note that in five of these six eyes, G mac , which does not depend upon disc centring, agreed with the RS-P.
A similar problem can occur via small differences in the centring of the fovea for the G mac analysis. This appeared to be the primary reason for five HC eyes that were FP only on the G mac . For example, in Fig. 3C, D, the ring-like artefact in the RGCLP probability plot (known to be due to anatomical differences of the fovea) suggests a small difference in centreing [14]. For the five eyes, the G mac change ranged from only 1.4 to 1.5 μm, a large value relative to the 95% CI which ranged from 1.0 to 1.4 μm. Note, the foveal centring should only affect the G mac . Consistent with this, G ONH agreed with the RS for all five eyes.

Segmentation errors
Segmentation errors can affect the metrics. Figure 4A shows an example where the segmentation, secondary to a scanning artefact, clearly affected the G mac value. While large errors such as this were rare, more subtle segmentation errors undoubtedly occurred and would be harder to detect. Figure 4B shows an example where a subtle segmentation error (red arrows) resulted in a decrease in the cRNFL thickness in the follow-up scan of this HC eye. The G ONH value changed by 4.6 μm, resulting in a FP, as the 95% CI was 3.1 μm. By superimposing the cRNFL plots for the two scan dates (lower right panel), we estimate that this segmentation error contributed about 4 μm to the change in G ONH . Thus, small segmentation errors can lead to FP or FN errors.

Discussion
We evaluated the performance of two common metrics used for detecting progression of glaucoma, global cRNFL thickness (G ONH ) and global RGCLP thickness (G mac ). Consistent with previous studies, these metrics identified a similar number of eyes with a standard event-based technique [15][16][17]. In particular, the metrics identified 24 (G ONH ) and 25 (G mac ) eyes as "statistical progressors," with 12 eyes progressing on both. Further, we demonstrated that these conventional thickness measures, combined with a traditional event-based analysis, resulted in both excessive FPs and FNs. A post-hoc analysis uncovered reasons for their poor performance, which was the main purpose of this study.

An evaluation of metrics
Based upon the RS for the patients, the metrics had relatively high FN and FP rates as shown in Table 1. For example, the eyes showing progression according to our RS-P, the FN rates for G ONH and G mac were 48.4% (15 eyes) and 29.0% (9 eyes) (columns 1 and 2, row 1). Given that only four eyes were missed by both metrics, if we classify an eye as a "progressor" based upon an abnormal G ONH OR an abnormal G mac , then the FN rate of 12.9% (column 3, row 1), is considerably lower. However, this OR criterion will increase the FP rate (i.e., decrease specificity). In particular, 10 of the 45 RS-NP eyes would be identified as statistical progressors based upon an abnormal G ONH OR G mac , for an FP rate of 22.2% and a specificity of 77.8% (column 3, row 2). Further, 7 of the 28 HC eyes would be identified as statistical progressors, for an FP rate of 25% and a specificity of 75%. Thus, G ONH and G mac metrics are a poor method for detecting progression in this population of eyes with early glaucoma.

Why are metrics performing poorly?
We identified three reasons why these global metrics perform poorly. First, they can miss local damage. The fact that local damage can be missed is understandable as both metrics are based upon averages of regions larger than these local defects. Second, we found that subtle segmentation errors can produce changes in G ONH and G mac that are large relative to the criterion change used to identify progression. Finally, relatively subtle changes in centring of the fovea or disc can also produce changes in G ONH and G mac . As a test of concept, we simulated changes in the centring of the fovea and the disc. According to these simulations, small changes in the centre of the disc can produce a change in G ONH equal to the average 95% CI cutoff. This is consistent with a 2009 study by Cheung et al. [18]. Based upon older time domain OCT circle scans, they estimated that offsets as small as 0.1 mm in disc centring produced on average a change in G ONH of 2.3 μm. Similarly, we found changes in the centre of the fovea as small as 0.5°(about 0.14 mm) can produce a change in G mac equal to or more than the average 95% CI cutoff.
There are two important points to be made about segmentation and centring problems. First, all algorithms make segmentation errors and correcting them is difficult in general, and typically not feasible in a clinical practice [19][20][21]. Likewise, small changes in centring of disc and/or fovea are difficult to impossible to avoid [22,23]. Segmentation will affect centring and so will head tilt into the plane of the scan. Currently, there is no way to correct the latter. Second, relatively small changes fall outside the 95% CI for these metrics. In this study, average changes of only 3.4 μm (G ONH ) and 1.6 μm (G mac ) are needed. Thus, although the changes in these metrics caused by segmentation and centring are small, they can still lead to both FPs and FNs [18].
Given these three problems, it is not surprising that global metrics are suboptimal for identifying progression. Further, there is no easy fix for these problems. Conventional clinical standards, such as Zeiss' Glaucoma Progression Analysis (GPA), use longer series (usually at least four tests) in an attempt to overcome some of these issues. Trend-and event-based analysis of a series of tests can potentially reduce the 'noise' and exclude outliers, although it is likely that local damage will still be missed, and segmentation and centring errors will still contribute to variability. However, there is a more fundamental problem inherent in the trend-based analysis. We have argued that analyses of long series of tests do not fully answer a crucial clinical question that physicians face in a glaucoma clinic; that is, "has glaucoma progressed since the last visit?" [24].

Our 95% CI values and the literature
Previous studies using different OCT instruments arrived at a 95% CI near 5 μm for the G ONH metric [25][26][27]. This lead to the "Rule of 5 μm" used by some clinicians [28]. Some consider changes in G ONH of more than 5 μm as indicating progression. In a longitudinal study, Thompson et al. concluded that a 95% CI of 5 μm resulted in too many FPs due to test-retest variability [28]. Our 95% CI value for G ONH was on average 3.4 μm, smaller than 5 μm. Had we used 5 μm instead, it would have reduced the FP rate, but increased the FN rate, leaving accuracy about the same ( Table 1, column 5). The accuracy of these global metrics is poor. Thus, changing cutoffs will only trade off sensitivity vs. specificity; it will not improve accuracy.
What is the alternative?
We have previously argued that OCT global metrics will miss damage that can be seen on reports such as those in Fig. 1 [12,29]. As in the case of early detection, we are suggesting that trained observers will outperform G ONH and G mac metrics if they had these reports. Of course, there may be some purposes, such as clinical trials, where qualitative evaluations are not appropriate. For these purposes, we need to find alternatives to global metrics. For detection of glaucoma, we have shown success with an objective structure-function method, as well as a deep learning approach [11,[30][31][32][33]. Similar approaches can be applied to progression. For example, the clinician can topographically compare the changes in the VF to the changes in the OCT probability maps, as well as topographically compare the changes in the different OCT maps and images.

Limitations
There are three limitations to this study worth mentioning. First, the sample is relatively small, although it is hard to see how more eyes will change the fundamental findings here. Second, the design suffers from the general problem facing studies of progression. There is no "gold standard" or "litmus test for progression." In this study we used an RS based on the consensus of four experts after evaluation of all available structural and functional information. Other progression studies have used, for example, Zeiss' GPA to confirm the presence of deterioration [34,35]. Thus, applying different RS will produce different estimates of FP and FN. However, our general conclusions regarding the problems with these metrics should hold. See the Supplementary Figures for proof of concept. Finally, the eyes in this study were all "early glaucoma," as defined by 24-2 MD better than -6 dB at baseline. The results here need to be extended to more advanced glaucoma. While it is generally held that one cannot use OCT for eyes with G ONH values less than about 50 μm, we have recently shown this is not true [36].

Conclusions
Global statistics such as average cRNFL thickness (G ONH ) and average RGCLP thickness (G mac ) will miss or overcall progression of glaucoma. There are inherent problems with these methods that will be difficult, if not impossible, to correct. In particular, as they are averages, they can miss local defects. Further, they are prone to FP and FN mistakes due to subtle segmentation and alignment errors of the fovea and disc centres. Approaches are needed which do not rely on these metrics and instead focus on the topographical agreement among the cRNFL, RGCLP, and RNFL thickness measures.

Summary
What was known before • Average (global) measures of the circumpapillary retinal nerve fibre layer (cRNFL) and the retinal ganglion cell plus inner plexiform layer (RGCLP) thickness are common measures of progression. However, these two measures, global cRNFL (G) and global RGCLP (Gmac), miss early glaucomatous damage. Thus, it is likely that these two measures will also miss clear progression of glaucoma, while also falsely identifying some eyes as progressors.
What this study adds • Global metrics G and Gmac can lead to both false positives and false negatives because of problems inherent in OCT scanning, such as segmentation and centring. In addition, they can miss local damage (false negatives). These problems are difficult, if not impossible, to correct, and raise concerns about the advisability of using global metrics for detecting progression.