Introduction

One of the phenotypes of late stages of age-related macular degeneration (AMD) is geographic atrophy (GA). Since its first description in 1970 by Gass, various terminologies and classifications have been used to define this entity on colour fundus photographs [1,2,3,4]. Typically, GA is defined as any sharply delineated roughly round or oval area of hypopigmentation or depigmentation with increased visibility of the underlying choroidal vessels of at least 175 µm in diameter on 30° or 35° colour fundus photographs (CFP) [5]. However, discriminating the edges of atrophy is a difficult task on monoscopic images and cannot be easily delineated using image analysis. Additionally, precursor changes at the cellular level that precede the development of GA are not identifiable on CFP.

The advent of fundus autofluorescence (FAF) has advanced our understanding significantly. Not only did it better delineate lesion boundaries, but several phenotypes of GA also became evident with associated prognostic significance [6, 7]. In addition, hyperautofluorescence at the margin of GA may precede cell death and growth of GA. For the first time, regulators have accepted FAF as a structural surrogate of disease progression [8]. However, the FAF lesion phenotypes are not always reproducible [9]. FAF is also dependent on lipofuscin loss from retinal pigment epithelium (RPE). Therefore, the focus is diverted to other imaging modalities that can identify changes before RPE cell loss and can complement CFP and FAF.

Spectral domain optical coherence tomography (OCT) has become the mainstay imaging modality for macular diseases including AMD. The high axial resolution of OCT allows layer-by-layer evaluation of retinal and choroidal tissue, enabling cross-sectional phenotyping of GA. More importantly, longitudinal OCT scans can also reveal temporal changes and identify precursor lesions of GA.

In the light of these advantages, the Classification of Atrophy Meetings (CAM) group proposed a new classification system based on OCT, unique to atrophy associated with AMD [10]. The new classification system considered microstructural changes in the outer retina and RPE to define four types of atrophy based on the correlated histopathological changes in the retina. These include iORA (incomplete outer retinal atrophy), cORA (complete outer retinal atrophy), iRORA (incomplete RPE and outer retinal atrophy) and cRORA (complete RPE and outer retinal atrophy) [8].

The entity, cRORA, the equivalent of GA, has been defined by the CAM group as having the following features on OCT: (i) Zone of hypertransmission of ≥250 µm, (ii) Zone of attenuation or disruption of RPE band of ≥250 µm, (iii) Evidence of overlying photoreceptor degeneration characterised by features that include outer nuclear layer (ONL) thinning, external limiting membrane (ELM) loss, and ellipsoid zone (EZ) or interdigitation zone (IZ) loss. Although these features are well-described, multiple novel parameters had to be evaluated by a team of retinal experts around the world and several meetings and grading exercises had to be completed to reach a consensus on the components of cRORA. In the absence of reliable quantification methods, the interpretations of these parameters are based on subjective recognition of descriptive characteristics. Due to the diversity of features seen in individual OCT images, image artefacts and variations in image quality, it can be challenging to accurately ascertain presence of these features. The agreement between clinicians provides a measure of this challenge.

The purpose of this study was to evaluate the inter-rater reliability for identification of cRORA on Spectralis Heidelberg SD-OCT line scans (Heidelberg Engineering, Heidelberg, Germany). We then investigated the technical issues faced by the graders in evaluation of each individual parameter to recommend approaches to mitigate them.

Methods

The study adhered to the tenets of the Declaration of Helsinki. Only anonymised images were analysed so institutional review board approval was not required.

Image acquisition and processing

One of the authors (SC) selected and extracted Spectralis Heidelberg SD-OCT line scans (Heidelberg Engineering, Heidelberg, Germany) that were routinely done on patients with GA. Scans were either on iRORA and cRORA. The line scan had to demonstrate an atrophic lesion size ≥250 µm measured on the Heyex overlay software. To assure consistent image quality and similar preconditions for each evaluated scan, we used well-resolved SD-OCT scans. These were pre-defined as line scans with a minimum of 20 dB signal-to-noise ratio; minimum of sixteen frames per B-scan using the average real time mode; clear media and qualitatively confirmed visibility and distinction of each outer retinal layer.

Each of the 50-line scans of the 50 patients were anonymised and exported from the Heidelberg software in.tiff format in the standard setting. Each line scan was saved in two different colour settings (i.e. inverted grey-scale showing images as either positives or negatives-referencing terminology used in black and white photography) identified as white-on-black (WB) and black-on-white (BW) in the study. Each folder contained a WB and BW scan of an anonymised patient.

Grading characteristics

The two sets of the images (WB and BW) of each of the 50 patients were analysed for three parameters mentioned in the CAM grading for cRORA. They include: (i) Zone of hypertransmission of ≥250 µm, (ii) Zone of attenuation or disruption of RPE band of ≥250 µm, (iii) Evidence of overlying photoreceptor degeneration whose features include ONL thinning, ELM loss and EZ or IZ loss.

Graders

The 50 pairs of images were interpreted by five clinicians (four medical retina fellows and one consultant) who are skilled OCT readers from the same retina centre (Moorfields Eye Hospital, London, UK). Two of the five graders were more accustomed to the CAM grading. All graders were familiarised with the CAM grading and trained to identify the parameters on test sets before grading the study images independently.

Grading procedure

The five graders analysed the 50 pairs of SD-OCT horizontal cross-sectional scans without application of any image modifications. The images were presented in two separate data sets of 50 images each, and the readers were masked to the grading outcome. The readers were required to go through both the images and data sets in the same order. The graders were asked to document their response as yes or no for presence or absence cRORA and each parameter that defines cRORA. The folders were then randomly re-numbered, and graders re-graded the images to evaluate intra-grader agreement. Additionally, they were asked to make a notation of whether WB or BW scans was most helpful in identifying each particular feature for a set of images. They could also choose that there was no difference in terms of setting, in their ability to evaluate respective OCT features.

Statistical analysis

Statistical analysis was performed using SPSS Statistics version 24 (IBM), Microsoft Excel for Mac version 15.33 (Microsoft), and the web-based Kappa Programme [11]. Inter-grader agreement was evaluated as a measure of reliability. The higher the inter-grader correlation coefficient, the more reliable was the identification and detectability of respective morphologic alteration. The responses being categorical variables, Fleiss’ kappa (ĸ) was used for this purpose. To measure the intra-grader agreement and inter-grader agreement among two graders Cohen’s kappa was used (Table 1). The significance was set at p ≤ 0.05.

Table 1 Interpretation of Cohen’s and Fleiss kappa.

Results

Fifty pairs of single SD-OCT horizontal scans were graded twice by each grader. Of the 50 images, 36 images demonstrated cRORA and the rest had iRORA as confirmed by two graders with previous experience in CAM grading.

The inter-grader and intra-grader Cohen’s kappa values for cRORA diagnosis are shown for white-on-black (Table 2) and black-on-white images (Table 3). The intra-grader reliability by Cohen’s kappa was in the range of 0.88–0.92 for white-on-black images which is strong to almost perfect agreement. The Cohen’s kappa for black-on-white images ranged from 0.45 to 0.95, being >0.90 for four out of five graders. The inter-grader reliability varied from as low as 0.28 to almost perfect value of 0.92 for white-on-black images. Similarly, it ranged from 0.34 to 0.86 for black-on-white images. The inter-grader agreement was almost perfect for two graders (kappa WB 0.92, p value < 0.0001; kappa BW 0.86, p value < 0.0001) who were accustomed to the CAM criteria.

Table 2 Inter-grader and Intra-grader agreement (Cohen’s kappa) for cRORA in white on black images.
Table 3 Inter-grader and Intra-grader agreement (Cohen’s kappa) for cRORA in black on white images.

The Fleiss kappa values (ĸ) are shown in Table 4. There was moderate agreement in identifying cRORA using white-on-black images (ĸ 0.49, p value < 0.0001) and fair agreement using BW images (ĸ 0.34, p value < 0.0001). The RPE attenuation/loss was parameter detected most reliably in both sets of images whereas hypertransmission was the most poorly detected parameter. Overall, the agreement was better using WB images for all parameters except RPE attenuation/loss.

Table 4 Fleiss kappa showing Inter-grader agreement across five graders for all parameters.

The graders noted that RPE attenuation/loss was the relatively easier parameter to identify in the images, better detected on BW images. Hypertransmission was the least reliable parameter according to the graders and was particularly difficult to distinguish on BW images. Inner layer changes also were more clearly identified using WB versus BW images. However, the graders observed that to reliably detect the presence of cRORA it is better to analyse both images together. Examples of challenges in grading cRORA are shown in Fig. 1.

Fig. 1: Examples of optical coherence tomography (OCT) images analysed in the study.
figure 1

Case 1A: White on black (WB) image showing hypertransmission of 315 µm width and associated inner retinal changes overlying the PED. In the corresponding Black on White (BW) image (Case 1B), the RPE loss is better appreciated. Case 2A and 2B: This image shows the presence of a persistent hyper-reflective line in the bed of cRORA, which could be confused as attenuated RPE. This has been termed persistent basal laminar deposit (white asterisk) by the CAM classification. Case 3A: All signs of cRORA are noticeable (EZ/ELM layer changes and hypertransmission of 250 µ), however the RPE is intact (white vertical arrow). This is again more evident in the BW image (white vertical arrow; Case 3B). Case 4A: There is loss of RPE and EZ and ELM layer changes overlying the PED (black asterisk), but the hypertransmission is absent (white horizontal arrow). Case 4B. The BW image confirms the definite absence of RPE (black asterisk). Case 5: This case shows an example of discontinuous transmission overlying a region of cRORA caused due to back shadowing secondary to dispersed pigmented cells (white horizontal arrow).

Discussion

The study assessed the reliability of detection of cRORA on SD-OCT images. There are four key findings. First, inter-grader reliability for any two graders was better for WB images than BW images and was almost perfect for graders more accustomed with CAM classification. Second, intra-grader agreement was high across all images suggesting parameters assessed inaccurately across one set of images were assessed incorrectly across the second set, reinforcing the importance of repeated training. Third, inter-grader reliability using Fleiss kappa was fair to moderate indicating the subjectivity of defined parameters. Finally, all parameters were detected better on WB images except RPE attenuation. RPE attenuation/loss was the parameter detected with highest agreement whereas hypertransmission had the lowest agreement. Taken together these findings suggest that diagnosis of cRORA on OCT images may be quite subjective and have an impact on clinical diagnosis when treatment become available for this condition. Adequate training combined with use of both WB and BW images may enhance our capability to improve our diagnostic capability of this entity.

The first finding was that inter-grader reliability was better for WB images. There are a couple of possible explanations for this finding. First, all graders were accustomed to viewing WB images routinely in clinic and this may have introduced a bias to better identification of structures in this set of images. Second, recognising minute aspects of outer retinal changes require advanced skills possibly acquired by repeated evaluation of these images. The RPE, EZ and ELM on OCT are seen as multiple hyper-reflective lines with almost similar reflectivity. The presence of drusen, subretinal drusenoid deposits, patchy intraretinal pigment migration or presence of outer retinal tubulations are some of the features that distort the continuity of these layers and increase the difficulty in differentiating them. Repetitive exposure to scans with these characteristics may enhance the expertise in accurately identifying pathologies.

Secondly, intra-grader agreement was high across all images. A layer or parameter identified wrongly on repeated occasions indicate the challenge in identifying these parameters due to the heterogeneity of these parameters and highlight the importance of experience required to assess these entities.

The third finding was the fair to moderate agreement based on Fleiss kappa for cRORA diagnosis. Kappa was designed to consider the possibility of guessing but it has its limitations. The assumptions kappa makes about rater independence and other factors are not well supported, leading to an excessively low estimate of agreement [12]. As it cannot be directly interpreted, it has become conventional for researchers to accept low kappa levels in inter-rater reliability studies [10]. Table 1 shows us the interpretation of Fleiss kappa coefficient. However, this interpretation is applicable to social science. Low levels of kappa are unacceptable in medicine or clinical research where results may change clinical practice and may lead to poorer clinical outcomes. An agreement of over 0.80 is considered acceptable in medicine related research [12]. Thus, even though our results show moderate agreement, they are not adequate. This points towards use of more descriptive definitions of each parameter or employing reliable quantification methods for more subjective parameters such as hypertransmission. These steps may help improve inter-grader agreement.

Finally, two parameters (hypertransmission and inner retinal layer changes) were detected with higher reliability on WB images and RPE attenuation was more reliably graded on BW images. Use of a combination of WB and BW images is required to better detect the parameters to reliably diagnose cRORA. Even though it was not evaluated in this particular study, the authors suggest use of other modalities including near-infrared reflectance (NIR) and FAF alongside OCT B-scan to improve cRORA diagnosis. We did not employ use of NIR and FAF in this study to avoid bias in detecting cRORA as this grading was based completely on OCT features. Moreover, the authors evaluated .tiff files of these images to maintain uniformity across the grading. In the real world, dynamic manipulation of images with ability to adjust colour/contrast settings and use multimodal imaging, may lead to precise identification of morphologic features on OCT. However, most reliability studies are done on static images due to ease of implementation [13].

The study scrutinized the difficulties faced by the graders during the grading process with respect to each parameter. The most challenging parameter was continuous hypertransmission. The CAM group used the term hypertransmission as it best conveyed the cause for the observed phenomenon. It is recognised though that hypertransmission may not always penetrate to the underlying choroid especially in eyes with tall pigment epithelial detachments (PED). However, as it one of the key features for cRORA diagnosis absence of definite hypertransmission may lead graders to misdiagnose cRORA especially those lesions lying on top of a PED [10]. Continuity of hypertransmission was another feature that was difficult to detect. In some eyes with RPE loss, the upward intraretinal migration of pigment tends to cause back shadowing in the OCT scan thereby interrupting the continuous hypertransmission resulting it as graded to be discontinuous. So even though the RPE cells are lost as per cRORA definition, the hypertransmission can be discontinuous and thus appears as if all criteria are not met. RPE attenuation was the most reliably detected parameter, however the agreement was still less than 0.80 (ĸ 0.72). The CAM group acknowledged presence of persistent laminar deposit might interfere with accurate assessment of RPE attenuation [10].

Ascertaining RPE continuity was increasingly complex in eyes with multiple drusen. In the case of BW images, the markedly pigmented RPE nuclei characteristically stand out as dark black line against the less dark EZ and ELM lines. This probably enhanced the ease with which one can detect its loss. Inner retinal layer changes were better documented on WB images.

There are several limitations to the study. First, only five graders were included. This may limit generalisability of our findings. However, we note that this is a fairly new classification and, to the best of our knowledge, a study on inter-grader reliability of clinicians not involved in the CAM classification has not been done previously. Second, there are technical limitations on image grading. We used pre-saved images where colour and image contrast settings may have affected the assessment of parameters. Third, we did not re-train our graders and perform a post-test grading to test if repeated training will improve reliability. However, as two of the graders who have repeatedly used the CAM criteria had higher agreement, it reinforces that training will likely improve the reliability of diagnosis cRORA based on these criteria.

In conclusion, the CAM classification provides a well-thought through classification and criteria for OCT-defined atrophy in the setting of AMD. The ability to identify these OCT changes reproducibly is essential to understand the natural history of the disease, to identify high-risk signs of progression, and to study the effects of early interventions. This study adds insight into the reproducibility of these parameters in the real world and the need for training for clinicians to accurately identify them, prior to implementing their use in clinical practice.

Summary

What was known before

  • CAM classification defines OCT based parameters for accurate identification of atrophy.

  • However, inter-grader reliability for diagnosis of cRORA is poor for individuals not accustomed with the CAM criteria.

What this study adds

  • This study emphasizes the need for training clinicians to identify cRORA accurately, prior to implementing it in clinical practice.