An Objective Structural and Functional Reference Standard for Diagnostic Studies in Glaucoma

Purpose: To propose a reference standard for the definition of glaucomatous optic neuropathy (GON) consisting of objective parameters from spectral-domain optical coherence tomography (SDOCT) and standard automated perimetry (SAP), and to apply it to the development and evaluation of a deep learning algorithm to detect glaucomatous damage on fundus photographs. Design: Retrospective, cross-sectional study. Methods: Data were extracted from the Duke Glaucoma Registry and included 2,927 eyes of 2,025 participants with fundus photos, SDOCT and SAP acquired within six months. Eyes were classified as GON versus normal based on a combination of objective SDOCT and SAP criteria. A DL convolutional neural network was trained to predict the probability of GON from fundus photos. The algorithm was tested on an independent sample with performance assessed by sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and likelihood ratios (LR). Results: The test sample included 585 eyes of 405 participants. The median DL probability of glaucoma in eyes with GON was 99.8% versus 0.03% for normal eyes (P<0.001), with an AUC of 0.92 and sensitivity of 77% at 95% specificity. LRs indicated that the DL algorithm provided large changes in the post-test probability of disease for the majority of eyes. Conclusions: The DL algorithm had high performance to discriminate eyes with GON from normal. The newly proposed objective definition of GON used as reference standard may increase the comparability of diagnostic studies of glaucoma across devices and populations, helping to improve the development and assessment of tests in clinical practice.


Introduction
Glaucoma is a progressive optic neuropathy in which characteristic changes in the optic disc and retinal nerve fiber layer (RNFL) are usually accompanied by associated visual field defects. [1] The disease is the leading cause of irreversible blindness, affecting over 70 million people worldwide. [2] Nonetheless, despite recent advances in functional and structural assessment, diagnosis of glaucoma remains a challenging task. Most individuals with glaucoma remain undiagnosed, while many healthy subjects are misdiagnosed and receive unnecessary treatment.
The lack of consensus on a reference standard for diagnosing glaucoma makes it difficult to evaluate newly proposed diagnostic tools. In recent years, advances in artificial intelligence (AI) have opened up the possibility for automated diagnoses of certain eye diseases using fundus photography. Deep learning (DL) algorithms have proven successful in detecting signs of diabetic retinopathy in fundus photos, achieving high accuracy when compared to a reference standard of human graders. [3] However, while the diagnosis of diabetic retinopathy is generally unequivocal, the same cannot be said for assessing the presence of glaucomatous damage, even for well-experienced glaucoma specialists. [4][5][6] Notably in the early stages of the disease, it can be very difficult, if not impossible, to ascertain the presence of glaucoma on a cross-sectional assessment of fundus photos. This is largely due to the wide variation in the appearance of optic discs in the normal population, as well as to the lack of a precise definition of what the "characteristic" features of glaucoma are. This lack of a clear definition of structural glaucoma questions the validity and usefulness of attempts to develop DL models to replicate human gradings for glaucoma diagnosis from fundus photos.
Spectral-domain optical coherence tomography (SDOCT) provides objective and reproducible quantitative assessment of neural loss in glaucoma. As an alternative to using human grading as the reference standard, DL models can be successfully trained to predict quantitative SDOCT measurements from analysis of fundus photographs. In a previous work, we developed a machine-to-machine DL algorithm that was able to provide continuous estimates of retinal nerve fiber layer (RNFL) and neuroretinal rim thickness from fundus photos. [7,8] The predicted measurements exhibited very high correlation to the observed SDOCT measurements, showing that the DL models could be used to automatically quantify neural loss from fundus photos.
Although SDOCT may be a reliable indicator of the presence of optic nerve damage, there are many other conditions that can lead to such damage, besides glaucoma. In addition, SDOCT by itself may be relatively unspecific for diagnosis of early glaucoma due to the wide variation in the normative measurement levels. The finding of correspondence between SDOCT and visual field abnormalities detected by standard automated perimetry (SAP) greatly enhances the specificity of diagnosis. However, to date, no objective definition of glaucomatous optic neuropathy (GON) has been proposed that incorporates both SDOCT and SAP.
In the present study, we propose an objective definition of GON using clearly defined structural and functional parameters from SDOCT and SAP. We then develop and validate a DL algorithm to detect GON on fundus photos, using the proposed objective definition as the reference standard. As such, to the best of our knowledge, the present work represents the first attempt to develop and validate an AI model to diagnose structurally and functionally defined GON using fundus photos.

Methods
This was a retrospective study that used cross-sectional data from the Duke Glaucoma Registry, a database of electronic medical and research records at the Vision, Imaging, and Performance Laboratory at Duke University. The Duke Health Institutional Review Board approved this study, and a waiver of informed consent was granted due to the retrospective nature of this work. All methods adhered to the tenets of the Declaration of Helsinki for research involving human subjects and the study was conducted in accordance with regulations of the Health Insurance Portability and Accountability Act.
The visual field tests were performed using SAP with the 24-2 Swedish Interactive Threshold Algorithm (Humphrey Field Analyzer II and III, Carl Zeiss Meditec, Inc., Dublin, CA) protocol. Unreliable tests with more than 33% fixation losses or more than 15% false-positive errors were excluded. RNFL thickness measurements were obtained from peripapillary circle scans, acquired using the Spectralis SDOCT (Software version 5.4.7.0, Heidelberg Engineering, GmbH, Dossenheim, Germany). According to manufacturer recommendations, tests with a quality score lower than 15 were excluded. The fundus photos present in the database were acquired from two different cameras: Nidek 3DX (Nidek, Japan) and Visupac (Carl Zeiss Meditec, Inc., Dublin, CA). The image was retained if the optic disc was entirely visible in the photo, and if no artifacts were present.

Objective Definition of Glaucomatous Optic Neuropathy
GON was defined based on the presence of corresponding structural and functional damage on SDOCT peripapillary RNFL scans and SAP, respectively, acquired within six months of each other. The definition considered the possibility of both global as well as localized losses. An eye was considered to have GON if any of the following were present: 1. Global RNFL thickness outside normal limits and abnormal SAP as defined by Glaucoma Hemifield Test (GHT) outside normal limits or Pattern Standard Deviation (PSD) with P < 5%, 2. At least one sector in the superior RNFL thickness (temporal-superior or nasal-superior) outside normal limits with a corresponding abnormality on SAP inferior hemifield, defined as inferior hemifield mean deviation (MD) with P < 5%, 3. At least one sector in the inferior RNFL thickness (temporal-inferior or nasal-inferior) outside normal limits with a corresponding abnormality on SAP superior hemifield, defined as superior hemifield MD with P < 5%.
Superior and inferior SAP hemifield MD were calculated as the average of the total deviation values in each hemifield. The probability cutoffs for superior and inferior MD were derived from healthy subjects. Of note, as eyes with glaucoma may frequently have a dense arcuate defect occupying most of a hemifield, an approach using hemifield PSD to objectively characterize hemifield VF damage would have poor accuracy in diagnosis.
To be considered normal, or without GON, the SDOCT-SAP pair had to meet all of the following criteria: 1. Global RNFL thickness within normal limits, 2. RNFL thickness within normal limits for all sectors, 3. SAP GHT within normal limits, 4. SAP PSD probability "not significant".
SDOCT-SAP pairs that did not meet criteria for GON or normal were considered suspects. These included eyes with only SDOCT abnormality or with only SAP abnormality. Although some suspects may in fact have early glaucoma, the lack of corresponding structural and functional damage leads to low specificity in the definition, which is undesirable in the context of developing an AI application for diagnosis and screening for glaucoma. Table 1 summarizes the criteria for the objective definition of GON.

Deep Learning Model
A DL model was trained to predict the probability of GON on fundus photos, using as reference standard the proposed classification of GON defined based on SDOCT-SAP pairs. For each eye of each individual, all available fundus photos were matched with the closest SDOCT-SAP pair acquired within an interval of six months. Each SDOCT-SAP pair was classified as glaucoma, suspect or normal according to the definition above. Eyes that were suspected of glaucoma were not used for training the DL model.
The dataset was randomly divided at the subject level into 80% for training and validation of the model and 20% for final testing. The fundus photos were preprocessed, downsampled to 256 x 256 pixels and normalized so that all pixel values ranged between zero and one. Data augmentation, including horizontal flips, angle rotations, zoom and lighting changes of the original image, was performed to reduce the chance of overfitting and to improve the generalization of the algorithm.
Training of the DL algorithm was performed with transfer learning, using the Deep Residual Learning (ResNet50) [9] architecture, pre-trained with the ImageNet dataset. To adapt the architecture to our specific Table 1: Summary of criteria for the objective definition of GON. Structural and functional criteria are derived from SDOCT and SAP results, respectively. To be considered glaucoma, it was necessary to meet the criteria for global or localized loss. To be considered normal, it was required that both SDOCT and SAP results were normal. SDOCT-SAP pairs that did not meet the criteria for either groups were considered suspects.

Global loss
Global RNFL thickness outside normal limits GHT outside normal limits or PSD P < 5%

Localized loss
RNFL thickness outside normal limits in at least one superior sector (temporal superior and/or nasal superior Inferior MD P < 5% RNFL thickness outside normal limits in at least one inferior sector (temporal inferior and/or nasal inferior Superior MD P < 5%

Normal
Normal RNFL thickness within normal limits for all sectors and global PSD probability not significant (P > 5%) and GHT within normal limits task, the last layer was replaced with a layer with output of size two: yielding probabilities of GON and normal. Initially, training was performed in the top layer, with the weights of the remaining layers frozen. Subsequently, all layers were unfrozen, and additional training was performed with differential learning rates, where smaller learning rates were used for earlier layers of the algorithm and larger rates for latter layers.
The network was trained with minibatches of size 32, optimized with Adam. [10] The best learning rate was found using the cyclical learning method [11] with stochastic gradient descent with restarts. By overlaying the Gradient-weighted class activation maps [12] with the input image, we built heatmaps that highlight the regions of the image that were most important for classification. This technique helps to visualize and understand the DL network predictions.

Statistical analysis
The performance of the DL algorithm to detect GON on fundus photos was assessed in the test sample. Receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC) were used to summarize the diagnostic accuracy. Furthermore, using a previously described ROC regression model, [13][14][15] ROC analysis was adjusted for age. Sensitivity and specificity were defined as the true positive and true negative rates, respectively. Due to the presence of multiple tests for each eye and each subject, comparison between groups was performed using a random effects mixed models, [16] and confidence intervals were derived with a bootstrap resampling procedure with resampling performed at the eye level. [17] In order to investigate the performance of the DL algorithm in different levels of disease severity, SDOCT-SAP pairs were grouped into early, moderate and severe damage according to the Hodapp-Parrish-Anderson (HPA) classification of the SAP. [18] ROC curves, AUCs, sensitivity, and specificity were calculated for each level of disease severity.
Finally, interval likelihood ratios (LRs) were calculated to assess the impact of the results of the DL model in changing the probability of disease. [19] A LR is defined as the probability of a given test result in those with disease divided by the probability of that same test result in those without the disease. [20,21] The application of LRs in the interpretation of results of imaging instruments for glaucoma diagnosis has been detailed previously. [22][23][24] LRs represent the best way to incorporate diagnostic test results in clinical practice according to the principles of evidence-based medicine. The LR for a given test result indicates how much that result will change the probability of disease, going from a pre-test probability (i.e., probability of disease before the test) to a post-test probability. A value of one means that the test provides no additional information, and ratios above or below one increase or decrease the likelihood of disease, respectively. Based on a prior classification definition, LRs greater than 10 or lower than 0.1 would be associated with large effects on post-test probability, LRs from 5 to 10 or from 0.1 to 0.2 would be associated with moderate effects, LRs from 2 to 5 or from 0.2 to 0.5 would be associated with small effects, and LRs closer to one would be insignificant. [21] Development of the DL algorithm was performed in Python, while all statistical analyses were performed using Stata (version 15, StataCorp LP, College Station, TX). The alpha level (type I error) was set at 0.05.

Results
The dataset comprised of 9,830 fundus photos from 2,927 eyes of 2,025 individuals. The training/validation sample had 7,712 fundus photos from 2,342 eyes of 1,620 subjects, while the test sample had 2,118 fundus photos from 585 eyes of 405 subjects. Of the 585 eyes in the test sample, 305 (52%) had GON and 280 (48%) were normal. The median SAP MD were -7.5 and 0.24 dB for the glaucoma and normal groups, respectively (P = 0.000, random effects mixed model); while mean global RNFL thickness values were 67.0 and 98.3 µm, respectively (P = 0.000, random effects mixed model). Table 2 shows demographic and clinical characteristics for the eyes in the study. The median DL probability of glaucoma (interquartile range [IQR]) assigned to fundus photos with GON and normal were 99.8% (87.5, 100.0%) and 0.03% (0.0, 2.6%), respectively (P = 0.000, random effects mixed model). Figure 1 illustrates the distribution of DL probabilities. The DL algorithm had an age-adjusted AUC of 0.92 (95% CI: 0.88, 0.95) in the test sample. For a 95% specificity, the model had 77.3% sensitivity.
The AUC values increased with worse levels of disease severity, achieving maximal age-adjusted AUC of 0.96 and sensitivity of 85.1% (at 95% specificity) among eyes with severe glaucoma. In Figure 2 and Table 3, ROC curves and AUC values, as well as sensitivities at 95% specificity, are presented across levels of disease severity. Figures 3 and 4 illustrate eyes included in the study, classified as normal and as glaucomatous, respectively.  The LRs are presented in Table 4 and help identify meaningful intervals of the DL probabilities for the risk of the disease. Test results with DL probability > 99% or > 99.9% were associated with large changes in increasing the probability of disease (LRs of 10.4 and 38.2, respectively); whereas test results with DL probability < 0.1% were associated with large change in decreasing the probability of disease (LR of 0.06). Any of these test results would provide strong evidence regarding the presence or absence of GON. As Table 4 shows, other test results would also increase or decrease the probability of disease, but provide less conclusive evidence, which could still be helpful depending on the setting of application of the test.
A total of 4,899 fundus photos from 1,648 eyes of 1,085 individuals were classified as suspect. When applied to this group, the DL algorithm predicted a median probability of glaucoma (IQR) of 6.4% (0.

Discussion
In the present study, we developed novel objective criteria for defining GON, based on the correspondence of structural and functional disease measures. We then used the proposed reference standard to train a DL algorithm to estimate the probability of GON from optic disc photos. The algorithm was able to accurately discriminate photos from eyes with and without GON based on the proposed reference standard, suggesting that DL analysis of fundus photos may have a role in screening for glaucomatous damage. Additionally, our results provide support for the use of an objectively defined structural and functional reference standard for GON, given the strong agreement found between DL predictions and the reference classifications.
Fundus photos are an inexpensive and portable method to image the optic disc. With recent advances in AI, automated algorithms have been developed to analyze fundus photos and yield a probability of GON. However, in most previous works, the development of DL algorithms relied on human gradings to establish the reference standard, an approach that presents severe limitations. First, human gradings tend to over-and underestimate the likelihood of glaucoma, [25] and, even among specialists, there is poor reproducibility and low interobserver agreement. [4][5][6] Second, labelling a large number of fundus photos to be used as reference standard is cumbersome and time consuming. Third, the experience of the evaluators and the criteria used to define GON varied tremendously across those studies. Finally, the accuracy achieved using the reference standard based on human gradings does not necessarily translates well to clinical practice. As an example, in a work by Phene et al., [26] a DL algorithm developed to detect GON, defined by human gradings, in fundus photos, achieved an AUC of 0.945. When evaluated in samples with a different reference standard, defined by a complete glaucoma workup and by International Classification of Diseases codes, the AUCs were 0.855 and 0.881, respectively.
In order to avoid subjective biases and low accuracy from human labelling, we developed an objective definition of GON, based on quantitative metrics of both SAP and SDOCT. SDOCT measurements of RNFL thickness are objective, reproducible, and are considered the standard of care for quantification of structural damage in glaucoma. [27] SAP is used to detect and classify levels of severity of functional loss18 and is correlated to quality of life and visual disability. [28] Since both tests are metrics of the same disease, using a combination of their results increases the specificity of the classification -which is particularly important for a reference standard used in developing algorithms aimed at screening for glaucoma, a disease that has a relatively low-prevalence. To the best of our knowledge, our study is the first to propose such objective definition of GON based on SDOCT and SAP.
The DL algorithm predicted probabilities of GON with high confidence, with a median DL probability of 0.03% (IQR: 0.0, 2.6%) for photos classified as normal and 99.8% (IQR: 87.5, 100.0%) for photos classified as glaucomatous. These results not only provide confidence on the usefulness of the DL classifier, but also provide internal validation for our proposed objective definition of GON. When applied to photos of eyes that had been classified as suspects, the median DL probability was 6.4% (IQR: 0.2, 82.4%), with a wide distribution of predictions. This is likely due to the fact that some eyes with pre-perimetric glaucoma, as well as some normal eyes, were included as suspects given the lack of correspondent structural and functional damage. In fact, when we compared structural and functional measurements in glaucoma suspect eyes divided by quartiles of DL probability, we found that those in the highest quartile had significantly lower global RNFL thickness measurements suggesting that the model was capturing those suspects at higher risk.
The age-adjusted AUC of the DL probabilities was 0.92, with a 77.3% sensitivity at 95% specificity. When evaluated in different levels of disease severity, the DL algorithm performance improved with worse severity (Early: AUC = 0.84, Moderate: AUC = 0.92, Severe: AUC = 0.96), correctly identifying more than 60% of the early cases and over 80% of the moderate and severe cases at a specificity of 95%. This performance makes the DL algorithm particularly useful for screening of moderate and severe glaucoma, 8 All rights reserved. No reuse allowed without permission.
author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the . which require prompt referral, but also with a high specificity, so as not to overwhelm the healthcare system with false-positive referrals. One could argue that the algorithm still misses about 20% of moderate and severe cases of glaucoma. However, given that previous population-based studies show that over 90% of patients with glaucoma remain undiagnosed in developing areas, [29,30] such a simple and inexpensive approach could potentially help reduce the burden of disability from glaucoma.
We also presented LRs for different intervals of DL probabilities. In contrast to sensitivity and specificity, which are difficult to apply in practice, LRs can be readily incorporated into clinical decision-making. In our work, DL probability values lower than 0.1% or greater than 99% were associated with large changes in the post-test probability, providing major conclusive changes regarding presence or absence of damage from glaucoma. To illustrate, consider the application of the test in a situation of screening, where the overall prevalence of glaucoma would be estimated at 5%. [29,30] This would then be considered the pre-test probability. If this patient is tested and gets a result with DL probability greater than 99.9% (LR of 38.2), this would bring the post-test probability to 66%, indicative of the need for referral.
The calculation of LRs helps to understand the usefulness of the test under different circumstances as well. For example, when applied to the evaluation of suspects in clinical practice or during opportunistic screening, levels of pre-test probability of disease would often be significantly higher, as those patients would This study has limitations. Our cohort consisted of a clinic-based population, meaning that further validation in real-world settings, for population-based or opportunistic screening, is necessary. Also, the classification did not include risk factors, such as intraocular pressure, race, family history, and age. However, such information can be aggregated to estimate the pre-test probability and then one can apply the DL model to modify such probability. Finally, the DL algorithm would benefit from additional external validation on a population from a different geographical area.
In conclusion, we developed a DL algorithm to evaluate fundus photos for glaucomatous damage, based on a novel reference standard definition of GON that uses both structural and functional measures of disease. We demonstrated that the DL algorithm can produce large changes in the probability of disease in the majority of subjects, supporting its usefulness for decision-making. Furthermore, our proposed objective reference standard to define GON may increase the comparability of diagnostic studies across devices and populations, helping to improve the development and assessment of diagnostic tests in clinical practice.
10 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.04. 10.20057836 doi: medRxiv preprint All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.04.10.20057836 doi: medRxiv preprint Figure S1: Example of a case classified as suspect. (A) illustrates the fundus photo used as input for the deep learning (DL) algorithm, which predicted a glaucoma probability of 48.3%. (B) shows the regions that were most important for the classification, in which the heatmap highlights diffusely the image, but mostly the superior half of the optic disc and the peripapillary region. Although the global and sectoral retinal nerve fiber layer thickness are within normal limits in the spectral domain optical coherence tomography (C), there is an apparent superior nasal step defect and the pattern standard deviation (PSD) is abnormal (P ¡ 5%) in the standard automated perimetry (D).

Supplementary Material
14 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.04. 10.20057836 doi: medRxiv preprint Figure S2: Example of a case classified as suspect. (A) illustrates the fundus photo used as input for the deep learning (DL) algorithm, which predicted a glaucoma probability of 49.3%. (B) shows the regions that were most important for the classification, in which the heatmap highlights diffusely the optic disc and a peripapillary atrophy. Although the spectral domain optical coherence tomography (C) is abnormal (global and temporal superior sectors outside normal limits, and 3 other sectors borderline), there is no defect in the standard automated perimetry (D).
15 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.