Introduction

The field of artificial intelligence (AI) began around 1950 when Turing pointed out that computer programs simulating cognitive functions like game play could be written [1]. In the 1980s, machine learning (ML), a subset of AI, achieved the objective of actually learning patterns in data without explicitly being programmed, but this subset of AI did not greatly impact medicine, probably because clinicians could readily outperform such algorithms. Around 2010 the artificial neural networks of ML were replaced with networks that functioned like neurons with receptive fields that efficiently integrated high throughput data and the subset ML called deep learning (DL) emerged (Fig. 1). In a short time, DL algorithms have rivalled and even outperformed pre-existing algorithms in medicine and other disciplines. DL applications are diverse, ranging from the prediction of earthquake aftershocks [2] to the advancement of drug discovery [3]. In healthcare, DL has been used to ascertain time of stroke onset [4], assess cancer lesions and metastases [5,6,7], and recognize numerous other conditions. In ophthalmology, DL applications aid in the detection of glaucoma [8,9,10,11], diabetic retinopathy (DR) [12,13,14,15], age-related macular degeneration [16,17,18], and retinopathy of prematurity [19, 20]. Remarkably, a myriad of products employing AI algorithms for the detection of conditions ranging from atrial fibrillation via the Apple watch to autonomous recognition of DR from digital fundus gained FDA approval in 2017 and 2018 [21]. The 2020 issue of Eye spotlights innovation and the incredible progress being made in the field of glaucoma. This review emphasizes advancements in glaucoma related to AI.

Fig. 1
figure 1

The relationship between deep learning, machine learning, and artificial intelligence is depicted. Artificial intelligence is the broadest classification and deep learning is the narrowest classification of the three. Machine learning is a type of artificial intelligence. Deep learning is a type of artificial intelligence as well but is also a machine learning classifier

After providing an overview of AI, this paper reviews the applications of DL to glaucoma, including (1) detection of the glaucomatous disc from fundus photographs and optical coherence tomography, (2) interpretation of visual fields and recognition of their progression, and (3) clinical forecasting.

AI, machine learning, and DL

In earlier forms of AI that did not use ML, a machine only learns when explicitly programmed. The machine is taught through a series of if-then statements that specify how the machine should act. For example, let us assume a person wants a computer to play checkers. To teach the computer, the person indicates where the computer should move based on specific circumstances in the game. Under these conditions, the computer will not likely be better at checkers than the person.

In contrast, ML describes the ability of a machine to learn something without needing to be explicitly programmed [22]. Samuel coined this term in attempting to make a computer play checkers better than him. ML allowed the computer to adapt to the game as it played out. As a result, the computer improved its own performance and learned to play checkers better than Samuel.

The ‘deep’ in DL, the newest subset of ML, refers to the many hidden layers in its computer neural network. The benefit of more hidden layers is the ability to analyse more complicated inputs, including entire images. DL also uses a general-purpose learning procedure so that features do not need to be engineered individually [23]. Of vital importance, the DL algorithm is inspired by the organization of the visual cortex, giving it a particular advantage in perceiving visual inputs.

DL and visual cortex neural networks

DL networks are modelled after visual cortex neural networks. As a result, there are multiple features that artificial and biological networks share, including the use of edge detection and a high degree of spatial invariance, which refers to the ability to recognize images despite alterations in viewing angle, image orientation, image size, scene lighting, etc. [24]. Early layers of the visual cortex are considered edge detectors [25] because they have dedicated orientation- and position-specific cells, as initially described by Hubel and Wiesel [26]. A cell might respond to a bar with a vertical orientation, but if the bar is rotated 30°, the cell may no longer respond. DL utilizes small receptive fields that act like flashlights to learn about edges of objects and where the objects have empty space.

There are multiple architectural similarities between biological and artificial neural networks, including their degree of connectivity and their learning procedure. In the visual cortex, every neuron in a particular layer is not connected to every neuron in the next layer. While this breadth of connectivity would be useful, it is not feasible because of evolutionary constraints on human brain size. Artificial neurons in DL networks have the same connective architecture as biological neurons, a feature that reduces computational burden. DL networks further reduce computational complexity and minimize the amount of computer memory use by employing matrix multiplication with predetermined filters. Another architectural similarity between biological and artificial neural networks is the condensation and summation that occurs at the end of the DL algorithm that is akin to what happens in level V1 of the cerebral cortex. Finally, DL and cortical computation have both feedforward and feedback arms (the latter is called backpropagation) [27, 28]. In backpropagation, a network adjusts the weights of its different inputs to ensure the actual output of the algorithm matches its expected value [28].

DL algorithm

DL consists of three essential stages, (1) training, (2) validating, and (3) testing. A machine is first given a training dataset, or sample data that the machine fits its algorithm to. A validation dataset then evaluates how well the model fits the training set and the model is manually adjusted accordingly. In order to assess how well the algorithm works, the machine is ultimately given a testing dataset.

DL and glaucoma

Glaucoma is a leading cause of irreversible blindness, with a global prevalence of 3.5% and a global burden of 76 million affected people in 2020 [29]. Early detection and treatment can preserve vision in affected individuals. However, glaucoma is asymptomatic in early stages, as visual fields are not affected until 20–50% of corresponding retinal ganglion cells are lost [30, 31]. Considerable work is needed to improve our ability to detect glaucoma and its progression as well as optimize treatment algorithms in order to preserve vision in these patients. While great strides have been made in understanding the various glaucoma subtypes, the avalanche of existing imaging and visual field data will need to be synthesized in new ways to improve our understanding of glaucoma and derive better treatments. Glaucoma, like the field of ophthalmology in general, is heavily image based and AI is poised to address many of these challenges.

DL and detection of the glaucomatous disc

Assessment of optic nerve head (ONH) integrity is the foundation for detecting glaucomatous damage. The ONH is a site where ~1 million retinal ganglion cell axons converge on a space with average area of 2.1–3.0 mm2 prior to radiating to higher visual pathways [32]. Given the variance in ONH anatomy [33], it can be challenging to identify the glaucomatous disc both in the clinical and screening setting. In fact a study showed that agreement on the detection of ONH damage from fundus photographs among experts is only moderate [34]. Difficulties in detecting the glaucomatous disc from fundus photographs can be compounded by variations in image capture platform, exposure, focus, magnification, state of mydriasis, and presence of non-glaucomatous disease. DL has made considerable inroads in detecting glaucomatous disc damage from digital optic nerve photographs. Figure 2 shows the DL procedure applied to detection of the glaucomatous disc. Here the input layer is an optic nerve image, which mathematically can be depicted as a 3-dimensional pixel array with length, width, and colour channels (Red–Green–Blue). The input image is assigned a clinical consensus ground truth label like ‘glaucoma’ or ‘no glaucoma’. The output of one hidden layer becomes the input of the next hidden layer. The output layer is a classification label that the algorithm gives the image based on the properties it identifies during DL.

Fig. 2
figure 2

The deep learning procedure applied to glaucomatous disc detection is depicted. The input layer, or the image, is analysed and gives rise to the output layer, or the classification label. There are two stages of deep learning analysis: feature learning and classification. Feature learning is an iterative procedure of convolution, pooling, and activation that is applied at each hidden layer. In order to classify the image based on what is deduced from feature learning, probability conversion is performed. The probability value produced is used to classify the input image. Prior to classification, backpropagation occurs to compare the predicted probability value to the actual probability value and calculate the corresponding error. In the case of glaucoma detection, the final probability value is used to classify the input image as glaucomatous or normal

There are two stages of DL: feature learning and classification. Feature learning is an iterative procedure of convolution, pooling, and activation, followed by backpropagation (Fig. 2). Classification consists of probability conversion and clinical labelling. The feature learning iterative procedure is applied at each hidden layer. Each layer is analysed piecewise, in blocks called image patches or receptive fields. Convolution, pooling, and activation occur at each image patch until the entire layer is analysed. The first step, convolution, is synonymous with matrix combination. The input matrix, or the image being analysed, and the feature matrix, or the feature being extracted from the image, are combined. Convolution then produces a feature map. A feature map shows the important features of the input image and excludes the irrelevant parts of the input image.

The final two stages of the iterative procedure are pooling and activation. Pooling consists of streamlining the matrix to its most important parts, which are passed on to the next hidden layer. The most common type of pooling is called max pooling. In max pooling, the image patches with the highest intensity pixels are maintained and all other image patches are removed. Pooling isolates the most relevant features of the given hidden layer. Activation further streamlines by setting negative matrix values to 0. Probability conversion produces a probability value based on what remains in the matrix. This probability value will later be used to clinically classify the input image. Prior to classification, backpropagation, which implements gradient descent, occurs. Backpropagation compares the predicted probability value to the actual probability value and calculates the corresponding error. Backpropagation subsequently updates each feature matrix value recursively in order to compute the most accurate probability value. Based on the final probability value, the input image is clinically labelled. In the case of glaucoma detection, the final probability value is used to classify the input image as glaucomatous or normal.

Common metrics that assess DL algorithms are sensitivity, specificity, accuracy, precision, positive predictive value, negative predictive value, and area under the receiver operating curve (AUC) [35]. AUC is calculated using sensitivity and specificity [36]. AUC is intended for binary classifiers only [36]. As a result, AUC can be used as a metric when images are classified into two categories, such as ‘glaucoma’ or ‘no glaucoma’.

Investigators have assembled large numbers of images into training, validation, and testing datasets to successfully train DL algorithms to detect a cup-disc ratio (CDR) at or above a certain threshold (either CDR of 0.7 or 0.8) with AUC ≥ 0.942 [10, 12] (Table 1). In an alternative approach, investigators have assigned a glaucoma status based on a consensus of ancillary data associated with the input disc photograph and also reported remarkable good results (AUC ≥ 0.872). [9, 37,38,39,40] In this way, a DL algorithm could be tailored to identify the optic disc associated with manifest visual field loss, a highly meaningful endpoint that circumvents the issue that larger discs will naturally have larger cups and could be a source of false positive screening results. Furthermore, investigators have applied DL to assessments of OCT. A study detecting early glaucoma with OCT using DL showed a higher AUC (0.937) than other machine learning methods including random forests (AUC 0.820) and support vector machine model (AUC 0.674) (Table 1) [8]. Finally, in a highly innovative approach, Medeiros et al. assigned the average nerve fibre layer thickness from an OCT paired to a fundus photo and trained a DL to predict average NFL thickness from a test fundus photo [37]. The correlation between predicted and observed retinal nerve fibre layer (RNFL) thickness values was high (r = 0.83) and the AUC for glaucoma detection from the DL prediction of RNFL thickness was 0.944. Such a machine-to-machine learning approach removes the subjectivity associated with the ground truth labels for disc photographs and gives the photo the added value of an estimated RNFL thickness. Using a similar approach, this research team also assigned a Bruch’s membrane opening minimal rim width (BMO-MRW) value, defined as the minimal distance from the internal limiting membrane to the inner opening of Bruch’s membrane opening, to fundus images and yielded similar results in terms of detecting glaucoma [40]. BMO-MWO is an OCT biomarker that may be as sensitive or more sensitive in detecting glaucoma. There is considerable pixel information embedded in digital fundus photographs and DL algorithms are being used to leverage that information.

Table 1 Summary of glaucoma detection studies using deep learning

Structural disc features that clinicians use to detect glaucomatous optic neuropathy include increased CDR, RNFL thinning, neuroretinal rim thinning and notching, excavation of the cup, optic cup vertical elongation, parapapillary atrophy, disc haemorrhage, nasal shifting of central ONH vessels, and baring of the circumlinear vessels [41]. To confirm glaucoma, a clinician inspects these features on ONH and RNFL exam. In contrast, it is unknown whether DL algorithms evaluate these features. In fact, the exact mechanism DL models use to predict in glaucoma algorithms is unclear. As a result, DL algorithms have been called ‘black boxes’ [35]. Heatmap analysis and occlusion testing have shed light on how DL works. Both heatmap analysis and occlusion testing visually represent the weights of fundus image components as contributors to the algorithm’s prediction [37]. Heatmap analysis has identified the optic disc as the region that algorithms primarily use to classify [37, 40, 42]. In addition, occlusion testing has identified the neuroretinal rim as the main predictive factor in differentiating normal from glaucomatous eyes [43].

AI and visual field interpretation

Computerized automated visual field testing represented a real advancement in mapping the island of vision and allowed visual field testing to be a cornerstone in diagnosing and monitoring glaucoma. Various platforms were developed and computerized algorithms generated useful outputs containing reliability parameters, retinal sensitivity arrays across visual space that were adjusted for age-matched controls, and global indices that provide summaries regarding the island of vision. Visual fields, as opposed to digital fundus photographs or OCTs, are low 2-dimensional datasets that represent a functional assay of the entire visual pathway. They are also subject to short- and long-term fluctuation. While computerized visual field printouts are extremely informative, they lack certain features that would make them more useful to clinicians. Specifically, current algorithms do not differentiate glaucomatous versus non-glaucomatous defects and artefacts. Furthermore, they do not quantify the degree of defects in a regional manner.

In 1994, Goldbaum et al. [44] created a two-layer neural network that analysed visual fields. This network categorized normal eyes and glaucomatous eyes with sensitivity (65%) and specificity (72%), comparable to two glaucoma specialists. The pioneering work of Goldbaum et al. employed an unsupervised ML learning strategy that could be broadly classified as axis learning, i.e. principal component analysis and independent component analysis. DL has been used to further leverage retinal sensitivity data contained in visual fields. For example, these algorithms have been effective in the automated differentiation of glaucoma and preperimetric glaucoma [11]. Asaoka et al. showed that a DL classifier exhibited a higher AUC (0.926) in detecting glaucomatous visual field loss than other machine learning classifiers, including random forests (AUC 0.790) and support vector machine (AUC 0.712) (Table 1) [11]. DL algorithms are also better able to detect glaucoma using visual fields than clinicians. Li et al. found that DL was more accurate (0.876) than ophthalmology residents (0.593–0.640), attending ophthalmologists (0.533–0.653), and glaucoma experts (0.607–0.663) at differentiating glaucomatous visual field from non-glaucomatous visual fields [45]. Li et al. suggested that there are patterns, possibly between adjacent and distant test points, that DL algorithms are able to detect that clinicians do not appreciate.

Current computerized packages do not decompose visual field data into patterns of loss. Visual field loss patterns are due to compromise of various structures ranging from the cornea to the occipital cortex. Furthermore, glaucomatous patterns ultimately have topological correspondence to discrete regions of the ONH [46]. Keltner et al. offered a visual field classification system based on manual inspection of visual fields generated in the ocular hypertension study (OHTS), but made no attempt to quantify these patterns [47]. More recently, an unsupervised algorithm employing a corner learning strategy called archetypal analysis was developed to classify quantitatively the regional patterns of loss without the potential bias of clinical experience [48]. In this study, 13,321 Humphrey visual fields were subjected to unsupervised learning to identify archetypes, or prototypical patterns of visual loss. All archetypes obtained through this algorithm corresponded to the manual OHTS classifiers. Archetypal analysis provides a regional stratification of a visual field along with coefficients that weigh each of these regional patterns of loss. In Fig. 3, an inferior visual field defect is decomposed into an inferonasal step (46%), an inferior altitudinal defect (40%), and an inferior paracentral pattern of loss (15%). Subsequent chart review of visual fields from patients with weighting coefficients >0.7 for each archetype yielded expected clinical findings [49]. For example, patients with high weighting coefficients for an archetype consistent with advanced glaucoma were more likely to have high CDRs than patients with high weighting coefficients for other archetypes. Furthermore, archetypal analysis was useful in predicting reversal of a glaucoma hemifield test back to normal after two consecutive ‘outside normal limit’ results largely because it accounted for lens rim artefacts and non-glaucomatous loss patterns [50].

Fig. 3
figure 3

The visual field shown depicts an inferior arcuate defect that is decomposed quantitatively into an inferonasal defect (46%), inferior altitudinal loss (40%), and an inferior paracentral defect (15%) as per archetypal analysis

Detecting visual field progression and AI

Saeedi et al. identified a high degree of variation in predictions of visual field progression across existing conventional algorithms that are used in randomized trials and in clinical practice [51]. This lack of agreement underscores an opportunity for AI algorithms to track visual field progression. In fact, Wang et al. calculated the rate of change in the weighting coefficients generated from archetypal analysis to a large visual field database with serial tests and used the consensus of three glaucoma specialists with access to clinical data as the reference standard. They found an accuracy of 0.77 for archetypal analysis to detect progression, a value exceeding the mean deviation slope (0.59), the permutation of pointwise linear regression (0.60), the Collaborative Initial Glaucoma Treatment Study scoring (0.59), and the Advanced Glaucoma Intervention Study scoring (0.52) [52].

Figure 4 shows consensus on visual field progression based on conventional algorithms, the clinical reference standard, and archetypal analysis. In Fig. 5, a patient with 6.3 years of follow-up was regarded to have progressed clinically on the basis of deepening of a superior nasal step. The change in mean deviation slope was small and the permutation of pointwise linear regression, the Collaborative Initial Glaucoma Treatment Study score, and the Advanced Glaucoma Intervention Study scoring did not designate the visual field tests as worsening; however, archetypal analysis found a significant increase in the coefficient for the superior nasal step archetype (archetype 3).

Fig. 4
figure 4

The visual field shown depicts a superior altitudinal defect and is a clear example of visual field progression. The figure shows consensus of progression based on conventional algorithms, the clinical reference standard, and archetypal analysis

Fig. 5
figure 5

The visual field shown depicts a superior nasal step and is a subtle example of visual field progression. The archetype method seems to detect VF progression pattern when conventional algorithms do not

Clinical forecasting and AI

In the Collaborative Initial Glaucoma Treatment Study, Janz et al. documented that patients with newly diagnosed glaucoma harbour fears of blindness after they receive this diagnosis [53]. The aggregate blindness rates from glaucoma are not high; for example, the rate of monocular blindness from glaucoma in the Salisbury Eye Evaluation study was 1.4% [54]. Nonetheless, patients and physicians alike would benefit from accurate forecasting to identify disease prognosis.

Kalman filtering is a forecasting method that has been used in numerous fields. During the Apollo missions in the 1960s, the National Aeronautics and Space Administration used Kalman filters to map out the trajectory of astronauts to the Moon [55]. In more recent years, Kalman filtering has been used for clinical forecasting. Clinical forecasting refers to the generation of a personalized prediction for disease trajectory [56]. Forecasting models can be updated using data from subsequent clinic visits, leading to more accurate predictions [57]. By using a Kalman filter model for patients with glaucoma, researchers were able to detect glaucoma progression 57% earlier than they would have using a yearly monitoring system [57]. The same model then accurately predicted disease progression in patients with normal tension glaucoma [58]. A third study used Kalman forecasting to predict personalized, target intraocular pressure levels for patients [59]. Personalized patient recommendations can be produced based on the Kalman forecasting method, which can help clinicians in decision-making.

Limitations of DL

DL is considered a ‘black box’ in that its predictive mechanism is unknown [60]. In the field of retinal disease, the most notable example of opening the ‘black box’ was reported by De Fauw et al. who provided a framework allowing for inspection of the particular OCT feature used to detect referable retinal disease [61]. Ultimately, learning image features under consideration in classification of disease or determination of disease progression may be instructive in making clinicians better observers of clinical data and could be used to train current and future generations of glaucoma specialists.

DL algorithms are only as good as the images inputted into them. Algorithms have low sensitivity and specificity in analysing fundus images with poor image quality. In a recent study, fundus images of poor quality that were not removed from the dataset were found to decrease the AUC by 0.1 [42]. Another limitation of DL is that images with less severe disease manifestations, including glaucoma suspect and early glaucoma, can be more difficult for DL algorithms to classify [39, 42]. Algorithms are thus better able to detect more severe cases of glaucoma. DL also has difficulty analysing images with multiple comorbid eye conditions. False negative classifications have occurred when glaucoma is present with age-related macular degeneration, DR, or high myopia [10]. Consequently, myopic eyes are sometimes excluded from analysis [62, 63]. Considering that Asians have a high prevalence of myopia [64, 65] and glaucoma [66, 67], devising a way to include myopic eyes in DL models is vital. Another limitation is the lack of information about the use of DL algorithms in heterogeneous samples. Thus far, algorithms have been used in mostly homogenous groups where images from only a few races and ethnicities were inputted.

In order for DL algorithms to predict with high sensitivity and specificity, a large number of images must be included [37, 38]. There are time constraints and technological difficulties associated with obtaining and storing many images. Furthermore, it may be necessary for such databases to be continuously updated to remain relevant and prevent system-wide failure of the algorithms. Also, a high AUC for an AI algorithm does not necessarily translate into important clinical impact and this must be kept in mind as AI begins to permeate the field of glaucoma.

An initial test indicates that tampering with an image in minor ways can undermine the DL classification. Specifically, it is possible that changing a few pixels can lead to the misclassification of an image [68]. Lynch et al. changed pixels in fundus photos exhibiting DR. These changes were undetectable to the human eye, so ophthalmologists still judged these photos as exhibiting DR. In contrast, the pixel changes caused the algorithm to mis-categorize half of the DR photos as normal [68]. While this finding has not been confirmed in other studies, it is a potential issue.

Conclusions

Although there are limitations, the future for DL applications in glaucoma is bright. In a few short years, tens of applications of DL specific to glaucoma have been published in peer-review publications. In addition to the subjects discussed here we suspect the use of AI in optical coherence tomography angiography interpretation will be forthcoming. We anticipate applications will emerge that will accomplish relevant clinical functions with high sensitivity and specificity across different patient platforms and different races/ethnicities.

It is interesting to speculate on what is possible in this new DL era. We envision that AI will greatly impact outpatient glaucoma screening, glaucoma management, and will allow for remote disease monitoring. These developments will empower the patient to take charge of their disease and enlighten the provider about the glaucomatous process. Glaucoma AI algorithms that meet regulatory approval (currently there are none) will likely be embedded in the electronic medical record to facilitate outpatient management. We can imagine an eye care provider logging on to view their schedule and being met with a pre-visit synthesis of a patient’s prior optic nerve and visual field data. The provider would also be notified if the patient was flagged to have glaucoma that is progressing and likely needs a change in target IOP. This AI assessment would also be updated based on additional data that are gathered during the patient visit and may also incorporate ancillary genetic and other medical information.

AI methods could be applied to teleretinal screening programs in non-ophthalmic settings like primary care offices. In this model, manual detection of the glaucoma-like disc would be replaced by algorithms that assess the disc, allowing for effective triage of the patient.

Finally, there is a great need to facilitate remote glaucoma monitoring whereby patients collect their own IOP data with anaesthesia-free and accurate tonometers [69], although more effort will be needed to make home-tonometry available at lower costs. Remote monitoring will be facilitated by home visual field testing that could be achieved using virtual reality [70], which controls for ambient lighting and distance between the eye and stimulus presentation, although fixation monitoring may still be an issue. Ultimately, nonmydriatic self-retinal imaging without the need for expensive smartphone attachments may be realized as the imaging capability of these pervasive handheld devices improve [71]. Of course, the tools for remote disease monitoring will require validation and the ability of DL algorithms to synthesize remotely acquired data will need to be assessed. However, once reliable, remotely generated glaucoma data are available and analysed by DL algorithms, a new era of glaucoma management will begin. Interestingly, in the United States, Centres for Medicare and Medicaid Service code for remote disease monitoring is available, essentially anticipating that such a trend will happen [72]. It will be up to the clinicians to lead the way and determine how to implement AI in meaningful ways for our glaucoma patients worldwide.