Introduction

Screening with deep learning

Since 2018 when the first autonomous machine learning model was approved to detect diabetic retinopathy in fundus photographs [1], deep learning algorithms have now expanded to assist diagnosis of macular degeneration [2] and glaucoma [3]. Regulatory approval is also pending for myopic retinopathy [4] and cardiovascular disease [5]. Deep learning algorithms can outperform human accuracy for the three targeted diseases: diabetic retinopathy/diabetic macular oedema [6, 7], macular degeneration [8], and glaucoma [9]. The technology is more accessible [10,11,12,13], 200 times faster [14, 15], and capable of sub-clinical detection [16,17,18,19], but the diseases screened are few and human experts are still required for complete retinal assessment.

The only clinically available systemic disease targeted by deep learning in fundus photographs remains diabetes mellitus despite expanding algorithm development to estimate age [20], refractive error [21], smoking status [22], body composition [23], renal function [23], glycated haemoglobin levels [24], anaemia [25], schizophrenia [26], neurodegenerative diseases [27, 28] as well as cardiovascular [22, 29, 30] and cerebrovascular health [31, 32]. Multi-target algorithms [33,34,35] and multimodality using scanning laser imaging and OCT technology [36] are also rapidly advancing, but realising autonomous comprehensive retinal screening on this trajectory is unlikely as the required training datasets for rare and novel diseases are lacking, and target-specific algorithms are not designed to detect clinically important incidental diseases noticed by human experts.

Applications in hypertensive and diabetic retinopathies

Retinal photography, the most effective screening strategy for diabetic retinopathy [37], shows the extent of progressive retinal vessel compromise and estimates the health of the other target organs; the heart, brain, and kidneys [38]. Similarly, the extent of hypertensive retinal vascular remodelling estimates the health of the other target organs [39]. Unlike the management of diabetes mellitus, where glycated haemoglobin provides time-averaged blood glucose estimates, hypertension has no such time-averaged metric to compensate for the volatility of blood pressure. Thus, hypertensive ocular biomarkers are key indicators for non-invasive management and offer preclinical detection, as arteriolar narrowing occurs prior to clinical hypertension [40].

More than twice as prevalent as diabetes mellitus, hypertension, defined as office blood pressure levels above 140 mmHg systolic, 90 mmHg diastolic, or both [41] remains the leading modifiable risk factor for premature death worldwide [42], affecting 20% of the global adult population, with 46% undiagnosed and only 21% receiving effective treatment [43]. Interestingly, clinical standards of care call for ongoing screening and monitoring of diabetic retinas, whereas there are no such recommendations for hypertension beyond retinal screening at diagnosis [44].

Not only is hypertensive retinopathy the most common clinically significant incidental finding in diabetic screening [45], but hypertensive retinal features appear to dominate the pathologies misclassified by deep learning algorithms [45,46,47,48,49,50,51,52,53,54,55,56,57,58,59]. Despite exploration of vessel calibre changes [60] and small vessel segmentation using optical coherence tomography angiography (OCTA) [61, 62], difficulties remain differentiating some early biomarkers of diabetic and hypertensive retinopathies [63].

Hypothetically, shared retinal biomarkers and/or high rates of hypertension in diabetic training images [64] could trigger false-positive results for deep learning algorithms. Additionally, or alternatively, training deep learning pattern recognition/discrimination for specific disease is not error-free and may result in anomalous data signals triggering misclassification of not only hypertensive retinopathy but also other clinically useful untargeted disease. Deep learning pathways remain unknown, but the hypotheses suggest that the algorithms may be clinically useful beyond their intended targets.

The primary aim of this study was to explore the potential for detecting clinically useful incidental ocular biomarkers using diabetic deep learning algorithms to screen fundus photographs of hypertensive adults.

Subjects and methods

Research design

The study had a retrospective, observational design. Approval was obtained from the East Metro Health Service Ethics and Governance Unit to use images collected for EastMetro HREC RGS1040 - Retinal Imaging in Resistant Hypertension. The study adhered to the tenets of the Declaration of Helsinki for research involving human subjects, and all participants gave informed consent.

Participants

Participants were recruited from Dobney Hypertension Centre, a public hospital outpatient clinic in Western Australia specializing in resistant hypertension, defined as uncontrolled high blood pressure despite at least three antihypertensive medications including a diuretic [41]. All participants were confirmed to have hypertension at recruitment but not necessarily with resistant hypertension.

All patients attending Dobney Hypertension Centre between 18 January 2016 and 31 March 2022 were invited to participate in the Retina Imaging in Resistant Hypertension study. Recruitment, data collection and processing have been described in detail previously [65, 66]. In brief, patients were referred from primary care for diagnostic workup and clinical management of difficult-to-control hypertension. Patients consented to participate in a systematic prospective analysis to explore the association between blood pressure at presentation and retinal imaging parameters. Baseline clinical data collected from the patients included medical history, medication history, serum pathology, extensive blood pressure testing, and specific assessments of hypertension-mediated organ damage (including retinal imaging).

The participation rate for all new and returning attendees was above 90%. Of the 529 consented participants, 96 were excluded from the study for lack of imaging arising from technical issues or non-attendance.

Data collection

Image acquisition

Clinic staff collected 45° macula-centred, colour fundus photographs without mydriasis using a Canon CR-2 camera (Tokyo, Japan) along with OCT and OCTA imaging by Optovue Avanti XR (Fremont, California, USA). All images and data were de-identified before transfer to the Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia’s national science agency.

Image selection

For each participant, the most recent processable higher-quality image capturing both the optic nerve and macula was selected. Processability was determined by two or more algorithms successfully processing the image. Two of the algorithms output a graded image or deemed the image ungradable, while the third algorithm graded all images along with a quality scale and a reliability threshold. Where no image met the processability criteria, the participant was removed from further assessment. Where more than one image met the criteria, the image with the algorithm-determined higher degree of pathology was selected. The same 45° colour fundus photograph selected for each participant was processed by three deep learning algorithms.

Algorithms

Algorithm 1, DR Grader (Perth, Australia), was approved in 2018 as a Class I medical device to detect only referable diabetic retinopathy, differentiating moderate and severe. Referable diabetic retinopathy is internationally recognised as more than mild diabetic retinopathy [67]. Verified on a data set of 193 images by two human experts for referable diabetic retinopathy, the algorithm identified 17 as positive including the two true positive cases, resulting in 100% capture of true positives and a positive predictive value of 12% [46].

Algorithm 2, RetCAD (Nijmegen, The Netherlands), was approved in 2020 as a Class IIa medical device for simultaneous detection of diabetic retinopathy and macular degeneration. The algorithm not only quantified all diabetic retinopathy and macular degeneration results, but also graded fundus photograph quality and estimated vertical cup-to-disk ratio. For all images, Algorithm 2 produced a contrast-enhanced image along with separate precise heatmaps of bright and red lesions. In 2022, in a real-world tertiary hospital screening setting, the algorithm processed 7195 images for referable diabetic retinopathy resulting in 90.5% sensitivity and 97.1% specificity [68]. When mild diabetic retinopathy was included, sensitivity rose to 91.7% and specificity dropped to 90.9% [69].

Algorithm 3, Eyetelligence (Melbourne, Australia), was approved in 2019 as a Class I medical device for three separate target diseases: diabetic retinopathy [47], macular degeneration [2], and glaucoma [70]. The algorithm categorised diabetic retinopathy into the four internationally recognised levels of diabetic retinopathy; mild, moderate, severe, and proliferative [67, 71], with a sensitivity of 92.5% and a specificity of 98.5% [47]. Algorithm 3 also output a diffuse heatmap for referable diabetic retinopathy results only.

Grading of known false-positive results

Within the images identified by one or more of the algorithms as positive for diabetic retinopathy, 29 participants were verified as clinically non-diabetic, an absence of diabetic or pre-diabetic clinical diagnosis, and a confirmed glycated haemoglobin result below 5.7 mmol/L. The 29 misclassified non-diabetic images were graded by two highly qualified retinal specialists with extensive experience in both clinical practice and research at the Lion’s Eye Institute, Perth. Both assessors independently identified the ocular anomalies visible in the sample. Where diagnostic ambiguities arose, the final determination was made by reviewing additional images, the fellow eye, medical history, heatmaps, OCT, and OCTA scans. The types and relative frequencies of ocular anomalies found in the 29 misclassified images were then recorded as the primary outcome for each algorithm.

Results

Sample derivation

Of the 433 participants imaged, 27 images (6%) were of insufficient quality for reliable assessment by two or more algorithms. Of the 406 assessable images, 251 (62%) returned negative results for all target anomalies across all three algorithms.

Of the 155 images returning positive results, 56 were positive for macular degeneration or glaucoma and 99 (63.9%) were flagged as positive for diabetic retinopathy. Of the 99 images, 14 (14%) were flagged by all three algorithms, 28 (28%) were flagged by two algorithms, and 57 (58%) were flagged by one algorithm.

Of the 99 participants with positive diabetic retinopathy results, 48 had a diagnosis of diabetes mellitus or had glycated haemoglobin levels in the diabetic range, and 22 were pre-diabetic based on glycated haemoglobin testing. The remaining 29 participants with positive diabetic retinopathy results (29% of the 99 reported as positive for diabetic retinopathy and 7% of the 406 assessable images) had no diabetic or glucose-control medication history and had glycated haemoglobin levels below 5.7 mmol/L.

Based on clinical evidence, these 29 participants were not at risk for diabetic retinopathy and their images were classified as false-positive results. Their ages ranged from 23 to 90 years (16 males; mean age 63 ± 16 years).

Features of false-positive results

Table 1 shows the frequency and capture rates by algorithm for the features found by retinal experts in the 29 images misclassified as positive for diabetic retinopathy. With few exceptions, the capture rates for each feature were highest for the single-target Algorithm 1, lower for the dual-target Algorithm 2, and lowest for the triple-target Algorithm 3.

Table 1 List of Features in 29 Images Misclassified as Diabetic Retinopathy with Capture Rates.

Supplementary Table 1 lists the features observed by human experts for each of the 29 images, in descending severity of algorithmic misclassification as diabetic retinopathy. All but one of the 29 images (97%) contained pathology for which clinical and/or lifestyle intervention is indicated. The exception, Image 21, showed an otherwise unremarkable tessellated fundus.

Three of the thirteen images flagged by Algorithm 2 had no observable blood or exudate and had no heatmap highlights. The other ten heatmaps precisely highlighted blood (red) and exudate (white), as shown by the example images in Fig. 1. The three heatmaps generated by Algorithm 3 did not coincide with prominent pathology in the images, an example of which is shown in Fig. 1.

Fig. 1: Example image with heatmaps.
figure 1

The original image in Fig. 1 is one of the two images misclassified as referable diabetic retinopathy by all three algorithms (A). Of the three heatmaps, two are from Algorithm 2 precisely highlighting blood (B) and exudate (C), while the heatmap from Algorithm 3 (D) did not match the prominent pathology in the image.

Algorithm-determined referral and clinical utility

Unlike Algorithm 1, Algorithms 2 and 3 provided results for mild non-referable diabetic retinopathy. Despite low inter-algorithm agreement demonstrated by the overlap of the 29 false-positive results for diabetic retinopathy in Fig. 2, all but one (97%) of the referable results (misclassified as moderate or severe diabetic retinopathy) and all 12 (100%) of the non-referable results (misclassified as mild diabetic retinopathy) contained clinically significant pathology likely to benefit from intervention.

Fig. 2: All but 1 of the 29 misclassified images were clinically useful.
figure 2

Figure 2 shows the low inter-algorithm agreement and the distribution of referable and non-referable results within the 29 misclassified images. All but one of the 29 images had clinically useful biomarkers.

Capture counts and number of diseases targeted

The consistent trend of decreasing detection fractions with increasing number of algorithm targets, previously noted for the individual features listed in Table 1, is also evident in the datasets as shown in Fig. 3. Not only did the models with fewer targets output more false positives in the 29 misclassified non-diabetics and its 28-count referable subset, but they also had higher capture counts in the 99 positive diabetic retinopathy results and its 75 count subset. It is unlikely that there is an exception to the overall trend in Fig. 3 where the single-target algorithm flagged fewer images than the two-target algorithm, as the single-target algorithm was not designed to output non-referable results that were included in the total counts of the other two algorithms.

Fig. 3: Detection counts for single, double, and triple target algorithms.
figure 3

Figure 3 shows that the algorithms with fewer targets not only output more false positives in the 29 misclassified non-diabetic results and its 28-count referable subset, but also had higher capture counts in the 99 positive diabetic retinopathy results and its 75-count referable subset. It is unlikely that there is an exception in the 99 positive diabetic retinopathy results where the single-target model did not return the most positive results, as the model was not designed to output non-referable results, which were included for the other two models.

Hypertensive retinopathy severity and misclassified diabetic retinopathy

Human-assessed hypertensive retinopathy features, artificially classified by the Keith-Wagener-Barker hypertensive retinopathy severity scale [72] increasing from 0 to 4, were plotted on the y-axis as a function of algorithmic diabetic retinopathy misclassification on the x-axis, to create Fig. 4. For each algorithm, Fig. 4 shows how both the frequency and severity of false-positive diabetic retinopathy results in our sample correlate with increasing hypertensive retinopathy features found in the images. Circle size represents the count at each intersection, with least squares regression lines showing the positive correlation for each algorithm.

Fig. 4: Positive correlation of retinopathy severity: hypertensive and misclassified diabetic.
figure 4

Figure 4 plots the severity of hypertensive retinopathy, according to the Keith-Wagener-Barker scale, increasing from 0 to 4, for the misclassified diabetic retinopathy results of each algorithm: 24 points for Algorithm 1, 13 points for Algorithm 2, and 8 points for Algorithm 3. The count at each intersection is represented by circle sizes and least squares regression lines show the positive correlation for all three algorithms.

Discussion

The study demonstrated that existing deep learning models capture clinically important incidental pathology in fundus photographs misclassified as diabetic retinopathy. As noted, these findings specifically relate to a subset of false-positive results for diabetic retinopathy in an established hypertensive cohort. All three algorithms captured high rates (97%, 100 and 100%) of clinically useful non-target disease, including all (100%) of the results classified as non-referable by the algorithms.

The trend for algorithms targeting fewer diseases to capture more incidental pathology is consistent with the hypothesis that deep learning algorithms with less differential training may have data signals with broader anomaly detection potential. This boosts confidence in suggesting a pivot from the flawed disease-specific comprehensive screening trajectory to generalised anomaly detection with self-supervised deep learning.

Whether the correlation between the severity of hypertensive retinopathy present in the image and the number and severity of misclassifications as diabetic retinopathy for each algorithm arises from biomarker ambiguity, training image comorbidities, and/or untrained broader detection signals embedded in the deep learning pathway remains unknown as the process is hidden. However, clinically, both retinopathies are significant, and comorbidity is common, so all positive results (referable, non-referable and false) have potential immediate utility.

Similarity of incidental findings in human diabetic screening

Apart from a hypertensive skew, the features and capture rates shown in Table 1 are comparable to the incidental pathologies found in human expert diabetic screening programs [45, 48,49,50,51,52,53,54,55, 73, 74]. In human diabetic retinal screening, hypertensive retinopathy is the most common incidental finding (14 to 34% [45, 48]), followed by drusen (14 to 21% [45]), macular degeneration (0.5 to 18% [45, 48,49,50]), and retinal vein occlusion (0.7% to 2.2% [45]). Other less common incidental findings include myopic choroidopathy, disc pallor, glaucoma, retinal emboli, geographic atrophy, epiretinal membranes, choroidal nevi, cataract, and posterior capsular opacities [45, 48, 50,51,52,53,54,55]. Referable incidental pathology in human diabetic screening varies from 24 to 45% [45, 53, 54], often with higher capture rates than the targeted diabetic retinopathy [51, 52]. The similarity of these incidental findings suggests that deep learning false-positive results from diabetic populations may contain a high proportion of the clinically valuable incidental pathology found by human assessment.

Similarity of false-positive results in general populations

The few published deep learning false-positive results for diabetic retinopathy are based on general population verification datasets and list drusen, exudate, microaneurysm, macular degeneration, venous occlusion, myopic maculopathy, arteriovenous crossing changes, and “normal” [46, 47, 56,57,58,59]. These features are a subset of those found in the hypertensive sample and do not necessarily represent the full set of anomalies that may occur for two reasons. First, the “normal” false-positive results may reflect deep learning anomaly detection beyond human observation, which would not be a false-positive error, but rather a subclinical detection and verification failure. Second, the selected anomalies are presented as plausible explanations for misclassification errors rather than a representation of the full set of features present in the false-positive images. Despite limited data, two algorithms reported referable pathology rates for their false-positive results of 80% [59] and 92% [47], suggesting that further investigation of false-positive results in wider populations may prove clinically useful.

Differential diagnosis of hypertensive and diabetic retinopathies

Although algorithmic pathways are hidden, the sensitivity of diabetic algorithms to hypertensive retinopathy may arise from the artificial and incomplete human classification of training images due to ambiguous biomarkers between the retinopathies [75]. This ambiguity appears to increase for earlier shared biomarkers, such as capillary rarefaction and reduced vessel density as seen with OCTA [76, 77] and is demonstrated algorithmically by a 6% specificity rise for Algorithm 2 when mild diabetic retinopathy detection was excluded [68, 69]. The correlation found between the degree of hypertensive retinopathy and the severity of diabetic misclassification not only shows that deep learning models can capture useful non-target pathology as false-positive results, but also raises the possibility that human limitations in biomarker knowledge may lead to algorithmic misclassification, inhibiting target-specific algorithm development.

Expanding deep learning utility

Algorithm target-specificity hinders progress towards autonomous comprehensive screening not only from misclassification and missed incidental disease, but also from inter-algorithm inconsistency and ethical bias. Disease-specific deep learning models are trained to make innumerable comparisons to define volume-derived baseline data to which inputs may be matched. Inconsistency and ethical concerns arise from the variable and biased human selection of training data and supervision used to define the hidden baseline data. Although saliency analysis, a technique of progressively sectioning the fundus to isolate anomalies, has been successful in refining heatmap outputs to theorise biomarkers used by an algorithm [78], and generative artificial intelligence modification of hypothesised features can further isolate potential data signals used to determine algorithm output [79], the baseline data is not exposed and outputs still vary between algorithms. This poses regulatory challenges, such as transparency of embedded normative data as applied to OCT [80,81,82] and raises ethical issues of input bias.

The consistency of the distribution of incidental findings found in false positive results in this study and in screening diabetic and general populations with other deep learning models increases confidence in the hypothesis that deep learning models have data signals capable of broader anomaly detection. As deep learning models are based on pattern recognition and discrimination, training on the existing vast and diverse repository of healthy retinal images to define normative data may generate a more comprehensive, consistent, and generalisable model. To differentiate pathology from anomalies arising from lighting, positioning, media opacities, and artefacts [83] without artificial classification and human supervisory bias, an autoencoding technique, known as self-supervision, has demonstrated success in distinguishing not only artefacts and media opacities, but also tessellation [84]. A variety of self-supervised feature learning models have been developed for medical imaging [85] including one using OCT images that is capable of general anomaly detection without differential diagnostic output [86]. Such initial triage could be of immediate benefit for those with access to OCT, but exceptional access would be realised when sufficient datasets and biomarker knowledge are available to use external eye images [87, 88]. Until then, self-supervised learning models to analyse fundus photographs could not only address scope, consistency, and ethical issues, but also provide opportunities for novel associations and biomarker discovery. A foundational model has now become publicly available, offering anomaly detection without diagnostic classification, ready for differential diagnosis development and labelling [89].

Strengths and limitations

To strengthen internal validity, exclusively non-diabetic participants comprised the subset of false-positive results for diabetic retinopathy. This not only eliminated potential human grading errors, but also minimised false-positive misclassification of true positive images arising from algorithmic subclinical detection of diabetic vascular changes [69], such as capillary damage [76, 77] which are known to exist prior to threshold glycated haemoglobin indicators in both prediabetics [90] and diabetics [91].

However, these results may have limited external generalisability as they represent a single site with potential selection biases related to hypertensive status, diabetic status, demographics, and voluntary participation. The extent of verified retinal pathology in the wider clinical population is unknown, and the high capture rate of clinically significant pathology observed in this at-risk subset of false-positive results may not be broadly representative.

Conclusion

The deep learning models in this study captured high rates of clinically significant incidental pathology in the misclassified non-diabetic results studied, raising the possibility of immediate clinical use of false positives in broader (beyond diabetic) screening and in other (beyond hypertensive) populations.

In the quest for full-scope autonomous screening, current development combining disease-specific models is flawed by limitations of human biomarker knowledge and the inability to train for rare and novel diseases. Conceivably, incidental capture may approach full-scope disease detection, but the study found that more incidental disease was captured by less trained models, which better aligns with using self-supervised deep learning to expand biomarker knowledge and as an alternate route to achieve comprehensive autonomous retinal screening.

Summary

What was known before

  • Deep learning models can detect target retinal diseases earlier and more accurately than human experts.

  • Deep learning models are not trained to detect important incidental pathology found by human screening.

What this study adds

  • Targeted deep learning retinal analysis may capture high rates of clinically useful non-target pathology as false-positive results.

  • Development of self-supervised deep learning models is proposed as an alternate pathway to achieve comprehensive autonomous retinal screening.