Ocular biomarkers: useful incidental findings by deep learning algorithms in fundus photographs

Background/Objectives Artificial intelligence can assist with ocular image analysis for screening and diagnosis, but it is not yet capable of autonomous full-spectrum screening. Hypothetically, false-positive results may have unrealized screening potential arising from signals persisting despite training and/or ambiguous signals such as from biomarker overlap or high comorbidity. The study aimed to explore the potential to detect clinically useful incidental ocular biomarkers by screening fundus photographs of hypertensive adults using diabetic deep learning algorithms. Subjects/Methods Patients referred for treatment-resistant hypertension were imaged at a hospital unit in Perth, Australia, between 2016 and 2022. The same 45° colour fundus photograph selected for each of the 433 participants imaged was processed by three deep learning algorithms. Two expert retinal specialists graded all false-positive results for diabetic retinopathy in non-diabetic participants. Results Of the 29 non-diabetic participants misclassified as positive for diabetic retinopathy, 28 (97%) had clinically useful retinal biomarkers. The models designed to screen for fewer diseases captured more incidental disease. All three algorithms showed a positive correlation between severity of hypertensive retinopathy and misclassified diabetic retinopathy. Conclusions The results suggest that diabetic deep learning models may be responsive to hypertensive and other clinically useful retinal biomarkers within an at-risk, hypertensive cohort. Observing that models trained for fewer diseases captured more incidental pathology increases confidence in signalling hypotheses aligned with using self-supervised learning to develop autonomous comprehensive screening. Meanwhile, non-referable and false-positive outputs of other deep learning screening models could be explored for immediate clinical use in other populations.

extent of hypertensive retinal vascular remodelling estimates the health of the other target organs [39].Unlike the management of diabetes mellitus, where glycated haemoglobin provides time-averaged blood glucose estimates, hypertension has no such time-averaged metric to compensate for the volatility of blood pressure.Thus, hypertensive ocular biomarkers are key indicators for non-invasive management and offer preclinical detection, as arteriolar narrowing occurs prior to clinical hypertension [40].
More than twice as prevalent as diabetes mellitus, hypertension, defined as office blood pressure levels above 140 mmHg systolic, 90 mmHg diastolic, or both [41] remains the leading modifiable risk factor for premature death worldwide [42], affecting 20% of the global adult population, with 46% undiagnosed and only 21% receiving effective treatment [43].Interestingly, clinical standards of care call for ongoing screening and monitoring of diabetic retinas, whereas there are no such recommendations for hypertension beyond retinal screening at diagnosis [44].
Hypothetically, shared retinal biomarkers and/or high rates of hypertension in diabetic training images [64] could trigger falsepositive results for deep learning algorithms.Additionally, or alternatively, training deep learning pattern recognition/discrimination for specific disease is not error-free and may result in anomalous data signals triggering misclassification of not only hypertensive retinopathy but also other clinically useful untargeted disease.Deep learning pathways remain unknown, but the hypotheses suggest that the algorithms may be clinically useful beyond their intended targets.
The primary aim of this study was to explore the potential for detecting clinically useful incidental ocular biomarkers using diabetic deep learning algorithms to screen fundus photographs of hypertensive adults.

Research design
The study had a retrospective, observational design.Approval was obtained from the East Metro Health Service Ethics and Governance Unit to use images collected for EastMetro HREC RGS1040 -Retinal Imaging in Resistant Hypertension.The study adhered to the tenets of the Declaration of Helsinki for research involving human subjects, and all participants gave informed consent.

Participants
Participants were recruited from Dobney Hypertension Centre, a public hospital outpatient clinic in Western Australia specializing in resistant hypertension, defined as uncontrolled high blood pressure despite at least three antihypertensive medications including a diuretic [41].All participants were confirmed to have hypertension at recruitment but not necessarily with resistant hypertension.
All patients attending Dobney Hypertension Centre between 18 January 2016 and 31 March 2022 were invited to participate in the Retina Imaging in Resistant Hypertension study.Recruitment, data collection and processing have been described in detail previously [65,66].In brief, patients were referred from primary care for diagnostic workup and clinical management of difficult-to-control hypertension.Patients consented to participate in a systematic prospective analysis to explore the association between blood pressure at presentation and retinal imaging parameters.Baseline clinical data collected from the patients included medical history, medication history, serum pathology, extensive blood pressure testing, and specific assessments of hypertension-mediated organ damage (including retinal imaging).
The participation rate for all new and returning attendees was above 90%.Of the 529 consented participants, 96 were excluded from the study for lack of imaging arising from technical issues or non-attendance.

Data collection
Image acquisition.Clinic staff collected 45° macula-centred, colour fundus photographs without mydriasis using a Canon CR-2 camera (Tokyo, Japan) along with OCT and OCTA imaging by Optovue Avanti XR (Fremont, California, USA).All images and data were de-identified before transfer to the Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia's national science agency.
Image selection.For each participant, the most recent processable higher-quality image capturing both the optic nerve and macula was selected.Processability was determined by two or more algorithms successfully processing the image.Two of the algorithms output a graded image or deemed the image ungradable, while the third algorithm graded all images along with a quality scale and a reliability threshold.Where no image met the processability criteria, the participant was removed from further assessment.Where more than one image met the criteria, the image with the algorithm-determined higher degree of pathology was selected.The same 45° colour fundus photograph selected for each participant was processed by three deep learning algorithms.
Algorithms.Algorithm 1, DR Grader (Perth, Australia), was approved in 2018 as a Class I medical device to detect only referable diabetic retinopathy, differentiating moderate and severe.Referable diabetic retinopathy is internationally recognised as more than mild diabetic retinopathy [67].Verified on a data set of 193 images by two human experts for referable diabetic retinopathy, the algorithm identified 17 as positive including the two true positive cases, resulting in 100% capture of true positives and a positive predictive value of 12% [46].
Algorithm 2, RetCAD (Nijmegen, The Netherlands), was approved in 2020 as a Class IIa medical device for simultaneous detection of diabetic retinopathy and macular degeneration.The algorithm not only quantified all diabetic retinopathy and macular degeneration results, but also graded fundus photograph quality and estimated vertical cup-to-disk ratio.For all images, Algorithm 2 produced a contrast-enhanced image along with separate precise heatmaps of bright and red lesions.In 2022, in a realworld tertiary hospital screening setting, the algorithm processed 7195 images for referable diabetic retinopathy resulting in 90.5% sensitivity and 97.1% specificity [68].When mild diabetic retinopathy was included, sensitivity rose to 91.7% and specificity dropped to 90.9% [69].
Algorithm 3, Eyetelligence (Melbourne, Australia), was approved in 2019 as a Class I medical device for three separate target diseases: diabetic retinopathy [47], macular degeneration [2], and glaucoma [70].The algorithm categorised diabetic retinopathy into the four internationally recognised levels of diabetic retinopathy; mild, moderate, severe, and proliferative [67,71], with a sensitivity of 92.5% and a specificity of 98.5% [47].Algorithm 3 also output a diffuse heatmap for referable diabetic retinopathy results only.
Grading of known false-positive results.Within the images identified by one or more of the algorithms as positive for diabetic retinopathy, 29 participants were verified as clinically non-diabetic, an absence of diabetic or pre-diabetic clinical diagnosis, and a confirmed glycated haemoglobin result below 5.7 mmol/L.The 29 misclassified non-diabetic images were graded by two highly qualified retinal specialists with extensive experience in both clinical practice and research at the Lion's Eye Institute, Perth.Both assessors independently identified the ocular anomalies visible in the sample.Where diagnostic ambiguities arose, the final determination was made by reviewing additional images, the fellow eye, medical history, heatmaps, OCT, and OCTA scans.The types and relative frequencies of ocular anomalies found in the 29 misclassified images were then recorded as the primary outcome for each algorithm.

Sample derivation
Of the 433 participants imaged, 27 images (6%) were of insufficient quality for reliable assessment by two or more algorithms.Of the 406 assessable images, 251 (62%) returned negative results for all target anomalies across all three algorithms.
Of the 155 images returning positive results, 56 were positive for macular degeneration or glaucoma and 99 (63.9%) were flagged as positive for diabetic retinopathy.Of the 99 images, 14 (14%) were flagged by all three algorithms, 28 (28%) were flagged by two algorithms, and 57 (58%) were flagged by one algorithm.
Of the 99 participants with positive diabetic retinopathy results, 48 had a diagnosis of diabetes mellitus or had glycated haemoglobin levels in the diabetic range, and 22 were prediabetic based on glycated haemoglobin testing.The remaining 29 participants with positive diabetic retinopathy results (29% of the 99 reported as positive for diabetic retinopathy and 7% of the 406 assessable images) had no diabetic or glucose-control medication history and had glycated haemoglobin levels below 5.7 mmol/L.
Based on clinical evidence, these 29 participants were not at risk for diabetic retinopathy and their images were classified as false-positive results.Their ages ranged from 23 to 90 years (16 males; mean age 63 ± 16 years).

Features of false-positive results
Table 1 shows the frequency and capture rates by algorithm for the features found by retinal experts in the 29 images misclassified as positive for diabetic retinopathy.With few exceptions, the capture rates for each feature were highest for the single-target Algorithm 1, lower for the dual-target Algorithm 2, and lowest for the tripletarget Algorithm 3.
Supplementary Table 1 lists the features observed by human experts for each of the 29 images, in descending severity of algorithmic misclassification as diabetic retinopathy.All but one of the 29 images (97%) contained pathology for which clinical and/or lifestyle intervention is indicated.The exception, Image 21, showed an otherwise unremarkable tessellated fundus.
Three of the thirteen images flagged by Algorithm 2 had no observable blood or exudate and had no heatmap highlights.The other ten heatmaps precisely highlighted blood (red) and exudate (white), as shown by the example images in Fig. 1.The three heatmaps generated by Algorithm 3 did not coincide with prominent pathology in the images, an example of which is shown in Fig. 1.

Algorithm-determined referral and clinical utility
Unlike Algorithm 1, Algorithms 2 and 3 provided results for mild non-referable diabetic retinopathy.Despite low inter-algorithm agreement demonstrated by the overlap of the 29 falsepositive results for diabetic retinopathy in Fig. 2, all but one (97%) of the referable results (misclassified as moderate or severe diabetic retinopathy) and all 12 (100%) of the nonreferable results (misclassified as mild diabetic retinopathy) contained clinically significant pathology likely to benefit from intervention.

Capture counts and number of diseases targeted
The consistent trend of decreasing detection fractions with increasing number of algorithm targets, previously noted for the individual features listed in Table 1, is also evident in the datasets as shown in Fig. 3.Not only did the models with fewer targets output more false positives in the 29 misclassified nondiabetics and its 28-count referable subset, but they also had higher capture counts in the 99 positive diabetic retinopathy results and its 75 count subset.It is unlikely that there is an exception to the overall trend in Fig. 3 where the single-target algorithm flagged fewer images than the two-target algorithm, as the single-target algorithm was not designed to output In column 1, the percentages in italics indicate the total number of images with the feature in column 2 as a percentage of the total 29 images.In columns 3-5, the percentages in italics indicate the number of images with the feature listed in column 2 detected by each model as a percentage of the total number with the feature in column 1.
non-referable results that were included in the total counts of the other two algorithms.

Hypertensive retinopathy severity and misclassified diabetic retinopathy
Human-assessed hypertensive retinopathy features, artificially classified by the Keith-Wagener-Barker hypertensive retinopathy severity scale [72] increasing from 0 to 4, were plotted on the y-axis as a function of algorithmic diabetic retinopathy misclassification on the x-axis, to create Fig. 4. For each algorithm, Fig. 4 shows how both the frequency and severity of false-positive diabetic retinopathy results in our sample correlate with increasing hypertensive retinopathy features found in the images.Circle size represents the count at each intersection, with least squares regression lines showing the positive correlation for each algorithm.

DISCUSSION
The study demonstrated that existing deep learning models capture clinically important incidental pathology in fundus photographs misclassified as diabetic retinopathy.As noted, these findings specifically relate to a subset of false-positive results for diabetic retinopathy in an established hypertensive cohort.All three algorithms captured high rates (97%, 100 and 100%) of clinically useful non-target disease, including all (100%) of the results classified as non-referable by the algorithms.The trend for algorithms targeting fewer diseases to capture more incidental pathology is consistent with the hypothesis that deep learning algorithms with less differential training may have data signals with broader anomaly detection potential.This boosts confidence in suggesting a pivot from the flawed disease-specific comprehensive screening trajectory to generalised anomaly detection with self-supervised deep learning.
Whether the correlation between the severity of hypertensive retinopathy present in the image and the number and severity of misclassifications as diabetic retinopathy for each algorithm arises from biomarker ambiguity, training image comorbidities, and/or untrained broader detection signals embedded in the deep learning pathway remains unknown as the process is hidden.However, clinically, both retinopathies are significant, and comorbidity is common, so all positive results (referable, non-referable and false) have potential immediate utility.

Similarity of false-positive results in general populations
The few published deep learning false-positive results for diabetic retinopathy are based on general population verification datasets and list drusen, exudate, microaneurysm, macular degeneration, venous occlusion, myopic maculopathy, arteriovenous crossing changes, and "normal" [46,47,[56][57][58][59].These features are a subset of those found in the hypertensive sample and do not necessarily represent the full set of anomalies that may occur for two reasons.First, the "normal" false-positive results may reflect deep learning anomaly detection beyond human observation, which would not be a falsepositive error, but rather a subclinical detection and verification failure.Second, the selected anomalies are presented as plausible explanations for misclassification errors rather than a representation of the full set of features present in the false-positive images.Despite limited data, two algorithms Figure 2 shows the low inter-algorithm agreement and the distribution of referable and non-referable results within the 29 misclassified images.All but one of the 29 images had clinically useful biomarkers.
Fig. 1 Example image with heatmaps.The original image in Fig. 1 is one of the two images misclassified as referable diabetic retinopathy by all three algorithms (A).Of the three heatmaps, two are from Algorithm 2 precisely highlighting blood (B) and exudate (C), while the heatmap from Algorithm 3 (D) did not match the prominent pathology in the image.reported referable pathology rates for their false-positive results of 80% [59] and 92% [47], suggesting that further investigation of false-positive results in wider populations may prove clinically useful.

Differential diagnosis of hypertensive and diabetic retinopathies
Although algorithmic pathways are hidden, the sensitivity of diabetic algorithms to hypertensive retinopathy may arise from the artificial and incomplete human classification of training images due to ambiguous biomarkers between the retinopathies [75].This ambiguity appears to increase for earlier shared biomarkers, such as capillary rarefaction and reduced vessel density as seen with OCTA [76,77] and is demonstrated algorithmically by a 6% specificity rise for Algorithm 2 when mild diabetic retinopathy detection was excluded [68,69].The correlation found between the degree of hypertensive retinopathy and the severity of diabetic misclassification not only shows that deep learning models can capture useful non-target pathology as false-positive results, but also raises the possibility that human limitations in biomarker knowledge may lead to algorithmic misclassification, inhibiting target-specific algorithm development.

Expanding deep learning utility
Algorithm target-specificity hinders progress towards autonomous comprehensive screening not only from misclassification and missed incidental disease, but also from inter-algorithm inconsistency and ethical bias.Disease-specific deep learning models are trained to make innumerable comparisons to define volume-derived baseline data to which inputs may be matched.Inconsistency and ethical concerns arise from the variable and biased human selection of training data and supervision to define the hidden baseline data.Although saliency analysis, a technique of progressively sectioning the fundus to isolate anomalies, has been successful in refining heatmap outputs to theorise biomarkers used by an algorithm [78], and generative artificial intelligence modification of hypothesised features can further isolate potential data signals used to determine algorithm output [79], the baseline data is not exposed and outputs still vary between algorithms.This poses regulatory challenges, such as transparency of embedded normative data as applied to OCT [80][81][82] and raises ethical issues of input bias.
The consistency of the distribution of incidental findings found in false positive results in this study and in screening diabetic and general populations with other deep learning models increases confidence in the hypothesis that deep learning models have data signals capable of broader anomaly detection.As deep learning models are based on pattern recognition and discrimination, training on the existing vast and diverse repository of healthy retinal images to define normative data may generate a more comprehensive, consistent, and generalisable model.To differentiate pathology from anomalies arising from lighting, positioning, media opacities, and artefacts [83] without artificial classification and human supervisory bias, an autoencoding technique, known as self-supervision, has demonstrated success in distinguishing not only artefacts and media opacities, but also tessellation [84].A variety of self-supervised feature learning models have been developed for medical imaging [85] including one using OCT images that is capable of general anomaly detection without differential diagnostic output [86].Such initial triage could be of immediate benefit for those with access to OCT, but exceptional access would be realised when sufficient datasets and biomarker knowledge are available to use external eye images [87,88].Until then, self-supervised learning models to analyse fundus photographs could not only address scope, consistency, and ethical issues, but also provide opportunities for novel associations and biomarker discovery.A foundational model has now become publicly available, offering anomaly    3 shows that the algorithms with fewer targets not only output more false positives in the 29 misclassified non-diabetic results and its 28-count referable subset, but also had higher capture counts in the 99 positive diabetic retinopathy results and its 75-count referable subset.It is unlikely that there is an exception in the 99 positive diabetic retinopathy results where the single-target model did not return the most positive results, as the model was not designed to output nonreferable results, which were included for the other two models.

Strengths and limitations
To strengthen internal validity, exclusively non-diabetic participants comprised the subset of false-positive results for diabetic retinopathy.This not only eliminated potential human grading errors, but also minimised false-positive misclassification of true positive images arising from algorithmic subclinical detection of diabetic vascular changes [69], such as capillary damage [76,77] which are known to exist prior to threshold glycated haemoglobin indicators in both prediabetics [90] and diabetics [91].However, these results may have limited external generalisability as they represent a single site with potential selection biases related to hypertensive status, diabetic status, demographics, and voluntary participation.The extent of verified retinal pathology in the wider clinical population is unknown, and the high capture rate of clinically significant pathology observed in this at-risk subset of false-positive results may not be broadly representative.

CONCLUSION
The deep learning models in this study captured high rates of clinically significant incidental pathology in the misclassified nondiabetic results studied, raising the possibility of immediate clinical use of false positives in broader (beyond diabetic) screening and in other (beyond hypertensive) populations.
In the quest for full-scope autonomous screening, current development combining disease-specific models is flawed by limitations of human biomarker knowledge and the inability to train for rare and novel diseases.Conceivably, incidental capture may approach full-scope disease detection, but the study found that more incidental disease was captured by less trained models, which better aligns with using self-supervised deep learning to expand biomarker knowledge and as an alternate route to achieve comprehensive autonomous retinal screening.

What was known before
• Deep learning models can detect target retinal diseases earlier and more accurately than human experts.
• Deep learning models are not trained to detect important incidental pathology found by human screening.

What this study adds
• Targeted deep learning retinal analysis may capture high rates of clinically useful non-target pathology as false-positive results.
• Development of self-supervised deep learning models is proposed as an alternate pathway to achieve comprehensive autonomous retinal screening.

Fig. 2
Fig. 2All but 1 of the 29 misclassified images were clinically useful.Figure2shows the low inter-algorithm agreement and the distribution of referable and non-referable results within the 29 misclassified images.All but one of the 29 images had clinically useful biomarkers.

Fig. 3
Fig.3Detection counts for single, double, and triple target algorithms.Figure3shows that the algorithms with fewer targets not only output more false positives in the 29 misclassified non-diabetic results and its 28-count referable subset, but also had higher capture counts in the 99 positive diabetic retinopathy results and its 75-count referable subset.It is unlikely that there is an exception in the 99 positive diabetic retinopathy results where the single-target model did not return the most positive results, as the model was not designed to output nonreferable results, which were included for the other two models.

Figure
Fig.3Detection counts for single, double, and triple target algorithms.Figure3shows that the algorithms with fewer targets not only output more false positives in the 29 misclassified non-diabetic results and its 28-count referable subset, but also had higher capture counts in the 99 positive diabetic retinopathy results and its 75-count referable subset.It is unlikely that there is an exception in the 99 positive diabetic retinopathy results where the single-target model did not return the most positive results, as the model was not designed to output nonreferable results, which were included for the other two models.

Fig. 4
Fig.4Positive correlation of retinopathy severity: hypertensive and misclassified diabetic.Figure4plots the severity of hypertensive retinopathy, according to the Keith-Wagener-Barker scale, increasing from 0 to 4, for the misclassified diabetic retinopathy results of each algorithm: 24 points for Algorithm 1, 13 points for Algorithm 2, and 8 points for Algorithm 3. The count at each intersection is represented by circle sizes and least squares regression lines show the positive correlation for all three algorithms.

Table 1 .
List of Features in 29 Images Misclassified as Diabetic Retinopathy with Capture Rates.