Main

Disease diagnosis and management often require specialized equipment and trained medical professionals to interpret findings. For example, diabetic retinopathy (DR) screening programmes use either a fundoscope or a fundus camera to examine the posterior part of the eye (that is, the retinal fundus; Fig. 1a). Such examinations can reveal diabetes-related blood vessel compromise, such as microaneurysms. More recently, machine-learning-based assessments of retinal fundus photographs have been shown to detect signs of cardiovascular risk1,2, anaemia3, chronic kidney disease4 and other systemic parameters5. Despite the expanding diagnostic information that can be obtained from fundus photographs, the burden of costly specialized cameras, skilled imaging technicians and, oftentimes, mydriatic eye drops to dilate the patients’ pupils limits the use of these diagnostic techniques to eye clinics or primary care facilities with specialized equipment.

Fig. 1: Extracting insights from external photographs of the front of the eye.
figure 1

a, Diabetes-related complications can be diagnosed by using specialized cameras to take fundus photographs, which visualize the posterior segment of the eye. By contrast, anterior imaging using a standard consumer-grade camera can reveal conditions affecting the eyelids, conjunctiva, cornea and lens. In this work, we show that external photographs of the eye can offer insights into diabetic retinal disease and detect poor blood sugar control. These images are shown for illustrative purposes only and do not necessarily belong to the same subject. b, Our external eye DLS was developed on data from California (CA, yellow) and evaluated on data from 18 other US states (light blue). c, AUCs for the four validation datasets. The prediction tasks shown are poor sugar control (HbA1c ≥9%), elevated lipids (total cholesterol (Tch) ≥240 mg dl−1, triglycerides (Tg) ≥200 mg dl−1), moderate+ DR, DME, VTDR and a positive control: cataract. Some targets were not available in validation sets C and D. Error bars are 95% CIs computed using the DeLong method. A graphic visualization of receiver operating characteristic (ROC) curves is shown in Extended Data Fig. 3, and Supplementary Table 1 contains numerical values of AUCs along P values for improvement. d, Similar to c but for PPV. The thresholds used for the PPV are based on the 5% of patients with the highest predicted likelihood. Error bars are 95% bootstrap CIs. A graphic visualization of the PPV over multiple thresholds is in Extended Data Fig. 1. aBaseline characteristics models for validation sets A and B include self-reported age, sex, race/ethnicity and years with diabetes and were trained on the training dataset. bThe baseline characteristics models for validation sets C and D include self-reported age and sex and were trained directly on the respective sets due to large differences in patient population compared with the development set. Thus, these overestimate baseline performance. cPre-specified primary prediction tasks.

By contrast, the anterior part of the eye can be readily imaged to produce external eye photographs (Fig. 1a) without specialized equipment or mydriatic eye drops. Indeed, external eye photographs have an associated Current Procedural Terminology code (92285) and are sometimes used to document progression of anterior eye conditions, such as eyelid abnormalities, corneal ulcers and cataracts. Interestingly, the anterior part of the eye is also known to manifest signs of disease beyond those affecting the front of the eye. For example, hypertension can cause recurrent subconjunctival haemorrhages (from broken blood vessels in the white part of the eye)6, and diabetes can affect pupil size7,8, conjunctival vessel calibre9 and tortuosity10.

Here we hypothesized that deep learning can more precisely detect disease presence from external eye photographs and potentially extract information that can help in identifying high-risk individuals for further testing or follow-up. To test this hypothesis, we leveraged de-identified data from tele-ophthalmology screening programmes where important parameters of systemic health and diabetic eye care were available, including glycated haemoglobin (HbA1c, a marker of glucose control), total cholesterol, triglycerides and diabetic retinal disease diagnoses (DR, diabetic macular oedema (edema) (DME) and vision-threatening DR (VTDR, a combination of DR and DME)). Our results indicate that external eye photographs can indeed reveal signs of systemic and retinal disease and merit further study.

Results

We developed a deep-learning system (DLS) using external eye images taken with fundus cameras (Fig. 1a) from 145,832 patients with diabetes screened for DR at 301 sites in California, USA (Fig. 1b). To evaluate the DLS, we used four validation datasets (Table 1). The first validation set (‘A’) included 27,415 patients who underwent DR screening at 186 sites across 18 US states without pupillary dilation (that is, without use of mydriatic eye drops); the second validation set (‘B’) included 5,058 patients at 59 sites across 14 US states with pupillary dilation. The third validation set (‘C’) included 10,402 patients who underwent mydriatic DR screening at the Atlanta Veterans Affairs (VA) hospital in Georgia, USA as part of the diabetic teleretinal screening programme. The final validation set (‘D’) included 6,266 patients from the Technology-based Eye Care Services (TECS) programme in Georgia, USA. In this programme, screening was mydriatic and located in primary care clinics surrounding the Atlanta VA hospital, and the screened population included patients both with and without diabetes. The pre-specified primary analyses included four predictions: poor blood glucose control (defined as HbA1c ≥9% by the Healthcare Effectiveness Data and Information Set (HEDIS11)), moderate-or-worse (‘moderate+’) DR, DME and VTDR. Cataract detection was a fifth prediction task used as a positive control. In addition, we present exploratory analyses for predictions of elevated total cholesterol and triglycerides relative to several clinical cut-offs.

Table 1 Characteristics of the datasets

We first evaluated the DLS for its ability to detect HbA1c ≥9%. Across the four validation sets A to D, the DLS achieved an area under the receiver operating characteristic curve (AUC) of 70.0%, 73.4%, 69.3% and 67.6%, respectively (Fig. 1c and Supplementary Table 1; full ROC curves shown in Extended Data Fig. 3). For comparison, logistic regression models using baseline characteristics (age, sex, race/ethnicity, years with diabetes on validation sets A and B, age and sex on validation sets C and D) achieved AUCs of 64.8%, 66.5%, 65.2% and 59.8%, respectively (P < 0.001 for comparing the DLS with baseline characteristics alone in all three validation sets). The AUC trends for predicting other thresholds (HbA1c ≥7% and ≥8%) were similar and are presented in Supplementary Table 1. The positive predictive values (PPVs) for HbA1c ≥9% exceeded 69% for the 5% of patients with the highest risk predicted by the DLS, and the negative predictive values (NPVs) exceeded 88% for the lowest-risk 5% in validation set A (Fig. 1d and Extended Data Figs. 1a and 2a).

Next, we evaluated the DLS for detecting diabetic retinal diseases (as graded by certified teleretinal experts using fundus photographs; see ‘Labels’ section in Methods). The AUCs for the three conditions (moderate+ DR, DME and VTDR) were 75.0%, 77.9% and 79.2%, respectively, in validation set A. On dilated pupils, the AUCs were consistently 5–10% higher, at 84.0%, 84.7% and 86.7% in validation set B, and more similar to validation set A, at 76.8%, 79.5% and 81.1%, in validation set C (which was from a different patient population). Using only baseline characteristics resulted in AUCs that were 4–10% lower in validation sets A and B and 13–24% lower in validation set C (P < 0.001 for comparing the DLS with baseline characteristics alone for all comparisons). The trends for other DR thresholds (‘mild+’ and ‘severe+’) were similar and are presented in Supplementary Table 1. The PPV and NPV for the various diabetic retinal diseases similarly indicated the ability to identify patients at either very high or very low likelihood of these conditions (Fig. 1d and Extended Data Figs. 1d–f and 2d–f).

Finally, we evaluated the DLS for detecting elevations in available elements of the lipid panel: total cholesterol and triglycerides. Across the four validation datasets, the AUCs for total cholesterol ≥240 mg dl−1 were 59.1% (A), 57.9% (B), 62.3% (C) and 58.1% (D), and for triglycerides the AUCs for ≥200 mg dl−1 were 63.6% (A), 62.7% (B), 66.3% (C) and 67.1% (D) (Supplementary Table 1), with similar trends for other cut-offs (total cholesterol ≥200 mg dl−1, triglycerides ≥150 mg dl−1 and ≥500 mg dl−1). For this set of predictions, the DLS did not consistently outperform the baseline characteristics (see adjusted analyses in Tables 25 for complementary analyses).

Adjusted analyses

To verify that the DLS predictions were not solely mediated by baseline characteristics as confounders, we examined the odds ratios of the DLS predictions adjusted for baseline characteristics (Tables 25). For all primary prediction tasks in all four validation datasets, DLS predictions (standardized to zero mean and unit variance for interpretability of coefficients) had statistically significant adjusted odds ratios (P < 0.001). The adjusted odds ratios ranged from 1.5 to 2.0 among tasks for validation set A, from 2.0 to 3.1 for validation set B, from 1.6 to 2.0 for validation set C and it was 2.0 for validation set D. For elevated lipids, despite the fact that the DLS did not achieve higher AUC compared with the baseline model, the adjusted odds ratios were statistically significant (P < 0.05) in all cases, except where there were fewer than 50 positives (triglycerides ≥500 mg dl−1, the only example among 20 analyses).

Table 2 Adjusted analysis for validation set A
Table 3 Adjusted analysis for validation set B
Table 4 Adjusted analysis for validation set C
Table 5 Adjusted analysis for validation set D

Explainability analysis

To better understand how the DLS was able to detect these diseases without looking inside the eye at the retina, we next conducted several explainability experiments. Predicting cataract presence acted as a positive control because it manifests as media opacity (‘cloudiness’ of the lens) and, thus, is generally visible from external eye images. The AUC of the DLS for predicting cataract was substantially higher than that for predicting the other targets, at 86.4% and 93.2% for validation sets A and B, respectively (labels were not available for sets C and D) (Supplementary Table 1).

In ablation analysis, we removed portions of each image, trained another DLS on such images and then evaluated this DLS (Methods)3. The central region of the image was more important than the outer rim for all predictions except for lipids, with the biggest delta being our positive control, cataract (Fig. 2a,b). For lipids predictions, the central region and outer rim appeared equally important.

Fig. 2: Importance of different regions of the image and impact of image resolution.
figure 2

a, Sample images with different parts of the image visible. Top: only a circle centred around the centre is visible. Bottom: only the peripheral rim is visible. b, AUCs as a function of the fraction of input pixels visible when the DLS sees only the centre (blue curve) or peripheral regions (green curve). A large delta between the blue and green curves (for example, for cataract) indicates that most of the information is in the region corresponding to the blue line. c, Sample images of different resolutions. Images were down-sampled to a certain size (for example, ‘5’ indicates 5 × 5 pixels across) and then up-sampled to the same original input size of 587 × 587 pixels to ensure that the network has the exact same capacity (Methods). d, AUCs corresponding to different input image resolutions. Error bars are 95% CIs computed using the DeLong method. aPre-specified primary prediction tasks.

Next, we conducted saliency analysis using several gradient-based methods (Methods and Fig. 3). The systemic predictions (HbA1c and lipids) were the most distinct, with almost all of the attention on the nasal and temporal conjunctiva/sclera (that is, where conjunctival vessels tend to be prominent; Fig. 3b–d). By contrast, all three diabetic retinal disease predictions (moderate+ DR, DME and VTDR) focused on both the pupil and the nasal and temporal iris and conjunctiva (Fig. 3e–g). The cataract predictions focused on the pupil in the middle of the image (Fig. 3h). The results of additional saliency methods are presented in Extended Data Fig. 4a–g, with a quantitative summary of these trends in Extended Data Fig. 4h,i.

Fig. 3: Qualitative and quantitative saliency analysis illustrating the influence of various regions of the image towards the prediction.
figure 3

a, Overview of anterior eye anatomy. bh, Eye images from patients with HbA1c ≥9% (b), Tch ≥240 mg dl−1 (c), Tg ≥200 mg dl−1 (d), moderate+ DR (e), DME (f), VTDR (g), and cataract (h). Illustrations are the result of averaging images or saliency maps from 100 eyes, with laterality manually adjusted via horizontal flips to have the nasal side on the right (Methods). Left: the original averaged image. Right: a similarly averaged saliency from GradCAM. Saliency maps from other methods are shown in Supplementary Fig. 4a–g. aPre-specified primary prediction tasks.

Because diabetes is known to affect pupil sizes7,12, and multiple attribution methods highlighted the pupil/iris region (Fig. 3 and Extended Data Fig. 4), we reasoned that the DLS could have learnt to perform these predictions via measuring pupil size. To test this hypothesis, we first trained a pupil/iris detection model using a subset of the development set (Methods), and we used this model to estimate the pupil size in the validation sets (Supplementary Information and Supplementary Figs. 1 and 2). To correct for differences in distance from the camera, we normalized the pupil size by representing the pupil radius as a fraction of the iris radius. The iris size is equivalent to the corneal diameter since the visible iris extends to the edge of the corneoscleral junction. Corneal diameter (or iris size here) is known to be relatively constant in the adult population13,14.

We first examined to what degree pupil size added useful information to these predictive tasks by inspecting the effect of its addition to the baseline characteristics model (Supplementary Table 2). There was generally little to no improvement in AUC of the baseline characteristics model (approximately 1% with overlapping confidence intervals (CIs)). The only exceptions were in validation set C, with 3–4% AUC improvements for moderate+ DR and VTDR. The improvement of the DLS over these augmented baseline characteristics models remained statistically significant in all cases (P < 0.001). Similarly, in an adjusted analysis including estimated pupil size, the external eye predictions retained statistically significant odds ratios (Supplementary Table 3).

Additionally, to determine whether the model’s ability to detect diabetic retinal disease could be explained by its ability to detect elevated HbA1c, we conducted an analysis adjusting for laboratory-measured HbA1c values (Supplementary Table 4). Even in this adjusted analysis, the odds ratios remained statistically significant, ranging from 1.4 to 3.2 on the primary predictions (P < 0.001 for moderate+ DR, DME and VTDR across all three validation datasets that contain DR labels).

Sensitivity and subgroup analysis

Because these external eye images were taken with fundus cameras, we next sought to examine whether lower-quality images would suffice. To simulate low-quality images, the input images were down-sampled to lower resolutions for both training and evaluation (Methods). The DLS performance decreased slightly as the input size decreased to 75 pixels and then substantially as the input size decreased below 35 pixels (Fig. 2c,d).

Next, we conducted extensive subgroup analyses across multiple variables: demographic information (age, sex and race/ethnicity), presence of cataract, camera types (for a subset of data where this information was available) and pupil size in validation set A (Supplementary Tables 511). In brief, the DLS (and the baseline variables) remained predictive in all subgroups. The DLS consistently outperformed the baseline characteristics model, albeit with the difference not reaching statistical significance in some subgroups. The DLS AUC dropped by more than 5% in some subgroups: the triglycerides ≥200 and moderate+ DR targets in the Black subgroup; the VTDR target in the >10 years with diabetes subgroup; for cataract in the age >50 years, small pupil and camera type B subgroups. In all of these cases, the AUC of the baseline characteristics model decreased as well.

Validation set D had additional variables available: body mass index (BMI), diabetes status and presence of intraocular lens (IOL). We conducted subgroup analysis based on these variables for both the HbA1c ≥9% and ≥ 7% prediction tasks (due to the relatively few numbers of positive examples of the former; Supplementary Tables 12 and 13). Across the different BMI groups, the DLS AUC dropped by at most 4% compared with the full dataset. In the non-diabetic group, as may be expected, there were few positives for these cut-offs and, hence, CIs are wide. Nevertheless, the DLS outperformed the baseline characteristics model (57.6% versus 53.7% for HbA1c ≥7%). Across the two IOL groups, AUC differed by at most 0.6% compared with the full dataset. Taking into account both HbA1c ≥9% and HbA1c ≥7%, we did not observe any trends towards worse performance with race/ethnicity in this dataset with a higher proportion of Black patients.

Discussion

Our results show that external images of the eye contain signals of diabetes-related retinal conditions, poor blood sugar control and elevated lipids. We accomplished this through the development of a DLS, which generalized to diverse patient populations (including race/ethnicity groups with differences in iris colour and the red reflex), different imaging protocols and devices from independent clinics in multiple US states. The results for our primary prediction targets (HbA1c ≥9% and presence of varying levels of diabetic retinal disease) were statistically significantly better compared with using demographic information and medical history (such as years with diabetes) alone and remained statistically significant after adjusting for multiple baseline characteristics and within numerous subgroups. Our exploratory results also show promise in extending this method to predicting elevated lipids and generalizing to a non-diabetic population.

The discovery that predictions about systemic parameters and diabetic retinal disease could be derived from external eye photography is surprising, since such images are primarily used to identify and monitor anterior eye conditions, such as eyelid and conjunctival malignancies, corneal infections and cataracts. Although there are numerous studies noting that conjunctival vessel changes (fewer, wider and less tortuous conjunctival vessels)9,10,15,16,17 are associated with duration of diabetes18,19 and severity of DR20,21, and elevated cholesterol levels and atherosclerosis have been linked to xanthelasmas (yellowish deposit under the eyelid skin)22. However, to our knowledge, there have been no large studies linking HbA1c or diabetic macular oedema to conjunctival vessel changes in diabetes. Furthermore, conjunctival vessel assessment for signs of diabetes or elevated lipids is not a common clinical practice due to the subjectivity and time-consuming nature of such an evaluation and the option of a more accurate and easier test for the clinician (HbA1c). We verified that these surprising results were reproducible and not an artefact of a single dataset or site via broad geographical validation across 18 US states.

Our evidence provides several hints as to how these predictions are possible. First, the ablation analysis indicates the centre of the image (pupil/lens, iris/cornea and conjunctiva/sclera) is substantially more important than the image periphery (for example, eyelids) for all predictions (Fig. 2a,b). Second, the saliency analysis similarly indicates that the DLS is most influenced by areas near the centre of the image. These included the pupil and the corneoscleral junction for diabetic retinal disease and both the nasal and temporal conjunctiva (where conjunctival blood vessels are often most prominent) for blood sugar control (HbA1c) (Fig. 3). In both analyses, the positive control (cataract, which is visible at the pupil/lens) provides a useful baseline attribution map for a prediction that is expected to focus exclusively on the pupil/lens. Moreover, the quantitative results where the DLS remained predictive within pupil size subgroups, after adjusting for pupil size, and compared with a baseline model augmented with pupil size, suggest that the performance of the DLS could not be explained by it relying on pupil size alone. Together, these data suggest that the DLS is leveraging both information in the pupil region, such as from the light reflecting from the retina (‘red reflex’), and information outside of the pupil/iris region, such as in the conjunctival vessels. For diabetic retinal disease predictions, we observed that DLS scores remained statistically significant even when adjusting for HbA1c, suggesting that the model is picking up on signals beyond elevated HbA1c to make these predictions. Finally, the performance of the DLS for HbA1c trended upwards for higher HbA1c cut-offs (AUC: 66.9% for ≥7% versus 69.1% for ≥8% versus 70.0% for ≥9% in validation set A), compared with a ‘flat’ trend of 65% for baseline characteristics. This ‘dose dependent’ trend cannot be explained by increased training data (because the number of examples of ≥9% is strictly less than ≥8%) and suggests that the features used by the DLS were related to the extent of elevated glucose levels. A similar trend of increasing DLS AUCs for higher cut-offs was observed for total cholesterol and triglycerides. Better scientific understanding of these predictions via systematic falsifying of hypotheses, and potentially enabling experts to perform the same predictions, will need to be addressed in future work.

Our technique has potential implications for the large and rapidly growing population of patients with diabetes because it does not, in principle, require specialized equipment. Specifically, screening for diabetes-related retinal disease requires fundoscopy or the use of a fundus camera to examine the back of the eye through the pupil. This limits disease screening and detection examinations to either eye clinics or store-and-forward teleretinal screening sites where fundus cameras are present—both of which require in-person visits. Indeed, poor access and cost contribute, in part, to low DR screening rates23. Similarly, HbA1c and lipid measurements require an in-person visit for a venous blood draw, which can be unpleasant for patients and have multiple potential side effects, including bleeding, bruising and nerve damage24. Our results suggest that signs of these conditions are extractable from a photograph of the front of the eye without pupil dilation via eye drops, as shown in validation set A. There appeared to be performance differences in the proposed model with versus without dilation, but the prevalence differences and the fact that baseline models (that do not depend on whether the eyes were dilated) shifted in the same direction suggest that a patient population difference was at least partially responsible. While the experiments in the paper did use external eye images captured by a fundus camera, we show that even low-resolution images of 75 × 75 pixels (which is 1% of the resolution of a basic ‘720p’ laptop webcam and 0.1% of the resolution of a standard 8-megapixel smartphone camera) result in adequate performance, suggesting that the resolution requirements for this technique can be easily met. While further work is needed to determine whether there are additional requirements for lighting, photography distance or angle, image stabilization, lens quality or sensor fidelity, we hope that disease detection techniques via external eye images can eventually be widely accessible to patients, whether in clinics, pharmacies or even at home. The AUCs in our experiments may not be adequate for a confirmatory diagnosis; for example, it is not expected to approach detection parity with the reference standard obtained via fundus photographs or blood test. However, they are comparable to other low-cost and accessible screening tools, such as the pre-diabetes risk score from the US Centers for Disease Control and Prevention25,26, which achieves an AUC of 0.79–0.83, with a PPV ranging from 5% to 13%. Because lack of access to and underuse of healthcare are associated with poor diabetic management27, we hope that the availability of easily accessible techniques such as ours can improve adherence to diabetic interventions, which we know to be cost saving or cost effective28. Finally, our exploratory analysis on validation set D shows promise in extending this approach to a more general, non-diabetic population.

The potential future use cases for easy identification and monitoring of high-risk diabetic patients are manifold. First, detecting patients with diabetes who have difficulty controlling their blood glucose (HbA1c ≥9%) may help reveal which patients are in need of further counselling, additional diabetic resources and medication changes. In our analysis, when the top 5% of patients with the highest predicted likelihood were examined, 50–75% may truly have HbA1c ≥9%. Similarly, our exploratory analysis on the non-diabetic subgroup of validation set D demonstrates that it may be possible to identify potentially asymptomatic patients at risk for diabetes and help determine which patients may benefit from a confirmatory blood test and early interventions, such as lifestyle counselling or medications. Second, identification of patients at risk for diabetic retinal disease can determine which patients may benefit from ophthalmology follow-up and targeted treatment to avoid diabetes-associated vision loss. In our analysis, if the top 5% of patients with the highest predicted likelihood of various severities of diabetic retinal disease were examined via fundus photographs, 20–65% could have vision-threatening diabetic retinal disease, and 20–90% could have moderate-or-worse diabetic retinal disease that warrants ophthalmology referral. Importantly, compliance with diabetic eye disease screening is estimated to be only 20–50% (refs. 29,30,31,32) due to factors such as transportation difficulties23. If our approach could be shown to generalize to images captured by non-specialized equipment, remote DR ‘pre-screening’ using this method may enable targeted support for high-risk patients. Conversely, patients at very low risk of developing diabetic retinal disease could potentially be screened less frequently. Additionally, detection of high blood lipids, or hyperlipidemia, from external eye photos allows for an easy, non-invasive approach for cardiovascular disease risk factor33 screening. Early identification of individuals with hyperlipidemia, who have a twofold increased risk of cardiovascular disease development34, could allow for earlier lifestyle counselling, medication intervention, additional disease screening and reduction in morbidity and mortality. Lastly, minimizing in-person visits to health clinics has become exceedingly important in preventing disease spread. During the current coronavirus disease 2019 (COVID-19) pandemic, many countries are actively encouraging ‘social’ (physical) distancing and deferment of non-urgent medical visits, including decreased visits for diabetes35. Similarly to how phone cameras36,37 can help identify digital biomarkers of diabetes38, external eye photo-based screening could allow for remote evaluation, potentially from home, and with common equipment, greatly minimizing risk to the individual and others.

This work establishes that signs of systemic disease can be identified from external eye images. These initial findings raise the tantalizing prospect that such external eye photographs may contain additional useful signals, both familiar and novel, related to other systemic conditions. Clinically, numerous external or anterior segment findings are correlated with or are the result of underlying systemic disease. Obstructive sleep apnoea is associated with floppy eyelid syndrome (easily everted eyelids) and resultant papillary conjunctivitis39,40. Thyroid disease can manifest specific ocular signs such as upper eyelid retraction, conjunctival injection (redness) and chemosis (swelling) and caruncular oedema (swelling of small, pink nodule at the inner corner of the eye)41,42. Atherosclerosis has been linked with xanthelasmas, which is predictive of adverse cardiac outcomes43. Systemic hypertension, another high-morbidity disease, has been associated with conjunctival microangiopathy and correlates with time since hypertension diagnosis44. Beyond eyelid and conjunctival changes, deposition of calcium or uric acid in the cornea may signify derangements related to hyperparathyroidism, chronic renal failure and gout45,46,47. Previous work has also shown that the severity of corneal and conjunctival calcium deposition, as determined by external eye photos, is predictive of all-cause one-year mortality risk in haemodialysis patients48. Recently, hepatobiliary disease signals have been identified in slit-lamp photos of the eye49. These manifestations could be readily captured with external eye photography that documents the eyelids, conjunctiva and cornea, and further work examining these disease populations is needed to assess additional systemic disease prediction models. Similarly to how investigations into manifestations of systemic disease in the retina have been dubbed ‘oculomics’50, such analyses on external ocular images could be termed ‘exoclumics’.

Our study is not without limitations. First, we have limited data with which to understand if smartphone or webcam images are sufficient, because all images were taken by healthcare professionals using a table-top fundus camera. Although the resolution and actual image sensor (for example, Canon EOS-series cameras) were similar to consumer-grade cameras, these clinical-grade cameras provided chin stabilization that may have reduced motion blur and controlled the shooting angle and distance, good lighting via circular flash to reduce occlusion of features from shadows and greater magnification51. Similarly, despite only semi-standardized protocols, clinical professionals may have provided instructions to standardize gaze and improve image quality, and the clinical environment may have been darkened to enlarge pupil size. Additional work is needed to evaluate whether the DLS generalizes to other consumer-grade cameras and settings without modification or if additional data collection for further training is needed. Second, part of the model’s performance may be derived from detecting confounding factors not available in this study. For example, patients with diabetes are more likely to develop cataracts52 and other anterior segment signs of proliferative DR (such as rubeosis iridis53 or neovascular glaucoma54) than non-diabetics, although these conditions are not common and unlikely to explain all of the model’s performance. To control confounding effects from cataract, we showed that the image-based model performed better than the baseline even without the central, pupil-containing region, and we further verified our findings in a subgroup excluding patients labelled with cataract presence. However, cataract was not the primary screening goal in these datasets, and, therefore, their recall was probably imperfect; cataract type and severity were also unavailable. Other potentially relevant variables, such as BMI55 or use of related medications, were not consistently available and, thus, could not be used to develop baseline models or perform adjustment. Further study with verified and more granular incidental findings will help quantify the contribution from such potential confounding factors. Additionally, while we did present exploratory analysis showing promise in generalizing to non-diabetic populations, this will need to be evaluated further on a larger dataset, and the model may benefit from further training or refinement on this specific patient population (that is, those without diabetes). Lastly, other imaging modalities, such as slit-lamp photography, can reveal further signs of non-ophthalmic issues (hepatobiliary disease) from the eye49, and future work in understanding how to integrate these modalities or extracting similar information from external eye photographs is needed.

In addition to the above limitations, as this is an initial study examining the feasibility of extracting useful signals of disease from external eye photographs, future work will be needed to tackle additional challenges before clinical validation studies. First, use cases and target patient populations should be identified, and the models should be thoroughly tested to ensure adequate accuracy across subgroups. If needed, the model should be further refined or tuned. Similarly, operating points should be customized for the target patient population and use case. Finally, whether the availability of this technology would improve adherence to diabetic retinal disease screening and ultimately improve patient outcomes will also need to be studied56.

In summary, we demonstrated that external eye images can be used to detect the presence of several conditions, such as poor blood sugar control, elevated lipids and various diabetic retinal diseases. Further study is warranted to evaluate whether such a tool can be used in a home, pharmacy or primary care setting to improve disease screening and help with management of diabetes.

Methods

Datasets

This work used three teleretinal diabetic screening datasets and one more general eye care dataset in the United States (Table 1). The first two datasets were the California and non-California cohorts from EyePACS, a teleretinal screening service in the United States57. In the EyePACS datasets, patients presented to sites, such as primary care clinics, for diabetic retinal disease screening. Visits from unknown sites were excluded, and patients associated only with sites from unknown states were excluded. We leveraged data from California (the state with the most visits in EyePACS) for development and the remaining US states for validation purposes. Within the development dataset, data from Los Angeles County were further excluded as held-out data for another project. Validation set A consisted of all non-California visits without pupil dilation, and validation set B consisted of all non-California visits with pupil dilation. The third validation dataset (C) was from the Atlanta VA Healthcare system’s diabetic teleretinal screening programme58, which serves multiple community-based outpatient clinics in the greater Atlanta area. Most visits in validation set C involved pupil dilation. The final validation dataset (D) was from the TECS programme established in primary care clinics in Georgia associated with the Atlanta VA Healthcare system’s catchment59. Ethics review and institutional review board exemption for this retrospective study on de-identified data were obtained via the Advarra institutional review board.

Imaging protocol

As part of the standard imaging protocol for DR screening at these sites, patients had photographs taken of the external eye for evaluation of the anterior segment (along with the retinal fundus photograph). At sites served by EyePACS, cameras (where this information was available) included Canon (CR1 and CR2), Topcon (NW200 and NW400), Zeiss Visucam, Optovue iCam and Centervue DRS. On the basis of available image Exif metadata, images were digitized using Canon EOS single-lens reflex cameras. The image protocol involved first positioning patients comfortably in front of the camera with their chin resting on the chin rest and their forehead an inch away from the forehead brace. The operator then took external images of the right eye and then the left eye, ensuring image clarity, focus and visibility of the entire eye, including the iris, pupil and sclera60. At sites served by the VA diabetic teleretinal screening programme, the cameras included Topcon NW8 and Topcon NW400. The acquisition protocol involved slightly distancing the patient from the fundus camera to capture a clear view of the external eye; further alignment was entrusted to the operator. The external eye photographs were intended to document findings such as cataracts and eyelid lesions. None of the external eye images was excluded for image quality or other reasons.

Labels

Diabetic retinal disease in EyePACS (development set and validation sets A and B) was graded by EyePACS-certified graders using a modified Early Treatment Diabetic Retinopathy Study grading protocol61,62. Graders evaluated three-field retinal fundus photographs (nasal, primary and temporal), and the absence or presence of lesions was mapped63 to five International Clinical Disease Severity Scale (ICDR) DR levels: no DR, mild, moderate, severe and proliferative DR and DME. VTDR was defined as severe-or-worse DR or DME. Diabetic retinal disease in the VA dataset (validation set C) was graded by ophthalmologists at the VA using the same protocol as EyePACS, except that only macular-centred and disc-centred fundus photographs were available. In all datasets, hard exudates were used as a surrogate for the presence of DME.

Baseline characteristics, including demographic information, were patient reported and provided by each site. Blood lab measurements (HbA1c, total cholesterol and triglycerides) were extracted from the patients’ medical records and could be from prior visits. For the VA datasets, labs from more than 90 days away from the date of the external eye photo were discarded; for EyePACS, this was not possible due to lack of precise dates. Diabetes status for the VA TECS dataset was determined from both the International Classification of Diseases codes (ICD-10 codes E09–E11 and E13 and ICD-9 codes 250 and 362.0) available in their medical records and notes from their eye care visit (for example, if the patient self-reported as being diagnosed with DR, they were labelled as diabetic because the DR diagnosis requires a diabetes diagnosis). Intraocular lens status in the VA TECS dataset was determined from the notes of the eye care visit. Cataract presence in the EyePACS images was indicated by the EyePACS graders but was not available in the VA datasets.

Pupil size estimation

To better understand whether pupil size was associated with our results, we obtained segmentations of both pupil and iris size. To do so, 12 ophthalmologist graders evaluated a total of 5,000 external eye images. If the pupil and iris were distinct enough to delineate the borders accurately, the graders drew ellipses around both the iris and pupil (Supplementary Fig. 1). This was performed for a subset of 4,000 randomly selected images in the development set, another 500 images in validation sets A and B combined and 500 images in validation set C. We then trained a model to segment the iris and pupil using the labelled development set images and evaluated the accuracy in the validation sets (Supplementary Note 1 and Supplementary Fig. 2). Our pupil size analyses were based on running this model across all images.

DLS development

The DLS takes an external eye photograph as input and was trained to provide predictions about both systemic parameters (from labs: HbA1c, total cholesterol and triglycerides) and diabetic retinal disease status (graded via colour fundus photographs). The DLS does not ‘see’ the colour fundus photographs, whether during training or evaluation. To develop the external eye DLS, we split the development dataset into training and tuning, split in a 7:1 ratio, while ensuring that each site was only in one split. All visits for all patients were used for training or tuning. A multi-task learning approach was used to train a single network to predict all tasks (that is, one ‘head’ per task). Specifically, all heads adopted a classification setup (with cross-entropy loss) to ensure that all losses were in comparable units. For HbA1c and lipids, we opted for a classification instead of regression setup, because experiments in the tune set found a 1–2% performance improvement in AUC by adopting a classification approach relative to a regression approach (and using the predictions as a proxy for the predicted likelihood of each binary outcome). Specifically, HbA1c was trained as 3 binary classification heads with thresholds at 7%, 8% and 9%; total cholesterol was trained as two binary classification heads with thresholds at 200 mg dl−1 and 240 mg dl−1, and triglycerides were trained as three binary classification heads with thresholds at 150 mg dl−1, 200 mg dl−1 and 500 mg dl−1. DR was discretized into the five ICDR categories; DME was binary (present/absent); VTDR was binary (present/absent) and cataract was binary (present/absent). Some additional ‘auxiliary’ heads were included: age was discretized into seven categories: (0, 30], (30, 40], (40, 50], (50, 60], (70, 80] and (80, 90] years; race/ethnicity was categorized into Hispanic, White, Black, Asian and Pacific Islander, Native American and others; sex was represented as male and female; and years with diabetes was discretized into five categories: (0, 1.5], (1.5, 5.0], (5.0, 10.5], (10.5, 15.5] and (15.5, infinity) years. Because not every label was available for all of these heads, the losses were propagated only for the relevant heads for each example.

We trained five models based on the Inception-v364 architecture using hyperparameters summarized in Supplementary Table 14, and the predicted likelihood was averaged unless otherwise noted. Within the hyperparameter search space detailed in Supplementary Table 14, 100 random trials were conducted, and the top 5 models were selected. During each training run, the tune set AUC for HbA1c ≥7% was the first to decline after reaching the peak, so early stopping was based on the tune set AUC of this task.

DLS evaluation

For evaluation, we selected a single visit per patient at random. This resulted in 27,415 visits for validation set A, 5,058 visits for set B and 10,402 visits for set C. We preprocessed the external eye photographs in the same manner as in previous work for colour fundus photographs1,65. Some of the prediction targets (described below) were on a per-visit basis (for example, HbA1c has a single value per visit per patient), and so the predictions of both eyes were averaged for evaluation purposes. For prediction targets that were specifically for each eye (for example, eye diseases that affect each eye individually), we randomly selected one eye per visit during evaluation. The exact numbers of data points used for each prediction task are presented in Supplementary Table 1.

For validation sets A and B, visits without dilation status information were excluded (9% of the total visits). A small number of patients had both visits with dilation and visits without dilation and were included in both datasets (n = 497).

Baseline models for comparisons

Models leveraging baseline characteristics were used as a baseline and were trained using logistic regression with class-balanced weighting and the default L2 regularization (with the hyperparameter C = 1.0) in the scikit-learn Python library. For validation sets A and B, the input baseline characteristics were self-reported variables: age, sex, race/ethnicity and years with diabetes. These baseline models were trained using only undilated or dilated visits, respectively, because using the entire training dataset did not improve tune set performance. This also kept the development process of the baseline model consistent with that of the DLS. For validation sets C and D, because race/ethnicity and years with diabetes were not available, only age and sex were used as input variables. These baseline models were trained using the respective validation sets due to the large differences in patient population (mostly male and older and enriched for Black and White patients) and differences in available baseline characteristics. This overestimates the baseline model’s performance and underestimates the improvement (delta) of our DLS.

Statistical analysis

We evaluated all results using the AUC, expressed as percentage (that is, 0–100%, where 50% is random performance). To evaluate the superiority of DLS predictions compared with the baseline model predictions in each dataset, we used the DeLong method66. The superiority analysis on four tasks (HbA1c ≥9%, moderate+ DR, DME and VTDR) was pre-specified and documented as primary analyses before applying the DLS to the validation sets. Alpha was adjusted using the Bonferroni method for multiple comparison correction (α = 0.025 for one-sided superiority test, divided by four tasks = 0.00625).

Pre-specified secondary analyses included additional evaluation metrics (sensitivity, specificity, PPV and NPV), additional related prediction tasks (cataract, HbA1c, total cholesterol, triglycerides, mild+ and severe+ DR), subgroup analysis (presence of cataract, pupil size, HbA1c stratification, demographic groups based on age, sex and race/ethnicity), adjusted analysis, explainability analysis and sensitivity analysis for image resolution and dilation status. Adjusted analyses leveraged logistic regression models fit using the statsmodels library (v0.12.1). For subgroup analysis, when a patient had multiple visits that satisfied a filtering criteria, one visit was chosen randomly. Finally, exploratory analyses were conducted for additional subgroups in validation set D (diabetes status, BMI and IOL).

Ablation analysis

We evaluated the dependence of DLS performance on the visibility of different regions of the image. Because both the images and ocular anatomy are ‘circular’ (that is, roughly radially symmetric), and most pupils are centred, we conducted this ‘image visibility’ analysis based on concentric circles. Two types of masking were examined: keeping only the centre visible and keeping the peripheral rim visible. At various degrees of masking, the two types of masking were compared while controlling for the number of visible pixels. For each condition, the same mask was applied during both training and evaluation.

Saliency analysis

From validation set A, 100 positive images with the highest predicted likelihoods were selected for saliency analysis for each task. Images were manually inspected to ensure correct vertical orientation and to orient the nasal aspect on the right. Three saliency methods were applied: GradCAM67, guided backprop68 and integrated gradient69, as implemented in the People+AI Research saliency library70.

Image resolution analysis

When simulating the low-resolution images, images were first down-sampled to the specified resolution using the area-based method (tf.image.resize with method=AREA). Next, to ensure that the DLS input size (and, thus, the network’s number of parameters and capacity) remained the same for a fair comparison, we up-sampled each image back to the original resolution (by bilinear interpolation with antialias using tf.image.resize with antialias=True). Down-sampling and up-sampling methods were chosen to produce the most visually appropriate images, blinded to the actual DLS results.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.