## Introduction

Diabetic retinopathy (DR) is a common microvascular complication of diabetes1,2. Approximately 10% of people with diabetes can progress to vision threatening diabetic retinopathy (VTDR) without any symptoms, whose early signs can include blurred vision, eye pain or redness and the appearance of floating shapes in the vision field3. Known risk factors of DR are duration of diabetes, uncontrolled diabetes and hypertension. However, identifying people with VTDR based on models on these risk factors is not accurate. Therefore, it is recommended that every person with diabetes undergoes retinal screening regularly to identify intraretinal signs of DR (microaneurysms, haemorrhage, drusen) and refer those at risk of VTDR (referable DR) for regular monitoring by an ophthalmologist and VTDR treatment when indicated4,5. There are approximately 537 million people with diabetes worldwide, and 75% reside in low- and middle-income countries (LMIC)2. Approximately 40 million people are at risk of having VTDR in these countries2, but establishing systematic DR screening programmes to the standards recommended in high income countries is not feasible in LMIC due to the costs of the retinal cameras (fixed table-top fixed retinal cameras), lack of infrastructure and trained workforce to obtain and grade retinal images.

To ensure screening of large number of people with diabetes and to reach remote and rural areas, most LMIC employ non-technical staff to screen people with diabetes in community settings using non-mydriatic low-cost cameras2,6. These screening strategies have additional approach-specific challenges6,7. Handheld non-mydriatic retinal cameras offer the benefits of portability and low-cost but they increase the rate of ungradable images, in part due to the lack of a stabilising platform8. Image quality is also impacted by the increased prevalence of undiagnosed co-pathology in communities with limited healthcare access, particularly cataract, the most common cause of visual impairment in LMIC9.

The recommended workforce for grading retinal images is not cost-effective even in high income countries10. One solution to a more efficient and sustainable programme is to employ automated algorithms. Deep learning, as a state-of-the-art machine learning technique, has achieved remarkable success in the detection of a variety of medical conditions, particularly in ophthalmology11,12,13, and most notably DR14,15,16,17,18. However, to date, automated algorithms for DR screening have been developed using retinal images acquired through dilated pupils on fixed desktop cameras by a trained workforce14,15,17,18. These algorithms cannot be translated to non-mydriatic retinal images captured by field workers in the challenging acquisition conditions of community settings19. A substantial proportion of retinal images captured in such environments exhibit variable quality due to obscuration of fundal areas, variable image brightness and suboptimal focus. Therefore, automated algorithms need to be developed specifically for this setting. As such, there is an unmet need for an automated algorithm that grades retinal images taken in non-clinical, community environments to enable the translation and adoption of DR screening in LMIC.

As part of the SMART India study, a cross-sectional study conducted across 20 regions in India, in this work-package we developed and evaluated a deep learning-based system (DLS) in detecting referable DR. We focussed not only on traditional two-field images but also on single-field macula or optic disc-centred handheld non-mydriatic retinal images to inform the accuracy of the algorithm based on the retinal area captured in such settings. In addition, we compared the accuracy of this algorithm to risk-models based on systemic risk factors that are used to identify DR in settings where retinal screening is not available.

## Methods

### Study design and participants

Participants were recruited and screened in two stages between 20th December 2018 and 20th March 2020 (SMART-INDIA 1, SM1) and between 8th October 2020 and 17th April 2021 (SMART INDIA 2, SM2). A stratified sample of adults aged 40 years or above were screened in each household for diabetes, and those with diabetes were screened for DR by minimally trained field workers using low cost handheld non-mydriatic retinal cameras (see included centres in Supplementary Fig. S1)20. Field workers underwent on-site training at each centre on the use of a handheld Zeiss Visuscout 100 camera (Zeiss, Germany) to capture a set of at least two 40° colour retinal photographs (macula and optic disc centred) from each eye without pupil dilation. To maximize gradeability rates, no limit was set on the number of acquired photographs for each patient. When difficulties, media opacities or undiagnosed co-pathologies, such as cataract or small pupils, hindered the acquisition of fundus images, photographs of the anterior segment were acquired with the same camera, which were not used in the development of the DLS for referable DR screening. In SM1, field workers captured the set of retinal fundus photographs in community screenings from individuals who had confirmed diabetes or who, on the day of survey, had an elevated random blood sugar of 8.9 mmol/L or higher. In SM2, to enrich the total dataset with VTDR images, the same field workers screened in the ophthalmology clinics only patients who had confirmed diabetes, resulting in a higher prevalence of referable patients.

This cross-sectional study complied with the Declaration of Helsinki and was approved by The Indian Council of Medical Research (ICMR)/Health Ministry Screening Committee (HMSC/2018-0494, dated 17/12/2018). Institutional Ethics Committees of all the participating institutions approved both parts of the study (SM1 and SM2). Informed consent was obtained from each participant. The study protocol has been published20.

### Automated data curation

The pool of captured images comprised of anterior segment, grayscale and ungradable samples. A small number of images also had missing laterality data (11%). An automated data curation pipeline was implemented to select the best quality two-field macula- and optic disc-centred fields from the initial pool of captured images per eye (Fig. 1B). The process addressed the identified challenges of this type of community screening via the development and testing of four independent deep learning-based models for fundal, laterality and field detection (macula and optic disc), as well as gradeability scoring (Supplementary Fig. S2)23. A subset of retinal photographs from the initial pool of captures images were manually graded for these parameters by a trained ophthalmologist and used to develop the deep learning curation models (for details about implementation, training and independent validation of the curation models, see Supplementary Methods and Supplementary Figs. S3, S4 and Table S3). After the removal of grayscale/non-fundus images and detection of laterality, macula and optic disc fields were identified and the image with highest gradeability score from each field per eye was selected. Eyes with an eligible pair of two-field images were selected for referable DR DLS development.

### Model development

A DLS was developed to detect referable DR/DME in a patient eye from a pair of macula and optic disc-centred handheld non-mydriatic retinal photographs (Fig. 1C). Each field was fed into an independent CNN with trainable parameters. Feature maps generated by each architecture were concatenated after a global average pooling layer (1 × 1024) and forwarded to the final fully-connected layer. All models took 766 × 578 pixel colour fundus photographs as inputs and provided an output probability for the presence of referable DR and/or DME. Higher resolution inputs up to 1149 × 867 pixel size were also investigated, but no significant improvements in DLS performance were observed.

The model encoding sections use ResNet34 architectures24 and were pre-trained on the ImageNet database and trained on the SM1 and SM2 datasets with five times cross-validation with fold stratification by database and DR score (SM1, SM2 and DR scores equally distributed throughout all folds), and eyes from the same patient were never part of the training set and the test set. Images are pre-processed by subtracting the local average colour and normalizing images at the channel level to ImageNet mean and standard deviation. The models were trained for 10 epochs, with a batch size of 16, and 10–4 initial rate with a decay factor of 0.95. Data augmentation was used in the training phase (random Gaussian blur with 5% probability, random flip with 50% probability, ± random 5% scaling, ± 10° random rotation, up to 5% random translation, and random up ± 5% shearing).

### Statistical analysis

We evaluated the ability of the DLS to predict referable DR/DME from handheld non-mydriatic retinal photographs using the area-under-the receiver operating characteristic curve (AUROC) with 1000 times bootstrapped confidence intervals (see Supplementary Methods). Additionally, we examined model sensitivity and specificity at three operating points (OP): Youden’s index (threshold defined by Eq. (1))25, high sensitivity (threshold defined by Eq. (2) with $$\alpha =0.3$$) and high specificity (threshold defined by Eq. (2) with $$\alpha =0.7$$).

$${t}_{Y}=argma{x}_{t}J\left(t\right), \;\; where \;\; J\left(t\right)=[sensitivity\left(t\right)+specificity \left(t\right)-1]$$
(1)
$$t=argma{x}_{t} f\left(\alpha \right), \;\; where \;\; f\left(\alpha \right)= [\alpha *Specificity\left(t\right)+\left(1-\alpha \right)*Sensitivity(t)]$$
(2)

Inter-grader agreement between primary and secondary graders, and between final grades (after arbitration) and primary and secondary graders, respectively, were calculated with exact Clopper-Pearson CIs with 95% confidence levels.

DLS performance was compared to the prognostication obtained by using individual-level risk factors. Univariable and multivariate logistic regression models were trained using available risk factors to identify the presence of referable DR/DME in either eye. Univariate models were trained using glycated haemoglobin levels (HbA1c), duration of diabetes, systolic and diastolic blood pressure, and body mass index (BMI). Multivariate models included systolic and diastolic blood pressure alone and all aforementioned risk factors.

## Results

From a pool of 81,320 retinal fundus images, a total of 32,494 images from 16,247 eyes (9778 individuals) were eligible for the study (Supplementary Fig. S2), comprised of a pair of macula-centred and optic disc-centred images for each person eye. Participant demographics and distribution of the DR grades for both SM1 and SM2 cohorts are shown in Table 1. In SM1, the average age of the participants was 54.40 (10.72) years, with 49.02% males, 4.70% DR referable eyes, and 3.20% DME referable eyes. In SM2, the average age of the participants was 55.38 (9.28) years, with 66.95% males, 88.75% DR referable eyes, and 60.55% DME referable eyes.

The AUROC of the DLS for referable DR/DME was 0.985 (1000 times bootstrapped 95% CI 0.98–0.99) (Fig. 2A). When the prediction was obtained with a single field, the AUROCs was 0.977 (0.98–0.98) for the macula field (Fig. 2B) and 0.963 (0.95–0.98) for the optic-disc field (Fig. 2C). When evaluated independently in SM1 and SM2, the DLS achieved 95.47 (91.75–97.98) and 95.27 (92.74–97.14) AUROC, respectively.

To assess region importance in referable DR/DME prediction, we evaluated DLS performance with input ablation (Fig. 3, Supplementary Table S1). Both retinal fields where vertically split in three regions. For the macula field, AUROC for the macula region was 0.96 (1000 times bootstrapped 95% CI 0.95–0.96), decreasing to 0.92 (0.90–0.94) for the temporal region, and 0.85 (0.81–0.87) for the optic disc region. For the optic-disc field, AUROC reached 0.91 (0.90–0.94) for the macula region, 0.91 (0.89–0.94) for the nasal region, and 0.89 (0.87–0.92) for the optic disc region. When only the region corresponding to optic disc was occluded, AUROC reached 0.97 (0.97–0.98) for the macula field, and 0.95 (0.95–0.97) for the optic disc field.

Sensitivity and specificity of three different OPs of the DLS were examined and compared to inter-grader performance (Table 2). Youden’s OP reached 93.86% (1000 times bootstrapped 95% CI 91.34–96.08) sensitivity and 96.00% (94.68–98.09) specificity. The high sensitivity OP corresponded to 95.53% (93.64–97.41) sensitivity and 92.79% (88.96–97.40) specificity, and the high specificity OP reached 90.88% (88.80–95.23) sensitivity and 98.00% (96.29–98.88) specificity. Primary grader agreement with final grades (after arbitration when disagreement between primary and secondary grades) reached 86.67% (85.52–87.75) sensitivity and 98.43% (98.18–98.64) specificity, whilst secondary graders agreement with final grades showed 94.32% (93.52–95.05) sensitivity and 99.63% (99.50–99.73) specificity. Primary graders vs. secondary graders reached 90.01% (88.95–91.01) sensitivity and 96.09% (95.72–96.42) specificity.

Univariate analysis of person level risk factors (Fig. 3, Supplementary Table S2 in the Supplement) showed the duration of diabetes had the most significant predictive association, with an AUROC of 0.84 (1000 times bootstrapped 95% CI 0.81–0.86), followed by the glycated haemoglobin (HbA1c) levels, with an AUROC of 0.64 (0.59–0.67). Multivariate analyses of duration of diabetes, glycated haemoglobin levels, systolic and diastolic blood pressure (BP), and BMI reached 0.84 (0.81–0.87).

Integrated gradients were used to gain insight into the retinal features learned by the DLS26. The saliency maps highlight the most influential pixels in the DLS decision (Fig. 4). When signs of referable DR/DME are present in the image (Fig. 4A,B) the parts of the image where the specific lesions are located (e.g. microaneurysms) are prominently highlighted. The DLS consistently highlights lesions even when they are hardly visible to the naked eye. In the absence of referable DR/DME, only the optic disc or general regions of the retina are highlighted.

## Discussion

The use of handheld non-mydriatic retinal images for screening poses unique challenges for automated DR detection systems due to variable image quality. The majority of prior automated referable DR detection systems have been developed using guidelines and acquisition conditions reflective of high-income countries15,17,27. The application of these systems to handheld, non-mydriatic retinal images results in significantly reduced model performance19. Therefore, there is a need to develop and evaluate automated grading systems using screening and acquisition conditions matching those found in LMIC. Evidence of the efficacy and applicability of automated DR detection systems in resource limited environments could greatly widen the availability of DR screening which, in turn, could help reduce preventable sight loss. In this study, we developed and validated a DLS that achieved a clinically acceptable level of performance in detecting referable DR and DME from handheld, non-mydriatic retinal images acquired in community settings by field workers in India, a LMIC.

Prior to the advent of deep learning-based techniques, feature-based approaches had been explored to assist on the screening of DR from different retinal image modalities28,29. Detection frameworks based on geometric features, vessel analysis and retinal hemodynamics had been widely studied30,31. However, in recent years, deep learning success at classification tasks has paved the way for new achievements in the automated diagnosis of referable DR. Several studies have recently explored the detection of referable DR using deep neural networks15,17,27. These studies used mydriatic retinal photographs acquired from in-clinic screening programs by imaging professionals using table-top retinal cameras. Gulshan et al.27 evaluated a deep learning system for referable DR reporting a sensitivity of 90.3% and a specificity of 98.1% specificity for the EyePACS-1 public dataset. Similarly, Ting et al.17 algorithm reported a sensitivity of 90.5% and a specificity of 91.6% on a proprietary validation dataset. More recently, a prospective study by Gulshan et al.14 evaluated automated DR detection reaching best performance of 92.1% sensitivity and 95.2% specificity at some sites. In their study, the authors highlighted the impact of acquisition settings (in-clinic and community-based) in algorithm performance. Bellemo et al.15 reported a sensitivity 99.42% for detecting vision-threatening DR and 97.19% for referable maculopathy in a study based in a LMIC (Zambia).

A few studies have also explored a more accurate classification of the DR stages. Wang et al.32 examined the performance of different architectures at DR staging when trained in a 166 image subset of the Kaggle dataset33, with InceptionNet V334 reaching the best results at 63.23% accuracy. Khan et al.35 proposed a custom convolutional model and reported an accuracy of 98.15% dataset on the Messidor dataset36. Harangi et al.37 used the Kaggle dataset33 to train a framework that combined AlexNet38 and hand-crafted features. The authors reported an accuracy of 90.07% when tested on the IDRiD dataset39. Li et al.40 achieved a joint accuracy of 65.1% the IDRiD dataset39 by training a ResNet50 architecture24. Qureshi et al.41 proposed a framework trained on EyePACS-1 dataset based on patch extraction and classification and reported 92.20% sensitivity and 95.10% specificity. Alyoubi et al.42 with a custom convolutional model, achieved an accuracy of 88.6% and 84.1% on the DDR43 and the APTOS Kaggle 2019 dataset44, respectively.

Few studies have explored DLS performance using handheld retinal images or community-based settings. Notable exceptions are Rajalakshmi et al.45 who reported a sensitivity of 95.8% and a specificity of 80.2% at detecting any DR using 2408 smartphone-based mydriatic fundus images acquired by hospital trained staff in clinic environment. A pilot study by Natarajan et al.46 on 223 patients with diabetes (PwD) reported 100.0% sensitivity and 88.4% specificity for referable DR detection using a smartphone-based automated system. Sosale et al., in a prospective study including 922 individuals, developed a smartphone-based system using a combination of non-mydriatic and mydriatic images acquired in clinical settings by a trained camera technician. Their referable DR system using pairs of macula and disc centred images reported 93.0% sensitivity and 92.5% specificity.

There are differences between these studies and ours. In our study, we developed a fully automated DLS to detect referable DR/DME in a setting that mirrors real-life implementations in LMIC. We evaluated the DLS system on handheld non-mydriatic retinal photographs acquired by field workers and demonstrated competitive or better performance to prior studies despite the unfavourable acquisition conditions. Clinically acceptable performance was achieved by the DLS either using a two-field (macula and optic-disc centred) or single-field inputs independently, most notably with macula only images. As retinal screening is not available in many countries worldwide, we also evaluated the predictive performance of different risk factors available by training univariate and multivariate logistic regression models and assessing their comparative predictive performance. Among the different risk factors we studied, duration of diabetes had the highest predictive significance. Adding other risk factors had no additional contribution in the multivariate model. The image based DLS outperformed all risk factor-based models in detecting referable DR/DME, highlighting the need to establish retinal screening programmes globally. Incorporating this fully automated DLS in low-cost cameras is likely to reduce the healthcare burden of DR screening worldwide.

As image quality is suboptimal and only some areas of the 2-field images may be missing, we also carried out a comprehensive set of image region ablation studies to better understand the contribution of different images areas to the prediction. The findings showed that the optic-disc regions, both within the macula field and the optic-disc field, had the lowest significance. This is evident from the fact that DLS performance was slightly reduced when only the optic disc area was occluded, whilst using the optic disc region alone yielded the lowest performance. On the other hand, the inclusion of the macula area achieved the highest performance compared to each of the other independently evaluated regions. These findings have significance, as image occlusion is likely to occur in a non-trivial proportion of images captured using non-mydriatic handheld cameras. Hence, we could demonstrate likely impacts on performance. Our results show that people with DME are more prevalent than severe DR, which is why model performance is significantly impacted with the occlusion of the macula field. Optic disc neovascularisation is a sight threatening complication that requires treatment. Therefore, despite the challenges of capturing 2-field images through non-mydriatic pupils, it is crucial for field workers to be trained to obtain both the macula and optic disc field images. Obtaining the optic disc field alone without macular field is likely to miss significant numbers of referable DR/DME.