Using deep learning to detect diabetic retinopathy on handheld non-mydriatic retinal images acquired by field workers in community settings

Diabetic retinopathy (DR) at risk of vision loss (referable DR) needs to be identified by retinal screening and referred to an ophthalmologist. Existing automated algorithms have mostly been developed from images acquired with high cost mydriatic retinal cameras and cannot be applied in the settings used in most low- and middle-income countries. In this prospective multicentre study, we developed a deep learning system (DLS) that detects referable DR from retinal images acquired using handheld non-mydriatic fundus camera by non-technical field workers in 20 sites across India. Macula-centred and optic-disc-centred images from 16,247 eyes (9778 participants) were used to train and cross-validate the DLS and risk factor based logistic regression models. The DLS achieved an AUROC of 0.99 (1000 times bootstrapped 95% CI 0.98–0.99) using two-field retinal images, with 93.86 (91.34–96.08) sensitivity and 96.00 (94.68–98.09) specificity at the Youden’s index operational point. With single field inputs, the DLS reached AUROC of 0.98 (0.98–0.98) for the macula field and 0.96 (0.95–0.98) for the optic-disc field. Intergrader performance was 90.01 (88.95–91.01) sensitivity and 96.09 (95.72–96.42) specificity. The image based DLS outperformed all risk factor-based models. This DLS demonstrated a clinically acceptable performance for the identification of referable DR despite challenging image capture conditions.

To ensure screening of large number of people with diabetes and to reach remote and rural areas, most LMIC employ non-technical staff to screen people with diabetes in community settings using non-mydriatic low-cost cameras 2,6 . These screening strategies have additional approach-specific challenges 6,7 . Handheld non-mydriatic retinal cameras offer the benefits of portability and low-cost but they increase the rate of ungradable images, in part due to the lack of a stabilising platform 8 . Image quality is also impacted by the increased prevalence of undiagnosed co-pathology in communities with limited healthcare access, particularly cataract, the most common cause of visual impairment in LMIC 9 .
The recommended workforce for grading retinal images is not cost-effective even in high income countries 10 . One solution to a more efficient and sustainable programme is to employ automated algorithms. Deep learning, as a state-of-the-art machine learning technique, has achieved remarkable success in the detection of a variety of medical conditions, particularly in ophthalmology [11][12][13] , and most notably DR [14][15][16][17][18] . However, to date, automated algorithms for DR screening have been developed using retinal images acquired through dilated pupils on fixed desktop cameras by a trained workforce 14,15,17,18 . These algorithms cannot be translated to non-mydriatic retinal images captured by field workers in the challenging acquisition conditions of community settings 19 . A substantial proportion of retinal images captured in such environments exhibit variable quality due to obscuration of fundal areas, variable image brightness and suboptimal focus. Therefore, automated algorithms need to be developed specifically for this setting. As such, there is an unmet need for an automated algorithm that grades retinal images taken in non-clinical, community environments to enable the translation and adoption of DR screening in LMIC.
As part of the SMART India study, a cross-sectional study conducted across 20 regions in India, in this work-package we developed and evaluated a deep learning-based system (DLS) in detecting referable DR. We focussed not only on traditional two-field images but also on single-field macula or optic disc-centred handheld non-mydriatic retinal images to inform the accuracy of the algorithm based on the retinal area captured in such settings. In addition, we compared the accuracy of this algorithm to risk-models based on systemic risk factors that are used to identify DR in settings where retinal screening is not available.

Methods
Study design and participants. Participants were recruited and screened in two stages between 20th December 2018 and 20th March 2020 (SMART-INDIA 1, SM1) and between 8th October 2020 and 17th April 2021 (SMART INDIA 2, SM2). A stratified sample of adults aged 40 years or above were screened in each household for diabetes, and those with diabetes were screened for DR by minimally trained field workers using low cost handheld non-mydriatic retinal cameras (see included centres in Supplementary Fig. S1) 20 . Field workers underwent on-site training at each centre on the use of a handheld Zeiss Visuscout 100 camera (Zeiss, Germany) to capture a set of at least two 40° colour retinal photographs (macula and optic disc centred) from each eye without pupil dilation. To maximize gradeability rates, no limit was set on the number of acquired photographs for each patient. When difficulties, media opacities or undiagnosed co-pathologies, such as cataract or small pupils, hindered the acquisition of fundus images, photographs of the anterior segment were acquired with the same camera, which were not used in the development of the DLS for referable DR screening. In SM1, field workers captured the set of retinal fundus photographs in community screenings from individuals who had confirmed diabetes or who, on the day of survey, had an elevated random blood sugar of 8.9 mmol/L or higher. In SM2, to enrich the total dataset with VTDR images, the same field workers screened in the ophthalmology clinics only patients who had confirmed diabetes, resulting in a higher prevalence of referable patients.
This cross-sectional study complied with the Declaration of Helsinki and was approved by The Indian Council of Medical Research (ICMR)/Health Ministry Screening Committee (HMSC/2018-0494, dated 17/12/2018). Institutional Ethics Committees of all the participating institutions approved both parts of the study (SM1 and SM2). Informed consent was obtained from each participant. The study protocol has been published 20 .
Image grading. A teleophthalmology system was set up whereby retinal images captured by each fieldworker were uploaded to a cloud-based database for subsequent independent grading at the local clinical centre (on-site primary grading), as well as transferred to four central reading centres for secondary grading (Fig. 1A). Trained optometrists or ophthalmologists graded all images from each eye and discrepancies between primary and secondary grading were arbitrated by a senior retinal consultant at each Reading Centre. Person eyes were classified as per the International Clinical Disease Severity Scale for DR as no DR, mild, moderate, severe nonproliferative DR, and proliferative DR 21,22 , or as ungradable. Gradable eyes had two outcomes: (1) referable DR (moderate non-proliferative DR or worse) or non-referable DR (eyes with no DR or mild DR), and (2) diabetic macular edema (DME) graded as non-present, present or referable. The reference standard used to develop and validate the DLS was the presence of either referable DR or referable DME as per the final manual human grade which was based on all captured images per patient eye.
Automated data curation. The pool of captured images comprised of anterior segment, grayscale and ungradable samples. A small number of images also had missing laterality data (11%). An automated data curation pipeline was implemented to select the best quality two-field macula-and optic disc-centred fields from the initial pool of captured images per eye (Fig. 1B). The process addressed the identified challenges of this type of community screening via the development and testing of four independent deep learning-based models for fundal, laterality and field detection (macula and optic disc), as well as gradeability scoring ( Supplementary  Fig. S2) 23 . A subset of retinal photographs from the initial pool of captures images were manually graded for these parameters by a trained ophthalmologist and used to develop the deep learning curation models (for  Table S3). After the removal of grayscale/non-fundus images and detection of laterality, macula and optic disc fields were identified and the image with highest gradeability score from each field per eye was selected. Eyes with an eligible pair of two-field images were selected for referable DR DLS development.

Model development.
A DLS was developed to detect referable DR/DME in a patient eye from a pair of macula and optic disc-centred handheld non-mydriatic retinal photographs (Fig. 1C). Each field was fed into an independent CNN with trainable parameters. Feature maps generated by each architecture were concatenated after a global average pooling layer (1 × 1024) and forwarded to the final fully-connected layer. All models took 766 × 578 pixel colour fundus photographs as inputs and provided an output probability for the presence of referable DR and/or DME. Higher resolution inputs up to 1149 × 867 pixel size were also investigated, but no significant improvements in DLS performance were observed. The model encoding sections use ResNet34 architectures 24 and were pre-trained on the ImageNet database and trained on the SM1 and SM2 datasets with five times cross-validation with fold stratification by database and DR score (SM1, SM2 and DR scores equally distributed throughout all folds), and eyes from the same patient were never part of the training set and the test set. Images are pre-processed by subtracting the local average colour and normalizing images at the channel level to ImageNet mean and standard deviation. The models were trained for 10 epochs, with a batch size of 16, and 10 -4 initial rate with a decay factor of 0.95. Data augmentation was used in the training phase (random Gaussian blur with 5% probability, random flip with 50% probability, ± random 5% scaling, ± 10° random rotation, up to 5% random translation, and random up ± 5% shearing).

Statistical analysis.
We evaluated the ability of the DLS to predict referable DR/DME from handheld nonmydriatic retinal photographs using the area-under-the receiver operating characteristic curve (AUROC) with 1000 times bootstrapped confidence intervals (see Supplementary Methods). Additionally, we examined model sensitivity and specificity at three operating points (OP): Youden's index (threshold defined by Eq. (1)) 25 , high sensitivity (threshold defined by Eq. (2) with α = 0.3 ) and high specificity (threshold defined by Eq. (2) with α = 0.7).
Inter-grader agreement between primary and secondary graders, and between final grades (after arbitration) and primary and secondary graders, respectively, were calculated with exact Clopper-Pearson CIs with 95% confidence levels. www.nature.com/scientificreports/ DLS performance was compared to the prognostication obtained by using individual-level risk factors. Univariable and multivariate logistic regression models were trained using available risk factors to identify the presence of referable DR/DME in either eye. Univariate models were trained using glycated haemoglobin levels (HbA 1c ), duration of diabetes, systolic and diastolic blood pressure, and body mass index (BMI). Multivariate models included systolic and diastolic blood pressure alone and all aforementioned risk factors.

Results
From a pool of 81,320 retinal fundus images, a total of 32,494 images from 16,247 eyes (9778 individuals) were eligible for the study (Supplementary Fig. S2), comprised of a pair of macula-centred and optic disc-centred images for each person eye. Participant demographics and distribution of the DR grades for both SM1 and SM2 cohorts are shown in Table 1. In SM1, the average age of the participants was 54.40 (10.72) years, with 49.02% males, 4.70% DR referable eyes, and 3.20% DME referable eyes. In SM2, the average age of the participants was 55.38 (9.28) years, with 66.95% males, 88.75% DR referable eyes, and 60.55% DME referable eyes.
To assess region importance in referable DR/DME prediction, we evaluated DLS performance with input ablation (Fig. 3, Supplementary Table S1). Both retinal fields where vertically split in three regions. Sensitivity and specificity of three different OPs of the DLS were examined and compared to inter-grader performance (   Table S1) and by univariate/multivariate analysis of person level risk factors (Supplementary Table S2). Error bars indicate 1000 times bootstrap 95% CIs.  Table S2 in the Supplement) showed the duration of diabetes had the most significant predictive association, with an AUROC of 0.84 (1000 times bootstrapped 95% CI 0.81-0.86), followed by the glycated haemoglobin (HbA1c) levels, with an AUROC of 0.64 (0.59-0.67). Multivariate analyses of duration of diabetes, glycated haemoglobin levels, systolic and diastolic blood pressure (BP), and BMI reached 0.84 (0.81-0.87).
Integrated gradients were used to gain insight into the retinal features learned by the DLS 26 . The saliency maps highlight the most influential pixels in the DLS decision (Fig. 4). When signs of referable DR/DME are present in the image (Fig. 4A,B) the parts of the image where the specific lesions are located (e.g. microaneurysms) are prominently highlighted. The DLS consistently highlights lesions even when they are hardly visible to the naked eye. In the absence of referable DR/DME, only the optic disc or general regions of the retina are highlighted.

Discussion
The use of handheld non-mydriatic retinal images for screening poses unique challenges for automated DR detection systems due to variable image quality. The majority of prior automated referable DR detection systems have been developed using guidelines and acquisition conditions reflective of high-income countries 15,17,27 . The application of these systems to handheld, non-mydriatic retinal images results in significantly reduced model performance 19 . Therefore, there is a need to develop and evaluate automated grading systems using screening and acquisition conditions matching those found in LMIC. Evidence of the efficacy and applicability of automated DR detection systems in resource limited environments could greatly widen the availability of DR screening which, in turn, could help reduce preventable sight loss. In this study, we developed and validated a DLS that achieved a clinically acceptable level of performance in detecting referable DR and DME from handheld, non-mydriatic retinal images acquired in community settings by field workers in India, a LMIC. www.nature.com/scientificreports/ Prior to the advent of deep learning-based techniques, feature-based approaches had been explored to assist on the screening of DR from different retinal image modalities 28,29 . Detection frameworks based on geometric features, vessel analysis and retinal hemodynamics had been widely studied 30,31 . However, in recent years, deep learning success at classification tasks has paved the way for new achievements in the automated diagnosis of referable DR. Several studies have recently explored the detection of referable DR using deep neural networks 15,17,27 . These studies used mydriatic retinal photographs acquired from in-clinic screening programs by imaging professionals using table-top retinal cameras. Gulshan et al. 27 evaluated a deep learning system for referable DR reporting a sensitivity of 90.3% and a specificity of 98.1% specificity for the EyePACS-1 public dataset. Similarly, Ting et al. 17 algorithm reported a sensitivity of 90.5% and a specificity of 91.6% on a proprietary validation dataset. More recently, a prospective study by Gulshan et al. 14 evaluated automated DR detection reaching best performance of 92.1% sensitivity and 95.2% specificity at some sites. In their study, the authors highlighted the impact of acquisition settings (in-clinic and community-based) in algorithm performance. Bellemo et al. 15 reported a sensitivity 99.42% for detecting vision-threatening DR and 97.19% for referable maculopathy in a study based in a LMIC (Zambia).
A few studies have also explored a more accurate classification of the DR stages. Wang et al. 32  Few studies have explored DLS performance using handheld retinal images or community-based settings. Notable exceptions are Rajalakshmi et al. 45 who reported a sensitivity of 95.8% and a specificity of 80.2% at detecting any DR using 2408 smartphone-based mydriatic fundus images acquired by hospital trained staff in clinic environment. A pilot study by Natarajan et al. 46 on 223 patients with diabetes (PwD) reported 100.0% sensitivity and 88.4% specificity for referable DR detection using a smartphone-based automated system. Sosale et al., in a prospective study including 922 individuals, developed a smartphone-based system using a combination of non-mydriatic and mydriatic images acquired in clinical settings by a trained camera technician. Their referable DR system using pairs of macula and disc centred images reported 93.0% sensitivity and 92.5% specificity.
There are differences between these studies and ours. In our study, we developed a fully automated DLS to detect referable DR/DME in a setting that mirrors real-life implementations in LMIC. We evaluated the DLS system on handheld non-mydriatic retinal photographs acquired by field workers and demonstrated competitive or better performance to prior studies despite the unfavourable acquisition conditions. Clinically acceptable performance was achieved by the DLS either using a two-field (macula and optic-disc centred) or single-field inputs independently, most notably with macula only images. As retinal screening is not available in many countries worldwide, we also evaluated the predictive performance of different risk factors available by training univariate and multivariate logistic regression models and assessing their comparative predictive performance. Among the different risk factors we studied, duration of diabetes had the highest predictive significance. Adding other risk factors had no additional contribution in the multivariate model. The image based DLS outperformed all risk factor-based models in detecting referable DR/DME, highlighting the need to establish retinal screening programmes globally. Incorporating this fully automated DLS in low-cost cameras is likely to reduce the healthcare burden of DR screening worldwide.
As image quality is suboptimal and only some areas of the 2-field images may be missing, we also carried out a comprehensive set of image region ablation studies to better understand the contribution of different images areas to the prediction. The findings showed that the optic-disc regions, both within the macula field and the optic-disc field, had the lowest significance. This is evident from the fact that DLS performance was slightly reduced when only the optic disc area was occluded, whilst using the optic disc region alone yielded the lowest performance. On the other hand, the inclusion of the macula area achieved the highest performance compared to each of the other independently evaluated regions. These findings have significance, as image occlusion is likely to occur in a non-trivial proportion of images captured using non-mydriatic handheld cameras. Hence, we could demonstrate likely impacts on performance. Our results show that people with DME are more prevalent than severe DR, which is why model performance is significantly impacted with the occlusion of the macula field. Optic disc neovascularisation is a sight threatening complication that requires treatment. Therefore, despite the challenges of capturing 2-field images through non-mydriatic pupils, it is crucial for field workers to be trained to obtain both the macula and optic disc field images. Obtaining the optic disc field alone without macular field is likely to miss significant numbers of referable DR/DME. Manual grading was performed by independent primary and secondary graders. In case of discrepancies, the grading was arbitrated by a senior consultant who had access to the primary and secondary grades. We evaluated intergrader performance and compared it to the deep learning system performance at different operating points. A significant difference in sensitivity was found when evaluating primary grader and secondary grader agreement to the final grades (after arbitration). The lower sensitivity by primary graders showed a higher restrictive standard at detecting referable cases. Three different operating points of the DLS were evaluated. The high specificity point performance aligned closely with that of human graders. Whilst the balanced operating point (maximising Youden's index) reached a sensitivity comparable to the best intergrader values with a preserved level of specificity. Overall, the findings highlight comparable performance between human graders and the DLS. www.nature.com/scientificreports/ To the best of our knowledge, this is the first prospective multi-centre study mirroring a real-life implementation of DR screening in a LMIC and includes a considerably large dataset of handheld retinal images taken by field workers in a community setting. Our results demonstrate that these photographs can be used to develop deep learning-based systems capable of detecting referable DR. Our findings can contribute to the development of novel screening guidelines supported by deep learning systems and guide policy makers in establishing new scalable, cost-effectively approaches to detect vision threatening retinopathy in countries with low resources, where most of the PwD reside.
Our study has some limitations. First, the study is based on a mono-ethnic population. Hence, different fundal appearance and DR/DME expression in any other ethnicities may affect algorithm generalization and, therefore, influence performance when applied to different populations. Second, the pool of retinal photographs acquired by the field workers required curation to discard incorrectly acquired images and select gradable 2-field images suitable for referable DR DLS development. The deployment of DR screening programs in LMIC and the employment of non-technical field workers makes this curation pipeline a necessary step prior to DR/DME screening. Our demonstrates that this limitation can be addressed with deep learning techniques and be successfully automated. Third, manual graders had access to all the retinal images acquired for each patient eye and provided their decision on referable DR/DME whereas the DLS, limited to a finite number of input photographs, provided predictions on the basis of the pair of resulting images from this curation process, which could possibly include outliers (e.g., misclassified field images). However, the impact of outliers resulting from the curation process is considered small given that the deep learning algorithms involved in the curation process showed excellent performance for each of the curation tasks (see Supplementary Methods).
In conclusion, our study highlights the efficacy of automated deep learning-based detection of referable DR and DME using handheld non-mydriatic retinal images in community settings. Our findings have particular relevance for policy makers in LMIC aiming to implement cost-effective, scalable and sustainable DR screening programmes.

Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.