Introduction

Diabetic retinopathy (DR), is a major cause of blindness1,2. Cost-effective strategies for DR management includes routine screening using retinal photographs and having referable cases (typically moderate or worse DR and/or diabetic macular edema) managed by eye care specialists3,4,5. Recently, deep learning (DL) using convolutional neural networks (CNNs) has sparked tremendous interest in medicine6. In ophthalmology, many DL algorithms and systems have been reported to achieve robust performances in detecting various ocular diseases from retinal photographs7,8,9, especially for DR10,11,12,13. Despite substantial promise of DL technology, it is unclear what factors may influence the performance of a DL algorithm14. Currently, many research groups have developed different DL algorithms using different datasets and different techniques, and comprehensive guidelines on best practices are not yet available8,9,15,16,17,18,19,20.

There have been many studies, primarily in computer vision, exploring various factors that seek to optimize DL algorithms, albeit individually analyzed, thus making consistency difficult to achieve21,22,23,24,25. In addition, some factors involved in algorithm design such as input size and field have been overlooked and underestimated26. Some groups have looked at a compilation of technical factors postulated to be critical in the development of a DL algorithm in the clinical setting in detection of pathologies from radiological imaging27,28,29. However, these factors are largely domain specific (i.e., radiology)27,30. Thus, factors relevant in ophthalmology and particularly in the area of fundus imaging have yet to be explored.

The objective of this study is to systematically review current literature investigating possible factors that may influence the performance of a DL algorithm in detecting DR from fundus photographs. We then specifically addressed some of these factors that may impact on the performance of a DL algorithm. This study provides insights into technical and image-related factors that may impact future developments of DL systems for retinal image analysis, especially in context of tele-ophthalmology settings.

Results

Systematic review of literature

The results of our systematic review of literature are detailed in Fig. 1. Our search yielded 222 results, of which seven studies were identified to demonstrate evaluation of technical or image-related factors in DR detection by a DL algorithm. Table 1 displays the various factors and research questions addressed by the respective studies, demonstrating the focus on image-related factors by previous analyses, ranging from training dataset sizes to retinal camera specifications. Table 2 details the essential components, outcome measurements and implications of the articles included in the systematic review9,15,25,31,32,33,34.

Fig. 1: Study selection.
figure 1

Flowchart detailing the systematic literature review conducted to identify suitable studies that have evaluated technical and/or image-related factors that may influence the performance of a DL algorithm in detection of DR.

Table 1 Technical and image-related challenges to development of deep learning algorithms for ocular disease detection.
Table 2 Characteristics of included studies in systematic review.

Patient demographics and disease breakdown of datasets

Out of a total of 38,185 eyes included in the Singapore’s national integrated Diabetic Retinopathy Screening Program (SiDRP) primary testing dataset, 8.4% had referable DR (n = 3192). This proportion is similarly reflected in the SiDRP source testing dataset used, with 3.8% referable DR (n = 1373) in 35,948 eyes. Comparably, this is likewise seen in the external testing datasets, with African American Eye Disease Study (AFEDS) having 6.4% referable DR (n = 90) within a total of 1403 eyes and the Singapore Epidemiology of Eye Diseases (SEED) dataset having 8.5% referable DR (n = 415) within a total of 4910 eyes. These datasets, including detailed demographic characteristics, and breakdown into training and testing subsets have been previously published15.

Technical factors: effect on performance

Diagnostic performances of the DL algorithms using different CNNs and computational frameworks are shown in Table 3. Regardless of the CNN or computational framework employed in this study, all the DL algorithms were able to achieve high diagnostic performance—area under the receiver operating curve (AUC) ranged from 0.936 to 0.944, and sensitivities and specificities all exceeded 90% at the chosen classification thresholds. Newer CNNs showed no significant improvement in diagnostic performance. Compared to the oldest CNN VGGNet (AUC 0.938), ResNet (AUC 0.936; P = 0.581), and DenseNet (AUC 0.941; P = 0.410) have similar performance to VGGNet despite their increasing complexity in layers. However, an Ensemble of the three networks showed higher performance at detecting referable DR (AUC 0.944; P = 0.02). Sensitivities ranged from 91.9 to 94.0% with Ensemble producing the highest sensitivity, and specificities narrowly ranged from 90.7 to 91.0%. To illustrate this consistency in the performance between the different CNNs, an example is shown in Fig. 2a.

Table 3 Effect of technical factors specifically convolutional neural networks and computational framework.
Fig. 2: Retinal image examples.
figure 2

a Our results showed that using different CNNs show complementary classification of referable or non-referable DR, and these two images exhibit this agreement. b Using either computational framework similarly does not affect performance significantly as many images such as those depicted above are correctly classified as non-referable or referable DR by either framework. c Altering the image compression level does affect the DL model’s performance significantly beyond the threshold of 250 KB with a drop in sensitivity and specificity. These two photographs illustrate examples where a referable DR image is correctly identified as referable by the DL model when mild compression is introduced (i.e., a true positive case), but with further compression beyond 250 KB, this is misclassified as non-referable (i.e., a false negative case). This supports the drop in sensitivity beyond the 250 KB threshold. Similarly, this is demonstrated for a case of non-referable DR, where higher compression of the image causes a previously correctly classified image to subsequently be incorrect (i.e., a previously true negative result, now falsely classified as positive with disease), supporting the drop in specificity. d Another amendment to the image characteristics, in this case the field of view, showed reduced sensitivity and specificity when using 1-field instead of 2-field images. This example of referable DR had significant lesions present in the inferior-nasal quadrant, which were likely to be missed if using simply a macula-centered image, supporting the drop in sensitivity with the solitary use of 1-field images. Conversely, this example of healthy retina captured some dust particles in the superior and inferior nasal quadrant that might have inadvertently been misinterpreted by the DL algorithm as a lesion, prompting the misclassification as referable DR, thus supporting the drop in specificity.

Similarly, changing the computational frameworks used did not result in significant differences in diagnostic performance. Caffe and TensorFlow showed comparable performances with similar AUCs (0.936 vs 0.938; P = 0.736), sensitivities (90.5% vs 92.1%) and specificities (91.9% vs 91.0%). An example is displayed in Fig. 2b.

Image-related factors: effect on performance

Diagnostic performances of the DL algorithms using different image sizes, numbers of fields, and prior cataract surgery are shown in Tables 46, respectively. Variation of these image characteristics had significant effects on diagnostic performance of the DL algorithms. With progressive reduction in image size from the original 350kilobytes (KB) to 300, 250, 200, and 150 KB, AUC dropped progressively from 0.936 to 0.921, 0.900, 0.896, and 0.891 respectively with decreases amounting to statistical significance below 250 KB in size (P < 0.001) and falling below the AUC 0.9 mark. Although sensitivities were maintained high, ranging from 83.5 to 90.5%, due to the previously fixed operating point, specificities dropped culminating in a specificity of 72.4% when images of 150 KB in size were used. Figure 2c illustrates this threshold with examples of retinal images of referable DR that were identified correctly as referable in minimal compression, but subsequently misclassified as non-referable when compression increased beyond 250 KB in image size, and vice versa.

Table 4 Effect of image-related factors specifically compression levels.
Table 5 Effect of image-related factors specifically fundus fields of view.
Table 6 Effect of image-related factors specifically previous cataract surgery.

Providing the DL algorithm with an increased number of fields of fundus photography similarly showed better performance. Comparing 2-field with 1-field in SiDRP dataset, AUC (0.936 vs 0.908; P < 0.001), sensitivity (90.5% vs 89.4%) and specificity (91.9% vs 89.4%) were higher for the former. Examples of the effect of fundus field of views on outcome are represented in Fig. 2d. This trend is similarly seen in the AFEDS dataset as the AUC (0.949 vs 0.911 vs 0.895), sensitivity (90.0% vs 82.6% vs 78.4%) and specificity (86.5% vs 84.4% vs 86.1%) improved when using 7-field images compared to 2-field and 1-field images respectively. Overall, this shows that the DL model’s performance was best for the 7-field, followed by 2-field then by 1-field input images.

Previous cataract surgery showed improvement in the DL algorithm’s ability to detect DR in the pseudophakic eyes compared to phakic eyes as AUC (0.918 vs 0.833; P < 0.001), sensitivity (93.4% vs 91.1%), specificity (84.2% vs 76.1%) were remarkably higher. It is of note that the specificity of detecting DR in phakic eyes falls below 80%, representing a large number of false positives, non-pathological images misclassified to be referable DR.

Heatmaps

Subsidiary heatmap analysis was conducted to explore the rationale for the decrease in performance observed in images of increasing compression. An example of the heatmaps is shown in Fig. 3. This displayed a heatmap of a healthy retina that was accurately classified by the DL model as having no DR when the original 350 KB image was provided. Once provided with the compressed image of 150 KB in size, this was falsely classified as positive for referable DR. The heatmap showed that pixelation of the retina caused by the Joint Photographic Experts Group (JPEG) lossy compression was perceived by the DL algorithm as a pathological manifestation of DR, thus resulting in the conversion from a true negative case to a false positive case.

Fig. 3: Heatmaps generated for compressed images.
figure 3

Heatmaps showing the ‘hot’ areas that the DL algorithm focuses its attention on when making a diagnostic assessment on the retinal image. This was created using the Integrated Gradient method66. The colors on the greyscale retina image show the region of interest, with the red showing peak areas of region of interest while the blue shows the background areas of the region of interest. The white box isolates an area of the image to illustrate the difference between images of 350 and 150 KB in size. a A fundus photo of a healthy retina that was provided to the DL model as a 350 KB image. This was correctly classified by the DL model as a healthy retina with no DR. The heatmaps show focus on the normal optic disc and vasculature. b The same healthy retina is shown but compressed into a 150 KB size. This was misclassified by the DL algorithm as a retina with referable DR. The heatmaps show other regions of interest aside from the normal optic disc. The magnification of one of these anomalous regions of interest depicts pixelations as identified by the white arrows and ovals. These pixelations amalgamate into a mistaken pathological manifestation of DR, resulting in its false positive status.

Discussion

Our study provides insights that are useful for the development of DL algorithms for detecting DR from retinal photographs. Overall, for DR detection from retinal images, technical factors (CNN and computational framework) do not appear to impact on diagnostic performance of the DL algorithm, but image-related factors (e.g., image compression, number of fields, prior cataract surgery) had a greater and significant impact. First, our study shows that it is indifferent to utilize different CNNs and computational frameworks to build the DL algorithm, as all show comparable diagnostic performance in detecting referable DR (AUC, sensitivity and specificity >0.90). Although combining three CNNs into an Ensemble model yielded a statistically higher performance (AUC 0.944 vs 0.938), this may not be clinically significant. Second, reduction in image size below 250 KB results in significantly lower performance of the DL algorithm, especially reducing specificity to 88.8, 85.3, 72.4%. From the heat map analysis, compressed retinal images with lower image size were more pixelated and had more activity areas, although the changes were not obvious on the color photographs. Third, the performance of DL algorithm showed improved performance with increased number of fields (7-field is more superior than 2-field than 1-field), demonstrating the importance of covering more retina areas for DR screening. Fourth, lens status has important implications, with pseudophakic eyes associated with improved diagnostic performance when compared to phakic eyes.

Our study further supports existing literature demonstrating that utilization of newer CNNs with increasing complexity does not greatly improve the performance of DL algorithms35. This is with the exception of an Ensemble of multiple networks which often demonstrated superior results24,36. Previous studies examining different computational frameworks in the accuracy at general image classification tasks also showed comparable performance37,38. Although there have not been specific studies addressing the effect of compression of retinal images in the context of DL algorithms detection of DR, our study reinforces previous studies that have demonstrated the robustness of DL models with compression of general non-medical images up to a compression threshold23.

Possible explanations for our findings are as follows. Advances in DL methods have made it possible to exceed human performance with error rates below 5%39. CNNs that belong to this era include ResNet and DenseNet, with VGGNet falling close behind40,41. It could be postulated that changes in DL model architecture may not affect the performance significantly because the limiting factor is the quality information the input images provide. To some extent, heatmaps provided the rationale behind the performance observed when utilizing compressed images. Increasing compression resulted in lower resolution to an extent where the image may not hold enough information to distinguish hemorrhages from the background, thus causing a decrease in sensitivity from 90.5 to 83.5%. It may also result in increasing distortion of the picture where normal retina or vascular architecture may be misinterpreted as pathological manifestations of DR such as hemorrhages or venous bleeding, causing a decrease in specificity42. United Kingdom national screening guidelines recommends retinal images to be compressed to no less than 400KB, implying the importance of image size in adequate assessment of DR43.

It is apparent that a greater view of the retina allows for more accurate diagnosis due to an increase in information44,45. Therefore, this would explain the findings that when provided with only 1-field, the DL models’ performance dropped (from AUC 0.911 to 0.895) and why when provided with 7-fields, the performance improved (from AUC 0.911 to 0.949). With additional evidence that suggest an estimated 30% of lesions located around the Early Treatment Diabetic Retinopathy Study (ETDRS) 7-fields, this further supports the improvement in increased number of fields46. However, it is interesting to note the high performance of the DL algorithm despite providing only 1-field. Our study shows that the DL system yielded best performance on 7-field retinal images, although this may not be practical to do that in the routine setting as it requires pupil dilation, experienced photographers and patients’ compliance. Despite this, this DL system showed clinically acceptable outcome (AUC > 0.90) on 1-field and 2-field photographs for DR screening. A possible reason for this high performance could be due to the distribution of manifestations of DR important for diagnosis. This is because some studies report a skewed topological distribution of DR lesions concentrating in the areas lateral to the macula and in the temporal retina, those areas visible to the macula-centered image47,48. It should be noted that there is a possible confounding factor of increased number of fields providing increased number of images per eye, thus providing a better result. Phakic lens status and cataract, with resultant impact on media opacity and the gradability of retinal images would also result in a decrease in specificity from the increased false positives observed and this supports the better performance reported in pseudophakic eyes49.

There is an increasing desire for research groups around the world to develop their own DL algorithms tailored to their specific purpose with aspiration to emulate the successes of previously published DL algorithms. In addition, many clinicians, healthcare professionals and policymakers making decisions on the adoption of a DL algorithms increasingly require a comprehensive guide on the clinical translatability of these algorithms in the specific clinical context they operate in. This paper provides a broad guidance in the technical and image-related factors that should be considered during the development and deployment of DL models, concentrating on factors that would vary based on the intended purpose of the DL models and the resources available. Subsequent investigations may consider expanding on this study to explore the effect of altering other factors governing the characteristics of the input images such as comparisons between a myriad of fundus cameras: non-mydriatic and mydriatic, table top and handheld, color and mono-chromatic. This is due to the finding that specifications of input images may be deemed to be the main limiting factor to improving the algorithm’s performance.

Our current study has several limitations which should be acknowledged. The analysis of the following five described factors: choice of CNN, computational framework, image compression, field of view, and previous cataract surgery is not exhaustive. There are other computational frameworks (e.g., PyTorch, CNTK), CNNs (e.g., Inception, AlexNet), and variations in number of fundus fields of view (e.g., wide and ultrawide field retinal imaging) that were not included in this analysis. Inclusion of these would not be pragmatic given the wide range of technologies currently available with novel ones being introduced frequently. In our investigation into the features of the input image and the effect on performance, we considered factors that would be clinically relevant and beneficial for real-world applications50. For example, the potential of DL to expand coverage of healthcare services to rural areas with limited access required compromise on the image size and the number of fields captured per eye for screening purposes to compensate for limited economic resources such as manpower, data storage, and connectivity. Although our study shows that the DL performed better in the pseudophakic than phakic eyes, this may not change how we would screen for patients with diabetes. In addition, the patients with early cataract could still have clear media to allow good quality retinal images, although we do not have such information captured in our study.

Another limitation is this study focused only on DR detection from fundus imaging, thus the findings may not be applicable to other ocular conditions and imaging modalities. Nevertheless, we feel that the technical and image-related factors that we have identified as important in this study may be extrapolated to DL algorithms being applied for diagnosis of other ocular diseases from retinal images. The relative importance of these factors when applied to detection of other ocular diseases, or using alternative imaging modalities (such as optical coherence tomography) is an interesting area for further study.

In conclusion, our study provides a guide for researchers to understand the factors that may impact the development of DL algorithms for detection of DR and other conditions from retinal photographs, particularly when using images from real-world populations. Various image-related factors play more significant roles than technical factors in determining the diagnostic performance, suggesting the importance of having robust training and testing datasets for DL training and deployment in the real-world settings. In order to ensure a successful translation of a DR screening algorithm, it is important to consider technical factors (e.g., types of CNN, computational framework) and image-related factors (e.g., compression levels, number of fields, media clarity, mydriatic status, retinal cameras, pigmentation of different races, disease prevalence, systemic vascular risk factors, concurrent ocular diseases and reference standards).

Methods

We first conducted a systematic literature review on factors affecting DL algorithms in detection of DR using search engines PubMed, Web of Science, Scopus, ProQuest and IEEE Xplore searching for peer-reviewed studies up to 20 September 2019. Keywords used were ‘diabetic retinopathy’, ‘deep learning’, ‘technical factor’ and ‘image-related factor’.

Based on the systematic literature review (Fig. 1), we then selected the following factors to conduct the following analyses. First, we looked at different CNNs within the DL algorithm architecture: VGGNet, ResNet, DenseNet and Ensemble (a combination of the aforementioned three CNNs). Second, we evaluated the impact of altering the computational framework used to implement the coded CNNs: Caffe and TensorFlow. Third, we analyzed the effect of different image sizes generated through the process of compression, specifically looking at the five image sizes 350 (original), 300, 250, 200, 150 KB. Fourth, we looked at various numbers of input field of views captured per eye: 1-field macula-centered image, 2-field macula-centered and optic disc-centered images, 7-field ETDRS standard fields. Fifth, we looked at the impact of eyes with prior cataract surgery compared to eyes with cataract on the ability of the DL algorithm to accurately detect DR.

Study population dataset: training dataset

The DL algorithm was trained to detect referable DR with 76,370 retinal fundus photographs obtained from 13,099 patients from the database of SiDRP between 2010 and 2013. The definition of DR was made based on the International Classification Diabetic Retinopathy Severity Scale (ICDRSS) (Supplementary Fig. 1). SiDRP is a national screening program in Singapore established in 2010 which utilizes a tele-ophthalmology platform where fundus photographs captured in primary care clinics are transmitted to a centralized team of trained graders15,51. The 45 degree angle retinal fundus photographs are all taken with Topcon TRC-NW8 Non-Mydriatic Retinal Cameras in two fields of view per eye, an optic disc-centered image and a macula-centered image, with both eyes taken per patient. These two-field images in the training dataset were notably 350 KB average in size each. This training dataset included phakic and pseudophakic eyes. One round of training was conducted with no further re-training of the algorithm. For the purposes of collating a robust training dataset for the DL algorithms, two senior certified non-medical graders with more than five years’ experience were tasked to grade each eye. Discordant grades between the two graders were arbitrated by a retinal specialist. Poor quality images deemed ungradable to senior non-medical professional graders were removed from the training and testing dataset, especially if the images had more than a third of the photograph obscured. These images were not used in the analysis of the DL algorithm.

Technical factors: different CNNs

Three CNNs were explored in this study, namely VGGNet, ResNet and DenseNet. Additionally, the three CNNs were further combined into an Ensemble model. The CNNs were built using TensorFlow with Keras API specification (Fig. 4). Pre-trained models with transfer learning were available in the Keras library. All models had a final max-pooling layer added before the output layer.

  1. (1)

    VGGNet: This is a 16-layered network, designed by the Visual Geometry Group in Oxford in 2014. VGGNet has been popular due to its excellent performance on the classification of retinal images15,52,53. An overview of the architecture can be seen in Fig. 4a.

  2. (2)

    ResNet: For this study, we used ResNet-50 consisting of 50 layers which surpassed human performance with an error rate of 3.6%54. Recently, it has been widely used to detect age-related macular degeneration, diabetic macular edema, and glaucoma24,55,56. ResNet’s rise in popularity is attributed to its ability to increase depth of the network architecture through ‘skip’ residual connections equipped to perform identity mappings, thus increasing accuracy whilst still being easy to train (Fig. 4b).

  3. (3)

    DenseNet: This CNN consisted of 121 layers densely connected through concatenating sequential layers in a feedforward fashion to achieve increased depth of deep CNNs efficiently (Fig. 4c)40.

  4. (4)

    Ensemble: Ensemble consists of the above three networks’ (VGGNet, ResNet, and DenseNet) and its output is established as an average over outputs of the component networks per eye. Performance is expected to match or exceed single CNNs (Fig. 4d)57.

Fig. 4: Convolutional neural networks investigated.
figure 4

The architecture of convolutional neural networks (CNNs) are based on few general principles. The network is composed of mathematically weighted neurons that form sequential layers where there is linear transfer of signal from the input through to the output layers. For this study, each input image was pre-processed by scaling to a fixed template of 512 × 512 pixels in resolution. These images were subsequently represented as a matrix of Red Green Blue (RGB) values in the input layer. Sequential convolutions were conducted by superimposing a weighted kernel over these input maps, with our study using a 3 × 3 weighted kernel with subsequent max-pooling. The output layer utilizes a softmax classifier to generate probability values for the pre-defined output classes15,32,52. a VGGNet is the oldest CNN used in this comparison, released in 2014. Despite its standard uniform architecture composed of 16 layers, it has had great success at feature extraction53. b ResNet has been highly favored since its introduction in 2015, with its atypical architecture utilizing skip residual connections (visualized as blue arrows) to bypass signals across layers. This allows for increase in layers without compromising the ease of training, resulting in supra-human performance of 3.6% top-5 error rate54. c DenseNet is a newer CNN released in 2017 that has been shown to perform better than ResNet. Its architecture builds on a similar principle to the one capitalized by ResNet, but rather has a dense connectivity pattern where each layer receives information from all preceding layers as shown by the green arrows. This allows concatenation of sequential layers and compacting the network into a ‘denser’ configuration40. d Ensemble is a combination of the three networks’ probability output scores generated per eye, through the acquisition of the mean value.

Technical factors: different computational frameworks

The DL algorithms were then constructed using two open-sourced computational frameworks–Caffe and TensorFlow to compute VGGNet using Python (a programing language).

  1. (1)

    Caffe: This was reconstructed from the reference paper, initialized with Xavier initialization and included an extra module of convolutional and pooling layer to cater for increased resolution of input images to 512 × 51258. This VGGNet Caffe DL algorithm was the control architecture for analysis of the image-related factors.

  2. (2)

    TensorFlow: In this study, TensorFlow was implemented with Keras API specification. In addition, a single pooling layer preceding the output layer in VGGNet was added59.

Image-related factors: different image compression

Increasing compression may allow ease of transmission in tele-ophthalmology but may compromise image quality. To investigate the effect of input image compression on performance, 71,896 original images (35,948 eyes of 14,880 patients) taken from SiDRP between 2014 and 2015 were used as the archetype to create five distinct sets of fundus images with different compression levels. These were 45 degree angle fundus fields taken with Topcon TRC-NW8 Non-Mydriatic Retinal Cameras. In total, 359,480 retinal fundus images were generated. These five sets were then used to test the DL algorithm. It should be noted that the DL algorithm was previously trained on the original 350KB images. These images were compressed from an average of 350 KB JPEG images to four additional levels of compression, averaging 300, 250, 200, 150 KB in image size respectively. This was achieved with the use of a standard JPEG compression algorithm using the Independent JPEG Group’s library of quality levels. We used a publicly available algorithm on the OpenCV library.

Image-related factors: different fundus field of view

We evaluated the effect of different fundus field of views on the performance of the DL algorithm by comparing (1) 1-field macula-centered and (2) 2-field optic disc and macula-centered fundus photographs. Both subsets were taken from data collected from SiDRP between 2014 to 2015 containing 35,948 eyes. Macula-centered or optic disc-centered is defined as the macula or the optic disc, respectively, located less than one disc diameter circumferentially from center of the image. In addition, we also looked at increasing number of fields, comparing (1) 1-field, (2) 2-field, and (3) 7-field fundus imaging. For this analysis, we used an external testing dataset from the AFEDS that obtained the ETDRS stereoscopic reference standard of 7-field fundus imaging60,61. 7-field, 2-field, and 1-field retinal images collected from a fixed set of 1403 eyes in this dataset were used. A total of 9821 images were used.

Image-related factors: previous cataract surgery

To assess the impact of previous cataract surgery on the ability for the DL algorithm to detect DR on fundus photographs, we employed an external testing dataset using retinal images from participants of SEED Study (baseline, 2004-11). The SEED study is a population-based epidemiologic study that comprised three major ethnic groups in Singapore – Malay patients were recruited from 2004 to 2006, Indian patients from 2007 to 2009, and Chinese patients from 2009 to 201162,63,64,65. Phakic and pseudophakic eyes in this dataset were separated and analysis was conducted to compare one group with relation to the other. Among the 4910 eyes (9820 images) included, 1612 eyes were phakic and 3298 eyes were pseudophakic.

Heatmap

Heatmaps were generated to provide insight into the conundrum of DL, the black box of learning, as they demonstrate focus areas visualized by the DL system. The method of Integrated Gradient was used to generate these heatmaps66.

Reference standard of testing dataset

The reference standard of the severity of DR of each eye in the SiDRP and SEED testing datasets was set as the grading assessment by an ophthalmologist sub-specializing in retinal diseases, with over five years’ experience in assessing DR. For the AFEDS testing dataset, concurring assessments from two retinal specialists were used as the reference standard. The grading was conducted in accordance to ICDRSS. However, for the purposes of this study, we reclassified this scale to a binary outcome measure of referable DR or non-referable DR. Referable DR is defined as moderate non-proliferative DR or worse, including diabetic macular edema.

Statistical analysis

We used the following primary outcome measures as a marker for DL algorithm’s performance in detecting referable DR: AUC, sensitivity and specificity of the algorithm at detecting DR with respect to the reference standard. The operating thresholds were pre-set during training of each modification of the DL algorithms’ technical parameters. During training, AUCs were determined for the training dataset while sensitivities and specificities were calculated across a range of thresholds. The optimal threshold for each DL algorithm was chosen to achieve 90% sensitivity. The 95% confidence intervals (CI) for sensitivity and specificity were calculated with the exact Clopper-Pearson method and for AUC, empirical bootstrap with 5000 replicates was used. To evaluate statistical significance of difference in results for comparison of AUCs, the empirical bootstrap with 5000 replicates was used. All statistical analyses were performed using the R statistical software (version 3.5.1; R Foundation for Statistical Computing, Vienna, Austria). P value less than 0.05 was considered statistically significant.

Ethics approval

Our study was approved by the centralized institutional review board (IRB) of SingHealth, Singapore (IRB reference number 2018/2433). It was conducted in accordance with the Declaration of Helsinki. Informed consent by the patients were exempted by the IRB because it used fully anonymized images retrospectively.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.