Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy

Yip, Michelle Y. T.; Lim, Gilbert; Lim, Zhan Wei; Nguyen, Quang D.; Chong, Crystal C. Y.; Yu, Marco; Bellemo, Valentina; Xie, Yuchen; Lee, Xin Qi; Hamzah, Haslina; Ho, Jinyi; Tan, Tien-En; Sabanayagam, Charumathi; Grzybowski, Andrzej; Tan, Gavin S. W.; Hsu, Wynne; Lee, Mong Li; Wong, Tien Yin; Ting, Daniel S. W.

doi:10.1038/s41746-020-0247-1

Download PDF

Article
Open access
Published: 23 March 2020

Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy

Michelle Y. T. Yip^1,2,
Gilbert Lim ORCID: orcid.org/0000-0002-5381-9250^1,3,
Zhan Wei Lim³,
Quang D. Nguyen¹,
Crystal C. Y. Chong¹,
Marco Yu ORCID: orcid.org/0000-0002-2825-8914¹,
Valentina Bellemo¹,
Yuchen Xie¹,
Xin Qi Lee¹,
Haslina Hamzah¹,
Jinyi Ho¹,
Tien-En Tan¹,
Charumathi Sabanayagam^1,2,
Andrzej Grzybowski^4,5,
Gavin S. W. Tan^1,2,
Wynne Hsu³,
Mong Li Lee³,
Tien Yin Wong^1,2 &
…
Daniel S. W. Ting ORCID: orcid.org/0000-0003-2264-7174^1,2,6

npj Digital Medicine volume 3, Article number: 40 (2020) Cite this article

3174 Accesses
29 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Deep learning (DL) has been shown to be effective in developing diabetic retinopathy (DR) algorithms, possibly tackling financial and manpower challenges hindering implementation of DR screening. However, our systematic review of the literature reveals few studies studied the impact of different factors on these DL algorithms, that are important for clinical deployment in real-world settings. Using 455,491 retinal images, we evaluated two technical and three image-related factors in detection of referable DR. For technical factors, the performances of four DL models (VGGNet, ResNet, DenseNet, Ensemble) and two computational frameworks (Caffe, TensorFlow) were evaluated while for image-related factors, we evaluated image compression levels (reducing image size, 350, 300, 250, 200, 150 KB), number of fields (7-field, 2-field, 1-field) and media clarity (pseudophakic vs phakic). In detection of referable DR, four DL models showed comparable diagnostic performance (AUC 0.936-0.944). To develop the VGGNet model, two computational frameworks had similar AUC (0.936). The DL performance dropped when image size decreased below 250 KB (AUC 0.936, 0.900, p < 0.001). The DL performance performed better when there were increased number of fields (dataset 1: 2-field vs 1-field—AUC 0.936 vs 0.908, p < 0.001; dataset 2: 7-field vs 2-field vs 1-field, AUC 0.949 vs 0.911 vs 0.895). DL performed better in the pseudophakic than phakic eyes (AUC 0.918 vs 0.833, p < 0.001). Various image-related factors play more significant roles than technical factors in determining the diagnostic performance, suggesting the importance of having robust training and testing datasets for DL training and deployment in the real-world settings.

Using deep learning to detect diabetic retinopathy on handheld non-mydriatic retinal images acquired by field workers in community settings

Article Open access 25 January 2023

Predicting optical coherence tomography-derived diabetic macular edema grades from fundus photographs using deep learning

Article Open access 08 January 2020

Understanding inherent image features in CNN-based assessment of diabetic retinopathy

Article Open access 06 May 2021

Introduction

Diabetic retinopathy (DR), is a major cause of blindness^1,2. Cost-effective strategies for DR management includes routine screening using retinal photographs and having referable cases (typically moderate or worse DR and/or diabetic macular edema) managed by eye care specialists^3,4,5. Recently, deep learning (DL) using convolutional neural networks (CNNs) has sparked tremendous interest in medicine⁶. In ophthalmology, many DL algorithms and systems have been reported to achieve robust performances in detecting various ocular diseases from retinal photographs^7,8,9, especially for DR^10,11,12,13. Despite substantial promise of DL technology, it is unclear what factors may influence the performance of a DL algorithm¹⁴. Currently, many research groups have developed different DL algorithms using different datasets and different techniques, and comprehensive guidelines on best practices are not yet available^{8,9,15,16,17,18,19,20}.

There have been many studies, primarily in computer vision, exploring various factors that seek to optimize DL algorithms, albeit individually analyzed, thus making consistency difficult to achieve^{21,22,23,24,25}. In addition, some factors involved in algorithm design such as input size and field have been overlooked and underestimated²⁶. Some groups have looked at a compilation of technical factors postulated to be critical in the development of a DL algorithm in the clinical setting in detection of pathologies from radiological imaging^27,28,29. However, these factors are largely domain specific (i.e., radiology)^27,30. Thus, factors relevant in ophthalmology and particularly in the area of fundus imaging have yet to be explored.

The objective of this study is to systematically review current literature investigating possible factors that may influence the performance of a DL algorithm in detecting DR from fundus photographs. We then specifically addressed some of these factors that may impact on the performance of a DL algorithm. This study provides insights into technical and image-related factors that may impact future developments of DL systems for retinal image analysis, especially in context of tele-ophthalmology settings.

Results

Systematic review of literature

The results of our systematic review of literature are detailed in Fig. 1. Our search yielded 222 results, of which seven studies were identified to demonstrate evaluation of technical or image-related factors in DR detection by a DL algorithm. Table 1 displays the various factors and research questions addressed by the respective studies, demonstrating the focus on image-related factors by previous analyses, ranging from training dataset sizes to retinal camera specifications. Table 2 details the essential components, outcome measurements and implications of the articles included in the systematic review^{9,15,25,31,32,33,34}.

Table 1 Technical and image-related challenges to development of deep learning algorithms for ocular disease detection.

Full size table

Table 2 Characteristics of included studies in systematic review.

Full size table

Patient demographics and disease breakdown of datasets

Out of a total of 38,185 eyes included in the Singapore’s national integrated Diabetic Retinopathy Screening Program (SiDRP) primary testing dataset, 8.4% had referable DR (n = 3192). This proportion is similarly reflected in the SiDRP source testing dataset used, with 3.8% referable DR (n = 1373) in 35,948 eyes. Comparably, this is likewise seen in the external testing datasets, with African American Eye Disease Study (AFEDS) having 6.4% referable DR (n = 90) within a total of 1403 eyes and the Singapore Epidemiology of Eye Diseases (SEED) dataset having 8.5% referable DR (n = 415) within a total of 4910 eyes. These datasets, including detailed demographic characteristics, and breakdown into training and testing subsets have been previously published¹⁵.

Technical factors: effect on performance

Diagnostic performances of the DL algorithms using different CNNs and computational frameworks are shown in Table 3. Regardless of the CNN or computational framework employed in this study, all the DL algorithms were able to achieve high diagnostic performance—area under the receiver operating curve (AUC) ranged from 0.936 to 0.944, and sensitivities and specificities all exceeded 90% at the chosen classification thresholds. Newer CNNs showed no significant improvement in diagnostic performance. Compared to the oldest CNN VGGNet (AUC 0.938), ResNet (AUC 0.936; P = 0.581), and DenseNet (AUC 0.941; P = 0.410) have similar performance to VGGNet despite their increasing complexity in layers. However, an Ensemble of the three networks showed higher performance at detecting referable DR (AUC 0.944; P = 0.02). Sensitivities ranged from 91.9 to 94.0% with Ensemble producing the highest sensitivity, and specificities narrowly ranged from 90.7 to 91.0%. To illustrate this consistency in the performance between the different CNNs, an example is shown in Fig. 2a.

Table 3 Effect of technical factors specifically convolutional neural networks and computational framework.

Full size table

Similarly, changing the computational frameworks used did not result in significant differences in diagnostic performance. Caffe and TensorFlow showed comparable performances with similar AUCs (0.936 vs 0.938; P = 0.736), sensitivities (90.5% vs 92.1%) and specificities (91.9% vs 91.0%). An example is displayed in Fig. 2b.

Image-related factors: effect on performance

Diagnostic performances of the DL algorithms using different image sizes, numbers of fields, and prior cataract surgery are shown in Tables 4–6, respectively. Variation of these image characteristics had significant effects on diagnostic performance of the DL algorithms. With progressive reduction in image size from the original 350kilobytes (KB) to 300, 250, 200, and 150 KB, AUC dropped progressively from 0.936 to 0.921, 0.900, 0.896, and 0.891 respectively with decreases amounting to statistical significance below 250 KB in size (P < 0.001) and falling below the AUC 0.9 mark. Although sensitivities were maintained high, ranging from 83.5 to 90.5%, due to the previously fixed operating point, specificities dropped culminating in a specificity of 72.4% when images of 150 KB in size were used. Figure 2c illustrates this threshold with examples of retinal images of referable DR that were identified correctly as referable in minimal compression, but subsequently misclassified as non-referable when compression increased beyond 250 KB in image size, and vice versa.

Table 4 Effect of image-related factors specifically compression levels.

Full size table

Table 5 Effect of image-related factors specifically fundus fields of view.

Full size table

Table 6 Effect of image-related factors specifically previous cataract surgery.

Full size table

Providing the DL algorithm with an increased number of fields of fundus photography similarly showed better performance. Comparing 2-field with 1-field in SiDRP dataset, AUC (0.936 vs 0.908; P < 0.001), sensitivity (90.5% vs 89.4%) and specificity (91.9% vs 89.4%) were higher for the former. Examples of the effect of fundus field of views on outcome are represented in Fig. 2d. This trend is similarly seen in the AFEDS dataset as the AUC (0.949 vs 0.911 vs 0.895), sensitivity (90.0% vs 82.6% vs 78.4%) and specificity (86.5% vs 84.4% vs 86.1%) improved when using 7-field images compared to 2-field and 1-field images respectively. Overall, this shows that the DL model’s performance was best for the 7-field, followed by 2-field then by 1-field input images.

Previous cataract surgery showed improvement in the DL algorithm’s ability to detect DR in the pseudophakic eyes compared to phakic eyes as AUC (0.918 vs 0.833; P < 0.001), sensitivity (93.4% vs 91.1%), specificity (84.2% vs 76.1%) were remarkably higher. It is of note that the specificity of detecting DR in phakic eyes falls below 80%, representing a large number of false positives, non-pathological images misclassified to be referable DR.

Heatmaps

Subsidiary heatmap analysis was conducted to explore the rationale for the decrease in performance observed in images of increasing compression. An example of the heatmaps is shown in Fig. 3. This displayed a heatmap of a healthy retina that was accurately classified by the DL model as having no DR when the original 350 KB image was provided. Once provided with the compressed image of 150 KB in size, this was falsely classified as positive for referable DR. The heatmap showed that pixelation of the retina caused by the Joint Photographic Experts Group (JPEG) lossy compression was perceived by the DL algorithm as a pathological manifestation of DR, thus resulting in the conversion from a true negative case to a false positive case.

Discussion

Our study provides insights that are useful for the development of DL algorithms for detecting DR from retinal photographs. Overall, for DR detection from retinal images, technical factors (CNN and computational framework) do not appear to impact on diagnostic performance of the DL algorithm, but image-related factors (e.g., image compression, number of fields, prior cataract surgery) had a greater and significant impact. First, our study shows that it is indifferent to utilize different CNNs and computational frameworks to build the DL algorithm, as all show comparable diagnostic performance in detecting referable DR (AUC, sensitivity and specificity >0.90). Although combining three CNNs into an Ensemble model yielded a statistically higher performance (AUC 0.944 vs 0.938), this may not be clinically significant. Second, reduction in image size below 250 KB results in significantly lower performance of the DL algorithm, especially reducing specificity to 88.8, 85.3, 72.4%. From the heat map analysis, compressed retinal images with lower image size were more pixelated and had more activity areas, although the changes were not obvious on the color photographs. Third, the performance of DL algorithm showed improved performance with increased number of fields (7-field is more superior than 2-field than 1-field), demonstrating the importance of covering more retina areas for DR screening. Fourth, lens status has important implications, with pseudophakic eyes associated with improved diagnostic performance when compared to phakic eyes.

Our study further supports existing literature demonstrating that utilization of newer CNNs with increasing complexity does not greatly improve the performance of DL algorithms³⁵. This is with the exception of an Ensemble of multiple networks which often demonstrated superior results^24,36. Previous studies examining different computational frameworks in the accuracy at general image classification tasks also showed comparable performance^37,38. Although there have not been specific studies addressing the effect of compression of retinal images in the context of DL algorithms detection of DR, our study reinforces previous studies that have demonstrated the robustness of DL models with compression of general non-medical images up to a compression threshold²³.

Possible explanations for our findings are as follows. Advances in DL methods have made it possible to exceed human performance with error rates below 5%³⁹. CNNs that belong to this era include ResNet and DenseNet, with VGGNet falling close behind^40,41. It could be postulated that changes in DL model architecture may not affect the performance significantly because the limiting factor is the quality information the input images provide. To some extent, heatmaps provided the rationale behind the performance observed when utilizing compressed images. Increasing compression resulted in lower resolution to an extent where the image may not hold enough information to distinguish hemorrhages from the background, thus causing a decrease in sensitivity from 90.5 to 83.5%. It may also result in increasing distortion of the picture where normal retina or vascular architecture may be misinterpreted as pathological manifestations of DR such as hemorrhages or venous bleeding, causing a decrease in specificity⁴². United Kingdom national screening guidelines recommends retinal images to be compressed to no less than 400KB, implying the importance of image size in adequate assessment of DR⁴³.

It is apparent that a greater view of the retina allows for more accurate diagnosis due to an increase in information^44,45. Therefore, this would explain the findings that when provided with only 1-field, the DL models’ performance dropped (from AUC 0.911 to 0.895) and why when provided with 7-fields, the performance improved (from AUC 0.911 to 0.949). With additional evidence that suggest an estimated 30% of lesions located around the Early Treatment Diabetic Retinopathy Study (ETDRS) 7-fields, this further supports the improvement in increased number of fields⁴⁶. However, it is interesting to note the high performance of the DL algorithm despite providing only 1-field. Our study shows that the DL system yielded best performance on 7-field retinal images, although this may not be practical to do that in the routine setting as it requires pupil dilation, experienced photographers and patients’ compliance. Despite this, this DL system showed clinically acceptable outcome (AUC > 0.90) on 1-field and 2-field photographs for DR screening. A possible reason for this high performance could be due to the distribution of manifestations of DR important for diagnosis. This is because some studies report a skewed topological distribution of DR lesions concentrating in the areas lateral to the macula and in the temporal retina, those areas visible to the macula-centered image^47,48. It should be noted that there is a possible confounding factor of increased number of fields providing increased number of images per eye, thus providing a better result. Phakic lens status and cataract, with resultant impact on media opacity and the gradability of retinal images would also result in a decrease in specificity from the increased false positives observed and this supports the better performance reported in pseudophakic eyes⁴⁹.

There is an increasing desire for research groups around the world to develop their own DL algorithms tailored to their specific purpose with aspiration to emulate the successes of previously published DL algorithms. In addition, many clinicians, healthcare professionals and policymakers making decisions on the adoption of a DL algorithms increasingly require a comprehensive guide on the clinical translatability of these algorithms in the specific clinical context they operate in. This paper provides a broad guidance in the technical and image-related factors that should be considered during the development and deployment of DL models, concentrating on factors that would vary based on the intended purpose of the DL models and the resources available. Subsequent investigations may consider expanding on this study to explore the effect of altering other factors governing the characteristics of the input images such as comparisons between a myriad of fundus cameras: non-mydriatic and mydriatic, table top and handheld, color and mono-chromatic. This is due to the finding that specifications of input images may be deemed to be the main limiting factor to improving the algorithm’s performance.

Our current study has several limitations which should be acknowledged. The analysis of the following five described factors: choice of CNN, computational framework, image compression, field of view, and previous cataract surgery is not exhaustive. There are other computational frameworks (e.g., PyTorch, CNTK), CNNs (e.g., Inception, AlexNet), and variations in number of fundus fields of view (e.g., wide and ultrawide field retinal imaging) that were not included in this analysis. Inclusion of these would not be pragmatic given the wide range of technologies currently available with novel ones being introduced frequently. In our investigation into the features of the input image and the effect on performance, we considered factors that would be clinically relevant and beneficial for real-world applications⁵⁰. For example, the potential of DL to expand coverage of healthcare services to rural areas with limited access required compromise on the image size and the number of fields captured per eye for screening purposes to compensate for limited economic resources such as manpower, data storage, and connectivity. Although our study shows that the DL performed better in the pseudophakic than phakic eyes, this may not change how we would screen for patients with diabetes. In addition, the patients with early cataract could still have clear media to allow good quality retinal images, although we do not have such information captured in our study.

Another limitation is this study focused only on DR detection from fundus imaging, thus the findings may not be applicable to other ocular conditions and imaging modalities. Nevertheless, we feel that the technical and image-related factors that we have identified as important in this study may be extrapolated to DL algorithms being applied for diagnosis of other ocular diseases from retinal images. The relative importance of these factors when applied to detection of other ocular diseases, or using alternative imaging modalities (such as optical coherence tomography) is an interesting area for further study.

In conclusion, our study provides a guide for researchers to understand the factors that may impact the development of DL algorithms for detection of DR and other conditions from retinal photographs, particularly when using images from real-world populations. Various image-related factors play more significant roles than technical factors in determining the diagnostic performance, suggesting the importance of having robust training and testing datasets for DL training and deployment in the real-world settings. In order to ensure a successful translation of a DR screening algorithm, it is important to consider technical factors (e.g., types of CNN, computational framework) and image-related factors (e.g., compression levels, number of fields, media clarity, mydriatic status, retinal cameras, pigmentation of different races, disease prevalence, systemic vascular risk factors, concurrent ocular diseases and reference standards).

Methods

We first conducted a systematic literature review on factors affecting DL algorithms in detection of DR using search engines PubMed, Web of Science, Scopus, ProQuest and IEEE Xplore searching for peer-reviewed studies up to 20 September 2019. Keywords used were ‘diabetic retinopathy’, ‘deep learning’, ‘technical factor’ and ‘image-related factor’.

Based on the systematic literature review (Fig. 1), we then selected the following factors to conduct the following analyses. First, we looked at different CNNs within the DL algorithm architecture: VGGNet, ResNet, DenseNet and Ensemble (a combination of the aforementioned three CNNs). Second, we evaluated the impact of altering the computational framework used to implement the coded CNNs: Caffe and TensorFlow. Third, we analyzed the effect of different image sizes generated through the process of compression, specifically looking at the five image sizes 350 (original), 300, 250, 200, 150 KB. Fourth, we looked at various numbers of input field of views captured per eye: 1-field macula-centered image, 2-field macula-centered and optic disc-centered images, 7-field ETDRS standard fields. Fifth, we looked at the impact of eyes with prior cataract surgery compared to eyes with cataract on the ability of the DL algorithm to accurately detect DR.

Study population dataset: training dataset

The DL algorithm was trained to detect referable DR with 76,370 retinal fundus photographs obtained from 13,099 patients from the database of SiDRP between 2010 and 2013. The definition of DR was made based on the International Classification Diabetic Retinopathy Severity Scale (ICDRSS) (Supplementary Fig. 1). SiDRP is a national screening program in Singapore established in 2010 which utilizes a tele-ophthalmology platform where fundus photographs captured in primary care clinics are transmitted to a centralized team of trained graders^15,51. The 45 degree angle retinal fundus photographs are all taken with Topcon TRC-NW8 Non-Mydriatic Retinal Cameras in two fields of view per eye, an optic disc-centered image and a macula-centered image, with both eyes taken per patient. These two-field images in the training dataset were notably 350 KB average in size each. This training dataset included phakic and pseudophakic eyes. One round of training was conducted with no further re-training of the algorithm. For the purposes of collating a robust training dataset for the DL algorithms, two senior certified non-medical graders with more than five years’ experience were tasked to grade each eye. Discordant grades between the two graders were arbitrated by a retinal specialist. Poor quality images deemed ungradable to senior non-medical professional graders were removed from the training and testing dataset, especially if the images had more than a third of the photograph obscured. These images were not used in the analysis of the DL algorithm.

Technical factors: different CNNs

Three CNNs were explored in this study, namely VGGNet, ResNet and DenseNet. Additionally, the three CNNs were further combined into an Ensemble model. The CNNs were built using TensorFlow with Keras API specification (Fig. 4). Pre-trained models with transfer learning were available in the Keras library. All models had a final max-pooling layer added before the output layer.

(1)
VGGNet: This is a 16-layered network, designed by the Visual Geometry Group in Oxford in 2014. VGGNet has been popular due to its excellent performance on the classification of retinal images^15,52,53. An overview of the architecture can be seen in Fig. 4a.
(2)
ResNet: For this study, we used ResNet-50 consisting of 50 layers which surpassed human performance with an error rate of 3.6%⁵⁴. Recently, it has been widely used to detect age-related macular degeneration, diabetic macular edema, and glaucoma^24,55,56. ResNet’s rise in popularity is attributed to its ability to increase depth of the network architecture through ‘skip’ residual connections equipped to perform identity mappings, thus increasing accuracy whilst still being easy to train (Fig. 4b).
(3)
DenseNet: This CNN consisted of 121 layers densely connected through concatenating sequential layers in a feedforward fashion to achieve increased depth of deep CNNs efficiently (Fig. 4c)⁴⁰.
(4)
Ensemble: Ensemble consists of the above three networks’ (VGGNet, ResNet, and DenseNet) and its output is established as an average over outputs of the component networks per eye. Performance is expected to match or exceed single CNNs (Fig. 4d)⁵⁷.

**Fig. 4: Convolutional neural networks investigated.**

Technical factors: different computational frameworks

The DL algorithms were then constructed using two open-sourced computational frameworks–Caffe and TensorFlow to compute VGGNet using Python (a programing language).

(1)
Caffe: This was reconstructed from the reference paper, initialized with Xavier initialization and included an extra module of convolutional and pooling layer to cater for increased resolution of input images to 512 × 512⁵⁸. This VGGNet Caffe DL algorithm was the control architecture for analysis of the image-related factors.
(2)
TensorFlow: In this study, TensorFlow was implemented with Keras API specification. In addition, a single pooling layer preceding the output layer in VGGNet was added⁵⁹.

Image-related factors: different image compression

Increasing compression may allow ease of transmission in tele-ophthalmology but may compromise image quality. To investigate the effect of input image compression on performance, 71,896 original images (35,948 eyes of 14,880 patients) taken from SiDRP between 2014 and 2015 were used as the archetype to create five distinct sets of fundus images with different compression levels. These were 45 degree angle fundus fields taken with Topcon TRC-NW8 Non-Mydriatic Retinal Cameras. In total, 359,480 retinal fundus images were generated. These five sets were then used to test the DL algorithm. It should be noted that the DL algorithm was previously trained on the original 350KB images. These images were compressed from an average of 350 KB JPEG images to four additional levels of compression, averaging 300, 250, 200, 150 KB in image size respectively. This was achieved with the use of a standard JPEG compression algorithm using the Independent JPEG Group’s library of quality levels. We used a publicly available algorithm on the OpenCV library.

Image-related factors: different fundus field of view

We evaluated the effect of different fundus field of views on the performance of the DL algorithm by comparing (1) 1-field macula-centered and (2) 2-field optic disc and macula-centered fundus photographs. Both subsets were taken from data collected from SiDRP between 2014 to 2015 containing 35,948 eyes. Macula-centered or optic disc-centered is defined as the macula or the optic disc, respectively, located less than one disc diameter circumferentially from center of the image. In addition, we also looked at increasing number of fields, comparing (1) 1-field, (2) 2-field, and (3) 7-field fundus imaging. For this analysis, we used an external testing dataset from the AFEDS that obtained the ETDRS stereoscopic reference standard of 7-field fundus imaging^60,61. 7-field, 2-field, and 1-field retinal images collected from a fixed set of 1403 eyes in this dataset were used. A total of 9821 images were used.

Image-related factors: previous cataract surgery

To assess the impact of previous cataract surgery on the ability for the DL algorithm to detect DR on fundus photographs, we employed an external testing dataset using retinal images from participants of SEED Study (baseline, 2004-11). The SEED study is a population-based epidemiologic study that comprised three major ethnic groups in Singapore – Malay patients were recruited from 2004 to 2006, Indian patients from 2007 to 2009, and Chinese patients from 2009 to 2011^62,63,64,65. Phakic and pseudophakic eyes in this dataset were separated and analysis was conducted to compare one group with relation to the other. Among the 4910 eyes (9820 images) included, 1612 eyes were phakic and 3298 eyes were pseudophakic.

Heatmap

Heatmaps were generated to provide insight into the conundrum of DL, the black box of learning, as they demonstrate focus areas visualized by the DL system. The method of Integrated Gradient was used to generate these heatmaps⁶⁶.

Reference standard of testing dataset

The reference standard of the severity of DR of each eye in the SiDRP and SEED testing datasets was set as the grading assessment by an ophthalmologist sub-specializing in retinal diseases, with over five years’ experience in assessing DR. For the AFEDS testing dataset, concurring assessments from two retinal specialists were used as the reference standard. The grading was conducted in accordance to ICDRSS. However, for the purposes of this study, we reclassified this scale to a binary outcome measure of referable DR or non-referable DR. Referable DR is defined as moderate non-proliferative DR or worse, including diabetic macular edema.

Statistical analysis

We used the following primary outcome measures as a marker for DL algorithm’s performance in detecting referable DR: AUC, sensitivity and specificity of the algorithm at detecting DR with respect to the reference standard. The operating thresholds were pre-set during training of each modification of the DL algorithms’ technical parameters. During training, AUCs were determined for the training dataset while sensitivities and specificities were calculated across a range of thresholds. The optimal threshold for each DL algorithm was chosen to achieve 90% sensitivity. The 95% confidence intervals (CI) for sensitivity and specificity were calculated with the exact Clopper-Pearson method and for AUC, empirical bootstrap with 5000 replicates was used. To evaluate statistical significance of difference in results for comparison of AUCs, the empirical bootstrap with 5000 replicates was used. All statistical analyses were performed using the R statistical software (version 3.5.1; R Foundation for Statistical Computing, Vienna, Austria). P value less than 0.05 was considered statistically significant.

Ethics approval

Our study was approved by the centralized institutional review board (IRB) of SingHealth, Singapore (IRB reference number 2018/2433). It was conducted in accordance with the Declaration of Helsinki. Informed consent by the patients were exempted by the IRB because it used fully anonymized images retrospectively.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The datasets used in this study originated from different principal investigators from different countries. Upon request, the corresponding author D.S.W.T. can send the data request to the individual principal investigator to seek clearance from them.

Code availability

The artificial intelligence system described in this study is kept at the Singapore Eye Research Institute (SERI) and National University of Singapore (NUS). The underlying algorithm is copyrighted by SERI, NUS and will not be available to public.

References

Leasher, J. L. et al. Global estimates on the number of people blind or visually impaired by diabetic retinopathy: a meta-analysis from 1990 to 2010. Diabetes Care 39, 1643–1649 (2016).
Article PubMed Google Scholar
Cheung, N., Mitchell, P. & Wong, T. Y. Diabetic retinopathy. Lancet 376, 124–136 (2010).
Article PubMed Google Scholar
Early Treatment Diabetic Retinopathy Study Research Group. Grading diabetic retinopathy from stereoscopic color fundus photographs–an extension of the modified Airlie House classification. ETDRS report number 10. Early Treatment Diabetic Retinopathy Study Research Group. Ophthalmology 98, 786–806 (1991).
Article Google Scholar
Ferris, F. L. III How effective are treatments for diabetic retinopathy? JAMA 269, 1290–1291 (1993).
Article PubMed Google Scholar
Lamoureux, E. L., Taylor, H. & Wong, T. Y. Frequency of evidence-based screening for diabetic retinopathy. N. Engl. J. Med. 377, 194–195 (2017).
Article PubMed Google Scholar
Ting, D. S. W. et al. AI for medical imaging goes deep. Nat. Med. 24, 539–540 (2018).
Article CAS PubMed Google Scholar
Wong, T. Y. & Bressler, N. M. Artificial intelligence with deep learning technology looks into diabetic retinopathy screening. JAMA 316, 2366–2367 (2016).
Article PubMed Google Scholar
Burlina, P. M. et al. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol. 135, 1170–1176 (2017).
Article PubMed PubMed Central Google Scholar
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402–2410 (2016).
Article PubMed Google Scholar
Cheung, C., Tang, F., Ting, D., Tan, G. & Wong, T. Artificial intelligence in diabetic eye disease screening. Asia Pac. J. Ophthalmol. 8, 158–164 (2019).
Google Scholar
Grzybowski, A. & Brona, P. A pilot study of autonomous artificial intelligence-based diabetic retinopathy screening in Poland. Acta Ophthalmologica 97, 1149–1150 (2019).
Article Google Scholar
Grzybowski, A. et al. Artificial intelligence for diabetic retinopathy screening: a review. Eye 34, 457–460 (2020).
Google Scholar
Verbraak, F. D., Schmidt-Erfurth, U., Grzybowski, A., Abramoff, M. & Schlingemann, R. Is automated screening for diabetic retinopathy indeed not yet ready as stated by Grauslund et al.? Acta Ophthalmologica 98, 257–258 (2019).
Google Scholar
Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform 19, 1236–1246 (2018).
Article PubMed Google Scholar
Ting, D. S. W. et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 318, 2211–2223 (2017).
Article PubMed PubMed Central Google Scholar
Abramoff, M. D. et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest. Ophthalmol. Vis. Sci. 57, 5200–5206 (2016).
Article PubMed Google Scholar
Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. Npj Digital Med. 1, 39 (2018).
Article Google Scholar
Kanagasingam, Y. et al. Evaluation of artificial intelligence–based grading of diabetic retinopathy in primary care. JAMA Netw. Open 1, e182665–e182665 (2018).
Article PubMed PubMed Central Google Scholar
Li, Z. et al. An automated grading system for detection of vision-threatening referable diabetic retinopathy on the basis of color fundus photographs. Diabetes Care 41, 2509–2516 (2018).
Article PubMed Google Scholar
Gargeya, R. & Leng, T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology 124, 962–969 (2017).
Article PubMed Google Scholar
Cavigelli, L., Hager, P. & Benini, L. CAS-CNN: A deep convolutional neural network for image compression artifact suppression. 2017 International Joint Conference on Neural Networks (IJCNN) 752–759 (2017).
Cho, J., Lee, K., Shin, E., Choy, G. & Do, S. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? Preprint at https://arxiv.org/abs/1511.06348 (2015).
Dodge, S. & Karam, L. Understanding how image quality affects deep neural networks. 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX) 1–6 (2016).
Grassmann, F. et al. A deep learning algorithm for prediction of age-related eye disease study severity scale for age-related macular degeneration from color fundus photography. Ophthalmology 125, 1410–1420 (2018).
Article PubMed Google Scholar
Sahlsten, J. et al. Deep learning fundus image analysis for diabetic retinopathy and macular edema grading. Sci. Rep. 9, 10750 (2019).
Article PubMed PubMed Central CAS Google Scholar
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).
Article PubMed Google Scholar
Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016).
Article PubMed Google Scholar
Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122–1131.e9 (2018).
Article CAS PubMed Google Scholar
Lee, H. et al. An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nat. Biomed. Eng. 3, 173–182 (2019).
Article PubMed Google Scholar
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
Article PubMed PubMed Central Google Scholar
Bawankar, P. et al. Sensitivity and specificity of automated analysis of single-field non-mydriatic fundus photographs by Bosch DR Algorithm—Comparison with mydriatic fundus photography (ETDRS) for screening in undiagnosed diabetic retinopathy. PLoS ONE 12, e0189854 (2017).
Article PubMed PubMed Central CAS Google Scholar
Bellemo, V. et al. Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa. The Lancet Digital Health 1, 35–44 (2019).
Article Google Scholar
Burlina, P. M., Joshi, N., Pacheco, K. D., Liu, T. Y. A. & Bressler, N. M. Assessment of deep generative models for high-resolution synthetic retinal image generation of age-related macular degeneration. JAMA Ophthalmol. 137, 258–264 (2019).
Article PubMed PubMed Central Google Scholar
Ting, D. S. W. et al. Deep learning in estimating prevalence and systemic risk factors for diabetic retinopathy: a multi-ethnic study. Npj Digital Med. 2, 24 (2019).
Article Google Scholar
Canziani, A., Paszke, A. & Culurciello, E. An analysis of deep neural network models for practical applications. Preprint at https://arxiv.org/abs/1605.07678 (2016).
May, M., Dang, T., Fenu, S., Verma, A. & Cimino, M. A System for better diagnosis and treatment of skin cancer. https://matt-may.github.io/pub/dermfollow_final_report.pdf. Accessed 1 Oct 2019.
Wu, Y. et al. A comparative measurement study of deep learning as a service framework. Preprint at https://arxiv.org/abs/1810.12210 (2018).
Kovalev, V., Kalinovsky, A. & Kovalev, S. Deep learning with theano, torch, caffe, tensorflow, and deeplearning4j: Which one is the best in speed and accuracy? 13th International Conference on Pattern Recognition and Information Processing, 99–103 (2016).
Rampasek, L. & Goldenberg, A. Tensorflow: Biology’s gateway to deep learning? Cell Syst. 2, 12–14 (2016).
Article CAS PubMed Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2261–2269 (2017).
Eckersley, P. & Nasser Y. EFF AI Progress Measurement Project. https://www.eff.org/ai/metrics. Accessed on 1 Dec 2018.
Newsom, R. S. et al. Effect of digital image compression on screening for diabetic retinopathy. Br. J. Ophthalmol. 85, 799–802 (2001).
Article CAS PubMed PubMed Central Google Scholar
Li, H. K. et al. Grading diabetic retinopathy severity from compressed digital retinal images compared with uncompressed images and film. Retina 30, 1651–1661 (2010).
Article PubMed Google Scholar
Group, E. T. D. R. S. R. Grading diabetic retinopathy from stereoscopic color fundus photographs—an extension of the modified Airlie House classification: ETDRS report number 10. Ophthalmology 98, 786–806 (1991).
Article Google Scholar
Williams, G. A. et al. Single-field fundus photography for diabetic retinopathy screening: a report by the American Academy of Ophthalmology. Ophthalmology 111, 1055–1062 (2004).
Article PubMed Google Scholar
Silva, P. S. et al. Peripheral lesions identified by mydriatic ultrawide field imaging: distribution and potential impact on diabetic retinopathy severity. Ophthalmology 120, 2587–2595 (2013).
Article PubMed Google Scholar
Kern, T. S. & Engerman, R. L. Vascular lesions in diabetes are distributed non-uniformly within the retina. Exp. Eye Res. 60, 545–549 (1995).
Article CAS PubMed Google Scholar
Dobree, J. Simple diabetic retinopathy. Evolution of the lesions and therapeutic considerations. Br. J. Ophthalmol. 54, 1 (1970).
Article CAS PubMed PubMed Central Google Scholar
Scanlon, P. H., Foy, C., Malhotra, R. & Aldington, S. J. The influence of age, duration of diabetes, cataract, and pupil size on image quality in digital photographic retinal screening. Diabetes Care 28, 2448–2453 (2005).
Article PubMed Google Scholar
Butler, D. AI summit aims to help world’s poorest. Nat. N. 546, 196 (2017).
Article CAS Google Scholar
Nguyen, H. V. et al. Cost-effectiveness of a National Telemedicine Diabetic Retinopathy Screening Program in Singapore. Ophthalmology 123, 2571–2580 (2016).
Article PubMed Google Scholar
Lim, G., Lee, M. L., Hsu, W. & Wong, T. Y. Transformed representations for convolutional neural networks in diabetic retinopathy screening. Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 34–38 (2014).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556 (2014).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (2016).
Mo, J., Zhang, L. & Feng, Y. Exudate-based diabetic macular edema recognition in retinal images using cascaded deep residual networks. Neurocomputing 290, 161–171 (2018).
Article Google Scholar
Shibata, N. et al. Development of a deep residual learning algorithm to screen for glaucoma from fundus photography. Sci. Rep. 8, 14665 (2018).
Article PubMed PubMed Central CAS Google Scholar
Hansen, L. K. & Salamon, P. Neural network ensembles. IEEE Trans. pattern Anal. Mach. Intell. 12, 993–1001 (1990).
Article Google Scholar
Jia, Y. et al. Caffe: convolutional architecture for fast feature embedding. Proc. 22nd ACM International Conference on Multimedia, 675–678 (2014).
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. 12th USENIX conference on Operating Systems Design and Implementation, 265–283 (2016).
Varma, R. African American Eye Disease Study. http://grantome.com/grant/NIH/U10-EY023575-03. Accessed on 30 Jun 2019.
McKean-Cowdin, R. et al. The African American Eye Disease Study: Design and Methods. Ophthalmic Epidemiol. 25, 306–314 (2018).
Article PubMed PubMed Central Google Scholar
Shi, Y. et al. Is aspirin associated with diabetic retinopathy? The Singapore Epidemiology of Eye Disease (SEED) study. PLoS ONE 12, e0175966 (2017).
Article PubMed PubMed Central CAS Google Scholar
Huang, O. S. et al. Prevalence and determinants of undiagnosed diabetic retinopathy and vision-threatening retinopathy in a multiethnic Asian cohort: the Singapore Epidemiology of Eye Diseases (SEED) study. Br. J. Ophthalmol. 99, 1614–1621 (2015).
Article PubMed Google Scholar
Wong, T. Y. et al. Prevalence and risk factors for diabetic retinopathy: the Singapore Malay Eye Study. Ophthalmology 115, 1869–1875 (2008).
Article PubMed Google Scholar
Chong, Y. H. et al. Type 2 diabetes genetic variants and risk of diabetic retinopathy. Ophthalmology 124, 336–342 (2017).
Article PubMed Google Scholar
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Proc. 34th International Conference on Machine Learning 70, 3319–3328 (2017).
Google Scholar

Download references

Acknowledgements

This study was supported by the National Medical Research Council Singapore, Ministry of Health, Singapore and Tanoto Foundation. This project received funding from National Medical Research Council (NMRC), Ministry of Health (MOH), Singapore (National Health Innovation Center, Innovation to Develop Grant (NHIC-I2D-1409022); Health Service Research Grant; SingHealth Foundation Research Grant (SHF/FG648S/2015), and the Tanoto Foundation. For Singapore Epidemiology of Eye Diseases (SEED) study, we received funding from NMRC, MOH (grants 0796/2003, IRG07nov013, IRG09nov014, STaR/0003/2008, & STaR/2013; CG/SERI/2010) and Biomedical Research Council (grants 08/1/35/19/550, 09/1/35/19/616). The Singapore Diabetic Retinopathy Program (SiDRP) received funding from the MOH, Singapore (grants AIC/RPDD/ SIDRP/SERI/FY2013/0018 & AIC/HPD/FY2016/0912). The NMRC (Singapore), MOH (Singapore), NHIC (Singapore), and Tanoto Foundation had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Author information

Authors and Affiliations

Singapore Eye Research Institute, Singapore National Eye Center, Singapore, Singapore
Michelle Y. T. Yip, Gilbert Lim, Quang D. Nguyen, Crystal C. Y. Chong, Marco Yu, Valentina Bellemo, Yuchen Xie, Xin Qi Lee, Haslina Hamzah, Jinyi Ho, Tien-En Tan, Charumathi Sabanayagam, Gavin S. W. Tan, Tien Yin Wong & Daniel S. W. Ting
Duke-NUS Medical School, Singapore, Singapore
Michelle Y. T. Yip, Charumathi Sabanayagam, Gavin S. W. Tan, Tien Yin Wong & Daniel S. W. Ting
School of Computing, National University of Singapore, Singapore, Singapore
Gilbert Lim, Zhan Wei Lim, Wynne Hsu & Mong Li Lee
Department of Ophthalmology, University of Warmia and Mazury, Olsztyn, Poland
Andrzej Grzybowski
Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznan, Poland
Andrzej Grzybowski
State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
Daniel S. W. Ting

Authors

Michelle Y. T. Yip
View author publications
You can also search for this author in PubMed Google Scholar
Gilbert Lim
View author publications
You can also search for this author in PubMed Google Scholar
Zhan Wei Lim
View author publications
You can also search for this author in PubMed Google Scholar
Quang D. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Crystal C. Y. Chong
View author publications
You can also search for this author in PubMed Google Scholar
Marco Yu
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Bellemo
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xin Qi Lee
View author publications
You can also search for this author in PubMed Google Scholar
Haslina Hamzah
View author publications
You can also search for this author in PubMed Google Scholar
Jinyi Ho
View author publications
You can also search for this author in PubMed Google Scholar
Tien-En Tan
View author publications
You can also search for this author in PubMed Google Scholar
Charumathi Sabanayagam
View author publications
You can also search for this author in PubMed Google Scholar
Andrzej Grzybowski
View author publications
You can also search for this author in PubMed Google Scholar
Gavin S. W. Tan
View author publications
You can also search for this author in PubMed Google Scholar
Wynne Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Mong Li Lee
View author publications
You can also search for this author in PubMed Google Scholar
Tien Yin Wong
View author publications
You can also search for this author in PubMed Google Scholar
Daniel S. W. Ting
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.Y., G.L., Z.L., V.B., X.L., H.H., T.T., T.W., and D.T. contributed to the initial drafting of the manuscript. H.H., J.H., G.T., C.S., T.W., and D.T. contributed to the data collection. G.L., Z.L., Q.N., C.C., and M.C.Y.Y. contributed to the data analysis. M.Y., G.L., Z.L., Q.N., C.C., M.C.Y.Y., V.B., Y.X., X.L., H.H., J.H., T.T., C.S., A.G., S.T., W.H., M.L., T.W., and D.T. contributed to the interpretation, critical review, final approval for this manuscript and are accountable for work described.

Corresponding author

Correspondence to Daniel S. W. Ting.

Ethics declarations

Competing interests

D.S.W.T., G.L., M.L.L., W.H., and T.Y.W. are co-inventors of a patent on the deep learning system in this paper; potential conflicts of interests are managed according to institutional policies of the Singapore Health System (SingHealth) and the National University of Singapore. M.Y., Z.L., Q.N., C.C., M.Y., V.B., Y.X., X.L., H.H., J.H., T.T., C.S., A.G., and G.T. have no conflicts of interests to disclose.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Reporting Summary

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yip, M.Y.T., Lim, G., Lim, Z.W. et al. Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy. npj Digit. Med. 3, 40 (2020). https://doi.org/10.1038/s41746-020-0247-1

Download citation

Received: 01 December 2019
Accepted: 19 February 2020
Published: 23 March 2020
DOI: https://doi.org/10.1038/s41746-020-0247-1

This article is cited by

Image quality assessment of retinal fundus photographs for diabetic retinopathy in the machine learning era: a review
- Mariana Batista Gonçalves
- Luis Filipe Nakayama
- Rubens Belfort
Eye (2023)
Automated detection of myopic maculopathy from color fundus photographs using deep convolutional neural networks
- Jun Li
- Lilong Wang
- Lixin Xie
Eye and Vision (2022)
Artificial intelligence to detect malignant eyelid tumors from photographic images
- Zhongwen Li
- Wei Qiang
- Guohai Wu
npj Digital Medicine (2022)
Determination of probability of causative pathogen in infectious keratitis using deep learning algorithm of slit-lamp images
- Ayumi Koyama
- Dai Miyazaki
- Yoshitsugu Inoue
Scientific Reports (2021)
Feasibility study to improve deep learning in OCT diagnosis of rare retinal diseases with few-shot classification
- Tae Keun Yoo
- Joon Yul Choi
- Hong Kyu Kim
Medical & Biological Engineering & Computing (2021)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Systematic review of literature

Patient demographics and disease breakdown of datasets

Technical factors: effect on performance

Image-related factors: effect on performance

Heatmaps

Discussion

Methods

Study population dataset: training dataset

Technical factors: different CNNs

Technical factors: different computational frameworks

Image-related factors: different image compression

Image-related factors: different fundus field of view

Image-related factors: previous cataract surgery

Heatmap

Reference standard of testing dataset

Statistical analysis

Ethics approval

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links