Detecting visually significant cataract using retinal photograph-based deep learning

Age-related cataracts are the leading cause of visual impairment among older adults. Many significant cases remain undiagnosed or neglected in communities, due to limited availability or accessibility to cataract screening. In the present study, we report the development and validation of a retinal photograph-based, deep-learning algorithm for automated detection of visually significant cataracts, using more than 25,000 images from population-based studies. In the internal test set, the area under the receiver operating characteristic curve (AUROC) was 96.6%. External testing performed across three studies showed AUROCs of 91.6–96.5%. In a separate test set of 186 eyes, we further compared the algorithm’s performance with 4 ophthalmologists’ evaluations. The algorithm performed comparably, if not being slightly more superior (sensitivity of 93.3% versus 51.7–96.6% by ophthalmologists and specificity of 99.0% versus 90.7–97.9% by ophthalmologists). Our findings show the potential of a retinal photograph-based screening tool for visually significant cataracts among older adults, providing more appropriate referrals to tertiary eye centers.

9,737 eyes) was randomly distributed into a development set (n=4,138), and an independent internal test set (n=900; 1,692 eyes) based on an 8:2 ratio at individual level (i.e. division done at person-level). As illustrated in Annex Figure 1 below, the developed model pipeline consisted of two parts: a deep convolutional neural network (CNN), serving as a feature extractor, and the classification model.
Annex Figure 1: The model pipeline.
The first step was feature extraction. For this purpose, a deep convolutional neural network called the Residual Neural Network (ResNet) was used. 1 A ResNet typically consists of multiple residual blocks, each of which contains a few convolutional layers and a residual connection to combine earlier features. 1 In particular, the network architecture used in our model was ResNet-50. Following the common practice for improved performance, we adopted a ResNet-50 model that was pre-trained on the ImageNet dataset. 2 The training retinal images were fed to the model to extract their features, a process referred to as 'feature extraction'. In this instance, 2,048 features were extracted from each training image. These features, along with the ground-truth clinical labels, were then used to classify the image through an extreme gradient boosting (XGBoost classifer) classification model. 2 The XGBoost classifer method was based on the gradient boosting approach, where decision trees were gradually added, such that each subsequent tree reduced error of the preceding ones. 2 This method was aimed to prevent overfitting using the regularization techniques, parallelized tree building, tree pruning and other enhancement features. 2 The parameters for the XGBoost classifier, such as learning rate, minimum sum of instance weight needed, maximum depth of the tree, number of estimators, were chosen using the grid-search approach, in order to minimize its cross-validated classification error on the training set. In addition, since the dataset was imbalanced, we also adjusted the classifier parameters in order to balance the impact of positive and negative samples. Once the model had been trained, it was used for making predictions on the independent internal and external test dataset. The final output of the classification-based model was the probability for presence of visually significant cataract in each study eye.

Details on generating saliency maps
In order to understand which regions of the retinal images were used in the prediction decision for presence of visually significant cataract, we implemented the GradCAM method for generating saliency maps. 3 We replaced the gradient boosting module with a dense layer as the classifier. We first forward-propagated the images through the deep learning model to obtain predictions of the images. Once the prediction to the image was obtained, the gradients for the target class (i.e., the ground-truth class of the image) were set to 1, while the gradients for other classes were set to 0. The gradients were then backpropagated through the network.
From a prespecified convolutional layer, the gradients and feature maps were extracted and combined to generate the heatmap (Gradient-based class activation map). 3 The saliency maps were overlaid on the original images to indicate the important regions.