Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks

Retinal fundus diseases can lead to irreversible visual impairment without timely diagnoses and appropriate treatments. Single disease-based deep learning algorithms had been developed for the detection of diabetic retinopathy, age-related macular degeneration, and glaucoma. Here, we developed a deep learning platform (DLP) capable of detecting multiple common referable fundus diseases and conditions (39 classes) by using 249,620 fundus images marked with 275,543 labels from heterogenous sources. Our DLP achieved a frequency-weighted average F1 score of 0.923, sensitivity of 0.978, specificity of 0.996 and area under the receiver operating characteristic curve (AUC) of 0.9984 for multi-label classification in the primary test dataset and reached the average level of retina specialists. External multihospital test, public data test and tele-reading application also showed high efficiency for multiple retinal diseases and conditions detection. These results indicate that our DLP can be applied for retinal fundus disease triage, especially in remote areas around the world.

match the double sized input shape, an addition Conv block was added. The filters of the first convolution layer were reduced from 64 to 32, and the kernel size of the first convolutional layer changed from 7x7 to 5x5. In the custom designed ResNext model cardinality=16 was used instead of cardinality=32. Every Conv block contained a number of residual units (or grouped residual units). Inside every residual units, pre-activation 20 and bottleneck structures were used. The structures of the CNNs in the group B were shown in Supplementary Fig. 3a.
The subclass classification of bigclass10 was considered as a fine-grained classification and was implemented by a small pipeline. An image was firstly pre-processed to be 384x384 pixels, and then a Mask R-CNN was used to detect and segment the optic disc. The optic disc detection task was considered as an instance segmentation problem instead of a localization or object detection problem. This was because the confidence value outputted by Mask-RCNN was important and the mask images of optic disc were easy to obtain. After cropping the optic disc area to 112x112 pixels, a custom designed Resnet and ResNeXt were used to do the final classification task. Compared with the standard Resnet and ResNext, there were some modifications: a Conv block was removed to match half sized input shape and the kernel size of the first convolutional layer was reduced from 7x7 to 5x5. The structures of the CNNs in Group C were shown in Supplementary Fig. 3b. Instance segmentation was used instead of object detection and semantic segmentation because there was a large number of pixel-level annotated data samples and the confidence value of the detected optic disc was very important.

Real time data augmentation
Real time data augmentation was used during both training and inference. In general, image augmentation during training was much more used than during test time. However, test time image augmentation has been used in ImageNet 21 (multi-crop) and Kaggle Data Science Bowl 2017 competitions 22 . It improved not only the accuracy but also robustness to small image perturbation 23 . Compared with before-hand image augmentation, real time image augmentation was flexible and simplified the whole training process. During training, images were randomly rotated  24 , horizontally and vertically flipped, and image contrast were modified (multiplicative factor range:[90%,110%]). Whereas during inference, for an image, two other images were generated on the fly using pre-defined transformations. One image was generated by moving (dx=6px, dy=6px) and horizontal flipping, and the other image was generated using moving (dx=-6px, dy=-6px) and vertical flipping. Training time image augmentation was implemented using the imgaug 25 library and Keras 26 Sequence(better than python generator in multi-process environment), and test time image augmentation was implemented by custom designed OpenCV codes and Keras Sequence.

Dynamic data resampling
During the training process, dynamic data resampling was used to resolve the problem of imbalanced classes 5 .
Compared with traditional under-sampling, it can make full use of training data because in every epoch it generates a different training dataset. Compared with traditional over-sampling, which simply duplicate minority class samples, it can avoid overfitting. Because dynamic resampling and real time augmentation were used together, different images were generated on the fly using real time augmentation for a minority class image. Data resampling methods were widely used in single label setting, however it could not be directly used in multi-label setting. When calculating the sampling probability of an image with multiple labels, the labels was converted to a single label, which was the class with the smallest data samples among the image's multiple labels. It should be noted that this conversion was only used in the sampling process, when generating the training dataset, the original labels wound be used. This method worked well because labels of bigclass dataset were very sparse (Label cardinality = 1.098). The dynamic data resampling algorithm was shown in Algorithm 1.
Algorithm 1: Dynamic data resampling Let S be the original training set.
Let epoch_num be the epoch of training.
num_classes ←30 # set num_classes to be the number of classes of S(in this case 30).
num_samples ← len(S) # set num_samples to be the number of samples in S S' = copy.deepcopy(S) #clone a new set object S' using S.

FOR each sample1 in S'
IF len(sample1 [1])>1 #sample1 contains data and labels (x, y), determine y in sample1 has multiple labels or not.
select the label with the minimum class samples, and then replace the original labels by the selected label.

ENDFOR
In this study, the weight_power parameter was set to 0.65 for bigclass classification. For subclass classifications, dynamic data resampling ratio parameters were set case by case. There parameters were kept stable during training.

Loss function for Multi-label classification
A custom designed function which could be viewed as the weighted binary cross entropy loss was used in bigclass classification. The loss function was formally defined as follows: A single sample in the training set was denoted by (x, y). The No. C output of a neural network f(x) was denoted by p c , and the No. C label of ground truth was denoted by y c . Because of label smoothing 27 , the element y c was not always be 0 or 1. The loss of (x,y) was denoted by L(x, y).
False negative and false positive weights in the cost matrix of class c were denoted by C FNc and C FPc respectively.
For simplicity, C FPc was set to 1 for all classes, so only C FN needs to be set. C FN The algorithm of computing C which contained 30 numbers were set automatically based on two hyper-parameters: positive_weight_ratio and weight_power.
The class_weight parameters were used to tackle the inter-class imbalance that some labels were more frequent than the others. The class weight for class No. C in the loss function was denoted by class_weight

Hyper-parameter
positive_weight_ratio was used to tackle the class imbalanced between negative and positive brought by labels sparsity and binary relevance conversion, and it was empirically set to 2.4. Transfer learning 28,29 was applied for training standard models in the group A and D. All custom-defined models in the CNN group B and C were trained from scratch (Supplementary Table 4).
During training bigclass models in the group A and all models in the group D, weights were initialized using pre-trained ImageNet models (except for weights of the last fully connected layer), and then all layers were fine-tuned.
Bigclass models were trained prior to training subclass models, the subclass models in the CNN group A were transferred from big class models using the method mentioned previously. The domains of subclass classification were subsets of that of the bigclass classification. The tasks of subclass classification and bigclass classification were related. The more similar the data distribution between source domain and target domain and the more related the source task and the target task, the better was the transferring effect.
Adam 30 with lookhead 31 (k=5, alpha=0.5) was used as the optimizer, and a custom learning rate scheduler was used to adjust learning rate dynamically. Label smoothing 27 (ε=0.1) was used to calibrate predicted probabilities 32 .

Model ensemble
For an image, after being preprocessed, test time image augmentation and model ensemble were used to generate the final predicted probabilities ( Supplementary Fig. 4). The mathematical formula: The number of CNN models involved was denoted by n and the times of test time image augmentation was denoted by m. Generating labels from predicted probabilities was the predicted probability of model i for image augmentation j. Both parameter n and m were set to 3. Setting n and m to be greater than 3 will not generate apparent performance improvement, however it will results in consuming more computing power and long response time. The final predicted probabilities array was denoted by probs.
The algorithm of generating bigclass labels was shown in Algorithm 2.
Algorithm 2: How to generate bigclass labels Subclass classification(multi-class) algorithm: The predicted class was denoted by pred_class. As for bigclass classification (multi-labels), for simplicity, threshold-moving was not adopted and 0.5 was used as the threshold for all classes. Moreover, abovementioned T-Criterion rule 4 was implemented in the multi-label classification algorithm.

Visualizing and explaining CNNs
The explainability of neural networks was very important, unfortunately all current explanation methods were fragile 34 .
A modified Class Activation Maps (CAMs) 35 and the DeepShap 36 (DeepExplainer), which can complement each other, were simultaneously used to generate heat-maps (Fig. 1). These heat-maps were used to explain decisions made by neural networks. Class Activation Maps (CAMs) 35 were class discriminative and faithful to predicted values，but with low resolution. DeepExplainer was a combination of Deeplift 37 and Shapley value. It could generate fine-grained heat-maps and was more efficient. Accordingly it could generate more reliable results than other approximation methods such as Layer-wise relevance propagation (LRP) and Integrated Gradients 38 . The differences between the original CAMs and our modified CAMs were only two RELU functions.
The activation of unit k in the last convolutional layer at spatial location (x, y) was denoted by ( , ), and the weight corresponding to class c for unit k was denoted by . The mathematical formula of CAM was changed from The intuition was the same as Guided Backpropagation and Grad-CAM++ 39 , i.e., only the positive gradients (or positive weights of the last fully connected layer) were taken into consideration. According to our experimental results, performance of the modified CAMs was obviously better than the original CAMs.

The DLP deployment
After being fully trained and validated, all the CNN models were deployed for production. The simplified architecture of the production platform was shown in Supplementary Fig. 1. A custom designed computer-aided diagnosis service (CADS) was developed instead of using standard Tensorflow Serving because generating heat-maps needs low-level controls on models. Trained CNN models were automatically loaded during the startup of theCADS and provided services through the xmlrpc server. Both web app and web service implemented an xmlrpc client that communicate with CADS ( Supplementary Fig. 7). Web service is a restful service, doctor station communicates with it using http protocol. Python3 build-in xmlrpc was used to develop the RPC server, and Django framework was applied to Let probs be the predicted probabilities (after model ensembling) develop web application and web service.  After training and validation, the DLP was deployed as both a web site and a few web services, JSIEC PACS was integrated with the DLP though web services for internal testing. Predictions of bigclasses and subclasses are shown in the bottom. After preprocessing, the preprocessed images were further classified into 30 bigclasses (ID 0~29) with generated heatmaps (CAM and Deepshap). Images classified as bigclass ID 0, 1, 2, 5, 10, 15 and 29 would be further processed with subclass prediction using corresponding models trained independently. There are four subclasses in bigclass 0. Therefore, three parallel binary classifiers would be applied to detect the probability of the three conditions (Tessellated fundus, large optic cup and DR1) against normal. Images classified as bigclass 10 would be subsequently cropped and segmented into optic disc-centered image with small size (112x112 pixels) for final subclass classification. Flow of the algorithm, is depicted in Supplementary Fig. 4   Images classified as bigclass 0, 1, 2, 5, 15 and 29 were further processed with subclass prediction with CNNs trained independently. Images classified as bigclass 10 were subsequently cropped and segmented into optic disc-centered image with small size (112x112 pixels) for final subclass classification. There were four subclasses in bigclass 0; therefore, three parallel binary classifiers applied to detect the probability of the three conditions (tessellated fundus, large optic cup and DR1) against normal. To detect microaneurysms (very small red dots), input image resolution 448x448 was used in custom designed CNNs for detection of subclass 0.3.

Supplementary Tables
Supplementary b. Jaccard index, also known as Jaccard similarity coefficient, were applied to compare similarities between finite datasets 39 . For the same sample, the label sets (multiple labels) marked by two graders are set A and B, then Jaccard index can be calculated by this formula: J (A, B) = |A∩B | / |A∪B|. c. Images without identical labels between unspecialized ophthalmologists and retina specialists were transferred to the retina expert panel. a. All images with unclassifiable agreement were also sent to the retina expert panel for final confirmation.