Introduction

Diabetic retinopathy (DR) is a major ocular manifestation of diabetes. According to a World Health Organization (WHO) report, it is estimated that by the year 2040, the number of diabetic patients will reach 642 million. Nearly 40–45% patients with diabetes are prone to vision impairment due DR, making the global estimate of DR patients nearly 224 million. DR can be initially asymptomatic at its non-proliferative stage (NPDR), which is characterized by the presence of micro-aneurysms. If not treated in a timely fashion, it can progress to proliferative diabetic retinopathy (PDR), leading to irreversible vision loss and blindness. The American Academy of Ophthalmology (AAO) recommends that patients with the prevalent Type II diabetes should be screened every year after the initial diagnosis1. However, studies show that less than 50% diabetic patients follow through and get their yearly screening, the rate being even lower (15–20%) in rural areas2,3. Therefore, it is imperative to find an efficient way to improve the treatment management for DR and enable mass screening, early onset detection, and clinical diagnostics.

In recent years, researchers have successfully demonstrated machine learning (ML) and deep learning (DL) based algorithms for detecting diabetes4, DR diagnosis and referrals. Especially, DL methodologies have facilitated feature extraction and DR classification with high accuracy, sensitivity, and specificity5,6,7,8,9,10,11,12,13,14,15,16,17 using different imaging modalities such as fundus images, optical coherence tomography (OCT) and OCT angiography (OCTA) images. In general, such DL based DR classification pipelines require large, clean, diverse data, ground truth associated with the data, and a robust DL model (convolutional neural nets such as VGG16, ResNet, InceptionNet etc.). In case of referable vs non-referable DR classification, despite impressive performance showcased by these DL models, they often underperform in independent clinical test datasets. The reason for this is that the model often do not learn effective data representations that can lead to better generalization and model transferability. Furthermore, these models need to utilize larger, more diverse data from different sub-populations for better model training which increases the need for ground truth generation and data labeling—creating a large burden on clinicians. In this study, our goal is on reducing the burden of ground truth generation by developing more transferable DL models through a robust data representation learning framework.

To address this goal, we have developed a self-supervised contrastive learning (CL) based pipeline for classification of referable vs non-referable DR. Self-supervised models like CL help a DL model learn effective representation of the data without the need for large ground truth data18,19, the supervision is provided by the data itself. In such pipelines, a DL network is trained on a primary task (for representation learning, which requires no ground truth), and then the weights from that task are transferred to a secondary target task (i.e., classification, which requires smaller set of data and ground truth). For example, if a model is trained to solve an image puzzle, it does not need ground truth (the input image is the reference ground truth). But by learning to solve the puzzle, the model learns effective representation and the characteristic features from the image. This model’s weight can be then used for image classification task—yielding high classification performance with smaller data and ground truth20. In our case, while prior models on DR classification uses ‘ImageNet’ weights for transfer learning models11,12,21,22,23,24, our framework generates enhanced transfer learning weights that improves classification accuracy, even with smaller training data.

In this paper, our key novel contributions are following: (i) we present ‘FundusNet’—a novel CL based framework with neural style transfer (NST) based augmentation that achieves high classification performance; (ii) our novel NST based image augmentation technique effectively improves the representation learning capability of the CL network from fundus images as demonstrated by comparing to a state of the art lesion based CL model25. The model with FundusNet weights is independently evaluated on external clinical data, which achieves high sensitivity and specificity, when compared to three baseline models (two fully supervised models and one CL model); and iii) The CL-pretrained model also performed well even when the labelled dataset was reduced to 10% of its original size, suggesting the potential of CL to train models for DR diagnosis using small, labeled datasets.

Results

Study datasets

This study used EyePACS dataset for the CL based pretraining and training the referable vs non-referable DR classifier. EyePACS is a public domain fundus dataset which contains 88,692 images from 44,346 individuals (both eyes, OD and OS), and the dataset is designed to be balanced across races and sex. After removing the fundus images that met the exclusion criteria (uneven illumination, color distortion, low contrast, motion blur, missing fovea or optic discs; Fig. 1), the final training dataset contains 70,969 fundus images, of which 57,722 are non-referable DR and 13,247 are referable DR. An independent testing dataset from UIC retina clinic is used for the target task of DR classification. This dataset contains 2500 images from 1250 patients (both eyes OD and OS). Among 1250 subjects (mean [SD] age, 53.37 [11.03]), 818 were male (65.44%) and 432 were female (34.56%). The UIC data was curated for this study from clinical DR patients at UIC retina clinic (not publicly available). The detailed demographic information of the subjects from UIC is in Table 1. There was no statistically significant difference in the distribution of age, sex, and hypertension between non-referable and referable DR groups (Analysis of Variance (ANOVA), P = 0.32, 0.18, 0.59 respectively).

Figure 1
figure 1

Data exclusion criteria for EyePACS training data.

Table 1 Demographics characteristics of non-referable and referable DR subjects from the testing dataset at UIC.

Framework for contrastive learning-based pretraining

Our FundusNet framework consists of two primary steps. First, we perform self-supervised pretraining on unlabeled fundus images from the training dataset using contrastive learning to learn visual representations. Once the model has been trained, the weights are transferred to a secondary classifier model for supervised fine-tuning on labeled fundus images. Figure 2 describes a summary of the framework.

Figure 2
figure 2

(a) A framework for contrastive learning based pretraining for referrable vs non-referrable diabetic retinopathy classification. NST denotes neural style transfer. The training utilizes the EyePACS dataset, whereas the test dataset comes from the UIC retinal clinic. The input to the contrastive learning framework is fundus images (x). xi and xj are augmented image pair from each x. The representations hi and hj are used as transfer learning weights (one-to-one for encoder layers) for the classifier network (Resnet50) after the contrastive learning pipeline is optimized, i.e., the contrastive loss has reached its minimum value. zi and zi refer to the final representation after the projection head. The AdaIN refers to adaptive instance normalization that allows real time style transfer, described in26,27. Regular augmentation refers to augmentations used in the original SimCLR paper18 (flipping, rotation, color distortion), (b) architecture of the ResNet50 encoder. Conv convolutional layer, Pool a pooling layer, Avg pool average pooling layer, FC fully connected convolutional layer.

To teach our model visual representations effectively, we adopt and modify the SimCLR framework18, which is a recently proposed self-supervised approach that relies on contrastive learning. In this method, the model learns representations by maximizing the agreement between two differently augmented versions of the same data using a contrastive loss (more details on contrastive loss is provided in “Material and methods” section). This contrastive learning framework (Fig. 2a) attempts to teach the model to distinguish between similar and dissimilar images. Given a random sample of fundus images, the FundusNet framework takes in each image x, augments them twice, creating two versions of the input image xi and xj. The two images are encoded via a ResNet50 network (Fig. 2b), generating two encoded representations hi and hj. These two representations are then transformed via a non-linear multi-layer perceptron (MLP) projection head, yielding two final representations, zi and zi, which are used to calculate the contrastive loss. Based on the loss on each augmented pairs generated from a batch of input images, the encoder and projection head representations improve over time and the representations obtained place similar images closer in the representation space. The CL framework contains a Resnet50 encoder (containing convolutional neural network and pooling layers with skip connections) with a projection head (dense and Relu layers) that maps the representation. The batch size of the CL pretraining pipeline has been demonstrated to have significant effect on the model pretraining18,19,28 and therefore, the performance of the target model. To test this, we trained our FundusNet model for bath sizes 32 through 4096 (step size 32) in the CL pretraining step. The model is trained for 100 epochs or until the loss function saturates.

Improving representation learning through neural style transfer (NST)

One of the key findings from CL based self-supervised pretraining is that augmentation and transformation are key to better representation learning. As we adopted and modified the SimCLR framework, which was originally used for classification of natural images, we found that regular image augmentation techniques such as flipping, rotating etc. did not generate a good representation (Zi and Zi in Fig. 2a) from fundus images. A study by Geirhos et.al.27 demonstrated that CNNs used in computer vision tasks are often biased towards texture, compared to global shape features that are primarily used by humans for distinguishing classes. Increasing shape bias by randomizing texture environments can be a useful way to improve accuracy and generalizability of a CNN model. NST manipulates the low-level texture representation of an image (style) but preserves the semantic content. NST has been previously demonstrated to improve robustness to domain shift in CNNs for computer vision tasks29,30. In our study, we integrated an NST-based augmentation technique into the CL pipeline, based on convolutional style transfer from non-medical style sources (i.e., art, painting etc.). The NST replaces the style of the fundus images (primarily texture, color and contrast) with the randomly selected non-medical images. However, it preserves the semantic contents (global objects, shapes like microaneurysm, vasculature etc.) of the image required for better disease detection. The NST convolution methodology was adopted from AdaIn style transfer26,27. The style source was artistic paintings from Kaggle’s ‘Painter by Numbers’ dataset (79,433 paintings), downloaded via https://www.kaggle.com/c/painter-by-numbers. In the CL pretraining, the NST based augmentation was combined with the regular augmentation techniques such as rotation, flipping, color distortion, crops with resize, and gaussian blur. A higher probability (70%) of augmentation through NST was defined in the pretraining protocol. To compare the performance improvement of detecting referable DR due to integration of NST into our pipeline, we also trained two baseline CL frameworks: first one with just the original SimCLR augmentations18 and a second state of the art lesion based CL model that utilized lesion patches instead of the whole fundus image (with original SimCLR augmentations).

Referable vs non-referable DR classification training schemes

Using the weights of the pretrained network as initializations, we trained an end-to-end supervised model for a downstream DR classification task (referable vs non-referable DR). We trained a ResNet50 encoder network with standard cross-entropy loss, a batch size of 256, ADAM optimizer and random augmentations (gaussian blurring, resizing, rotations, flipping, and color distortions). The transfer learning weights were encoder to encoder (one-to-one; Fig. 2), i.e., the h representations from the CL network (before the projection head) were transferred to a ResNet50 encoder. To compare our FundusNet results, we also trained two separate fully supervised baseline models (ResNet50 and InceptionV3 encoder networks, both initiated with Imagenet weights). Both the baseline models are based on based on DL models in literature that have achieved state of the art diagnostic accuracy in detecting referable DR17,24. Standard hyperparameter search (learning rate (logarithmic grid search between 10–6 and 10–2), optimizer (ADAM, SGD), batch size (32, 64, 128, 256)) and training protocols were maintained for the FundusNet and both baseline networks.

To further investigate whether the CL pretrained model performs well with smaller training data (and ground truth), we reduced the training dataset gradually from 100 to 10% (10% step size) and conducted the downstream classification training for both the CL and Imagenet pretrained baseline models. After identifying the best hyperparameters and fine tuning the models for each experiment, we chose the model that had the best performance on validation dataset (fivefold cross validation). The final optimal models were tested on an independent testing dataset from UIC. In terms of encoder networks, we compared three types of encoder networks in our experiment (VGG, ResNet, and Inception architectures).

FundusNet performance on real-life clinical test data

The FundusNet model pretrained with style transfer augmentation achieved an average area under the receiver operating characteristics (ROC) curve (AUC) of 0.91 on the independent test dataset from UIC, outperforming the state-of-the-art supervised baseline models (ResNet50 and InceptionV3) trained with Imagenet weights17 (AUCs of 0.80 and 0.83, respectively), and the CL baseline models trained with SimCLR and lesion based framework (AUCs of 0.83 and 0.86 respectively) (Tables 2, 3). The significant performance difference in the testing test compared to the baseline model indicates that the FundusNet model generalized better through our pretraining framework due to enhance representation learning enabled by NST. The NST augmentation that was designed specifically for learning global geometric features, allowed learning of more discriminative visual representations of retinal pathologies, improving the overall classification performance compared to CL models pretrained with generic augmentation techniques.

Table 2 Classification performance of FundusNet framework for referable vs non-referable DR compared to fully supervised baseline models.
Table 3 Evaluation of model performance on test set based on augmentation and contrastive learning techniques compared to the state of the art.

To investigate the label-efficiency of the FundusNet model, we trained our model on different fractions of the labeled training data and tested each resulting model on the test dataset. We compared this to the performance of the baseline models. Fine-tuning experiments were conducted on five-folds training data and the results were averaged. Figure 3a shows how the performance varies using the different label fractions for both the FundusNet and baseline supervised models on testing dataset. We observe that the CL pretrained FundusNet model retains AUC performance even when the labels are reduced up to 10%, whereas there is significantly smaller performance of the baseline models. When reducing the amount of training data from 100 to 10% of the data, the AUC for FundusNet drops from 0.91 to 0.81 when tested on UIC data, whereas the drop is larger for the baseline models (0 0.80 to 0.58 for the ResNet50 and 0.81 to 0.63 for the InceptionV3 model). Importantly, the FundusNet model is able to match the performance of the baseline models using only 10% labeled data when tested on independent test data from UIC (FundusNet AUC 0.81 when trained with 10% labelled data vs 0.80 and 0.81, respectively, for baseline models trained with 100% labelled data).

Figure 3
figure 3

(a) AUC for referrable vs non-referrable DR classification tested on independent test data from UIC for FundusNet vs two baseline models with varied percentages of training data, (b) ROC curve for FundusNet and two baseline models (test set).

In the experiment to evaluate the optimal batch size for CL pretraining, we observed that CL frameworks learned better image representations when there were higher number of negative examples in a batch (i.e., augmented image pairs generated from other images in a batch), therefore, higher batch size yielded better performance (Table 4; AUC of 0.77 for batch size 32 vs 0.91 for batch size 2048 on test dataset). However, high batch size also means the need for larger compute resources. We observed that at batch size 4096, the AUC did not improve significantly, so the optimum batch size was chosen as 2048. In terms of encoder networks, compared to VGG and Inception architectures, ResNet50 provided the best classification performance in both validation and test dataset (Table 5).

Table 4 Effect of batch size on FundusNet performance on detecting referable DR.
Table 5 Classification performance using different encoder networks.

We have also generated saliency maps using GradCAM31 to visualize regions in the retina that were most likely used by our models to predict DR stages. Two sample saliency maps from a PDR subject are shown in Fig. 4, which are generated from a ResNet50 model (pretrained with ImageNet and FundusNet weights). The hotter colors like red and oranges are indicative of regions which potentially contribute more towards the predicted output. We can observe that, the saliency map near the optic disc is similar in both models, however, ResNet50 with FundusNet weights looks at additional regions on the top of the fundus images, that have some physiological markers.

Figure 4
figure 4

(a) Sample PDR fundus image, (b) saliency map from ResNet50 model with ImageNet weight, (c) saliency map from ResNet50 model with FundusNet weights.

Discussion

In this paper, we present a self-supervised CL based pipeline, FundusNet, for improving the performance, of referable vs non-referable DR classification over previously published baseline models17,24. CL improves representation and transferability of DR classification models. Our key contributions are: (i) the novel NST based CL pretraining method significantly outperformed the baseline models pretrained with Imagenet weights; (ii) we proposed a novel NST based augmentation within the CL framework that enhances the capability of the model to learn fundus data representation for robust classification performance. The augmentation outperforms generic SimCLR and state of the art lesion-based CL framework in terms of pretraining a model for better transferability; and iii) the CL pretrained models performed well, even when we reduced the amount of training data by 90%. The FundusNet network with CL pretraining shows potential for integrating self-supervision in deep learning frameworks for developing robust diagnostic models with reduced amount of data and associated ground truth. Previous models for the diagnosis of DR used Imagenet weights to initiate their models11,12,21,22,23,24. While transfer learning through Imagenet can be useful, recent work has shown that the role of Imagenet weights for medical imaging tasks can be limited32,33 since it was developed and validated on natural images, and pretraining on in-domain medical image data (i.e., FundusNet weights) can be more effective33 compared to Imagenet. Our CL framework provides an enhanced transfer learning weights that can be used in other applications pertaining retinal images and disease diagnosis.

The FundusNet model achieves high sensitivity and specificity in referable vs non-referable DR classification (Table 2) and performed significantly better than the supervised baseline models (ResNet50 and InceptionV317,24) on the independent test dataset (AUC of 0.91 vs 0.80 and 0.81 respectively) and on EyePACS fivefold validation data (AUC of 0.95 vs 0.92 and 0.94 respectively), suggesting that the CL model generalized better. Several previous studies have demonstrated great success to detect referable DR using deep learning approaches. Gulshan et al. developed a model with 128,175 images and validated the model on 9963 fundus images high level of performance for classifying referable DR (AUC = 0.99)8. Ting et al. trained their deep learning model using 73,370 images and reported excellent results for referable vs non referable DR (AUC of 0.936)15. Li et al. trained their deep learning system on 71,043 images and validated them in a real-world dataset of 35,201 images (AUC of 0.955 for vision-threatening DR)24. To achieve such excellent diagnostic accuracy, these works used a large diverse training dataset (ranging from 70,000 to 200,000 images) with class labels created by multiple clinicians. Our CL framework could potentially reduce the need for large training datasets. On the independent test data from UIC, the FundusNet model matched the AUC of the two baseline models with only 10% of labeled training data (Fig. 3).

In our study, we not only adapted contrastive learning into our training pipeline, we also further modified the pipeline with a novel NST based augmentation. NST is an image style transformation algorithm that manipulates the low-level texture representation of an image while preserving the semantic content. NST has been previously demonstrated to improve robustness to domain shift in CNNs for computer vision tasks20, and we have adopted it into our FundusNet framework. This is a unique application of style transfer from medically irrelevant artistic images in ophthalmic diagnostics. The NST augmentation facilitated learning discriminative visual representations of retinal pathologies by encouraging the model to learn global shape based structural characteristics of vasculature and other pathologies in the retina. NST works well as an augmentation method because the style transfer can introduce in a wide variety of textures when training classifiers using medical images (which often have more uniform color and texture distribution), that can improve classifier performance. Therefore, compared to conventional augmentations (such as the ones used in the original SimCLR18), the model generalizes better when augmented with NST using texture-rich artistic images (AUC of 0.91 vs 0.83 when FundusNet used NST vs did not used NST on top of regular augmentations). We compare the NST based framework with another recent work that proposed a lesion based CL framework25 for DR classification, where the model input is segmented lesions, rather than the whole fundus images. The augmentation technique used in this work is similar to that of original SimCLR paper. We implemented this framework for comparison and observed that the lesion-based CL pretraining did not provide significant performance improvement compared to SimCLR framework (Table 3). Possible reason is that, this method requires to employ a patch detection model prior to CL pretraining, that does not generalize very well to new datasets (i.e. EyePacs or UIC data) and can lead to a very large number of patches that do not have any pathologies. Furthermore, the generic augmentation techniques do not cater well for retinal image representation learning. Our proposed framework with NST can identify more global features that can be useful for DR classification. When compared, the NST model outperformed the lesion-based CL framework (AUCs of 0.91 vs 0. 84 for 100% labelled data and 0.81 vs 0.75 for 10% labelled data). Furthermore, our NST model does not require any additional step for patch identification that can add processing time and complicate both training and evaluation of models when used on new external datasets.

Our study has several limitations. First, we did not automatically optimize the NST hyperparameters. The use of NST as an augmentation technique is new and requires further investigation to optimize the style coefficients. Another limitation of our approach is that a large batch size is required for training of the CL model. Self-supervised frameworks like SimCLR and MoCo reported the need for larger batch size18,19,28 because CL training requires a large number of negative samples in a batch to calculate contrastive loss efficiently. This can impair computational efficiency. Furthermore, our model was tested on only one external clinical site at UIC, which may not be sufficiently representative of a broader population. Further validation of the developed FundusNet framework is required in future studies.

In conclusion, we have introduced a self-supervised CL based framework for referable vs non-referable DR classification that includes NST augmentation for learning effective pathological and visual representations from retinal fundus images. Our experiments demonstrate that our CL based pretraining yields significant improvements of DR classification compared with the baseline models in independent testing data (AUC (CI) values of 0.91 (0.898 to 0.930) in FundusNet vs 0.80 (0.783 to 0.820) in ResNet50 and 0.83 (0.801 to 0.853) in InceptionV3 baseline models). Importantly, our CL model is able to be trained using only 10% of the training data that could reduce the annotation burden for clinicians in producing training datasets. The FundusNet model is able to match the performance of the baseline models using only 10% labeled data when tested on independent test data from UIC (FundusNet AUC 0.81 when trained with 10% labelled data vs 0.80 and 0.81, respectively, for baseline models trained with 100% labelled data). Ultimately, our CL method could be useful for developing classification models for clinical use when validated sufficiently.

Materials and methods

Study design and participants

This study was approved by the institutional review board (IRB) of Stanford University and the University of Illinois at Chicago (UIC) and was in compliance with the ethical standards stated in the declaration of Helsinki. Informed consent was obtained from all subjects to use the de-identified data in this research. This multi-center cross-sectional study was primarily conducted at Stanford University School of Medicine. The testing data from UIC was shared in encrypted cloud drive with researchers at Stanford. For training and developing the CL based pretraining and referable vs non-referable DR classifier, we used the EyePACS dataset from Kaggle (88,692 fundus photographs, EyePACS, California). The final DR classifier model was tested on an independent dataset from UIC. The training data from EyePACS contained retinal fundus photographs from patients with varying degrees of severity of DR. The dataset had 57,722 non-referable and 13,247 referable DR images with varying resolutions from 433 × 289 up to 5184 × 3456 pixels. Images from the dataset are already labeled with stages of DR (0: no DR, 1: mild, 2: moderate, 3: severe non-proliferative DR (NPDR), and 4: proliferative DR (PDR)), following the diagnostic criteria for DR. For our project, we define referable DR as data which have labels of moderate NPDR and above (label ≥ 2). We excluded images that satisfied our exclusion criteria (uneven illumination, color distortion, low contrast, motion blur, missing fovea or optic discs; Fig. 1). An automated image quality assessment algorithm was used to identify the images that fit the exclusion criteria34,35. An ophthalmologist then manually verified the exclusion criteria and the excluded images. One image could satisfy multiple exclusion criteria, but the primary criteria was identified from the automated assessment algorithm (criteria with higher probability) and then confirmed by the ophthalmologist. The testing dataset from UIC contained 2500 fundus photographs from patients with DR, recruited from the UIC retina clinic (1000 referable and 1500 non-referable DR). This was retrospective data of type II diabetes patients who underwent retinal imaging at the clinic. The patients are thus representative of a university population of diabetic patients who require imaging for management of diabetic macular edema and DR. Images of both eyes were taken. Subjects with macular edema, previous history of eye diseases, and vitreous surgery were excluded from the study. The patients were classified by severity of DR according to the Early Treatment Diabetic Retinopathy Study staging system, which was then converted to class labels of no DR, mild, moderate, and severe NPDR, and PDR. The grading was done by retina specialist on dilated patients who were examined using a slit-lamp fundus lens, and technicians did not contribute to the grading of the patients. All patients in this study provided written informed consent and did not receive any compensation or incentives to participate.

Contrastive learning-based pre-training

The CL framework learns representations by maximizing the agreement between two different augmented encodings (zi and zj in Fig. 2) of the same input data via contrastive loss within a hidden representation of neural nets19. The contrastive loss between a pair of positive augmented examples is defined as36:

$${loss}_{i,j}=-log\frac{\mathrm{exp}(sim({z}_{i}/{z}_{j})/\tau )}{{\sum }_{k=1}^{2N}{1}_{[k\ne 1]}exp(sim({z}_{i}/{z}_{k})/\tau )}$$
(1)

Here, sim(.) is a cosine similarity function computed on \({z}_{i}\); \({1}_{[k\ne 1]}\in \)5 is a function evaluating up to 1 if and only if \(k\) is not equal to \(i\); \(\tau \) is an adjustable temperature parameter that can scale the inputs and widen the range [−1, 1] of cosine similarity, and N is the batch size. Upon calculating the cosine similarity, a SoftMax function is applied to get the probability of the two pairs being similar. This SoftMax calculation is equivalent to getting the probability of the augmented image pairs being the most similar to each other. Here, all remaining images in the batch are sampled as a dissimilar image (negative pair). The loss is calculated for a pair by taking the negative of the log of this SoftMax calculation. In the pipeline, the final loss is computed across all the positive pairs in a batch. Based on the loss, the encoder and projection head representations improve over time and the representations obtained place similar images closer in the space. Once the CL model is trained on the contrastive learning task, it can be used for transfer learning.

The CL pre-training is conducted for a batch size of 32 through 4096. The large batch size can be unstable when using standard stochastic gradient descent with linear learning rate scaling37. To stabilize the CL pre-training, a LARS optimizer38 is used for all batch sizes. We train our model with Cloud TPUs, using up to 12 v2 cores depending on the batch size. With 12 cloud TPUs, it takes around 18 h to pre-train a ResNet-50 encoder with batch size of 2048 for 100 epochs.

In default setting of the CL pre-training, a ResNet50 base encoder along with 2-layer multi-layer perception (MLP) projection head were used to project the Z representations to a 128-dimensional embedded space. The loss from Eq. (1) was optimized with the LARS optimizer with a learning rate of 0.3 with a weight decay of 1e−5. The batch size with best performance was 2048 with 100 epochs. The pre-training experiments were conducted with or without initializing Imagenet weights. The augmentations with the style transfer comprised random horizontal and vertical flips, Gaussian blur, partial rotation up to 30°, and random color augmentation and jittering (strength = 0.5). All the images were resized to 224 × 224.

Referable DR vs non-referable DR classification

For the task of referable vs non-referable DR classification, a ResNet50 network was trained with a batch size of 256 (image size 224 × 224), standard cross-entropy loss optimized with the ADAM optimizer. An on-the-fly random data augmentation was conducted (rotations (up to 30°), horizontal flipping, and color distortions). We experimented with the learning rate and weight decay (logarithmic grid search between 10–6 and 10–2 and 10–5 and 10–3 respectively). For the Imagenet supervised model as baseline, similar methods for data augmentation and hyper-parameter search were used.

Suggested configurations

For the CL pre-training, at the default setting, at least a 4 core TPU core is suggested. Otherwise, the memory will be limited for CL pre-training. Post-CL pre-training, any desktop or laptop computer with × 86 compatible CPU, 8 GB or more of free disk space, and at least 8 GB memory are suggested for training and testing the referrable vs non-referrable DR classification framework with FundusNet weights. A computer with a GPU would make the training time significantly lower (12 h in a CPU vs 3 h in a GPU).

Augmentation with neural style transfer

The stylized version of the fundus dataset is generated by applying AdaIn style transfer26,27. The AdaIn style transfer takes an input image (fundus) and a random artistic style image as inputs, then generates an output image that retains the content of the input image and the style of the artistic image. Within the framework, both the input and style images are encoded into the feature space and both the feature maps are fed into the AdaIN layer which aligns the mean and variance of the input feature maps to those of the style feature maps, therefore, producing the target feature maps. A decoder is then used to generate the stylized output image from the target feature maps. The AdaIn style transfer is chosen since it is capable of transferring the random styles in real time. Each of the fundus image was stylized through the AdaIn with a stylization coefficient of 1.0. The artistic style images were used from the ‘Painter by Numbers’ dataset (79,433 paintings), available in Kaggle. During the NST augmentation, the input images were resized to 512 × 512, and the style source images were resized to 128 × 128 to maintain the shape and global characteristics of the input image during stylization. The NST augmented images were synthesized in advance and called on the fly during CL pre-training. As discussed in the discussion section, the NST augmentation itself on the fly is computationally very expensive.

Statistical analysis

Our primary metrics to evaluate the model performance was the area under the curve (AUC) with 95% confidence intervals (CI). For each experiment, sensitivity and specificity of the CNN classifiers were also computed across probability thresholds to plot the receiver operating characteristic (ROC) curves and calculate AUC. For the classification thresholds for generating ROC curve and concurrent analyses, we used Youden’s index. The optimal cut-off to get the best sensitivity and specificity results (Table 2) was based on Youden’s J = max(sensitivity + specificity − 1), also shown in Fig. 3b.

Ethics

This study was approved by the institutional review board (IRB) of Stanford University and the University of Illinois at Chicago (UIC) and was in compliance with the ethical standards stated in the declaration of Helsinki. Informed consent was obtained from all subjects to use the de-identified data in this research.