Contrastive learning-based pretraining improves representation and transferability of diabetic retinopathy classification models

Diabetic retinopathy (DR) is a major cause of vision impairment in diabetic patients worldwide. Due to its prevalence, early clinical diagnosis is essential to improve treatment management of DR patients. Despite recent demonstration of successful machine learning (ML) models for automated DR detection, there is a significant clinical need for robust models that can be trained with smaller cohorts of dataset and still perform with high diagnostic accuracy in independent clinical datasets (i.e., high model generalizability). Towards this need, we have developed a self-supervised contrastive learning (CL) based pipeline for classification of referable vs non-referable DR. Self-supervised CL based pretraining allows enhanced data representation, therefore, the development of robust and generalized deep learning (DL) models, even with small, labeled datasets. We have integrated a neural style transfer (NST) augmentation in the CL pipeline to produce models with better representations and initializations for the detection of DR in color fundus images. We compare our CL pretrained model performance with two state of the art baseline models pretrained with Imagenet weights. We further investigate the model performance with reduced labeled training data (down to 10 percent) to test the robustness of the model when trained with small, labeled datasets. The model is trained and validated on the EyePACS dataset and tested independently on clinical datasets from the University of Illinois, Chicago (UIC). Compared to baseline models, our CL pretrained FundusNet model had higher area under the receiver operating characteristics (ROC) curve (AUC) (CI) values (0.91 (0.898 to 0.930) vs 0.80 (0.783 to 0.820) and 0.83 (0.801 to 0.853) on UIC data). At 10 percent labeled training data, the FundusNet AUC was 0.81 (0.78 to 0.84) vs 0.58 (0.56 to 0.64) and 0.63 (0.60 to 0.66) in baseline models, when tested on the UIC dataset. CL based pretraining with NST significantly improves DL classification performance, helps the model generalize well (transferable from EyePACS to UIC data), and allows training with small, annotated datasets, therefore reducing ground truth annotation burden of the clinicians.


Conclusion and relevance:
CL-based pretraining with NST significantly improves DL classification performance, helps the model generalize well (transferable from EyePACS to UIC data), and allows training with small, annotated datasets-therefore reducing ground-truth annotation burden of the clinicians.
Diabetic retinopathy (DR) is a major ocular manifestation of diabetes.According to a World Health Organization (WHO) report, it is estimated that by the year 2040, the number of diabetic patients will reach 642 million.Nearly 40-45% patients with diabetes are prone to vision impairment due DR, making the global estimate of DR patients nearly 224 million.DR can be initially asymptomatic at its nonproliferative stage (NPDR), which is characterized by the presence of micro-aneurysms.If not treated in a timely fashion, it can progress to proliferative diabetic retinopathy (PDR), leading to irreversible vision loss and blindness.The American Academy of Ophthalmology (AAO) recommends that patients with the prevalent Type II diabetes should be screened every year after the initial diagnosis [1].However, studies show that less than 50% diabetic patients follow through and get their yearly screening, the rate being even lower (15-20%) in rural areas [2,3].Therefore, it is imperative to find an efficient way to improve the treatment management for DR and enable mass screening, early onset detection, and clinical diagnostics.
In recent years, researchers have successfully demonstrated machine learning (ML) and deep learning (DL) based algorithms for DR diagnosis and referrals.Especially, DL methodologies have facilitated feature extraction and DR classification with high accuracy, sensitivity, and specificity [4][5][6][7][8][9][10][11][12][13][14][15][16] using different imaging modalities such as fundus images, optical coherence tomography (OCT) and OCT angiography (OCTA) images.In general, such DL based DR classification pipelines require large, clean, diverse data, ground truth associated with the data, and a robust DL model (convolutional neural nets such as VGG16, ResNet, InceptionNet etc.).In case of referrable vs non-referrable DR classification, despite impressive performance showcased by these DL models, we can make two major observations: i) for wide-spread deployment, the DL models need to be more generalized, and ii) more data from different sub-populations can lead to better model training.However, this need to utilize larger, more diverse data also increases the need for ground truth generation and data labelingwhich creates a large burden on clinicians.In this study, our focus is on reducing the burden of ground truth generation.
To address this goal, we have developed a self-supervised contrastive learning (CL) based pipeline for classification of referrable vs non-referrable DR.Self-supervised models like CL help a DL model learn effective representation of the data without the need for large ground truth data, the supervision is provided by the data itself.In such pipelines, a DL network is trained on a primary task (for representation learning, which requires no ground truth), and then the weights from that task are transferred to a secondary target task (i.e., classification, which requires smaller set of data and ground truth).For example, if a model is trained to solve an image puzzle, it does not need ground truth (the input image is the reference ground truth).But by learning to solve the puzzle, the model learns effective representation and the characteristic features from the image.This model's weight can be then used for image classification taskyielding high classification performance with smaller data and ground truth [17].In this paper, we present 'FundusNet'a CL based framework that achieves high classification performance even with smaller sets of fundus image data.In addition to that, we introduce a neural style transfer (NST) -based image augmentation technique, that effectively improves the representation learning capability of the CL network from fundus images.The model with FundusNet weights is independently evaluated on external clinical data, which achieves high sensitivity and specificity, when compared to a baseline model.The CL model also performed well even when the labelled dataset was reduced to 10% of its original size, suggesting the potential of CL to train models for DR diagnosis using small, labeled datasets.

Study design and participants
This study was approved by the institutional review board of Stanford University and the University of Illinois at Chicago (UIC) and was in compliance with the ethical standards stated in the declaration of Helsinki.This multi-center cross-sectional study was primarily conducted at Stanford University School of Medicine.The testing data from UIC was shared in encrypted cloud drive with researchers at Stanford.
For training and developing the CL based pretraining and referrable vs non-referrable DR classifier, we used the EyePACS dataset from Kaggle (88,702 fundus photographs, EyePACS, California).The final DR classifier model was tested on an independent dataset from UIC.The training data from EyePACS contained retinal fundus photographs from patients with varying degrees of severity of DR.The dataset had 71,548 non-referable (65,343 no DR, 6205 mild NPDR) and 17,154 referrable (13,153 moderate NPDR, 2087 severe NPDR, and 1914 PDR) DR images with varying resolutions from 433 x 289 up to 5184 x 3456 pixels.Images from the dataset are already labeled with stages of DR (0: no DR, 1: mild, 2: moderate, 3: severe non-proliferative DR (NPDR), and 4: proliferative DR (PDR)), following the diagnostic criteria for DR.For our project, we define referrable DR as data which have labels of moderate NPDR and above (label >= 2).We excluded images that had motion artifacts, were too dark or blurry to confidently stage disease, and images with missing fovea or optic disk.An automated image quality assessment algorithm was used to identify the images that fit the exclusion criteria [18,19] .The testing dataset from UIC contained 2500 fundus photographs from patients with DR, recruited from the UIC retina clinic (1000 referrable and 1500 non-referrable DR).This was retrospective data of type II diabetes patients who underwent retinal imaging at the clinic.The patients are thus representative of a university population of diabetic patients who require imaging for management of diabetic macular edema and DR.
Images of both eyes were taken.Subjects with macular edema, previous history of eye diseases, and vitreous surgery were excluded from the study.The patients were classified by severity of DR according to the Early Treatment Diabetic Retinopathy Study staging system, which was then converted to class labels of no DR, mild, moderate, and severe NPDR, and PDR.The grading was done by retina specialist on dilated patients who were examined using a slit-lamp fundus lens, and technicians did not contribute to the grading of the patients.All patients in this study provided written informed consent and did not receive any compensation or incentives to participate.

Framework for contrastive learning-based pretraining
Our FundusNet framework consists of two primary steps.First, we perform self-supervised pretraining on unlabeled fundus images from the training dataset using contrastive learning to learn visual representations.Once the model has been trained, the weights are transferred to a secondary classifier model for supervised fine-tuning on labeled fundus images.Figure 1 describes a summary of the framework.
Figure 1: (a) A framework for contrastive learning based pretraining for referrable vs non-referrable diabetic retinopathy classification.NST denotes neural style transfer.The training utilizes the EyePACS dataset, whereas the test dataset comes from the UIC retinal clinic.The representations hi and hj are used as transfer learning weights for the classifier network after the contrastive learning pipeline is optimized, i.e., the contrastive loss has reached its minimum value.The AdaIN refers to adaptive instance normalization that allows real time style transfer, described in [20,21].Regular augmentation refers to augmentations used in the original SimCLR paper [22] (flipping, rotation, color distortion); (b) Architecture of the ResNet50 encoder.
To teach our model visual representations effectively, we adopt a SimCLR framework [22], which is a recently proposed self-supervised approach that relies on contrastive learning.In this method, the model learns representations by maximizing the agreement between two differently augmented versions of the same data using a contrastive loss (more details on contrastive loss is provided in supplemental material).This contrastive learning framework (Figure 1a) attempts to teach the model to distinguish between similar and dissimilar images.Given a random sample of fundus images, the FundusNet framework takes in each image x, augments them twice, creating two versions of the input image x i and x j .The two images are encoded via a ResNet50 network (Figure 1b), generating two encoded representations hi and hj.These two representations are then transformed via a non-linear multi-layer perceptron (MLP) projection head, yielding two final representations, z i and z i , which are used to calculate the contrastive loss.Based on the loss on each augmented pairs generated from a batch of input images, the encoder and projection head representations improve over time and the representations obtained place similar images closer in the representation space.The CL framework contains a Resnet50 encoder (containing convolutional neural network and pooling layers with skip connections) with a projection head (dense and Relu layers) that maps the representation.The batch size of the CL pretraining pipeline has been demonstrated to have significant effect on the model pretraining [22][23][24] and therefore, the performance of the target model.To test this, we trained our FundusNet model for bath sizes 32 through 4096 (step size 32).The model is trained for 100 epochs or until the loss function saturates.

Improving representation learning through neural style transfer (NST)
One of the key findings from CL based self-supervised pretraining is that augmentation and transformation are key to better representation learning.As we adopted and modified the SimCLR framework, which was originally used on natural images, we found that regular image augmentation techniques such as flipping, rotating etc. did not generate a good representation (Zi and Zi in figure 1a) from fundus images.A study by Geirhos et.al.[21] demonstrated that CNNs used in computer vision tasks are often biased towards texture, compared to global shape features that are primarily used by humans for distinguishing classes.Increasing shape bias by randomizing texture environments can be a useful way to improve accuracy and generalizability of a CNN model.NST manipulates the low-level texture representation of an image (style) but preserves the semantic content.NST has been previously demonstrated to improve robustness to domain shift in CNNs for computer vision tasks [25,26].In our study, we integrated an NST-based augmentation technique into the CL pipeline, based on convolutional style transfer from non-medical style sources (i.e., art, painting etc.).The NST replaces the style of the fundus images (primarily texture, color and contrast) with the randomly selected non-medical images.
However, it preserves the semantic contents (global objects, shapes like microaneurysm, vasculature etc.) of the image required for better disease detection.The NST convolution methodology was adopted from AdaIn style transfer [20,21].The style source was artistic paintings from Kaggle's 'Painter by Numbers' dataset (79,433 paintings), downloaded via https://www.kaggle.com/c/painter-by-numbers. In the CL pretraining, the NST based augmentation was combined with the regular augmentation techniques such as rotation, flipping, color distortion, crops with resize, and gaussian blur.A higher probability (70%) of augmentation through NST was defined in the pretraining protocol.To compare the performance improvement of detecting referrable DR due to integration of NST into our pipeline, we also trained a CL framework with just the original SimCLR augmentations [22].

Referrable vs non-referrable DR classification
Using the weights of the pretrained network as initializations, we trained an end-to-end supervised model for a downstream DR classification task (referrable vs non-referrable DR).We trained a ResNet50 encoder network with standard cross-entropy loss, a batch size of 256, ADAM optimizer and random augmentations (gaussian blurring, resizing, rotations, flipping, and color distortions).The fundus images were resized to 224 x 224 pixels during this stage.This pixel dimension was optimized based on the tradeoff between image resolution and memory limitation during model training.To compare our FundusNet results, we also trained two separate fully supervised baseline models (ResNet50 and InceptionV3 encoder networks, both initiated with Imagenet weights).Both the baseline models are based on based on DL models in literature that have achieved state of the art diagnostic accuracy in detecting referrable DR [16,27].Standard hyperparameter search (learning rate (logarithmic grid search between 10 -6 and 10 --2 ), optimizer (ADAM, SGD), batch size (32,64,128,256)) and training protocols were maintained for the FundusNet and both baseline networks.
To further investigate whether the CL pretrained model performs well with smaller training data (and ground truth), we reduced the training dataset gradually from 100% to 10% (10% step size) and conducted the downstream classification training for both the CL and Imagenet pretrained baseline models.After identifying the best hyperparameters and fine tuning the models for each experiment, we chose the model that had the best performance on validation dataset (5-fold cross validation).The final optimal models were tested on an independent testing dataset from UIC.In terms of encoder networks, we compared three types of encoder networks in our experiment (VGG, ResNet, and Inception architectures).

Statistical Analysis
Our primary metrics to evaluate the model performance was the area under the curve (AUC) with 95% confidence intervals (CI).For each experiment, sensitivity and specificity of the CNN classifiers were also computed across probability thresholds to plot the receiver operating characteristic (ROC) curves and calculate AUC.For individual AUC, statistical comparisons were performed using DeLong's test [28].We also compared the age, sex and hypertension distribution among different DR cohorts using one-way, multi-label analysis of variance (ANOVA) test.

Results
This study used EyePACS dataset for the CL based pretraining and training the referrable vs nonreferrable DR classifier.EyePACS is a public domain fundus dataset which contains 88,692 images from 44,346 individuals (both eyes, OD and OS), and the dataset is designed to be balanced across races and sex.After removing the fundus images that met the exclusion criteria, the final training dataset contains 70,969 fundus images, of which 57,722 are non-referrable DR and 13,247 are referrable DR.An independent testing dataset from UIC retina clinic is used for the target task of DR classification.This dataset contains 2500 images from 1250 patients (both eyes OD and OS).Among 1250 subjects (mean [SD] age, 53.37 [11.03]), 818 were male (65.44%) and 432 were female (34.56%).The detailed demographic information of the subjects from UIC is in table 1.There was no statistically significant difference in the distribution of age, sex, and hypertension between non-referrable and referrable DR groups (ANOVA, P = 0.32, 0.18, 0.59 respectively).The FundusNet model pretrained with CL and style transfer augmentation achieved an average AUC of 0.91 on the independent test dataset from UIC, outperforming the state-of-the-art baseline models (ResNet50 and InceptionV3) trained with Imagenet weights [16] (AUCs of 0.80 and 0.83, respectively) (Table 2).The significant performance difference in the testing test compared to the baseline model indicates that the FundusNet model generalized better through our pretraining framework.The NST augmentation further allowed learning of more discriminative visual representations of retinal pathologies, improving the overall classification performance (0.91 (95% CI: 0.898-0.93)with NST augmentation vs 0.83 (95% CI: 0.80-0.85)with original SimCLR augmentations [22]).To investigate the label-efficiency of the FundusNet model, we trained our model on different fractions of the labeled training data and tested each resulting model on the test dataset.We compared this to the performance of the baseline models.Fine-tuning experiments were conducted on five-folds training data and the results were averaged.Figure 2 shows how the performance varies using the different label fractions for both the FundusNet and baseline models.Figure 2 shows the performance comparison on testing dataset.We observe that the CL pretrained FundusNet model retains AUC performance even when the labels are reduced up to 10%, whereas there is significantly smaller performance of the baseline models.When reducing the amount of training data from 100% to 10% of the data, the AUC for FundusNet drops from 0.91 to 0.81 when tested on UIC data, whereas the drop is larger for the baseline models (0 0.80 to 0.58 for the ResNet50 and 0.81 to 0.63 for the InceptionV3 model).Importantly, the FundusNet model is able to match the performance of the baseline models using only 10% labeled data when tested on independent test data from UIC (FundusNet AUC 0.81 when trained with 10% labelled data vs 0.80 and 0.81, respectively, for baseline models trained with 100% labelled data).
Figure 2: AUC for referrable vs non-referrable DR classification tested on independent test data from UIC for FundusNet vs two baseline models with varied percentages of training data.In the experiment to evaluate the optimal batch size for CL pretraining, we observed that CL frameworks learned better image representations when there were higher number of negative examples in a batch (i.e., augmented image pairs generated from other images in a batch), therefore, higher batch size yielded better performance (Table 3; AUC of 0.77 for batch size 32 vs 0.91 for batch size 2048 on test dataset).However, high batch size also means the need for larger compute resources.We observed that at batch size 4096, the AUC did not improve significantly, so the optimum batch size was chosen as 2048.
In terms of encoder networks, compared to VGG and Inception architectures, ResNet50 provided the best classification performance in both validation and test dataset (Supplemental table 1).

Table 1 .
Demographics characteristics of non-referrable and referrable DR subjects from the testing dataset at UIC.

Table 2 :
Classification performance of FundusNet framework for referrable vs non-referable DR. : diabetic retinopathy, CI: confidence interval; AUC: area under the ROC curve; Ref: reference; P value from measuring statistical significance using DeLong's test for comparing pairwise AUCs. DR

Table 3 :
Effect of batch size on FundusNet performance on detecting referrable DR. : area under the ROC curve; SD standard deviation; Ref: reference; '*' indicates significant difference; P value from measuring statistical significance using DeLong's test for comparing pairwise AUC values among the batch size (all vs batch size 32 in third column, and all vs batch size 2048 in fourth column).This pairwise comparison was done to show whether the AUCs coming from experiments with all the batch sizes are significantly different that AUCs coming from experiments with batch size 32 (initial batch size) and 2048 (optimal batch size). AUC