An Interpretable Multiple-Instance Approach for the Detection of referable Diabetic Retinopathy from Fundus Images

Diabetic Retinopathy (DR) is a leading cause of vision loss globally. Yet despite its prevalence, the majority of affected people lack access to the specialized ophthalmologists and equipment required for assessing their condition. This can lead to delays in the start of treatment, thereby lowering their chances for a successful outcome. Machine learning systems that automatically detect the disease in eye fundus images have been proposed as a means of facilitating access to DR severity estimates for patients in remote regions or even for complementing the human expert's diagnosis. In this paper, we propose a machine learning system for the detection of referable DR in fundus images that is based on the paradigm of multiple-instance learning. By extracting local information from image patches and combining it efficiently through an attention mechanism, our system is able to achieve high classification accuracy. Moreover, it can highlight potential image regions where DR manifests through its characteristic lesions. We evaluate our approach on publicly available retinal image datasets, in which it exhibits near state-of-the-art performance, while also producing interpretable visualizations of its predictions.


I. INTRODUCTION
D IABETIC Retinopathy (DR) is a complication of diabetes mellitus that can lead to blindness if left untreated. More than 1 out of 3 diabetic patients are expected to develop DR during their lifetime [1], with their chances increasing over time [2]. Yet, despite its prevalence among diabetic populations, the risk of blindness can be significantly reduced via timely treatment, i.e. before the retina is severely damaged [3].
Diabetic Retinopathy is characterized as either nonproliferative (NPDR), meaning it manifests mainly through retinal lesions, or as proliferative (PDR), meaning neovascularization of weak blood vessels also occurs. The International Clinical Diabetic Retinopathy Disease (ICDR) severity scale [4], suggests a finer classification of the disease into the following 5 stages, based on the observable findings during eye examination: 1) No apparent retinopathy 2) Mild NPDR 3) Moderate NPDR 4) Severe NPDR 5) PDR. Common guidelines recommend annual screenings for diabetic patients without or with mild DR, 6 month follow up examination for moderate DR, and referral to an ophthalmologist for treatment evaluation for severe cases [5]. A DR of moderate or worse stage is further characterized as referable Diabetic Retinopathy (rDR). Subjects who were diagnosed with rDR by a trained DR grader (not necessarily an ophthalmologist) must also be referred to an ophthalmologist for further evaluation of their condition.
Screening for DR is usually carried out by either an inperson eye examination or by means of retinal photography. In either case, an eye care professional examines the retina (either directly through a slit-lamp or indirectly through a high resolution retinal photograph captured with a specialized camera) for signs of the disease, such as microaneurysms, haemorrhages and hard or soft exudates. Other factors such as macular edema, narrowing of the blood vessels or damage in the nerve tissue are also considered [3]. In general, accurate DR grading is a daunting task even for experienced graders, and, as a result, inter-grader variability is quite common [6].
In recent years, retinal photography has been widely accepted as an adequate screening method that can even lead to improved diagnostic performance compared to the standard slit-lamp examination [7]. This acceptance has naturally led to the conduction of much research on the development of automated methods for grading retinal images, as such techniques can provide substantial benefits to the standard DR screening procedure. For example, they can support retinal specialists by lightening their workload or by identifying cases they might have missed. Automated grading can also be used to increase the coverage of nation-wide screening programs by facilitating access for people in remote or rural regions via the use of teleophthalmology.
Image classification is a classic use case for machine learning (ML) algorithms. especially those based on the paradigm of deep learning. In recent years, deep learning has achieved impressive results in a series of tasks, including image classification. In this work, we propose a method based on deep learning to detect referable Diabetic Retinopathy from retinal images and simultaneously produce heatmaps of the most dominant DR lesions to aid the model's interpretability. Our method operates by independently extracting information from multiple small image patches and combining them based on their content. We focus on the detection of rDR, as this simplified binary classification task is a common target for automated DR grading methods that provides ample clinical utility [8].
Our paper is organized in the following way: in Section II we review the most prominent works in the related literature. Then, in Section III we introduce the proposed methodology for performing rDR detection. In Section IV we provide details about the datasets used in our experiments and in Section V we present the experimental results for our method, as well as comparisons with other works in the literature. Finally, we discuss our method and results in Section VI and conclude the paper with Section VII.

II. RELATED WORK
The automated analysis of retinal images for diabetic retinopathy detection has received impressive research attention over the last decade, particularly following the rise of deep learning that eliminated the need of manually constructing problem-specific features. Many different learning tasks have been pursued in the related literature, such as optic disk and blood vessel segmentation, lesion detection and DR grading. As the relevant literature is abundant, in this section we will focus on the tasks of DR grading and lesion detection that are mostly related to our work. For a thorough review of the field, we refer the reader to the excellent survey of [9].
In grading approaches, the task is usually cast as a binary classification problem of no rDR vs rDR, but it can be also viewed as a multi-class problem, using the stages of the ICDR severity scale as 5 distinct classes, Early methods relied on traditional computer vision methods, mainly using feature extraction techniques to identify specific properties of interest in the fundus image, coupled with shallow machine learning models for classification. For instance, [10] used morphological and texture analysis to extract blood vessel and hard exudate features which were then used as input features in a neural network, [11] made use of higher-order spectra coupled with an SVM classifier and, more recently, [12] proposed a set of high-performing, handcrafted shape features that were fed to a Random Forest classifier.
In recent years, the rise of Deep Convolutional Neural Networks (DCNN) and their dominance in computer vision tasks has transformed the research landscape, so that most works are nowadays based on deep neural networks. The trend reached its peak with the seminal work [13] that reported excellent rDR classification performance (ROC AUC of 0.99), by using an Inception v3 model pre-trained on ImageNet and finetuned on a private dataset of 120k fundus images. A replication study using the publicly-available Kaggle-EyePACS dataset [14], was conducted in [15], but was unable to reproduce the results of the original study, suggesting that the ground-truth quality is a decisive factor for achieving high performance, as the dataset used in [13] was annotated by a panel of 7 retina specialists, while Kaggle-EyePACS was annotated by a single specialist. This is further corroborated by the findings of [6], who report performance improvements when using even a small amount of adjudicated DR ground truths.
The work of [16] investigates the use of a multi-resolution training scheme, using different networks with shared weights for each different resolution of the input image. They also extract and combine features simultaneously from multiple labelpreserving random perturbations of the same input image, as suggested by the team that placed second 1 in the Kaggle DR competition, along with other tricks, such as aggressive data augmentation, to improve performance.
An interesting addition to the standard DR prediction procedure is suggested by the authors of [17]. In particular, they incorporate a bayesian estimation of the model's uncertainty during test time, implemented via a dropout approximation [18]. They go on to show, that by refraining from classifying images for which the model is uncertain, one can increase classification performance and obtain more reliable predictions. This modification is applicable to any of the usual deep CNN models that have been suggested in the context of DR classification, such as ResNet [19], AlexNet [8] and VGG [16].
A different strategy that is more on par with the screening procedure of eye specialists is to first detect specific DR lesions and then use these detections to infer DR. Following this direction, [8] introduced a series of lesion detector models, based on AlexNet [20] and VGG [21], where each detector is applied to the input image to detect a particular type of DR lesion (haemorrhages, exudates, etc). The detector outputs are then fused together to form a feature vector, that is used to first assess the image quality, and then, if the quality is found acceptable, to predict the level of DR. The resulting hybrid system was shown to achieve very high performance and has been deployed in real world conditions. Such methods, however, depend on the manual annotation of DR lesions, a procedure that significantly increases the clinician workload and, as such, can be typically be carried out for few images. In addition, disease-relevant information tend to be located in just a few regions in the input image [22], so that using a single feature vector to represent the entire image can be problematic. To aid in these problems, Multiple-Instance Learning methods [23] that treat the input image as bag of patches accompanied by a single DR label, have been suggested. An early MIL method introduced in [24], used rectangular patches originating from DR-positive images to train a model using a patch relevance criterion. The resulting model was then shown to produce high relevance scores for patches that contained DR lesions. The subsequent work of [22] employed the use of generic MIL algorithms like the mi-SVM method [25], to produce both local (per-patch) and global (per-image) decisions, by processing each patch independently and combining the per-patch results according to some aggregation criterion. A similar method that made use of deep learning was later proposed in [26].
Here we introduce a method to perform rDR classification by analyzing the fundus image on the patch level. Our approach follows the MIL paradigm that treats each input image as a bag of image patches. At first, each patch is independently encoded via a deep CNN model into a feature vector. Contrary to other MIL approaches in which the individual patch vectors are pooled via pre-defined operators, here we examine the use of the attention mechanism [27]. In the context of MIL, attention serves as a trainable pooling operator that learns to put emphasis on the most informative instances (i.e. patches that contain DR lesions) and ignore the uninformative ones, thus leading to an image-level representation that retains the most relevant information and can thus be used for efficient rDR detection.  Fig. 1: High-level overview of the proposed rDR prediction pipeline. The preprocessed image I is represented by K rectangular patches with fixed overlap that cover most of the retinal disk (patches that have less than 50% retinal content, such as the patches extracted near the boundary of the retinal disk, are discarded). Features are then extracted from each x k patch via a Resnet-18 (function φ) model Each resulting feature vector h k is assigned a weight α k by an attention mechanism and the weighted average of all the vectors {h k } is used to acquire a single embedding that describes the entire image. Finally, the embedding is fed to a fully connected layer (function ρ) that outputs the probability of rDR for the given image. At the same time, the attention weights of different patches are mapped to their corresponding image regions and an attention heatmap that highlights the most salient image regions according to the model is constructed.
In medical applications, model interpretability is key for the successful adoption of a proposed ML solution, as human experts are more likely to trust a decision if they understand the motives behind it. In the case of DR in particular, a model is usually considered black-box during training and endowed with intrepretability properties using some post-training optimization procedure to output a heatmap of the fundus regions [19], [28], [29] that the model is most sensitive to. In this work, we follow a different approach: we leverage the attention mechanism to construct heatmaps that highlight the image regions that directly contributed to a given prediction. In doing so, heatmap generation arises quite naturally since attention is a basic building block of our model.
We evaluate the classification performance of our model on the publicly available Kaggle-EyePACS [30] and Messidor-2 [8] datasets. In the rDR detection task, an ensemble of 5 models trained on the standard Kaggle-EyePACS training set, achieves a ROC AUC of 0.957 on the Kaggle-EyePACS test set and 0.976 on Messidor-2, outperforming other patchbased methods and performing on-par with the state-of-theart. We also perform experiments to evaluate the quality of the produced attention heatmaps using the images from the IDRiD [31] dataset that contain detailed pixel-level lesion annotations. Our results in these experiments show that the attention mechanism can efficiently recognize the DR-induced lesions in the fundus of the eye.

III. PROPOSED METHODOLOGY
A. Image pre-processing A practical DR grading system must be able to handle fundus images of different resolutions, lighting conditions and view points of the retina disk, as these parameters may vary between different retinal cameras and capturing conditions. However, typical image classification models operate on images of pre-defined dimensions. In addition, artificial artifacts in the image, such as blobs caused by dust in the lens or by improper lightning, can seriously hinder model performance. Therefore, to mitigate these issues we apply a common preprocessing step to all images prior to any machine learning analysis. The pre-processing pipeline consists of the following steps: 1) Estimate the center and radius of the circular eye disk by means of the Hough transform 2 . 2) Crop a rectangular region with edge equal to the estimated eye disk radius and centered at the estimated disk center. 3) Resize the rectangular crop to a common resolution of 512 × 512. 4) Subtract the local color average as suggested in [14], to account for the variability in lightning conditions across images. 5) Zero-out the outer 5% of the retina disk to remove artificial boundary effects introduced by the filtering procedure. Images for which the disk center cannot be found via the Hough transform are discarded altogether. Additional care is taken for images that do not contain the whole circular disk (for example 3a, 3c). In such cases, an additional zero padding step of appropriate size is applied along the height dimension, before cropping the rectangular retina region. The overall process results into two components that will used in the subsequent learning pipeline: the preprocessed image, which is ready to be fed to a machine learning model, and the binary retina mask (obtained during the Hough transform step) which can be used to separate the retina from the image background. 2 Can be skipped if ROIs are provided by the retinal camera Samples of preprocessed images and their corresponding retina masks are given in Fig. 2.

B. Bag of patches encoding
Having established a common frame of reference for all images, we now compute the bag of patches representation for an image. To this end, rectangular patches of size d × d pixels are extracted from each image. We use different policies for patch extraction during training and testing. During training we randomly select K o patches from the pool of possible image patches. This significantly speeds up training as we can represent an image using only a small subset of all the possible patches and also adds a regularization effect, as the representation of a specific image will be different each time it is fed to the network (since it will be represented by a set of different random K o patches). At test time, we perform exhaustive patch extraction over a regular grid with a patch overlap ratio of t ∈ [0, 1). In both cases, some patches will contain very small part of the retina or originate completely from the image background. To avoid such cases, we use the previously computed retina mask and discard patches whose eye disk content is less than 50%. The surviving K patches form a set X = {x 1 , x 2 , . . . , x K } with x k ∈ R d×d×3 that constitutes the bag of patches for the given image. In the following, we will discuss how we can efficiently utilize the bag of patches representation to perform rDR prediction.

C. Multiple-Instance modelling
In the MIL view of the rDR detection problem, we are essentially interested in learning a function f that will map any set to the bag of patches of the input image) to a real number (corresponding to the probability of rDR for that image). Being a set function, f must be invariant to the different permutations of X. Previous theoretical work [32] suggested that any permutation-invariant function over a set X can be modelled as a sum decomposition of the form ρ x∈X φ (x) , where the transformation φ : R N → R M is applied elementwise to each set instance, thus leading to a transformed set while the transformation ρ : R M → R is applied to the result of summing the elements of H to obtain the desired output. Stepping on this result, [27] proposed to incorporate the attention-mechanism [33] in the sum-decomposition, as an elegant way of tackling multiple-instance classification problems. More specifically, they proposed to model the label probability of a bag via the following mechanism: The coefficients α k in Eq. 1 can be computed via the additive attention [34] mechanism of Eq. 2, which introduces the learnable In this work, we adopt the attention-based multiple-instance scheme for the purposes of detecting rDR in a fundus image. In a nutshell, we propose the following processing pipeline: 1) Pre-process the input image I according to Section III-A 2) Compute its bag of patches representation as described in Section III-B. 3) Transform the bag of patches X = {x 1 , x 2 , . . . , x K } with x i ∈ R d×d×3 to a bag of features H = {h 1 , h 2 , . . . , h K } with h i ∈ R M , by applying the function φ to each patch. 4) Apply attention pooling to arrive at a global image representation vector z = K k=1 α k h k 5) Estimate the rDR probability for I by computing ρ(z) (where ρ is modelled by a fully-connected network). One attractive property of the MIL attention approach is that all transformations can be modelled using deep neural networks and the resulting model can still be trained in an end-to-end manner. This would not be possible if some other commonly used pooling operator, such as Bag of Features or Fisher Vectors, was used. For our specific problem, we elect to use a high capacity CNN (ResNet-18) [35] to model the feature extraction function φ and a single fully-connected layer to model the final classifier ρ. The attention mechanism is implemented via a fully-connected network with 2 layers that correspond to the parameters V and w of Eq. 2. A schematic overview of the proposed system is given in Fig. 1.

D. Interpretable attention heatmaps
After training, we can use the attention mechanism to visualize the fundus regions that affect the model's decision. As we saw in section III-C, in order to classify an image, we assign a weight to each patch using the attention mechanism. These weights offer an implicit way of identifying which image regions contribute the most to the model's prediction. Ideally, in true positive cases, the model should assign high weights α k to patches that contain DR lesions, such as haemorrhages and exudates, while in true negative cases, all patches should receive similar weights. Such a heatmap could also assist in understanding the reasons behind misclassifications, either false positives or false negatives and enable targetted interventions in the learning pipeline.
To produce the attention heatmap, we start with an initially zeroed single-channel image of same height and width as the pre-processed images. This auxiliary image will be used to accumulate the attention values: we iterate the patches and add their assigned weights to the pixels of the accumulator image that correspond to the location of each patch in the original image. The granularity of the pixel-level assignment can be controlled via the patch overlap parameter t: larger overlap between the extracted patches leads to more granular and aesthetically pleasing visualizations, while smaller overlap speeds up the computations but results in coarser visualizations. The values of the auxiliary image are then linearly mapped to the [0, 1] range to produce a proper heatmap. Examples of such heatmaps are given in Fig. 5.

A. Kaggle-EyePACS
The Kaggle Diabetic Retinopathy Detection [30] challenge dataset consists of high-resolution retina images captured under a variety of imaging conditions. It contains 88.702 RGB images of differing resolutions (from 433 × 289 up to 5184×3456) that are partitioned in a training set of 35.126 and a test set of 53.576 images. A clinician has graded all images according to the ICDR scale. The DR grade distribution can be seen in Table I. Some indicative images from this dataset can be seen in Fig. 3.
It is estimated [36] that 25% of the Kaggle-EyePACS images are ungradable because they contain artifacts (loss of focus, under or overexposure etc.) that prevent a medical expert from assessing their DR grade. In fact, it is quite common for DR image graders to first evaluate the image quality and proceed with the actual DR grading only if they find it acceptable. The authors of [15] suggested that image quality estimation does not require medical expertise and can, therefore, be carried out by non-experts They went on to manually evaluate the quality of all images in the Kaggle dataset, based on instructions provided to the professional DR graders for performing the same task in [13]. In doing so, they rejected about 19.9% of the images, resulting in a filtered dataset of 71.056 images. In our experiments, we adopt these gradability estimates and consider only the images that were deemed of adequate quality.

B. Messidor-2
The Messidor-2 dataset [8] contains 1748 fundus images captured with a Topcon TRC NW6 camera in three different resolutions: 1440 × 960, 2240 × 1488 and 2304 × 1536. Grades for DR and image quality are provided by a panel of 3 fellowship-trained retina specialists [6]. Images that caused disagreements among specialists were re-examined and adjudication sessions were carried out until consensus was reached. As a result, in terms of label quality, Messidor-2 can be considered less noisy than the Kaggle-EyePACS dataset, in which DR grades are provided by a single DR grader. The DR grade distribution for this dataset is given in Table I, while sample images are provided in Fig. 3.

C. IDRiD
The Indian Diabetic Retinopathy Dataset (IDRiD) [31] contains fundus images captured during real clinical examinations in an eye clinic in India using a Kowa VX fundus camera. The captured images have a 50°field of view with a resolution of 4288 × 2848. The images are separated into 3 parts, corresponding to 3 different learning tasks and accompanied by the respective types of ground-truth. The first part is designed for the development of segmentation algorithms and contains 81 images (54 train set -27 test set) with pixel-level annotations of DR lesions (microaneurysms, haemorrhages, hard and soft exudates) and the optical disk. The second part corresponds to a DR grading task and contains 516 images divided into train set (413 images) and test set (103 images) with DR and Diabetic Macular Edema (DME) severity grades. Finally, the third part corresponds to a localization task and contains 516 images with the pixel coordinates of the optic disk center and fovea center (again split in a 413 train and 103 test set).

V. EXPERIMENTAL EVALUATION A. Implementation details
We perform extensive experiments to assess the proposed model's ability to detect rDR in fundus images. For training, we use the gradable (according to [15]) images in the Kaggle-EyePACS training split, while for evaluation we use the Kaggle test split and Messidor-2 datasets. Before any training, we randomly sample 3000 images from the Kaggle training and set them aside for use as a validation set. For training, we use the standard binary cross-entropy loss. Assuming thatp data denotes the empirical data-label distributions defined by the training set, the cross-entropy loss is defined as: The work of [37] showed that transferring weights from the natural to the medical image domain is not necessary to achieve high classification performance. However, it does speed up convergence and thus, we initialized the function φ (ResNet-18) with pre-trained ImageNet weights. During training we perform image augmentation by randomly shifting, flipping, scaling and rotating the image and by randomly applying small perturbations to its brightness, contrast, hue and saturation. We convert the image to its bag of patches representation by randomly selecting K = 50 random patches from the pool of all possible image patches. This value was selected based on its superior classification performance (Fig.  4) on the validation set. At test time, instead, we extract patches on a regular grid with a patch overlap rate of 0.75.
The function φ transforms an input patch to a feature vector of length M = 128, while an attention module of dimension L = 32 is used to produce the pooled image representation z. We use the Adam optimizer with a base learning rate of 3 · 10 −4 for 60 epochs and the suggested b 1 , b 2 parameters. After training, we keep the model instance that achieved the highest performance on the validation set.

B. Classification experiments
For measuring classification performance we use the Area Under the Receiver Operating Curve (ROC AUC), a metric commonly used in the related literature. We also compute the model's sensitivity and specificity metrics at the operating points of high specificity (> 0.9) and high sensitivity (> 0.9). We report the performance metrics of a 5 model ensemble on both the Kaggle-EyePACS test set and Messidor-2 in Table  II, along with a comparison to related works. Ideally, such a comparison would include methods that were trained and evaluated on the same datasets and under the same conditions. However, there is a widely inconsistent use of datasets and evaluation metrics throughout the rDR classification literature [9], with different methods using different datasets for training/testing or even custom training/test splits. For instance, [15] and [26] use Kaggle splits that favor significantly larger training sets, as opposed to the official splits in which the test set is ∼1.5 times larger than the training set. After taking this into consideration, we elect to compare our method to the most prominent works of the rDR literature that use the same publicly-available datasets. More specifically, in the first part of Table II we include methods that were trained and evaluated on Kaggle-EyePACS, using either official or custom splits. Then, in the second part we include methods that were trained with any dataset and evaluated on Messidor-2.
As we can see, in terms of rDR classification performance, our method perform on par with the state-of-the-art literature. In fact, when evaluated on the Kaggle test set, our approach outperforms the alternatives in terms of AUC score, even those that use larger training sets. In Messidor-2, our method is only outperformed by [13]. However, their model was trained on a much larger and better annotated dataset (>100k images annotated by a panel of specialists, in contrast to our ∼28k images annotated by a single expert). [39] also reports slightly better AUC on Messidor-2, but their score corresponds to a slightly different task that is to predict rDR given both eye images for a subject, in contrast to our model that operates on single images.  [17] 0.927 NA NA Rakhlin et al. [38] 0.920 0.920 0.720 0.800 0.920 Pires et al. [16] 0

C. Attention heatmap evaluation
Following the procedure of Sec. III-D we can construct attention heatmaps that pinpoint the fundus areas that the model focused on for arriving at a prediction. Examples of such heatmaps for both correct and incorrect predictions in the Messidor-2 dataset are presented in Fig. 5. As we can see by inspecting the images in Fig. 5, the model seems to attend more to patches that contain artifacts resembling DR lesions, such as microaneurysms, haemorrhages and soft or hard exudates. Rather than relying on visual inspection of the produced heatmap, we would like to evaluate its validity more quantitatively. To that end, we employ the first part of the IDRiD dataset that contains 80 images with detailed, pixel-level lesion annotations. Based on these images, we can verify the correlation between per-patch attention weight and lesion existence that Fig. 5 suggests. To do so, we conduct two experiments. In the first experiment, we use the attention weight assigned to each patch as a predictor of whether the patch contains any DR lesion. We compute the AUC and the Area Under Precision Recall Curve (AUPRC) of the per-patch attention weight against a binary label that denotes lesion existence. For the purposes of this experiment, we extract patches on a regular grid with high patch overlap (87.5%), in order to enrich the pool of patches. Any patch that contains even a tiny amount of lesion (at least 1 pixel according to the available ground-truth) is assigned to the positive class and otherwise to the negative class. We report the aforementioned classification metrics against different labels, corresponding to microaneurysms, haemorrhages, exudates (we have merged soft and hard exudates in a single class, since only a fraction of IDRiD images contain soft exudate ground-truth), as well as all lesions combined, in Table III. As we can see, the attention weight achieves good performance, especially when the target label is produced by considering all lesions together. This is to be expected, as the training procedure allows the model to focus on what it considers relevant to the task at hand and, as a result, it does not explicitly learn to prioritize a specific lesion type over some other.
In the second experiment, we are interested in verifying whether the attention weight of a patch depends on the amount of lesions it contains. To that end, we compute the scatterplots of the attention weights versus the percentage of the area of a patch that corresponds to lesions (i.e. how many pixels in a given patch have been annotated as belonging to any lesion category). We present such plots for 9 images of the IDRiD dataset in Fig. 6. Based on these figures, we find that there seems to be a positive linear correlation between the magnitude

VI. DISCUSSION
The development of methods for automated DR grading is one of the most popular applications of machine learning in recent years. According to a recent survey paper [9], there is an exponential rise of interest in the field after 2015, with over 50 papers in 2018 and 70 papers in 2019 using deep learning. Yet, despite the vast growth, there are no standardized benchmarks for DR algorithms and, as a result, there is no easy way to compare fairly with other works. This is made worse by the inherent inter-(and intra-) grader variability human experts themselves exhibit [6] when manually grading DR. As different works often produce their own DR grades for a dataset in cooperation with their eye specialists, one must keep in mind that even when comparing with more standardized datasets such as Messidor-2, there might be Fig. 6: Example scatterplots of the attention weight assigned by the model to each patch versus the percentage of the patch area that contains lesions. The plots are computed using images from the IDRiD dataset, for which detailed per-pixel lesion annotations for microaneurysms, haemorrhages, soft and hard exudates are available. A linear regression fit is estimated for each image and overlayed on the plot to highlight the linear trend.
discrepancies between the DR grades used. In this work we made a conscious effort to facilitate easy comparisons with future work by employing standard training/test splits and the official DR grades associated with each dataset. With respect to classification performance, we have demonstrated that the attention-based MIL approach is a viable alternative to the usual pipelines for rDR detection. More specifically, it proved resilient against the noisy setting of the Kaggle-EyePACS dataset, where, to the best of our knowledge, it achieved the highest AUC on the single image prediction task. In the cross-dataset testing scenario, it achieved an AUC of 0.976 on Messidor-2, which is comparable to the relevant state-of-the-art (AUC of 0.99) and often superior to that of other models trained on larger datasets.
Instead of extracting image patches deterministically (i.e. using a pre-defined grid), our model was trained on pools of randomly sampled patches. Naturally, this modification greatly improved the training time, as an image can be represented by as little as 50 patches, while extracting patches over a grid with 50% overlap yields 225 64 × 64 patches for a 512 × 512 image. Most interestingly, it also slightly increased the model's performance on the validation set (Fig. 4). This can be explained by thinking the random patch policy as a form of implicit data augmentation: each time a specific image is used for training, the model will, with high probability, see a new bag of patches encoding, owing to the image being represented by a relatively small random subset of all possible patches.
Apart from detection accuracy, model intrepretability is a common requirement in medical applications of machine learning, as it allows human experts to make sense of why the model makes some prediction and not the opposite. Thus, interpretability is key in the acceptance of any proposed machine learning solution in the field of eye care. To aid in this, our method outputs a heatmap that contains the attention weights assigned to each image region during inference. This attention heatmap emerges as a natural by-product of the model and consequently does not require additional computation, as is the case in alternative interpretability approaches for deep networks such as Grad-cam [40]. Such a heatmap can be used alongside the rDR probability prediction, to inform medical experts of the particular fundus artifacts that led to that prediction. Furthermore, it can help machine learning practitioners understand their model's failures and find ways to overcome them. For instance, we can see that in false positive predictions (images without rDR that are predicted as rDR) our model usually focuses on regions that contain either tiny spots that resemble microaneurysms (Fig. 5f), yellow lesions that resemble hard exudates (Fig. 5e) or bright cotton-woollike artifacts (Fig. 5d). Bringing these findings to the attention of eye specialists could facilitate a better understanding as to why the model mistakes these artifacts for lesions, as well as a potential strategy to counter such factors of confusion. One limitation of the proposed approach is the need for large patch overlap during inference. While this is not an issue during training (due to the random patch sampling), it can pose a memory constraint on the system's deployment. For classification an overlap value of 0.75 (which results in 841 patches per image) achieves good performance and can be handled relatively easily. Nevertheless, for constructing the auxiliary attention heatmaps larger values are necessary in order to have more fine-grained and pleasing visualizations. As a measure of scale, the images in Fig. 5 were produced using a patch overlap of 0.875, the largest value that a GPU system with 8GB of memory (NVidia 1070Ti) could handle.
Finally, an interesting line of investigation for future work, would be to question the utility of using a CNN designed for large resolution images (such as ResNet-18) to process patches of quite smaller dimensions (64 × 64). In fact, while for this work we opted for a clearly over-parameterized approach, it will be interesting to test whether a model designed to work with smaller resolutions (e.g. a ResNet-20 that is designed for 32×32 images), performs equally well or even better. Another topic of future research will be to examine how to utilize pixellevel DR lesion annotations, typically available in very small amounts due to the tediousness of producing them, in order to improve the predictive performance and lesion heatmap quality. This remains an open problem, as DR grading and DR lesion localization methods have remained more or less orthogonal, with very few published works on the topic [41].

VII. CONCLUSIONS
We introduced a system that detects referable Diabetic Retinopathy in fundus images, by extracting local information from each image patch separately and combining it with an attention mechanism. In our experiments, the proposed system achieved high classification performance, making it competitive with state-of-the-art works that were trained in larger and better annotated data. Aside from its high predictive value, our system can inherently produce a heatmap of the regions on which its decision was based, thus aiding in the interpretation of its predictions.