Introduction

Optical Coherence Tomography (OCT) scans play an important role in diagnosing and managing sight-threatening macular diseases such as age related macular degeneration (AMD) and diabetic macular edema (DME). By imaging the retina at micrometer resolution, OCT has given ophthalmologists the ability to visualize retinal structures in three dimensions (see Fig. 1). Yet, detailed analysis of OCT scans in clinical routine is time-consuming even for experienced physicians1. With over 30 million OCT scans acquired annually worldwide and an increasing prevalence of chronic eye conditions, the human resources and expertise needed to assess OCT images, today and in years to come, are simply overwhelming2,3.

Figure 1
figure 1

Evaluation of an OCT volume scan (left) using our proposed automated biomarker identification method. Individual cross-sections, or Bscans (middle), are processed and predictions on eleven biomarkers are given per cross-section (right). The color coding indicate the likelihood of a specific biomarker being present in a given cross-section.

Machine learning provides a pathway to automate inspections of medical imaging such as OCT scans. By using datasets of annotated examples, trained machine learning algorithms are not only faster at assessing scans, but also more cost effective when compared to human counterparts. These advantages have led to a surge of machine learning based methods for retinal image analysis4,5,6. These include techniques that perform automated diagnosis7, morphological shape estimation8,9,10,11,12, treatment outcome estimation13,14,15 and clinical referral support16. Broadly, these developments have hinged on clinical insights, novel machine learning techniques and large amounts of OCT scans.

Biological markers, or biomarkers of the retina, have traditionally played a central role in both clinical routine and research17. For example, monitoring fluctuations of fluid biomarkers using OCT is an essential part of the standard of care for managing chronic retinal conditions, while other biomarkers have been linked to how well patients respond to treatments18. However, given that there are dozens of established biomarkers, their identification is both time consuming and challenging due to their number, size, shape and extent. Furthermore, other morphological markers yet to be identified may have a significant impact on patients’ outcomes, needed treatment and prognosis.

At the core of this work, we hypothesize that an automated method can identify biomarkers reliably and can also help answer routine clinical questions. To show this, we present a machine learning method that automatically identifies a wide range biomarkers in OCT scans. Our approach learns to identify biomarkers without needing to be shown where these are located in training scans, and obviates the need for burdensome segmentation annotations. By training our algorithm this way, our method is not only capable of identifying biomarkers more consistently then experienced experts, it also allows a robust representation of retinal characteristics that can be used to identify pathologies in OCT scans acquired with different OCT devices.

Methods

Data and biomarkers

To train our method, we acquired 6 × 6 mm volume OCT scans using the same OCT device type (Spectralis, Heidelberg Engineering, Germany). The scan protocol included 49 Bscan for each volume scan and an ART setting of 9 (each Bscan was averaged 9 times). Pathologies included diabetic retinopathy with and without DME, and early, intermediate and late AMD. In total, 327 patients from the outpatient department of the university eye clinic were included in the study. From these a total of 470 volume scans were collected, resulting in 23’030 OCT Bscans (see Supplementary materials for the patients per disease distribution). As a validation set, we collected an additional cohort of 21 AMD patients using the same device type and scanning protocol. This study was approved by the ethics committee of the Kanton of Bern, Switzerland (KEK-Nr. 093/13), and was conducted in compliance with the tenets of the Declaration of Helsinki. Given the retrospective design of this study, the ethics committee waived the requirement for individual informed consent.

The following morphological biomarkers were considered in this study: subretinal fluid (SRF), intraretinal fluid (IRF), intraretinal cysts (IRC), hyperreflective foci (HF), drusen, reticular pseudodrusen (RPD), epiretinal membrane (ERM), geographic atrophy (GA), outer retinal atrophy (ORA) and fibrovascular pigment epithelial detachment (FPED). A healthy biomarker was also included to denote the lack of any previously mentioned biomarker. Note that these parameters are established predictive OCT biomarkers18,19,20, used to stage diseases and assess disease progression (see Supplementary materials for a detailed definition of each biomarker used in this work).

Annotation protocol

To annotate which biomarkers were present in individual OCT Bscans, we developed a custom web-based annotation tool to allow fast and convenient annotations (grading) of the image data (see Supplementary video for an illustration of the tool in use). This tool allows OCT volumetric scans to be uploaded and viewed directly in a web browser, with multiple annotators able to grade simultaneously with the same protocol.

A pool of eight experienced annotators from the Bern University Hospital and the Bern Photographic Reading Center contributed to this study. Prior to the annotation collection, all annotators were trained for respective biomarkers on additional scans containing the biomarkers defined. For each Bscan, annotators were instructed to mark which biomarkers were present in the given scan using dedicated buttons (biomarker is present or not) and could also view adjacent Bscans at the same time. A “healthy” grade was assigned when a Bscan contained no other pathological biomarker, while a “can’t grade” grade was given to low quality scans that did not allow confident assessments. For additional consistency, annotators were provided with an annotation manual providing definitions and example images of each biomarker.

Each of the 23’030 Bscans in the training set was annotated by a single annotator randomly selected from the annotator pool. For the validation set, we randomly sampled 49 Bscans for each of the 21 imaged patients. To obtain a broader spectrum of disease stages in this validation set, Bscans were selected from OCT volumes acquired over several years and from both eyes. This resulted in a total of 1029 Bscans, each of which was then graded by at least five graders.

Bscan-based biomarker classification

We propose to use a convolutional neural network (CNN) classifier to identify which biomarkers are present in any given Bscan. Our method uses a set of OCT Bscans and the associated biomarkers present in that scan to train the method. This is in contrast to recent trends8,11,12,16, that use pixel wise segmentations and volumetric information to provide segmentations. Therefor our approach has two advantages: firstly, our method requires annotations that are far simpler to gather, especially with the tool proposed. Secondly, our method assesses individual Bscans, allowing it be used on volumes by repeatedly evaluating scans in volume or be used on individual scans as is often the case in clinics, where each individual scan of the cube/volume scan is assessed for the presence of respective biomarkers such as IRF or SRF.

Our chosen architecture is based on residual convolutional neural networks21 with dilated convolutions22 and is fully described in the Supplementary material. This choice is motivated by the need to identify a wide range of biomarkers that vary in size, position and extent. For instance, HF biomarkers can occupy a few pixels in size, while the entire Bscan needs to be assesed to assign a healthy label. The use of dilated convolutions is motivated by the fact that many networks such as the ResNet21, decrease the image size in the network to increase the receptive field by means of pooling. While maximum pooling may retain the activation of a small biomarkers such as HF, we argue that retaining a large spatial size of the image is advantageous for small to medium sized biomarkers. A drawback of the large spatial size of the image is the reduced receptive field compared to pooling networks. To overcome this, Yu et al.23 proposed to use dilated convolutions that increase the receptive field more strongly than conventional convolutions using the same number of model parameters. To further mitigate the problem of identifying small biomarkers, we not only use global average pooling after our last convolutional layer, but both global maximum pooling and global average pooling on the same feature maps after concatenating the vectors.

As in previous methods16,24, we construct an ensemble of CNNs using cross-validated models and combine the prediction of all models using the unweighted mean of the final output before sigmoid activation (see Supplementary material for more details). We perform 10-fold cross-validation on the training set, where every model is trained on 90% of the training data and validated on 10% of the training data. To train our network, we use the weighted binary cross entropy loss function and Stochastic Gradient Descent. The inputs are 512 × 512 pixel RGB grayscale Bscans. We pre-train the network using ImageNet25 and then perform transfer learning on the B-scan task to maximize the classifiers performance.

Expert and method evaluation

Using the validation set to evaluate how well expert graders perform the task of Bscan grading, we first apply a majority voting scheme using all graders to determine a gold-standard annotation for all biomarkers on all Bscans. At least half of the graders must have been in agreement to imply that a biomarker is in fact present.

To assess the performance of graders, we apply two validation schemes: (1) inter-grader performance and (2) performance compared to the majority vote. Inter-grader performance is assessed by computing the Cohen’s Kappa values of two graders on commonly annotated slices for each biomarker. Performance to the majority vote is evaluated by including all graders to determine the majority vote and then computing the Cohen’s Kappa of a grader to the majority vote. Given that the latter may result in biased results (due to the grader contributing to the majority vote), we also compare performances with each grader when discarded from the majority vote computation.

We then evaluate our method’s capacity to identify biomarkers in Bscans on the same validation set using a variety of performance metrics: Cohen’s Kappa, sensitivity, specificity, average precision and F1 scores. Additional performance information regarding the architecture design choice is also given in the Supplementary materials.

To estimate the generalization performance of our method for the purpose of pathology classification, we make use of the publicly available Duke AMD dataset26. This dataset includes OCT volumes from 269 AMD and 115 healthy eyes, acquired using a Bioptigen, Inc SD-OCT imaging system, and AMD or healthy annotations are give on a per volume basis. The imaging protocol of this dataset differs from that of our training set, as every volume contains 100 Bscans. Given the different spatial resolution of 1000 AScans in ±6.7 mm compared to our 512 AScans in ±6 mm, we spatially normalize Bscans from this dataset to 512 × 570 pixels. While our biomarker classifier does not directly predict if a volume has AMD, our method can yield a distribution of biomarkers from a volume when aggregating the predicted biomarkers on each Bscan of a volume. In order to create a classifier out of this distribution, the sum of the sigmoid activated logits is computed for all AMD biomarkers (SRF, IRF, Drusen, RPD, IRC, FPED) and over all Bscans. Note that we do not train on any of the data from this dataset even though it is acquired with a different device than that of the data our method is trained on.

Similarly, we also examine the performance of our biomarker predictor on the public test dataset provided by Kermany et al.27 for the task of pathology classification. This dataset contains 1000 Bscans from eyes with healthy, CNV, drusen and DME conditions (250 from every category). Data was collected using the Heidelberg Spectralis OCT device. We evaluate the performance of our method on each Bscan and map the set of predicted biomarkers to a pathology category. We do so by using a simple comparison tree: CNV if fluid (SRF, IRF or IRC) is present with either drusen, FPEDs or RPD being present; DME if fluid (SRF, IRF or IRC) is present and neither drusen or PED are present; Drusen if fluid is not present but drusen, FPED or RPD are; healthy otherwise.

Results

Expert-level grading of biomarkers

The validation set contained 1029 Bscans of which 1002 Bscans had at least one biomarker annotated in agreement. Bscans with no biomarker agreement were removed from further analysis. On average, grading time of all biomarkers on a single Bscan lasted 24.2 seconds and every Bscan had on average 1.8 ± 0.983 majority voted biomarkers (see Supplementary materials for the complete majority voted statistics).

On average, the mean Kappa value for all biomarkers across graders was of 0.742 ± 0.036, with maximum and minimum scores of 0.781 and 0.674, respectively. When leaving out respective graders to compute majority votes, mean Kappa value across biomarkers decreased to 0.677 ± 0.035, with maximum and minimum scores of 0.710 and 0.607, respectively. Table 1 summarizes the performance of each grader on the validation set, while Fig. 2 illustrates for each biomarker the distribution of Kappa scores across graders. The highest average inter-grader Kappa values were found to be for SRF (\(\kappa =0.8367\)) and ERM (\(\kappa =0.7386\)), while the lowest were for Outer Retinal Atrophy (\(\kappa =0.365\)) and Reticular Pseudodrusen (\(\kappa =0.361\)).

Table 1 Grader assessment on the validation set.
Figure 2
figure 2

Biomarker grading performance by annotators and our method. For each biomarker, the distribution of human grader Kappa values is illustrated with a dedicated box plot (median, first and third quantiles). The red stars depict the performance of our method. (left) depicts values when the majority includes all graders and (right) when individual graders are discarded from the majority vote.

Figure 3 illustrates the Kappa value relations between each pair of graders for each biomaker, where lighter colors represent higher Kappa values and darker colors correspond to low values.

Figure 3
figure 3

Inter-grader Kappa values for each each biomarker. The 8 grader IDs are along the vertical and horizontal axes, while light and dark colors show high and low Kappa values between any pair of graders, respectively.

Network-based biomarker grading

Our method yields an average Kappa value of 0.761 ± 0.111 over all biomarkers. The exact exact match ratio (proportion of Bscans where the method predicted the majority vote) was of 0.502 and an at-most-one-wrong ratio (proportion of Bscans where the method predicted the majority vote with at most one biomarker mistake) was of 0.824. Our method achieves the highest Kappa values for ERM (\(\kappa =0.877\)) and the lowest for Reticular pseudodrusen (\(\kappa =0.545\)). Figure 2 (left) shows the performance of our method in comparison to human graders for each biomarker and a complete summary of all Kappa scores, as well as sensitivity, specificity, F1 score and average precision are outlined in Table 2. For eight biomarkers, our method has higher Kappa scores than the median human grader. If we also consider our method to be an additional grader and include its predictions to compute the majority vote for each biomarker, then the classifiers mean Kappa increases to 0.817. Figure 2 illustrates how these compare to human graders for each of the biomarkers.

Table 2 Performance of the proposed method on the validation set.

Figure 4 shows the Receiver operating characteristic (ROC) curves of all biomarkers. Here, we compute eight different leave-grader-out majority votes and compare them with our method’s prediction, whilst plotting the graders’ specificity and sensitivity. Using class activation maps28, we illustrate in Fig. 5 what part of the image data contributes to the classifier’s decisions on selected examples. For each example, we highlight in warm colors (red) regions of high relevance in the classifier decision.

Figure 4
figure 4

Receiver operating characteristic curves for all biomarkers. Stars denote the leave grader out performance. The black ROC curve compares our method with the majority voted ground truth of all graders. Colored curves denote the comparison of the automatic classifier with the majority vote of corresponding leave-grader out evaluation.

Figure 5
figure 5

Classifier activation maps for different biomarkers on selected examples. Red regions illustrates regions of the image that are automatically found by our method to correctly predict the corresponding biomarker.

Automated pathology classification

Using automatically identified biomarkers to classify Bscans based on pathology, Fig. 6 outlines the distribution of biomarker distributions for both the Duke (left) and Kermany (right) datasets. In both cases, the distributions are split based on the provided pathological groupings.

Figure 6
figure 6

Distribution of predicted biomarkers for the Duke Dataset (left) and Kermany Dataset (right). Using a threshold on the biomarker probability (p ≥ 0.5) to indicate presence, values are aggregated with occurrence over all scans and within each volume.

In the Duke dataset case, for all biomarkers except “Healthy”, significant differences in quantities of biomarkers were found. Using our approach to classify these images as AMD or healthy, we report an AUC of 0.9981 using all AMD biomarkers and 0.9992 using only the drusen biomarker, compared to reported state-of-the-art methods performance of 0.99929 and 0.99730 on this same dataset. At a false positive rate of 0% and 1.7%, our approach achieves a true positive rate of 96.2% and 99.3%, respectively. We also achieve a maximum accuracy of 96.7% on the Kermany data set, while Kermany et al.27 reported an accuracy of 96.6%. When classifying healthy from pathological Bscans, we report an accuracy of 99.8% compared to 99.1% previously reported27.

Discussion

We have presented a machine learning method capable of identifying a wide range of biomarkers from OCT Bscans of the retina. Our method is based on convolutional neural networks and is trained using a heterogeneous cohort of treated patients with chronic conditions. We show that our method is not only capable of identifying referenced biomarkers with high reliability, but that these can then be used to classify scans into pathological categories at state-of-the-art levels.

One of the unique qualities of our approach is that it is trained using annotations that can be collected in a matter of seconds per Bscan. That is, the annotations used are simple, only carrying information as to biomarker presence. In contrast to recent methods that have looked to fully locate and delineate relevant anatomical and pathological biomarkers16, our approach is significantly less time consuming in terms of data preparation. In addition, segmentation annotations may be error prone31, as these are usually performed by a single grader and may not serve as an appropriate gold-standard for accurate training of a classifier32. Consequently, the annotations needed by our method can be easily gathered with an appropriate interface, such as the one we propose, and allows for much larger number of different scans to be annotated in the same amount of time.

We have also demonstrated here that the task of biomarker identification from Bscans is in itself a challenging one even for experts. The use of experienced graders from a Reading Center showed that depending on the biomarker, a significant variance in grading outcomes can be expected33. While this variance could be linked to a number of factors such as adherence to the biomarker definition, size, extent of the biomarker in the image or image quality, the overall inter-grader reliability of most biomarkers remains quite high. At the same time, our approach provides grading performances that rivals, if not surpasses, experienced graders on virtually all biomarkers evaluated here. This opens the possibility for automated systems for clinical decision support in routine clinical care or validation of large clinical studies.

Our results also shows that characterizing OCT Bscans with clinically known morphological descriptors can provide valuable information for OCT scan classification, supporting the result of De Fauw et al.16. Whether on single Bscans, or volumetric scans, our approach attains state-of-the-art performances on classifications tasks, by leveraging biomarkers found with our method. The fact that our approach operates on a Bscan level gives way for analysis on even larger cohorts of data as many clinical sites only acquire Bscans for the moment.

Surprisingly, our state-of-the-art performance of pathology classification on the two public datasets evaluated involved OCT scans from multiple devices. In the past, the ability to use machine learning methods on different machines had already been shown in the context of medical scans34 and OCT images16. What sets our result apart from those however, is that our method is only trained with images from a single device, and yet provides a rich enough biomarker characterization (see Fig. 6) to classify scans acquired with other devices and imaging protocols. This suggests that our histogram representation is to a certain extent robust enough to changes in image variability across imaging devices.

Naturally, a limitation of our method and its use of presence information for training is that it does not provide an intuitive way to quantify biomarkers (in millilitres or in occurrences). This capability could be essential to assess improvement and worsening of underlying pathology using specific biomarkers (IRF and SRF). However, in daily clinical practice, physicians often aim for disease stability and usually treatment decisions are based on presence or absence of disease activity, and disease activity markers, irrespective of whether there is more or less fluid compared to the patient’s last visit35,36,37.

Interestingly however, biomarker activation maps such as those illustrated in Fig. 5 provide both a visual representation of the reasoning of the classifier and a coarse delineation of biomarkers. This latter could potentially serve as a proxy for quantification of biomarkers. In addition, by evaluating each Bscan with our approach, a coarse quantification can still be determined, whereby our approach still allows for repeated detection of a specific biomarker on multiple Bscans to be quantified.

A further limitation of this work includes the restricted validation set used. While we considered 1029 Bscan from 21 patients to be sufficiently large in this work, a similar analysis with a larger cohort of patients would further validate the approach. In particular, evaluations on patients with a variety of diseases and treatment would show evidence of greater generalization to a wider range of clinically relevant cases. Last, our approach does not leverage the fact Bscans are highly correlated to adjacent BScan in a volume. An improved classifier would most likely result if it considered inter-Bscan relations. We plan to explore solutions to these limitations in the future.

Data Availability

The datasets generated during and/or analyzed during the current study are not publicly available due to privacy constraints. The data may however be available from the University Hospital Bern subject to local and national ethical approvals.