Introduction

Fuchs endothelial dystrophy (FED) is a corneal disease characterized by an increase in thickness of the Descemet’s membrane, a large deposition of extracellular matrix in the corneal endothelium (referred as guttae), and a progressive loss of endothelial cells1. The guttae, which are condensation of collagen growing from the Descemet’s membrane (gutta is the Latin word for ‘droplet’, being guttae the plural form and guttata the adjective form), appear disperse in the early stages of the disease, but as FED progresses, guttae become abundant and visible in the entire endothelium2. This is accompanied by a great loss of endothelial cells that is irreversible due to the lack of cell regeneration. Eventually, the endothelium is not able to support corneal deturgescence, thereby leading to corneal edema, poor visual acuity, and finally the necessity for corneal transplantation to restore vision. FED is the most common cause for the transplantation of the corneal endothelium worldwide3. This disease usually appears in people of 40–60 years of age and it has a slow clinical progression (10–20 years)4. Since many patients with cornea guttata do not progress to FED, it is difficult to assess its incidence and prevalence. Nevertheless, it is estimated that around 4% of the population over 40 years of age have corneal guttae5.

The most common staging system used to describe FED has four levels1,4: in stage 1, guttae appear in the central cornea and they are non-confluent; in stage 2, guttae start to coalesce and spread towards the peripheral cornea, while endothelial cell loss accelerates; in stage 3, stromal edema is present, which might cause the formation of epithelial and subepithelial bullae; and in stage 4, the cornea has become opaque due to chronic edema and visual acuity is critically compromised. Conversely, central guttae can be observed in the elderly without the presence of edema, and the existence of only peripheral guttae is related to a condition called ‘Hassall-Henle bodies’ and does not lead to edema1. An endothelial cell density (ECD) between 400 to 700 cells/mm\(^{2}\) tends to result in corneal decompensation6. Thus, the correct estimation of ECD, along with the other two main endothelial parameters (polymegethism or coefficient of variation in cell size [CV], and pleomorphism or hexagonality [HEX]), is a valuable tool for the diagnosis and monitoring of this dystrophy.

The endothelium can be easily imaged with specular microscopy, a noninvasive, noncontact method that relies on the reflection of light from the interface between the endothelium and aqueous humor7. However, it is crucial to have a smooth, regular endothelial surface to provide good-quality images. In this respect, the guttae push the endothelial cells out of the specular plane and, thus, these appear as ‘black droplets’ in the specular images (Fig. 1). If the guttae are non-confluent and have a relatively small size (similar or less than the average cell size), a human observer could probably infer the cell tessellation in the area. However, manual annotations are very tedious and time-consuming, and there are currently no automatic methods that can perform such inference. In fact, many recent studies have shown that today’s commercial methods provided by the microscope manufacturers to automatically segment the endothelial images depict low reliability, even for healthy corneas8,9,10,11,12.

In recent years, several new approaches to estimate the corneal endothelial parameters have been proposed. Up to 2018, the methods were based on image processing techniques and classic machine learning13,14,15,16,17, but from 2018 onward, a significantly large number of new approaches based on deep learning have been presented18,19,20,21,22,23,24,25,26,27,28,29,30,31,32. Overall, these methods have shown a good performance. However, it is worth noting a few details: (i) some methods (mainly from the pre-deep learning era) evaluated the estimation of the three corneal parameters but the images were from healthy eyes and there was either a manual selection of the region of interest (ROI) or the cells were visible in the whole image14,15,16,17,18,21,23; (ii) some of the first deep learning methods simply evaluated the capacity of neural networks to perform an accurate segmentation, either in healthy19,22 or unhealthy corneas26, without estimating any corneal parameter, which avoids the non-trivial problem of refining the raw segmentation for the purpose of obtaining an accurate biomarker estimation; (iii) other publications only focused in estimating ECD (or the number of cells) in healthy cases29,30 and images with guttae20,27,31, some including a method to select the ROI (although this part is often unclear); and (iv) our previous work is, to the best of our knowledge, the only fully-automatic method to estimate all three parameters in all types of images (heavily blurred24,25 and also with some guttae28). Among the publications dealing with guttae, a quick visual inspection is enough to perceive the inaccurate segmentation around the guttae, where partial cells occluded by the guttae are included as full cells27,31; in contrast, our previous work28 has shown better results but still failed in cases of very advanced disease. Therefore, the accurate segmentation of images in the presence of guttae is still an unsolved problem.

In this study, we present a new approach that automatically segments endothelial images with guttae and estimates the endothelial parameters. The proposed method contains three subprocesses (Fig. 1): (i) a convolutional neural network (CNN) that infers the cell edges in the image (named CNN-Edge, yielding an ‘edge image’); (ii) another CNN model that infers the cells that can be fully identified (named CNN-Body, yielding a [cell] ‘body image’); and (iii) an image processing model based on watershed that refines the edge images, uses the body images to select the well-detected cells, and extracts the corneal parameters (named postprocessing). A comparative analysis is performed to determine what type of deep learning model (UNet33, ResUNeXt34, DenseUNet35, and UNet+/++36) is more appropriate for this task. We also present a new attention mechanism, named feedback non-local attention (fNLA), which can be plugged to any of the aforementioned networks. Different versions of the attention mechanism are tested, and we compare it to a well-know ‘attention UNet’37. This framework is evaluated against our previous approach28, named CNN-ROI (Fig. 2), which is retrained with the dataset of this study. A variation of CNN-Body, named CNN-Blob (Fig. 2), which infers the overall area of interest instead of independent cells, is also evaluated. Finally, these frameworks are compared against the manual annotations and the estimates provided by the microscope’s software.

Figure 1
figure 1

Flowchart: CNN-Edge detects the cell edges from the specular image, CNN-Body infers the cell bodies that are well detected, and the postprocessing refines the edge images and applies the ROI from the body images to provide the final segmentation. The final segmentation (edges in red, vertices in yellow, non-ROI area in blue) is dilated and superimposed onto the specular image for illustrative purposes.

Figure 2
figure 2

(a) Illustrative specular image with guttae (grade 4) and (f) its manual annotation (the edges, in yellow, are dilated for illustrative purposes; non-ROI area in blue). For this image, the targets in the different networks are: (b) CNN-Edge, (c) CNN-Body, (d) CNN-Blob, and (e) CNN-ROI. The outcome of the networks during inference (DenseUNet fNLA-Mul) are: (g) CNN-Edge, (h) CNN-Body, (i) CNN-Blob, and (j) CNN-ROI. Red and green arrows indicate two peculiarities of the targets (to be discussed later).

Results

Evaluation metrics

The metrics to evaluate the CNN output images were accuracy, the Sørensen-Dice coefficient (DICE), and modified Hausdorff distance (MHD)38. MHD is a good metric to evaluate the edge images because it measures how close the detected edges are from the manual annotations, whereas DICE can better evaluate the performance for the body images since it measures the overlap of the region of interest (ROI). The evaluation metrics for the corneal parameters were mean absolute error (MAE) and mean absolute percentage error (MAPE). To assess the statistical significance between methods, we used the paired Wilcoxon test to compare the MAPE after assigning a 100% error if no parameter estimate was produced. To assess the clinical statistical significance of our method against the gold standard, we used the Kruskal-Wallis H test. All tests used a statistical significance established at \(\alpha =0.05\).

Comparison between types of CNNs

Quantitatively (Table 1), the basic UNet, ResUNeXt, and DenseUNet seemed to provide similar performance, with slightly better results for DenseUNet (more noticeable if we look at the MAE of the corneal parameters). However, the differences between those networks were discernible in the qualitative results (Fig. 3): in the edge images, UNet and ResUNeXt provided rather binary results and were very conservative in the areas with large guttae (Fig. 3b,c, central area), whereas DenseUNet provided more probabilistic outputs and was able to extend the edge inference into those guttae areas, reaching beyond the manual annotations (Fig. 3d). In the case of the body images, DenseUNet showed a similar probabilistic behavior (Fig. 3), whereas ResUNeXt had significant problems and the lowest performance (Table 1). Thus, DenseUNet was the network to further investigate.

One key point for the success of the edge inference within the guttae areas was the selective training procedure, where some of the images of the batch were sampled from specific guttae subsets. In fact, if the DenseUNets were trained without any stratified subsampling (random selection), such edge inferences did not appear (Fig. 3e). Another interesting observation was that the use of the multiscale networks (+ and ++) had no impact on the edge images but provided a clear improvement in the body images for all types of networks (Table 1), although this did not necessarily result in more accurate parameter estimations.

Table 1 The average accuracy, DICE, and MHD in the 10-fold cross-validation scheme for the different networks (CNN-Edge and CNN-Body), the number of parameters of each network, and their MAE of the endothelial parameters.
Figure 3
figure 3

(a) Illustrative specular image with guttae (grade 5) and its manual annotation (edges in yellow; non-ROI area in blue). For the different networks (indicated on top), the inference of this image for the CNN-Edge (top image) and CNN-Body (bottom image). Topcon’s software did not detect a single cell in this image.

Selection of attention mechanism

In 2018, Oktay et al.37 proposed an attention mechanism in UNets, which modifies the output of a transition block by using the tensors from the lower resolution stage. In our experiments, we tested several variations of this mechanism, but none provided a clear improvement. The best design (Table 1, most like Oktay’s proposal) gave sharper edge images in complicated areas but it occasionally provided poor results for the body images (Fig. 3f).

In this work, we developed a different attention mechanism that would exploit the non-local dependencies between features of different resolution stages while simultaneously bringing back the attention information from the lowest stages back to the highest ones. Quantitatively, the improvement was subtle but perceptible in the basic DenseUNet: the MHD in the edge images was substantially smaller in all cases, and all metrics in the body images improved (Table 1). Qualitatively, they all provided sharper edge images (there were less double-edge artifacts within guttae areas, Fig. 3), and the inference in the body images was more probabilistic (cells within guttae areas appeared with lower intensity instead of just pitch black). Thus, our attention mechanism seemed to correctly use the surrounding information to determine, in a probabilistic manner, whether a cell should be discarded. With regards to the type of aggregation at the end of the attention block, the concatenative and the multiplicative type provided similar results; in contrast, the additive type had the lowest performance in all the experiments. Therefore, we chose the multiplicative type because it uses fewer parameters and provides the overall smallest MAE in the corneal biomarkers (Table 1). On the other hand, the use of fNLA in the multiscale versions +/++ did not yield a clear benefit.

Postprocessing tuning

The postprocessing had two hyperparameters to tune: a threshold to discard weak edges (‘edge threshold’), and a threshold to discard superpixels from the final segmentation (‘body threshold’). We performed a combined grid search and found that, for the edge threshold: (i) a value around 0.1 minimized the error for CV regardless of the body threshold; (ii) for HEX, the optimal value was 0.1 for a body threshold of 0.5 or lower, but it shifted towards 0.2 as we increased the body threshold towards 1; and (iii) for ECD, a higher threshold (0.2–0.3) was better but the differences were very small. Overall, the purpose of the edge threshold is to simply discard false edges created during the postprocessing (watershed). In that respect, a value of 0.1 seemed to be a reasonable choice.

As for the body threshold, we observed that a value of 0.5 (which is the most intuitive choice) yielded the lowest errors in the three corneal parameters for the networks without attention, but it shifted towards 0.75 if we employed networks with fNLA, although the error differences were very small. However, visual inspection showed that no major mistakes were produced with a threshold of 0.5 (Fig. 4, cells with brighter green). We believe that a lower threshold sometimes includes cells that were not annotated in the gold standard, therefore the differences do not indicate segmentation mistakes but mainly register the so-called ‘dissimilarity due to cell variability’ (i.e. the estimated parameters can vary if a different set of cells are used to estimate them).

Finally, the postprocessing computed the hexagonality by using the cell vertices (vertex method) instead of counting the neighboring cells (neighbor method, which inevitably discards the cells in the periphery). For those peripheral cells, it was possible to detect their vertices if a portion of the peripheral edges was visible. This was true for most cases, although sometimes the vertex detection was faulty (Fig. 4IV-e,f). Nevertheless, the MAE in our HEX estimations was 3.14 [%] with the vertex method and 4.13 [%] with the neighbor method. Thus, the vertex method was better even when there were some mistakes in the detection of peripheral vertices.

Figure 4
figure 4

Eight images with guttae (two cases per grade): (a,b) grade 3, (c,d) grade 4, (e,f) grade 5, (g,h) grade 6. (I) The specular image; the output of the (II) CNN-Edge and (III) CNN-Body for the DenseUNet fNLA-Mul; (IV) the final segmentation (edges in red, vertices in yellow, non-ROI in blue, and the detected cells in two tonalities of green: brighter green for the cells whose body-intensity was between 0.50–0.75, and darker green for the cells whose body-intensity was between 0.75–1.00), and (V) the gold standard (the annotated edges in yellow; non-ROI in blue). On the right, the estimated parameters (gold standard values in parenthesis). Topcon’s software detected 45, 34, 0, 38, 6, 0, 0, and 0 cells (ah).

Comparison between frameworks

The performance of Topcon’s algorithm was considerably inferior (Table 2), particularly for the images with guttae, where it failed to detect any cell in 30% of the images (as in Fig. 4c,f,g,h) and it only detected one third of the cells (on average) in the remaining images.

Table 2 The MAE of the endothelial parameters in the images with/without guttae, the percentage of images with estimates (success), and the average number of cells per image, for our methods and Topcon (the latter has two success percentages: one for ECV/CV and another for HEX). The manual annotations had an average of 148 and 166 cells for the images with/without guttae, respectively.

Regarding the previous approach CNN-ROI28, the network had no problems to accurately infer the targets (Fig. 2j). However, the question was whether such targets are optimal to subsequently identify the well-detected cells. The error analysis indicated that the CNN-ROI method detected approximately 10–12 more cells per image, but there were a few segmentation mistakes among those cells, and the estimation errors were significantly worse than the CNN-Body approach (Table 2). The paired Wilcoxon test indicated a statistically significant difference between the estimates of approaches CNN-Body and CNN-ROI (\(P<0.0001\), all biomarkers).

As for the CNN-Blob, it detected the same number of cells as CNN-Body (Table 2) and the estimation errors were virtually the same (slightly worse for CNN-Blob in the images with guttae; Table 2). The paired Wilcoxon test indicated no statistically significant difference between the two estimates (\(P=0.69\), \(P=0.13\), and \(P=0.69\), for the ECD, CV, and HEX, respectively). Nevertheless, it is easier for a human observer to interpret what cells are well detected in the CNN-Body output (Fig. 2h) than in the CNN-Blob output (Fig. 2i).

Statistical analysis

The distributions of the estimated parameters from the CNN-Body method and the gold standard passed the Levene’s test for homogeneity, but they did not pass the Shapiro-Wilk normality test. Thus, the Kruskal-Wallis H test was performed and it showed no statistically significant difference between the manual and our automatic assessments for ECD (\(P=0.81\)) and CV (\(P=0.74\)), but it did for HEX (\(P<0.001\)). To further assess the estimates, we performed a Bland-Altman analysis, and it showed that more than 95% of the estimates were within the 95% limit of agreement for all parameters: 96.5% for ECD, 96.6% for CV, and 96.7% for HEX.

Error analysis

We plotted the error of the corneal parameters as a function of the number of cells (Fig. 5) and we fitted two exponentials to the mean and SD of the error using the least-squares method. The error estimates showed a normal distribution along the y-axis for the three parameters, which allowed us to assume that the area within two SDs covered approximately the 95% of the error. Images with and without guttae where plotted with different colors Fig. 5. This evaluation showed that, for images with high number of cells, the error is very low regardless of the presence of guttae; in fact, guttae is a critical factor only for images with very low number of cells. Overall, we can conclude that (i) the number of cells was the main variable to predict the reliability of the estimation, with a clear decrease in the error spread as more cells were detected; (ii) the unreliable cases were images with guttae and less than 25 cells; (iii) HEX required more cells to reduce the error spread, and (iv) there was a notable underestimation in HEX for the images with less than 100 cells.

Figure 5
figure 5

Error of our estimates of (a) ECD, (b) CV, and (c) HEX for the subsets of images without guttae (black dots, 702 images) and with guttae (colored circles, 483 images), displayed as a function of the number of cells. The y-axis indicates the error computed as the difference between the estimates and the gold standard. The mean (solid line) and two SDs (dashed lines) of the error function were modeled with exponentials.

Figure 6
figure 6

Average value of the biomarkers from the manual annotations—(a) number of cells, (b) ECD, (c) CV, and (d) HEX—and the standard deviation of the estimated error for the different subsets of graded guttae images. The error SD allows to visualize how close the estimations are to the gold standard values.

Clinical analysis

The number of cells necessary to estimate ECD with high reliability is widely accepted as 75 cells39. At that point, our estimated ECD error was approximately 0 ± 35 [cells/mm\(^2\)] (mean ± SD). At 25 cells, it was 0 ± 45 [cells/mm\(^2\)]. This is expected to be much lower than the uncertainty generated by the “cell variability”. Doughty et al.39 evaluated that uncertainty (in manual annotations) and concluded that estimating ECD using 75 cells entails to assume an uncertainty (of 1 SD) of ±2% (or ±70 cells/mm\(^2\), which is twice than our method’s error), whereas the uncertainty at 25 cells increases to ±10% (or ±350 cells/mm\(^2\), more than seven times larger than our method’s error). Nevertheless, both elements (uncertainty and error) should be taken into consideration when evaluating the reliability of the estimations. Regarding CV and HEX, there are no studies in the literature about the effect of the cell variability to allow us to make a comparison.

We also observed that the number of visible cells and ECD decrease acutely as the amount of guttae increases (Fig. 6), whereas neither CV nor HEX showed a significant change. While it is generally accepted that a CV of less than 30% and a HEX of greater than 60% is usually a sign of a healthy endothelium28, this assumption seems to be untrue for the current dataset. Thus, our results seem to suggest that guttae does not have an impact on neither the cell size variation nor the hexagonality of the visible cells (those that are not actually affected by extensive guttae).

Overall, our proposed method has a very low error spread for the three biomarkers in images graded with mild or moderate guttae (Fig. 6, up to grade 4). For the cases with severe guttae (grade 5–6), the main problem is the low number of detected cells (in grade 6, the average number of cells per image was 13 ± 10), which translates into a large estimation uncertainty, particularly for CV and HEX (Fig. 6). As shown in Fig. 4g,h, we did not observe major segmentation mistakes in those cases, but the low number of visible cells makes the biomarker estimation unreliable.

Discussion

We have presented a fully automatic method for the estimation of the corneal endothelium parameters from specular microscopy images that contain guttae. This is the first time that such images have been solved properly in the literature.

The main factor to achieve such accuracy was the way annotations were performed. We observed that, if the edges hidden by the guttae could be manually delineated, the networks could learn to replicate such pattern. Since endothelial cell edges usually appear as straight lines, the extension of partially occluded edges is sufficient to infer the hidden tessellation if the guttae are small enough. In fact, we observed that our model went beyond what our annotator could do. For example, the guttae indicated by the green arrow in Fig. 2 was too large to be certain about the hidden edges, yet the network provided a segmentation that seemed highly probable. However, further experiments would be needed to assert the trustworthiness of such inference; as a suggestion, the artificial generation of black spots in the images (mimicking the guttae’s visuals) could be a way to evaluate such hypothesis, although that would raise questions regarding the experiment’s reliability.

Our proposed attention mechanism (fNLA) moderately improved the performance in both types of CNNs (CNN-Edge and CNN-Body). Since the basic DenseUNet already achieved an excellent performance, the improvement provided by our attention block was modest. Nevertheless, we believe that the feedback-attention path depicted in our proposed network could yield a good performance in other types of segmentation problems, particularly the ones that infer areas instead of edges.

One interesting behavior of this framework is that, while CNN-Edge provides good inference beyond the manual annotations, CNN-Body is rather conservative. This is because the targets of CNN-Body were based on the original annotations and, thus, any inference by CNN-Edge that surpasses the manual annotations are not considered in the target of CNN-Body. While this might seem a negative quality, the results show that our framework detects practically the same number of cells as the manual annotations and barely any segmentation mistake is observed even in the most difficult cases (Fig. 4). Therefore, we believe this approach is preferred to more daring ones, like CNN-ROI (Table 2).

Overall, our estimates agreed very well with the gold standard and they were significantly better than the ones provided by the instrument’s software, demonstrating the ability of this artificial intelligence framework to accurately estimate the endothelial parameters from images with guttae.

Materials and methods

Datasets

Two datasets were employed in this work. The first dataset came from a clinical study concerning the implantation of a Baerveldt glaucoma implant in the Rotterdam Eye Hospital (Rotterdam, the Netherlands). The clinical study contained 7975 images from 204 patients (average age 66±10 years), who were imaged with a specular microscope (Topcon SP-1P, Topcon Co., Tokyo, Japan) before the device implantation and at 3, 6, 12, and 24 months after, in both the central (CE) and the peripheral supratemporal (ST) cornea. The protocol required five specular images to be taken in each area for each visit, although it was sometimes difficult to reach that number of gradable images (specifically, an average of 4.7 CE images and 3.6 ST images per visit were acquired). Retrospectively, we observed that 15 patients had clear signs of FED stage two: they all had guttae in both CE and ST cornea, with a larger amount in CE and a clear increase in ST during the two-year follow-up. From this subset of patients, 193 images were collected. Furthermore, we observed that another 81 patients had small amounts of non-confluent guttae in the CE cornea (either FED stage one or due to normal aging), and 227 images were collected from these cases. In total, 420 images with presence of guttae were selected from this study. In addition, 400 images from other patients in the study without signs of guttae were selected to build a balanced dataset.

The second dataset came from another clinical study in the Rotterdam Eye Hospital regarding the transplantation of the cornea (ultrathin Descemet Stripping Automated Endothelial Keratoplasty, UT-DSAEK). This dataset contained 383 images of the central cornea from 41 eyes (41 patients, average age 73±7 years) and they were acquired at 1, 3, 6, and 12 months after surgery with the same specular microscope Topcon SP-1P. The included population for the study were patients over 18 years old with FED planned for keratoplasty. Among these patients, FED reappeared in one of them. Another 13 patients showed a small amount of non-confluent, stable guttae during the one-year follow-up. In total, 80 images out of the 383 showed some guttae (all images were included in the present work).

Altogether, the combined dataset contained 1203 images, in which 500 images depicted guttae in various magnitudes. The images covered an area of approximately 0.25 mm \(\times\) 0.55 mm and were saved as 8-bit grayscale images of 240 \(\times\) 528 pixels. All images were manually segmented to create the gold standard, using the open-source image manipulation program GIMP (version 2.10). Furthermore, we collected the endothelial parameters provided by the microscope (Topcon SP-1P performed this with the software IMAGEnet i-base, version 1.32).

Ethics approval and consent to participate

Data was collected in accordance with the principles of the Declaration of Helsinki (October, 2013). Signed informed consent was obtained from all subjects. Participants gave consent to publish the data. Approval was obtained from the Medical Ethical Committee of the Erasmus Medical Center, Rotterdam, The Netherlands (MEC-2014-573). Trial registration for the Baerveldt study: NTR4946, registered 06/01/2015, URL https://www.trialregister.nl/trial/4823. Trial registration for the UTDSAEK study: NTR4945, registered on 15-12-2014, URL https://www.trialregister.nl/trial/4805.

Grading the dataset

The 500 images with guttae were graded based on their complexity to segment them, taking two metrics into account: the amount of guttae and blur present in the image. For both metrics, images were given a value of 1 (mild), 2 (moderate), or 3 (severe), the final grade being the sum of both values. As a result, there were 134 images with low complexity (grades 1–2), 235 images with medium complexity (grades 3–4), and 131 images with high complexity (grades 5–6).

Targets and frameworks

CNN-Edge is the core of the method. If the specular image had good quality (high contrast) and with cells visible in the whole image, the resulting edge image would probably be so well inferred that a simple thresholding and skeletonization would suffice to obtain the binary segmentation. However, these issues (low contrast, blurred areas, and guttae) are present in the current images. In contrast, CNN-Body has the goal of providing a ROI image to discard areas masked by extensive guttae or blurriness.

These CNNs are trained independently and they all have the same design; thus, they are simply trained with different inputs and targets. To create the targets, we make use of the manual annotations (i.e. gold standard), which are binary images where value 1 indicates a cell edge (edges are 8-connected-pixel lines of 1 pixel width), value 0 represents a full cell body, and any area to discard (including partial cells) is given a value 0.5. If a blurred or guttae area is so small that the cell edges could be inferred by observing the surroundings, the edges are annotated instead (Fig. 2f). For all the annotated cells, we identify their vertices; this allows computing the parameter hexagonality from all cells and not only the inner cells (in the latter, HEX is computed by counting the neighboring cells, thus the cells in the periphery of the segmentation are not considered; this way of computing HEX was used in previous publications23,24,25,28,40 and it is how Topcon’s built-in software computes it). Therefore, HEX is now defined as the percentage of cells that have six vertices.

The target of the CNN-Edge only contains the cell edges from the gold standard images, which have been convolved with a 7 \(\times\) 7 isotropic unnormalized Gaussian filter of standard deviation (SD) 1 (Fig. 2b). This provides a continuous target with thicker edges, which proved to deliver better results than binary targets23.

The target of the CNN-Body only contains the full cell bodies from the gold standard images, and partial cells are discarded either because they are partially occluded by large guttae or they are at the border of the image (Fig. 2c). The same probabilistic transformation is applied here. Alternatively, we also evaluated whether a target that also includes the edges between the cell bodies (Fig. 2d) was a better approach (named CNN-Blob).

This framework is similar to our previous approach28, where a model named CNN-ROI, whose input is the edge image, provides a binary map indicating the ROI (Fig. 2j). To create its target, the annotator would simply draw the rough area from the edge images (Fig. 2g) that they would choose as trustworthy, creating a binary target (Fig. 2e).

Backbone of the network

The proposed CNN has five resolution stages (Fig. 7a). We tested three designs depending on the connections of the convolutional layers within each node: consecutively (as in UNet33), with residual connections (ResUNeXt34), and with dense connections (DenseUNet35). In addition, we also explored two multiscale designs (commonly refered as + [plus] and ++ [plusplus]36) in the aforementioned networks: UNet+ (Fig. 7d), UNet++, ResNeXt+, ResNeXt++, DenseUnet+, and DenseUnet++. Our ++ designs differentiates from + ones in that the former use feature addition from all previous transition blocks of the same resolution stage. In total, nine basic networks were tested (the code for all cases can be found in our GitHub).

Figure 7
figure 7

A schematic representation of the (a) UNet backbone and (d) its equivalent UNet+; a simplified representation of adding fNLA/sNLA blocks to the (b) UNet and (e) UNet+, where a red arrow indicates self-attention (sNLA) if it returns to the same node or feedback attention (fNLA) otherwise; (c) a schematic overview of how the fNLA/sNLA blocks are added to the UNet backbone for the three deepest resolution scales; and (f) a detailed description of an fNLA block with multiplicative aggregation: the feature maps are shown as the shape of their tensors, \(X^{a,b}\) is the tensor to be transformed, \(Y^{a+1,b}\) is the tensor from the lower resolution scale used for attention, the blue boxes (\(\phi\) and g) indicate a 2 \(\times\) 2 transpose convolution with strides 2, the red boxes (\(\theta\) and \(\Omega\)) indicate a 1 \(\times\) 1 convolution, \(\oplus\) denotes matrix multiplication, and \(\odot\) denotes element-wise multiplication. In the case of sNLA, \(\phi\) and g are also a 1 \(\times\) 1 convolution with \(X^{a,b}\) as input. In attention terminology44, Q is for query, K for key, V for value.

Here, we briefly describe the network that provided the best performance: the DenseUNet. In this network, the dense nodes have 4, 8, 12, 16, or 20 convolutional blocks, depending on the resolution stage, with a growth rate (GR) of 5 feature maps per convolutional layer. Each convolutional block has a compression layer, Conv(1 \(\times\) 1, 4\(\cdot\)GR maps) + BRN41 + ELU42, and a growth layer, Conv(3 \(\times\) 3, GR maps) + BRN + ELU, followed by Dropout(20%) + Concatenation (with block’s input), except the first dense block, which lacks the compression layer and dropout43 (all nodes in the first resolution stage lack dropout). The transition layer (short-connections) has Conv(1 \(\times\) 1, \(\alpha\) maps) + BRN + ELU, the downsampling layer has Conv(2 \(\times\) 2, strides 2, \(\alpha\) maps) + BRN + ELU, and the upsampling layer has ConvTranspose(2 \(\times\) 2, strides 2, \(\alpha\) maps) + BRN + ELU, being \(\alpha\) = (number of blocks in previous dense block) \(\times\) GR/2. The output of the last dense block, \(X^{04}\), uses a transition layer with 2 maps to provide the output of the network.

The attention mechanism

The core of the attention block (Fig. 7f) is inspired by Wang et al.’s non-local attention method44, which also resembles a scaled dot-product attention block45,46. Wang et al.’s design, which is employed in the deepest stages of a classification network, proposes a self-attention mechanism where (among other differences) Q, K, and V in Fig. 7f are the same tensor. In our case, the attention block is added at the end of each dense block and it is named feedback non-local attention (fNLA, Fig. 7f) or self-non-local attention (sNLA), depending where it is placed within the network (Fig. 7c): if there exists a tensor from a lower resolution stage, the attention mechanism makes use of it (fNLA), but in the absence of such lower tensor, a self-attention operation is performed (sNLA, where \(\phi\) and g in Fig. 7f become a 1 \(\times\) 1 convolution with the same input \(X^{a,b}\)). In the DenseUNet, the nodes of the encoder use fNLA (except the deepest node), and the nodes of the decoder use sNLA (Fig. 7b); in the DenseUNet+/++, only the largest decoder use sNLA (Fig. 7e).

We also explored three types of attention mechanism depending on the type of aggregation at the end of the block: (a) the default case (Fig. 7f) uses multiplicative aggregation, where a single attention map (from \(\Omega\)) is sigmoid activated and then element-wise multiplied to all input feature maps; (b) in the case of concatenative aggregation, \(\Omega\) is simply an ELU activation, whose output maps (C/8 maps) are concatenated to the input features; (c) in case of additive aggregation, \(\Omega\) becomes a 1 \(\times\) 1 convolution with ReLU activation and C output feature maps, which are then summed to the input tensor. For comparison, Wang et al.’s model44 is an sNLA with additive aggregation.

Intuitively, an sNLA mechanism computes the response at a specific point in the feature maps as the weighted sum of the features at all positions. In fNLA, the attention block maps the input tensor against the output features from the lower dense block, thus allowing the attention mechanism to use information created further ahead in the network. Moreover, the feedback path created in the encoder allows to propagate the attention features from the deepest dense block back to the first block. While endothelial specular images do not possess such long-range dependencies (features separated by a distance of 3–4 cells do not seem to be correlated), this attention operation might be useful in the presence of large blurred areas (such as guttae). In this work, we explored the use of this attention mechanism for DenseUNet and the multiscale versions DenseUNet+/++.

Description of the postprocessing

The postprocessing aims to fix any edge discontinuity. Here, we have improved a process that was first described in a previous publication23. The steps are:

(I) We estimate the average cell size (l) by using Fourier analysis in the edge image23. As we proved in previous work25, this estimation is simple, extremely robust, and accurate.

(II) As new step, we add a perimeter to the edge image with intensity 0.5. This closes any partial cell in touch with the border.

(III) We smooth the edge image with a Gaussian filter whose standard deviation is \(\sigma =k_{\sigma }l\), being \(k_{\sigma }=0.2\) (this parameter was derived in a previous publication40). This fixes any edge discontinuity.

(IV) We apply classic watershed47 to the smoothed edge image, which detects weak edges. This provides a binary segmentation where edges are 8-connected-pixel lines of 1 pixel width.

(V) In another new step, we identify every edge and vertex in the segmentation. The vertices are the branch points of the segmentation, and the edges are the set of 8-connected positive pixels whose endpoints are constrained to vertices17. We set 2 pixels as the minimum length for an edge (edges of only 1 pixel are fused with the vertices at its endpoints to become a single vertex of 3 pixels). For every edge, we check its mean intensity (from the edge images) to discard the weak ones: if it is lower than 0.1 (threshold evaluated in the “Results” section), the edge is discarded. However, we make a distinction when removing edges: if the edge is internal, all pixels of the edge are completely removed and, thus, the vertex-pixels at the end of that edge become edge-pixels; in contrast, if the edge is in contact with the non-ROI area, we only remove a few pixels in the middle of the edge in order to preserve the vertices. This is relevant because we use the vertices to determine the HEX.

(VI) In the updated binary segmentation, every superpixel in contact with the border of the image is discarded. For the remaining superpixels, we checked the Body/ROI image to determine whether to keep or discard them. If using the body/blob image, a superpixel is included if the average intensity of their pixels is above 0.5. If using the ROI image, a superpixel is included if at least 85% of its area is within the ROI.

(VII) In the final segmentation, the parameters are estimated.

Implementation

To evaluate the networks, a ten-fold cross-validation was performed (with all images from one eye in the same fold). All networks were implemented in Tensorflow 2.4.1 on a single NVIDIA V100 GPU with 32GB of memory. A training batch size of 15 images was employed, where six images were sampled from a specific guttae subgroup (two images from each complexity level). The dimensions of the DenseUNet were selected such that it could fit within the GPU memory; to build the multiscale DenseUNet+/++, we reduced the GR to 4 so that they could still fit in memory while having a similar number of parameters. In that respect, ResUNeXt+/++ and UNet+/++ had one less convolutional block in each resolution stage than the ResUNeXt and UNet (except the lowest stage).

The network hyperparameters were categorical cross-entropy as loss function, Nadam optimizer48, initial learning rate (lr) of 0.001, no early stop, 200 epochs for CNN-Edge (learning decay of 0.99, considering that \(lr_{epoch-i}=lr \cdot lr_{decay}^i\)) and 100 epochs for CNN-Body, CNN-Blob, and CNN-ROI (learning decay of 0.97). For data augmentation, images were randomly flipped left-right and up-down, and elastic deformations were employed. The specular images were only normalized to have a range between 0–1 instead of 0–255. The networks were programmed in Python 3.7, and the parameter estimation and statistical analyses were done in Matlab 2020a (MathWorks, Natick, MA).