Introduction

The performance of deep learning (DL) has surpassed that of both human experts and other analytic methods on many prediction tasks in computer vision as well as other applications1,2. It has also shown great potential in pathology applications, such as breast cancer metastases detection3,4,5 and grading of prostate cancer6,7,8,9. The workflow of a digital pathology laboratory consists of preparation of glass slides, containing tissue or cell sample from a patient, which then are digitalised using specialised scanners to produce whole slide images (WSIs). This allows pathologists to review the patient case using computers instead of microscopes and enables deployment of deep learning systems to assist the review10.

One remaining challenge for wide clinical adoption of DL in pathology, perhaps the most important one, is that the performance of neural networks (NNs) can deteriorate substantially due to domain shift11, i.e., differences in the distributions between the training data and the data for which prediction occurs. In digital pathology, differences of WSIs are observed between different centres due to, for instance, medical protocols, tissue preparation processes or scanner types used.

The typical strategy to achieve high generalization capacity of a DL model is to ensure high diversity in the training data. In pathology, collecting data from several sources (care providers) is important. Another approach to reducing the domain shift error is to apply extensive augmentations during the training stage, which is an active research area in the pathology domain12,13. Unfortunately, these methods are yet far from completely alleviating the performance drop due to domain shift. Thus, exploring more strategies to ensure a model’s robustness when deployed is highly motivated.

The uncertainty of a prediction is a source of information that is typically not used in DL applications. There is, however, rationale indicating that it could be beneficial. By generating multiple, slightly varied, predictions as base for an uncertainty estimate, additional information about the model’s sensitivity to changes is made available. This potential added value could be relevant for both the generalisation challenge and for boosting performance overall. In particular, we argue that the added dimension of uncertainty could be utilised as a building block for clinical workflows where pathologists and DL models interact, for instance for triaging WSIs or sorting WSI regions of interest.

In this paper, we explore a number of research questions that are central for understanding how uncertainty can contribute to robustness of DL for digital pathology:

  • Can uncertainty estimates add to the predictive capacity?

  • Are uncertainty estimates effective for flagging potential prediction errors?

  • Is the added value of uncertainty affected by domain shift?

  • Are computationally more demanding uncertainty methods, i.e. deep ensembles and TTA, more effective than the softmax score of the NN?

  • Do custom architectures, designed to provide uncertainty estimation, outperform model-independent methods?

Our experiments explore two types of domain shift in histology of lymph nodes in breast cancer cases. One shift concerns data coming from two different care providers, in different countries. Another shift concerns the challenge of dealing with cancer subtypes, whether uncertainty estimation could be beneficial when there are rare conditions that may lack sufficient amount of training data. Around 70% of breast cancer cases are ductal carcinomas. The second most common subtype is lobular carcinomas which accounts to roughly 10% of the positive cases. Furthermore, this subtype is often more challenging for a pathologist to detect due to its less obvious infiltrative patterns14,15. Therefore, it is reasonable to assume that DL models not specifically trained for lobular carcinomas will have lower performance there. We use the ductal vs lobular carcinoma scenario as a proxy for the general case of subtypes that are under-represented in the training data.

With respect to uncertainty estimation methods, there is an important distinction between methods that influence the choice of NN architecture or training procedure, and methods that are independent of how the DL model was designed. The first category includes MC dropout16, ensembles17 and other techniques18,19,20. However, model-independent methods such as Test time augmentations (TTA)21 have several advantages. One advantage is that constraints on model design may lead to suboptimal performance. Moreover, model independence opens the possibility to benefit from uncertainty estimates for any model—also for the locked-down commercial solutions that are typically deployed in the clinic. Thus, studying the effectiveness of model-independent uncertainty estimation is of a particular interest. To the best of our knowledge, we are the first to compare the effectiveness of TTA, MC dropout, and ensembles on classification tasks in digital pathology.

In this work, we train an NN classifier as a base for our evaluation of uncertainty methods. We contribute to the understanding of uncertainty and deep learning for digital pathology in four ways. First of all, we propose a way of combining the uncertainty measure with the softmax score in order to boost generalisability of the model. Secondly, we measure how well misclassified patches can be deteced by uncertainty methods. Thirdly, we compare the effectiveness of three uncertainty methods (Deep ensembles, MC dropout, and TTA) together with four different metrics utilising the multiple predictions—three established measures (sample variance, entropy and mutual information) and our proposed metric (sample mean uncertainty). Finally, we investigate if uncertainty estimations generalise over a clinically realistic domain shift, and for mitigating the problem of a rare cancer subtype that is under-represented in the training data.

Related work

Uncertainty estimation is an important topic in deep learning research that holds potential in providing more calibrated predictions and increasing the robustness of NNs. The methods can be categorised based on what statistical theory they are grounded on: frequentist approaches, Bayesian neural networks (BNNs) and Bayesian approximations for standard NNs22. The methods based on frequentist statistics commonly use ensembles17,23, bootstrapping24 and quantile regression25. BNNs are based on Bayesian Variational Inference and estimate the posterior distribution for a given task, and thus provide uncertainty distributions over parameters by design. However, currently their adaptation to the medical imaging domain is slow due to the higher computational costs of training and poor uncertainty estimation22,26. There is also a more recent line of research showing that certain transformations of the softmax confidence score27, or some modification to the network architecture28, may produce a reasonable estimation of uncertainty without any additional computations. To reflect this recent trend we include a comparison of direct uncertainty estimation from softmax score in all of our experiments, as well as an uncertainty estimator based on the sample mean over different network evaluations.

In deep learning applications within the medical domain, most research effort has been devoted to radiology, with MC dropout and Deep ensembles being two common methods compared in the literature. Nair et al.20 showed that the MC dropout16 method can improve multiple sclerosis detection and segmentation. They evaluated the uncertainty measures by omitting a certain portion of the most uncertain predictions and comparing the effect on false positive and true positive rates. Kyono et al.29 evaluated if AI-assisted mammography triage could be safely implemented in the clinical workflow of breast cancer diagnosis. They estimated uncertainty by combining the MC dropout and TTA methods, and concluded that this approach could provide valuable assistance.

Within computational pathology, the most similar previous work is by Thagaard et al.30, which evaluated the deep ensembles17, MC dropout16, and mixup31 methods for breast cancer metastases detection in lymph nodes. They trained an NN model for breast cancer metastasis detection and evaluated its performance in combination with the three uncertainty estimation methods on several levels of domain shifts: in-domain test data (same hospital, same organ), breast cancer metastases in lymph nodes from a different hospital, colorectal cancer (different hospital and organ), and head and neck squamous cell carcinoma (different hospital, organ and sub-type of cancer) metastases to the lymph nodes. They found that Deep ensembles17 performed considerably better on most evaluation criteria except for detecting squamous cell carcinoma where mixup31 showed better results. Similarly, Linmans et al.19 showed that uncertainties computed by Deep ensembles as well as a multi-head CNN32 allowed for detection of out-of-distribution lymphoma in sentinel lymph nodes of breast cancer cases.

TTA in medical imaging has successfully been applied for segmentation tasks. Graham et al.33 improved the performance of gland instance segmentation in colorectal cancer by incorporating TTA uncertainties into the NN system. Wang et al.34 compared the potential gains from using MC dropout, TTA or a combination of both on segmentation performance of fetal brains and brain tumours from 2D and 3D magnetic resonance images. They found that the combination of the two methods achieved the best results.

In comparison with previous research efforts, our work brings novel contributions in several ways. This includes evaluating the model-agnostic TTA method for classification in pathology, and making the comparison to model-integrated methods. We introduce an approach to combine a model’s softmax score with an uncertainty measure in order to improve the predictive performance. Moreover, we use a broader evaluation scheme for misprediction detection where all classification thresholds are considered instead of a single one. The broad scheme also includes evaluation of three uncertainty estimation methods using four different metrics, whereas previous work mostly has focused on the entropy metric. Finally, in all experiments we include a baseline based on the softmax score from one single model, in order to clearly measure the improvement that can be achieved by the added complexity of uncertainty estimation methods.

Material and methods

In this section we describe the three uncertainty estimation methods and the four uncertainty metrics that are evaluated in our experiments. Then we provide the details about the NN algorithms that we trained for the classification task, the training procedure, and the datasets used for training and evaluation of the uncertainty methods and metrics.

Uncertainty estimation methods

All of the methods have the same basic principle: to produce multiple predictions for each input. The variation within these predictions shows how uncertain the model is.

MC dropout

We are interested in computing posterior probability distribution p(W|XY) over the NN weights W given the input patches X and corresponding ground true labels Y. This posterior is intractable, but it can be approximated using variational inference with some parameterised distribution \(q^{*}(W)\) that minimises the Kullback-Leibler (KL) divergence:

$$\begin{aligned} q^{*}(W) = {{\,\mathrm{arg\,min}\,}}_{q(W)} KL(q(W) \Vert p(W| X, Y)). \end{aligned}$$

Gal et al.16 showed that minimising the cross-entropy loss of an NN with dropout layers, is equivalent to minimising the KL divergence above. Furthermore, the authors show that we can treat the samples obtained by multiple stochastic passes through an NN with the dropout enabled as an approximation of the model’s uncertainty. Following Thagaard et al.30, we added a dropout layer with probability 0.5 in the NN before the logits. During test time, we activated the dropout layer with the same probability and ran 50 stochastic passes for each input.

Deep ensembles

This is a method based on training T identical NNs with different random seeds. During the inference, the T predictions per input are aggregated for uncertainty estimation17. Following previous work19,30, we set \(T=5\).

Test time augmentations

Each input is randomly augmented T times before passing through the trained model. The uncertainty scores are computed from the T predictions. Usually, the test time augmentations are identical to the ones applied during the training of the model21,34. In our experiments we set \(T = 50\) to match the number of forward passes in the MC dropout method. For a detailed description of the augmentations, refer to Sect. 3.3.

Uncertainty metrics

Once we obtain the multiple predictions per input, we can compute an uncertainty metric. In this work we compared three well established metrics, sample variance, entropy and mutual information. In addition, we introduce the sample mean uncertainty metric which is based on a probabilistic interpretation of the softmax score in a binary classification problem.

Sample mean uncertainty

This metric is based on the mean of the samples generated by an uncertainty estimation method. We define sample mean uncertainty, \(u_s\), as:

$$\begin{aligned} u_s = 1 - 2(\overline{s} - 0.5)^2, \end{aligned}$$

where \(\overline{s}\) is the average of softmax scores \(s_i\) over T predictions:

$$\begin{aligned} \overline{s} = \frac{1}{T} \sum _{i=1}^{T} s_i. \end{aligned}$$

The value range of the measure is between 0 and 1, and assigns high value for patches that have the mean tumour softmax score around 0.5, indicating that they are potentially more uncertain. Low values are observed when the softmax scores are close to 0 or 1, implying high confidence in the corresponding binary classifications. The measure reflects the general dependence between softmax confidence and uncertainty.

Also, it shares characteristics with the estimator based on max predicted softmax probability for any class, which was evaluated in27.

Sample variance

This metric is derived by taking the variance across T number of predictions per input produced by each of the uncertainty methods20.

Entropy

For a discrete random variable X, Shannon entropy quantifies the amount of uncertainty inherent in the random variable’s outcomes. It is defined as35:

$$\begin{aligned} H(X) = - \sum _{i} P(x_i) \log P(x_i), \end{aligned}$$

which we approximate for each input i as36:

$$\begin{aligned} H(\hat{y_i}| \mathbf {W}, \mathbf {D})&\approx - \sum _{c=1}^{C} \frac{1}{T} \sum _{t=1}^{T} \Bigg [ P(\hat{y_i} = c| W_t, D_t) \cdot \log \left( \frac{1}{T} \sum _{t=1}^{T} P(\hat{y_i} = c| W_t, D_t) \right) \Bigg ], \end{aligned}$$

where T is the number of predictions per input generated by an uncertainty estimation method, C is the number of classes in our data, \(\mathbf {D}\) is the dataset, \(\hat{y_i}\text {, } i \in |\mathbf {D}|\) is a prediction by the classifier, and \(\mathbf {W}\) are the parameters of the classifier. We refer to this metric as ’entropy’.

Mutual information (MI)

The MI metric was first defined by Shannon35. It measures how much information we gain for each input by observing the samples produced by an uncertainty estimation method. It is approximated by36:

$$\begin{aligned} MI(\hat{y_i}, \mathbf {W}| \mathbf {D}) \approx H(\hat{y_i}| \mathbf {W}, \mathbf {D}) - E \left[ H(\hat{y_i}| W_t, D_t) \right] , \end{aligned}$$

where \(H(\hat{y_i}| \mathbf {W}, \mathbf {D})\) is the entropy of expected predictions. \(E \left[ H(\hat{y_i}| W_t, D_t) \right]\) is the expected entropy of model predictions across the samples generated by an uncertainty estimation method which can be approximated as36:

$$\begin{aligned} E \left[ H(\hat{y_i}| W_t, D_t) \right]&\approx - \sum _{c=1}^{C} \frac{1}{T} \sum _{t=1}^{T} \Bigg [ P(\hat{y_i} = c| W_t, D_t) \cdot log (P(\hat{y_i} = c| W_t, D_t) ) \Bigg ] \end{aligned}$$

Network training

We trained five Resnet18 models37 with He initialisation38 and a dropout layer (with probability 0.5)39 before the logits with five different random seeds. The data augmentations during the training as well as the testing time were based on the work of Tellez et al.12. That is, on each input we applied horizontal flip with probability 0.5, 90 degrees rotations, scaling factor between 0.8 and 1.2, HSV colour augmentation by adjusting hue and saturation intensity ratios between [-0.1, 0.1], brightness intensity ration: [0.65, 1.35], contrast intensity ratio: [0.5, 1.5]. We also applied additive Gaussian noise and Gaussian blur, both with \(\sigma \in [0.0, 0.1]\).

Each training epoch consisted of 131 072 patches sampled from the training WSIs with equal number of tumour and healthy patches. We used ADAM optimiser with \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), initial learning rate of 0.01 with learning rate decay of 0.1 applied when the validation accuracy was not improving for 4 epochs. The models were trained until convergence with the maximum limit of 100 epochs. From each training setup, the best performing model in terms of validation accuracy was chosen.

Datasets

Table 1 Information about the datasets used in the evaluation of the model and the uncertainty methods.

In-domain data in this project is the Camelyon16 dataset40 which contains 399 whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections collected in two medical centres in the Netherlands. The slides were scanned with the 3DHistech Pannoramic Flash II 250 and Hamamatsu NanoZoomer-XR C12000-01 scanners. 270 WSIs from Camelyon16 dataset were used for the training and validation of Resnet18 models . A balanced test dataset was created by extracting the patches from the official Camelyon16 test set which contained 129 WSIs. This was used for the in-domain performance evaluation. In our experiments, ’Camelyon16 data’ refers to the set of patches extracted from the Camelyon16 test set unless otherwise noted.

Our out-of-domain class-balanced patch data is extracted from 114 H&E stained WSIs of lymph node sections from a medical center in Sweden which were annotated by a resident pathologist with 4 years of experience aided with immunostained slides. This is a subset of the larger AIDA BRLN dataset41, which was scanned by Aperio ScanScope AT and Hamamatsu NanoZoomer scanners (XR, S360, and S60). We refer to it as BRLN.

Table 1 lists the four datasets of patches extracted from Camelyon16 and BRLN that were used in our experiments. These datasets were only used for the evaluation. In BRLN data, we have two cancer subtypes: lobular and ductal carcinomas.

In order to study uncertainty effects on generalisation to the lobular cancer subtype, we created two subsets of BRLN data which we call Lobular and Ductal data. They consist of 3480 tumour patches of each cancer subtype and the same 3480 healthy patches. As all the datasets are publicly available, we did not need to obtain an ethical approval for our study.

Evaluation metrics

We evaluate our results based on area under the curve (AUC) of receiver operating characteristic (ROC) and precision recall (PR). ROC-AUC is the most common metric used to evaluate the performance of a binary classifier42 and also in uncertainty evaluation in digital pathology19,30,33. ROC-AUC captures the trade-off between the true positive rate (TPR), also known as recall, and false positive rate (FPR), also known as 1—specificity:

$$\begin{aligned} TPR = \text {Recall} = \frac{TP}{TP + FN} \\FPR = 1 - \text {Specificity} = \frac{FP}{TN + FP} \end{aligned}$$

The PR curve plots precision against recall where precision is the fraction of positive predictions that are truly positive43:

$$\begin{aligned} \text {Precision} = \frac{TP}{TP + FP} \end{aligned}$$

In addition to the AUC measures that aggregate performance across all classification thresholds, we are interested in examining in detail how performance of methods and metrics varies for different choices of classification thresholds. For this comparison, we look at accuracy:

$$\begin{aligned} \text {Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \end{aligned}$$

Results

Basis for the experiments

Performance of classifier

In order to draw any meaningful conclusions on evaluation of uncertainty methods and metrics, we first need to ensure that the base classifier has a reasonable performance.

Table 2 shows ROC-AUC and PR-AUC on the four datasets. Overall, the achieved performance of 0.975 ROC-AUC (0.981 PR-AUC) on in-domain Camelyon16 data indicates that the Resnet18 was a sufficient classifier to perform this detection task. The drop of below 1% in ROC-AUC (and PR-AUC) between Camelyon16 and BRLN data sets is consistent with other work observing that a well-trained model should suffer relatively small decrease in performance under domain shift arising from different medical centers4.

Table 2 PR and ROC AUC values based on softmax scores (single model) for each of the 4 datasets.

Increased error for lobular carcinoma

Investigating the model’s performance on cancer subtypes within BRLN data, we found that it exhibits a substantially worse result on the lobular carcinoma: 0.889 ROC AUC (0.928 PR AUC) compared to the 0.982 ROC AUC (0.987 PR AUC) on the ductal cancer subtype (see Table 2). This result confirms that there indeed is a domain shift effect due to tumour type, in line with our assumptions.

Boosting metastases detection

Figure 1
figure 1

Tumour metastases detection on Camelyon16 data: ROC curves and accuracy of using softmax tumour score from a single NN vs averages of softmax tumour scores (per input) produced by the uncertainty estimation methods.

Figure 2
figure 2

Tumour metastases detection on BRLN data: ROC curves and accuracy of using softmax tumour score from a single NN vs averages of softmax tumour scores (per input) produced by the uncertainty estimation methods.

Figure 3
figure 3

Relation between softmax confidence and estimated uncertainty, for two different uncertainty methods. The points show the testset from Camelyon16, with colors encoding ground truth class labels. The dashed lines illustrate the 2D threshold used for classification based on both softmax and uncertainty, for a set of different threshold values.

Table 3 ROC-AUCs on Camelyon16 and BRLN data: combining softmax tumour score from a single NN with uncertainty estimates. Softmax score refers to using the softmax output alone.
Figure 4
figure 4

Prediction accuracy on Camelyon16 data when combining softmax tumour score from a single NN with uncertainty estimates.

Figure 5
figure 5

Prediction accuracy on BRLN data when combining softmax tumour score from a single NN with uncertainty estimates.

Given the multiple predictions provided by three different methods (MC dropout, ensembles, and TTA), the most straightforward method for boosting predictive performance is to utilise traditional ensemble techniques. The most common one is to average the softmax output over the different inference runs/models/augmentations. The results are demonstrated in Figs. 1 and 2 for Camelyon16 and BRLN, respectively, and compared to using a single prediction as baseline. The results show a consistent but small improvement in terms of ROC-AUC for the deep ensembles and TTA. This is also reflected by the accuracy curve, demonstrating how the averaging improves the results by a small margin (\(\sim\)0.3 percentage points) at the optimal classification threshold, for both Camelyon16 and BRLN.

Another strategy for boosting predictive performance is to consider uncertainty estimation and softmax output from a single network as separate entities. Although the uncertainty methods also make use of different combinations of the softmax score, it is interesting to investigate if this approach holds benefits over traditional techniques. We do this by turning the classification task into a two-dimensional thresholding problem, with the softmax score and the uncertainty measure as two separate dimensions.

We observed that for tumour patches it was more common to have a combination of high entropy uncertainty and low softmax score compared to the healthy patches (see Fig. 3a). This inspired us to propose an alternative classification score defined by:

$$\begin{aligned} f(u,s) = \left( \left( \frac{u}{P_u} \right) ^y + s^y \right) ^{\frac{1}{y}}, \end{aligned}$$

where u is the uncertainty measure, s is the softmax score for the tumour (positive) class from one single NN. The factor \(P_u\) is used to normalise the range of uncertainties, and we define this to be the 99th percentile of the uncertainty value range in the data. Based on a specified threshold t, the prediction is positive for \(f(u,s)>t\), otherwise negative. The curve \(f(u,s)=t\) intersects the axes at t, and the exponent y can be used to control its shape, from circular for \(t=2\) towards square for large t. For all experiments we use \(t=10\). Figure 3 illustrates the 2D space spanned by softmax score and uncertainty estimation, for two different methods, with corresponding 2D decision boundaries, \(f(u,s)=t\), for a selection of different t.

Table 3 summarises the ROC-AUC results on Camelyon16 and BRLN data, for different combinations of uncertainty metrics and methods. We can see that MC dropout is the only method that, independently of metric and data set, achieves worse ROC-AUC scores than the softmax score from a single NN. TTA and deep ensembles exhibit nearly identical performance for each computed metric. The uncertainty-including methods consistently perform at par or better than the baseline of using the softmax score from a single NN, but the improvement is small, below 1 percentage point in terms of ROC-AUC.

Although the ROC-AUC results are similar compared to the traditional ensemble technique (Figs. 1 and 2), another aspect of robustness is how the performance of methods and metrics varies across the range of classification thresholds. In Fig. 4 we can see that when using the sample mean or entropy uncertainty, the shape of the accuracy versus classification threshold curves are considerably different for Camelyon16 data. Instead of a narrow range of peak accuracy, we get high performance over a broader range of thresholds. This indicates that embedding uncertainty information can lessen the sensitivity for how the operating point of the prediction is set, which is one part of the generalisation challenge. Importantly, this finding holds true also under domain shift (Fig. 5).

Misprediction detection

Figure 6
figure 6

ROC-AUCs of misprediction detection on Camelyon16 (in-domain) and BRLN (domain shift) data sets for different thresholds used to differentiate between tumour and non-tumour predictions. The softmax-based baseline uncertainty is the same in all plots.

Table 4 Camelyon16 data: ROC-AUCs of misprediction detection for varying classification thresholds. The highest achieved values per classification threshold are in bold.
Table 5 BRLN data: ROC AUCs of misprediction detection for varying classification thresholds. The highest achieved values per classification threshold are in bold.

In addition to embedding uncertainty information in the prediction, a straightforward application of the uncertainty estimates is in misprediction detection. Performance for this task also provides a general idea about the capacity of the methods to boost robustness in a deployed diagnostic tool. In this work, we only evaluated how well the methods can detect mispredictions without determining what is the best approach of incorporating this information in the clinical setting. For example, in order to improve performance, one could omit the detected mispredictions or adjust their predicted labels, but we leave this direction of research for future work.

We compare the three uncertainty estimation methods incorporating multiple predictions with a baseline uncertainty \(u_\text {base}\) derived from a single softmax value:

$$\begin{aligned} u_\text {base} = 1 - 2(s - 0.5)^2, \end{aligned}$$

where s refers to the softmax output for the tumour class of a single NN. The baseline captures the general correlation between softmax score and uncertainty, where uncertainty is maximal at 0.5 and decreases towards 0 and 1, as seen in Fig. 3a.

Evaluating uncertainty methods

From the plots in Fig. 6 showing the ROC-AUC performance for the misprediction detection task, we observe the same tendency as in the experiment of boosting the general performance: ensembles and TTA are substantially better than MC dropout. In fact, MC dropout performs worse than the baseline independently of the chosen metric or classification threshold.

In Table 4, we see that the highest result for all classification thresholds was achieved by ensembles method. TTA performance is midway between the baseline and the ensembles. Comparing with Table 5, we see that domain shift affects misprediction performance in a negative way. Under the domain shift, only ensembles and TTA with sample mean uncertainty consistently achieve improvements over the baseline, whereas other combinations are at par with or below the baseline (see also Fig. 6b).

An interesting observation is that there is a trade-off between how good the uncertainty methods are at misprediction detection and how well the NN performs on its primary task of cancer metastases detection. For higher threshold values, the predictive accuracy of the NN decreases, but the misclassification detection effectiveness increases (Fig. 6). This may suggest that uncertainty estimation is more beneficial for models with weaker predictive performance.

Evaluating uncertainty metrics

In the experiments we also compared the four uncertainty metrics. In Fig. 6a, we observe that on the in-domain data all metrics achieve similar good performance compared to the baseline, when computed from TTA or deep ensembles predictions. From Table 4, MI emerges as the best performing metric, closely followed by the other three.

Sample variance, entropy and MI metric do not generalise well under the domain shift. From Figs. 5 and 6b, we see that sample mean uncertainty is the only metric that performs better than the baseline independently of the classification threshold on the BRLN data (for ensembles and TTA).

Uncertainty and lobular carcinoma

Table 6 Tumour metastases detection on Lobular and Ductal data: ROC AUCs of combination of sample means uncertainty and the softmax score.
Figure 7
figure 7

Prediction accuracy on Lobular and Ductal data when combining softmax tumour score from a single NN with sample means uncertainty estimated by ensembles and TTA methods.

Figure 8
figure 8

ROC-AUCs of misprediction detection by sample means uncertainty from ensembles and TTA (for varying classification threshold). Baseline is computed from softmax score of a single NN.

Table 7 ROC-AUCs of misprediction detection by sample means uncertainty computed from ensembles and TTA methods. The highest achieved values per classification threshold are in bold.

Now we turn to evaluating if uncertainty measures may contribute to boosting the performance on a rare type of data, in our case: lobular carcinoma.

For this experiment, we focus on the consistent good performers in previous experiments: the sample mean uncertainty metric combined with the ensembles and TTA uncertainty estimation methods.

Uncertainty for boosting the tumour metastases detection

In Table 6, we see similar results as for the entire BRLN dataset: the ROC-AUCs improve marginally by combining the uncertainty with the softmax score, slightly more improvement for the lobular data. Fig. 7 shows the previously noted effect of a flattened accuracy curve, where the accuracy increase for suboptimal thresholds is more pronounced for the lobular data set.

Uncertainty for misprediction detection

From Fig. 8 we conclude that all methods are substantially better at detecting mispredictions on the ductal cancer subtype than the lobular, meaning that this type of domain shift also has a negative effect on misprediction performance. For the optimal classification threshold, the misprediction performance on lobular data is not much better than a random guess.

From Table 7 we see that the improvement from the baseline for the best performing uncertainty estimation method is similar on ductal and lobular data.

Discussion

The main research question was whether uncertainty estimates can add to the predictive capacity of DL in digital pathology. The results show that uncertainty indeed adds value if good measures and metrics are chosen. The predictive performance can be slightly increased, but a perhaps more important benefit is a lessened sensitivity to the choice of classification threshold—mitigating the infamous AI ’brittleness’. Uncertainty used for misprediction detection is valuable in the sense that performance is far above a random guess. The results also show, however, that the added value of introducing uncertainty over softmax probability is quite limited and it is an open question whether these benefits would make a substantial difference when employed in a full DL solution in a clinical setting.

Drilling down into detailed results, it is clear from the experiments that MC dropout is the least suitable method as the variability in its output has minimal value for boosting the general NN’s performance directly or via misprediction detection. This is also apparent from inspecting the relation between softmax confidence and MC dropout uncertainty in Fig. 3b, which show little correlation. In contrast, the TTA and deep ensemble methods outperformed the baseline on both evaluation tasks for most metrics. While deep ensembles exhibited the best performance, the difference to TTA was often negligible. Thus, if the flexibility offered by using a model-agnostic method is important in the scenario considered, TTA could be preferred.

Interestingly, the gains of using ensembles or TTA were larger for the classification thresholds corresponding to high accuracy, at least for the most well-performing metrics. Furthermore, our results demonstrate that misprediction detection is easier when classification is poor. This underlines that misprediction detection should not be considered in isolation, instead the interplay with classification accuracy should always be considered.

The choice of uncertainty metric is not trivial. In our experiments, entropy and sample mean uncertainty can be said to have achieved the best results overall, but the differences are small between all metrics. It is a somewhat surprising result that a mean aggregation performs at par with a metric taking variance into account.

In the out-of-domain experiments we saw some reduction in the performance gains from all combinations of uncertainty estimation methods and metrics. While this is consistent with previous work30, it is discouraging, as the foremost objective of these approaches is to mitigate the generalisation problem. It seems that the variation of model output is not that different between in-domain and out-of-domain pathology data. In fact, only the sample mean uncertainty sustains a better performance than the simple softmax-based baseline in the out-of-domain case, and the baseline showed the least drop in performance due to domain shift. This is somewhat surprising, as we would have expected the softmax baseline to be more sensitive to domain shift. The reason is likely both that we deal with a smaller, clinically realistic, domain shift and that softmax can behave better than expected in out-of-domain situations27. The upside of this result is that even a simple uncertainty measure can exhibit a reasonable performance on misprediction detection.

In the study of detecting mispredictions within a data subtype that is underrepresented in the training set (lobular carcinoma), we observed that uncertainty methods and the baseline are much less effective at this compared to the abundant data subtype (ductal carcinoma). The performance gains from using ensemble and TTA uncertainty estimation had larger margin for the classification thresholds corresponding to the highest accuracy, but less than on the in-domain data.

One of the limitations of this work is that we worked with patches extracted from WSIs. This was essential to investigate the basic properties of the uncertainty in digital pathology, but a study on how this translates to WSI level decisions is necessary. Furthermore, we focused on breast cancer metastases detection in the lymph nodes. More studies should be carried out to confirm that the results hold in other digital pathology applications. Of particular interest is to study prediction tasks with lower accuracy, where our results indicate that the added value of uncertainty may be greater than in this work. Regarding TTA, there may be other types of augmentations that are better suited to the specific objective of estimation of predictive uncertainty. There are also other method parameter options that could be relevant to evaluate. The dropout probability chosen for MC dropout may, for instance, not be optimal for our ResNet18 architecture, but we argue (also in light of previous work) that it is unlikely that the MC dropout performance then would surpass the other methods.

A potential direction for future work could be to do some more extensive tuning of the uncertainty estimation methods. For example, exploring the effects of bagging, boosting or stacking techniques44 on improving the diversity of the models in an ensemble which could lead to better uncertainty estimates provided by the deep ensembles method. Alternatively, the focus could be placed on determining if a combination of several uncertainty estimation methods would result in an improved performance.

Conclusion

We conclude that the evaluated uncertainty methods and metrics perform well on in-domain data but are affected by the domain shift due to new medical center as well as the underrepresented subtypes of data in the training set. The softmax score of the target NN can be transformed to provide an uncertainty measure which is less affected by the domain shift than the more established methods. The associated computational costs and NN design constraints indicate that the use of softmax score transformation is an appealing alternative to the uncertainty estimation methods.