Uncertainty-informed deep learning models enable high-confidence predictions for digital histopathology

A model’s ability to express its own predictive uncertainty is an essential attribute for maintaining clinical user confidence as computational biomarkers are deployed into real-world medical settings. In the domain of cancer digital histopathology, we describe a clinically-oriented approach to uncertainty quantification for whole-slide images, estimating uncertainty using dropout and calculating thresholds on training data to establish cutoffs for low- and high-confidence predictions. We train models to identify lung adenocarcinoma vs. squamous cell carcinoma and show that high-confidence predictions outperform predictions without uncertainty, in both cross-validation and testing on two large external datasets spanning multiple institutions. Our testing strategy closely approximates real-world application, with predictions generated on unsupervised, unannotated slides using predetermined thresholds. Furthermore, we show that uncertainty thresholding remains reliable in the setting of domain shift, with accurate high-confidence predictions of adenocarcinoma vs. squamous cell carcinoma for out-of-distribution, non-lung cancer cohorts.

carcinoma, and obtained the following promising results: 1) uncertainty thresholding improved accuracy for high-confidence predictions; 2) uncertainty thresholding could generalize to out-of-distribution data; 3) areas of high uncertainty correlated with histologic ambiguity; and 4) UQ thresholding identified decision-boundary uncertainty. The paper is well-structured and clearly written. The methods are technically sound, and the results are promising. The following are a few comments that I hope can help further improve the manuscript.
• "A total of 276 standard (non-UQ) and 504 UQ-enabled DCNN models based on the Xception architecture were trained to discriminate between lung squamous cell carcinoma and lung adenocarcinoma using varying amounts of data from TCGA" It is unclear why different numbers of non-UQ (276) and UQ-enabled (504) DCNN models were trained here. It would be helpful if the authors could provide some relevant discussion on how to choose these two specific numbers (i.e., 276 vs 504).
• In Figure 2b, it would be helpful to show also the percentage of the UQ high-confidence predictions over all the predictions.
• " Figure 2(c) Across all cross-validation experiments with UQ, a median of 84.6% (43.8% -100%) of validation data is classified as high-confidence. The shaded interval represents the 95% confidence interval at each dataset size." The authors may also want to report the mean percentage in addition to the median percentage.
• " Figure 3(a) Models trained on TCGA at varying dataset sizes were validated on lung adenocarcinomas and squamous cell carcinomas from CPTAC. Patient-level metrics are shown with the dotted lines, and slide-level metrics are shown with Xs. AUROC, accuracy, and Youden's J are all improved in the highconfidence UQ cohorts. The proportion of patients and slides reported as high-confidence is shown in the last panel." The slide-level metrics are shown with Xs, which is a bit hard to read.
• It would be helpful to include more discussion on low confidence predictions. For these predictions, one could not make any decision. Readers might be interested in learning how many are these low confidence cases. If there are too many, the proposed methods may not be that effect in practical applications.

Reviewer #1
In this work, the authors developed a method to determine the cutoffs for low-and high-confidence predictions for uncertainty quantification in digital histopathology. Using the TCGA data, they compared many different UQ-enabled models with non-UQ models for classifying lung squamous cell carcinoma and lung adenocarcinoma. They concluded that the high-confidence predictions from UQ models outperform non-UQ models in terms of classification accuracy. I think the authors addressed an important question but the proposed method was not well described. Also there were lack of systematic comparisons with existing UQ methods. I have the following specific comments.
1. On Page 3, lines 92 -93, the authors stated "we describe a clinically-oriented method for determining slide-level confidence using the Bayesian approach to estimating uncertainty, with uncertainty thresholds determined from training data". I don't understand why the proposed method is a "Bayesian approach". The Section "Estimation of Uncertainty" from lines 425 -467 is quite confusing to me. A lot of notations are not well defined. For a Bayesian approach, what is the likelihood and what is the prior. How to compute the posterior? All those details are missing.
Thank you for raising this question. The use of the term "Bayesian approach" comes from prior work (particularly Gal et al) which established that the distribution from a neural network obtained via dropout approximates sampling of the Bayesian posterior of a deep Gaussian process, and that standard deviation of such a distribution expresses the model's predictive uncertainty. This work forms the foundation of our uncertainty estimation paradigm, which provides a novel method of aggregating individual image tile uncertainty into whole-slide uncertainty and confidence thresholding. We will highlight to readers the paper by Gal et al, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, for proofs and mathematical justification for this approach.
We appreciate that there are many alternative methods for estimating uncertainty for individual image tiles, including test-time augmentation, deep ensembles, and hyper-deep ensembles, methods which have previously been compared by other groups (Thagaard, Poceviciute). We chose one well-studied method for estimating tile-level uncertainty, MC Dropout, in order to describe a novel method of aggregating tile-level uncertainty into patient-level uncertainty and confidence.
We have significantly rewritten the section "Estimation of Uncertainty" in the Methods to better explain the foundational methodology used for our uncertainty estimation and improved notation clarity. We also included a new paragraph in this section.

Changes:
Updates to Figure 1, with new panels d and e.

Figure 1. Estimation of uncertainty and confidence thresholding. (a)
With standard deep learning neural network designs, a single image yields a single output prediction. When dropout is enabled during inference, predictions for a single image will vary based on which nodes are randomly dropped out. To estimate tile-level uncertainty, images first undergo 30 forward passes through the network, resulting in a distribution of predictions. The mean of each prediction, ( ), represents the final tile-level prediction, and the standard deviation, ( ), represents the tile-level uncertainty. (b) When UQ methods are used, incorrect predictions are associated with higher uncertainties than correct predictions 32-38 . From a given distribution of tile-or slide-level uncertainties, we determine the uncertainty threshold which optimally separates correct and incorrect predictions by maximizing Youden's index (J). Predictions with uncertainty below this threshold are high-confidence, and all others are low-confidence. (c) To prevent data leakage and overfitting, optimal tile-and slide-level uncertainty thresholds are determined through nested crossvalidation within training folds. (d) Schematic for calculating tile-level uncertainty and confidence. The optimal tile-level uncertainty threshold is calculated from a given validation dataset. Tiles from the dataset are separated into high-and low-confidence by whether the tile-level uncertainty falls below or above , respectively. (e) Schematic for slide-level uncertainty and confidence. Slide-level uncertainty is defined as the average uncertainty among high-confidence tiles for a given slide. The optimal slide-level uncertainty threshold is found and used to classify slides as high-and low-confidence.
Changes in paragraph text and equations in the Method's section "Estimation of Uncertainty", including the addition of a new paragraph: "Uncertainty is estimated with the Bayesian Neural Network (BNN) approach. This is an ensemble method where the uncertainty is quantified as the "disagreement" of the predictions made by different models sampled from an ensemble of neural networks. All of the networks in the ensemble explain the same training data but can disagree on some images. The disagreement is computed simply as the standard deviation of the predictions by the sampled neural networks. BNN is a specific version of the ensemble method which differs from alternatives such as Deep Ensembles in the way that members of the ensemble are sampled: sampling is performed from a posterior distribution of models conditioned on the training data. Specifically, Gal and Ghahramani show that sampling from predictions generated via neural networks with Monte Carlo dropout is equivalent to sampling from a variational family (Gaussian Mixture), approximating the true deep Gaussian process posterior 27 . Thus, the distribution of predictions resulting from multiple forward passes in a dropout-enabled network approximates sampling of the Bayesian posterior of a deep Gaussian process, and the standard deviation of such a distribution is an estimate of predictive uncertainty 28,50,[32][33][34][35][36]51 ." We appreciate the reviewer's concern. We did not mean to imply that the proposed method is specifically high performing in the imaging domain. We hoped to emphasize that methodologies such as these are not routinely implemented in imaging-based deep learning research, despite the importance of uncertainty estimation for clinical implementation where model predictions may be used to make high-risk decisions, for example relating to selecting treatment for a patient. Deep learning model interpretability and uncertainty are particularly challenging for imaging data, where there is no tangible relationship between input data points (pixels) and model output (predictions). MC dropout provides a robust and convenient method of estimating deep learning model uncertainty which works well for imaging data but has also been used for many other types of high-dimensional input modalities.

References
3. There are a lot of tuning parameters in the UQ-enable and non-UQ deep learning models, e.g., the dropout rate, the network architectures. It would be helpful if the authors provide some justifications on the choices of those parameters.
Thank you for the opportunity to expand upon our hyperparameter selection process.
Our objective in this work was to develop and rigorously assess a novel uncertainty quantification approach for whole-slide imaging, rather than attempting to develop the best possible performing model for predicting lung adenocarcinoma vs. squamous cell carcinoma. To avoid the risk of overfitting on this dataset, we used the same hyperparameters as previously published (Dolezal, 2021), with the exception of hidden layers with dropout (used for uncertainty estimation) and early stopping (used to reduce potential overfitting on this very large training dataset).
There is no standard consensus for the optimal dropout rate for uncertainty estimation for medical imaging applications; prior work has used dropout rates of 0.2, 0.5, and 0.6 (Syrykh, Raczowski, Thagaard, Poceviciute, Ponzio, Leibig, Song). Dropout is a regularization technique which can help with generalizability and overfitting, but as with other regularization methods, too strong of regularization can worsen accuracy (Kamalov, 2020). We chose a low dropout rate of 0.1 to decrease the likelihood of performance degradation when the method is applied to new datasets and problems. As with other hyperparameters, optimal dropout rate will be dataset-dependent and should be experimentally tuned if maximum performance is desired.
We recognize that other readers may also have questions about our hyperparameter selection, and have thus included additional information in the Methods regarding how these hyperparameters were chosen. References: • F. Kamalov and H. H. Leung, "Deep learning regularization in imbalanced data," 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), 2020, pp. 1-5, doi: 10.1109/CCCI49893.2020.9256674.

Changes (in Methods, under "Model architecture and hyperparameters"):
"… Hyperparameters were chosen based on prior work 8 without further tuning in order to reduce the risk of overfitting on this dataset, with the exception of added dropout-enabled hidden layers and the use early stopping (enabled due to the large dataset size)." 4. The imaging processing procedure may have large impact on the classification accuracy and uncertainty quantifications. In particular, the steps for removing background image tiles including grayspace filtering, Otsu's thresholding and gaussian blur detection. The authors should provide more details on those procedures and show the sensitivity of the UQ to the mild changes of tuning parameters in those procedures.
We appreciate the opportunity to expand upon the impact of slide-level image processing on classification accuracy and uncertainty estimation. We have added a reference for the Otsu's thresholding algorithm and additional details regarding Gaussian blur filtering in the Methods section.
Our background and artifact filtering process includes two slide-level steps and one tile-level step. Gaussian blur filtering and Otsu's thresholding are both slide-level steps which identify areas of background or artifact on whole-slide images; tiles are not extracted from these areas of the slide. The final tile-level step is grayspace filtering, where image tiles are converted to the HSV colorspace, pixels are thresholded into areas of color (foreground) and grayspace (background) based on saturation, and then tiles are discarded if the total grayspace fraction for the image exceeds some threshold.
To test the impact of the slide-level image processing steps on classification accuracy and uncertainty quantification, we extracted tiles from the training dataset with blur filtering and Otsu's thresholding, blur filtering alone, Otsu's thresholding alone, and no slide-level background filtering. We then trained models in cross-validation at the maximum dataset size a total of four times for each method (12 total models trained for each method), and then performed nested uncertainty estimation for each model to determine UQ thresholds (Supplementary Figure 6). High-confidence predictions outperformed predictions without UQ estimation for all slide processing methods (P < 0.001).
To investigate the potential impact of the grayspace filtering threshold on uncertainty quantification, we extracted all image tiles, without background filtering, for 50 adenocarcinomas and 50 squamous cell carcinomas in the CPTAC database. For each image tile, we calculated grayspace fraction, UQ-enabled model prediction, and estimated uncertainty. We plotted density estimation for image tiles at each grayspace value, separated by whether the prediction was correct or incorrect (Supplementary Figure 7,  A). These results show a bimodal distribution for grayspace fraction, with a large peak around 0 (representing image tiles with little background) and another peak around 1 (indicating tiles mostly background). Image tiles with low grayspace fraction are much more likely to be correctly predicted than incorrectly predicted, whereas image tiles with grayspace fraction above 0.8 are just as likely to be correct as incorrect. We then plotted density estimation for grayspace fraction vs. uncertainty for each image tile, separated by whether the model prediction was correct or incorrect (Supplementary Figure  7, B and C). These results show that when grayspace fraction is low (less than 0.2), most correct predictions are below the uncertainty threshold, while most incorrect predictions are above the uncertainty threshold (and would thus be filtered out). When grayspace fraction exceeds around 0.8, there is an increase in the number of incorrectly predicted image tiles that fall below the uncertainty threshold and would fail to be removed by UQ thresholding. These results support a grayspace fraction threshold of around 0.7 -0.8 to maximize the utility of uncertainty estimation to enrich for correct predictions.
We have included these results in the Supplementary Information, as shown below.

Changes to Methods (under "Image processing"):
"Background image tiles were removed via grayspace filtering, Otsu's thresholding 48 , and gaussian blur filtering. Gaussian blur filtering was performed with a sigma of 3 and threshold of 0.02. Experiments were performed on datasets with and without Otsu's thresholding and/or blur filtering and with varying grayspace fraction thresholds to confirm generalizability of the UQ methods regardless of background filtering method (Supplementary Figs. 6 and 7)."