Introduction

Recent advances in deep learning (DL) have led to the rapid development of diagnostic and treatment support applications in various aspects of healthcare, including oncology1,2,3,4. The proposed applications of DL utilise a range of data modalities, including MRI scans5, CT scans6, histopathology slides7, genomics8, transcriptomics9,10, and most recently, integrated approaches with various data types11,12. In general, studies using DL show excellent predictive performance, providing hope for successful translation into clinical practice13,14. However, prediction accuracy in DL comes with potential pitfalls which need to be overcome before wider adoption can be eventuated15.

The lack of transparency over prediction reliability is one challenge for implementing DL16. One approach to overcome this is by providing uncertainty estimates about a model’s prediction17,18, enabling better-informed decision making. Another obstacle relates to the assumptions made about data when transitioning from training to real-world applications. In standard DL practice, during the ‘development’ stage, models are trained and validated on data prepared to satisfy the assumption of independent and identically distributed (IID) data, meaning that model would be applied to make predictions on the data that are independently drawn and come from the same distribution as the training data. However, this assumption cannot be guaranteed and is, in fact, frequently violated when models are deployed in ‘production’ (i.e. real-world application). This is because confounding variables, which we cannot control for, cause distributional shifts that push data out-of-distribution (OOD)19. For oncology applications, confounding variables can include technical differences in how the data are collected (e.g., batch effects, differences in sequencing depth or library choice for genomic and transcriptomic data; differences in instrumentation and imaging settings for medical imaging data), as well as biological differences (e.g., differences in patient demographics or a data class unseen during model development). The consequences from OOD data include inaccurate predictions coupled with underestimated uncertainties, which together result in the model’s overconfidence from distributional shifts, or what we call ‘shift-induced’ overconfidence20,21,22. Consequently, implementation of DL into clinical practice (i.e., production) requires that models are robust (i.e., generalise) to distributional shifts and provide correct predictions with calibrated uncertainties.

Methods to address DL overconfidence in production exist, albeit with different limitations. Repeated retraining of deployed models on new production data is beneficial for accuracy, but introduces new risks such as over-computation or catastrophic forgetting, whereby DL models lose performance on original training/development data23,24. Using tracking metrics such as accuracy can help inform ML engineers about the DL reliability, although such metrics are only available retrospectively. A key pitfall for these methods is that they are reactive and not proactive.

One proactive approach for managing risks in production is with ‘uncertainty thresholding’, whereby only predictions with uncertainties below a threshold are accepted (to increase accuracy). Unfortunately, a DL model’s uncertainty threshold is established with development (IID) data. Thus, when the model is deployed to production (OOD) data it runs a high risk of becoming overconfident. Therefore, the uncertainty threshold established in development corresponds to higher error-rate in production, which is a problem if expectations (between healthcare professionals and engineers) are set during the development phase of a project. To address this problem, post-hoc methods exist that calibrate uncertainty (e.g., with ‘Temperature scaling’25). However, while post-hoc calibration effectively controls overconfidence in IID data25, it fails to do so proactively in OOD data21,22. Despite the notable theoretical and empirical research towards generalising DL uncertainties from OOD data26,27, shift-induced overconfidence is yet to be sufficiently addressed in practice.

In this study, we aim to address the generally under-appreciated shift-induced DL overconfidence in the context of oncology—the field that is particularly vulnerable to this pitfall due to frequent data distribution shifts. We conduct our experiments with a case study that predicts cancer of origin with transcriptomic data.

Cancer of origin prediction has been an active application area for DL24,28,29,30, since accurate diagnosis is critical for the treatment of cancers of unknown primary (CUP), i.e. metastatic cancers in which the primary cancer site cannot be reliably determined. We investigate multiple cancer datasets, including one newly introduced dataset, with simple, effective, and scalable approximate Bayesian DL techniques that improve generalisation. We examine if the techniques improve model robustness to shift-induced overconfidence and, therefore, improve the DL reliability. We introduce the prototypical ADP metric to measure model robustness to shift-induced overconfidence and to directly explain the “expected loss of accuracy during deployment in an uncertainty-thresholding regime”. Finally, we provide a brief discussion about how ADP supports model selection and how that can be helpful within a clinical setting.

Results

Bayesian model benchmarking approach to predict cancer of unknown primary

The primary DL task was to predict the tissue of origin (primary cancer type) of cancer samples using transcriptomic data. We used transcriptomic data from TCGA of primary cancer samples corresponding to 32 primary cancer types as model ‘development’ data: training (n = 820231) and validation IID data (n = 1434; Supplementary Table S1). The test data were OOD (representing ‘production’), providing a platform for benchmarking resilience to overconfidence, and included TCGA metastatic samples (n = 39232), Met500 metastatic samples (n = 47933), and a combination of primary and metastatic samples from our own independent internal custom dataset, i.e. ICD (n = 46134,35,36,37,38,39,40,41,42; Fig. 1a, Supplementary Fig. S1). The distributional shifts in the test data were likely to be caused by several factors, including dataset batches, sample metastasis status (metastatic or primary) and whether the cancer type was absent during training (‘unseen’).

Figure 1
figure 1

Overview of the study design. (a) Simplified study workflow. TCGA primary cancer types comprised the training and IID validation data. OOD test data comprised of the TCGA (metastatic cancer types), Met500 and ICD datasets, which included primary, metastatic and ‘unseen’ cancer types. (b) Schematic overview of the four tested models: pointwise Resnet (Resnet), Resnet extended with Monte Carlo Dropout (MCD), MCD extended with bi-Lipschitz constraint (Bilipschitz), and an ensemble of Bilipschitz models (Ensemble). Note, Resnet represents a single point in function space (blue dot), while two Bayesian models (MCD and Bilipschitz) represent a distribution within a single region in function space (green dots). The Ensemble represents a collection of distributions centred around different modes (red dots).

We aimed to evaluate if three simple ‘distribution-wise’ Bayesian DL models improve performance and reduce shift-induced overconfidence compared to a pointwise baseline model (with identical Resnet architecture). To achieve this, we performed controlled benchmarking of the models over IID and OOD data (Fig. 1b). The experiment was controlled by enforcing consistency for factors affecting uncertainty within the validation/IID dataset. Specifically, all models had identical architecture, hyperparameter, and optimisation settings. Importantly, all models had identical (negative log likelihood) loss within the validation/IID dataset. We intentionally did not perform hyperparameter optimisation for each model, as it was important for our study design to control for accuracy.

The Bayesian models were Monte Carlo Dropout approximation (‘MCD’)43, MCD with smoothness and sensitivity constraints (‘Bilipschitz’)44,45, and an ensemble of Bilipschitz models (‘Ensemble’)45. The ways in which models differed were canonical: MCD modified Resnet by keeping Dropout during prediction, Bilipschitz modified MCD with spectral normalisation, Ensemble modified Bilipschitz by combining multiple models.

Approximate Bayesian inference reduces shift-induced overconfidence for ‘seen’ classes in a primary cancer site context

The predictive performance of each model to predict primary tissue was assessed using micro-F1 (equivalent to Accuracy; abbreviated F1). For the IID validation data, the difference between the highest and lowest ranking models was 0.28% (97.07% for Resnet and 96.79% for Ensemble, respectively; Fig. 2a, Supplementary Fig. S2S5). This was anticipated, since the loss was controlled for within validation data. As expected, F1 scores dropped for the OOD test set across all four models, with a 1.74% difference between the highest and lowest ranking models (82.04% for Ensemble and 80.30% for Resnet, respectively; Fig. 2b, Supplementary Figs. S6S9). All models had higher predictive uncertainties (Shannon’s entropy II) for OOD, relative to IID data (Fig. 2b). Uncertainties were significantly higher for all approximate Bayesian models (MCD, Bilipschitz, and Ensemble) relative to (pointwise) Resnet (p < 0.0001). Moreover, overconfidence in OOD data was evident for the Resnet and MCD models since their binned accuracies (i.e., the correct classification rates within bins delineated by the confidence scores) were consistently lower than corresponding confidence scores (Fig. 2c). The expected calibration errors (ECEs) for OOD data ranged between 5% for Ensemble and Bilipschitz and 16% for Resnet (Fig. 2c). Estimation of overconfidence as an absolute error was negligible across all models for IID data, with high amounts of overconfidence for OOD data, highlighting the shift-induced overconfidence when transitioning from IID to OOD data (Fig. 2d). Furthermore, Resnet had significantly higher overconfidence than MCD (p value < 0.01), Bilipschitz (p value < 0.001), and Ensemble (p value < 0.001) for OOD data but not IID data. This shows that the shift-induced overconfidence in pointwise DL models can be reduced with simple (approximate) Bayesian inference.

Figure 2
figure 2

Out-of-distribution overconfidence of a pointwise baseline Resnet model and three simple Bayesian models on ‘seen’ data. (a) Micro-F1 score (i.e., Accuracy) of all models on the IID validation data (left) and on ‘seen’ OOD data (right). Accuracy for (IID) validation data was controlled with early stopping. (b) Box plot of each model’s predictive uncertainty (Shannon’s Entropy, H) for individual samples on IID data (left) and on ‘seen’ OOD data (right). Sample median is depicted by horizontal line, while the sample mean is depicted by the grey star. Statistical significance (single-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively. (c) Each model’s confidence vs accuracy of each ECE-bin on ‘seen’ OOD data. The black diagonal lines illustrate perfect calibration, i.e., no overconfidence. ECE value for each model shown in parentheses. The residuals are colour-coded by the (left) colour scale and represent the difference between confidence and accuracy for each bin. (d) Box plot of each model’s absolute calibration error of individual samples on IID data (left) and ‘seen’ OOD data (right). Statistical significance (single-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively.

Prediction overconfidence for ‘unseen’ classes explained by related primary cancer types

Classes absent from training (‘unseen’) cannot have correct predictions, and prediction uncertainties should be higher compared to ‘seen’ classes. As expected, mean total uncertainties were higher for ‘unseen’ classes for all models (Fig. 3a). Moreover, approximate Bayesian models were significantly more uncertain with ‘unseen’ classes compared to Resnet (p value < 0.01; Fig. 3a). However, exceptions occurred across all models, where total uncertainty values were low, at both: class level, where predictions for a whole ‘unseen’ class consistently had low uncertainty; and sample level, where predictions for only some samples from a class had low uncertainty (Fig. 3b). We wanted to investigate whether any of the exceptions could be examples of ‘silent catastrophic failure’ (Supplementary Information—S4.2), a phenomenon where data are far from the training data’s support, resulting in incorrect yet extremely confident predictions44,45,46.

Figure 3
figure 3

Total uncertainties for out-of-distribution data with cancer types ‘seen’ and ‘unseen’ in training. (a) Box plot of each model’s predictive uncertainty (Shannon’s Entropy, \(\mathscr H\)) on OOD data with cancer types ‘seen’ (left) and ‘unseen’ (right) during training. Statistical significance (two-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively. Stars denoted mean, the horizontal centre lines denoted median, and notches— the 95% confidence interval of the median total uncertainty. (b) Total uncertainty values for the ‘unseen’ classes. The horizontal red lines denoted median total uncertainty values.

‘Unseen’ classes (i.e., cancer types) with low levels of uncertainty (averaged within the class) corresponded to ‘seen’ classes that either (biologically) related to the predicted primary cancer type, or were from a similar tissue or cell of origin. For example, all acral melanoma (ACRM) samples (n = 40), a subtype of melanoma that occurs on soles, palms and nail beds, were predicted to be cutaneous melanoma (MEL) by all four models (Supplementary Figs. S6S9) with the smallest median total uncertainty for all four models (Fig. 3b). All three fibrolamellar carcinoma (FLC) samples, a rare type of liver cancer, were predicted to be hepatocellular carcinomas (HCC), although the median uncertainty was much higher for Bilipschitz and Ensemble models compared to Resnet and MCD (1.8, 1.5, 0.1 and 0.29 Shannon’s Entropy II, respectively). Two bladder squamous cell carcinomas (BLSC) showed different examples of class-level exceptions with one sample predicted as a bladder adenocarcinoma (BLCA), with the same primary tissue site as BLSC, or a lung squamous carcinoma (LUSC), with similar cell of origin. For the ‘unseen’ class pancreatic neuroendocrine tumours (PANET) we saw a wide spread of uncertainty values (Fig. 3b). Interestingly, only PANET samples that were predicted as another subtype of pancreatic cancer, pancreatic adenocarcinomas (PAAD), had low prediction uncertainty across all models compared to other incorrectly predicted PANET samples (Supplementary Fig. S10). Overall, since most of the incorrect predictions with low uncertainties had a reasonable biological explanation for the prediction, we concluded that we did not find strong evidence of catastrophic silent failure in this case study.

Robustness to shift-induced overconfidence is integral for production inference

To evaluate the robustness of the models’ accuracy, as well as the uncertainty’s correlation with the error-rate (abbreviated “uncertainty’s error-rate correlation”) we used the F1-Retention Area Under the Curve (F1-AUC)47. Evaluation was carried out on ‘seen’ and ‘unseen’ OOD data (i.e., ‘production data’). All models yielded similar results, with only a 0.45% percent decrease between the highest and lowest ranking models (F1-AUC of 93.67% for Bilipschitz and 93.25% for MCD, respectively; Fig. 4a). The performance difference between all models was marginal as F1-AUC doesn’t capture the lost calibration caused by the distributional shift when transitioning from IID to (‘seen’ and ‘unseen’) OOD. In other words, the F1-AUC metric did not detect effects caused by the shift-induced overconfidence. This was evident from the following observations: (1) inter-model accuracies were similar within IID, as well as OOD data (Fig. 2a); (2) calibration errors (i.e. overconfidence) were not different for IID (p value > 0.05), but different for OOD (p value < 0.01; Fig. 2d); and (3) F1-AUC scores were similar for all models, which implies ‘uncertainty’s error-rate correlation’ must have been similar (since F1-AUC encapsulates accuracy and ‘uncertainty’s error-rate correlation’47). Thus, while we showed that F1-AUC encapsulated accuracy and ‘uncertainty’s error-rate correlation’, both of which are important components of robustness when deploying DL in production, we caution that F1-AUC does not encapsulate robustness to shift-induced overconfidence. Hence it is not sufficient for safe deployment in clinical practice.

Figure 4
figure 4

Evaluation of model generalisability from development to production. (a) F1-Retention Curves and corresponding F1-AUC scores. The F1-Retention curve of the (baseline) Resnet model and three approximate Bayesian models (MCD, Bilipschitz, Ensemble). As the retention fraction decreases, more of the most uncertain predictions are replaced with the ground truth. Thus, steeper curves require stronger correlation between uncertainty and the error-rate. The F1-Retention Area Under the Curve (F1-AUC) for each model are detailed in the legend. The F1-AUC is a function of both predictive performance (micro-F1), and the uncertainty error-rate correlation. (b) Development and Production F1-Uncertainty curves for each model. The figure illustrates the development F1(IID)-Uncertainty curves (continuous lines), as well as the production F1(OOD)-Uncertainty curves (dashed lines). Black lines illustrate the F1 decrease from a single development F1 score with F1dev = 98.5% for all models. The Area Between the Development and Production Curve (ADP) is shown as the coloured region. (c) Area Between the Development and Production Curves (ADP) bar plot with bootstrapped confidence intervals. ADP is the averaged F1 decrease calculated between F1dev = 97.5% and F1dev = 99.0% at intervals of 0.001%. Steps for calculating the ADP are detailed in the Methods.

To overcome the limitation of the F1-AUC metric’s insensitivity to shift-induced overconfidence, we developed a new (prototypical) metric called the Area between the Development and Production curve (ADP), which depends on both IID (i.e., ‘development’) data, as well as the (‘seen’ and ‘unseen’) OOD (i.e., ‘production’) data. The ADP may be interpreted as “the expected decrease in accuracy when transitioning from development to production if uncertainty thresholding is utilised to boost reliability”. The ADP differs from ECE and Accuracy in two primary ways. First, ECE and accuracy relate to a single data set, whereas the ADP relates to two data sets, hence ADP explains the expected change in, for example, accuracy from one data set relative to the other. Second, the ADP complements and subsumes F1-AUC in the context of deploying models from training/development data (IID) to production test data (OOD). The ADP was calculated by averaging the set of decreases in F1, from development (IID) to production (OOD) datasets, at multiple different uncertainty thresholds (a single F1-decrease is demonstrated in Fig. 4b; refer to the “Methods” section for details).

The ADP metric detected effects from shift-induced overconfidence, with an inter-model percent decrease that was two orders of magnitude larger than F1-AUC (Fig. 4c). The percent decrease between the top and bottom ranking models was 53.68%. The top-ranking model was Bilipschitz with an ADP of 4.28%, and the bottom ranking model was Resnet with ADP of 9.24% (Fig. 4c). This highlights that ADP may be relevant when evaluating the performance of models that are deployed in production by encapsulating shift-induced overconfidence, which is inevitable in an oncological setting.

To further illustrate the utility of ADP, we performed an additional experiment (Supplementary Fig. S11). We used an independent classification task, the well-known CIFAR-10 (IID) dataset and its’ OOD variant—CIFAR-10-C, and compared a non-Bayesian CNN Resnet model and a Deep Kernel Learning Model (i.e., neural Gaussian process). The results were in line with our hypothesis that Bayesian deep learning improves robustness to distribution shift, demonstrated by a lower ADP for the Gaussian process model compared to the Resnet model.

Discussion

A major barrier to using DL in clinical practice is the shift-induced overconfidence encountered when deploying a DL model from development to production. Reducing and accounting for shift-induced overconfidence with appropriate models and relevant metrics should make the models more transparent and trustworthy for translation into practice. Our work herein shows that marked progress can be made with simple Bayesian DL models deployed in conjunction with uncertainty thresholding. However, the performance of models deployed in production can be difficult to evaluate without a suitable metric, therefore we developed ADP to directly measure shift-induced overconfidence.

Three Bayesian models with canonical extensions, namely MCD, Bilipschitz, Ensemble, were chosen to test whether simple modifications applicable to any DL architecture can improve performance in production. The Bayesian models were selected according to criteria for which we believe would facilitate adoption: (1) simplicity, for wider accessibility; (2) ubiquity, to ensure models were accepted and tested methods; (3) already demonstrated as robust to shift-induced overconfidence22,48,49; and (4) computational scalability. Our prior expectations were that each canonical extension would further improve generalisation of both accuracy and uncertainty quality, albeit at the cost of increased complexity. Those expectations were mostly in line with our benchmarking results, since the most complex model (Ensemble) went from worst-performing in IID to best-performing model in OOD in terms of accuracy. Furthermore, while inspection into overconfidence presented no significant inter-model differences within IID data, the OOD overconfidence differences were significant, whereby added complexity corresponded to less shift-induced overconfidence. Using the ADP statistic, improvements in robustness to shift-induced overconfidence were shown to have a large impact on the accuracy in production when rejecting unreliable predictions above an acceptable uncertainty threshold. Hence, any DL architecture’s accuracy in production can be substantially improved with simple and scalable approximate Bayesian modifications. This phenomenon is sometimes referred to as “turning the Bayesian crank”50.

We restricted our uncertainty statistics to predictive (i.e., total) uncertainties, since it was not possible to estimate the sub-divisions of uncertainty with the baseline Resnet model, which only captures uncertainty about the data. The Bayesian models captured an additional component of uncertainty, the ‘epistemic’ uncertainty, hence they all had larger total uncertainty estimates when compared to the non-Bayesian baseline. Consequently, the Bayesian models filled the uncertainty gap caused by distribution shift (i.e., shift-induced overconfidence). In future work, a richer picture may be understood by focusing only on distribution-wise models to inspect the two sub-divisions of the predictive uncertainty: epistemic (model) uncertainty and aleatoric (inherent) uncertainty. Epistemic uncertainty is dependent on the model specification and may be reduced with more data or informative priors. Aleatoric uncertainty is dependent on data’s inherent noise and can be reduced with more data features that explain variance caused by confounding variables (e.g., patient age, cancer stage, batch effect). Epistemic and aleatoric uncertainties present the potential for further insights, including whether a data point’s predictive uncertainty will reduce with either more examples or by an altered model design (epistemic uncertainty), or more features (aleatoric uncertainty)51,52,53,54.

This study addressed distributional shift effects on uncertainties with parametric models, which assume parameters are sufficient to represent all training data. Non-parametric models relax that assumption, which is arguably crucial to detect when data are outside the domain of training data (‘out-of-domain’) and for avoiding extreme overconfidence, i.e., ‘silent catastrophic failure’. In future work, non-parametric models, for example Gaussian Processes, capable of measuring uncertainties about ‘out-of-domain’ data, should also be explored44,45,46,55.

Our work suggests that considerations of robustness to distributional shifts must encapsulate uncertainty and prediction to improve performance in production. While this study focused on the quality of uncertainty, it is important to note that other DL components are worth consideration too. These include model architecture (i.e. inductive bias), which can be tailored to ignore redundant data-specific aspects of a problem via invariant or equivariant model representations56, data-augmentation strategies57, and/or structural causal models58,59,60. Such tailored models can further improve data efficiency56, robustness to distributional shifts27, and are central to an appropriate model specification that challenges DL deployment61. The importance of tailored inductive biases is supported by the prolific advances in fields beyond clinical diagnostics in computer vision (e.g. CNN’s translational equivariance56), and biology (e.g. how Alpha Fold 262 solved the Critical Assessment of protein Structure Prediction (CASP63). These studies show that a wide array of DL components can improve generalisation and, thus, DL performance in production. Our study argues uncertainty calibration as an important element in that array; hence, improving the quality of uncertainty can lead to improved DL reliability in production.

In practice, we hope the community considers utilising uncertainty thresholding as a proactive method to improve accuracy and safety of DL applications, deployed in the clinic. This may involve (iterative) consultation between ML engineer and medical professionals to agree on a ‘minimally acceptable accuracy’ for production (deem this \(\mathit{min}\left(F{1}_{dev}\right)\)). The ML engineer may then use development data to train an approximate Bayesian DL model and produce Development F1-Uncertainty curves (with validation data). The engineer then, with another independent dataset, can proceed to develop an ADP estimate (as described in the “Methods” section) to help communicate (in context of available dataset differences) what the expected accuracy decrease may be when the model is deployed to production, which helps manage expectations and facilitate trust. Importantly, with the (prototypical) ADP, the team may better judge which uncertainty quantification techniques are most effective for boosting accuracy under the ‘uncertainty thresholding’ risk-management regime. This procedure, as well as the ADP statistic, is of course prototypical and only suggestive. We leave improvement, and clarification of this for future work.

In conclusion, our study highlighted approaches for quantifying and improving robustness to shift-induced overconfidence with simple and accessible DL methods in the context of oncology. We justified our approach with mathematical and empirical evidence, biological interpretation, and a new metric, the ADP designed to encapsulate shift-induced overconfidence—a crucial aspect that needs to be considered when deploying DL in real-world production. Moreover, the ADP is directly interpretable as a proxy to expected accuracy loss when deploying DL models from development to production. Although we have addressed the shift-induced overconfidence by utilising first-line solutions, work remains to bridge DL from theory to practice. We must account for data distributions, evaluation metrics, and modelling assumptions as all are equally important and necessary considerations to see safe translation of DL into clinical practice.

Methods

Prediction task and datasets

The task was to predict a patient's primary cancer type, which we cast under the supervised learning framework by learning the map \(\left\{\mathbf{x}\to y\right\}\), with \(y\) denoting the primary cancer category, and \(\mathbf{x}\in {\mathbb{R}}^{D}\) denoting a patient’s sampled bulk gene expression signature.

Three independent datasets were used: our own independent Internal Custom Dataset, ICD34,35,36,37,38,39,40,41,42, TCGA31, and Met50033. All datasets were pre-processed and partitioned into groups (i.e., strata) that uniquely proxied different distribution shifts. Proxies of approximately unique shifts were assumed to be governed by their respective intervention (i.e. unique shift), as deemed by values of four presumed hidden variables influencing the modelled map \(\left\{\mathbf{x}\to y\right\}\). Those variables were ‘Batch’ (indicating source dataset label, e.g., ‘TCGA’), ‘State-of-Metastases’ (valued ‘Primary’, or ‘Metastatic’), and ‘Seen’ (indicating whether a target value y was seen during training) (Supplementary Table S1). Training and validation data comprised of the Strata ID.

$$\underbrace {{\left( {'Batch',\;'State\;of\;Metastases',\;'Seen'} \right)}}_{Strata \;ID\;key} = \underbrace {{\left( {'TCGA',\;'Primary',\; True} \right)}}_{key \;value},$$

since we believed it to be approximately independent and identically distributed (IID) data. All other strata were assumed out-of-distribution (OOD) due to distribution shifts caused by confounding variables. As a result, the training and validation data were IID, while the test data were OOD.

Benchmarked models

Four models were benchmarked in this study—the baseline pointwise Resnet, MCD, Bilipschitz, and Ensemble. All models shared identical model architecture and hyperparameter settings (including early stopping), respectively controlling the inductive bias and accuracy from confounding overconfidence. Although we did not perform explicit hyperparameter optimisation, some manual intervention was used to adjust hyperparameters within the validation set. For example, the singular value bound hyperparameter (for spectral normalisation) was manually tuned to be as low as practically possible, while being capable of being flexible enough to learn the training task of predicting the primary site.

Baseline resnet model

Resnet architecture had four hidden layers, each with 1024-neurons, Mish activations64, batch normalisation65, and standard residual connections from the first hidden layer up to the final hidden ‘logit-space’ layer, which was then normalised using the SoftMax function to yield probability vector \(\mathbf{p}\left(\mathrm{x}\right)= \in {[\mathrm{0,1}]}^{K}\), where the prediction’s class index,

$$c=\mathrm{arg}\,\underset{k}{\mathrm{max}}\left\{{\left[{{\mathrm{p}}_{1},\mathrm{p}}_{2},\dots , {\mathrm{p}}_{K}\right]}^{T}\right\}$$

indicates the primary cancer site’s label \(y \leftarrow c\). Specifically, a batch \(\mathbf{\rm X}\in {\mathbb{R}}^{B\times D}\) with \(B\) individual samples is first transformed by the input layer \({\mathbf{U}}^{\left(0\right)}=g(\langle \mathbf{\rm X}, {\mathbf{W}}^{\left(0\right)}\rangle +{\mathbf{b}}^{\left(0\right)})\), with affine transform parameters \(\left\{{\mathbf{W}}^{\left(0\right)}, {\mathbf{b}}^{\left(0\right)}\right\}\), non-linear activations \(g\), and output representation

\({\mathbf{U}}^{\left(0\right)}\). Hidden layers have residual connections \({\mathbf{U}}^{\left(l\right)}=g\left(\langle {\mathbf{U}}^{\left(l-1\right)},{\mathbf{W}}^{\left(l\right)}\rangle +{\mathbf{b}}^{\left(l\right)}\right)+{\mathbf{U}}^{(l-1)}\) where \(l \in \mathrm{1,2},\dots ,L\) denotes the hidden layer index (\(L=3\) in this case). The final output layer is a pointwise (mean estimate) function in logit-space \(\mathbf{f}\left(\mathbf{X}\right)= g\left(\langle {\mathbf{U}}^{\left(L\right)},{\mathbf{W}}^{\left(\mu \right)}\rangle +{\mathbf{b}}^{\left(\mu \right)}\right)\), where \(\left\{{\mathbf{W}}^{\left(\mu \right)}, {\mathbf{b}}^{\left(\mu \right)}\right\}\) are the final output (affine) transformation parameters. Finally, SoftMax normalisation yields a K-vector \(\mathbf{p}\left(\mathbf{X}\right)= {\text{SoftMax}}\left(\mathbf{f}\left(\mathbf{X}\right)\right)\). All other hyperparameter settings are defined in Supplementary Table S2. This baseline Resnet model architecture was inherited by all other models in this study to control inductive biases.

Approximate Bayesian inference

Bayesian inference may yield a predictive distribution about sample \({\mathbf{x}}^{*}\), \(p(\mathbf{p}|{\mathbf{x}}^{*},\mathscr{D})\), from the likelihood of an assumed parametric model \(p(\mathbf{p}|{\mathbf{x}}^{\boldsymbol{*}},\Theta)\), an (approximate) parametric posterior \(q\left(\Theta |\mathscr{D}\right)\), and potentially Monte Carlo Integration (MCI) technique, also referred to as Bayesian model averaging:

$$p\left(\mathbf{p}|{\mathbf{x}}^{*},\mathscr{D}\right)\approx {\int_{\Theta}}p\left(\mathbf{p}|{\mathbf{x}}^{*},\Theta \right)q\left(\Theta |\mathscr{D}\right)d\Theta \approx \frac{1}{T}\sum_{t=1}^{T}p(\mathbf{p}|{\mathbf{x}}^{*},{\Theta }_{t})$$

Most neural networks are parametric models, which assume \(\Theta\) can perfectly represent \(\mathscr{D}\). As a result, the model likelihood \(p(\mathbf{p}|{\mathbf{x}}^{*},\mathscr{D},\Theta)\) is often replaced with \(\mathrm{p}(\mathrm{p}|{\mathrm{x}}^{*},\Theta )\). The main differentiating factor among all Bayesian deep learning inference methods lies in how the parametric posterior \(q\left(\Theta |\mathscr{D}\right)\) is approximated.

Resnet extended with Monte Carlo Dropout

The MCD model approximates the parametric posterior \(q(\Theta |\mathscr{D})\) by keeping dropout activated during inference43. Dropout randomly ‘switches off’ a subset of neurons to zero-vectors at each iteration. Hence, a collection of dropout configurations \({\left\{{\Theta }_{t}\right\}}_{t=1}^{T}\) are samples from the (approximate) posterior \(q(\Theta |\mathscr{D})\). For more information, refer to the Appendix of43 where an approximate dual connection between Monte Carlo Dropout neural networks and Deep Gaussian processes is established.

The MCD also extends the Resnet model architecture by including an additional output layer to estimate a data-dependent variance function \({\mathbf{s}}_{t}^{2}\left(\mathbf{X}\right)= g(\langle {\mathbf{U}}^{(L)},{\mathbf{W}}_{t}^{(\Sigma )}\rangle +{\mathbf{b}}_{t}^{(\Sigma )})\) in addition to the (now stochastic) mean function \({\mathbf{f}}_{t}\left(\mathbf{X}\right)= g\left(\langle {\mathbf{U}}^{\left(L\right)},{\mathbf{W}}_{t}^{(\mu )}\rangle +{\mathbf{b}}_{t}^{(\mu )}\right)\). Both final output layers had a shared input \({\mathbf{U}}^{(L)}\), but unique parameters \(\left\{{\mathbf{W}}_{t}^{(\mu )},{\mathbf{b}}_{t}^{(\mu )}\right\}\) and \(\left\{{\mathbf{W}}_{t}^{(\Sigma )},{\mathbf{b}}_{t}^{(\Sigma )}\right\}\). Together, the stochastic mean \({\mathbf{f}}_{t}\left(\mathbf{X}\right)\) and variance \({\mathbf{s}}_{t}^{2}\left(\mathbf{X}\right)\) specify a Gaussian distribution in the logit-space, which was then sampled once \({\mathbf{u}}_{t}\left(\mathbf{X}\right)\sim \mathscr{N}(\mu ={\mathbf{f}}_{t}\left(\mathbf{X}\right),\Sigma ={\mathbf{s}}_{t}^{2}{\left(\mathbf{X}\right)}^{T}\mathbf{I}\) and normalised with the SoftMax function \({\mathbf{p}}_{t}\left(\mathbf{X}\right)= {\text{SoftMax}}\left({\mathbf{u}}_{t}\left(\mathbf{X}\right)\right)\). \({\mathbf{p}}_{t}\left(\mathbf{X}\right)\) represents a single sample from the model likelihood \(p(\mathbf{p}|\mathbf{x},\Theta )\), from which \(T\) samples are averaged for Monte Carlo integration:

$$\mathbf{p}\left(\mathbf{X}\right)= \frac{1}{T}\sum_{t=1}^{T}{\mathbf{p}}_{t}(\mathbf{X}).$$

Finally, \(\mathbf{p}\left(\mathbf{X}\right)\) estimates the cancer primary site label \(y\), the predictive uncertainties \(\text{Conf(.)}\), and \(\mathscr{H}\left(\text{.}\right)\) for each individual sample in data batch \(\mathbf{x}\).

MCD extended with a bi-Lipschitz constraint

The Bilipschitz model shared all the properties of the MCD model with an additional bi-Lipschitz constraint:

$${L}_{1}{\Vert {\mathbf{x}}_{1}-{\mathbf{x}}_{2}\Vert }_{\mathscr{X} }\le {\Vert \mathbf{f}\left({\mathbf{x}}_{1}\right)- \mathbf{f}\left({\mathbf{x}}_{2}\right)\Vert }_{\mathscr{F}}\le {L}_{2}{\Vert {\mathbf{x}}_{1}-{\mathbf{x}}_{2}\Vert }_{\mathscr{X} }$$

where scalars \({L}_{1}\) and \({L}_{2}\) respectively control the tightness of the lower- and upper-bound. Norm operators \(\left\{{\Vert \text{.}\Vert }_{\mathscr{X} },{\Vert \text{.}\Vert }_{\mathscr{F}}\right\}\) are over the data space \(\mathscr{X}\) and function space \(\mathscr{F}\). The effect of the bi-Lipschitz constraint is such that the changes in input data \({\Vert {\mathbf{x}}_{1}-{\mathbf{x}}_{2}\Vert }_{\chi }\) (e.g. distribution shifts) are proportional to the changes in the output, \({\Vert \mathbf{f}\left({\mathbf{x}}_{1}\right)- \mathbf{f}\left({\mathbf{x}}_{2}\right)\Vert }_{\mathscr{F}}\). These changes are within a bound determined by \({L}_{1}\) (controlling sensitivity) and \({L}_{2}\) (controlling smoothness). Interestingly, recent studies have established that bi-Lipschitz constraints are beneficial to the robustness of the neural network under distributional shifts44,45. Sensitivity (i.e. \({L}_{1}\)) is controlled with residual connections66,67, which allows \(\mathbf{f}\left(\mathbf{x}\right)\) to avoid arbitrarily small changes, especially in the presence of distributional shifts in those regions of \(\mathscr{X}\) with no (training data) support44. Sensitivity (i.e. \({L}_{2}\)) is controlled with spectral normalisation on parameters \(\Theta\) 44,68 and batch-normalisation functions45, which allow \(\mathbf{f}\left(\mathbf{x}\right)\) to avoid arbitrarily large changes (under shifts) that induce feature collapse and extreme overconfidence44,45,46.

Deep ensemble of Bilipschitz models

The Ensemble model was a collection of eight independently trained Bilipschitz models with unique initial parameter configurations. Each Bayesian model in the Ensemble model is sampled \(T/10(=25)\) times and then pooled to control for Monte Carlo integration between the ‘Ensemble’ and all other models.

Models in deep ensembles yield similarly performant (low-loss) solutions, but are diverse and distant in parameter- and function-space69. This allows the ensemble to have an (approximate) posterior \(q\left(\Theta |\mathscr{D}\right)\) with multiple modes, which was not the case for the Resnet, MCD, and Bilipschitz models. We believe the ensemble modelled \(q\left(\Theta |\mathscr{D}\right)\) with the highest fidelity to the true parametric posterior \(p\left(\Theta |\mathscr{D}\right)\) due to empirical evidence from other studies' results27,48,70,71.

Model efficacy assessment

Model efficacy was assessed using several metrics with practical relevance in mind (justification provided in the Supplementary Information—S1.2). Predictive performance, the predictive uncertainties and the total overconfidence were, respectively, measured with the micro-F1 score, Shannon’s Entropy II and Expected Calibration Error (ECE). F1-AUC was used to evaluate the robustness of the predictive performance and the uncertainty’s error-rate correlation. The Area between Development and Production (ADP) metric was designed to complement F1-AUC by evaluating robustness to shift-induced overconfidence. This may be interpreted as the expected predictive loss during a model’s transition from development inference (IID) to production inference (OOD) while controlling for the uncertainty threshold.

Quantifying predictive uncertainty

A predictive uncertainty (or total uncertainty) indicates the likelihood of an erroneous inference \(\mathbf{p}\left(\mathbf{x}\right)= {\text{SoftMax}}\left(\mathbf{f}\left(\mathbf{x}\right)\right)\), with a probability vector \(\mathbf{p}\left(\mathbf{x}\right)\in {[\mathrm{0,1}]}^{K}\), normalising operator \({\text{SoftMax}}\left(.\right)\), pointwise SoftMax function in logit-space, \(\mathbf{f}\left(.\right)\), and an gene expression vector \(\mathbf{x}\in {\mathbb{R}}^{D}\). The ideal predictive uncertainties depend on the combination of many factors including the training data \({\mathscr{D}}_{train}={\left\{\left({\mathbf{x}}_{i},{y}_{i}\right)\right\}}_{i=1}^{n}\), model specification (e.g. model architecture, hyperparameters, etc.), inherent noise in data, model parameters \(\Theta\), test data inputs \(\mathbf{x}\in {\mathscr{D}}_{test}\) (if modelling heteroscedastic noise), and hidden confounding variables causing distribution shifts. Consequently, there are many statistics, each explaining different phenomena, which make up the predictive uncertainty. Given that some sub-divisions of uncertainty are exclusive to distribution-wise predictive models72, we restricted ourselves to uncertainties that are accessible to both pointwise and distribution-wise models, namely, the confidence score, \(\mathrm{Conf}(\mathbf{x})\), and Shannon’s Entropy \(\mathscr{H}(\mathbf{p}\left(\mathbf{x}\right))\).

A model’s confidence score with reference to sample x, is defined by the largest element from the SoftMax vector,

$$\mathrm{Conf}\left(\mathbf{x}\right)={\Vert \mathbf{p}\left(\mathbf{x}\right)\Vert }_{\infty },$$

where ||p(x)|| denotes the matrix-induced infinity norm of the vector p(x). Confidence scores approximately quantify the probability of being correct and thus they are often used for rejecting ‘untrustworthy’ predictions (recall ‘uncertainty thresholding’ from the Introduction). Moreover, an average conf(x) is comparable to the accuracy metric, which allows for evaluating the overconfidence via ECE, which we will shortly detail.

Another notion of predictive uncertainty is that of Shannon’s Entropy, i.e.,

$$\mathscr{H}\left(\mathbf{p}\right)=\sum_{k=1}^{K}{\mathrm{p}}_{k}\mathrm{log}\left({\mathrm{p}}_{k}\right)= -\langle \mathbf{p},\mathrm{log}\left(\mathbf{p}\right)\rangle,$$

where \(\langle .,.\rangle\) is the dot product operator. Recall that \(\mathscr{H}\left(\mathbf{p}\right)\) is maximised when \(\mathbf{p}\) encodes a uniform distribution.

Defining out-of-distribution data and the DL effects

The IID assumption on data implies true causal mechanisms (i.e. structural causal model) where the underlying data generating process is immutable across observations, and hence the samples are independently generated from the same distribution58. The OOD assumption, however, underpins a different setting where the underlying causal mechanisms are affected (e.g. via interventions), thus the distribution of data changes73. There are many different types of distributional shifts, all of which negatively affect model performance. Deep learning models can degrade under distribution shifts as the IID assumption is necessary for most optimisation strategies (Supplementary Information—S4.1). Furthermore, it is worth noting that the resulting overconfidence can be extreme, whereby arbitrary model predictions correspond with maximal confidence scores \({s}_{i}\to 1\) 45 (Supplementary Information—S4.2).

Evaluation in OOD using ECE

The Expected Calibration Error was determined by binning each model’s confidence scores into M bins. The absolute difference between each bin’s accuracy and average maximum SoftMax score is averaged to weigh the bins proportionally with sample count. The ECE is defined as follows:

$$\mathrm{ECE}={\sum }_{m=1}^{M}\frac{\left|{B}_{m}\right|}{n}\left|\mathrm{acc}({B}_{m})-\mathrm{conf}({B}_{m}))\right|,$$

where \({B}_{m}\) is the number of predictions in bin \(m\), \(n\) is the total number of samples, and \(\mathrm{acc}({B}_{m})\) and \(\mathrm{conf}({B}_{m})\) are the accuracy and confidence scores of bin \(m\), respectively.

Evaluation in OOD using the area under the F1-retention curve (F1-AUC)

Area under the F1-Retention Curve (F1-AUC) was used to evaluate model performance in OOD, as it accounts for both predictive accuracy and an uncertainty’s error-rate correlation47. High F1-AUC values result from high accuracy (reflected by vertical shifts in F1-Retention curves) and/or high uncertainty error-rate correlation (reflected by the gradient of the F1-Retention curves). An uncertainty’s error-rate correlation is important in the production (OOD) context as higher correlations imply more discarded erroneous predictions.

F1-AUC was quantified according to the following method.

  1. 1.

    Predictions were sorted by their descending order of uncertainty.

  2. 2.

    All predictions were iterated over in order once, while at each iteration, F1 and retention (initially 100%) were calculated before replacing the current prediction with ground truth, hence decreasing the retention.

  3. 3.

    The increasing F1 scores and the corresponding decreasing retention rates determined the F1-Retention curve.

  4. 4.

    Approximate integration of the F1-Retention curve determined F1-AUC.

F1-Retention curves and F1-AUC metrics were quantified for all models on OOD data, including samples with classes that were not seen during training.

Using ADP for evaluating models in OOD data relative to IID data

The Area between the Development and Production Curve (ADP) aimed to complement F1-AUC, especially in the context of deploying models from development inference (IID) to production inference (OOD). Thus, ADP was designed to capture (in OOD data, relative to IID) three aspects of a model’s robustness relating to the accuracy, uncertainty error-rate correlation, and shift-induced overconfidence. This is because benchmarked inter-model performance can reduce similarly in terms of robustness to accuracy and uncertainty’s error-rate correlation (as measured by F1-AUC), but significantly differ by their uncertainty calibration (as measured by ADP).

ADP was calculated according to the following method:

  1. 1.

    Development and Production F1-Uncertainty curves were produced by iteratively calculating F1 and discarding (not replacing) samples by their descending order of uncertainty.

  2. 2.

    A nominal F1 target range of \(\left[\mathrm{min}\left(\mathrm{F}{1}_{dev}\right),\mathrm{max}\left(\mathrm{F}{1}_{dev}\right)\right]=\left[0.975, 0.990\right]\) was selected, based on the Development F1-Uncertainty curve; with \(\left(\mathrm{F}{1}_{dev}, {\mathscr{U}}_{accept}\right)\) denoting a point on the Development F1-Uncertainty curve at uncertainty threshold \({U}_{accept}\).

  3. 3.

    Nominal F1 target points, \(\mathrm{F}{1}_{nominal}\), were incremented at 1e-5 intervals from \(\mathrm{F}{1}_{nom}=\mathrm{min}(\mathrm{F}{1}_{dev})\) to \(\mathrm{F}{1}_{nom}=\mathrm{max}(\mathrm{F}{1}_{dev})\), with the per cent decrease in F1, from development \(\mathrm{F}{1}_{nom}\) to production \(\mathrm{F}{1}_{prod}\), recalculated at each step:

    $${\mathrm{Decrease}}^{\left(dev\to prod\right)} (\mathrm{F}{1}_{nom})=(\mathrm{F}{1}_{nom}-\mathrm{F}{1}_{prod})\times 100\%.$$
  4. 4.

    The set of recalculated \({\mathrm{Decrease}}^{\left(dev\to prod\right)} (\mathrm{F}{1}_{nom})\) values was averaged to approximate the Area between the Development and Production curves (ADP).

The ADP may be interpreted as “the expected decrease in accuracy when transitioning from development to production if uncertainty thresholding is utilised to boost reliability”.

It is important to note that our method for selecting the range \(\left[\mathrm{min}\left(\mathrm{F}{1}_{\mathrm{dev}}\right),\mathrm{max}\left(\mathrm{F}{1}_{\mathrm{dev}}\right)\right]\) was not arbitrary and required two checks for each model’s Development F1-Uncertainty curve. The first check was to ensure the sample size corresponding to \(\mathrm{max}\left(\mathrm{F}{1}_{\mathrm{dev}}\right)\) was sufficiently large (see Supplementary Table S3). The second check was to ensure that \(\mathrm{min}\left(\mathrm{F}{1}_{\mathrm{dev}}\right)\) was large enough to satisfy production needs. Failing to undertake these checks may result in the ADP statistic to mislead explanations about the expected loss when deploying models to production.

ADP is practically relevant by relating to the uncertainty thresholding technique for improving reliability in production (recall introduction). This is because \({\mathrm{Decrease}}^{\left(dev\to prod\right)} (\mathrm{F}{1}_{nom})\) first depends on a nominated target performance \(\mathrm{F}{1}_{nom}\), which selects corresponding \({\mathscr{U}}_{accept}\) from the Development F1-Uncertainty Curve. Predictions with uncertainties below \({\mathscr{U}}_{accept}\) are accepted in production, with performance denoted by \(\mathrm{F}{1}_{prod}\). As far as the authors are aware, no other metric monitors the three robustness components of accuracy, uncertainty’s error-rate correlation, and shift-induced overconfidence.

Ethics approval and consent to participate

This project used RNA-seq data which was previously published or is in the process of publication. The QIMR Berghofer Human Research Ethics Committee approved use of public data (P2095).