Abstract
Uncertainty estimation is crucial for understanding the reliability of deep learning (DL) predictions, and critical for deploying DL in the clinic. Differences between training and production datasets can lead to incorrect predictions with underestimated uncertainty. To investigate this pitfall, we benchmarked one pointwise and three approximate Bayesian DL models for predicting cancer of unknown primary, using three RNAseq datasets with 10,968 samples across 57 cancer types. Our results highlight that simple and scalable Bayesian DL significantly improves the generalisation of uncertainty estimation. Moreover, we designed a prototypical metric—the area between development and production curve (ADP), which evaluates the accuracy loss when deploying models from development to production. Using ADP, we demonstrate that Bayesian DL improves accuracy under data distributional shifts when utilising ‘uncertainty thresholding’. In summary, Bayesian DL is a promising approach for generalising uncertainty, improving performance, transparency, and safety of DL models for deployment in the real world.
Similar content being viewed by others
Introduction
Recent advances in deep learning (DL) have led to the rapid development of diagnostic and treatment support applications in various aspects of healthcare, including oncology^{1,2,3,4}. The proposed applications of DL utilise a range of data modalities, including MRI scans^{5}, CT scans^{6}, histopathology slides^{7}, genomics^{8}, transcriptomics^{9,10}, and most recently, integrated approaches with various data types^{11,12}. In general, studies using DL show excellent predictive performance, providing hope for successful translation into clinical practice^{13,14}. However, prediction accuracy in DL comes with potential pitfalls which need to be overcome before wider adoption can be eventuated^{15}.
The lack of transparency over prediction reliability is one challenge for implementing DL^{16}. One approach to overcome this is by providing uncertainty estimates about a model’s prediction^{17,18}, enabling betterinformed decision making. Another obstacle relates to the assumptions made about data when transitioning from training to realworld applications. In standard DL practice, during the ‘development’ stage, models are trained and validated on data prepared to satisfy the assumption of independent and identically distributed (IID) data, meaning that model would be applied to make predictions on the data that are independently drawn and come from the same distribution as the training data. However, this assumption cannot be guaranteed and is, in fact, frequently violated when models are deployed in ‘production’ (i.e. realworld application). This is because confounding variables, which we cannot control for, cause distributional shifts that push data outofdistribution (OOD)^{19}. For oncology applications, confounding variables can include technical differences in how the data are collected (e.g., batch effects, differences in sequencing depth or library choice for genomic and transcriptomic data; differences in instrumentation and imaging settings for medical imaging data), as well as biological differences (e.g., differences in patient demographics or a data class unseen during model development). The consequences from OOD data include inaccurate predictions coupled with underestimated uncertainties, which together result in the model’s overconfidence from distributional shifts, or what we call ‘shiftinduced’ overconfidence^{20,21,22}. Consequently, implementation of DL into clinical practice (i.e., production) requires that models are robust (i.e., generalise) to distributional shifts and provide correct predictions with calibrated uncertainties.
Methods to address DL overconfidence in production exist, albeit with different limitations. Repeated retraining of deployed models on new production data is beneficial for accuracy, but introduces new risks such as overcomputation or catastrophic forgetting, whereby DL models lose performance on original training/development data^{23,24}. Using tracking metrics such as accuracy can help inform ML engineers about the DL reliability, although such metrics are only available retrospectively. A key pitfall for these methods is that they are reactive and not proactive.
One proactive approach for managing risks in production is with ‘uncertainty thresholding’, whereby only predictions with uncertainties below a threshold are accepted (to increase accuracy). Unfortunately, a DL model’s uncertainty threshold is established with development (IID) data. Thus, when the model is deployed to production (OOD) data it runs a high risk of becoming overconfident. Therefore, the uncertainty threshold established in development corresponds to higher errorrate in production, which is a problem if expectations (between healthcare professionals and engineers) are set during the development phase of a project. To address this problem, posthoc methods exist that calibrate uncertainty (e.g., with ‘Temperature scaling’^{25}). However, while posthoc calibration effectively controls overconfidence in IID data^{25}, it fails to do so proactively in OOD data^{21,22}. Despite the notable theoretical and empirical research towards generalising DL uncertainties from OOD data^{26,27}, shiftinduced overconfidence is yet to be sufficiently addressed in practice.
In this study, we aim to address the generally underappreciated shiftinduced DL overconfidence in the context of oncology—the field that is particularly vulnerable to this pitfall due to frequent data distribution shifts. We conduct our experiments with a case study that predicts cancer of origin with transcriptomic data.
Cancer of origin prediction has been an active application area for DL^{24,28,29,30}, since accurate diagnosis is critical for the treatment of cancers of unknown primary (CUP), i.e. metastatic cancers in which the primary cancer site cannot be reliably determined. We investigate multiple cancer datasets, including one newly introduced dataset, with simple, effective, and scalable approximate Bayesian DL techniques that improve generalisation. We examine if the techniques improve model robustness to shiftinduced overconfidence and, therefore, improve the DL reliability. We introduce the prototypical ADP metric to measure model robustness to shiftinduced overconfidence and to directly explain the “expected loss of accuracy during deployment in an uncertaintythresholding regime”. Finally, we provide a brief discussion about how ADP supports model selection and how that can be helpful within a clinical setting.
Results
Bayesian model benchmarking approach to predict cancer of unknown primary
The primary DL task was to predict the tissue of origin (primary cancer type) of cancer samples using transcriptomic data. We used transcriptomic data from TCGA of primary cancer samples corresponding to 32 primary cancer types as model ‘development’ data: training (n = 8202^{31}) and validation IID data (n = 1434; Supplementary Table S1). The test data were OOD (representing ‘production’), providing a platform for benchmarking resilience to overconfidence, and included TCGA metastatic samples (n = 392^{32}), Met500 metastatic samples (n = 479^{33}), and a combination of primary and metastatic samples from our own independent internal custom dataset, i.e. ICD (n = 461^{34,35,36,37,38,39,40,41,42}; Fig. 1a, Supplementary Fig. S1). The distributional shifts in the test data were likely to be caused by several factors, including dataset batches, sample metastasis status (metastatic or primary) and whether the cancer type was absent during training (‘unseen’).
We aimed to evaluate if three simple ‘distributionwise’ Bayesian DL models improve performance and reduce shiftinduced overconfidence compared to a pointwise baseline model (with identical Resnet architecture). To achieve this, we performed controlled benchmarking of the models over IID and OOD data (Fig. 1b). The experiment was controlled by enforcing consistency for factors affecting uncertainty within the validation/IID dataset. Specifically, all models had identical architecture, hyperparameter, and optimisation settings. Importantly, all models had identical (negative log likelihood) loss within the validation/IID dataset. We intentionally did not perform hyperparameter optimisation for each model, as it was important for our study design to control for accuracy.
The Bayesian models were Monte Carlo Dropout approximation (‘MCD’)^{43}, MCD with smoothness and sensitivity constraints (‘Bilipschitz’)^{44,45}, and an ensemble of Bilipschitz models (‘Ensemble’)^{45}. The ways in which models differed were canonical: MCD modified Resnet by keeping Dropout during prediction, Bilipschitz modified MCD with spectral normalisation, Ensemble modified Bilipschitz by combining multiple models.
Approximate Bayesian inference reduces shiftinduced overconfidence for ‘seen’ classes in a primary cancer site context
The predictive performance of each model to predict primary tissue was assessed using microF1 (equivalent to Accuracy; abbreviated F1). For the IID validation data, the difference between the highest and lowest ranking models was 0.28% (97.07% for Resnet and 96.79% for Ensemble, respectively; Fig. 2a, Supplementary Fig. S2–S5). This was anticipated, since the loss was controlled for within validation data. As expected, F1 scores dropped for the OOD test set across all four models, with a 1.74% difference between the highest and lowest ranking models (82.04% for Ensemble and 80.30% for Resnet, respectively; Fig. 2b, Supplementary Figs. S6–S9). All models had higher predictive uncertainties (Shannon’s entropy II) for OOD, relative to IID data (Fig. 2b). Uncertainties were significantly higher for all approximate Bayesian models (MCD, Bilipschitz, and Ensemble) relative to (pointwise) Resnet (p < 0.0001). Moreover, overconfidence in OOD data was evident for the Resnet and MCD models since their binned accuracies (i.e., the correct classification rates within bins delineated by the confidence scores) were consistently lower than corresponding confidence scores (Fig. 2c). The expected calibration errors (ECEs) for OOD data ranged between 5% for Ensemble and Bilipschitz and 16% for Resnet (Fig. 2c). Estimation of overconfidence as an absolute error was negligible across all models for IID data, with high amounts of overconfidence for OOD data, highlighting the shiftinduced overconfidence when transitioning from IID to OOD data (Fig. 2d). Furthermore, Resnet had significantly higher overconfidence than MCD (p value < 0.01), Bilipschitz (p value < 0.001), and Ensemble (p value < 0.001) for OOD data but not IID data. This shows that the shiftinduced overconfidence in pointwise DL models can be reduced with simple (approximate) Bayesian inference.
Prediction overconfidence for ‘unseen’ classes explained by related primary cancer types
Classes absent from training (‘unseen’) cannot have correct predictions, and prediction uncertainties should be higher compared to ‘seen’ classes. As expected, mean total uncertainties were higher for ‘unseen’ classes for all models (Fig. 3a). Moreover, approximate Bayesian models were significantly more uncertain with ‘unseen’ classes compared to Resnet (p value < 0.01; Fig. 3a). However, exceptions occurred across all models, where total uncertainty values were low, at both: class level, where predictions for a whole ‘unseen’ class consistently had low uncertainty; and sample level, where predictions for only some samples from a class had low uncertainty (Fig. 3b). We wanted to investigate whether any of the exceptions could be examples of ‘silent catastrophic failure’ (Supplementary Information—S4.2), a phenomenon where data are far from the training data’s support, resulting in incorrect yet extremely confident predictions^{44,45,46}.
‘Unseen’ classes (i.e., cancer types) with low levels of uncertainty (averaged within the class) corresponded to ‘seen’ classes that either (biologically) related to the predicted primary cancer type, or were from a similar tissue or cell of origin. For example, all acral melanoma (ACRM) samples (n = 40), a subtype of melanoma that occurs on soles, palms and nail beds, were predicted to be cutaneous melanoma (MEL) by all four models (Supplementary Figs. S6–S9) with the smallest median total uncertainty for all four models (Fig. 3b). All three fibrolamellar carcinoma (FLC) samples, a rare type of liver cancer, were predicted to be hepatocellular carcinomas (HCC), although the median uncertainty was much higher for Bilipschitz and Ensemble models compared to Resnet and MCD (1.8, 1.5, 0.1 and 0.29 Shannon’s Entropy II, respectively). Two bladder squamous cell carcinomas (BLSC) showed different examples of classlevel exceptions with one sample predicted as a bladder adenocarcinoma (BLCA), with the same primary tissue site as BLSC, or a lung squamous carcinoma (LUSC), with similar cell of origin. For the ‘unseen’ class pancreatic neuroendocrine tumours (PANET) we saw a wide spread of uncertainty values (Fig. 3b). Interestingly, only PANET samples that were predicted as another subtype of pancreatic cancer, pancreatic adenocarcinomas (PAAD), had low prediction uncertainty across all models compared to other incorrectly predicted PANET samples (Supplementary Fig. S10). Overall, since most of the incorrect predictions with low uncertainties had a reasonable biological explanation for the prediction, we concluded that we did not find strong evidence of catastrophic silent failure in this case study.
Robustness to shiftinduced overconfidence is integral for production inference
To evaluate the robustness of the models’ accuracy, as well as the uncertainty’s correlation with the errorrate (abbreviated “uncertainty’s errorrate correlation”) we used the F1Retention Area Under the Curve (F1AUC)^{47}. Evaluation was carried out on ‘seen’ and ‘unseen’ OOD data (i.e., ‘production data’). All models yielded similar results, with only a 0.45% percent decrease between the highest and lowest ranking models (F1AUC of 93.67% for Bilipschitz and 93.25% for MCD, respectively; Fig. 4a). The performance difference between all models was marginal as F1AUC doesn’t capture the lost calibration caused by the distributional shift when transitioning from IID to (‘seen’ and ‘unseen’) OOD. In other words, the F1AUC metric did not detect effects caused by the shiftinduced overconfidence. This was evident from the following observations: (1) intermodel accuracies were similar within IID, as well as OOD data (Fig. 2a); (2) calibration errors (i.e. overconfidence) were not different for IID (p value > 0.05), but different for OOD (p value < 0.01; Fig. 2d); and (3) F1AUC scores were similar for all models, which implies ‘uncertainty’s errorrate correlation’ must have been similar (since F1AUC encapsulates accuracy and ‘uncertainty’s errorrate correlation’^{47}). Thus, while we showed that F1AUC encapsulated accuracy and ‘uncertainty’s errorrate correlation’, both of which are important components of robustness when deploying DL in production, we caution that F1AUC does not encapsulate robustness to shiftinduced overconfidence. Hence it is not sufficient for safe deployment in clinical practice.
To overcome the limitation of the F1AUC metric’s insensitivity to shiftinduced overconfidence, we developed a new (prototypical) metric called the Area between the Development and Production curve (ADP), which depends on both IID (i.e., ‘development’) data, as well as the (‘seen’ and ‘unseen’) OOD (i.e., ‘production’) data. The ADP may be interpreted as “the expected decrease in accuracy when transitioning from development to production if uncertainty thresholding is utilised to boost reliability”. The ADP differs from ECE and Accuracy in two primary ways. First, ECE and accuracy relate to a single data set, whereas the ADP relates to two data sets, hence ADP explains the expected change in, for example, accuracy from one data set relative to the other. Second, the ADP complements and subsumes F1AUC in the context of deploying models from training/development data (IID) to production test data (OOD). The ADP was calculated by averaging the set of decreases in F1, from development (IID) to production (OOD) datasets, at multiple different uncertainty thresholds (a single F1decrease is demonstrated in Fig. 4b; refer to the “Methods” section for details).
The ADP metric detected effects from shiftinduced overconfidence, with an intermodel percent decrease that was two orders of magnitude larger than F1AUC (Fig. 4c). The percent decrease between the top and bottom ranking models was 53.68%. The topranking model was Bilipschitz with an ADP of 4.28%, and the bottom ranking model was Resnet with ADP of 9.24% (Fig. 4c). This highlights that ADP may be relevant when evaluating the performance of models that are deployed in production by encapsulating shiftinduced overconfidence, which is inevitable in an oncological setting.
To further illustrate the utility of ADP, we performed an additional experiment (Supplementary Fig. S11). We used an independent classification task, the wellknown CIFAR10 (IID) dataset and its’ OOD variant—CIFAR10C, and compared a nonBayesian CNN Resnet model and a Deep Kernel Learning Model (i.e., neural Gaussian process). The results were in line with our hypothesis that Bayesian deep learning improves robustness to distribution shift, demonstrated by a lower ADP for the Gaussian process model compared to the Resnet model.
Discussion
A major barrier to using DL in clinical practice is the shiftinduced overconfidence encountered when deploying a DL model from development to production. Reducing and accounting for shiftinduced overconfidence with appropriate models and relevant metrics should make the models more transparent and trustworthy for translation into practice. Our work herein shows that marked progress can be made with simple Bayesian DL models deployed in conjunction with uncertainty thresholding. However, the performance of models deployed in production can be difficult to evaluate without a suitable metric, therefore we developed ADP to directly measure shiftinduced overconfidence.
Three Bayesian models with canonical extensions, namely MCD, Bilipschitz, Ensemble, were chosen to test whether simple modifications applicable to any DL architecture can improve performance in production. The Bayesian models were selected according to criteria for which we believe would facilitate adoption: (1) simplicity, for wider accessibility; (2) ubiquity, to ensure models were accepted and tested methods; (3) already demonstrated as robust to shiftinduced overconfidence^{22,48,49}; and (4) computational scalability. Our prior expectations were that each canonical extension would further improve generalisation of both accuracy and uncertainty quality, albeit at the cost of increased complexity. Those expectations were mostly in line with our benchmarking results, since the most complex model (Ensemble) went from worstperforming in IID to bestperforming model in OOD in terms of accuracy. Furthermore, while inspection into overconfidence presented no significant intermodel differences within IID data, the OOD overconfidence differences were significant, whereby added complexity corresponded to less shiftinduced overconfidence. Using the ADP statistic, improvements in robustness to shiftinduced overconfidence were shown to have a large impact on the accuracy in production when rejecting unreliable predictions above an acceptable uncertainty threshold. Hence, any DL architecture’s accuracy in production can be substantially improved with simple and scalable approximate Bayesian modifications. This phenomenon is sometimes referred to as “turning the Bayesian crank”^{50}.
We restricted our uncertainty statistics to predictive (i.e., total) uncertainties, since it was not possible to estimate the subdivisions of uncertainty with the baseline Resnet model, which only captures uncertainty about the data. The Bayesian models captured an additional component of uncertainty, the ‘epistemic’ uncertainty, hence they all had larger total uncertainty estimates when compared to the nonBayesian baseline. Consequently, the Bayesian models filled the uncertainty gap caused by distribution shift (i.e., shiftinduced overconfidence). In future work, a richer picture may be understood by focusing only on distributionwise models to inspect the two subdivisions of the predictive uncertainty: epistemic (model) uncertainty and aleatoric (inherent) uncertainty. Epistemic uncertainty is dependent on the model specification and may be reduced with more data or informative priors. Aleatoric uncertainty is dependent on data’s inherent noise and can be reduced with more data features that explain variance caused by confounding variables (e.g., patient age, cancer stage, batch effect). Epistemic and aleatoric uncertainties present the potential for further insights, including whether a data point’s predictive uncertainty will reduce with either more examples or by an altered model design (epistemic uncertainty), or more features (aleatoric uncertainty)^{51,52,53,54}.
This study addressed distributional shift effects on uncertainties with parametric models, which assume parameters are sufficient to represent all training data. Nonparametric models relax that assumption, which is arguably crucial to detect when data are outside the domain of training data (‘outofdomain’) and for avoiding extreme overconfidence, i.e., ‘silent catastrophic failure’. In future work, nonparametric models, for example Gaussian Processes, capable of measuring uncertainties about ‘outofdomain’ data, should also be explored^{44,45,46,55}.
Our work suggests that considerations of robustness to distributional shifts must encapsulate uncertainty and prediction to improve performance in production. While this study focused on the quality of uncertainty, it is important to note that other DL components are worth consideration too. These include model architecture (i.e. inductive bias), which can be tailored to ignore redundant dataspecific aspects of a problem via invariant or equivariant model representations^{56}, dataaugmentation strategies^{57}, and/or structural causal models^{58,59,60}. Such tailored models can further improve data efficiency^{56}, robustness to distributional shifts^{27}, and are central to an appropriate model specification that challenges DL deployment^{61}. The importance of tailored inductive biases is supported by the prolific advances in fields beyond clinical diagnostics in computer vision (e.g. CNN’s translational equivariance^{56}), and biology (e.g. how Alpha Fold 2^{62} solved the Critical Assessment of protein Structure Prediction (CASP^{63}). These studies show that a wide array of DL components can improve generalisation and, thus, DL performance in production. Our study argues uncertainty calibration as an important element in that array; hence, improving the quality of uncertainty can lead to improved DL reliability in production.
In practice, we hope the community considers utilising uncertainty thresholding as a proactive method to improve accuracy and safety of DL applications, deployed in the clinic. This may involve (iterative) consultation between ML engineer and medical professionals to agree on a ‘minimally acceptable accuracy’ for production (deem this \(\mathit{min}\left(F{1}_{dev}\right)\)). The ML engineer may then use development data to train an approximate Bayesian DL model and produce Development F1Uncertainty curves (with validation data). The engineer then, with another independent dataset, can proceed to develop an ADP estimate (as described in the “Methods” section) to help communicate (in context of available dataset differences) what the expected accuracy decrease may be when the model is deployed to production, which helps manage expectations and facilitate trust. Importantly, with the (prototypical) ADP, the team may better judge which uncertainty quantification techniques are most effective for boosting accuracy under the ‘uncertainty thresholding’ riskmanagement regime. This procedure, as well as the ADP statistic, is of course prototypical and only suggestive. We leave improvement, and clarification of this for future work.
In conclusion, our study highlighted approaches for quantifying and improving robustness to shiftinduced overconfidence with simple and accessible DL methods in the context of oncology. We justified our approach with mathematical and empirical evidence, biological interpretation, and a new metric, the ADP designed to encapsulate shiftinduced overconfidence—a crucial aspect that needs to be considered when deploying DL in realworld production. Moreover, the ADP is directly interpretable as a proxy to expected accuracy loss when deploying DL models from development to production. Although we have addressed the shiftinduced overconfidence by utilising firstline solutions, work remains to bridge DL from theory to practice. We must account for data distributions, evaluation metrics, and modelling assumptions as all are equally important and necessary considerations to see safe translation of DL into clinical practice.
Methods
Prediction task and datasets
The task was to predict a patient's primary cancer type, which we cast under the supervised learning framework by learning the map \(\left\{\mathbf{x}\to y\right\}\), with \(y\) denoting the primary cancer category, and \(\mathbf{x}\in {\mathbb{R}}^{D}\) denoting a patient’s sampled bulk gene expression signature.
Three independent datasets were used: our own independent Internal Custom Dataset, ICD^{34,35,36,37,38,39,40,41,42}, TCGA^{31}, and Met500^{33}. All datasets were preprocessed and partitioned into groups (i.e., strata) that uniquely proxied different distribution shifts. Proxies of approximately unique shifts were assumed to be governed by their respective intervention (i.e. unique shift), as deemed by values of four presumed hidden variables influencing the modelled map \(\left\{\mathbf{x}\to y\right\}\). Those variables were ‘Batch’ (indicating source dataset label, e.g., ‘TCGA’), ‘StateofMetastases’ (valued ‘Primary’, or ‘Metastatic’), and ‘Seen’ (indicating whether a target value y was seen during training) (Supplementary Table S1). Training and validation data comprised of the Strata ID.
since we believed it to be approximately independent and identically distributed (IID) data. All other strata were assumed outofdistribution (OOD) due to distribution shifts caused by confounding variables. As a result, the training and validation data were IID, while the test data were OOD.
Benchmarked models
Four models were benchmarked in this study—the baseline pointwise Resnet, MCD, Bilipschitz, and Ensemble. All models shared identical model architecture and hyperparameter settings (including early stopping), respectively controlling the inductive bias and accuracy from confounding overconfidence. Although we did not perform explicit hyperparameter optimisation, some manual intervention was used to adjust hyperparameters within the validation set. For example, the singular value bound hyperparameter (for spectral normalisation) was manually tuned to be as low as practically possible, while being capable of being flexible enough to learn the training task of predicting the primary site.
Baseline resnet model
Resnet architecture had four hidden layers, each with 1024neurons, Mish activations^{64}, batch normalisation^{65}, and standard residual connections from the first hidden layer up to the final hidden ‘logitspace’ layer, which was then normalised using the SoftMax function to yield probability vector \(\mathbf{p}\left(\mathrm{x}\right)= \in {[\mathrm{0,1}]}^{K}\), where the prediction’s class index,
indicates the primary cancer site’s label \(y \leftarrow c\). Specifically, a batch \(\mathbf{\rm X}\in {\mathbb{R}}^{B\times D}\) with \(B\) individual samples is first transformed by the input layer \({\mathbf{U}}^{\left(0\right)}=g(\langle \mathbf{\rm X}, {\mathbf{W}}^{\left(0\right)}\rangle +{\mathbf{b}}^{\left(0\right)})\), with affine transform parameters \(\left\{{\mathbf{W}}^{\left(0\right)}, {\mathbf{b}}^{\left(0\right)}\right\}\), nonlinear activations \(g\), and output representation
\({\mathbf{U}}^{\left(0\right)}\). Hidden layers have residual connections \({\mathbf{U}}^{\left(l\right)}=g\left(\langle {\mathbf{U}}^{\left(l1\right)},{\mathbf{W}}^{\left(l\right)}\rangle +{\mathbf{b}}^{\left(l\right)}\right)+{\mathbf{U}}^{(l1)}\) where \(l \in \mathrm{1,2},\dots ,L\) denotes the hidden layer index (\(L=3\) in this case). The final output layer is a pointwise (mean estimate) function in logitspace \(\mathbf{f}\left(\mathbf{X}\right)= g\left(\langle {\mathbf{U}}^{\left(L\right)},{\mathbf{W}}^{\left(\mu \right)}\rangle +{\mathbf{b}}^{\left(\mu \right)}\right)\), where \(\left\{{\mathbf{W}}^{\left(\mu \right)}, {\mathbf{b}}^{\left(\mu \right)}\right\}\) are the final output (affine) transformation parameters. Finally, SoftMax normalisation yields a Kvector \(\mathbf{p}\left(\mathbf{X}\right)= {\text{SoftMax}}\left(\mathbf{f}\left(\mathbf{X}\right)\right)\). All other hyperparameter settings are defined in Supplementary Table S2. This baseline Resnet model architecture was inherited by all other models in this study to control inductive biases.
Approximate Bayesian inference
Bayesian inference may yield a predictive distribution about sample \({\mathbf{x}}^{*}\), \(p(\mathbf{p}{\mathbf{x}}^{*},\mathscr{D})\), from the likelihood of an assumed parametric model \(p(\mathbf{p}{\mathbf{x}}^{\boldsymbol{*}},\Theta)\), an (approximate) parametric posterior \(q\left(\Theta \mathscr{D}\right)\), and potentially Monte Carlo Integration (MCI) technique, also referred to as Bayesian model averaging:
Most neural networks are parametric models, which assume \(\Theta\) can perfectly represent \(\mathscr{D}\). As a result, the model likelihood \(p(\mathbf{p}{\mathbf{x}}^{*},\mathscr{D},\Theta)\) is often replaced with \(\mathrm{p}(\mathrm{p}{\mathrm{x}}^{*},\Theta )\). The main differentiating factor among all Bayesian deep learning inference methods lies in how the parametric posterior \(q\left(\Theta \mathscr{D}\right)\) is approximated.
Resnet extended with Monte Carlo Dropout
The MCD model approximates the parametric posterior \(q(\Theta \mathscr{D})\) by keeping dropout activated during inference^{43}. Dropout randomly ‘switches off’ a subset of neurons to zerovectors at each iteration. Hence, a collection of dropout configurations \({\left\{{\Theta }_{t}\right\}}_{t=1}^{T}\) are samples from the (approximate) posterior \(q(\Theta \mathscr{D})\). For more information, refer to the Appendix of^{43} where an approximate dual connection between Monte Carlo Dropout neural networks and Deep Gaussian processes is established.
The MCD also extends the Resnet model architecture by including an additional output layer to estimate a datadependent variance function \({\mathbf{s}}_{t}^{2}\left(\mathbf{X}\right)= g(\langle {\mathbf{U}}^{(L)},{\mathbf{W}}_{t}^{(\Sigma )}\rangle +{\mathbf{b}}_{t}^{(\Sigma )})\) in addition to the (now stochastic) mean function \({\mathbf{f}}_{t}\left(\mathbf{X}\right)= g\left(\langle {\mathbf{U}}^{\left(L\right)},{\mathbf{W}}_{t}^{(\mu )}\rangle +{\mathbf{b}}_{t}^{(\mu )}\right)\). Both final output layers had a shared input \({\mathbf{U}}^{(L)}\), but unique parameters \(\left\{{\mathbf{W}}_{t}^{(\mu )},{\mathbf{b}}_{t}^{(\mu )}\right\}\) and \(\left\{{\mathbf{W}}_{t}^{(\Sigma )},{\mathbf{b}}_{t}^{(\Sigma )}\right\}\). Together, the stochastic mean \({\mathbf{f}}_{t}\left(\mathbf{X}\right)\) and variance \({\mathbf{s}}_{t}^{2}\left(\mathbf{X}\right)\) specify a Gaussian distribution in the logitspace, which was then sampled once \({\mathbf{u}}_{t}\left(\mathbf{X}\right)\sim \mathscr{N}(\mu ={\mathbf{f}}_{t}\left(\mathbf{X}\right),\Sigma ={\mathbf{s}}_{t}^{2}{\left(\mathbf{X}\right)}^{T}\mathbf{I}\) and normalised with the SoftMax function \({\mathbf{p}}_{t}\left(\mathbf{X}\right)= {\text{SoftMax}}\left({\mathbf{u}}_{t}\left(\mathbf{X}\right)\right)\). \({\mathbf{p}}_{t}\left(\mathbf{X}\right)\) represents a single sample from the model likelihood \(p(\mathbf{p}\mathbf{x},\Theta )\), from which \(T\) samples are averaged for Monte Carlo integration:
Finally, \(\mathbf{p}\left(\mathbf{X}\right)\) estimates the cancer primary site label \(y\), the predictive uncertainties \(\text{Conf(.)}\), and \(\mathscr{H}\left(\text{.}\right)\) for each individual sample in data batch \(\mathbf{x}\).
MCD extended with a biLipschitz constraint
The Bilipschitz model shared all the properties of the MCD model with an additional biLipschitz constraint:
where scalars \({L}_{1}\) and \({L}_{2}\) respectively control the tightness of the lower and upperbound. Norm operators \(\left\{{\Vert \text{.}\Vert }_{\mathscr{X} },{\Vert \text{.}\Vert }_{\mathscr{F}}\right\}\) are over the data space \(\mathscr{X}\) and function space \(\mathscr{F}\). The effect of the biLipschitz constraint is such that the changes in input data \({\Vert {\mathbf{x}}_{1}{\mathbf{x}}_{2}\Vert }_{\chi }\) (e.g. distribution shifts) are proportional to the changes in the output, \({\Vert \mathbf{f}\left({\mathbf{x}}_{1}\right) \mathbf{f}\left({\mathbf{x}}_{2}\right)\Vert }_{\mathscr{F}}\). These changes are within a bound determined by \({L}_{1}\) (controlling sensitivity) and \({L}_{2}\) (controlling smoothness). Interestingly, recent studies have established that biLipschitz constraints are beneficial to the robustness of the neural network under distributional shifts^{44,45}. Sensitivity (i.e. \({L}_{1}\)) is controlled with residual connections^{66,67}, which allows \(\mathbf{f}\left(\mathbf{x}\right)\) to avoid arbitrarily small changes, especially in the presence of distributional shifts in those regions of \(\mathscr{X}\) with no (training data) support^{44}. Sensitivity (i.e. \({L}_{2}\)) is controlled with spectral normalisation on parameters \(\Theta\) ^{44,68} and batchnormalisation functions^{45}, which allow \(\mathbf{f}\left(\mathbf{x}\right)\) to avoid arbitrarily large changes (under shifts) that induce feature collapse and extreme overconfidence^{44,45,46}.
Deep ensemble of Bilipschitz models
The Ensemble model was a collection of eight independently trained Bilipschitz models with unique initial parameter configurations. Each Bayesian model in the Ensemble model is sampled \(T/10(=25)\) times and then pooled to control for Monte Carlo integration between the ‘Ensemble’ and all other models.
Models in deep ensembles yield similarly performant (lowloss) solutions, but are diverse and distant in parameter and functionspace^{69}. This allows the ensemble to have an (approximate) posterior \(q\left(\Theta \mathscr{D}\right)\) with multiple modes, which was not the case for the Resnet, MCD, and Bilipschitz models. We believe the ensemble modelled \(q\left(\Theta \mathscr{D}\right)\) with the highest fidelity to the true parametric posterior \(p\left(\Theta \mathscr{D}\right)\) due to empirical evidence from other studies' results^{27,48,70,71}.
Model efficacy assessment
Model efficacy was assessed using several metrics with practical relevance in mind (justification provided in the Supplementary Information—S1.2). Predictive performance, the predictive uncertainties and the total overconfidence were, respectively, measured with the microF1 score, Shannon’s Entropy II and Expected Calibration Error (ECE). F1AUC was used to evaluate the robustness of the predictive performance and the uncertainty’s errorrate correlation. The Area between Development and Production (ADP) metric was designed to complement F1AUC by evaluating robustness to shiftinduced overconfidence. This may be interpreted as the expected predictive loss during a model’s transition from development inference (IID) to production inference (OOD) while controlling for the uncertainty threshold.
Quantifying predictive uncertainty
A predictive uncertainty (or total uncertainty) indicates the likelihood of an erroneous inference \(\mathbf{p}\left(\mathbf{x}\right)= {\text{SoftMax}}\left(\mathbf{f}\left(\mathbf{x}\right)\right)\), with a probability vector \(\mathbf{p}\left(\mathbf{x}\right)\in {[\mathrm{0,1}]}^{K}\), normalising operator \({\text{SoftMax}}\left(.\right)\), pointwise SoftMax function in logitspace, \(\mathbf{f}\left(.\right)\), and an gene expression vector \(\mathbf{x}\in {\mathbb{R}}^{D}\). The ideal predictive uncertainties depend on the combination of many factors including the training data \({\mathscr{D}}_{train}={\left\{\left({\mathbf{x}}_{i},{y}_{i}\right)\right\}}_{i=1}^{n}\), model specification (e.g. model architecture, hyperparameters, etc.), inherent noise in data, model parameters \(\Theta\), test data inputs \(\mathbf{x}\in {\mathscr{D}}_{test}\) (if modelling heteroscedastic noise), and hidden confounding variables causing distribution shifts. Consequently, there are many statistics, each explaining different phenomena, which make up the predictive uncertainty. Given that some subdivisions of uncertainty are exclusive to distributionwise predictive models^{72}, we restricted ourselves to uncertainties that are accessible to both pointwise and distributionwise models, namely, the confidence score, \(\mathrm{Conf}(\mathbf{x})\), and Shannon’s Entropy \(\mathscr{H}(\mathbf{p}\left(\mathbf{x}\right))\).
A model’s confidence score with reference to sample x, is defined by the largest element from the SoftMax vector,
where p(x)_{∞} denotes the matrixinduced infinity norm of the vector p(x). Confidence scores approximately quantify the probability of being correct and thus they are often used for rejecting ‘untrustworthy’ predictions (recall ‘uncertainty thresholding’ from the Introduction). Moreover, an average conf(x) is comparable to the accuracy metric, which allows for evaluating the overconfidence via ECE, which we will shortly detail.
Another notion of predictive uncertainty is that of Shannon’s Entropy, i.e.,
where \(\langle .,.\rangle\) is the dot product operator. Recall that \(\mathscr{H}\left(\mathbf{p}\right)\) is maximised when \(\mathbf{p}\) encodes a uniform distribution.
Defining outofdistribution data and the DL effects
The IID assumption on data implies true causal mechanisms (i.e. structural causal model) where the underlying data generating process is immutable across observations, and hence the samples are independently generated from the same distribution^{58}. The OOD assumption, however, underpins a different setting where the underlying causal mechanisms are affected (e.g. via interventions), thus the distribution of data changes^{73}. There are many different types of distributional shifts, all of which negatively affect model performance. Deep learning models can degrade under distribution shifts as the IID assumption is necessary for most optimisation strategies (Supplementary Information—S4.1). Furthermore, it is worth noting that the resulting overconfidence can be extreme, whereby arbitrary model predictions correspond with maximal confidence scores \({s}_{i}\to 1\) ^{45} (Supplementary Information—S4.2).
Evaluation in OOD using ECE
The Expected Calibration Error was determined by binning each model’s confidence scores into M bins. The absolute difference between each bin’s accuracy and average maximum SoftMax score is averaged to weigh the bins proportionally with sample count. The ECE is defined as follows:
where \({B}_{m}\) is the number of predictions in bin \(m\), \(n\) is the total number of samples, and \(\mathrm{acc}({B}_{m})\) and \(\mathrm{conf}({B}_{m})\) are the accuracy and confidence scores of bin \(m\), respectively.
Evaluation in OOD using the area under the F1retention curve (F1AUC)
Area under the F1Retention Curve (F1AUC) was used to evaluate model performance in OOD, as it accounts for both predictive accuracy and an uncertainty’s errorrate correlation^{47}. High F1AUC values result from high accuracy (reflected by vertical shifts in F1Retention curves) and/or high uncertainty errorrate correlation (reflected by the gradient of the F1Retention curves). An uncertainty’s errorrate correlation is important in the production (OOD) context as higher correlations imply more discarded erroneous predictions.
F1AUC was quantified according to the following method.

1.
Predictions were sorted by their descending order of uncertainty.

2.
All predictions were iterated over in order once, while at each iteration, F1 and retention (initially 100%) were calculated before replacing the current prediction with ground truth, hence decreasing the retention.

3.
The increasing F1 scores and the corresponding decreasing retention rates determined the F1Retention curve.

4.
Approximate integration of the F1Retention curve determined F1AUC.
F1Retention curves and F1AUC metrics were quantified for all models on OOD data, including samples with classes that were not seen during training.
Using ADP for evaluating models in OOD data relative to IID data
The Area between the Development and Production Curve (ADP) aimed to complement F1AUC, especially in the context of deploying models from development inference (IID) to production inference (OOD). Thus, ADP was designed to capture (in OOD data, relative to IID) three aspects of a model’s robustness relating to the accuracy, uncertainty errorrate correlation, and shiftinduced overconfidence. This is because benchmarked intermodel performance can reduce similarly in terms of robustness to accuracy and uncertainty’s errorrate correlation (as measured by F1AUC), but significantly differ by their uncertainty calibration (as measured by ADP).
ADP was calculated according to the following method:

1.
Development and Production F1Uncertainty curves were produced by iteratively calculating F1 and discarding (not replacing) samples by their descending order of uncertainty.

2.
A nominal F1 target range of \(\left[\mathrm{min}\left(\mathrm{F}{1}_{dev}\right),\mathrm{max}\left(\mathrm{F}{1}_{dev}\right)\right]=\left[0.975, 0.990\right]\) was selected, based on the Development F1Uncertainty curve; with \(\left(\mathrm{F}{1}_{dev}, {\mathscr{U}}_{accept}\right)\) denoting a point on the Development F1Uncertainty curve at uncertainty threshold \({U}_{accept}\).

3.
Nominal F1 target points, \(\mathrm{F}{1}_{nominal}\), were incremented at 1e5 intervals from \(\mathrm{F}{1}_{nom}=\mathrm{min}(\mathrm{F}{1}_{dev})\) to \(\mathrm{F}{1}_{nom}=\mathrm{max}(\mathrm{F}{1}_{dev})\), with the per cent decrease in F1, from development \(\mathrm{F}{1}_{nom}\) to production \(\mathrm{F}{1}_{prod}\), recalculated at each step:
$${\mathrm{Decrease}}^{\left(dev\to prod\right)} (\mathrm{F}{1}_{nom})=(\mathrm{F}{1}_{nom}\mathrm{F}{1}_{prod})\times 100\%.$$ 
4.
The set of recalculated \({\mathrm{Decrease}}^{\left(dev\to prod\right)} (\mathrm{F}{1}_{nom})\) values was averaged to approximate the Area between the Development and Production curves (ADP).
The ADP may be interpreted as “the expected decrease in accuracy when transitioning from development to production if uncertainty thresholding is utilised to boost reliability”.
It is important to note that our method for selecting the range \(\left[\mathrm{min}\left(\mathrm{F}{1}_{\mathrm{dev}}\right),\mathrm{max}\left(\mathrm{F}{1}_{\mathrm{dev}}\right)\right]\) was not arbitrary and required two checks for each model’s Development F1Uncertainty curve. The first check was to ensure the sample size corresponding to \(\mathrm{max}\left(\mathrm{F}{1}_{\mathrm{dev}}\right)\) was sufficiently large (see Supplementary Table S3). The second check was to ensure that \(\mathrm{min}\left(\mathrm{F}{1}_{\mathrm{dev}}\right)\) was large enough to satisfy production needs. Failing to undertake these checks may result in the ADP statistic to mislead explanations about the expected loss when deploying models to production.
ADP is practically relevant by relating to the uncertainty thresholding technique for improving reliability in production (recall introduction). This is because \({\mathrm{Decrease}}^{\left(dev\to prod\right)} (\mathrm{F}{1}_{nom})\) first depends on a nominated target performance \(\mathrm{F}{1}_{nom}\), which selects corresponding \({\mathscr{U}}_{accept}\) from the Development F1Uncertainty Curve. Predictions with uncertainties below \({\mathscr{U}}_{accept}\) are accepted in production, with performance denoted by \(\mathrm{F}{1}_{prod}\). As far as the authors are aware, no other metric monitors the three robustness components of accuracy, uncertainty’s errorrate correlation, and shiftinduced overconfidence.
Ethics approval and consent to participate
This project used RNAseq data which was previously published or is in the process of publication. The QIMR Berghofer Human Research Ethics Committee approved use of public data (P2095).
Data availability
This project used RNAseq data, which was previously published or available at European GenomePhenome Archive (EGA)—EGAS00001002864. TCGA data was accessed from the National Cancer Institute Genomic Data Commons data portal (downloaded on 23rd Mar 2020), Met500 data was accessed from the University of California Santa Cruz Xena (downloaded 10th Oct 2020), and ICD data is available at EGA under study accession numbers EGAS00001000397, EGAS00001001552, EGAS00001003438, EGAS00001000154, EGAS00001001732, EGAS00001004619 and EGAS00001002864.
Code availability
Code available upon request.
References
Cao, C. et al. Deep learning and its applications in biomedicine. Genom. Proteom. Bioinform. 16(1), 17–32. https://doi.org/10.1016/j.gpb.2017.07.003 (2018).
Tran, K. A. et al. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 13(1), 152. https://doi.org/10.1186/s1307302100968x (2021).
Wang, M., Zhang, Q., Lam, S., Cai, J. & Yang, R. A review on application of deep learning algorithms in external beam radiotherapy automated treatment planning. Front. Oncol. https://doi.org/10.3389/fonc.2020.580919 (2020).
Zhu, W., Xie, L., Han, J. & Guo, X. The application of deep learning in cancer prognosis prediction. Cancers 12(3), 603. https://doi.org/10.3390/cancers12030603 (2020).
Schelb, P. et al. Classification of cancer at prostate MRI: Deep learning versus clinical PIRADS assessment. Radiology 293(3), 607–617. https://doi.org/10.1148/radiol.2019190938 (2019).
Ozdemir, O., Russell, R. & Berlin, A. A 3D probabilistic deep learning system for detection and diagnosis of lung cancer using lowdose CT scans. IEEE Trans. Med. Imaging 39, 1419–1429. https://doi.org/10.1109/TMI.2019.2947595 (2019).
Su, A. et al. A deep learning model for molecular label transfer that enables cancer cell identification from histopathology images. NPJ Precis. Oncol. https://doi.org/10.1038/s41698022002520 (2022).
Jiao, W. et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun. https://doi.org/10.1038/s41467019138258 (2020).
Tuong, Z. K. et al. Resolving the immune landscape of human prostate at a singlecell level in health and cancer. Cell Rep. 37(12), 110132. https://doi.org/10.1016/j.celrep.2021.110132 (2021).
Yap, M. et al. Verifying explainability of a deep learning tissue classifier trained on RNAseq data. Sci. Rep. https://doi.org/10.1038/s41598021817739 (2021).
Gayoso, A. et al. Joint probabilistic modeling of singlecell multiomic data with totalVI. Nat. Methods 18(3), 272–282. https://doi.org/10.1038/s4159202001050x (2021).
Luecken, M. D. et al., A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. Presented at the Thirtyfifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Accessed: Jun. 06, 2022. [Online]. Available: https://openreview.net/forum?id=gN35BGa1Rt (2021).
Park, C. M. & Lee, J. H. Deep learning for lung cancer nodal staging and realworld clinical practice. Radiology 302(1), 212–213. https://doi.org/10.1148/radiol.2021211981 (2022).
Weberpals, J. et al. Deep learningbased propensity scores for confounding control in comparative effectiveness research: A largescale, realworld data study. Epidemiol. Camb. Mass 32(3), 378–388. https://doi.org/10.1097/EDE.0000000000001338 (2021).
MacDonald, S., Kaiah, S. & Trzaskowski, M. Interpretable AI in healthcare: Enhancing fairness, safety, and trust. In Artificial Intelligence in Medicine: Applications, Limitations and Future Directions (eds Raz, M. et al.) 241–258 (Springer, 2022).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215. https://doi.org/10.1038/s422560190048x (2019).
Gal, Y. Uncertainty in Deep Learning (University of Cambridge, 2016).
Gawlikowski, J. et al. A Survey of Uncertainty in Deep Neural Networks. arXiv, https://arxiv.org/2107.03342. https://doi.org/10.48550/arXiv.2107.03342 (2022).
BarragánMontero, A. et al. Towards a safe and efficient clinical implementation of machine learning in radiation oncology by exploring model interpretability, explainability and datamodel dependency. Phys. Med. Ampmathsemicolon Biol. 67(11), 11TR01. https://doi.org/10.1088/13616560/ac678a (2022).
Kristiadi, A., Hein, M. & Hennig, P. Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks. arXiv, https://arxiv.org/2002.10118. https://doi.org/10.48550/arXiv.2002.10118 (2020).
Minderer, M. et al. Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems, vol. 34, 15682–15694. Accessed: Jun. 06, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/8420d359404024567b5aefda1231af24Abstract.html (2021).
Ovadia, Y. et al. Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. arXiv, https://arxiv.org/1906.02530. https://doi.org/10.48550/arXiv.1906.02530 (2019).
French, R. M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128–135. https://doi.org/10.1016/S13646613(99)012942 (1999).
Gupta, S. et al. Addressing Catastrophic Forgetting for Medical Domain Expansion. arXiv, https://arxiv.org/2103.13511. https://doi.org/10.48550/arXiv.2103.13511 (2021).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On Calibration of Modern Neural Networks. arXiv, https://arxiv.org/1706.04599. https://doi.org/10.48550/arXiv.1706.04599 (2017).
Khan, M. E. & Rue, H. The Bayesian Learning Rule. arXiv, https://arxiv.org/2107.04562. https://doi.org/10.48550/arXiv.2107.04562 (2022).
Wilson, A. G. & Izmailov, P. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. https://doi.org/10.48550/arXiv.2002.08791 (2020).
Divate, M. et al. Deep learningbased pancancer classification model reveals tissueoforigin specific gene expression signatures. Cancers 14(5), 1185. https://doi.org/10.3390/cancers14051185 (2022).
Grewal, J. K. et al. Application of a neural network whole transcriptomebased pancancer method for diagnosis of primary and metastatic cancers. JAMA Netw. Open 2(4), e192597. https://doi.org/10.1001/jamanetworkopen.2019.2597 (2019).
Zhao, Y. et al. CUPAIDx: A tool for inferring cancer tissue of origin and molecular subtype using RNA geneexpression data and artificial intelligence. EBioMedicine 61, 103030. https://doi.org/10.1016/j.ebiom.2020.103030 (2020).
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. Poznan Pol. 19(1A), A6877. https://doi.org/10.5114/wo.2014.47136 (2015).
Hoadley, K. A. et al. Celloforigin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173(2), 291304.e6. https://doi.org/10.1016/j.cell.2018.03.022 (2018).
Robinson, D. R. et al. Integrative clinical genomics of metastatic cancer. Nature 548, 7667. https://doi.org/10.1038/nature23306 (2017).
Akgül, S. et al. Intratumoural heterogeneity underlies distinct therapy responses and treatment resistance in glioblastoma. Cancers https://doi.org/10.3390/cancers11020190 (2019).
Aoude, L. G. et al. Radiomics biomarkers correlate with CD8 expression and predict immune signatures in melanoma patients. Mol. Cancer Res. 19(6), 950–956. https://doi.org/10.1158/15417786.MCR201038 (2021).
Bailey, P. et al. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature 531, 7592. https://doi.org/10.1038/nature16965 (2016).
Hayward, N. K. et al. Wholegenome landscapes of major melanoma subtypes. Nature 545, 7653. https://doi.org/10.1038/nature22071 (2017).
Lee, J. H. et al. Transcriptional downregulation of MHC class I and melanoma dedifferentiation in resistance to PD1 inhibition. Nat. Commun. https://doi.org/10.1038/s41467020157267 (2020).
Newell, F. et al. Multiomic profiling of checkpoint inhibitortreated melanoma: Identifying predictors of response and resistance, and markers of biological discordance. Cancer Cell 40(1), 88102.e7. https://doi.org/10.1016/j.ccell.2021.11.012 (2022).
Newell, F. et al. Wholegenome sequencing of acral melanoma reveals genomic complexity and diversity. Nat. Commun. 11(1), 5259. https://doi.org/10.1038/s41467020189883 (2020).
Patch, A.M. et al. Whole–genome characterization of chemoresistant ovarian cancer. Nature 521, 7553. https://doi.org/10.1038/nature14410 (2015).
Scarpa, A. et al. Wholegenome landscape of pancreatic neuroendocrine tumours. Nature 543, 7643. https://doi.org/10.1038/nature21063 (2017).
Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv, https://arxiv.org/1506.02142. https://doi.org/10.48550/arXiv.1506.02142 (2016).
Liu, J. Z., Lin, Z., Padhy, S., Tran, D., BedraxWeiss, T. & Lakshminarayanan, B. Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness. arXiv, https://arxiv.org/2006.10108. https://doi.org/10.48550/arXiv.2006.10108 (2020).
van Amersfoort, J., Smith, L., Jesson, A., Key, O. & Gal, Y. On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty. arXiv, https://arxiv.org/2102.11409. https://doi.org/10.48550/arXiv.2102.11409 (2022).
van Amersfoort, J., Smith, L., Teh, Y. W. & Gal, Y. Uncertainty Estimation Using a Single Deep Deterministic Neural Network. arXiv, https://arxiv.org/2003.02037. https://doi.org/10.48550/arXiv.2003.02037 (2020).
Malinin, A. et al. Shifts: A Dataset of Real Distributional Shift Across Multiple LargeScale Tasks. arXiv, https://arxiv.org/2107.07455. https://doi.org/10.48550/arXiv.2107.07455 (2022).
Izmailov, P., Nicholson, P., Lotfi, S. & Wilson, A. G. Dangers of Bayesian model averaging under covariate shift. In Advances in Neural Information Processing Systems, vol. 34, 3309–3322. Accessed: Jun. 06, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/1ab60b5e8bd4eac8a7537abb5936aadcAbstract.html (2021).
Mukhoti, J., Stenetorp, P. & Gal, Y. On the Importance of Strong Baselines in Bayesian Deep Learning. arXiv, https://arxiv.org/1811.09385. https://doi.org/10.48550/arXiv.1811.09385 (2018).
Murphy, K. P. Inference algorithms: an overview. In Probabilistic Machine Learning: Advanced Topics (draft), 319. [Online]. Available: probml.ai (MIT Press, 2022).
Abdar, M. et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 76, 243–297. https://doi.org/10.1016/j.inffus.2021.05.008 (2021).
Hüllermeier, E. & Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Mach. Learn. 110(3), 457–506. https://doi.org/10.1007/s10994021059463 (2021).
Jesson, A., Mindermann, S., Gal, Y. & Shalit, U. Quantifying ignorance in individuallevel causaleffect estimates under hidden confounding. In Proceedings of the 38th International Conference on Machine Learning, 4829–4838. Accessed: Jun. 29, 2022. [Online]. Available: https://proceedings.mlr.press/v139/jesson21a.html (2021).
Sambyal, A. S., Krishnan, N. C. & Bathula, D. R. Towards Reducing Aleatoric Uncertainty for Medical Imaging Tasks arXiv https://doi.org/10.48550/arXiv.2110.11012 (2022).
Ober, S. W., Rasmussen, C. E. & van der Wilk, M. The promises and pitfalls of deep kernel learning. arXiv, https://arxiv.org/2102.12108. https://doi.org/10.48550/arXiv.2102.12108 (2021).
Bronstein, M. M., Bruna, J., Cohen, T. & Veličković, P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv. https://doi.org/10.48550/arXiv.2104.13478 (2021).
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6(1), 60. https://doi.org/10.1186/s4053701901970 (2019).
Peters, J., Janzing, D. & Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms (MIT Press, 2017).
Schölkopf, B. et al. Toward causal representation learning. Proc. IEEE 109(5), 612–634. https://doi.org/10.1109/JPROC.2021.3058954 (2021).
Xia, K., Lee, K.Z., Bengio, Y. & Bareinboim, E. The causalneural connection: expressiveness, learnability, and inference. In Advances in Neural Information Processing Systems, vol. 34, 10823–10836. Accessed: Jun. 29, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/5989add1703e4b0480f75e2390739f34Abstract.html (2021).
D’Amour, A. et al. Underspecification Presents Challenges for Credibility in Modern Machine Learning. arXiv. https://doi.org/10.48550/arXiv.2011.03395 (2020).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. https://doi.org/10.1038/s41586021038192 (2021).
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)—round XIV. Proteins Struct. Funct. Bioinform. 89(12), 1607–1617. https://doi.org/10.1002/prot.26237 (2021).
Misra, D. Mish: A Self Regularized NonMonotonic Activation Function. arXiv, https://arxiv.org/1908.08681. https://doi.org/10.48550/arXiv.1908.08681 (2020).
Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv, https://arxiv.org/1502.03167. https://doi.org/10.48550/arXiv.1502.03167 (2015).
Behrmann, J., Grathwohl, W.,Chen, R. T. Q., Duvenaud, D. & Jacobsen, J.H. Invertible Residual Networks. arXiv, https://arxiv.org/1811.00995. https://doi.org/10.48550/arXiv.1811.00995 (2019).
He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition. arXiv, https://arxiv.org/1512.03385. https://doi.org/10.48550/arXiv.1512.03385 (2015).
Farnia, F., Zhang, J. M. & Tse, D. Generalizable Adversarial Training via Spectral Normalization. arXiv, https://arxiv.org/1811.07457. https://doi.org/10.48550/arXiv.1811.07457 (2018).
Fort, S., Hu, H. & Lakshminarayanan, B. Deep Ensembles: A Loss Landscape Perspective. arXiv, https://arxiv.org/1912.02757. https://doi.org/10.48550/arXiv.1912.02757 (2020).
Izmailov, P., Vikram, S., Hoffman, M. D. & Wilson, A. G. What Are Bayesian Neural Network Posteriors Really Like?. arXiv, https://arxiv.org/2104.14421. https://doi.org/10.48550/arXiv.2104.14421 (2021).
D’Angelo, F. & Fortuin, V. Repulsive deep ensembles are Bayesian. In Advances in Neural Information Processing Systems, vol. 34, 3451–3465. Accessed: Jun. 30, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/1c63926ebcabda26b5cdb31b5cc91efbAbstract.html (2021).
Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. S. & Gal, Y. Deep Deterministic Uncertainty: A Simple Baseline. arXiv, https://arxiv.org/2102.11582. https://doi.org/10.48550/arXiv.2102.11582 (2022).
Zhang, K., Schölkopf, B., Muandet, K. & Wang, Z. Domain Adaptation Under Target and Conditional Shift. In Proceedings of the 30th International Conference on Machine Learning, 819–827. Accessed: Jun. 30, 2022. [Online]. Available: https://proceedings.mlr.press/v28/zhang13d.html (2013).
Acknowledgements
Special thanks to Yousef Rabi for their helpful discussions. We would also like to acknowledge members of the Medical Genomics and Genome Informatics teams at the QIMR Berghofer Medical Research for their technical support. This research was partially supported by the Australian Research Council through an Industrial Transformation Training Centre for Information Resilience (IC200100022). This research was also partially supported by the Cooperative Research Centres (CRC—P) Grant (CRCPFIVE000176). Nicola Waddell is supported by a National Health and Medical Research Council of Australia (NHMRC) Senior Research Fellowship (APP1139071), Olga Kondrashova is supported by a NHMRC Emerging Leader 1 Investigator Grant (APP2008631). The results published here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
Author information
Authors and Affiliations
Contributions
M.T., O.K., & N.W. supervised the project. M.Y., H.F., R.L.J., L.T.K, O.K., & S.M preprocessed the data. S.S., O.K., H.F., M.Y., & S.M. harmonised cancer type classes. S.M., H.F., & K.S. programmed and reviewed the models. N.W., M.T., J.V.P., & F.R. provided funding for the study. J.V.P. & S.W. provided study resources. V.A. assisted with study design and results interpretation. S.M., O.K., M.T., N.W. & F.R. conceived the study with input from all other authors. S.M., O.K., M.Y., H.F., M.T., & N.W. wrote the manuscript with contributions from all other authors.
Corresponding authors
Ethics declarations
Competing interests
M.Y., H.F., S.M., K.S. and M.T. are employed by Max Kelsen, which is a commercial company with an embedded research team. J.V.P. and N.W. are founders and shareholders of genomiQa Pty Ltd, and members of its Board. S.S., A.B., O.K., V.A., S.W, L.T.K. and R.L.J have no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
MacDonald, S., Foley, H., Yap, M. et al. Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology. Sci Rep 13, 7395 (2023). https://doi.org/10.1038/s41598023311265
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598023311265
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.