Medicine, even from the earliest days of artificial intelligence (AI) research, has been one of the most inspiring and promising domains for the application of AI-based approaches. Equally, it has been one of the more challenging areas to see an effective adoption. There are many reasons for this, primarily the reluctance to delegate decision making to machine intelligence in cases where patient safety is at stake. To address some of these challenges, medical AI, especially in its modern data-rich deep learning guise, needs to develop a principled and formal uncertainty quantification (UQ) discipline, just as we have seen in fields such as nuclear stockpile stewardship and risk management. The data-rich world of AI-based learning and the frequent absence of a well-understood underlying theory poses its own unique challenges to straightforward adoption of UQ. These challenges, while not trivial, also present significant new research opportunities for the development of new theoretical approaches, and for the practical applications of UQ in the area of machine-assisted medical decision making. Understanding prediction system structure and defensibly quantifying uncertainty is possible, and, if done, can significantly benefit both research and practical applications of AI in this critical domain.
Decisions deeply informed by computer modelling, throughout its 70-year history, have shaped both the paradigm of model-based prediction and supercomputer architectures—from transistors to full systems. Starting from well-defined questions, analytic models are sewn together over many length- and timescales to yield numerical answers. But numerical results, without a measure of their veracity, do not provide the trust needed to inform decisions. Hence, fields of activity on prediction, validation against available data, and how to test algorithms, models and sensitivities feed into overall measures of confidence captured under uncertainty quantification (UQ). To achieve this confidence, UQ extends the traditional discipline of statistical error analysis to also capture uncertainties due to possibly incomplete, inaccurate and contradictory input data, missing and undetected mechanisms and dependencies, expert judgment, and variations between reasonable model forms and modelling strategies. Advances in UQ now provide measures of confidence necessary to inform national and international security decisions. A notable example is the US support of a nuclear test moratorium since 1992, whereby we annually provide detailed measures of confidence in the safety, security and performance of the nuclear stockpile—guaranteed through virtual testing1,2.
UQ in model-based critical decision making
In model-based prediction, we first understand and define the questions we are posing and then define models to answer them. Not so with data-rich problems, where often neither the questions nor the underlying models are known. In this case, artificial intelligence (AI)-based methods, from novel hardware to machine learning (ML) techniques, seek to define the effective models that characterize emergent features in data. And those data are often complex, multimodal, discordant, noisy and incomplete.
UQ today underpins many decision processes in nuclear security, our risk management and associated investments, which can be at the scale of billions of dollars. Predictions without UQ are neither predictions nor actionable. The data-rich world of ML, especially the powerful deep learning (DL) models, poses parallel challenges. To develop consequential decision support from ‘learned’ models built on complex datasets, there is an important need to co-develop UQ for this domain. Ultimately it is in the merging of these two distinct worlds—model- and data-based—that a future path for prediction lies. To get there, an immediate need is UQ for AI-based approaches.
In this Perspective, we first discuss some of the current roles of DL in clinical decision making. We then describe the possible place and function of UQ, the challenges this brings forth, how these relate to previous uses of UQ, and fruitful areas of research that we can foresee. We end by summarizing that with a principled approach one can reap the benefits of data-driven approaches without sacrificing our ability to make and defend our clinical decisions.
Machine-assisted clinical decision making and research
Although AI-based research has long played an important role in medicine, it has, nevertheless, been one of the more challenging areas for AI to see an effective adoption. The reasons for this are not only social or cultural, but also due to unfamiliar interfaces and a hesitancy to give machine intelligence the responsibility of making life-critical decisions. On the other hand, the opportunities for the applications of AI in medicine are broad and, in some areas, potentially transformational. They range from uncontroversial and fundamental applications, such as image classification and information extraction, to much more complex, challenging and high-impact applications such as medical and therapeutic discoveries, outcome predictions, treatment personalization and optimization, targeted therapies, and far-reaching basic science discoveries.
These are the areas where AI could potentially have radical impact, but also where errors can have catastrophic consequences. Automated systems adoption, especially systems not analysable in terms of known causal connections, will require principled and formal UQ to play a transformative role, just as we have seen in the nuclear security domain. UQ captures our pragmatic approaches to ascribing confidence in predictions from some of the most complex simulations done today.
To analyse the situation, we can think broadly of the two streams in empirical sciences: those that (i) use data to derive partial theories or ‘generalizable/transferable knowledge’ that provide understanding and use such knowledge to intervene, or those that (ii) use data to build models that are specific to the problem. The latter do not necessarily provide ‘understanding’, but may use complex correlations in the data to directly make actionable projections. Historically, medicine was squarely in the second category—that is, it was mainly an empirical science through much of its history—with the rise of statistical interpretations only in the 1950s through the introduction of randomized clinical trials. Even with statistically orientated and clinical trials, most partial theories in medicine still explain an extremely small fraction of the observed phenomena and variations3,4. Even though fully mechanistic models are unlikely to be the first avenues of progress, use of scientific insights and attempts at a cohesive framework incorporating the major clinical predictors are likely to be increasingly useful as predictive models are able to efficiently summarize more complex correlations in the data.
The current roles and applications of AI
The role of AI in medicine ranges from the well-established tasks of recognition of medical conditions and symptoms with human-like or superhuman accuracy from visual sources, to more novel applications such as outcome prediction and augmented cognition, and ultimately guiding medical discoveries and therapy development.
Recently, approaches based on DL have had the most significant impact in the area requiring interpretation of medical images, as DL-structured neural networks are particularly suitable for recognition of visually manifested conditions such as changes in tissue, lesions and growth, and so on. The applications of DL techniques based on transfer learning have reported performance comparable to that of human experts5,6 or better7,8. Additionally, DL methods have been used in the predictive scenarios related to quality of care, and clinical outcomes where large neural networks were used as function estimators in place of classical predictive models, with reported performance better than the state-of-the-art, classical model approaches9,10,11.
Finally, there is a growing application of AI techniques in discovery-oriented biomedical subdisciplines. Some are in more applied areas such as drug discovery, while some are in more fundamental science areas such as the study of chemical reactions12, and assistance in the exploration and discovery of the molecular characteristics of medical phenomena from the available data using deep learning and other AI methods13,14. In most of the presented cases, the applications of AI are based on the DL neural networks, trained on a very large number of labelled datasets, and their learning tuned with the large number of hyperparameters. The most commonly applied neural network architectures are convolutional neural networks (CNNs) for the analysis of images, recurrent neural networks (RNNs) for analysing time series and prediction, and sequence recognizers (for example, long short-term memory units (LSTMs)15) for the analysis of text, though the architecture of the network is itself often the subject of exploration16. Unlike statistical approaches where mathematical models are used to explain variations observed in data and propose the margin of errors on inferences, with these recent applications, different learning architectures are combined with a large number of DL network parameters to form universal approximators. These are then ‘trained’ to reconstruct the outcome of some generative function, without an explicit attempt to specify the exact mathematical model behind the process.
The role for UQ in DL
To understand the role of UQ in DL, we need to understand the lifecycle of a typical DL process, and how UQ might fit into it. Unlike classical scenarios that start with the formulation of models reflective of physical reality, almost all DL scenarios start with the collection of potentially relevant and most comprehensive datasets available for decision-making scenarios (for example, an early detection of the onset of sepsis). Unless the data are already labelled, collection and organization of data is often followed by ‘labelling’ the data to mark the phenomena of interest (for example, patterns of vital signs characteristic of the pre-septic patients). These data are then used for training the DL model to meet some performance goals (for example, accuracy and precision in prediction of patients with pre-septic clinical features). To enable this, the authors of the DL process first select the most suitable DL architecture for this kind of predictive application, and then train the DL network with the labelled data. This training process is iterative, and involves the optimization of a variety of the learning parameters, which will be ‘tweaked’ until the network is trained to a sufficient level of performance. Next, the trained model is validated against the validation dataset—the dataset that has not been previously ‘seen’ by the network. If the performance of the model meets the desired performance criteria, the model will be deemed potentially usable in intended scenarios (for example, early onset of sepsis surveillance). Obviously, there are many steps in such scenarios where there are uncertainties that would need to be quantified. The obvious ones are uncertainties related to the (i) collection and selection of the training data and how well it represents and covers the actual medical phenomena; (ii) accuracy and completeness of the labelling of the training data; (iii) selection and understanding of the actual DL model, and its performance bounds and limitations; and (iv) uncertainties related to model’s performance against the operational data (clinical inference). While still non-exhaustive, we propose that all of these steps would need to be quantified in order to arrive at even crude, overall measures of the uncertainty of the DL-based decision model.
We see at least four overlapping groups of challenges associated with the uncertainty quantification of the data-driven approaches such as DL.
Absence of theory: unlike the physical world, which is governed by the well-understood laws of physics, the domains where the DL is usually applied, such as medicine, do not have ‘hard laws’. Although we use compensating mathematical techniques that take certain assumptions in order to account for the random noise, or some other well-known problem in working with the data, we are ultimately operating without the fundamental, underlying mathematical model that we could otherwise use to ground our uncertainties and to bound any assumptions we make.
Absence of causal models: in addition to the absence of underlying mechanistic theory, one also has to contend with the fact that DL is essentially exploiting correlations in the data, without paying attention to any causal link. This may not seem like a limitation since prediction does not need causal relation. In fact, after arriving at a low-dimensional representation that describes certain correlations (for example, the difference between cancerous and matched normal cells), hypotheses can be raised and tested. The absence of a causal connection, however, limits the conclusions that can be made from DL models; furthermore, it is imperative to understand how the training data must be similar to prediction data.
Sensitivity to imperfect data: as we discussed before, DL learns from data, and often uses subtle multivariate correlations to improve its predictions. Real-world data are usually imperfect—typically containing missing elements and errors—and these imperfections have patterns that can confound prediction. Specific UQ methods, therefore, need to be developed to quantify the sensitivity of models to imperfect data.
Computational expense: the training of the DL models is computationally expensive, and any further re-computation and re-evaluation of the models, aimed, for example, at the calculation of uncertainty bounds, might currently be prohibitively expensive. Fortunately, computing capacity in support of DL is growing exponentially, and techniques are being developed17 to approximate some of the UQ-relevant calculations.
To note, ad hoc solutions, such as sensitivity analysis and study of model variability, have sometimes been employed to mitigate some the problems we outline. The need to systematize a similar situation is what actually led us to develop the formal approach to UQ in US national security sciences. The wider application of DL in the biomedical field now requires an extension of these methods to this emerging field.
Needs for new research
Just as the challenges in applying UQ for DL are significant, the opportunities for new and important research are equally exciting. Even though a review of the ongoing research in this area is beyond the scope of this Perspective, in this section we describe a few major research directions that, we expect, could improve the situation. In the end, it is possible that the entire new field of UQ for DL might need to be developed.
Quantifying and limiting overfitting
Overfitting, or the problem of a model performing well on the training set, but generalizing poorly for unseen datasets, is one of the fundamental problems of all data-centric methods, and therefore DL.
In classical models, we evaluated models’ performance by information criteria that strongly penalized the number of parameters estimated from the data, and strong guarantees against overfitting relied on proving that the assumptions did not allow one to fit random noise. With DL, and the large number of model parameters involved as well as the capacity of DL networks to memorize random noise18, classical approaches do not work.
The research question in the context of DL is: what is a scheme that informs us about the bounds of overfitting? Some approaches, such as attempts to empirically learn generalizable patterns with insertion of random noise19,20, or the use of cross-validation to determine the progression of generalizable learning, move us forward in this problem space, while still carrying the problem of overfitting21. Despite these advances, further research is needed in the criteria that can be used to provide provable limits on overfitting, assuming fair sampling in the training data.
Advances in understanding of how DL works internally will allow for a more effective UQ of interpreting DL. This is an active area of research, with a common approach focusing on interpreting the relationship between the input and output of a DL algorithm, and providing an explanation of the results, not only for individual instances, but also for the general method. In addition, there are ongoing studies that attempt to understand what DL does22 and how it learns23.
Training DL to provide its own uncertainty estimates
Ultimately, an effective way of addressing some of the mentioned UQ problems might be to reshape the DL engine itself to provide an uncertainty estimate on its predictions. In other words, instead of trying to analyse a trained DL network, or the training procedure, one can use the characteristics, architecture and computational capabilities of the DL process to learn to analyse its own uncertainty. We propose this based on the realization that, ultimately, uncertainty in generalization depends on the density of training points in an appropriately defined neighbourhood of the prediction target. However, in high-dimensional problems, typical to the medical setting (images, large numbers of phenotypes and so on), every point can be isolated in some other dimension, and a density of points makes sense only after irrelevant dimensions are projected out—this is difficult to do just by analysing the network from outside. On the other hand, the network itself can be used to study this uncertainty empirically and provide the uncertainty bounds. A particularly fruitful approach seems to be the use of generative adversarial networks (GANs) for detecting out-of-sample cases24,25,26.
Data-driven methods are emerging as the foundations of evidence-based decision making and the future of data-driven scientific discovery. To fully realize their potential, we need to overcome significant hurdles in understanding the precision and uncertainty in purely data-driven predictions. Fortunately, there is a unifying structure to this problem in its various guises of complex predictive correlations in large datasets and engineering black boxes. These have been studied in other scientific disciplines and decision-making arenas, and we can learn from those. The details of the medical applications and DL networks are, however, significantly different since the theoretical foundations are far less developed and there are deep psychological and sociological implications in delegating to machines decisions that affect the lives of human beings. Nevertheless, progress in understanding the structure of these predictive systems, merging model- and data-driven approaches with strongly defensible UQ, and the formalization of the UQ for DL discipline will be needed to make DL and other data-centric tools and methods practically useful. UQ for DL is not likely to be simply a set of tools or procedures to apply, but a more complex wrap-up of disparate methods that in total help bound the overall confidence in predictions.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Oberkampf, W. L. & Roy, C. J. Verfication and Validation in Scientific Computing (Cambridge Univ. Press, Cambridge, 2010).
National Research Council Evaluation of Quantification of Margins and Uncertainties: Methodology for Assessing and Certifying the Reliability of the Nuclear Stockpile (National Academies Press, Washington DC, 2009).
Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl Acad. Sci. USA 109, 1193–1198 (2012).
Choi, J. D. & Lee, J.-S. Interplay between epigenetics and genetics in cancer. Genomics Inform. 11, 164–173 (2013).
Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122–1131 (2018).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Mar, V. & Soyer, H. Artificial intelligence for melanoma diagnosis: How can we deliver on the promise? Ann. Oncol. 29, 1625–1628 (2018).
Weng, S. F., Reps, J., Kai, J., Garibaldi, J. M. & Qureshi, N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS One 12, e0174944 (2017).
Bychkov, D. et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci. Rep. 8, 3395 (2018).
Xiao, C., Ma, T., Dieng, A. B., Blei, D. M. & Wang, F. Readmission prediction via deep contextual embedding of clinical concepts. PLoS One 13, e0195024 (2018).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1, 18 (2018).
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
Hsu, E., Klemm, J., Kerlavage, A., Kusnezov, D. & Kibbe, W. Cancer moonshot data and technology team: Enabling a national learning healthcare system for cancer to unleash the power of data. Clin. Pharmacol. Ther. 101, 613–615 (2017).
Fillon, M. Making sense of the mountains of new cancer data. J. Natl Cancer Inst. 109, djx020 (2017).
Geraci, J. et al. Applying deep neural networks to unstructured text notes in electronic medical records for phenotyping youth depression. Evid. Based Ment. Health 20, 83–87 (2017).
Zhou, Y. et al. Resource-efficient neural architect. Preprint at https://arxiv.org/abs/1806.07912 (2018).
Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Preprint at https://arxiv.org/abs/1506.02142 (2015).
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at http://arxiv.org/abs/1611.03530 (2016).
Arpit, D. et al. A closer look at memorization in deep networks. Preprint at https://arxiv.org/abs/1706.05394 (2017).
Zhang, C., Vinyals, O., Munos, R. & Bengio, S. A study on overfitting in deep reinforcement learning. Preprint at http://arxiv.org/abs/1804.06893 (2018).
Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).
Brahma, P. P., Wu, D. & She, Y. Why deep learning works: A manifold disentanglement perspective. IEEE Trans. Neural Netw. Learn. Sys. 27, 1997–2008 (2016).
Raghu, M., Gilmer, J., Yosinski, J. & Sohl-Dickstein, J. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Preprint at https://arxiv.org/abs/1706.05806 (2017).
Brahma, P. P., Huang, Q. & Wu, D. O. Structured memory based deep model to detect as well as characterize novel inputs. Preprint at http://arxiv.org/abs/1801.09859 (2018).
Yu, Y., Qu, W., Li, N. & Guo, Z. Open-category classification by adversarial sample generation. Preprint at http://arxiv.org/abs/1705.08722 (2017).
Ge, Z., Demyanov, S., Chen, Z. & Garnavi, R. Generative openmax for multi-class open set classification. Preprint at http://arxiv.org/abs/1707.07418 (2017).
This manuscript has been in part co-authored by UT-Battelle, LLC, under contract no. DE-AC05-00OR22725.
About this article
Nature Machine Intelligence (2019)