The lackluster performance of many machine learning (ML) systems in healthcare has been well documented1,2. In healthcare, as in other areas, AI algorithms can even perpetuate human prejudices such as sexism and racism when trained on biased datasets3.

Given the rapid embracement of artificial intelligence (AI) and ML in clinical research and their accelerating impact, the formulation of guidelines4,5 such as SPIRIT-AI, CONSORT-AI and, more recently, DECIDE-AI to regulate the use of ML in clinical research have helped to fill a regulatory void.

However, these clinical research guidelines generally concern the use of ML ex post facto, after the decision has been made to use an ML technique for a research study. The guidelines do not pose questions about the necessity or appropriateness of the AI or ML technique in the healthcare setting.

Failure to replicate

At the beginning of the COVID-19 pandemic, before the widespread adoption of reliable point-of-care assays to detect SARS-CoV-2, one highly active area of research involved the development of ML algorithms to estimate the probability of infection. These algorithms based their predictions on various data elements captured in electronic health records, such as chest radiographs.

Despite their promising initial validation results, the success of numerous artificial neural networks trained on chest X-rays were largely not replicated when applied to different hospital settings, in part because the models failed to learn or understand the true underlying pathology of COVID-19. Instead, they exploited shortcuts or spurious associations that reflected biologically meaningless variations in image acquisition, such as laterality markers, patient positioning or differences in radiographic projection6. These ML algorithms were not explainable and, while appearing to be at the cutting edge, were inferior to traditional diagnostic techniques such as RT-PCR, obviating their usefulness. More than 200 prediction models were developed for COVID-19, some using ML, and virtually all suffer from poor reporting and high risk of bias7.

Avoiding overuse

The term ‘overuse’ refers to the unnecessary adoption of AI or advanced ML techniques where alternative, reliable or superior methodologies already exist. In such cases, the use of AI and ML techniques is not necessarily inappropriate or unsound, but the justification for such research is unclear or artificial: for example, a novel technique may be proposed that delivers no meaningful new answers.

Many clinical studies have employed ML techniques to achieve respectable or impressive performance, as shown by area under the curve (AUC) values between 0.80 and 0.90, or even >0.90 (Box 1). A high AUC is not necessarily a mark of quality, as the ML model might be over-fit (Fig. 1). When a traditional regression technique is applied and compared against ML algorithms, the more sophisticated ML models often offer only marginal accuracy gains, presenting a questionable trade-off between model complexity and accuracy1,2,8,9,10,11,12. Even very high AUCs are no guarantees of robustness, as an AUC of 0.99 with an overall event rate of <1% is possible, and would lead to all negative cases being predicted correctly, while the few positive events were not.

Fig. 1: Model fitting.
figure 1

Given a dataset with data points (green points) and a true effect (black line), a statistical model aims to estimate the true effect. The red line exemplifies a close estimation, whereas the blue line exemplifies an overfit ML model with over-reliance on outliers. Such a model might seem to provide excellent results for this particular dataset, but fails to perform well in a different (external) dataset.

There is an important distinction between a statistically significant improvement and a clinically significant improvement in model performance. ML techniques undoubtedly offer powerful ways to deal with prediction problems involving data with nonlinear or complex, high-dimensional relationships (Table 1). By contrast, many simple medical prediction problems are inherently linear, with features that are chosen because they are known to be strong predictors, usually on the basis of prior research or mechanistic considerations. In these cases, it is unlikely that ML methods will provide a substantial improvement in discrimination2. Unlike in the engineering setting, where any improvement in performance may improve the system as a whole, modest improvements in medical prediction accuracy are unlikely to yield a difference in clinical action.

Table 1 Definitions of several key terms in machine learning

ML techniques should be evaluated against traditional statistical methodologies before they are deployed. If the objective of a study is to develop a predictive model, ML algorithms should be compared to a predefined set of traditional regression techniques for Brier score (an evaluation metric similar to the mean squared error, used to check the goodness of a predicted probability score), discrimination (or AUC) and calibration. The model should then be externally validated. The analytical methods, and the performance metrics on which they are being compared, should be specified in a prospective study protocol and should go beyond overall performance, discrimination and calibration to also include metrics related to over-fitting.

Conversely, some algorithms are able to say “I don’t know” when faced with unfamiliar data13, an output that is important but often underappreciated, as knowledge that a prediction is highly uncertain may, itself, be clinically actionable.

Rationalize usage

Researchers should start any ML project with clear project goals and an analysis of the advantages that AI, ML or conventional statistical techniques deliver in the specific clinical use case. Unsupervised clustering analyses tend to be well suited for discovering hidden patterns of clustering, for example to propose a novel molecular taxonomy of cancers14 or define subtypes of a psychiatric disorder15.

If the objective of a study is to develop a new prognostic nomogram or predictive model, there is little evidence that ML will fare better than traditional statistical models even when dealing with large and highly dimensional datasets1,2,8,9,10,11,16,17,18. If the purpose of a study is to infer a causal treatment effect of a given exposure, many well-established traditional statistical techniques, such as structural equation modelling, propensity-score methodology, instrumental variables analysis and regression discontinuity analysis, yield readily interpretable and rigorous estimates of the treatment effect.

Avoiding misuse

In contrast to overuse, the term ‘misuse’ connotes more egregious usages of ML, ranging from problematic methodology that engenders spurious inferences or predictions, to applications of ML that endeavor to replace the role of physicians in situations which should still require a human input.

Indiscriminately accepting an AI algorithm purely based on its performance, without scrutinizing its internal workings, represents a misuse of ML19, although it is questionable to what extent every clinician decision is robustly explainable.

Many groups have called for explainable ML or the incorporation of counterfactual reasoning in order to disentangle correlation from causation20. Medicine should be based on science, and medical decisions should be substantiated by transparent and logical reasoning that may be subjected to interrogation. The notion of a ‘black box’ that underpins clinical decision-making is an antithesis to the modern practice of medicine and is increasingly inaccurate, given the growing armamentarium of techniques such as saliency maps and generative adversarial networks that can be used to probe the reasoning made by neural networks.

Researchers should commit to developing ML models that are interpretable, with their reasoning standing up to scrutiny by human experts, and to sharing de-identified data and scripts that would allow external replication and validation. Some researchers might conclude that machines can identify patterns in the data that the human brain cannot discern. Yet, just as an expert should be able to explain their thought patterns on complex topics, so, too, should machines be able to justify the path they took to uncover certain patterns.

Data constraints

Usage of ML in spite of data constraints, such as biased data and small datasets, is another misuse of AI. Training data can be biased and can amplify sexist and racist assumptions3,21. Deep learning techniques are known to require large amounts of data, but many publications in the medical literature feature techniques with much smaller sample and feature-set sizes than are typically available in other technological industries. Well-trained ML algorithms may therefore lack access to a complete description of the clinical problem of interest.

Meta’s Facebook trained its facial recognition software using photos from more than 1 billion users; autonomous automobile developers use billions of miles of road traffic video recordings from hundreds of thousands of individual drivers in order to develop software to recognize road objects; and DeepBlue and AlphaGo learn from millions or billions of played games of chess and Go. In contrast, clinical research studies involving AI generally use thousands or hundreds of radiological and pathological images22, and surgeon–scientists developing software for surgical phase recognition often work with no more than several dozen surgical videos23. These observations underscore the relative poverty of big data in healthcare and the importance of working toward achieving sample sizes like those that have been attained in other industries, as well as the importance of a concerted, international big-data sharing effort for health data.

Human–machine collaboration

The respective functions of humans and algorithms in delivering healthcare are not the same. Algorithms allow clinicians to make the best use of the available data to inform practice, especially when the data have a complex structure or are both large and highly granular.

ML algorithms can complement, but not replace, physicians in most aspects of clinical medicine, from history-taking and physical examination to diagnosis, therapeutic decisions and performing procedures. Clinician–investigators must therefore forge a cohesive framework whereby big data propels a new generation of human–machine collaboration. Even the most sophisticated ML applications are likely to exist as discrete decision-support modules to support specific aspects of patient care, rather than competing against their human counterparts.

Human patients are likely to want human doctors to continue making medical decisions, no matter how well an algorithm can predict outcomes. ML should, therefore, be studied and implemented as an integral part of a complete system of care.

The clinical integration of ML and big data is poised to improve medicine. ML researchers should recognize the limits of their algorithms and models in order to prevent their overuse and misuse, which could otherwise sow distrust and cause patient harm.