With only a limited number of clinical trials of artificial intelligence in medicine thus far, the first guidelines for protocols and reporting arrive at an opportune time. Better protocol design, along with consistent and complete data presentation, will greatly facilitate interpretation and validation of these trials, and will help the field to move forward.
The past decade ushered in excitement for the potential to apply deep-learning algorithms to healthcare. This subtype of artificial intelligence (AI) has the ability to improve the accuracy and speed of interpreting large datasets, such as images, speech and text. However, for deep learning to be accepted and implemented in the care of patients, proof from randomized clinical trials is urgently needed.
Randomized clinical trials became commonplace in the early 1980s to provide the basis of evidence for medical practice, but it was not until nearly two decades later that the Consolidated Standards of Reporting Trials (CONSORT) were developed in 1996 (ref. 1). In contrast, the use of AI in medicine, and specifically the use of deep neural networks, is still in the early stages — clinical trials using AI have been initiated only in the past two years. Two new companion pieces in this issue of Nature Medicine are devoted specifically to clinical-trial guidelines for protocols (SPIRIT-AI Extension) and publication (CONSORT-AI extension)2,3.
Retrospective analyses are a first step only
There are currently hundreds of published retrospective reports that fall under the rubric of ‘clinical trials’ of AI, but they are not really trials at all. While essential for laying the groundwork, these are in silico assessments of datasets to determine how well a deep neural network performs a clinical task, compared with a small number of physicians. Such AI reports do not simulate clinical practice but instead work from a cleaned, relatively pristine, annotated dataset. The real world of medicine, in contrast, is messy, with plenty of missing data, lots of it unstructured, and its focus is caring for patients rather than developing an analytical substrate. The stark contrast of a clinical environment versus an in silico one cannot be overemphasized, and hence clinical trials of AI are needed.
So far, there have been two systematic reviews and meta-analyses of all clinical AI studies, and they have exposed serious shortcomings4,5. In one review of 82 studies, it was found that the reporting of these trials was poor for critical aspects, which resulted in missing data, key terms and definition of those terms. The authors also found that model performance and validation descriptors were highly variable, and external validation (out-of-sample, beyond the test and internal validation assessment) was lacking. No study had a sample-size calculation to provide assurance it was adequately powered. Most troubling, the deep-learning models were rarely compared with the combined approach of assessment of the same datasets by the algorithm and healthcare professionals. Another review of 81 studies confirmed that previous review and identified further deficiencies5. The authors of this review found a serious problem with transparency, limited availability of datasets and code to assess reproducibility, very small numbers of clinicians for comparison with algorithmic performance, and hyperbolic conclusions. This pitting of clinicians versus a machine is the antithesis of clinical practice, which invariably keeps humans in the loop, at the very least for any important, serious diagnosis. We cannot ever solely rely on a neural network for critical, potentially life-or-death decisions about a patient.
The case for clinical trials of AI
Prospective trials that are representative of patient care are essential. Take, for example, one of the first major studies of AI in medicine: a comparison of a deep-neural-network diagnosis of skin cancer that analyzed photographs of lesions versus 21 board-certified dermatologists who made diagnoses on the basis of the same photographs6. When a dermatologist is evaluating a skin lesion, they are not analyzing a photograph in isolation but instead are analyzing the lesion in the context of the patient’s history and physical exam, which is very much in contrast with the application of the deep neural network. Furthermore, multiple retrospective studies have spotlighted remarkable, near-perfect accuracy for the diagnosis of diabetic retinopathy through the use of algorithms on retinal imaging. But when the first prospective trial was performed using these algorithms, the accuracy, while acceptable and a step forward with an automated diagnosis, was not nearly as high7. Retrospective studies of AI in healthcare must therefore be considered hypothesis generating, often a best-case scenario, and unacceptable as definitive proof points. Yet, unfortunately, currently, most of the regulatory approvals for algorithms by the US Food and Drug Administration rely on such preliminary evidence8. Furthermore, the retrospective datasets used by private companies to develop their algorithms are rarely published and are hence not transparent to the clinical community that intends to use, for patient care, the algorithms that they are based upon.
While it is certainly not the intent of an AI algorithm, there is also inadvertent potential for clinical algorithms to cause harm. When an algorithm has embedded bias or the population it was developed on is misrepresentative of the one to which it is being applied, this could lead to serious diagnostic or predictive inaccuracy. Once implemented in clinical care, such software is eminently scalable, such that the potential to unwittingly hurt patients can occur quickly and exponentially. Robust evidence from clinical trials is essential for identifying and understanding any potential an algorithm has to cause this type of harm.
A new era needs new guidelines
The ultimate proof of the clinical utility of AI will come from randomized trials, and ideally these will assess the accuracy of a diagnosis via a deep-learning algorithm compared with the accuracy of such a diagnosis from clinicians, and by clinicians in conjunction with the algorithm. Currently there is public information for only about a dozen such trials (Table 1) and for only seven randomized trials (Table 2). For the latter, six are for endoscopic polyp diagnosis, and so far, all but one was performed in China (Table 2). The limited number of prospective and randomized trials indicates how nascent the AI-for-medicine era is.
So that this potential turning point in clinical practice is not wasted, clinical trials of medical AI that are carried out must be done in a way that is transparent and does not cause harm, and this is where the new guidelines are essential. It is noteworthy that the generation of these guidelines is the painstaking work by a large international, transdisciplinary team undertaken in multiple phases. The drafting was started by a steering group of academic faculty experienced in the conduct and methodology of clinical trials reviewing over 300 registered trials (only 7 of which were published; 62 were completed), then a two-stage Delphi study review and a vote on candidate components (‘items’) by 169 highly regarded transdisciplinary experts, including patient groups, and culminating in a two-day consensus meeting at the organizer’s institution (the University of Birmingham) in January 2020. The outgrowth of the meetings is 15 essential items in the form of two separate checklists for clinical trial protocols and reporting. The items aim to cover the critical shortfalls of AI medical research thus far. They aim to increase the ease of replicability and independent evaluation of the trial.
At their most reductive, deep-learning models consist of inputs (data, such as an image) and outputs (an interpretation or prediction, such as whether the chest X-ray indicates the presence of pneumonia). In a clinical trial of an AI, for the inputs we must know the inclusion and exclusion data for the patients, how representative they are for the clinical question at hand, and the quality and provenance of their data. For the outputs, how were they specified and contribute to decision making are just a couple of important features. The guidelines stipulate much information about the algorithm itself that needs to be provided, such as which version, changes that occurred during testing and internal validation, and the fit of the model. Overfitting of healthcare datasets—a narrow analysis that is extrapolated to the broader, unrestricted world of the clinical environment—needs to be avoided. The guidelines require details on how detectable, predictable and explainable any errors that have been generated are, which will help to elucidate the relative safety of the application of the AI. Furthermore, the human–AI interface must be fully understood by readers of these trials, and in this context, the authors give the example of a colonoscopy clinical trial, clarifying the need to know the details of how video clips were readied for gastroenterologists to review2,3. Similarly, supervised learning relies on ground truths, a powerful term that conveys something that is absolutely accurate, better than truths, but the ground truths on which algorithms are built may not be actual ground truths, and the recommendations require that the details of these are elaborated These are just some of the items that the two guideline groups settled on as important for building into protocols and publications.
Establishing these standards and transparency will undoubtedly help to propel the field forward. But it is important to acknowledge that there is still more to learn about the best practice of carrying out clinical trials, and it is likely that the new standards will require revision in the years ahead. The current standards have largely been centered on imaging and have yet to detail recommendations for speech and text datasets in a meaningful way. Nearly all the clinical applications thus far have used supervised learning, which leaves unknowns for how to address unsupervised, self-supervised forms. Furthermore, the efforts in clinical trials, with rare exceptions, have included AI that relates to medical professionals only and do not acknowledge the power of AI to provide autonomy for patient self-diagnosis. There are already deep-learning algorithms used at scale by consumers, such as a smartwatch resting-heart-rate app for the diagnosis of atrial fibrillation9. There have yet to be any prospective post-implementation trials to provide another form of validation in the real-world setting. In addition to assessing utility, such studies will also log other challenges, from software glitches to malicious, adversarial attacks.
A particular strength of deep neural networks is their autodidactic learning capacity, and the more the data they learn on, the better the performance. Yet the current guidelines do not yet tackle this issue, just as regulatory agencies have struggled with it. We clearly want to exploit this capacity for medical good, but there is some uncertainty as to whether once an algorithm has ‘learned’ further, there will be deviation of performance from the clinical-trial evidence it is released with. Conversely, at the moment, when an algorithm is released, it is frozen, and this suppresses one of the potentially most powerful parts.
We will look forward to future updates provided by the CONSORT-AI and SPIRIT-AI teams for how to deal with the issues that the use of medical AI brings once applied in the real-world setting. For now, we can express deep gratitude for all their efforts in raising the bar for AI medical research.
Moher, D. et al. Br. Med. J. 340, c869 (2010).
Liu, X. et al. Nat. Med. https://doi.org/10.1038/s41591-020-1034-x (2020).
Rivera, S. C. et al. Nat. Med. https://doi.org/10.1038/s41591-020-1037-7 (2020).
Liu, X. et al. Lancet Digit. Health 1, e271–e297 (2019).
Nagendran, M. et al. Br. Med. J. 368, m689 (2020).
Esteva, A. et al. Nature 542, 115–118 (2017).
Keane, P. & Topol, E. NPJ Digit. Med. 1, 40 (2018).
Topol, E. J. Nat. Med. 25, 44–56 (2019).
Perez, M. V. et al. N. Engl. J. Med. 381, 1909–1917 (2019).
Abràmoff, M. D., Lavin, P. T. & Birch, M. et al. NPJ Digit. Med. 1, 39 (2018).
Gulshan, V. et al. JAMA Ophthalmol. 137, 987–993 (2019).
Kanagasingam, Y. et al. JAMA Netw. Open. 1, e182665 (2018).
Long, E., Lin, H. & Liu, Z. et al. Nat. Biomed. Eng. 1, 0024 (2017).
Steiner, D. F. et al. Am. J. Surg. Pathol. 42, 1636–1646 (2018).
Lee, H. et al. Nat. Biomed. Eng. 3, 173–182 (2019).
Mori, Y. et al. Ann. Intern. Med. 169, 357–366 (2018).
Li, C. et al. Cancer Commun. 38, 59 (2018).
Dascalu, A. & David, E. O. EBioMedicine 43, 107–113 (2019).
Phillips, M. et al. JAMA Netw. Open 2, e1913436 (2019).
Hollon, T. C. et al. Nat. Med. 26, 52–58 (2020).
Wang, Pu. et al. Lancet Gastroenterol. Hepatol. 5, 343–351 (2020).
Gong, Dexin et al. Lancet Gastroenterol. Hepatol. 5, 352–361 (2020).
Su, J.-R. et al. Gastrointest. Endosc. 91, 415–424.e4 (2020).
Wu, L. et al. Gut 68, 2161–2169 (2019).
Wang, P. et al. Gut 68, 1813–1819 (2019).
Lin, H. et al. EClinicalMedicine 9, 5–59 (2019).
Wijnberge, M. et al. J. Am. Med. Assoc. 323, 1052–1060 (2020).
The author declares no competing interests.
About this article
Cite this article
Topol, E.J. Welcoming new guidelines for AI clinical research. Nat Med 26, 1318–1320 (2020). https://doi.org/10.1038/s41591-020-1042-x