AI diagnostics need attention

Computer algorithms to detect disease show great promise, but they must be developed and applied with care.
Fluorescence angiogram of retina with diabetic retinopathy

AI diagnostic tools can find problems including retinal disease but they need to be developed with care.Credit: Lester V. Bergman/Getty

One of the biggest — and most lucrative — applications of artificial intelligence (AI) is in health care. And the capacity of AI to diagnose or predict disease risk is developing rapidly. In recent weeks, researchers have unveiled AI models that scan retinal images to predict eye- and cardiovascular-disease risk, and that analyse mammograms to detect breast cancer. Some AI tools have already found their way into clinical practice.

AI diagnostics have the potential to improve the delivery and effectiveness of health care. Many are a triumph for science, representing years of improvements in computing power and the neural networks that underlie deep learning. In this form of AI, computers process hundreds of thousands of labelled disease images, until they can classify the images unaided. In reports, researchers conclude that an algorithm is successful if it can identify a particular condition from such images as effectively as can pathologists and radiologists.

But that alone does not mean the AI diagnostic is ready for the clinic. Many reports are best viewed as analogous to studies showing that a drug kills a pathogen in a Petri dish. Such studies are exciting, but scientific process demands that the methods and materials be described in detail, and that the study is replicated and the drug tested in a progression of studies culminating in large clinical trials. This does not seem to be happening enough in AI diagnostics. Many in the field complain that too many developers are not taking the studies far enough. They are not applying the evidence-based approaches that are established in mature fields, such as drug development.

Many reports of new AI diagnostic tools, for example, go no further than preprints or claims on websites. They haven’t undergone peer review, and might never do so. That would verify key details: the underlying algorithm code, and analyses of, for example, the images on which the model is trained, the physicians with which it is compared, the features the neural network used to make decisions, and caveats.

These details matter. For instance, one investigation published last year found that an AI model detected breast cancer in mammograms better than did 11 pathologists who were allowed assessment times of about one minute per image. However, a pathologist given unlimited time performed as well as AI, and found difficult-to-detect cases more often than the computers (B. E. Bejnordi et al. J. Am. Med. Assoc. 318, 2199–2210; 2017).

Some issues might not appear until the tool is applied. For example, a diagnostic algorithm might incorrectly associate images produced using a particular device with a disease — but only because, during the training process, the clinic using that device saw more people with the disease than did another clinic using a different device.

These problems can be overcome. One way is for doctors who deploy AI diagnostic tools in the clinic to track results and report them, so that retrospective studies expose any deficiencies. Better yet, such tools should be developed rigorously — trained on extensive data and validated in controlled studies that undergo peer review. This is slow and difficult, in part because privacy concerns can make it hard for researchers to access the massive amounts of medical data needed. A News story in Nature discusses one possible answer: researchers are building blockchain-based systems to encourage patients to securely share information. At present, human oversight will probably prevent weaknesses in AI diagnosis from being a matter of life or death. That is why regulatory bodies, such as the US Food and Drug Administration, allow doctors to pilot technologies classified as low risk.

But lack of rigour does carry immediate risks: the hype–fail cycle could discourage others from investing in similar techniques that might be better. Sometimes, in a competitive field such as AI, a well-publicized set of results can be enough to stop rivals from entering the same field.

Slow and careful research is a better approach. Backed by reliable data and robust methods, it may take longer, and will not churn out as many crowd-pleasing announcements. But it could prevent deaths and change lives.

Nature 555, 285 (2018)

doi: 10.1038/d41586-018-03067-x
Nature Briefing

Sign up for the daily Nature Briefing email newsletter

Stay up to date with what matters in science and why, handpicked from Nature and other publications worldwide.

Sign Up