The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage, across all sectors. In medicine, this is beginning to have an impact at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health. The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely be actualized, but whether that will be used to improve the patient–doctor relationship or facilitate its erosion remains to be seen.
Medicine is at the crossroad of two major trends. The first is a failed business model, with increasing expenditures and jobs allocated to healthcare, but with deteriorating key outcomes, including reduced life expectancy and high infant, childhood, and maternal mortality in the United States1,2. This exemplifies a paradox that is not at all confined to American medicine: investment of more human capital with worse human health outcomes. The second is the generation of data in massive quantities, from sources such as high-resolution medical imaging, biosensors with continuous output of physiologic metrics, genome sequencing, and electronic medical records. The limits on analysis of such data by humans alone have clearly been exceeded, necessitating an increased reliance on machines. Accordingly, at the same time that there is more dependence than ever on humans to provide healthcare, algorithms are desperately needed to help. Yet the integration of human and artificial intelligence (AI) for medicine has barely begun.
Looking deeper, there are notable, longstanding deficiencies in healthcare that are responsible for its path of diminishing returns. These include a large number of serious diagnostic errors, mistakes in treatment, an enormous waste of resources, inefficiencies in workflow, inequities, and inadequate time between patients and clinicians3,4. Eager for improvement, leaders in healthcare and computer scientists have asserted that AI might have a role in addressing all of these problems. That might eventually be the case, but researchers are at the starting gate in the use of neural networks to ameliorate the ills of the practice of medicine. In this Review, I have gathered much of the existing base of evidence for the use of AI in medicine, laying out the opportunities and pitfalls.
Artificial intelligence for clinicians
Almost every type of clinician, ranging from specialty doctor to paramedic, will be using AI technology, and in particular deep learning, in the future. This largely involved pattern recognition using deep neural networks (DNNs) (Box 1) that can help interpret medical scans, pathology slides, skin lesions, retinal images, electrocardiograms, endoscopy, faces, and vital signs. The neural net interpretation is typically compared with physicians’ assessments using a plot of true-positive versus false-positive rates, known as a receiver operating characteristic (ROC), for which the area under the curve (AUC) is used to express the level of accuracy (Box 1).
One field that has attracted particular attention for application of AI is radiology5. Chest X-rays are the most common type of medical scan, with more than 2 billion performed worldwide per year. In one study, the accuracy of one algorithm, based on a 121-layer convolutional neural network, in detecting pneumonia in over 112,000 labeled frontal chest X-ray images was compared with that of four radiologists, and the conclusion was that the algorithm outperformed the radiologists. However, the algorithm’s AUC of 0.76, although somewhat better than that for two previously tested DNN algorithms for chest X-ray interpretation5, is far from optimal. In addition, the test used in this study is not necessarily comparable with the daily tasks of a radiologist, who will diagnose much more than pneumonia in any given scan. To further validate the conclusions of this study, a comparison with results from more than four radiologists should be made. A team at Google used an algorithm that analyzed the same image set as in the previously discussed study to make 14 different diagnoses, resulting in AUC scores that ranged from 0.63 for pneumonia to 0.87 for heart enlargement or a collapsed lung6. More recently, in another related study, it was shown that a DNN that is currently in use in hospitals in India for interpretation of four different chest X-ray key findings was at least as accurate as four radiologists7. For the narrower task of detecting cancerous pulmonary nodules on a chest X-ray, a DNN that retrospectively assessed scans from over 34,000 patients achieved a level of accuracy exceeding 17 of 18 radiologists8. It can be difficult for emergency room doctors to accurately diagnose wrist fractures, but a DNN led to marked improvement, increasing sensitivity from 81% to 92% and reducing misinterpretation by 47% (ref. 9).
Similarly, DNNs have been applied across a wide variety of medical scans, including bone films for fractures and estimation of aging10,11,12, classification of tuberculosis13, and vertebral compression fractures14; computed tomography (CT) scans for lung nodules15, liver masses16, pancreatic cancer17, and coronary calcium score18; brain scans for evidence of hemorrhage19, head trauma20, and acute referrals21; magnetic resonance imaging22; echocardiograms23,24; and mammographies25,26. A unique imaging-recognition study focusing on the breadth of acute neurologic events, such as stroke or head trauma, was carried out on over 37,000 head CT 3-D scans, which the algorithm analyzed for 13 different anatomical findings versus gold-standard labels (annotated by expert radiologists) and achieved an AUC of 0.73 (ref. 27). A simulated prospective, double-blind, randomized control trial was conducted with real cases from the dataset and showed that the deep-learning algorithm could interpret scans 150 times faster than radiologists (1.2 versus 177 seconds). But the conclusion that the algorithm’s diagnostic accuracy in screening acute neurologic scans was poorer than human performance was sobering and indicates that there is much more work to do.
For each of these studies, a relatively large number of labeled scans were used for training and subsequent evaluation, with AUCs ranging from 0.99 for hip fracture to 0.84 intracranial bleeding and liver masses to 0.56 for acute neurologic case screening. It is not possible to compare DNN accuracy from one study to the next because of marked differences in methodology. Furthermore, ROC and AUC metrics are not necessarily indicative of clinical utility or even the best way to express accuracy of the model’s performance28,29. Furthermore, many of these reports still only exist in preprint form and have not appeared in peer-reviewed publications. Validation of the performance of an algorithm in terms of its accuracy is not equivalent to demonstrating clinical efficacy. This is what Pearse Keane and I have referred to as the ‘AI chasm’—that is, an algorithm with an AUC of 0.99 is not worth very much if it is not proven to improve clinical outcomes30. Among the studies that have gone through peer review (many of which are summarized in Table 1), the only prospective validation studies in a real-world setting have been for diabetic retinopathy31,32, detection of wrist fractures in the emergy room setting33, histologic breast cancer metastases34,35, very small colonic polyps36,37, and congenital cataracts in a small group of children38. The field clearly is far from demonstrating very high and reproducible machine accuracy, let alone clinical utility, for most medical scans and images in the real-world clinical environment (Table 1).
Pathologists have been much slower at adopting digitization of scans than radiologists39—they are still not routinely converting glass slides to digital images and use whole-slide imaging (WSI) to enable viewing of an entire tissue sample on a slide. Marked heterogeneity and inconsistency among pathologists’ interpretations of slides has been amply documented, exemplified by a lack of agreement in diagnosis of common types of lung cancer (Κ = 0.41–0.46)40. Deep learning of digitized pathology slides offers the potential to improve accuracy and speed of interpretation, as assessed in a few retrospective studies. In a study of WSI of breast cancer, with or without lymph node metastases, that compared the performance of 11 pathologists with that of multiple algorithmic interpretations, the results varied and were affected in part by the length of time that the pathologists had to review the slides41. Some of the five algorithms performed better than the group of pathologists, who had varying expertise. The pathologists were given 129 test slides and had less than 1 minute for review per slide, which likely does not reflect normal workflow. On the other hand, when one expert pathologist had no time limits and took 30 hours to review the same slide set, the results were comparable with the algorithm for detecting noninvasive ductal carcinoma42.
Other studies have assessed deep-learning algorithms for classifying breast cancer43 and lung cancer40 without direct comparison with pathologists. Brain tumors can be challenging to subtype, and machine learning using tumor DNA methylation patterns via sequencing led to markedly improved classification compared with pathologists using traditional histological data44,45. DNA methylation generates extensive data and at present is rarely performed in the clinic for classification of tumors, but this study suggests another potential for AI to provide improved diagnostic accuracy in the future. A deep-learning algorithm for lung cancer digital pathology slides not only was able to accurately classify tumors, but also was trained to detect the pattern of several specific genomic driver mutations that would not otherwise be discernible by pathologists33.
The first prospective study to test the accuracy of an algorithm classifying digital pathology slides in a real clinical setting was an assessment of the identification of presence of breast cancer micrometastases in slides by six pathologists compared with a DNN (that had been retrospectively validated34). The combination of pathologists and the algorithm led to the best accuracy, and the algorithm markedly sped up the review of slides35. This study is particularly notable, as the synergy of the combined pathologist and algorithm interpretation was emphasized instead of the pervasive clinician-versus-algorithm comparison. Apart from classifying tumors more accurately by data processing, the use of a deep-learning algorithm to sharpen out-of-focus images may also prove useful46. A number of proprietary algorithms for image interpretation have been approved by the Food and Drug Administration (FDA), and the list is expanding rapidly (Table 2), yet there have been few peer-reviewed publications from most of these companies. In 2018, the FDA published a fast-track approval plan for AI medical algorithms.
For algorithms classifying skin cancer by image analysis, the accuracy of diagnosis of deep-learning networks has been compared with that of dermatologists. In a study using a large training dataset of nearly 130,000 photographic and dermascopic digitized images, 21 US board-certified dermatologists were at least matched in performance by an algorithm, which had an AUC of 0.96 for carcinoma47 and of 0.94 for melanoma specifically. Subsequently, the accuracy of melanoma skin cancer diagnosis by a group of 58 international dermatologists was compared with a convolutional neural network; the mean ROCs were 0.79 versus 0.86, respectively, reflecting an improved performance of the algorithm compared with most of the physicians48. A third study carried out algorithmic assessment of 12 skin diseases, including basal cell carcinoma, squamous cell carcinoma, and melanoma, and compared this with 16 dermatologists, with the algorithm achieving an AUC of 0.96 for melanoma49. None of these studies were conducted in the clinical setting, in which a doctor would perform physical inspection and shoulder responsibility for making an accurate diagnosis. Notwithstanding these concerns, most skin lesions are diagnosed by primary care doctors, and problems with inaccuracy have been underscored; if AI can be reliably shown to simulate experienced dermatologists, that would represent a significant advance.
There have been a number of studies comparing performance between algorithms and ophthalmologists in diagnosing different eye conditions. After training with over 128,000 retinal fundus photographs labeled by 54 ophthalmologists, a neural network was used to assess over 10,000 retinal fundus photographs from more than 5,000 patients for diabetic retinopathy, and the neural network’s grading was compared with seven or eight ophthalmologists for all-cause referable diagnoses (moderate or worse retinopathy or macular edema; scale: none, mild, moderate, severe, or proliferative). In two separate validation sets, the AUC was 0.99 (refs. 50,51). In a study in which retinal fundus photographs were used for the diagnosis of age-related macular degeneration (AMD), the accuracy for DNN algorithms ranged between 88% and 92%, nearly as high as for expert ophthalmologists52. Performance of a deep-learning algorithm for interpreting retinal optical coherence tomography (OCT) was compared with ophthalmologists for diagnosis of either of the two most common causes of vision loss: diabetic retinopathy or AMD. After the algorithm was trained on a dataset of over 100,000 OCT images, validation was performed in 1,000 of these images, and performance was compared with six ophthalmologists. The algorithm’s AUC for OCT-based urgent referral was 0.999 (refs. 53,54,55).
Another deep-learning OCT retinal study went beyond the diagnosis of diabetic retinopathy or macular degeneration. A group of 997 patients with a wide range of 50 retinal pathologies was assessed for urgent referral by an algorithm (using two different types of OCT devices that produce 3-D images) and results were compared with those from experts: four retinal specialists and four optometrists, with an AUC for accuracy of urgent referral triage to replace false alarm of 0.992. The algorithm did not miss a single urgent referral case. Notably, the eight clinicians agreed on only 65% of the referral decisions. Errors on the correct referral decision were reduced for both types of clinicians by integrating the fundus photograph and notes on the patient, but the algorithm’s error rate (without notes or fundus photographs) of 3.5% was as good or better than all eight experts56. One unique aspect of this study was the transparency of the two neural networks used, one for mapping the eye OCT scans into a tissue schematic and the other for the classifier of eye disease. The user (patient) can watch a video that shows what portions of his or her scan were used to reach the algorithm’s conclusions along with the level of confidence it has for the diagnosis. This sets a new bar for future efforts to unravel the ‘black box’ of neural networks.
In a prospective trial conducted in primary care clinics, 900 patients with diabetes but no known retinopathy were assessed by a proprietary system (an imaging device combined with an algorithm) made by IDx (Iowa City, IA) that obtained retinal fundus photographs and OCT and by established reading centers with expertise in interpreting these images30,31. The algorithm was used at primary care clinics up until the clinical trial was autodidactic and thus locked for testing, but it achieved a sensitivity of 87% and specificity of 91% for the 819 patients (91% of the enrolled cohort) with analyzable images. This trial led to FDA approval of the IDx device and algorithm for autonomous detection, that is, without the need for a clinician, of ‘more than mild’ diabetic retinopathy. The regulatory oversight in dealing with deep-learning algorithms is tricky because it does not currently allow continued autodidactic functionality but instead necessitates fixing the software to behave like a non-AI diagnostic system30. Notwithstanding this point along with the unknown extent of uptake of the device, the study represents a milestone as the first prospective assessment of AI in the clinic. The accuracy results are not as good as the aforementioned in silico studies, which should be anticipated. A small prospective real-world assessment of a DNN for diabetic retinopathy in primary care clinics, with eye exams performed by nurses, led to a high false-positive diagnosis rate32.
While the studies of retinal OCT and fundus images have thus far focused on eye conditions, recent work suggests that these images can provide a window to the brain for early diagnosis of dementia, including Alzheimer’s disease57.
The potential use of retinal photographs also appears to transcend eye diseases per se. Images from over 280,000 patients were assessed by DNN for cardiovascular risk factors, including age, gender, systolic blood pressure, smoking status, hemoglobin A1c, and likelihood of having a major adverse cardiac event, with validation in two independent datasets. The AUC for gender at 0.97 was notable, indicating that the algorithm could identify gender accurately from the retinal photo, but the others were in the range of 0.70, suggesting that there may be a signal that, through further pursuit, could be useful for monitoring patients for control of their risk factors58,59.
Other less common eye conditions that have been assessed by neural networks include congenital cataracts38 and retinopathy of prematurity in newborns60, both with accuracy comparable with that of eye specialists.
The major images that cardiologists use in practice are electrocardiograms (ECG) and echocardiograms, both of which have been assessed with DNNs. There is a nearly 40-year history of machine-read ECGs using rules-based algorithms with notable inaccuracy61. When deep learning was used to diagnose heart attack in a small retrospective dataset of 549 ECGs, a sensitivity of 93% and specificity of 90% were reported, which was comparable with cardiologists62. Over 64,000 one-lead ECGs (from over 29,000 patients) were assessed for arrhythmia by a DNN and six cardiologists, with comparable accuracy across 14 different electrical conduction disturbances63. For echocardiography, a small set of 267 patient studies (consisting of over 830,000 still images) were classified into 15 standard views (such as apical 4-chamber or subcostal) by a DNN and by cardiologists. The overall accuracy for single still images was 92% for the algorithm and 79% for four board-certified echocardiographers, but this does not reflect the real-world reading of studies, which are in-motion video loops23. An even larger retrospective study of over 8,000 echocardiograms showed high accuracy for classification of hypertrophic cardiomyopathy (AUC, 0.93), cardiac amyloid (AUC, 0.87), and pulmonary artery hypertension (AUC, 0.85)24.
Finding diminutive (<5 mm) adenomatous or sessile polyps at colonoscopy can be exceedingly difficult for gastroenterologists. The first prospective clinical validation of AI was performed in 325 patients who collectively had 466 tiny polyps, with an accuracy of 94% and negative predictive value of 96% during real-time, routine colonoscopy36,64. The speed of AI optical diagnosis was 35 seconds, and the algorithm worked equally well for both novice and expert gastroenterologists, without the need for injecting dyes. The findings of enhanced speed and accuracy were replicated in another independent study37. Such results are thematic: machine vision, at high magnification, can accurately and quickly interpret specific medical images as well as or better than humans.
The enormous burden of mental health, such as the 350 million people around the world battling depression74, is especially noteworthy, as there is potential here for AI to lend support to the affected patients and the vastly insufficient number of clinicians. Various tools that are in development include digital tracking of depression and mood via keyboard interaction, speech, voice, facial recognition, sensors, and use of interactive chatbots75,76,77,78,79,80. Facebook posts have been shown to predict the diagnosis of depression later documented in electronic medical records81.
Machine learning has been explored for predicting successful antidepressant medication82, characterizing depression83,84,85, predicting suicide83,86,87,88, and predicting bouts of psychosis in schizophrenics89.
The use of AI algorithms has been described in many other clinical settings, such as facilitating stroke, autism or electroencephalographic diagnoses for neurologists65,66, helping anesthesiologists avoid low oxygenation during surgery67, diagnosis of stroke or heart attack for paramedics68, finding suitable clinical trials for oncologists69, selecting viable embryos for in vitro fertilization70, help making the diagnosis of a congenital condition via facial recognition71 and pre-empting surgery for patients with breast cancer72. Examples of the breadth of AI applications across human lifespan is shown in Fig. 2.There is considerable effort across many startups and established tech companies to develop natural language processing to replace the need for keyboards and human scribes for clinic visits73. The list of companies active in this space includes Microsoft, Google, Suki, Robin Healthcare, DeepScribe, Tenor.ai, Saykara, Sopris Health, Carevoice, Orbita, Notable, Sensely and Augmedix.
Artificial intelligence and health systems
Being able to predict key outcomes could, theoretically, make the use of hospital palliative care resources more efficient and precise. For example, if an algorithm could be used to estimate the risk of a patient’s hospital readmission that would otherwise be undetectable given the usual clinical criteria for discharge, steps could be taken to avert discharge and attune resources to the underlying issues. For a critically ill patient, a very high likelihood of short-term survival might help this patient and their family and doctor make decisions regarding resuscitation, insertion of an endotracheal tube for mechanical ventilation, and other invasive measures. Similarly, it is possible that deciding which patients might benefit from palliative care and determining who is at risk of developing sepsis or septic shock could be ameliorated by AI predictive tools. Using electronic health record data, machine- and deep-learning algorithms have been able to predict many important clinical parameters, ranging from Alzheimer’s disease to death (Table 3)86,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107. For example, in a recent study, reinforcement learning was retrospectively carried out on two large datasets to recommend the use of vasopressors, intravenous fluids, and/or medications and the dose of the selected treatment for patients with sepsis; the treatment selected by the ‘AI Clinician’ was on average reliably more effective than that chosen by humans108. Both the size of the cohorts studied and the range of AUC accuracy reported have been quite heterogeneous, and all of these reports are retrospective and yet to be validated in the real-world clinical setting. Nevertheless, there are many companies that are already marketing such algorithms, such as Careskore, which is providing health systems with estimated of risk of readmission and mortality based on EHR data109. Beyond this issue, there are the differences between the prediction metric for a cohort and an individual prediction metric. If a model’s AUC is 0.95, which most would qualify as very accurate, this reflects how good the model is for predicting an outcome, such as death, for the overall cohort. But most models are essentially classifiers and are not capable of precise prediction at the individual level, so there is still an important dimension of uncertainty.
In addition to data from electronic health records, imaging has been integrated to enhance predictive accuracy98. Multiple studies have attempted to predict biological age110,111, and this has been shown to best be accomplished using DNA methylation–based biomarkers112. With respect to the accuracy of algorithms for prediction of biological age, the incompleteness of data input is noteworthy, since a large proportion of unstructured data—the free text in clinician notes that cannot be ingested from the medical record—has not been incorporated, and neither have many other modalities such as socioeconomic, behavioral, biologic ‘-omics’, or physiologic sensor data. Further, concerns have been raised about the potential to overfit data owing to small sample sizes in some instances. It has also been pointed out how essential it is to have κ-fold cross-validation of a model through successive, mutually exclusive validation datasets, which is missing from most of these publications. There is also considerable debate about using AUC as the key performance metric, since it ignores actual probability values and may be particularly misleading in regard to the sensitivity and specificity values that are of clinical interest113.
In summary, it is not yet known how well AI can predict key outcomes in the healthcare setting, and this will not be determined until there is robust validation in prospective, real-world clinical environments, with rigorous statistical methodology and analysis.
Machine vision (also known as computer vision), which uses data from ambient sensors, is attracting considerable attention in health systems for promoting safety by monitoring such activities as proper clinician handwashing114, critically ill patients in the intensive care unit115, and risk of falling for patients116. Weaning patients in the intensive care unit from mechanical ventilation is often haphazard and inefficient; a reinforcement-learning algorithm using machine vision has shown considerable promise in this regard117. There are also ongoing efforts to digitize surgery that include machine vision observation of the team and equipment in the operating room and performance of the surgeon; real-time, high-resolution, AI-processed imaging of the relevant anatomy of a patient; and integration of all of a patient’s preoperative data, including full medical history, labs, and scans118,119. Extremely delicate microsurgery, such as that inside the eye, has now been performed with AI assistance120. There is considerable promise in markedly reducing the radiation and time requirements for image acquisition and segmentation in preparation for radiotherapy via the use of deep-learning algorithms for image reconstruction121 and of generative adversarial networks to improve the quality of medical scans. These improvements will, when widely implemented, promote safety, convenience, and lower cost122,123,124.
Of the more than $3.5 trillion per year (and rising) expenditures for healthcare in the United States, almost a third is related to hospitals. With FDA-approved wearable sensors that can continuously monitor all vital signs—including blood pressure, heart rate and rhythm, blood oxygen saturation, respiratory rate, and temperature—there is the potential to preempt a large number of patients being hospitalized in the future. There has not yet been algorithmic development and prospective testing for remote monitoring, but this deserves aggressive pursuit as it could reduce the costs of care without sacrificing convenience and comfort for a patient and family. The reduction of nosocomial infections alone would be an alluring path for promoting safety.
It has been estimated that, per day, AI would process over 250 million images for the cost of about $1,000 (ref. 125), representing a staggering hypothetical savings of billions of dollars. Besides the productivity and workflow gains that can be derived from AI-assisted image interpretation and clinician support, there is potential to reduce the workforce for many types of back-office, administrative jobs such as coding and billing, scheduling of operating rooms and clinic appointments, and staffing. At Geisinger Health in Pennsylvania, over 100,000 patients have undergone exome sequencing; the results are provided via an AI chatbot (Clear Genetics), which is well-received by most patients and reduces the need for genetic counselors. This demonstrates how a health system can leverage AI tools to provide complex information without having to rely on expansion of highly trained personnel.
Perhaps the greatest long-term potential of AI in health systems is the development of a massive data infrastructure to support nearest-neighbor analysis, another application of AI used to identify ‘digital twins.’ If each person’s comprehensive biologic, anatomic, physiologic, environmental, socioeconomic, and behavioral data, including treatment and outcomes, were entered, an extraordinary learning system would be created. There have been great benefits derived from jet engine126 digital twins that use an ultrahigh-fidelity model engine to simulate the flight conditions of a particular jet, but such a model has yet to be completed at any scale for patients, who theoretically could benefit from being informed of the best prevention methods, treatments, and outcomes for various conditions by their relevant twin’s data127.
Artifical intelligence and patients
The work for developing deep-learning algorithms to enable the public to take their healthcare into their own hands has lagged behind that for clinicians and health systems, but there are a few such algorithms that have been FDA-cleared or are in late-stage clinical development. In late 2017, a smartwatch algorithm was FDA-cleared to detect atrial fibrillation128, and subsequently in 2018 Apple received FDA approval for their algorithm used with the Apple Watch Series 4 (refs. 129,130). The photoplethysmography and accelerometer sensors on the watch learn the user’s heart rate at rest and with physical activity, and when there is a significant deviation from expected, the user is given a haptic warning to record an ECG via the watch, which is then interpreted by an algorithm. There are legitimate concerns that the widescale use of such an algorithm, particularly in the low-risk, young population who wear Apple watches, will lead to a substantial number of false-positive atrial fibrillation diagnoses and prompt unnecessary medical evalautions131. In contrast, the deep learning of the ECG pattern on the smartwatch, which can accurately detect whether there is high potassium in the blood, may provide particular usefulness for patients with kidney disease. This concept of a ‘bloodless’ blood potassium level (Fig. 2) reading via a smartwatch algorithm embodies the prospect of an algorithm able to provide information that was not previously obtainable or discernible without the technology.
Smartphone exams with AI are being pursued for a variety of medical diagnostic purposes, including skin lesions and rashes, ear infections, migraine headaches, and retinal diseases such as diabetic retinopathy and age-related macular degeneration. Some smartphone apps are using AI to monitor medical adherence, such as AiCure (NCT02243670), which has the patient take a selfie video as they swallow their prescribed pill. Other apps use image recognition of food for calorie and nutritional content132. In what may be seen as an outgrowth of dating apps that use AI nearest-neighbor analysis to find matches, there are now efforts to use the same methodology for matchmaking patients with primary care doctors to engender higher levels of trust133.
One study has recently achieved the continuous sensing of blood-glucose (for 2 weeks) along with assessment of the gut microbiome, physical activity, sleep, medications, all food and beverage intake, and a variety of lab tests134,135,136. This multimodal data collection and analysis has led to the ability to predict the glycemic response to specific foods for an individual, a physiologic pattern that is remarkably heterogeneous among people and significantly driven by the gut microbiome. The use of continuous glucose sensors, which now are factory-calibrated, preempting the need for finger-stick glucose calibrations, has shown that post-prandial glucose spikes commonly occur, even in healthy people without diabetes137,138. It remains uncertain whether the glucose spikes indicate a higher risk of developing diabetes, but there are data suggesting this possibility139 along with mechanistic links to gastrointestinal barrier dysfunction140,141 in experimental models. Nevertheless, the use of AI with multimodal data to guide an individualized diet is a precedent for virtual medical coaching in the future. In the present, simple rules-based algorithms, based upon whether glucose values are rising or falling, are used for glucose management in people with diabetes. While these have helped avert hypoglycemic episodes142, smart algorithms that incorporate an individual’s comprehensive data are likely to be far more informative and helpful. In this manner, most common chronic conditions, such as hypertension, depression, and asthma, could theoretically be better managed with virtual coaching. With the remarkable progress in the accuracy of AI speech recognition and the accompanying soaring popularity of smart speakers, it is easy to envision that this would be performed via a voice platform, with or without an avatar. Eventually, when all of an individual’s data and the corpus of medical literature can be incorporated, a holistic, prevention approach would be possible (Fig. 3).
Artificial intelligence and data analysis
While upstream from clinical practice, AI progress in life science has been notably faster, with extensive peer-reviewed publication, an easier path to validation without regulatory oversight, and far more willingness among the scientific community for implementation. As the stethoscope is the icon of doctors, the microscope is the icon of scientists. Using AI, Christiansen et al. 143 developed in silico labeling. Instead of the routine fluorescent staining of microscopic images, which can harm and kill cells and involves a complex preparation, this machine-learning algorithm predicts the fluorescent labels, ushering in ‘image-free’ microscopy143,144,145. Soon thereafter, Ota et al.146 reported another image-free flow AI analytic method that they called ‘ghost cytometry’ to accurately identify rare cells, a capability that was replicated and extended by Nitta et al.147 with image-activated AI cell sorting. This use of machine learning addresses the formidable problem of identifying and isolating rare cells by rapid, high-throughput, and accurate sorting on the basis of cell morphology that does not require the use of biomarkers. Besides promoting image-free microcopy and cytometry, deep-learning AI has been used to restore or fix out-of-focus images148. And computer vision has made possible high-throughput assessment of 40-plex proteins and organelles within a single cell149,150.
Another challenge confronted by machine and deep learning has been in the analytics of genomic and other -omics biology datasets. Open-source algorithms have been developed for classifying or analyzing whole-genome sequence pathogenic variants151,152,153,154,155,156,157,158, somatic cancer mutations159, gene–gene interactions160, RNA sequencing data161, methylation162, prediction of protein structure and protein–protein interactions163, the microbiome164, and single cells165. While these reports have generally represented a single -omics approach, there are now multi-omic algorithms being developed166,167 that integrate the datasets. The use of genome editing has also been facilitated by algorithmic prediction of CRISPR guide RNA activity168 and off-target activities169.
Noteworthy is the use of AI tools to enhance understanding of how cancer evolves via application of a transfer-learning algorithm to multiregional tumor-sequencing data170 and of machine vision for analysis of live cancer cells at single-cell resolution via microfluidic isolation171. Both of these novel approaches may ultimately be helpful in both risk stratification of patients and guiding therapy.
With the AI descriptor of neural networks, it is not surprising that there is bidirectional inspiration: biological neuroscience impacting AI and vice versa172. A couple of examples in Drosophila are noteworthy. Robie et al.173 took videos of 400,00 flies and used machine learning and machine vision to map phenotype with gene expression and neuroanatomy. Whole-brain maps were generated for movement, female aggression, and many other traits. In another study, nearest-neighbor analysis was used to understand how odors are sensed by the flies, that is, their smell algorithm174.
AI has been used to reconstruct neural circuits, allowing an understanding of connectomics, from electron microscopy175. One of the most impressive advances facilitated by AI has been in understanding the human brain’s grid cells—which enable perception of the speed and direction of movement of the body, i.e., its place in space176,177. Reciprocally, neuromorphic computing, or reverse-engineering of the brain to make computer chips, is not only leading to more efficient computing, but also helping researchers understand brain circuitry and build brain–machine interfaces172,178,179. Machine vision tracking of human and animal behavior with a transfer-learning algorithm is yet another example of the progress being made180.
Drug discovery is being revamped with the use of AI at many levels, including sophisticated natural language processing searches of the biomedical literature, data mining of millions of molecular structures, designing and making new molecules, predicting off-target effects and toxicity, predicting the right dose for experimental drugs, and developing cellular assays at a massive scale181,182,183,184. There is new hope that preclinical animal testing can be reduced via machine-learning prediction of toxicity185. AI cryptography has been used to combine large proprietary pharmaceutical company datasets and discover previously unidentified drug interactions186. The story of the University of Cambridge and Manchester’s robot ‘Eve’ and how it autonomously discovered an antimalarial drug that is a constituent of toothpaste has galvanized interest in using AI to accelerate the process, with a long list of start-ups and partnerships with major pharmaceutical firms181,187,188.
Limitations and challenges
Despite all the promises of AI technology, there are formidable obstacles and pitfalls. The state of AI hype has far exceeded the state of AI science, especially when it pertains to validation and readiness for implementation in patient care. A recent example is IBM Watson Health’s cancer AI algorithm (known as Watson for Oncology). Used by hundreds of hospitals around the world for recommending treatments for patients with cancer, the algorithm was based on a small number of synthetic, nonreal cases with very limited input (real data) of oncologists189. Many of the actual output recommendations for treatment were shown to be erroneous, such as suggesting the use of bevacizumab in a patient with severe bleeding, which represents an explicit contraindication and ‘black box’ warning for the drug189. This example also highlights the potential for major harm to patients, and thus for medical malpractice, by a flawed algorithm. Instead of a single doctor’s mistake hurting a patient, the potential for a machine algorithm inducing iatrogenic risk is vast. This is all the more reason that systematic debugging, audit, extensive simulation, and validation, along with prospective scrutiny, are required when an AI algorithm is unleashed in clinical practice. It also underscores the need to require more evidence and robust validation to exceed the recent downgrading of FDA regulatory requirements for medical algorithm approval190.
There has been much written about the black box of algorithms, and much controversy surrounding this topic191,192,193; especially in the case of DNNs, it may not be possible to understand the determination of output. This opaqueness has led to both demands for explainability, such as the European Union’s General Data Protection Regulation requirement for transparency—deconvolution of an algorithm’s black box—before an algorithm can be used for patient care194. While this debate of whether it is acceptable to use nontransparent algorithms for patient care is unsettled, it is notable that many aspects of the practice of medicine are unexplained, such as prescription of a drug without a known mechanism of action.
Inequities are one of the most important problems in healthcare today, especially in the United States, which does not provide care for all of its citizens. With the knowledge that low socioeconomic status is a major risk factor for premature mortality195, the disproportionate use of AI in the ‘haves,’ as opposed to the ‘have-nots,’ could widen the present gap in health outcomes. Intertwined with this concern of exacerbating pre-existing inequities is embedded bias present in many algorithms due to lack of inclusion of minorities in datasets. Examples are the algorithms in dermatology that diagnose melanoma but lack inclusion of skin color47 and the use of the corpus of genomic data, which so far has seriously underrepresented minorities196. While there are arguments that algorithm bias is exceeded by human bias197, much work is needed to eradicate embedded prejudice and strive for medical research that provides a true representative cross-section of the population.
An overriding issue for the future of AI in medicine rests with how well privacy and security of data can be assured. Given the pervasive problems of hacking and data breaches, there will be little interest in use of algorithms that risk revealing the details of patient medical history198. Moreover, there is the risk of deliberate hacking of an algorithm to harm people at a large scale, such as overdosing insulin in diabetics or stimulating defibrillators to fire inside the chests of patients with heart disease. It is increasingly possible for an individual’s identity to be determined by facial recognition or genomic sequence from massive databases, which further impedes protection of privacy. At the same time, the blurring of truth made possible by generative adversarial networks, with seemingly unlimited capacity to manipulate content, could be highly detrimental for health198,199. New models of health data ownership with rights to the individual, use of highly secure data platforms, and governmental legislation, as has been achieved in Estonia, are needed to counter the looming security issues that will otherwise hold up or ruin the chances for progress in AI for medicine200,201,202.
A key point that I have emphasized throughout this Review it that the narrative of bringing AI to medicine is just beginning. There has been remarkably little prospective validation for tasks that machines could perform to help clinicians or predict clinical outcomes that would be useful for health systems, and even less for patient-centered algorithms. The field is certainly high on promise and relatively low on data and proof. The risk of faulty algorithms is exponentially higher than that of a single doctor–patient interaction, yet the reward for reducing errors, inefficiencies, and cost is substantial. Accordingly, there cannot be exceptionalism for AI in medicine—it requires rigorous studies, publication of the results in peer-reviewed journals, and clinical validation in a real-world environment, before roll-out and implementation in patient care (Fig. 4). With these caveats, it is also important to have reasonable expectations for how AI will ultimately be incorporated. Piercing through today’s widespread hype that doctors will be replaced by machines is the analogy of the self-driving car model for reality testing. Most would agree that autonomous cars represent the pinnacle technical achievement of AI to date, but the term autonomous is misleading. The Society of Automotive Engineers (SAE) has defined five levels of autonomy, with Level 5 indicating full control by the car under all conditions, without any possibility for human backup or taking control of the vehicle (Fig. 5). It is now accepted that this definition of full autonomy is likely to never be attained, as certain ambient or road conditions will prohibit the safe use of such vehicles203. By the same token, medicine will unlikely ever surpass Level 3, a conditional automation, for which humans will indeed be required for oversight of algorithmic interpretation of images and data. It is hard to imagine very limited human backup across the board of caring for patients (Level 4). Human health is too precious—relegating it to machines, except for routine matters with minimal risk, seems especially far-fetched.
The excitement that lies ahead, albeit much further along than many have forecasted, is for software that will ingest and meaningfully process massive sets of data quickly, accurately, and inexpensively and for machines that will see and do things that are not humanly possible. This capability will ultimately lay the foundation for high-performance medicine, which is truly data-driven, decompressing our reliance on human resources, and will eventually take us well beyond the sum of the parts of human and machine intelligence. This symbiosis will be preceded by the upstream advances that are already being made in biomedical science and discovery, which have a far less tortuous path to be accepted and widely implemented.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Funding was provided by the Clinical and Translational Science Award (CTSA) from the National Institute of Health (NIH) grant number UL1TR002550.