Advances in the automated diagnosis of eye conditions through colour photography of the retina1,2,3,4 and optical coherence tomography imaging5,6,7 have put artificial intelligence (AI) in a position to transform eye care. Soon, AI-based systems could augment physicians’ decision-making in the clinic — or even replace physicians altogether.
Screening for the eye disease diabetic retinopathy is a case in point. International clinical guidelines recommend annual or biennial retinal examinations for all people with diabetes, as diabetic retinopathy can be treated if caught early, but might otherwise lead to an irreversible loss of vision. However, estimates from 2017 show that 451 million people worldwide have diabetes8 — making screening an untenable burden for ophthalmologists. Multiple AI systems have been shown to have high accuracy in detecting diabetic retinopathy, almost matching that of a human specialist. In April 2018, the US Food and Drug Administration gave the green light for such an AI system to be used in a clinical setting. AI systems for other eye conditions are not far behind. Providing the patient with a diagnosis at the point of primary care presents a major disruption to the conventional model of physician–patient interaction.
AI systems could prove particularly useful in resource-limited settings where medical care is unavailable or costly. Such a system has the potential to detect diabetic retinopathy in people who might otherwise never visit an ophthalmologist, and for whom such a visit could save them from a life of debilitating vision loss or even blindness. For instance, a US study of Latinos with type 2 diabetes showed that 65%, or 535 of 821 participants, had failed to obtain an eye examination in the past 12 months9. However, if an AI algorithm were used to screen every person in the world with diabetes, even an accuracy rate of 99.9% would result in hundreds of thousands of misdiagnoses and missed diagnoses each year. Society must therefore decide whether the benefits of a correct diagnosis outweigh the risks of a missed diagnosis.
When applying diagnostic algorithms on a massive scale, one question to consider is that of bias. The populations from which the data sets used to train AI systems are derived could contain inherent biases along the lines of gender, race or a condition’s severity. This limits the accuracy of the resulting algorithms in populations the technology wasn’t trained on. For example, because the natural pigmentation of the layers below the retina varies with race10, an algorithm that has been trained to detect diabetic retinopathy using data from northern European or North American populations might not perform as well in populations from Africa or East Asia.
Researchers often apply a manual or semi-automated cleaning process to data sets to eliminate images of low quality, as well as those containing more than one eye condition, before they start to train, validate and test a diagnostic algorithm. However, that is not how the real world works. Image quality cannot be determined in advance and many, if not all, studies published so far do not report an automated method for determining the suitability of images. Common eye diseases, including age-related macular degeneration, retinal vein occlusion and glaucoma, can occur at the same time as diabetic retinopathy. Because even the state-of-the-art deep-learning algorithms have been trained to identify only a single disease, they can be fooled by co-existing conditions. Were such machines able to see past this, one that is capable of reporting only the presence of diabetic retinopathy would still not catch other common eye conditions that human specialists, including those who examine patients remotely through teleretinal screening programmes, would notice.
Furthermore, when presented with a rare condition such as swelling of the optic disc (papilloedema) or a heavily pigmented tumour (choroidal melanoma), it is unclear how an AI algorithm would proceed because the inputted image might be radically different from those in its training set. People can learn to evaluate cases beyond their expertise, but recreating this behaviour in an AI algorithm has proved difficult. Although such cases will come up only infrequently, the resulting missed diagnoses could lead to considerable morbidity and mortality — choroidal melanoma tumours can metastasize, and papilloedema can be an early sign of a brain tumour.
These limitations — plus the risk that the commercial, closed-source companies that develop and evaluate diagnostic algorithms have financial conflicts of interest — highlight the importance of independent, third-party validation studies performed in geographically and racially disparate settings, and for which participants are recruited consecutively. Researchers in the United States and United Kingdom are embarking on such efforts11, with the goal of mimicking the real-world conditions under which diagnostic algorithms will be deployed.
Even with such safeguards in place, and even when AI algorithms start to outperform humans, a cultural issue is likely to block the widespread adoption of AI systems in health care. In conventional medical settings, patients and their family members already tend to have a zero-tolerance policy for medical mistakes made by care providers. They are unlikely to be more tolerant of such mistakes when blame falls on a machine. Typically, when errors arise, an independent audit tries to identify and fix the source. However, the sort of post-event analysis that can be performed in a conventional medical setting cannot be completed for AI systems. The complexity that gives these algorithms their human-like performance also turns them into black boxes — debugging such programs to identify the root cause of an error is all but impossible. The public might well consider the act of knowingly adopting an error-prone system to be a form of negligence, even if its error rates were on par with those of physicians. And it remains an open question as to whether an algorithm’s creators should be held liable for medical errors that it makes.
Ideally, a specialist physician would oversee AI systems to create a layer of human oversight. But if a physician were to thoroughly review every AI diagnosis, any cost or labour savings would disappear, and the process might become slower than one that relies on a physician alone. The alternative — in which a physician randomly checks samples of the AI’s diagnoses — should work for most cases but will still miss rare conditions. One solution is to use adaptive AI systems that apply online learning techniques to learn from their own mistakes, as well as those of other systems, and so improve performance over time. These systems would acquire information in a similar way as physicians, gaining clinical experience as they consult with more patients. This would, however, add a further layer of opaqueness to the black-box problem, as well as create a moving target for assessing performance.
A medical error rate of zero, although laudable as a goal, is unrealistic as a reference standard. Certainly, more research must be done before fully automated AI systems can be safely deployed, and safeguards must be put firmly in place before widespread adoption. But ultimately, if AI systems can reach people who might otherwise never visit an ophthalmologist, a huge number of people could be prevented from going blind. Society will need to accept an inescapable truth for the greater good: to err is both human and AI.
This article is part of Nature Outlook: The eye, an editorially independent supplement.