Screening is used to detect breast cancer early in women who have no obvious signs of the disease. This image-analysis task is challenging because cancer is often hidden or masked in mammograms by overlapping ‘dense’ breast tissue. The problem has stimulated efforts to develop computer-based artificial-intelligence (AI) systems to improve diagnostic performance. Writing in Nature, McKinney et al.1 report the development of an AI system that outperforms expert radiologists in accurately interpreting mammograms from screening programmes. The work is part of a wave of studies investigating the use of AI in a range of medical-imaging contexts2.
Despite some limitations, McKinney and colleagues’ study is impressive. Its strengths include the large scale of the data sets used for training and subsequently validating the AI algorithm. Mammograms for 25,856 women in the United Kingdom and 3,097 women in the United States were used to train the AI system. The system was then used to identify the presence of breast cancer in mammograms of women who were known to have had either biopsy-proven breast cancer or normal follow-up imaging results at least 365 days later. These outcomes are the widely accepted gold standard for confirming breast cancer status in people undergoing screening for the disease. The authors report that the AI system outperformed both the historical decisions made by the radiologists who initially assessed the mammograms, and the decisions of 6 expert radiologists who interpreted 500 randomly selected cases in a controlled study.
McKinney and colleagues’ results suggest that AI might some day have a role in aiding the early detection of breast cancer, but the authors rightly note that clinical trials will be needed to further assess the utility of this tool in medical practice. The real world is more complicated and potentially more diverse than the type of controlled research environment reported in this study. For example, the study did not include all the different mammography technologies currently in use, and most images were obtained using a mammography system from a single manufacturer. The study included examples of two types of mammogram: tomosynthesis (also known as 3D mammography) and conventional digital (2D) mammography. It would be useful to know how the system performed individually for each technology.
The demographics of the population studied by the authors is not well defined, apart from by age. The performance of AI algorithms can be highly dependent on the population used in the training sets. It is therefore important that a representative sample of the general population be used in the development of this technology, to ensure that the results are broadly applicable.
Another reason to temper excitement about this and similar AI studies is the lessons learnt from computer-aided detection (CAD) of breast cancer. CAD, an earlier computer system aimed at improving mammography interpretation in the clinic, showed great promise in experimental testing, but fell short in real-world settings3. CAD marks mammograms to draw the interpreter’s attention to areas that might be abnormal. However, analysis of a large sample of clinical mammography interpretations from the US Breast Cancer Surveillance Consortium registry demonstrated that there was no improvement in diagnostic accuracy with CAD3. Moreover, that study revealed that the addition of CAD worsened sensitivity (the performance of radiologists in determining that cancer was present), thus increasing the likelihood of a false negative test. CAD did not result in a significant change in specificity (the performance of radiologists in determining that cancer was not present) and the likelihood of a false positive test3.
It has been speculated that CAD was not as useful in the clinic as experimental data suggested it might be because radiologists ignored or misused its input owing to the high frequency of marks on the images that were not findings suggestive of cancer. This outcome was attributed by some to the limited processing power available for CAD, which meant that comparisons with previous imaging studies of the same person were not possible4. Thus, CAD might mark regions that were not changing over time and that could be easily dismissed by expert readers. Another factor that limited CAD is that it was developed using the performance of human-based diagnosis. It was trained using mammograms in which humans had found signs of cancer and others that were false negatives — cases in which humans could not see signs of cancer although the disease was indeed present4. Similar pitfalls could be encountered with AI-based decision aids, too.
A system by which AI finds abnormalities that humans miss will require radiologists to adapt to the use of these types of tool. Imagine a system in which an algorithm marks a dense breast area on a screening mammogram and the human radiologist cannot see anything that looks potentially malignant. With CAD, radiologists scrutinize the areas marked, and if they decide the mark is probably not cancer, they assign the mammogram as being negative for malignancy. However, if AI algorithms are to make a bigger difference than CAD in detecting cancers that are currently missed, an abnormality detected by the AI system, but not perceived as such by the radiologist, would probably require extra investigation. This might result in a rise in the number of people who receive callbacks for further evaluation. A clinical trial would show the effect of the AI system on the detection of cancer and the rate of false positive diagnoses, while also allowing the development of effective clinical practice in response to mammograms flagged as abnormal by AI but not by the radiologist.
In addition, it would be essential to develop a mechanism for monitoring the performance of the AI system as it learns from cases it encounters, as occurs in machine-learning algorithms. Such performance metrics would need to be available to those using these tools, in case performance deteriorates over time.
It is sobering to consider the sheer volume of data needed to develop and test AI algorithms for clinical tasks. Breast cancer screening is perhaps an ideal application for AI in medical imaging because large curated data sets suitable for algorithm training and testing are already available, and information for validating straightforward clinical end points is readily obtainable. Breast cancer screening programmes routinely measure their diagnostic performance — whether cancer is correctly detected (a true positive) or missed (a false negative). Some areas found on mammograms might be identified as abnormal but turn out on further testing not to be cancerous (false positives). For most women, screening identifies no abnormalities, and when there is still no evidence of cancer one year later, this is classified as a true negative.
Most other medical tasks have more-complicated clinical outcomes, however, in which the clinician’s decision is not a binary one (between the presence or absence of cancer), and thus further signs and symptoms must also be considered. In addition, most diseases lack readily accessible, validated data sets in which the ‘truth’ is defined relatively easily. Obtaining validated data sets for more-complex clinical problems will require greater effort by readers and the development of tools that can interrogate electronic health records to identify and annotate cases representing specific diagnoses.
To achieve the promise of AI in health care that is implied by McKinney and colleagues’ study, anonymized data in health records might thus have to be treated as precious resources of potential benefit to human health, in much the same way as public utilities such as drinking water are currently treated. Clearly, however, if such AI systems are to be developed and used widely, attention must be paid to patient privacy, and to how data are stored and used, by whom, and with what type of oversight.
Nature 577, 35-36 (2020)