Main

Breast cancer screening detects cancer at earlier stages1, leading to a meaningful reduction in breast cancer mortality2. Moreover, early detection can lead to less aggressive treatments, reducing treatment toxicity. Although breast screening reduces overall mortality, it has limitations that result in failure to detect cancer in a considerable number of screened individuals. In these cases, cancer may be found later between screening rounds (interval cancer)3 or at the next screening round4. Reported estimates for the rate of interval cancer detection vary widely between countries and screening programs with varying screening intervals, ranging from 0.7 to 4.9 per 1,000 screened women3. Among them, the proportion of cancer cases that could be detected retrospectively at previous rounds is estimated to be 22%4. In the past, computer-aided detection (CAD) systems were developed to improve cancer detection. However, the benefits of CAD found in experimental studies did not translate into real-world clinical benefits. The use of CAD resulted in increased recalls, more time needed to assess screens and more biopsies without improving cancer detection, ultimately conferring no screening benefit5.

Modern artificial intelligence (AI) based on deep learning is a different technology from past CAD systems and has demonstrated higher potential in supporting the quality of screening services and reducing workload, depending on its workflow integration6,7,8,9,10. AI has the highest performance risk for cases with less common characteristics; thus, it requires assessment in large-scale studies. As retrospective studies make large-scale evaluations possible, they are crucial to validate the safety and effectiveness of AI before prospective use. However, retrospective results can be expected to translate to real clinical practice only when appropriate study methods are used to ensure that the analyzed data are representative of what AI would process in real-world deployments. Otherwise, the usefulness of AI in clinical practice is not guaranteed4,11,12. Prospective evaluations are needed to assess the real-world performance of AI integrated into live clinical workflows; however, these have been limited to date13.

This service evaluation presents results from using a commercially available AI system, Mia (Kheiron Medical Technologies), configured with regulatory-cleared predetermined sensitivity and specificity operating points in pilot implementations and live use in daily practice. The performance and generalizability of the AI system used were previously confirmed in a large-scale retrospective AI generalizability study8,9,14. The current analysis used prospectively collected postmarket real-world data to assess the effectiveness of the AI system as an additional component to standard screening procedures and a quality-control safety net in the AI-assisted additional-reader workflow to support early cancer detection.

Results

A three-phase approach was used to implement the AI system in an AI-assisted additional-reader workflow at four sites of MaMMa Egészségügyi Zrt. (MaMMa Klinika), a breast cancer screening institution that serves urban and rural populations in Hungary. The institution implements a 2-year screening interval and invites women aged 45–65 years to undergo screening. All institution sites also offer opportunistic screening, in which women who are not invited to screening but choose to participate are screened. These women undergo the same procedure as those participating in the population screening program. At the institution sites, full-field digital mammography images were obtained using the IMS Giotto Image 3DL and IMS Giotto Class systems, following the standard operating procedures at the four sites. All sites follow the standard double-reading workflow (with strictly no AI involvement) in which two radiologists review every case. When discordance arises, an arbitrator makes the decision to either recall or not recall a woman for further assessment. In the implemented AI-assisted additional-reader workflow, the AI system flagged cases for additional review among those classified by double reading as ‘no recall’. These positive discordant cases (that is, cases that AI flagged as ‘positive’ and human readers marked as ‘negative’) were additionally reviewed by a human arbitrator (additional arbitrator) to possibly recall additional cases and detect more cancerous cases at an early stage (Fig. 1). The additional arbitrator was provided with images containing AI-generated regions of interest highlighting areas suggestive of malignancy for their review.

Fig. 1: AI as an additional reader.
figure 1

The AI-assisted additional-reader workflow uses a standard double-reading process complemented by image assessment by AI. If double reading results in a ‘no recall’ decision but the AI system flags the case, the screen is assessed by an additional human arbitrator.

The implementation of the AI system consisted of three phases to ensure the safe deployment of the AI-assisted additional-reader process into live use. The first phase aimed to demonstrate the clinical benefit of the AI-assisted additional-reader process in a limited pilot rollout in which only one senior radiologist reviewed the AI-flagged cases from a single site, with the original screening date between April 6 and September 28, 2021 inclusive. The second phase was launched as an extended multicenter pilot involving a wider rollout of the AI-assisted additional-reader process across four sites (including the initial pilot site) and three additional arbitrators (including the additional arbitrator from the first phase). In the second phase, the readers independently reviewed every case flagged by AI from April 6 through December 21, 2021, at the initial pilot site and from April 6 through June 30, 2021, at each of the other three sites. One of the additional arbitrators made the final decision on which cases to recall additionally based on the opinions of all three readers. The extended pilot also aimed to provide a training period for the three additional arbitrators before live use began.

Finally, the third phase involved a full live rollout of the AI system as an official addition to the standard of care across the four sites from July 4, 2022. In this phase, the three additional arbitrators independently made recall decisions. The live rollout is ongoing, and the results presented here cover cases through January 31, 2023. Results were also simulated with a predetermined higher-specificity operating point to inform the sites on how the AI-assisted additional-reader process may be further optimized to suit their needs. The summary details of the dataset periods are provided in Table 1. In live use, each AI-flagged case was independently reviewed by one of the three additional arbitrators who made the final recall decision on each case they reviewed. During the two pilot phases, additional recalls based on additional arbitration reviews were done after the screening participants had been informed of the double-reading decision. In the third phase involving implementation into daily practice, the screening participants were informed after the decision was finalized based on the additional arbitration reviews. All readers had specialist training and ≥14 years of screening mammography experience, with non-additional arbitrators reading approximately 12,000 screens per year and additional arbitrators reading 25,000 screens per year on average.

Table 1 Overview of screens per phase per site

Patient characteristics

Table 2 shows the characteristics of participants in each phase. The initial pilot included 3,746 women with an average age of 58.2 (s.d. 11.0) years. Among them, 126 (3.4%) reported a family history of cancer and 479 (12.7%) had a Tabár parenchymal pattern classification of 4 or 5, correlating with high density. In the extended pilot (n = 9,112), the mean age was also 58.2 (s.d. 10.7) years. Tabár classification 4 or 5 was identified in 1,094 women (12.0%), and 274 women (3.0%) reported a family history of cancer. Finally, in the live-use phase, 15,953 women were included. The mean age was 58.6 (s.d. 10.5) years, with 615 women (3.9%) having reported a family history of cancer and 1,733 women (10.8%) having a Tabár classification of 4 or 5.

Table 2 Participant characteristics per phase

Screening performance of the AI-assisted additional-reader workflow

Across the three phases, the implementation of the AI-assisted additional-reader workflow resulted in 24 more cancer cases detected (7% relative increase in cancer detection rate (CDR)) and 70 more women recalled (0.28% increase in absolute recall rate), at a positive predictive value (PPV) for screening of 20.0% (3% relative increase) (Table 3). The initial pilot, extended pilot and live-use assessments included 3,746 of 3,817 (98.1%), 9,112 of 9,266 (98.3%) and 15,953 of 16,256 (98.1%) double-read cases that the AI could process, respectively (Table 1). Table 3 shows the outcome metrics for each phase and reports the results of the McNemar test for sensitivity and CDR. In summary, standard double reading resulted in recall rates of 6.7% (initial pilot), 7.0% (extended pilot) and 7.7% (live use) and CDRs of 12.8 per 1,000 cases (initial pilot), 13.8 per 1,000 cases (extended pilot) and 14.9 per 1,000 cases (live use). For the initial and extended pilots, AI flagged for review 10.6% (396/3,746) and 11.2% (1,024/9,112) of cases, respectively. Before launching the AI system into live use, its decision threshold was adjusted to a more specific predetermined operating point to accommodate the site’s workload capacity, resulting in a smaller proportion of cases (7.4%, 1,186/15,953) flagged for additional review in live use. The additional arbitration reviews resulted in six (initial pilot), 22 (extended pilot) and 48 (live use) additional recalled cases, increasing the recall rate by 0.16% (initial pilot), 0.23% (extended pilot) and 0.25% (live use), respectively. From the additional recalls, six (initial pilot), 13 (extended pilot) and 11 (live use) additional cancer cases were found, increasing the CDR by 1.6 per 1,000 cases (a 13% relative increase), 1.4 per 1,000 cases (a 10% relative increase) and 0.7 per 1,000 cases (a 5% relative increase) for the initial pilot, extended pilot and live-use phases, respectively (all statistically significant with P < 0.05) (Table 3). Of the additional cancer cases, four (66.7%) in the initial pilot, ten (76.9%) in the extended pilot and five (45.5%) in the live-use phase were confirmed to be invasive. In addition, one case (16.7%) in the initial pilot, one case (7.7%) in the extended pilot and two cases (18.2%) in live use were in situ cancer. Meanwhile, one case (16.7%) in the initial pilot, two cases (15.4%) in the extended pilot and four cases (36.4%) in live use had missing invasiveness information. Of the additional cancer cases found with available data on either pathological or radiological tumor size, 50.0% (two of four) in the initial pilot, 40% (four of ten) in the extended pilot and 57.1% (four of seven) in live use were ≤10 mm. Overall, the screening performance of double reading plus the AI-assisted additional-reader workflow resulted in recall rates of 6.8% (initial pilot), 7.3% (extended pilot) and 8.0% (live use); arbitration rates of 13.6% (initial pilot), 14.2% (extended pilot) and 10.8% (live use); and CDRs of 14.4 per 1,000 cases (initial pilot), 15.3 per 1,000 cases (extended pilot) and 15.6 per 1,000 cases (live use).

Table 3 Outcome metrics for standard double reading versus double reading plus the AI-assisted additional-reader workflow

Performance at a simulated higher-specificity operating point

When the performance of the AI system was evaluated at a predetermined higher-specificity operating point through simulations, the AI-assisted additional-reader workflow substantially reduced the proportion of cases requiring additional review to 2.4% (89/3,746), 3.0% (274/9,112) and 2.9% (457/15,953) for the initial pilot, extended pilot and live-use phases, respectively, while still detecting 5 of the 6 (1.3/1,000, a 10% relative increase) additional cancer cases found in the initial pilot, 11 of the 13 (1.2/1,000, a 9% relative increase) additional cancer cases found in the extended pilot and 10 of the 11 (0.6/1,000, a 4% relative increase) additional cancer cases found in live use (Table 4). Of the additional cancer cases, four (80.0%) in the initial pilot, nine (81.1%) in the extended pilot and five (50.0%) in live use were confirmed to be invasive; zero (0.0%) in the initial pilot, one (9.1%) in the extended pilot and two (20.0%) in live use were confirmed to be in situ cancer; and one (20.0%) in the initial pilot, one (9.1%) in the extended pilot and three (30.0%) in live use had missing invasiveness information.

Table 4 Outcome metrics for standard double reading versus double reading plus the AI-assisted additional-reader workflow at a higher-specificity operating point

Discussion

This analysis of prospective real-world usage data provides evidence that using AI in clinical practice results in a measurable increase in breast cancer detection. We analyzed the effects of the AI-assisted additional-reader workflow in two pilot phases and found that the results were maintained when AI was used in daily screening practice. Moreover, the observed clinical benefit (a significant 5–13% increase in the rate of early detection of mostly invasive and small cancerous tumors) had minimal impact on recall rates, thereby demonstrating the possibility of increasing cancer detection with no false-positive additional recalls. Although the double-reading recall rate (6.7–7.7%) in this evaluation is in line with previous results published in the UK and Europe9,15, the double-reading CDR is higher (14/1,000) than previously reported9—possibly resulting from the resumption of breast cancer screening programs after the coronavirus disease pandemic. Nevertheless, the AI-assisted additional-reader workflow supported the screening service by further increasing the rate of early cancer detection. It also can potentially reduce the proportion of cases requiring additional arbitration review to <3% of cases while still achieving increased cancer detection by 0.5–1.3 per 1,000 cases, corresponding to a 4–10% relative increase in cancer detection using a higher-specificity operating point. Future work investigating the implementation of a variety of operating points would be needed to confirm the extent of achievable improvement in early cancer detection in the context of sites with different needs, capacities and screening population characteristics.

Implementing AI into the diagnostic workflow requires careful monitoring of continued performance over time16. For the AI-assisted additional-reader workflow, the effectiveness of downstream clinical assessments of recalled positive discordant cases should be examined to ensure that potential cancer cases are found. Moreover, the AI-assisted additional-reader workflow could be combined with workflows focused on workload savings, such as using AI as an independent second reader. Large-scale retrospective studies of the same AI system used in this assessment have demonstrated that AI as an independent second reader can offer up to 45% workload savings8,9, offsetting the 3–11% additional arbitration reads (1–6% additional overall reading workload) for the AI-assisted additional-reader workflow while providing the benefit of increased cancer detection.

The AI-assisted additional-reader workflow was designed to flag high-priority cases not recalled by standard double reading, likely making the flagged set of cases a more difficult or complex set to read. We believe that this would be helpful in the training of mammogram readers. The spectrum of disease detected with the AI-assisted additional-reader workflow will be assessed in future work covering features such as invasiveness, tumor size, grade and lymph node status.

Several limitations need to be considered when interpreting the presented results. First, data were collected from only one breast cancer screening institution (with four sites) in one country. As screening programs vary between clinical sites and countries, future studies must confirm the benefit of the AI-assisted additional-reader workflow in other settings and screening populations. Furthermore, as only one commercial AI system was evaluated, the results may not be representative of other commercially available systems. Additionally, given that the follow-up period in this prospective assessment ranged only from 2 to 9 months, no information is yet available about possible interval cancer cases in the studied population. A longer follow-up analysis is required for a more accurate assessment of AI’s potential for improving cancer detection in the context of interval cancer occurrence. Moreover, the impact of inter-reader variation on the AI-assisted additional-reader workflow’s screening outcomes remains unclear and needs to be assessed in follow-up work.

Despite the many challenges in developing, validating, deploying and monitoring AI to ensure patient safety, this evaluation shows that a commercially available AI system can be effectively deployed, with its previously predicted benefits realized in a prospective real-world assessment of a live clinical workflow. We believe that the findings highlight opportunities for using AI in breast screening while demonstrating concrete steps for its safe deployment. The phased prospective approach underlines the potential for various AI adoption pathways.

Methods

Datasets for analysis

This study is an analysis of postmarket data collected at MaMMa Klinika, a large breast cancer screening institution in Hungary. Structured query language was used to collect data. Custom code using Python software version 3.8.8 and open-source Python packages, including pandas version 1.2.4, NumPy version 1.20.1, sklearn version 0.24.1 and statsmodels version 0.12.2, were used for data analysis. The analysis complied with all relevant ethical regulations. External ethical review was not required as the AI system was used as part of the standard of care in the screening service at each implementation phase of this service evaluation. Ethical considerations were reviewed internally by the screening service provider, MaMMa Klinika. The evaluation used deidentified data and presented results in aggregate without listing data of individual screening participants to protect their anonymity. As a consequence, the evaluation also did not require patient consent.

Metrics

Standard breast screening metrics, CDR and recall rate were primarily used to assess the effects of the AI-assisted additional-reader workflow compared to standard double reading without AI. CDR was calculated as the number of screen-detected cancer cases detected divided by the number of all screening cases. Recall rate was calculated as the number of cases recalled divided by the number of all cases; this should not be confused with the term ‘recall’ often used as a metric for sensitivity in machine learning. Arbitration rate was calculated as the number of arbitrations conducted divided by the number of all cases, with the double-reading arbitration rate including only double-reading arbitrations and the total arbitration rate including double-reading and additional-reader arbitrations. PPV was calculated as the number of screen-detected cancer cases divided by the number of recalled screens. Sensitivity was calculated as the number of screen-detected cancer cases divided by the number of all known positive screens. Specificity was calculated as the number of non-recalled screens divided by the number of all non-positive screens. Positive discordance rate was calculated as the number of AI-flagged positive discordant cases divided by the number of all cases. As the AI-assisted additional-reader workflow occurs subsequently to the double-reading workflow on the same cases, paired comparisons between the AI-assisted additional-reader and double-reading workflows were possible, with an exact measurement of the impact of AI in terms of additional recalls and cancer cases found. All detected cancer cases were confirmed with biopsy or histopathological examination within 12 months of the original screen or judged to be cancer by the patient tumor board (multidisciplinary team).

Statistical analysis

No statistical method was used to predetermine sample sizes. No data were excluded from the analyses. Blinding was not required as randomization was not applied. The standard double-reading process did not involve the AI system, and readers were blinded to the AI system’s output during the double-reading process. The Wilson score method was used to calculate 95% CIs. The statistical significance of CDR differences was assessed using the McNemar test. A P value of <0.05 was defined as statistically significant.

AI system

This evaluation used a commercially available AI system (Mia version 2.0, Kheiron Medical Technologies). The AI system is intended to process only cases from female participants and works with standard DICOM (Digital Imaging and Communications in Medicine) cases as inputs. The AI system analyzes four images with two standard full-field digital mammography views (craniocaudal and mediolateral oblique) per breast. The AI system’s primary output per case is a single binary recommendation of ‘recall’ (for further assessment based on findings suggestive of malignancy) or ‘no recall’ (no further assessment until the next screening interval). The AI system can provide binary recall recommendations for six predetermined operating points, ranging from having a balanced trade-off between sensitivity and specificity to having trade-offs that emphasize either sensitivity or specificity. The AI system’s balanced sensitivity/specificity and higher-specificity operating points are most relevant when the AI system is used in the AI-assisted additional-reader workflow. The set of cases flagged by the AI system’s higher-specificity operating point in the AI-assisted additional-reader workflow is always a subset of the cases flagged by the AI system’s balanced sensitivity/specificity operating point. Therefore, results at the higher-specificity operating point can be precisely simulated based on the balanced operating point results. The optionality between the different operating point trade-offs makes a significant difference for practical applicability at sites with differing workforces. Additionally, the AI system provides regions of interest indicating image locations showing characteristics most suggestive of malignancy. Depending on the clinical workflow and exact integration of the AI system, the AI’s recommendation may be used independently or combined with human reader assessment.

The underlying technology of the AI system is based on deep convolutional neural networks (CNNs), which are state-of-the-art machine learning tools for image classification. The AI system is a combination (also known as an ensemble) of multiple models with a diverse set of different CNN architectures. Each model was trained for malignancy detection. The final prediction of the ensemble is obtained by aggregating individual model outputs, with a subsequent threshold applied to the malignancy detection score to generate a binary recommendation of ‘recall’ or ‘no recall’. The thresholds relate to one of the AI system’s six predetermined, clinically meaningful operating points according to desired sensitivity/specificity trade-offs.

The AI system was trained on a heterogeneous, large-scale collection of more than 1 million images from real-world screening programs across different countries, multiple sites and equipment from different vendors over a period of >10 years. Positive cases were defined as pathology-proven malignancies confirmed by fine-needle aspiration cytology, core needle biopsy, vacuum-assisted core biopsy and/or histological analysis of surgical specimens. Negative cases were confirmed through multiple years of follow-up.

The AI software version and operating points used in the present evaluation were fixed before each phase. None of the evaluation data were used in any aspect of algorithm development.

The AI system’s performance, generalizability and clinical utility were previously confirmed in a large-scale retrospective AI generalizability study8,9,14. The study demonstrated that double reading with the AI system, compared to human double reading, resulted in at least noninferior recall rate, CDR, sensitivity, specificity and PPV for each mammography vendor and site, with superior recall rate, specificity and PPV observed for some mammography vendors and sites9. The double-reading simulation with the AI system indicated that using AI as an independent reader (in all cases it could process) can result in a 3.3–12.3% increase in the arbitration rate9 but can reduce human workload by 30.0–44.8%. AI as a supporting reader (used as a second reader only when it agrees with the first human reader) was found to be superior or noninferior on all screening metrics compared to human double reading while nearly halving the number of arbitrations (from 3.4% to 1.8%) and reducing the number of cases requiring second human reading (by up to 87%)8. Additionally, no differences in prognostic features (invasiveness, grade, tumor size and lymph node status) were found between the cancer cases detected by the AI system and those detected by human readers14. These findings imply that cancer cases detected by the AI system and human readers are likely to have similar clinical courses and outcomes, with limited or no downstream effects on screening programs, supporting the potential role of AI as a reader in the double-reading workflow.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.