Differential detection by breast density for digital breast tomosynthesis versus digital mammography population screening: a systematic review and meta-analysis

Background We examined whether digital breast tomosynthesis (DBT) detects differentially in high- or low-density screens. Methods We searched six databases (2009–2020) for studies comparing DBT and digital mammography (DM), and reporting cancer detection rate (CDR) and/or recall rate by breast density. Meta-analysis was performed to pool incremental CDR and recall rate for DBT (versus DM) for high- and low-density (dichotomised based on BI-RADS) and within-study differences in incremental estimates between high- and low-density. Screening settings (European/US) were compared. Results Pooled within-study difference in incremental CDR for high- versus low-density was 1.0/1000 screens (95% CI: 0.3, 1.6; p = 0.003). Estimates were not significantly different in US (0.6/1000; 95% CI: 0.0, 1.3; p = 0.05) and European (1.9/1000; 95% CI: 0.3, 3.5; p = 0.02) settings (p for subgroup difference = 0.15). For incremental recall rate, within-study differences between density subgroups differed by setting (p < 0.001). Pooled incremental recall was less in high- versus low-density screens (−0.9%; 95% CI: −1.4%, −0.4%; p < 0.001) in US screening, and greater (0.8%; 95% CI: 0.3%, 1.3%; p = 0.001) in European screening. Conclusions DBT has differential incremental cancer detection and recall by breast density. Although incremental CDR is greater in high-density, a substantial proportion of additional cancers is likely to be detected in low-density screens. Our findings may assist screening programmes considering DBT for density-tailored screening.


BACKGROUND
Digital breast tomosynthesis (DBT) provides reconstructed, quasithree-dimensional mammographic images of the breast, and has been proposed to improve cancer detection in screening through better visualisation of lesions that may be obscured by dense and/ or overlapping breast tissue on conventional (two-dimensional) digital mammography (DM) [1]. In addition, by minimising cancermimicking artefacts associated with overlapping breast tissue, DBT may reduce high baseline rates of recall to further assessment [2]. Multiple studies have compared DBT and DM in breast cancer screening, including six published systematic reviews [3][4][5][6][7][8]. All of these reviews reported that detection measures favoured DBT (compared to DM) for breast cancer screening; however, none reported screening detection measures by high and low breast density. High mammographic density (having heterogeneously or extremely dense breasts [9]) is associated with an increased risk of breast cancer [10], including interval breast cancer [11]. Evidence on whether screening performance measures for DBT compared to DM differ by breast density is of interest to population breast cancer screening programmes, and could inform potential adoption of DBT screening in whole or subgroups of the population.

Eligibility criteria
Studies were eligible when they included asymptomatic women who attended population-based breast cancer screening programmes; compared DBT with DM; reported cancer detection and/or recall by breast density using American College of Radiology Breast Imaging Reporting and Database System (BI-RADS) [9] (any edition); and were reported in English. Detailed inclusion and exclusion criteria are available in Supplementary Method 2. Studies using either a paired design (i.e. all participants underwent DBT and DM, allowing within-participant comparison) or unpaired design (i.e. comparison of separate groups that underwent DBT, with or without DM, versus DM alone) were both eligible for inclusion. To ensure that density classification was consistent for the purpose of pooling estimates by density strata, we did not include studies using an automated density.

Study selection
Titles and abstracts were screened by one author (TL) to determine whether studies met the eligibility criteria for full-text assessment and a sample of 25% was screened independently by another author (MLM) as a quality assurance process. The full-text assessment was conducted by one author (TL) with consultation from a second author (MLM) if required.

Data extraction
Data extraction was performed by one author (TL), with another independent extraction by one of two other authors (NN and AZ). Any disagreements were resolved by discussion and consensus, or with arbitration by a third author (MLM) when needed.
The following data were extracted into an Excel spreadsheet using predefined cells: first author, publication date, country, study design, screening interval, years of participant enrolment, DBT views, DBT modality, DBT screening reading strategy, participants' age (median or mean), and the number of participants and outcomes (cancers detected, recalls) per modality in each density category. Breast density information was extracted according to the BI-RADS density classification of a-d [9] (or 1-4 [13]) when available, and the combined categories of low density (BI-RADS a + b/1 + 2) and high density (BI-RADS c + d/3 + 4) when studies did not report the full BI-RADS classification. Because there were more studies reporting by combined (and less reporting by four) categories of density, we used the binary low-and high-density classification to standardise these data and allow statistical pooling across studies. This approach avoided excluding a substantial number of otherwise eligible studies.

Quality assessment criteria
Quality assessment of all eligible studies was performed by one author (TL) in consultation with two other authors (MLM and NH) when required, using appraisal criteria adapted from QUADAS-2 [14]. Each study was assessed for risk of bias under four domains covering patient selection, index test, reference standard, and flow and timing. The first three domains were also assessed in terms of concerns regarding applicability.

Statistical analysis
Study characteristics were summarised descriptively using median values and ranges. For both DBT and DM, estimates of cancer detection rate (CDR; per 1000 screens) and recall rate (percent) were calculated for low-and highdensity strata within each study, and exact 95% confidence intervals (CIs) were computed. Summary estimates of CDR and recall rates for DM (baseline) and DBT were derived and compared between screening settings using PROC GLIMMIX with random effects for study in SAS 9.4 (SAS Institute, Cary, NC, US). Incremental estimates (risk differences), calculated as the study-level differences between modalities (DBT minus DM) in CDR and recall rate, were pooled separately for low-and high-density strata using the inverse variance method with random effects (DerSimoneon and Laird method as implemented in RevMan 5.4.1, The Cochrane Collaboration, 2020 [15]). Standard errors of the risk differences were calculated based on differences in two independent proportions for unpaired study designs. For paired study designs, PROC GENMOD in SAS was used to take account of the pairing of results within an individual when computing the standard error of the difference in proportions. These estimates were then input into RevMan for meta-analysis. Chi-squared tests of differences between the separate pooled estimates for density strata were not performed due to inappropriate standard errors (arising from the same studies contributing to both density strata) and the potential for bias [16,17].
For the main analyses comparing density strata, we used PROC GENMOD (with the REPEATED statement for paired studies) to model the interaction between modality (DBT versus DM) and breast density (high versus low) for each study. Interaction terms (corresponding to the within-study difference between density strata in incremental CDR and recall rate) and their standard errors were input into RevMan and pooled using the inverse variance method with random effects [16,17].
Analyses were stratified by screening setting (European versus US studies) based on a priori evidence of a difference in CDR and recall rate [4]. Differences between screening setting subgroups were assessed using the Chi-squared test. Sensitivity analyses were undertaken to include only studies that reported both CDR and recall rate to investigate the effect on pooled estimates. Heterogeneity was assessed using the I 2 statistic with values >50% representing substantial or considerable heterogeneity [15].
Pooled estimates of incremental CDR and recall rate, and the withinstudy differences between density strata, were incorporated in an epidemiological model simulating plausible scenarios in population screening practice. Simplified decision trees (Supplementary Method 3) were used to apply conditional probabilities to a hypothetical screening population of 10,000 women where the screening setting and proportion of the population with low density were varied. Estimates of the proportion of the population with low density were derived from the median and range of study-specific values reported by European and US studies. For each screening setting and density subgroup, predictions of the number of additional cancers detected and additional women recalled by DBT per 10,000 screens were calculated by multiplying the total number in the population, the proportion of low (or high) density, and the relevant pooled incremental estimate derived from the meta-analysis. A detailed description of this modelled prediction can be found in Supplementary Method 3.
All tests of statistical significance were two-sided. The level chosen for statistical significance was 0.05.

Risk of bias and applicability
In sensitivity analyses where only studies reporting both CDR and recall rate were included (n = 9), pooled estimates of  [23]. CI confidence interval, df degrees of freedom.
In sensitivity analyses where only studies reporting both CDR and recall rate were included (n = 9), pooled estimates of incremental recall rate ( Supplementary Fig. 5) and the withinstudy difference between density strata ( Supplementary Fig. 6) did not change substantially.

Modelled predictions of additional cancers detected and women recalled by DBT in population screening
Pooled estimates of incremental CDR and recall rate (Figs. 1 and 3), and the differences between density strata (Figs. 2 and 4), were applied to different scenarios defined by screening setting (US or European) and the percentage of low breast density in the screening population ('median', 'maximum' and 'minimum' percentage based on density distributions in the included studies; see Supplementary Method 3). Across all scenarios, the predicted total number of additional cancers detected by DBT ranged from 9 to 25 per 10,000 screens ( Table 2). The ratio of the numbers of additional cancers detected in high versus low density depended on the percentage of screens with low density. Despite evidence of greater incremental CDR in high density (Fig. 2), the number of additional cancers detected by DBT in women with low density exceeded the number in high density for the 'maximum' percentage of low-density screens. The reverse was apparent for the 'minimum' percentage estimates. These patterns were observed regardless of the screening setting.
The estimated number of additional women recalled by DBT ranged from −237 to 56 per 10,000 screens (Table 2). For European screening, DBT was associated with a relatively small increase in the number of recalls. At the 'maximum' percentage of low-density screens, the number of additional recalls was equal in the high-and low-density groups, but at lower percentages, the number of recalls was greater in high compared with low density. For US screening, the ratio of additional recalls in high versus low density reflected the pattern observed for CDR. At the 'maximum' percentage of low-density screens, the reduction in the number of women recalled was greater in the low-density than in the highdensity group (and vice versa at the 'minimum' percentage).

DISCUSSION
The adoption of DBT in place of DM for population breast cancer screening has progressed rapidly, particularly in the US, whereas elsewhere there is conditional approval or restricted use of DBT in screening programmes [31]. Some population-based screening programmes do not currently endorse using DBT instead of DM but encourage its evaluation in prospective trials [32,33]. Mammographic breast density, a long-established independent risk factor for breast cancer, has gained increased attention since the introduction of breast density legislation in the US [34,35], and there is a suggestion that DBT may be more effective for screening women with dense breasts [4,36]. In this systematic review, we focused on estimating changes in cancer detection and recall associated with screening by DBT versus DM according to breast density. Our meta-analysis provides evidence that DBT increases cancer detection in both low-and high-density screening examinations regardless of the screening setting. Importantly, we also show that DBT has differential incremental detection (versus DM) by breast density, meaning that the increase in CDR is greater in high (versus low) density screens. Conversely, both the incremental recall rate for DBT and the differential incremental recall by density varied by screening setting.
Our estimates provide new synthesised evidence on the performance of DBT, noting that other systematic reviews [3][4][5][6][7][8] have not investigated the differential performance of DBT by density. One other review reported detection for DBT versus DM solely in screens classified as dense [37]. Our work showed that DBT detected more cancers than DM in both low-and highdensity screens, and that DBT substantially improved CDR in highdensity compared to low-density screens (pooled difference in incremental CDR 1.0 per 1000 screens). This improvement was more evident in studies undertaken in Europe (1.9 per 1000 screens) than in US studies (0.6 per 1000 screens). Although the difference between screening settings was not statistically     Fig. 3 Difference in recall rate (incremental recall rate) between DBT and DM stratified by breast density. Breast density was classified as low (BI-RADS a + b) and high (BI-RADS c + d) (see Data extraction). Squares with horizontal lines represent individual study estimates and 95% CIs. Diamonds represent pooled estimates of incremental recall rate for DBT over DM and 95% CIs. Additional data were supplied by study authors for Alsheik et al. [23] and Zackrisson et al. [22]. CI confidence interval, df degrees of freedom.
significant, pooling within-study interactions is likely to have low power to detect such subgroup differences [16]. A greater contribution by DBT to cancer detection in European screening practice is likely to reflect a longer time interval between screens, however other differences between European and US screening practices (e.g. double versus single-reading) may also contribute to this difference. The pooled difference in the incremental recall rate between low-and high-density screens differed between the screening settings. For European screening studies, a greater increase in DBT's incremental recall was observed in high compared with low-density screens (pooled absolute difference in recall rate 0.8%) with little heterogeneity (I 2 = 9%). In contrast, for US screening studies, there was a greater decrease in recall for DBT in high-than in low-density screens (pooled absolute difference in recall rate −0.9%). Although there was substantial heterogeneity in the magnitude of this estimate (I 2 = 61%), all US screening studies were consistent in the direction of the difference (Fig. 4). The opposing directions of the estimates from European and US studies are likely because the 'baseline' recall rates for DM in US screening studies were larger than those reported in European screening studies (Supplementary Table 2). Our results suggest that DBT has a beneficial effect in reducing recalls in women with dense breasts in US screening practice but may lead to increased recall in high-density screens in European screening programmes.
Our estimates of DBT's differential incremental detection and recall (versus DM) by breast density are relevant to screening programmes worldwide contemplating whether DBT should be used for population screening, if such decisions were to be based on conventional screening measures. The data provided in Table 2, for example, showing the additional detection (or effect on recall) if DBT replaced DM screening, according to the observed percentage of breast density and screening setting, can inform plans for trials or implementation studies. A screening programme targeting women aged 50-years-old and above with a large proportion of participants with low-density breasts (as would be expected in many European programmes, and Australia's programme [38]) would improve CDR overall through more detection in both low and high-density screens. In that setting, limiting DBT to those with high density would not achieve optimal outcomes from DBT screening. In contrast, if a European screening programme comprised a large proportion of participants with high breast density, much of the incremental CDR would be achieved by offering DBT to women with dense breasts. Our results may also be relevant to planning new research in risktailored screening [39].
There are limitations to this work that should be considered when using our findings. The included studies reported initial detection measures and lacked data on long-term health outcomes from DBT screening. It is therefore unknown whether DBT's incremental detection will lead to incremental screening benefits by reducing breast cancer mortality. Also, most of the data reported on prevalent (initial) DBT screening, even though repeat breast screening represents the majority of screens in screening programmes. Therefore, it is possible our results may be less generalisable to repeat (incident) DBT screening. Another limitation is that we included studies that assessed breast density using BI-RADS density classification, but excluded studies using automated assessment for consistency in meta-analysis. Given that automated density measures have only been recently introduced into practice [40][41][42][43], automated density should be assessed in future meta-analyses as the evidence develops. These issues reflect the still-evolving evidence base for DBT, a limitation inherent in evaluations of new health technologies that aim to inform implementation before practice becomes established and therefore more challenging to modify [44].
In addition, we have used 'US screening' and 'European screening' to classify studies, but this classification is only broadly indicative of screening practice-we acknowledge that varying practices exist in an inter-screen time interval and screen-reading strategy. For example, US studies may not have performed all screening annually, and other factors that differ between US and European studies, such as single versus double-reading and the  [22]. CI confidence interval, df degrees of freedom.  generally high recall rates in US studies, may account for some of the observed differences in incremental CDR and recall rates. Internationally, the majority of population breast cancer screening programmes use DM, but many are contemplating the potential role of DBT screening. This is occurring in an evolving population screening landscape that includes deliberation regarding density notification, and risk-tailored breast screening. Our meta-analysis provides timely comparative estimates for DBT and DM screening showing that DBT has differential incremental cancer detection and recall by breast density. Therefore, our synthesised evidence may assist screening policy, planning of research and individual screening recommendations.

DATA AVAILABILITY
All data generated or analysed during this study are included in this published article and its Supplementary Information files.