Supplemental breast cancer-screening ultrasonography in women with dense breasts: a systematic review and meta-analysis

Background Mammography is not effective in detecting breast cancer in dense breasts. Methods A search in Medline, Cochrane, EMBASE and Google Scholar databases was conducted from January 1, 1980 to April 10, 2019 to identify women with dense breasts screened by mammography (M) and/or ultrasound (US). Meta-analysis was performed using the random-effect model. Results A total of 21 studies were included. The pooled sensitivity values of M alone and M + US in patients were 74% and 96%, while specificity of the two methods were 93% and 87%, respectively. Screening sensitivity was significantly higher in M + US than M alone (risk ratio: M alone vs. M + US = 0.699, P < 0.001), but the slight difference in specificity was statistically significant (risk ratio = 1.060, P = 0.001). Pooled diagnostic performance of follow-up US after initial negative mammography demonstrated a high pooled sensitivity (96%) and specificity (88%). The findings were supported by subgroup analysis stratified by study country, US method and timing of US. Conclusions Breast cancer screening by supplemental US among women with dense breasts shows added detection sensitivity compared with M alone. However, US slightly decreased the diagnostic specificity for breast cancer. The cost-effectiveness of supplemental US in detecting malignancy in dense breasts should be considered additionally.


BACKGROUND
Mammography has been established as the primary method of screening for breast cancer, and since its introduction, the diagnosis of early-stage disease was significantly enhanced. 1,2 The overall sensitivity of mammography for the detection of nonpalpable cancers is approximately 85%, 1,2 but the density of breast tissue can markedly reduce the mammography detection rate of early-stage disease. [3][4][5] Breast density can be classified according to the Breast Imaging-Reporting and Data System (BI-RADS) American College of Radiology (ACR) categories and/or quantification: BI-RADS A: almost entirely fat (low density of mammary gland parenchyma), BI-RADS B: scattered fibroglandular densities (average density of gland parenchyma), BI-RADS C: heterogeneously dense (high density of gland parenchyma) and BI-RADS D: extremely dense (very high density of gland parenchyma). 6,7 In women with >75% breast parenchymal dense tissue, the sensitivity of mammography for detecting early-stage cancer can be as low as 48%. 4,5 Dense breast tissue is an independent marker associated with increased breast cancer risk, especially in women who are at higher risk due to other factors such as family history. 8 Women with dense breast tissue who develop carcinoma in one breast are also at higher risk of developing cancer in the contralateral breast. 9 It is estimated that approximately two-thirds of premenopausal women and one-third of elderly women aged 75-79 years have a breast density of 50% or higher. 4 Furthermore, ethnic differences also exist as dense breasts are more prevalent in Asian than in Caucasian women. 10,11 Although breast cancer incidence rates in the Asian population were found to be lower than those in the Western population according to a large-scale epidemiology study, the incidence in Asia is quickly increasing and surpassing than in the Western countries. 12 This highlights an urgent need for more efficient breast cancer prevention and management strategies in the Asian population.
In light of the limitations of mammography in women with dense breasts, a study has suggested that ultrasound (US) is more sensitive than mammography, and can identify mammography-occult breast cancers in dense breasts, especially of younger women aged 30-39. 13 Other studies have indicated that adjunctive US and mammography in women with dense breasts resulted in a significant increase in the cancer detection rate as compared with mammography alone. 3,14 Some authors have therefore suggested that mammography with supplemental US screening can be beneficial for women with dense breasts, specific female groups prone to have dense breast tissue as previously described and women in resource-poor healthcare systems. 15 In the prospective J-START study, improved sensitivity in breast cancer detection was found in mammography with ultrasound compared with without (91.1% vs. 77.0%) in asymptomatic Japanese women aged 40-49 years unlimited to breast density, albeit with a concurrent lower specificity. 16 In particular, breast cancers detected by US are likely to be different characteristically than those detected by mammography, in which breast cancers detected by US are more likely to be smaller-sized, invasive and of the luminal A subtype compared with those detected by mammogram. 17 Nevertheless, the use of adjunctive US may increase the number of false-positive findings and unnecessary biopsy recommendations, 18,19 and the added diagnostic benefit of the screening strategy should be reconsidered as a whole.
Therefore, the added clinical benefit of US to mammography has been the interest of many systematic reviews, which provides only qualitative evaluation of published clinical evidence. In contrast, quantitative analysis that pooled and compared the diagnostic yield of conjunctive or sequential mammography and US screening strategies is limited. Moreover, it was suggested by a systematic review that the overall available evidence regarding the detection rate of breast cancer by screening with mammography and adjunct US may be low based on the Grades of Recommendation Assessment Development and Evaluation (GRADE) system. 18 Thus, the objective of this study was to perform a systematic review and meta-analysis examining the diagnostic performance of mammography alone plus US for breast cancer in women with dense breasts, as well as that of follow-up US in women with dense breasts and negative mammography results.

Literature search and study selection
This meta-analysis was performed in accordance with the PRISMA guidelines, 20  Medline, Cochrane, EMBASE and Google Scholar databases were searched for studies published between January 1, 1980 and April 10, 2019 using the keywords as follows: breast, dense, density, breast cancer, breast density, mammography, ultrasonography, ultrasound, specificity, sensitivity, screening and comparison. The search strategies included "mammography and ultrasound and breast and (dense OR density) with search filters of abstract availability, publications English, and clinical trials", and "(ultrasound) AND (mammography) AND (breast cancer) AND (screening) AND (sensitivity) AND (specificity) AND (comparison)". Literature inclusion criteria for meta-analysis were (1) randomised controlled trials (RCTs), 2-arm prospective studies, retrospective studies and cohort studies; (2) participants were women with dense breasts with BI-RADS categories ≥2; (3) study design included either mammography with adjunctive ultrasonography or additional ultrasonography following a negative mammography; (4) quantitative outcome data for outcomes of interest (i.e. PPV, NPV, sensitivity and specificity); (5) full-text studies published in English. Letters, comments, editorials, case reports, proceedings and personal communication were excluded. Studies of patients without dense breasts, and those that did not provide direct comparisons of the outcomes of interest, were further excluded. Studies designed for the detection of microcalcifications were excluded due to the technical nature of ultrasound, which is limited to detect breast microcalcifications. 21,22 The reference lists of articles included for qualitative review were searched for studies that fit the above criteria. Literature searches were performed by two independent reviewers who were breast cancer specialists, and a third reviewer, also a breast cancer specialist, was consulted for resolutions of any disagreements.
Quality assessment The quality of the included studies was assessed using QUADAS-2, a revised tool for the quality assessment of diagnostic accuracy studies. 23 Briefly, QUADAS-2 comprises four domains: patient selection, index test, reference standard and flow and timing. Each domain was assessed for risk of bias, and the first three domains were subsequently assessed regarding topic relevance.
Data extraction and statistical analysis Studies' characteristics, including the number of total enrolled patients and number of patients with confirmed cancer, mean ± standard deviation (SD), mean or median with range (minimum-maximum) for age, detection rate per 1000 patients screened in cancer detection or added cancer detection benefit, were extracted. The dispersion of density categories, definition of dense breast, recall rate, biopsy rate per 1000 patients screened, reference standard and PPV were also extracted and summarised in preformed data forms accordingly. PPV1 was defined as the malignancy rate among cases with positive results; PPV2 was defined as the malignancy rate among positive cases with biopsy recommendations; PPV3 was defined as the malignancy rate of positive cases with a performed biopsy. The diagnostic outcomes, including sensitivity and specificity for the detection of early-stage breast cancer, were extracted according to full-text reviewing, and summarised as % (TP/TP + FN) and % (TN/FP + TN), respectively, where TP, FP, TN and FN indicated the number of patients with true positivity, false positivity, true negativity and false negativity predicted. Specificity, sensitivity or the difference between these outcomes where available were further evaluated by meta-analysis.
Through Meta-DiSc analysis, sensitivity and specificity of cancer detection from either test arm were then calculated and summarised as a forest plot presenting values of each study with the corresponding 95% confidence interval (CI, lower and upper limit), and then a pooled effect among those studies with completed measurements was calculated. Furthermore, a summary receiver-operating characteristic (SROC) curve was graphed along with the area under SROC curve (AUC) with standard error (SE).
For comparing the differences in diagnostic performance between mammography alone (M alone) and mammography with conjunctive ultrasound (M + US) in dense breast patients, an effect size defined as risk ratio (RR) was adopted and presented with 95% CI for each study, and a combined effect was subsequently calculated using the Comprehensive Meta-Analysis software, version 2.0 (Biostat, Englewood, NJ). An RR > 1 indicated that M alone might provide a higher diagnostic value than M + US, while an RR < 1 indicated that M + US provided a higher diagnostic value than M alone. An RR = 1 indicated that the results were similar between M alone and M + US.
The heterogeneity test was evaluated according to a χ 2 -based statistic and I 2 statistic with a p value. For the Q statistic (or otherwise indicated as chi-square), P values <0.10 were considered statistically significant for heterogeneity. For the I 2 statistic, heterogeneity was assessed as follows: no heterogeneity (I 2 = 0-25%), moderate heterogeneity (I 2 = 25-50%), large heterogeneity (I 2 = 50-75%) and extreme heterogeneity (I 2 = 75-100%). 24 A random-effect model was used in the current meta-analysis, assuming substantial heterogeneity present among the studies. 25 Subgroup analyses were performed with regard to study country, US method and available data obtained during first-round US. Sensitivity analysis was conducted using a leave-one-out approach. Publication bias analysis by funnel plot was not performed in the current meta-analysis due to the limited number of studies included (<10 studies). 26 In all analyses, a two-sided P value <0.05 was considered statistically significant. The statistical analyses were performed using Meta-DiSc analysis software, version 1.4 and Comprehensive Meta-Analysis software, version 2.0 (Biostat, Englewood, NJ).

Literature search
A flow diagram of study selection is shown in Fig. 1. After initially identifying 828 articles, 749 articles were excluded based on the exclusion criteria. The full text of 79 articles was then reviewed, and 58 articles were excluded; the reasons for exclusion are shown in Fig. 1. Three sets of studies (Corsetti et al., 27,28 Berg et al. 29,30 and Weigert et al. [31][32][33] were series reports of three individual patient cohorts. The earlier papers were excluded due to data duplication or lack of data on sensitivity and specificity. Thus, 21 studies were finally included in a systematic review. 3,28,30,[33][34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50] Study characteristics Study characteristics are summarised in Table 1; recall rate, biopsy rate, PPV1 and PPV3 are summarised in Table 2. The risk factors considered in each of the studies were summarised in Supplementary Table S1. Eight studies included women with dense breasts who received M alone or M + US screening for breast cancer, 3,30,34,36,40,42,47,49 and five of these performed US using the automated breast US (ABUS). 3,34,40,42,49 US was done in a wholebreast screening fashion in all included studies. In these studies, 443 out of 69,096 participants were diagnosed with malignancies confirmed by biopsy. Most of the studies that compared M alone with M + US were performed in countries highly populated by Caucasians (United States and Sweden). 3,30,34,40,42,49 Five of the eight studies comparing M alone and M + US in patients with dense breasts provided data for the presence of common breast cancer risk factor, which included BRCA1/2 mutations, family history, personal breast cancer history, use of hormone therapy, etc. 3,30,34,42,49 (Supplementary Table S1).
On the other hand, thirteen other studies included women with dense breasts and negative results on initial mammogram, and subsequently received additional US examination by handheld US (HHUS). 28,33,35,[37][38][39]41,[43][44][45][46]48,50 In these studies, 196 out of 50,350 participants were diagnosed with malignancies confirmed by biopsy. Four of the studies that evaluated follow-up US were Additional records identified through cross-referencing (n = 29) Fig. 1 Flow diagram of study selection for systematic review and meta-analysis. Twenty-one studies with quantitative synthesis were included for systematic review and 13 studies with complete diagnostic results were included for conducting meta-analysis.   Table S2). The specificity and sensitivity of the different methods reported in the studies are summarised in Table 3. To achieve homogeneity in screening strategy among studies, the included studies were stratified for those comparing M alone versus M + US, and those with follow-up US during meta-analysis evaluations.
Meta-analysis M alone versus M+US in patients with dense breasts. Seven of the eight studies provided complete sensitivity and specificity data. 30,34,36,40,42,47,49 The sensitivity of M alone for cancer detection ranged from 40% to 91.3%, and the specificity ranged from 78.1% to 99.0% (Table 3). High heterogeneity was found among studies reporting sensitivity or specificity of either methods (I 2 ranged from 83.8% to 99.9%, all P < 0.001, Figs. 2 and 3). For this reason, a random-effect model was used for meta-analysis. For M + US, the sensitivity for cancer detection ranged from 74.1% to 100.0%, and the specificity ranged from 72% to 99.7%. For all studies combined, the pooled sensitivity and specificity of M alone for cancer detection was 74% (95% CI: 0.69-0.79) and 93% (95% CI: 0.93-0.94), respectively (Fig. 2). On the other hand, the pooled sensitivity and specificity for M + US was 96% (95% CI: 0.93-0.97) and 87% (95% CI: 0.87-0.87), respectively (Fig. 3). When comparing the diagnostic accuracy of cancer detection between M alone and M + US, the AUC value of the SROC curve from the combined effects among those studies showed that the M + US had better diagnostic efficacy of pooled sensitivity and specificity as compared with M alone (M + US vs. M alone, asymmetric SROC AUC value = 0.989 vs. 0.741) (Figs. 2c and 3c). In reflection to this finding, the meta-analysis of differences in the diagnostic yield of the two methods also showed that M + US might have higher sensitivity in cancer detection compared with mammography alone (M alone vs. M + US, RR = 0.699, 95% CI = 0.569-0.821, P < 0.001) (Fig. 4). The difference in specificity between M + US and M alone was shown significantly. However, the RR is represented close to 1 between two groups (RR = 1.060, 95% CI = 1.023-1.098, P = 0.001) (Fig. 4). The metaanalysis was performed by a random-effect model again, as high heterogeneity was found in the differences in diagnostic   34 Giger 40 Giuliano 42 Korpraphong 47 Chae 36 34 Giger 40 Giuliano 42 Korpraphong 47 Chae 36  yield (difference in sensitivity: I 2 = 95.65%, P < 0.001; difference in specificity: I 2 = 99.4%, P < 0.001).
Follow-up ultrasound in patients with dense breasts and negative mammography. Six out of 13 studies with complete sensitivity and specificity data for the detection of malignancy by follow-up US in patients with negative mammography and dense breasts were included in the analysis. 33,35,37,38,45,48 The sensitivity for cancer detection by follow-up US ranged from 88.4% to 100%, and specificity ranged from 74% to 94.5% (Table 3). A fixed-effect model was used for sensitivity, and a random-effect model used as high heterogeneity was found for specificity (sensitivity: I 2 = 0%, P = 0.665; specificity: I 2 = 99.2%, P < 0.001) (Fig. 5). Upon metaanalysis, the pooled sensitivity of cancer detection was found to be 96% (95% CI: 0.91-0.99) (Fig. 5a), and the pooled specificity was 88% (95% CI: 0.87-0.88) (Fig. 5b). The diagnostic accuracy (AUC) was derived as 0.962 (SE = 0.02) by asymmetric SROC (Fig. 5c).

Subgroup analyses
To address potential confounding imposed by disease prevalence, US method and timing of follow-up US, subgroup analyses were conducted and summarised in Table 4.   34 Giger 40 Giuliano 42 Korpraphong 47 Chae 36 34 Giger 40 Giuliano 42 Korpraphong 47 Chae 36  . The RR showed that the M + US method had significantly higher sensitivity rate than M alone, given that either ABUS or HHUS method was adopted (ABUS method: RR = 0.72, 95% CI = 0.67-0.77, P < 0.001; HHUS method: RR = 0.67, 95% CI = 0.45-0.99, P = 0.045), and significantly lower specificity rate than M alone only when HHUS method was performed (RR = 1.07, 95% CI = 1.03-1.11, P < 0.001) ( Table 4). In studies that had data available specifically during the first-round US screening 34,49 (Supplementary Table S2 For the six studies adopting follow-up US in patients with negative mammography and dense breasts, the sensitivity and specificity of the screening strategy were 0.95 and 0.90, respectively, for studies conducted in Western countries, 33,35,38 and 1.00 and 0.79, respectively, for Far Eastern countries. 37,45,48 The asymmetric AUC of SROC for Western countries was 0.97 (SE = 0.041); an asymmetric AUC of SROC = 0.950 (SE = 0.035) was found for studies conducted in Far Eastern countries. Regarding the US method, all six eligible studies were conducted using HHUS method, 33,35,37,38,45,48 and the data were in line with the main results. Three studies evaluating follow-up US presented specific results for first-round screening 37,45,48 (Supplementary Table S2), and the pooled sensitivity and specificity were 1.00 and 0.73, respectively, with an asymmetric AUC of SROC of 0.94 (Table 4).
Overall, the results of the subgroup analyses and the main meta-analysis exhibited similar trends.  Fig. 4 Meta-analysis of differences in cancer diagnostic yield between mammography alone and mammography plus ultrasound in patients with dense breast. a Sensitivity and b specificity. M alone, mammography alone, M + US mammography plus ultrasound, lower and upper limit, lower and upper bound of 95% confidence intervals (CI).
Sensitivity analysis among studies Sensitivity analyses were performed using the leave-one-out approach in which the meta-analyses of cancer detection outcomes were performed with each study removed in turn.
The results are summarised in Supplementary Tables S3 and S4. The direction and magnitude of combined estimates did not vary markedly with the removal of most of the studies, indicating that each of the meta-analyses had good reliability, and the data were not overly influenced by each study.
Quality assessment The quality assessments of included diagnostic accuracy studies are shown in Fig. 6. The quality assessment of the included studies indicated that the quality of the studies was acceptable, except for the retrospective design and the reference standard used by studies; the risk of bias mainly resulted from lack of enrolledpatient randomisation, index test masking and available reference standard in a number of studies.

DISCUSSION
This systematic review and meta-analysis examined and compared the diagnostic yield and accuracy of US as an adjunct to mammography with mammography alone for the screening of breast cancer in women with dense breasts. For general participants with dense breasts, the combined sensitivity of M +  45 Chang 37 Leong 48 Crystal 38 Buchberger 35 Weigert 33 Kim 45 Chang 37 Leong 48 Crystal 38  US for breast cancer was significantly higher than that of M alone (96% and 74%, respectively; RR = 0.699, P < 0.001). The combined specificity of M + US for breast cancer in the general female population with dense breasts was slightly lower than that of M alone (87% vs. 93%, respectively; RR = 1.06, P = 0.001).
In contrast, in women with dense breasts and initially negative in mammography, the follow-up ultrasonography had high sensitivity (96%) and specificity (88%). Subgroup analyses with data stratified by study country, US method and first-round US further supported the main findings, suggesting that adjunctive US is beneficial for detecting breast cancer in women with dense breasts, albeit with an expected but tolerable sacrifice in detection specificity.
One meta-analysis published by Rebolj et al. 51 examined the rate of breast cancer detected only by US versus that detected by multimodal screening methods (mammography with or without US). The authors found that the proportion of cancers detected only by US was 0.29 (95% CI: 0.27-0.31) of all detected cancers, and this translated to approximately 40% increased breast cancer detection compared with other screening methods. Furthermore, follow-up US additionally contributed to 3.8 (95% CI: 3.4-4.2) screen-detected cases per 1000 mammography-negative women. Despite these findings, US was not recommended by the authors to be a stand-alone screening method, but rather as a supplemental tool. It was difficult to correlate the findings reported by Rebolj et al. 51 to our study, as neither the comparisons nor the outcomes of interests (M alone vs. M + US, and diagnostic yield in our case) were comparable between the two studies. Moreover, a fixed-effect model was adopted by Rebolj et al. 51 disregarding the varied screening strategy and target population among their included studies, while a random-effect model was preferred in the current meta-analysis accompanied by study stratification.
Previously published systematic reviews have examined the usefulness of adding US to mammography screening for women with dense breasts. A 2009 review by Nothacker et al. 52 only identified 6 cohort studies of intermediate-level evidence (3b) (no RCTs or other systematic reviews were identified). A more recent systematic review by Scheel et al. 53 identified 12 studies, and concluded that there was consistent evidence that adjunctive US screening detects more invasive cancers compared with mammography alone in women with dense breasts, but there was no evidence to support that adjunctive US screening was associated with reduced long-term breast cancer mortality. 53 In contrast to our study, the diagnostic outcomes of M + US did not receive individual review from follow-up US by Scheel et al. 53 . Furthermore, Scheel et al. 53 study did not evaluate diagnostic yield by meta-analyses, which was also likely due to the disparate screening methodology adopted by studies included in the systematic review. 53 A 2016 systematic review of supplemental screening for breast cancer in women with dense breasts done for the United States Preventive Services Task Force concluded that supplemental US screening increases the cancer detection rate, but was associated with an increase in the false-positive rate, and the impact on long-term breast cancer outcomes was unclear. 54 The detection and differentiation of malignant microcalcifications in dense breast tissue are a particular issue of concern, but traditional radiologist-based interpretation of US imaging remains limited in providing an immediate solution. 21,22 Computer-aided automatic reporting systems have been enthusiastically evaluated, 55,56 and their implementation in future mammography screening may achieve greater diagnostic accuracy of microlesions in dense breasts. Furthermore, the adjunctive use of tomosynthesis in mammography-negative patients has been tested prospectively, and shown to have exhibited less false-positive results in contrast to supplemental US. 57 In reflection, supplemental US screening for women with dense breasts was found to produce relatively small survival benefits, despite substantial increase in costs in a review using data from large medical databases and extensive literature search. 58 Although outside the context of the current meta-analysis, the cost-effectiveness of US performed in the present fashion as a supplemental or follow-up screening for breast cancer should be carefully considered.
In the current meta-analysis, the subjective disparity and observer variability of US in each study could not be clearly distinguished, and thus may confound the findings. acquisition and interpretation of US are highly operatordependent, and for this reason, computer-aided diagnosis systems have been rigorously developed in order to facilitate efficient interpretation, and improve the diagnostic accuracy in identifying malignant breast lesions. 59 In addition, the observed differences among studies may depend on differences in learning curves, individual radiologic experience and the way protocols and reports are filled out. The low PPV reported by Brem et al. 34 could be a result of the ABUS readout protocol, where the radiologist interpretation time was 2.9 min and evidently lower than that of Wilczek et al. 49 In particular, low breast cancer rates in the Asian population may explain why Chae et al. reported low PPV values. 36 Apart from the bias presented in the risk evaluations, the findings in the current meta-analyses may also be subjected to influence from heterogeneity among study design, patient characteristics, follow-up period and other details in the respective studies. Giger et al. 40 performed an enriched-reader study involving 17 radiologists from different types of health facilities; thus, the readout performance or enrolled population may not be comparable to the real-world scenario. 10,60 Leong et al. 48 involved only one medical centre in their study, reporting a sensitivity of 100% since no false-negative cases were found after 1 year of follow-up of participants with BI-RADS assessment category 1 or 2 under mammogram and categories U1-U4 under US assessment. 48 In Corsetti's study in 2011, 28 all subjects with negative screening mammograms and with dense breasts had bilateral breast US, and reported a screening sensitivity (86.7%) calculated by dividing cancers detected at screening with cancers detected at screening plus interval cancers occurring over 365 days for this study. Therefore, the variation of sensitivity of additional US ranging from 86.7% to 100% in the subgroup of patients with dense breasts and negative in mammography might result from heterogeneity in sample size and definition of true-positive cases.
There are limitations of this analysis that need to be considered in the interpretation of the results. The study design of most included studies in the analyses was retrospective rather than randomised head-to-head comparisons. Although the quality of the studies was found to be adequate, and the sensitivity analysis indicated that the results were robust, heterogeneity was detected among the studies. A number of studies evaluating the diagnostic effectiveness of follow-up US included patients who had initial suspicious rather than negative mammography results (Supplementary Table S2), and thus the effect imposed by prevalent cases could not be completely ruled out. In addition, we relied on breast-density results reported by the individual studies, and did not examine or stratify patients based on the actual breast density in the participants of the individual studies. Moreover, we did not take into account the mammography and US technical or instrumental differences among individual studies.

CONCLUSIONS
The results of this systematic review and meta-analysis suggest that the addition of US to mammography screening of women with dense breasts improves the sensitivity for the detection of breast cancer, despite a slightly decreased specificity. Follow-up US also had good diagnostic sensitivity and specificity for screening women with dense breasts and negative mammogram findings. Future prospective studies designed to evaluate US as an adjunct or follow-up screening method to mammography in women with dense breasts are needed to confirm the results from our meta-analysis. Enrolment of specific high-risk populations should be further considered to identify those that may benefit from adjunctive US screening for breast cancers most cost-effectively, and reduce the number of recall or falsenegative biopsies performed.