Validation of a new fully automated software for 2D digital mammographic breast density evaluation in predicting breast cancer risk

We compared accuracy for breast cancer (BC) risk stratification of a new fully automated system (DenSeeMammo—DSM) for breast density (BD) assessment to a non-inferiority threshold based on radiologists’ visual assessment. Pooled analysis was performed on 14,267 2D mammograms collected from women aged 48–55 years who underwent BC screening within three studies: RETomo, Florence study and PROCAS. BD was expressed through clinical Breast Imaging Reporting and Data System (BI-RADS) density classification. Women in BI-RADS D category had a 2.6 (95% CI 1.5–4.4) and a 3.6 (95% CI 1.4–9.3) times higher risk of incident and interval cancer, respectively, than women in the two lowest BD categories. The ability of DSM to predict risk of incident cancer was non-inferior to radiologists’ visual assessment as both point estimate and lower bound of 95% CI (AUC 0.589; 95% CI 0.580–0.597) were above the predefined visual assessment threshold (AUC 0.571). AUC for interval (AUC 0.631; 95% CI 0.623–0.639) cancers was even higher. BD assessed with new fully automated method is positively associated with BC risk and is not inferior to radiologists’ visual assessment. It is an even stronger marker of interval cancer, confirming an appreciable masking effect of BD that reduces mammography sensitivity.

www.nature.com/scientificreports/ Mammographic breast density (BD) is the absolute amount or percentage of fibro glandular tissue in the breast. It is an established risk factor for breast cancer (BC) and an important determinant of screening sensitivity as it may hamper cancer detection in mammograms, i.e., the masking effect. Radiologist visual assessment, an areabased method, commonly used for judgment of breast density, is subject to several important limitations, such as high subjectivity and intra-and inter-observer variability 1 resulting in low reliability and reproducibility 2,3 . In addition, breast density assessment performed by radiologists (using Breast Imaging-Reporting and Data System-BI-RADs) 4 is time consuming and prone to poor reproducibility, making its application in organized screening challenging. Consequently, there are important clinical and research repercussions. Breast density is being used to stratify women based on their breast cancer risk to decide if they need further imaging assessment in clinical trials, such as ultrasound or MRI 5,6 , or if they would benefit from a shorter mammography screening interval. Low reproducibility of mammographic density measures may result in conflicting recommendations and inaccurate risk stratification which could increase the probability of an interval BC.
The potential of personalized BC screening based on individual risk over universal "one-size fits all" screening has been recognized and several randomized trials have been launched to address this issue, such as Tailored Breast Screening Trial (TBST) 7 , WISDOM 8 and My Personalized Breast Screening (MyPeBS) 9 . MyPeBS is an international randomized, multicentre study aimed to assess the effectiveness of a risk-based breast cancer screening strategy (using clinical risk scores and polymorphisms) compared to standard screening (according to current national breast cancer screening policies). BD evaluation plays a central role in the MyPeBS trial as mammographic density contributes to the risk assessment in the intervention arm and it determines whether women should receive supplemental ultrasound.
Several semi-automated and fully automated methods for reproducible assessment of breast density have been developed and evaluated [10][11][12][13] ; these include approaches based on area-based and volumetric measures, and in some cases include mammographic pattern, however variations in assessment methodology make direct comparisons of risk prediction challenging [14][15][16][17][18][19] . DenSeeMammo (DSM) has been validated for density assessment and shows a higher degree of agreement than that reported in studies on other vendors' automated volumetric assessment tools 20 . However, the software has not yet been validated for prediction of breast cancer risk and for assessing the risk of masking (i.e., the risk of having interval or subsequent round cancer instead of screendetected cancer).
The primary aim of this study was to evaluate a recently developed automated system for breast density assessment (DenSeeMammo) and its ability to stratify breast cancer risk, by comparing discriminative accuracy for BC risk prediction to a non-inferiority threshold defined on the basis of visual assessment by radiologists. The secondary aim was to estimate how the fully automated BD assessment is able to measure the masking effect.

Methods
Study population. This study represents a pooled analysis of images collected within the RETomo trial 21 , Florence study 17 and PROCAS case-control study 18 , with a total of 14,267 women 48-55 years old, of whom 322 developed breast cancer. A summary of studies included is provided in Supplementary Table 1 while the flowchart of study participants is presented in Fig. 1.
Florence study. A cohort study including 15,952 women who had their first screening digital mammography aged 49-54 years old during the period 2006-2013. Women were routinely reinvited after 2 years, and the study follow-up lasted 2.5 years. Only 5359 2D full-field digital mammography images were available for breast density analysis since some images were performed with mammography equipment (GE 2000D) which is not supported by DenSeeMammo software.
RETomo (Reggio Emilia tomosynthesis) trial. A cohort study including more than 27,000 women aged 45-70 years old attending screening between March 2014 and July 2017 and who had already participated in at least one round of the Reggio Emilia screening program. Women aged 45-49 years were routinely reinvited after 1 year and women aged 50-74 years after 2 years; the study follow-up lasted 2.5 years. Of the 48-55 year old women density assessment was available in 8332 women.  No exclusion criteria other than those that were study-specific were applied. For women who had several mammograms, the first mammogram was analysed. Pooled analysis was conducted on individual anonymized data.
For all three studies local Ethics Committee approval had been obtained and informed consent given by all woman participating in the relevant study. Ethics approval for the PROCAS was through the North Manchester Research Ethics Committee (09/H1008/81). ReTomo was approved by the provincial Ethical Committee (November 11, 2013; ASMN 2013/0029304). The Ethics Committee of the Florence district gave their approval for the Florence study on 12 September 2017 (n.11630_oss). All methods involving human participants were performed in accordance with the Declaration of Helsinki.
Acquisition of images. Cranio-caudal (CC) and mediolateral oblique (MLO) pairs (left and right breast) of 2D full-field digital mammography (FFDM) of all women recruited were obtained from Picture Archiving and Communication Systems (PACS) and processed by DSM. Ten women were excluded as it was not possible to obtain suitable images. Images were from different vendors and systems (GE Essential and Siemens Mammomat Inspiration). Detailed description of image acquisition is provided in Supplementary Table S1.

Breast density assessment. DenSeeMammo 1.2 (Predilife, Villejuif, France)-is a Food and Drug
Administration (FDA) approved, fully automated software for assessing breast density providing a BI-RADS category density grade. DenSeeMammo handles processed 'for-presentation' images extracted from DICOM files as input (CC + MLO, or CC or MLO). It provides results on a per patient basis, using the maximum density category of the two breasts. The method is based on a comparison with databases containing images previously scored by the Mammography Quality Standards Act (MQSA) radiologists using BI-RADS 5th Edition 4 . With DenSeeMammo, all assessments are based on BI-RADS 5th Edition which takes into account percent density and density distribution in the breast to reflect risk of masking 4 . DenSeeMammo received 510(k) FDA clearance in 2017 and in 2018 for automatic breast density evaluation on GE and Hologic equipment.
Endpoints. Cancers included in the analysis were: (1) prevalent cancers, i.e., cancers detected at inclusion to the study, (2) incident cancers i.e., cancer not detected at the study inclusion mammogram but occurred at a subsequent screening round including (3) interval cancers i.e., cancer diagnosed after a negative screening mammogram and before the next screening mammogram. The main endpoint is incident cancer (interval and screen-detected at the second round). We also report prevalent cases and all cases (incident and prevalent), since including only incident cancers in women undergoing screening overestimates the ability of density to predict cancer risk. A substantial proportion of undetected prevalent cancers in women with dense breasts at baseline become symptomatic or detectable later during follow-up, while in women with fatty breasts, they are more efficiently detected by the baseline mammography.
The main analysis was the Area Under the Receiver Operating Characteristic curve (AUC) for DSM software's prediction of incident breast cancer risk, calculated from all the women for which we had available 2D mammography suitable for DSM assessment. This analysis was performed only on incident and interval cancers, because in its practical application in screening, risk stratification is meaningful only for women who have not been diagnosed with a cancer.
We also report the odds ratios for incident, interval, prevalent and all cancers. The association with all cancers may better reflect the strength of the aetiological link between density and breast cancer risk as it is not affected by the bias of excluding more efficiently prevalent cancers in fatty breasts than in dense breasts. The difference in the strength of association between prevalent and interval cancers can be interpreted as an indirect measure of the software's ability to identify the masking effect of breast density: if the relative risk of having interval cancers in dense breasts vs. non-dense breasts is higher than the relative risk of having prevalent screen detected cancers, this means that the breast density assessment was able to identify part of the masking effect due to breast density. Statistical analysis. Descriptive statistics were used to present the distribution of women and cancer cases within centres and breast density categories. The association between breast density assessed by DSM and BC risk was estimated using logistic regression and odds ratios (OR) with 95% confidence intervals (95% CI). Due to the small number of cancers in BI-RADS A category, the sum of the two lowest BI-RADS categories (A and B) was considered as a reference category. To take into account the different study designs of the PROCAS case-control study and the other two cohort studies, for the analysis of risk of all cancers on pooled data, we applied a sampling probability weight to the controls of the PROCAS study in a ratio 1:15 compared to cancers, as estimated from the risk observed in the underlying cohort of the PROCAS study. Given that models adjusted for age and study centre yielded similar results as crude models, the results of the crude model are presented.
As a single measure of accuracy, i.e., ability of the automated system to discriminate between cases and controls, area under the receiver operating characteristic curve (ROC)-AUC with 95% CI, according to binomial exact distribution, was computed based on the four BI-RADS categories.
In order to establish a clinically and statistically acceptable threshold for accuracy of the breast density software system, the non-inferiority margin for the AUC was defined as the observed value for the routine evaluation made by radiologists in the Florence study. Since our outcome was a single ROC curve and the direct www.nature.com/scientificreports/ comparison with radiologist visual assessment was not possible, a non-inferiority margin was calculated based on the modification of confidence interval method suggested by Ahn et al. 22 . In the Florence study, consisting of 15,952 mammograms classified by radiologists' visual assessment in the four BI-RADS categories, the AUC for predicting incident cases was 0.579 ( Supplementary Fig. S1). The non-inferiority threshold was set at 10% reduction in area exceeding 0.5, if the lower bound of the 95% CI of AUC value for DSM was higher than 0.571. Statistical analysis was performed using Stata version 10.0 (Stata Corporation, College Station, TX, USA).
Power of the study. For the power calculation we used AUC of VolparaDensity's automated system (version 3.1, Matakina Technology, Ely-Cambridgeshire, UK) from the Florence study as an expected value of software AUC. VolparaDensity is a commercial volumetric breast density method that operates on 'for processing' mammograms 23 . Expected value of the software AUC in the Florence study was 0.637 (re-analysis of data from Florence study) 17 . With an expected sample size of about 300 cancers and an underlying cohort of about 25,000 women aged 48-55, applying the same BI-RADS specific risk of cancer observed in the Florence study 17 , the power to exclude a non-inferiority threshold of 0.571 was expected to be over 99%.

Results
In total, 14,267 mammograms from the same number of women, aged 48-55 years old, were available for density assessment, out of whom 322 had cancer (115 in RETomo, 63 in Florence study and 144 in PROCAS) ( Table 1). Mean (SD) age of women in the pooled analysis was 51.0 ± 1.9 years old, while in RETomo it was 51.2 ± 2.1, in the Florence study 50.8 ± 1.4 and in PROCAS 51.3 ± 2.0. Out of 322 diagnosed cancers, 98 were incident round cases (74 in RETomo and 24 in Florence study), out of which 35 were interval cancers (26 in RETomo and 9 in Florence study) ( Table 1). The distribution of BD was similar in all studies, with 7.4% in BI-RADS A and 11.6% in BI-RADS D (Table 1). In RETomo, women below 50 years old were referred to a 1-year interval, while those over 50 were referred to 2 years; BIRADS D was more frequent in women below 50 years old 15.9% vs. 11.1% compared to those over 50 years (Supplementary  Table S2).
Risk of breast cancer. The risk of incident cancer was almost three times higher for women in the highest category of breast density (BI-RADS D) than for those in the two lowest BI-RADS categories (OR 2.6, 95% CI 1.5-4.4), while the risk for interval cancer was almost four times higher for women in the abovementioned categories (OR 3.6; 95% CI 1.4-9.3) ( Table 2). The risk for all cancers (OR 2.2; 95% CI 1.6-3.1) was similar to the risk for incident cancers, while risk for prevalent cancer was slightly lower (OR 1.8; 95% CI 1.2-2.6) ( Table 3).
When comparing the risk of incident cancer between the highest and lowest BI-RADS categories within each centre, the risk of cancer was similar in the Florence study (OR 2.6; 95% CI 0. 8

Discussion
In the present study we have evaluated the ability of a recently developed automated system for breast density assessment (DenSeeMammo) to predict risk of breast cancer in the pooled analysis of three European studies. Positive association of automatically assessed breast density and risk of incident, interval (masking effect) and prevalent breast cancers and all cancers was observed in the pooled data, and no important heterogeneity was observed among the studies. www.nature.com/scientificreports/ Accuracy of DSM in breast cancer risk prediction was acceptable in our study with an AUC (0.589), not inferior to, and actually slightly higher than, the radiologist's visual assessment observed in Florence 17 . It was also similar to those reported previously and estimated by using Volpara software 14,16,19 . Although these studies found positive associations of both automatically and visually assessed BD with breast cancer risk, their automated systems were inferior to BI-RADS visual assessment in the discrimination of cancer risk. Astley et al. also showed that the average of two radiologists estimates of breast density recorded on Visual Analogue Scales (VAS) was a stronger predictor for breast cancer, compared to four different automated systems 18 . In our study overall DSM accuracy outperformed the single radiologists' visual breast density assessment, suggesting that DSM or other software with similar performance could be used in screening, thereby overcoming some of the organizational barriers that impede the radiologist's visual assessment without the risk of decreasing accuracy.
This study did not aim to provide direct comparison with other automated software and Volpara was used as the only automated software available for the power calculation. Although a recent study demonstrated that DSM has a comparable or even better degree of agreement with expert radiologists' visual assessment than Volpara 24 ,   www.nature.com/scientificreports/ the most studied and one of the best performing quantitative volumetric assessment method 14,18 , other studies performed on the same set of mammographic images with direct comparison with other automated systems are necessary to answer whether DSM is a valid alternative to other automated systems. As expected, excess risk was higher for incident cancer (OR 2.6) than for overall cancers (OR 1.8) in our study. Data for incident cancers were less consistent between studies ranging from an OR of 2.8 in the PROCAS study 18 and a HR of 8.3 in the study by Wanders et al. 25 . Considering only incident cases overestimates the true prediction performance of the software; of note is that both studies are case-control studies and for the latter, density was evaluated at the time of cancer detection, augmenting additionally the ability of the software to predict breast cancer. When evaluating risk of overall cancers, i.e., considering also prevalent cases, our results (OR 1.8) are in agreement with results of studies utilising other automatically assessed breast density software. In particular, prediction of cancer was very similar to the estimates of Puliti et al. who reported 2.0 times higher risk of invasive BC in women within the highest breast density category compared to those in the three lower categories 17 , as estimated by VolparaDensity software in the entire Florence cohort. It must be noted that only a small part of available cancers and mammograms were included in both of the density evaluation studies, because only one third of the Florence cohort used for validating Volpara software was also suitable for processing by DSM. With the limitation of the difference in age of the participants, screening frequency and breast density distribution, other studies reported similar results for the association between fully automatically assessed BD and risk of cancer, with OR ranging from 1.7 to 2.3 when using Volpara 14,15,18,19 , 1.9 and 3.9 Quantra 14,16 , 2.2 Densitas 18 and 2.5 for a method using ImageJ 14 .
Association of breast density and risk of masking effect has been widely investigated due to the importance in patient risk stratification and identification of patients who will benefit from supplemental screening tests. However, only a few studies measured the association between automatically assessed BD and risk of interval cancer and estimates differs greatly, partly due to different choices of denominator (screen detected cancers, healthy women, or screen examinations). Odds ratio for interval cancer in our study (OR 3.6) was somewhat higher than that of Moshina 17 . It is hard to distinguish to what extent the interval cancer rate occurred truly due to masking and what part is attributed to rapidly evolving cancers, especially without information on size, grade and molecular subtypes of cancers. Nevertheless, our data and the previous literature clearly show that the proportion of interval cancers is higher in dense breasts than in fatty breasts 17,[26][27][28] . One bias affecting our study is that part of the REtomo cohort, i.e., women aged 48-50, were invited for rescreening after 12 months. The chance of having an interval cancer in the first year after a negative screen is low and it is worth noting that dense breasts were slightly more prevalent in women that have been re-invited after 12 months (Supplementary Table S2). This bias could only underestimate the true contribution of the interval cancers to the overall masking effect in our study. Similarly, Holland et al. confirmed that automatically assessed percentage of dense breasts is associated with increased risk of interval breast cancer not only due to the breast density as a risk factor but also due to its ability to discriminate screen detected versus interval cancer, i.e., to capture masking effect 29 .
Our study has certain limitations which should be addressed. We defined the non-inferiority threshold using the radiologist's visual assessment of breast density as standard. This was decided because routine evaluation by one radiologist is available in clinical practice in most screening programs. On the other hand, this standard is less accurate than that used in many studies, i.e., the judgement by a panel of radiologists; this latter option is a better gold standard with which to compare and has higher accuracy than what is usually available in routine practice. Furthermore, the threshold has been defined based on a single centre cohort, which may not be representative of performance in other centres. While in the PROCAS study women with breast cancer and controls were matched for major confounding variables, such as parity, body mass index and menopausal status, in the RETomo and Florence studies this was not possible, which hinders us from drawing conclusion about an association between BD and cancer risk and from generalising our results to other centres. Although the software works on "for presentation" images and is multivendor, not all the stored images were suitable for evaluation.
Both the metrics which we used for evaluation (AUC and OR computed in logistic regression) are very sensitive to the breast density threshold used. The DSM adopted threshold is calibrated to reproduce the distribution of BI-RADS categories in the general screening population 20 , but not to optimize risk prediction. This choice guarantees comparability with previous studies and makes meaningful the choice of the non-inferiority threshold but could be reconsidered when applying BD for risk stratification, because continuous values contain valuable additional information.
Finally, our study included only women of peri-menopausal age, when BD changes rapidly, decreasing in most women as they age; how much the predictivity observed in this age group can be generalised to older women should be assessed in a further study. Another potential selection bias could have been introduced by selecting only available images (for Florence, only the images from one vendor were available for processing, and in RETomo, only few images collected at the very beginning of the study were not stored correctly in order to be processed). Although this selection did not occur at random, these reasons are less likely to be associated with breast density and outcome in a population-based screening programme.

Conclusions
DenSeeMammo, a new fully automated software for breast density assessment, is non inferior to radiologist's visual assessment in the prediction of breast cancer risk. Automatically assessed breast density was strongly associated with incident and more strongly with interval cancers, indicating the ability to capture the masking effect of BD.