Introduction

Breast cancer (BC) is the most common cancer type affecting women1,2. In Europe, full-field digital mammography is the frontline imaging tool for early detection, and nation-wide mammography screening programmes have been introduced in many countries3. In Sweden, women between the ages of 40 and 74 are invited to attend every 18–24 months4. At screening, each breast, in turn, is pressed between a detector and a paddle, and x-rayed from two different angles. The breast is compressed in order to fix the breast in place, spread the dense breast tissue, reduce scatter radiation, and lower the required radiation dose. The resulting four images are then assessed by radiologists for BC findings.

It is important to determine the extent to which screening is able to correctly identify the presence and absence of BC, so that its findings encourage appropriate decision making. The adequacy and usefulness of a screening test is usually determined by the sensitivity (and specificity) of the test. In this paper we focus on sensitivity, specifically mammographic sensitivity. There are multiple reasons why a lesion can be missed on a mammogram, which can broadly categorized into technical errors, related to the quality of the image; perceptual errors, including masking and poor conspicuity of the lesion; and cognitive errors, related to misinterpretation and misclassification of the findings5.

For monitoring mammography screening performance, sensitivity is commonly estimated as the proportion of BCs that are detected in close proximity to a screening, and this “sensitivity” is usually calculated using 1–2 year follow-ups for cancer detection6,7. These screen-detected cancers are in contrast to interval cancers—BC detected symptomatically some time after a screening or between screening rounds—which are assumed to represent the cancers that were present at the previous screening but missed8,9. While this assumption is true for many of the interval cancers, some particularly fast-growing tumors will not have been present or not have been of a detectable tumor size (so called true interval cancers). Also, regular screening could mean that slow-growing tumors were present at multiple screenings, but only contribute towards the most recent screening interval10. These two scenarios mean that this definition of sensitivity can be considered as being biased and diverges from the classical definition of sensitivity in the statistical literature of diagnostic testing. It does not take the tumor size, nor the tumor growth rate, into account.

Whereas most studies report one constant estimate for screening sensitivity, some researchers have developed a modelling approach for estimating the screening sensitivity of mammography as a function of tumor size11,12, where sensitivity is part of a larger natural history model applied to population-based breast cancer screening data. In these natural history models, the latent tumor sizes at prior (negative) screens are (probabilistically) back-calculated. Logistic functions represent the probability of detecting a BC tumor, given its (latent) size at the time of the screening. This way of estimating screening sensitivity is therefore closer to the statistical definition of a test sensitivity13. To differentiate this definition of screening sensitivity, we will refer to this as the Screening Test Sensitivity (STS).

Many factors intricately affect mammographic sensitivity. Percent breast density (the proportion of fibroglandular radio-dense tissue in the breast) is known to be an important masking factor, since both tumors and dense tissue appear white on mammograms14,15,16. Mammographic density, in general, is considered to be the most important masking factor. Other factors that influence the appearance and quality of a mammogram include the distribution of the dense tissue, imaging techniques such as position settings and image processing conditions9,17. Here we study the associations of such factors with STS, in particular, total x-ray exposure, compression pressure, and compressed breast thickness, the latter being the distance between the two mammography plates after the breast is compressed. The role of such factors on image quality and radiation dose has been well studied18. The relationship between compression pressure and ‘sensitivity’ (as defined in terms of the ratio of screening to interval cancers) has also been studied using population-based data19,20,21, but as explained above, our approach to this is different. Studies19,20, like ours, use measurements of compression pressure based on standard automatic exposure control (AEC) settings. Moreover, we are not aware of population-based studies of the role of x-ray exposure and compressed breast thickness in mammography sensitivity. Studies have tried to identify good levels of compression and compressed breast thickness19,22,23 with varying focus.

Since mammographic density plays such an important role in masking tumors, studies exploring the role of other screening parameters need to account for density, as we do here. It is important to highlight that the composition of the breast changes with age. Several studies have demonstrated that breast density declines substantially with increasing age, mainly over the menopausal transition24, and reaches a plateau around the age of 6525. This age-related difference in breast composition has been shown to be concomitant with the increase of sensitivity of mammography with age9,26. We take this into account in our study by using multiple (longitudinal) measurements of the density and related acquisition parameters.

In addition to biological changes to breast density, the estimated density based on the mammogram is prone to measurement variability based on the placement of the breast, and other imaging settings. Studies with short-term repeated imaging have shown that this variability exists both for computer-aided PD estimation27,28,29, and for radiologist assessed density categorization27,28,30.

Materials and methods

Data

We utilise data from the Karolinska Mammography Project for Risk Prediction of Breast Cancer (KARMA)31. KARMA is a Swedish prospective mammography screening cohort. During the period January 2011 to March 2013, women attending mammography screening at any one of four selected hospitals in Sweden were invited to participate. A total of 70,877 women accepted, filled in an extensive web-based questionnaire, and gave blood samples. Both raw and digitally processed mammograms are stored. Cases of BC are identified through a Swedish national quality register.

In this study we focus on the following screening parameters:

  • Percent mammographic density (PD), the proportion of radio-dense tissue on the mammogram (0–1),

  • Compressed breast thickness (CBT), the distance between the paddles when the image is taken (cm),

  • Exposure (EXP), the total x-ray dose, i.e. x-ray current times the exposure time (mAs),

  • Compression pressure (CP), calculated as the compression force divided by the measured compressed breast area (N/\(\hbox {cm}^2\)),

  • Total breast volume (TBV), the measured compressed area on the image times the CPT.

The PD was measured using STRATUS32, and the other screening parameters were extracted from the DICOM information tag of each image. These variables are taken from the mediolateral oblique (MLO) view, and were retrieved from each individual screening, and for each woman. Women with missing information on any of these parameters on any of the screenings were excluded from the study.

It is conventional when studying the effect of PD on breast cancer risk to use the contralateral (non-cancer) breast for the breast cancer cases. This is done so that the presence of the tumor does not add to the measurement of PD. However, when studying the detectability of breast cancer and screening parameters, the tumor side is the relevant side. Furthermore, our approach is based on there being latent undiagnosed tumors among some of the censored women (i.e. women that were not diagnosed with BC during follow-up). We do not know which women have latent tumors, nor do we know which sides such tumors are in. As our solution to this, we took the average between the left and right breasts, for each variable. We however also conducted a sensitivity analysis where we used the contralateral breast for the cases and a randomly selected side for the censored women.

In the present study, we excluded women with a BC diagnosis prior to joining KARMA. Of the BC cases, we included only the women with invasive BC that had a recorded date of diagnosis, mode of detection (whether the BC was detected through mammography screening or symptoms), and primary tumor size.

Out of the 70,877 total women recruited, our study includes 52,803 women, of which 981 women were diagnosed with invasive breast cancer within the study period, which ended on 2018-02-28. In total, data from 163,053 screening occasions was included. See Fig. 1 for a summary of the data selection process.

Statistical methods

We use a continuous growth natural history model to jointly model the time from a woman’s birth to the detection of an invasive breast cancer tumor, and tumour size at detection. This is done in a prospective population-based cohort (where the majority of women are free from BC diagnosis at end of follow-up, but nonetheless contribute to the model estimation). The natural history model has been described in detail in a previous study33, but for self-containment a summary is included here and in the Supplementary Material.

Figure 1
figure 1

Flowchart describing the selection of the data.

The natural history model can be separated into four different sub-models:

  1. 1.

    The carcinogenesis of the tumor, which we refer to as the age at (tumor) onset. This is a latent process and never observed. It uses the Moolgavkar-Venson-Knudson two-stage model34 and is determined by the three parameters \(A, B, \delta \).

  2. 2.

    The growth of the tumor, which is assumed to be exponential with an inverse growth rate randomly drawn from a gamma distribution. This distribution is determined by the two parameters \(\mu , \phi \).

  3. 3.

    The time until breast cancer symptoms emerge and the tumor is symptomatically detected. The continuous hazard of symptomatic detection is assumed to be proportional to the latent tumor volume such that, as it grows, it is more likely to be detected. The parameter \(\eta _0\) determines the proportionality. In this study, the hazard rate is also adjusted for the total breast volume (TBV).

  4. 4.

    The probability that a tumor is prematurely detected should the woman attend a mammography screening. This sub-model is the primary focus of this study, and is defined as follows:

If a woman attends mammography screening before a tumor is symptomatically detected, there is opportunity to detect it early. We assume that the screening test sensitivity (STS) for the mammography screening undertaken at age \(\omega \) follows a logistic function of the latent tumor diameter \(D(\omega )\) at the time of the screening, i.e.

$$\begin{aligned} STS(\omega ) = \text {logit}\left( \beta _0 + \beta _s D(\omega )\right) = \frac{\text {exp}\left( \beta _0 + \beta _s \cdot D(\omega ) \right) }{1 + \text {exp}\left( \beta _0 + \beta _s \cdot D(\omega ) \right) }, \end{aligned}$$

for parameters \(\beta _0, \beta _s\). The STS can be extended to depend on other factors than just tumor size. In this study, we include the possible dependence on PD, CBT, EXP, CP. The STS function is then

$$\begin{aligned} STS(\omega ) = \text {logit}\left( \beta _0 + \beta _s D(\omega ) + \beta _1[PD] + \beta _2[CBT] + \beta _3[EXP]+\beta _4[CP] \right) . \end{aligned}$$

None of the four processes in the model are directly observable with the available data. But the individual data on each woman’s age at diagnosis, tumor size at diagnosis, and mode by which the tumor was detected (symptomatically or through screening) can be used to infer their functions and distributions on a population level. This is done by conditioning backwards in time to hypothetical times of onset. For each possible age at onset, the probability of experiencing the observed outcome is calculated. For more details, see the Supplementary Material or Strandberg & Humphreys33.

Ethics approval and consent to participate

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Regional Ethical Review Board in Stockholm (Dnr 2010/958-31/1). Informed consent was obtained from all subjects involved in the study.

Results

We present the key characteristics of the study population in Table 1. Of the 51 320 women included (Materials and Methods; Data), 640 had BC detected at one of their screenings, and 320 had BC detected symptomatically outside of, or between, screenings. The median tumor size was noticeably larger for the symptomatic cases (17 mm) than for the screen-detected (13 mm), and symptomatic cases were detected at younger ages on average (median 59 years vs. 63 years). Screen-detected women had—on average compared to symptomatic women—larger breast volume and compressed breast thickness, but lower percent density and compression pressure. These findings can be statistically confirmed by two-sample t-tests. The average total exposure was not different (a two-sample t-test gave a p-value of 0.23).

Table 1 Descriptive comparison of the variables under study, based on the three types of outcome (screen-detected breast cancer, symptomatically detected breast cancer, or censored at the end of follow-up).

In Fig. 2 we present box plots of measured PD by age. Overall, PD reduces with age. The median PD of women aged 40–45 is 0.33 with an inter-quartile range of 0.31, while women older than 65 have median PD 0.10 with inter-quartile range of 0.17. In addition, the 95th percentile reduces from 0.69 to 0.43 between these age groups. While not visible in the figure, the 5th percentile also reduces from 0.035 to 0.007.

Figure 2
figure 2

Measured percent density by age at mammography screening.

We started our model-based analysis by including all factors described in Statistical Methods; the model included TBV in the symptomatic detection rate, and PD, CBT, EXP, and CP in the logistic STS function. The maximum likelihood estimates (including 95% confidence intervals (CI)) of the model parameters are displayed in Table 2. The first six parameters are the ones related to the STS, and the rest are part of the other three submodels. To more easily compare the effects of the factors, the analysis was repeated where the model covariates were first scaled (by subtracting the sample mean and dividing by the sample standard deviation). The estimated coefficients from the scaled analysis, which represent the effect of increasing each covariate by one standard deviation, are also included in the table. PD has the largest effect, followed by CBT. CP and EXP have high p-values and relatively small effects.

To make the rest of the analysis more concise, we removed CP and EXP from the STS model (based on their small estimated effects and non-significant p-values). We also note that a likelihood ratio test comparing models with and without these two factors yielded a p-value of 0.45. The parameter estimates of the re-fitted/selected model is presented in Table 3. For comparison, and to highlight the significance of CBT for screening sensitivity, we also fit a version of the model with only PD in the STS (and TBV in the symptomatic detection rate). Estimates of the parameters in the STS model are (95% CI) \({\hat{\beta }}_0 = -4.88 (-5.28, -4.49), {\hat{\beta }}_s=0.51 (0.45, 0.56), {\hat{\beta }}_{PD}=-2.50 (-3.41, -1.60)\).

Table 2 Parameter estimates for the full model with symptomatic detection dependent on TBV, and screen-detection dependent on PD, CBT, exposure (EXP), and compression pressure (CP).
Table 3 Parameter estimates for the final selected model with symptomatic detection dependent on TBV, and screen-detection dependent on PD and CBT.
Figure 3
figure 3

The estimated screening test sensitivity function (sub-model 4) as a function of tumor size, for different versions of the model. Quantiles of sensitivity are from the estimated sensitivity functions of each screen in the data.

Since a main finding is that CBT is associated with STS, after accounting for PD, we decided to graphically illustrate it’s contribution. We do this in Fig. 3. We took the pairs of PD and CBT values for each of the 163,053 screenings and calculated the STS for a range of tumor sizes. For each tumor size, we then plotted the 5th, 50th, and 95th percentiles of STS. We did this for two versions of the model: the (selected) model with PD and CBT (estimates in Table 3), and the model with only PD. Adding CBT helps to separate the women with high and low STS, particularly those with high STS. The maximum difference between the 5th and 95th percentiles for STS was 0.44 (or 44 percentage points) for the model with PD and CBT, and 0.34 (or 34 percentage points) for the model with only PD.

To illustrate how CBT contributes to STS (how it complements PD), and how PD and CBT differ across screenings (for the same women), as well as across women, we present four examples (sets of images from four selected women). For each of the women, four mammograms were taken during the study period. Images from two women with high PD are shown in Fig. 4, and images from two women with low PD are displayed in Fig. 5. We present the woman’s age at the time each mammogram was taken, along with the measured PD and CBT values. We also display the estimated STS when including CBT (denoted STS* in the figures) and the estimated STS when not including CBT (denoted STS**) in the model-based STS measures. Both STS estimates are calculated for a hypothetical tumor diameter of 13 mm (the median in the study) and include PD in the estimate.

In discussions, below, of the images in Figs. 4 and 5, we use the notation i–j to refer to the image corresponding to Woman i and Round j. Woman 1 has high PD and CBT, and PD measurements reduce greatly over time from 57 to 24%. There is a single large increase in CBT in 1–4 from 79 to 90 mm. For all four images, the estimated STS is severely reduced (by around 0.20) when accounting for the high CBT. The average estimated STS difference when including CBT is 0.20 compared to not including CBT.

Figure 4
figure 4

Example of two women’s sequential mammograms over 5 years, including age and the measured values of percent density (PD) and compressed breast thickness (CBT). *Estimated screening test sensitivity (STS) for a hypothetical 13 mm tumor when accounting for PD and CBT. **Estimated STS for a hypothetical 13 mm tumor when accounting only for PD.

Figure 5
figure 5

Example of two women’s sequential mammograms over 5 years, including age and the measured values of percent density (PD) and compressed breast thickness (CBT). *Estimated screening test sensitivity (STS) for a hypothetical 13 mm tumor when accounting for PD and CBT. **Estimated STS for a hypothetical 13 mm tumor when accounting only for PD.

Woman 2 also has high PD, but has average CBT. Instead of decreasing over time, the PD measurements fluctuate between 39 and 50%. CBT values are between 54 and 61 mm. Also, for this woman, all estimated STS values are lower when CBT is accounted for; the estimated STS difference is on average 0.06.

Comparing images across women helps to highlight the impact of including CBT in the STS model/estimates. If we compare 1–4 to 2–2, which are the images with the highest estimated STS for the respective woman, we see that 1–4 has much lower PD (24% vs. 39%), and so, based solely on PD, has higher STS. However, when accounting for CBT, the estimated STS is not as low as that of 2–2. It is the high CBT of Woman 1 that drives the differences in STS estimates. Similarly, comparing 1–3 and 2–4, we see that despite having almost the same PD measurements, the estimated STS is significantly higher in 2–4 than in 1.3.

Woman 3 in Fig. 5 has relatively low PD and average CBT. The better separation of high and low STS means that the estimated STS is slightly higher when accounting for CBT. The mean estimated STS difference is −0.03.

Similarly, Woman 4 has low PD. However, the CBT values are significantly higher. Here we see that the estimated STS is instead noticeably lower when including CBT in the estimate. The estimated STS difference is 0.06 lower on average. The difference compared to Woman 3 is due to the high CBT. For example, images 3–3 and 4–3 have the same PD and the same estimated STS without CBT, but a 0.11 STS difference when accounting for CBT.

Lastly, we performed a sensitivity analysis of our choice to use the average measurements between the left and right side. We repeated the analysis where we used the measurements from the contralateral breast for the cases, and a randomly selected side for the censored women. The estimated coefficients for CP and EXP were smaller (scaled estimates of −0.08 and −0.16) but still not statistically significant (p-values 0.40 and 0.07). We also found that the coefficients for PD and CBT were only minimally impacted, with scaled estimates of −0.64 (95%CI −0.84, −0.44) and −0.27 (95%CI −0.45, −0.09).

Discussion

We have investigated the role that PD and a number of screening parameters have on mammography screening sensitivity. We found that the STS was significantly reduced by having increased PD or increased CBT. While PD is well known to have a masking effect14,16,35, few studies have been done on the role of CBT in screening sensitivity. In the first part of a two-part study, Salvagnini et al.36 simulated masses and microcalcifications onto real mammograms—they used 130 subsets of images with each subset containing four images with different CBT, but matched by BI-RADS score37. Half of these images were kept lesion free. Radiologists, who were blinded to tumor/control status, attempted to identify the lesions. Examinations were performed with standard AEC settings. The detectability (as measured by area under the operating characteristic curve) was found to be lower for mammograms with higher CBT. Salvagnini et al.36 reported a free-response receiver operator characteristic of 0.802 for CBT under 30 mm and 0.553 for CBT over 60 mm. Our results, obtained from a population-based study, with real lesions, are consistent with these results.

As mentioned in the introduction, Moshina et al.20, and Holland et al.19 studied the association between compression force and/or pressure and detectability at mammography screening. They used the ratio-based definition of screening sensitivity. Both studies found that high CP reduced sensitivity and was associated with interval cancer. Hill et al.21 instead found the opposite association: Higher CP increased sensitivity. We can note that Fig. 2 in Holland et al. may suggest that too little compression also reduced the sensitivity (making the association follow an inverted U-shape with the highest sensitivity for CP between 0.93 and 1.08N/\(\hbox {cm}^2\)), though this non-linear relationship is not tested for in the study. In our study, we did not find any association with STS, but instead found the association to be between CBT and STS. The two other studies adjusted for TBV and mammographic density, but it would be interesting to see if the association they found with CP persists if also adjusting for CBT. It could be that CBT better represents the achieved compression.

We defined the STS as a screening-specific probability of detecting a BC given the tumor size at the time of screening. The approach involved modelling the latent onset and growth of BC tumors. While there is no substitute for multiple actual size measures, this approach should give valid inference results on a population level. Other studies have had a general lack of consideration for tumor size and growth rates when estimating screening sensitivity. The variation on mammography screening sensitivity that we study, is closely related to the statistical definition of sensitivity13, although with our definition, a positive “test” result is not based only on the radiologists assessment of the mammogram, but incorporates the outcome of any subsequent investigation, e.g. tomosynthesis, ultrasound, or biopsy that ultimately leads to diagnosis. Our STS is more aptly then defined as the probability that a screening leads to diagnosis, given the tumor size—and in this study—PD and CBT.

Comparing this definition to the standard definition of sensitivity in the context of mammography screening, we have previously mentioned that the standard definition assumes that all interval cancers were present at the last screening but missed, and that no cancer was present for more than one screening round. This puts unmotivated constraints on the natural history of the cancer. By using our approach to estimating the STS, all possibilities for the natural history are taken into account.

The standard definition is also dependent on the screening interval and the follow-up time8. Using this definition of screening sensitivity also leads to studies estimating different sensitivities for the prevalence screening (the first screening round in a study) and incidence screenings (subsequent screening rounds), with a significantly greater sensitivity for the prevalence screen38. This is due to the tumors being larger on average in the prevalence screen compared to the incidence screens10. With time and regular screening the estimated screening sensitivity will change. A major advantage of the type of model used in this study is that it automatically includes the difference between a prevalence screen and incidence screen by modelling latent tumors and incorporating individual screening histories. The estimated STS also does not depend on the screening interval or the follow-up time, given the concurrent latent tumor size distribution.

Screening using contrast-enhanced magnetic resonance imaging (MRI) is known to find more breast cancers than mammography screening, and to detect them sooner.39,40,41,42 The additional tumors detected at MRI are on average smaller than those detected at mammography. This highlights the importance of tumor size—something often neglected when reporting screening sensitivities. As can be seen in Fig. 3, the estimated median sensitivity for a 10 mm tumor was 45%, with the 5th percentile of sensitivity (i.e. for the most dense) was a mere 20%. It therefore follows naturally from our approach that the additional MRI-detected tumors have an especially low mammography sensitivity, and that MRI can detect such tumors sooner.

In a previous study43 we estimated the association between PD and STS to have a coefficient of −2.09 (95% CI −2.93, −1.26), which is noticeably smaller in magnitude than what we estimate in this study. One reason is the inclusion of the other screening variables. Another reason is that in the previous study we only had one measurement of PD per woman, taken from the screening where they were enrolled. In Fig. 2 we see how PD differs with age in our study. Some of the difference between estimates can probably be attributed to the fact that PD at the start of follow-up (when PD was measured in the previous study) was, on average, higher than at subsequent screenings. Having multiple measurements of PD (and by extension the other screening variables) has improved the estimates.

For Woman 1 in Fig. 4, the 4th image had an estimated PD of 24%, which was perhaps lower than was motivated by its appearance. That image also had a larger CBT than the other three by 10 mm. This could have caused insufficient spreading out of the breast tissue, which in turn led to a relatively low contrast between dense and non-dense tissue. As a result, the PD would be underestimated. By accounting for the raised CBT, the estimated sensitivity is lowered to be closer to that of the previous images, thus partially offsetting the underestimated PD.

It is also important to note that the association between compressed breast thickness and mammographic screening sensitivity/STS that we observed, after adjustment for PD, is based on an area-based density measure. The relationships of both area and volumetric PD with sensitivity have been extensively studied44,45,46. If volumetric PD measurement is more consistent across various levels of compression than area PD, it is possible that the influence of thickness might be lower.

There are of course limitations to our modelling approach. Here we chose to model the STS as a logistic function of tumor diameter. While this is a function with desirable properties, it is a choice that restricts the shape of the STS. Wang et al.47 tried to estimate the STS without parametric assumptions. By assuming exponential tumor growth with the same constant tumor volume doubling time for all tumors, their results suggested that the logistic function underestimates the STS for small tumors, and that the STS seemingly approaches 1 too quickly. Alternative functions might improve model fit, but for this study we do not foresee that the associations between the screening variables and the STS are significantly affected.

In recent years, there has a great deal of interest in improving ways to screen women for which breast cancer risk is high and mammographic sensitivity is low48,49. It has been suggested that women’s screening intervals should be increased based on density50, or for high density women to be offered additional screening modalities46,51. The motivation is based on a dual effect of an increased BC risk, as well an increased risk of masking at mammography. While density remains the dominant factor of risk and sensitivity, the results of this study suggest that additional imaging parameters might be considered in the decision-making.

With regard to this, improving understanding of mammographic sensitivity is important, as are approaches to improving sensitivity. In part 2 of Salvagnini et al.’s study of compressed breast thickness on lesion detectability36, a new AEC setup was proposed and studied, to see if lesion detectability can be improved using larger breast thickness. The results from part 1 of their study, based on simulated lesions, along with our results, based on real images/lesions in a large population-based study, provide motivation for such endeavours.

Conclusions

The true sensitivity of mammography, defined as the probability that an examination leads to a positive result if a tumour is present in the breast, is associated with compressed breast thickness after accounting for mammographic density and tumour size. This can be used to guide studies of setups aimed at improving lesion detection. These results can help motivate further research into mammography screening settings. Our results suggest that other screening parameters than mammographic density—specifically the compressed breast thickness—should be considered when assigning alternative screening modalities, and when considering personalized screening in the future.