Introduction

Breast screening is widely implemented in many healthcare systems to reduce breast cancer mortality through the expedited diagnosis of smaller, asymptomatic breast cancers. For the majority of women, this uses mammography starting in middle age, although different regions differ in their screening recommendations and practices (Table 1). In rarer, very high-risk situations such as a known, high-penetrance genetic predisposition, earlier screening with magnetic resonance or ultrasound imaging is advocated [1, 2].

Table 1 Summary of national screening programme strategies or national body recommendations for screening women who are not at elevated risk of breast cancer (e.g. those without a known familial risk/genetic predisposition, or history of chest wall radiotherapy).

Whilst meta-analysis of randomised clinical trial data clearly demonstrates a reduction in the relative risk of breast cancer mortality due to screening, reduced breast cancer deaths may come at the expense of overdiagnosis (the identification and unnecessary treatment of clinically insignificant tumours), as well as the consequences of false-positive or false-negative results [3,4,5,6,7,8]. There is wide variability in the results seen in observational studies in relation to overdiagnosis estimations, depending on analytical or modelling approaches [9,10,11,12]. The summation of evidence assessed by the UK Independent Panel [3] was that breast screening does reduce mortality. It concluded that for every 10,000 women in the United Kingdom aged 50 years invited to screening for the next 20 years, 43 breast cancer deaths would be avoided and 681 tumours (invasive or ductal carcinoma in situ [DCIS]) would be diagnosed, but 129 women would be overdiagnosed, i.e. three overdiagnosed cases per breast cancer death averted, although this calculated benefit–harm balance has been contested by some [13, 14].

Most countries use an age-based population-level breast screening strategy that reduces breast cancer mortality but does not account for the wide variation in individual women’s cancer risks [15,16,17,18]. Identification of women at the highest risk of breast cancer may enable targeted intensification of early detection or preventive measures, and reduce the public health burden of this common malignancy. Motivated by such factors, ‘risk-stratified’ breast screening has emerged as a concept in which decisions to offer screening or the determination of screening frequency and modality (e.g. from mammography/tomosynthesis to magnetic resonance imaging [MRI]) are guided by accurate estimation of an individual woman’s risk of breast cancer [15, 19,20,21]. Logically, risk-stratified screening would target intervention at those that stand to gain the most and reduce or stop screening in those that stand to gain little benefit, potentially also informed by the cost-effectiveness evidence.

However, the efficacy and feasibility of personalising screening strategies is uncertain and would require the meeting of several critical dependencies for implementation [21]. These include the need for rigorously developed and validated risk prediction models capable of stratifying women accurately, robust health economic evaluation, and clarity on the balance sheet of benefits and harms that would become the ‘new norm’ in clinical practice after prospective studies. Although some recent reviews have sought to summarise this field [21,22,23], they differ in the extent to which they cover each of these dependencies. In this narrative scoping review, the authors seek to provide a concise overview of the key topics within the risk-stratified screening literature and anticipate the potential effects of emerging evidence on clinical practice. First, the evidence for risk prediction modelling that may guide personalised screening is reviewed. Thereafter, the evidence from observational analyses of epidemiological or registry data is appraised prior to discussion of how ongoing prospective cohorts and trials have been designed in this area. Furthermore, health economic evidence and output from qualitative studies is synthesised. Throughout, there is an emphasis on evidence quality, its limitations and discussion of how unmet needs may be satisfied.

Methods

A scoping literature review was undertaken using Medline (PubMed) with the following search strategy: (“breast screening” OR “mammography”) AND (“risk#adapted” OR “risk#stratified” OR “personalised” OR “personalized” OR “tailored” OR “risk#based”). Papers published in any language prior to 1 November 2020 were considered for inclusion in this review. We reviewed the reference lists of systematic reviews to identify key publications if not identified by the search strategy. We also searched clinicaltrials.gov on 1 November 2020 to identify ongoing interventional studies in this area (for search terms: “breast cancer”, “screening” and “risk”).

Reports retrieved were screened for inclusion based on title and abstract and, if relevant, were classified into five groups: papers reporting risk prediction models, epidemiological analyses of risk-stratified screening or retrospective evaluations, prospective studies and trials of risk-stratified screening, health economic evaluations and qualitative research on feasibility or acceptability. Findings were synthesised narratively, informed by the narrative synthesis guidelines developed by the Cochrane Collaboration [24].

Risk prediction models to guide personalised screening

Several risk prediction models for breast cancer incidence have been reported, which tend to incorporate ‘classical’ risk factors identified from epidemiological evidence, e.g. clinical, demographic or pharmacological exposures, but may also assimilate factors such as family history, genetic risk markers or polygenic risk scores and imaging-related parameters. Well-recognised ‘risk factors’ include breast density, first-degree family history of breast cancer, increased body mass index (BMI), nulliparity or young age at first birth, and such factors may be attributable for over 52% of risk [25]. The role of obesity relative to menopause, such as the apparent ‘protective effect’ of obesity on oestrogen receptor-positive cancers prior to menopause [26, 27], may be relevant but is not always incorporated into risk models [28].

Some models have also sought to incorporate markers of genetic risk, such as the inclusion of BRCA genotype in the Tyrer–Cuzick (also known as the ‘IBIS’) model [29], as well as the incorporation of a polygenic risk estimation incorporated into a later update thereof [30]. Some comprise predominantly genetic information, such as BOADICEA [31,32,33] and BRCAPRO [32, 34]. Genetic predisposition, through either highly penetrant mutations such as BRCA1/2 [35, 36] or subtler single-nucleotide polymorphisms (SNPs) may indeed affect breast cancer risk; however, only 25–50% of the familial risk can be explained by known genetic variants [35, 37,38,39,40], and only 16% of the risk of nonfamilial breast cancer is accounted for by SNPs [36]. Furthermore, breast density and textural features [41] may be relevant to breast cancer risk and have been explored as covariates in risk prediction models rooted in mammographic image analysis or as additions to covariate panels in updated versions of statistical models. Generally, assessment of breast density could be determined by visual assessment scales [42], or automated/algorithmic approaches [43,44,45], but the most common appears to be the four-category ‘Breast Imaging Reporting and Data System’ classification [46] (BI-RADS) [30, 47, 48]. Modelling approaches utilised include mathematical modelling [29], statistical regression [49, 50] or ‘machine learning’ techniques [51].

Whilst the implementation of risk-stratified screening to entire populations depends on the prospective outcome and economic evaluations, the requirement for accurate and robust multivariable risk prediction models is the sine qua non of any such approach [52]. All risk prediction models intended to be used to inform clinical decision-making should be transparently reported, and robustly evaluated in terms of various performance metrics [53]. Strong internal validation is recommended, and appropriate external validation using data sources that are independent of those used to generate the model may also be useful [54, 55]. Important considerations include the discrimination of models, i.e. how well they distinguish between women who do develop breast cancer versus those who do not [56]; calibration, i.e. the degree of agreement between the predicted risks and the observed risks [57]; and assessment of ‘net benefit’ using decision-curve analysis [58, 59]. In terms of discrimination, the ‘area under the curve’ (AUC) or identical ‘c-statistic’ may be used for binary outcomes, or the ‘c-index’ for survival data [60] both range between 0 and 1, with values of 1 corresponding to perfect prediction and 0.5 reflecting discrimination no better than a coin toss. Other metrics to consider include the proportion of variance explained by the model, such as the R2 [2, 61]. It is increasingly clear that average performance metrics or ‘overall’ assessments of model performance in populations are insufficient to truly assess clinical utility on deployment, as there may be differences in model performance between regions, ethnic groups or even age groups [62, 63].

Table 2 describes the development and validation results regarding key published risk prediction models. Our search strategy sought to identify original reports and secondary validation studies without restricting the latter to those performed by model developers. Interestingly, a recent systematic review by Louro et al. [64], which appraised the evidence for the Breast Cancer Risk Assessment Tool (BCRAT), Breast Cancer Surveillance Consortium (BCSC), Rosner and Colditz, IBIS and other models using the ISPOR-AMCP-NPC [65] questionnaire rather than PROBAST [66] found that it was challenging to recommend any model for the purposes of risk-stratified screening. Importantly, some risk prediction models were missed by their search strategy [50, 67], and models focussing on genetic determinants of risk were intentionally excluded [64]. The emergent pattern is that incorporation of multiple data forms, such as adding breast density or other mammographic features, or genetic information yields incremental gains in model performance [68], although these tend to be relatively small [30, 69]. Specifically, regarding incorporating breast density on discrimination, the increase in the AUC of published models ranges from 0.03 to 0.14 [69]. The effects on calibration are less clear and the effects on net benefit are not reported.

Table 2 Details regarding study data, modelling strategy and performance metrics of notable published risk prediction models for breast cancer or their ‘updates’ identified during the scoping review.

It is important to note that simply comparing AUC/c-indices of different model development studies does not constitute a meaningful comparison as a breast cancer risk is strongly influenced by age, and the AUC/c-index may be influenced by the heterogeneity of study population, the prediction horizons used and the source of the population used may differ across studies. For example, the AUC/c-index of a model developed in a cohort with a very broad age range is not directly transferable to a separate cohort of women with a narrower age range, such as women eligible for screening currently.

The O/E ratio is widely used to assess calibration, and simply compares the overall number of observed cases versus the number predicted by a model for a given population, and as a standalone metric is insufficient, as over-prediction in sub-groups can be compensated by under-prediction in others and vice versa [57]. A more comprehensive analysis of alignment between predicted versus observed risks for individual study participants could include the use of a calibration plot displaying (mis)alignment across levels of risk [70]. Some papers report as one aspect of their analyses the hazard ratio of pre-selected highest risk groups (such as top tenth) to middle-risk groups (such as middle 80%) [47]. This alone may provide an incomplete assessment of model performance as it is comparing small groups at the extremes of a risk score distribution to the bulk of the study population, which would naturally be expected to diverge in their observed risk. These must therefore be interpreted in the context of the other sources of information regarding model performance when provided. A key external validation study of four key models in a cohort of 15,732 women from Australia, Canada and the US (519 cases of breast cancer) demonstrated c-statistics of 0.70 for BOADICEA (95% confidence interval [CI]: 0.68–0.72), 0.71 for IBIS (95% CI: 0.69–0.73), 0.68 for BRCAPRO (95% CI: 0.65–0.70) and 0.60 for BRCAT (0.58–0.62) [71]. Assessment of calibration was limited to the O/E ratio: BOADICEA 1.05 (95% CI: 0.97–1.14), IBIS 1.03 (95% CI: 0.96–1.12), BRCAPRO 0.68 (95% CI: 0.65–0.70), and BCRAT 0.79 (95% CI: 0.73–0.85).

Recently, there has been increasing interest in ‘machine learning’ prediction modelling for healthcare. Whilst arguably perceived to be more flexible (e.g. better at capturing non-linear, complex interactions), less reliant on assumptions than traditional regression and capable of handling some forms of data that regression models cannot, machine learning has not been shown to be inherently better than traditional statistical modelling approaches [72]. Datasets used for ML modelling should capture clinical reality, i.e. reflect the target population, and the architecture of any algorithm should be reported, given their structural flexibility. Clarity of reporting model development can be problematic [73] and validation/performance assessment approaches may not always be appropriate or transparent, especially when comparing different approaches. One recent study compared the performance of the BOADICEA model with a Markov chain Monte Carlo generalised linear mixed model, an adaptive boosting model and a random forest model developed using data from a single oncogenetic institution in Switzerland that focusses on counselling and testing for hereditary cancer syndromes [51]. Whilst the machine learning models were declared to outperform BOADICEA, no robust evaluation of model calibration was performed, and the effective comparison was an external validation of BOADICEA versus an internal validation of the new model using the data they were derived from. This used cross-validation with a low number of repeats (n = 20), which presumably was used for hyperparameter tuning as well as performance evaluation (not elaborated in paper), an option that is optimistically biased [74, 75]. Further work in this area of comparing different model-building strategies for predicting risk should focus on more robust, meaningful comparisons.

Overall, a range of clinical prediction models has been developed that could be used to guide risk-stratified screening, some of which are undergoing evaluation in trials of personalised screening. The ability of models to guide risk-stratified screening by predicting incident breast cancer risk in asymptomatic women is uncertain [64], even if integrating clinical, genetic and imaging-derived variables [30, 47, 49, 68]. There is no single accepted benchmark for a given performance metric to render a model suitable for guiding personalised screening, and decisions regarding optimal models should not be made on a single metric. Instead, models need to be robustly assessed in terms of discrimination, calibration and potential clinical utility in the target populations. Whilst a model with an AUC of 0.5 in its target population cannot be informative, a high AUC in a model development/evaluation study does not guarantee utility in guiding risk-based screening. Poorly calibrated models may cause harm, and those with unstable performance across sub-groups may raise concerns regarding ‘algorithmic fairness/bias’. Some of these models are not fully developed using individual-level data, rather, are pre-determined systems of weights that are then applied to a test dataset to assess performance [30, 47]. Crucially, however, weak or non-existent calibration assessment, non-examination of performance heterogeneity, or the lack of consideration of geographical and temporal transportability [55, 76] are notable limitations. The QCancer (Breast) [50], IBIS [47] and iCARE [77] models are some examples wherein exploration of performance heterogeneity is performed according to age groups or other clinically relevant sub-populations (see Table 2). It has also been suggested that in order to minimise harms from overdiagnosis of indolent tumours, modelling the risk of developing lethal breast cancers could be more appropriate than modelling the diagnosis of any breast cancer in order to risk-stratify screening [11]. This is another avenue for further exploration and analysis.

Epidemiological analyses and retrospective evaluations of risk-stratified screening

The real-world effects of implementing risk-stratified screening strategies [78] warrant evaluation in prospective studies and trials. Whilst trials have recently been initiated (see later section), there have also been several explorations of the possible benefits and harms of using epidemiological approaches in a jurisdiction where multiple forms of mammography screening are available, namely in Taiwan. This complements studies modelling breast cancer risk in large cohort studies relative to age, or retrospectively simulating the effect of implementing different screening practices in screening cohorts. Table 3 summarises the evidence from such epidemiological papers or retrospective clinical evaluations simulating the possible effects of risk decision rules using patient data.

Table 3 Comparison of studies evaluating risk-stratified screening using simulations on retrospective data, or epidemiological studies.

A large Taiwanese study (n > 1.4 million) exploited the natural experiment of the concurrent availability of three screening approaches in the country’s population, namely annual clinical breast examination as the baseline (women aged 35 years and over), risk-stratified biennial mammography screening or universal mammography (both for women aged 50–69 years) [79]. The existence of three available approaches was predicated by a low breast cancer incidence rate in 2002–2004 and concerns about healthcare system capacity for whole population screening at the time, although the rates have increased since [80]. Risk stratification used a ‘risk score’ derived from reproductive/menstrual history and family history data obtained during attendances for clinical breast examination between 1999 and 2001, with the median of the risk scores used as the cut-off for eligibility for biennial mammography. Using propensity score methodology to try to adjust for disparities in baseline risk factors across the three groups (age at menarche, parity, breastfeeding and BMI), Cox models were used, where screening modality was modelled as a time-dependent covariate. Compared to clinical examination, universal biennial mammography had a higher breast cancer detection rate, was associated with a downwards stage migration of detected cancers, a 13% overdiagnosis rate (95% CI: 8–18%), a 30% reduction in stage II + breast cancers (hazard ratio [HR] 0.70, 95% CI: 0.66–0.74), and a 41% reduction in breast cancer mortality (95% CI: 27–52%) when adjusting for propensity score and year of birth [79]. Compared to clinical examination, the overdiagnosis with risk-based screening was negligible (HR for diagnosis 0.97, 95% CI: 0.92–1.03), there was an 8% reduction in stage II + breast cancers (HR 0.92, 95% CI: 0.86–0.99) and a ‘non-significant’ reduction in breast cancer mortality of 14% (HR 0.86, 95% CI: 0.73–1.03) [79]. However, the risk-stratification mechanism in this study was unclear—data were not provided on the modelling methodology used, the risk score covariates, the risk score distribution in the population seeking to opt into risk-based screening (or across the three arms) or performance evaluation to assess if this approach was suitable for clinical use. Further detail is needed to make meaningful inferences on the performance of risk-based screening versus ‘standard’ screening in this analysis. In addition, given the relatively low, albeit increasing, breast cancer incidence rate in this population, risk stratification beyond age and sex may have different proportional benefits in Taiwan in comparison to other nations.

Rather than analysing the effects of altering screening intensity or avoiding screening in low-risk women, a large Swedish cohort with linkage to several national databases (n > 5,000,000 women) was used to assess whether earlier screening starting ages could be appropriate for some women. By using 10-year cumulative risk estimations, the risk level of the ‘average’ 50-year-old woman that would be offered screening was calculated as a benchmark. The ages at which other women would attain the same 10-year risk were compared, based on patterns of family history [81] or personal reproductive history (parity and age at first birth) [82]. Both studies found that either approach could identify women who, despite not being eligible to start age-based screening, had the same 10-year risk estimate as 50-year-old women who would be invited to screen, or indeed may only attain that same threshold of risk after age 50 years. For example, women who had their first baby aged under 25 years met the benchmark aged 51 years, whereas women who had four births by age 25 years met this at 59 years of age [82]. Furthermore, women with one first-degree relative diagnosed with breast cancer before the age of 40 years met the average risk of women starting age-based screening at age 36 years [81]. Therefore, despite debate around the benefits of universally expanding screening to younger age groups such as the lack of long-term effect seen in Age UK [83], selected women with selected risk factors may be suitable for earlier or delayed commencement of early detection strategies. The optimal way to assess risk would require elucidation as reliance on two albeit important risk factors may inadequately capture risk.

In the radiological literature, some commentators have voiced criticism of the potential harms offered by risk-stratified screening [84,85,86], typically fuelled by retrospective studies applying risk factor-based decision rules to cohorts of women that partook in service screening [86,87,88,89]. For example, Lee et al. examined recall rates, cancer detection rates and positive predictive values for biopsy recommendation and the fact of biopsy across age groups, when accounting for breast cancer family history, personal breast cancer history and having dense breasts in a cohort of >2.6 million women [88]. The recall and cancer detection rates in 30–39-year-old women were the same with these risk factors undergoing incidence screening as the 40–49-year-old ‘average-risk’ women undergoing screening; thus, they concluded that such a higher-risk women may benefit from earlier screening starting age. Other institutional studies have expressed concern that risk-stratified approaches have the potential to miss 75.6–88% of cancers if screening was purely based on family history, 56–86% of cancers if density was the sole determinant or 43.5–76% if access to screening was dictated by positive family history and breast density [86, 89]. Such approaches are not powered to assess the effects of varying screening approaches on stage at detection nor breast cancer mortality, but most crucially, they rely on relatively simplistic determinations of relative risk. Prospective clinical trials are not evaluating such approaches, rather, more nuanced methods for estimating risk.

Overall, currently available epidemiological evaluations or retrospective clinical estimations of breast cancer screening guided by individualised risk are insufficient to inform the utility of risk-stratified screening.

Prospective cohort studies

Three notable cohort studies are currently exploring multiple aspects pertaining to the feasibility and acceptability of personalised risk assessment in the general population: specifically, the Personalised RISk-based MAmma screening study (PRISMA), the Karolinska Mammography Project for Risk Prediction of Breast Cancer (KARMA) and the Predicting the Risk of Cancer at Screening study (PROCAS).

PRISMA is a Dutch collaboration between institutions including Radboud UMC Nijmegen and the North, East, West and South Screening Programmes. In 2014, PRISMA started recruiting asymptomatic women aged 50–75 years in the general population eligible for the national screening programme for data collection via questionnaires, blood and saliva samples and mammograms for assessing breast density. It has a target of 90,000 women with regards to risk factor questionnaire data collection and imaging, and 27,000 blood samples. It aims to not only develop risk prediction models as a fulcrum for investigating risk-based screening strategies but undertake a robust assessment of the acceptability of risk-based screening from ethical, psychological, legal and logistical perspectives. Outputs from PRISMA thus far include multi-cohort qualitative research incorporating individuals from KARMA and PROCAS, which identified a preference for risk-tailored assessment results communication (e.g. letters for below average and average risk, face-to-face appointments for higher risk), the need for standardised risk assessments within national policies and detailed information needs for women in different European countries [90, 91].

The KARMA prospective screening cohort is developing an extensive resource of banked biological, mammogram and lifestyle/clinical factor information from over 70,000 women [92]. Its aims include identification of novel circulating risk markers, genetic risk factors and imaging protocols, assessment of high-throughput breast density measurement, trials of pharmacological prevention therapies such as lower-dose anti-oestrogen therapy and risk communication as well as the development of new risk prediction models with an evaluation of how these could be implemented within screening routines. Notable outputs include the CAD2Y model, which integrates mammographic features such as computer-detected microcalcifications with ‘clinical’ factors for short-term risk prediction [93], risk estimation integrating mammographic density and polygenic risk [48, 94], and contributions to the identification and understanding of genomic breast cancer risk by multi-centre consortia [95,96,97,98].

The PROCAS 2 study began in 2015 following PROCAS 1, which recruited over 50,000 women eligible for screening mammography at the Great Manchester NHS Breast Screening Programme. Lifestyle, reproductive history and other clinical informations were collected via questionnaires, with imaging assessment of mammographic density and DNA obtained for polygenic risk analysis [99]. Numerous studies have been undertaken within the remit of PROCAS, such as the evaluation of Tyrer–Cuzick and Gail risk prediction models in a screening population, the predictive impact of the inclusion of mammographic density [68] and/or polygenic risk score components into these [30, 100], extensive assessment of risk feedback and perception [101,102,103] and probing associations between ethnicity and mammographic density [99]. PROCAS outputs have been central to supporting the feasibility of population-based breast cancer risk assessment [101, 104], including identifying no major psychological harms of providing 10-year risk estimates from different forms of risk algorithms [103].

Prospective evaluations and trials of risk-stratified screening

Prospective clinical studies and ideally randomised trials should evaluate risk-based screening practices in terms of outcomes, balance of harms and benefits, cost-effectiveness and acceptability. Several key studies are underway [105,106,107,108], including NCT04359420 [108]. This is a non-randomised, counterbalanced study across seven screening sites in the United Kingdom, in which women on an invitation to the NHS Breast Screening Programme will either be offered the standard programme or the additional invitation to use BC-Predict, an automated system for offering a breast cancer risk assessment (to include questionnaires, breast density measurement and polygenic risk) on an invitation to screen [108]. Its aims include assessing risk assessment uptake after offer, uptake of risk consultation, chemoprevention or additional mammography, as well as risks of potential cancer worry, anxiety or health service costs [108]. Many others do not strictly seek to evaluate the outcomes of screening intensity/eligibility decisions based on individualised risk estimation. Indeed, some are also exploring how best to communicate personal risk (e.g. PROSPR/PCIPS 3, NCT01879189), promote breast screening uptake based on risk factor-specific educational materials (e.g. NCT00416975) or identify optimal imaging modalities for women at specific levels of predicted risk (e.g. NCT00003736).

The Women Informed to Screen Depending on Measures of risk (WISDOM) trial is a preference-tolerant randomised trial of a risk-based screening algorithm versus standard screening practice in the United States that started in 2016 [105]. Absolute breast cancer risk estimates are generated using the BCSC model [49], modified by a polygenic risk score incorporating 96 SNPs using Bayesian principles [105] and testing for nine high- or moderate-risk genes (BRCA1, BRCA2, TP53, STK11, PTEN, CHD1, ATM, PALB2 and CHEK2). Predicted risks at 5 years and age dictate the screening strategy. For those aged 40–49 years: women with a 5-year risk of <1.3% are not being offered screening, those with a 5-year risk of 1.3% or greater are undergoing biennial mammography, whilst women with extremely dense breasts or who are carriers of ATM/PALB2/CHEK2 mutations without a positive family history are undergoing annual mammography. For women aged 50–74 years, all are undergoing biennial mammography unless they are carriers of ATM/PALB2/CHEK2 mutations without a positive family history in which case they receive annual mammography. Regardless of age group, annual mammography with adjunct MRI is being deployed in carriers of BRCA1/BRCA2/TP53/PTEN/STK11/CDH1 mutations regardless of family history, carriers of ATM/PALB2/CHEK2 mutations with a positive family history, those who had chest irradiation between the ages of 10–30 years or have a 5-year breast cancer risk of at least 6%. With target recruitment of 100,000 women, it has been projected that 75% of women aged 40–49 years will be allocated to ‘no screening’, whereas 91% of women aged 50–74 years will undergo biennial mammography [105]. The primary endpoints are non-inferiority to standard screening regarding the number of late-stage breast cancers diagnosed (>stage IIB), and rates of recall and breast biopsy. Secondary endpoints include the rate of stage IIB and interval cancers, recall rates, rates of DCIS diagnosis, rates of chemoprevention use, cancer incidence rate, PROMIS anxiety score and rates of systemic therapy use between arms. Importantly, the design is inherently adaptive so that risk assessment methodology and screening strategies are adjustable in line with future evidence under a ‘continuous improvement’ framework [18].

The population-based Tailored Breast Screening Trial (NCT02619123) was initiated in Italy in 2013, and is randomising pre-menopausal women aged 44 years and older to invitation to ‘tailored screening’ or an active comparator [107]. The target recruitment is 33,200 women and the estimated study completion date will be early 2022. In the tailored arm, those with BI-RADS grade C–D breast density receive annual mammogram invitations until age 50 years and then standard population screening; those with lower density breasts are invited 2-yearly until age 50 years and then standard population screening. In the active comparator arm, women are invited to annual mammography until age 50 years, followed by usual population screening. The primary outcome measures are the difference in cumulative interval cancers between arms (also by density group) and the cumulative incidence of >T2 or node-positive breast cancers by arms (also by density group). Secondary endpoints include a comparison of false-positive rates between arms, the cumulative incidence of all breast cancer cases and attendance to screening. However, the ramifications of this trial on clinical practice may be limited, due to the basis for stratification (dense versus non-dense breasts), the small divergences in screening strategy (annual or biennial mammography) and the short period in which the screening intensity will be altered.

The other key trial is My Personalized Breast Screening (NCT03672331); an international study seeking to recruit 85,000 women aged 40–70 years, in which screening strategy in the experimental arm will be dictated by risk assessment incorporating age, family history, previous benign breast disease, hormone/reproductive history and a polygenic risk score. Specifically, women with one or no first-degree relatives with breast or ovarian cancer will utilise the MammoRisk® model, otherwise, the Tyrer–Cuzick model will be used (see above). No data regarding the proprietary MammoRisk® algorithm structure itself or results of performance evaluation is accessible on the owning company’s website (https://www.predilife.com/en/home-2/, accessed 1 November 2020) or identifiable on Medline, although studies of acceptability/ease of software use are available [109, 110]. In the comparator arm, women will be screened with mammography, tomosynthesis or MRI in accordance with extant national guidelines, whereas in the active arm, an estimated 5-year risk will inform mammography and/or tomosynthesis screening every 1 to 4 years (with or without ultrasound depending on breast density). The trial has a non-inferiority design and the primary outcome is the incidence of stage >II breast cancers for the risk-stratified arm. Secondary outcome measures include a superiority analysis regarding the incidence rate of stage >II cancers, rates of false positives and benign biopsies, subject anxiety, health-related quality of life according to the EQ-5D and cumulative breast cancer diagnosis rates.

Overall, three key randomised trials are underway to assess the outcomes associated with different screening intervals, starting age, or imaging modalities based on individualised risk assessment, although there is diversity in the robustness of the risk assessment mechanism. The initial results from these trials are likely to emerge within the next 3 years, and it is notable that the largest and most comprehensive is adaptable [105], in that newer methods of risk assessment or amended screening strategies can be incorporated should novel evidence emerge.

Health economic evaluations of stratified screening

Economic simulations of risk-stratified breast screening have modelled the clinical outcomes and cost-effectiveness of a range of scenarios in health systems such as the United Kingdom’s NHS Breast Screening Programme [16, 111], the United States [112, 113], Germany [114, 115], the Netherlands [116] and China [117]. Models have evaluated the stratification of screening intensity based on classical risk factors such as family history, class of breast density, age or relative risk based on polygenic risk scores. Across models, analytical approaches have been compared, e.g. relative numbers of breast cancer deaths avoided, rates of overdiagnosis, incremental costs or incremental quality-adjusted life-years and incremental cost-effectiveness ratios.

Whilst it appears that such modelling does lend support to the general concept of altering screening practices on the basis of risk to strike a more favourable balance of benefits (reduction in breast cancer deaths) and harms (overdiagnosis and unnecessary treatment) or to formulate a more cost-effective approach, a cohesive narrative regarding a particular algorithm-informed strategy is difficult to synthesise. Table 4 summarises the approaches and key results from such studies. These existing studies may diverge from real-world practice in terms of the ascertainment of individual risk (in terms of nuance and approach) or may be limited by incomplete information on risk distributions in the target population.

Table 4 Summary of health economic and outcomes models evaluating risk-stratified breast screening identified during the scoping review.

More simplistic estimations of relative risk on a small number of covariates tend to be used as the stratification mechanism, such as ‘positive family history’ or breast density category, or use blunter approaches to ascertain ‘high-risk’ women, such as the relative risk of 2.0 or greater. Multivariable risk assessment seeks to offer more nuanced estimations of risk, which are not recapitulated in many studies. Modelling the implementation of a polygenic risk assessment to inform relative risk may be limited by the relatively small absolute contribution that genetic factors may play in the majority of women that do not have high- or medium-penetrant mutations. As some studies have highlighted, the lack of data on the true distribution of risk groups in the target population as a whole limits their assessment of the impact of risk-based screening practices in a health system.

The approaches to and thresholds used to risk-stratify screening appear to differ between studies. Some seek to find the economically optimal screening interval or starting age for screening in cohorts simulated as having set risk levels, or even have intensified screening with additional imaging modalities as one pathway in their models. Few compare existing age-based methods with risk-stratified approaches in the same target population, and fewer evaluate a more robust view of risk-adapted screening, namely not offering screening to those at the lowest risk. Furthermore, the direct comparator in some studies is ‘no screening’ rather than an evaluation of transitioning from age-based screening to truly risk-adapted screening.

Qualitative research

In order to accept screening strategies tailored to individual risk, women need to be able to access and comprehend accurate risk estimations [118,119,120], and indeed, many women are interested to understand and discuss their risk [90, 101, 102, 121, 122]. Risk communication and risk perception are multifaceted and complex [119], yet it is striking that as few as 10% of women have accurate perceptions of personal risk with an otherwise roughly even split between under- and over-estimators [123, 124]. The variable use of absolute and relative risks can have major influences on screening intentions or in some cases be misleading [125].

A growing body of qualitative research has synthesised evidence from focus groups, semi-structured interviews and other methodologies regarding stakeholder views on the implementation and acceptability of risk-stratified screening [90, 126,127,128,129,130,131,132,133,134,135,136]. Particularly key considerations are those pertaining to communication of risk and risk-based pathways across languages and cultural groups [127], socioeconomic groups and those that have lower engagement with preventive healthcare. Generally, perceptions of risk-based screening appear to be favourable, whether based on genetic risk [133, 137] or other factors. Whilst it felt to be acceptable in principle by many women, the evidence base used to support these approaches needs clear articulation to secure buy-in [126] and reasons for heterogeneous policies for different groups need to be transparent [126]. Importantly, perceptions of risk stratification as a euphemism for service funding reductions may arise [128], and there should be cognisance of anxiety around self-directed risk assessment if used, such as via websites or apps [138] (albeit, at low levels) [134].

Discussion

Genetic, pharmacological and environmental or lifestyle factors affect breast cancer risk, with heterogeneity encountered in terms of individual women’s risks and the risks posed by individual tumours. As one-size-fits-all approaches are increasingly disparaged in breast cancer treatment, detection and prevention under the auspices of ‘personalised healthcare’ [21], the use of age as the single precision factor for guiding early detection strategies in women may be over-reductionist [19,20,21]. Approximately one-quarter of all breast cancers are diagnosed in women ineligible for screening due to age, only around 30% of breast cancers in the United Kingdom are detected by the triennial UK screening programme [139] (43% of all breast cancers in the United States [140]), and current methods may predicate overdiagnosis [3, 5, 8, 11]. In the United States, 73% of breast cancers in biennial screeners are detected by mammography, and 78% of breast cancers are detected by screening in annual screeners [141]. Approaches that tailor screening intensity and personalise modalities to those that stand most to gain and minimise unnecessary interventions in those with little to gain require robust evaluation.

It is relevant to distinguish between the use of genetic, lifestyle and other factors in tools for long-term risk prediction to guide imaging strategies over a woman’s ‘screening lifetime’ and estimating the risk of an underlying cancer being present at the time of screening. Compared to settings such as lung cancer screening, where smoking history has the predominant effect on decision-making and may approach immediate diagnostic relevance, the long-term prediction of breast cancer relies on appropriate multi-factorial assessment using data points that have ‘weaker’ effects on risk.

Approaches to risk-stratified screening that have been explored include altering screening intensity, screening starting age or screening technology used and therefore a single consensus definition has not emerged, nor on the optimal form risk-stratified screening could take. One concern is that a paradigm withholding access to screening with a sensitive imaging modality in those deemed ‘low risk’ by a less sensitive predictive model may be inappropriate. As such, there must be judicious consideration of whether a ‘screen-only high-risk’ model is ethically, economically and clinically appropriate in comparison to other models such as the current effective approach, or those that seek to deliver screening that is ‘tailored according to risk’, where risk assessment influences the factors mentioned above without removing women from the screening pool. It is not yet clear which if any of the currently available risk models or risk-based strategies deliver on this concept—ongoing trials and other continued progress in optimising individualised risk estimation and screening strategy should begin to deliver clinically informative answers in the coming years [105,106,107]. Further, individualised risk estimation may not only be useful for stratifying screening strategy to reduce harms but to also reduce the substantial public health burden of breast cancer, potentially through identifying women at previously unknown high risk suitable for prevention therapy (e.g. Anastrozole or Tamoxifen).

Alongside the elucidation of the optimal degree of nuance for breast screening pathways, it would be essential to explore risk communication strategies, and continually monitor such programs’ effectiveness both clinically and economically. Another avenue for future work will be further assessment of stated preferences, such as through discrete choice experiments [142, 143] or contingent valuation studies, and the trade-offs between benefits, risks and costs in making such preferences in a stratified programme.

Concerns have been raised that risk-stratified screening poses the danger that many breast cancers may be ‘missed’ by not screening women believed to be at low risk [84, 144]. However, many such studies ignore the inherently multivariable nature of the best performing currently available risk models [30, 47, 50], compare current age-based screening against a straw man of blunt risk estimation that is not advocated, and are not powered to identify the effects on important outcomes such as the proportion of late-stage cancer detected, or cancer survival. Stratification may lead to changes in imaging strategy, such as increased use of supplemental MRI, which randomised trial evidence suggests reduces rates of interval cancers in women with extremely dense breast tissue [145]. Recent evidence of supplemental abbreviated MRI in women at average risk with dense breasts and negative digital tomosynthesis results appears to increase the prevalent cancer detection rate (up to 27.4 per 1000 women), but the survival benefits are yet to be quantified [146].

Overdiagnosis and overtreatment have been widely acknowledged in prostate cancer screening for years [147,148,149,150], yet a progression towards risk-stratification of screening itself or risk-guided management of detected neoplasms is far more mature than in the breast cancer field [149, 151]. The debate around differing estimates of screening mammography’s benefits and harms has become increasingly polarised over recent decades [4, 18, 152,153,154,155,156,157,158], with consistent disagreement over the interpretation of decades-old trials, the reliability of specific randomised studies, statistical approaches used to interpret them or epidemiological studies [11, 158,159,160]. However, whilst the clearly prevailing consensus is that screening mammography reduces breast cancer mortality, whether this can be further improved is a worthy avenue to explore. Recently, trials have been designed to provide evidence to inform this. Should their results be positive, they must be followed by careful consideration of whether the ‘new normal’ would be acceptable to healthcare systems, policymakers, clinicians and members of the public.