A systematic review and quality assessment of individualised breast cancer risk prediction models

Background Individualised breast cancer risk prediction models may be key for planning risk-based screening approaches. Our aim was to conduct a systematic review and quality assessment of these models addressed to women in the general population. Methods We followed the Cochrane Collaboration methods searching in Medline, EMBASE and The Cochrane Library databases up to February 2018. We included studies reporting a model to estimate the individualised risk of breast cancer in women in the general population. Study quality was assessed by two independent reviewers. Results are narratively summarised. Results We included 24 studies out of the 2976 citations initially retrieved. Twenty studies were based on four models, the Breast Cancer Risk Assessment Tool (BCRAT), the Breast Cancer Surveillance Consortium (BCSC), the Rosner & Colditz model, and the International Breast Cancer Intervention Study (IBIS), whereas four studies addressed other original models. Four of the studies included genetic information. The quality of the studies was moderate with some limitations in the discriminative power and data inputs. A maximum AUROC value of 0.71 was reported in the study conducted in a screening context. Conclusion Individualised risk prediction models are promising tools for implementing risk-based screening policies. However, it is a challenge to recommend any of them since they need further improvement in their quality and discriminatory capacity.

BACKGROUND Mammography screening has been associated with a reduction in breast cancer mortality and therefore organised breast cancer screening programmes using mammography have been well established worldwide. 1-4 Although there is not a single consensus, current screening programmes generally recommend biennial or triennial screening in Europe and annual or biennial screening in the US with variations in the recommended targeted age. [2][3][4][5] These recommendations usually consider age as the sole risk factor leading women to be invited for screening from age 40-50 until age 70-74, depending on the programmes.
The likelihood that a woman will benefit from screening mammography depends on her risk for developing clinically significant breast cancer in her lifetime. Taking individual risk factors beyond age into account should enable the classification of women into groups at varying risk of breast cancer. Personalised risk-based screening going beyond the current 'one-size fits all' recommendation may increase the effectiveness and benefit-harm balance of breast cancer screening. Individualised risk prediction models for breast cancer are a key element to develop risk-based screening approaches since they are designed to quantify the risk that can predict whether an individual woman would develop breast cancer in a defined period. 6 A number of risk prediction models that include classical risk factors are commonly used in clinical contexts. 7 However, organised screening programmes do not use these models routinely. One reason for not including these models in screening context is the high uncertainty with regards to its applicability in screening settings. Also, the emergence of new risk prediction factors such as the expression of single nucleotide polymorphisms (SNPs) needs to be appropriately summarised before recommending one of the models into screening practice.
Like any other source of information, risk prediction models have limitations that should be evaluated before using them. A rigorous risk of bias assessment of the existing individualised risk models is needed to clarify the overall quality and applicability of each model. Therefore, the aim of this systematic review is to update the existing evidence, conduct a critical appraisal and risk of bias assessment and summarise the results of the individualised risk models which are used to estimate the risk of breast cancer in women in the general population.

Data sources and searches
We performed a systematic review of the literature following the standard Cochrane Collaboration methods 8 and adhering to the PRISMA statement reporting recommendations. 9 A predetermined review protocol was registered (CRD42018089842) in the PROS-PERO database (date of registration 1 March 2018). The Patient, www.nature.com/bjc Intervention, Comparison, Outcomes (PICO) question of this systematic review is the following: Should individualised breast cancer risk prediction models vs. no risk prediction models be used to develop risk-based screening approaches for women in the general population ? We retrieved relevant literature by using a combination of controlled vocabulary and keyword search terms in the following databases: (i) Medline (accessed through PubMed); (ii) The Cochrane Library; and (iii) EMBASE (accessed through Ovid). Terms related to breast cancer recurrence were excluded in order to avoid retrieving citations out of the scope of this systematic review. We adapted the search algorithms to the requirements of each database and used validated filters to retrieve systematic reviews and primary studies as needed. We reviewed references of included studies that could potentially fulfil our eligibility criteria. The detailed search strategy is reported in Supplementary table 1.
We searched primary studies of individualised breast cancer risk models searching each database from its inception up to February 2018.

Study selection
Eligible studies were those published in English that reported a model to estimate the individualised risk of breast cancer in women in the general population. We included models that assessed more than one risk factor and reported the quantitative characteristics of the risk prediction model. If multiple publications were based on the same individualised risk model, the most extensive report of the model in terms of risk factors reported was chosen. We excluded external validation studies that replicated previous models without adding any additional information such as a new design for collecting the inputs data, modifications on the risk factors or the risk model method.
Articles identified from the search were loaded into EndNote X7.7.1 for Windows (2008, Version 12.0.4) and duplicates were removed.
Data extraction and quality assessment One reviewer screened the search results based on title and abstract, and a second reviewer performed a quality check of the study screening by reviewing 20% of the references. Two reviewers independently confirmed eligibility based on the full text of the relevant articles. In case of disagreement between researchers, the inclusion of studies was determined by consensus. We reported the result of this process with a PRISMA flowchart (Fig. 1).
We used a predefined form to extract the following information from included studies: author, publication date, country, study design, the name of the model if available, sample characteristics, sample size, type of breast cancer, the method of analysis, and validation of the model. Data abstraction was conducted by one reviewer and checked by another.
Two reviewers carried out the assessment of the risk of bias independently and final quality assessment was based on consensus. We used the ISPOR-AMCP-NPC Questionnaire 10 to assess the relevance and credibility of each risk prediction study and the following sources of limitations: (i) internal and external validation; (ii) bias due to the study design for risk estimates; (iii) limitations in data inputs; (iv) appropriateness of the model analysis; (v) reporting bias; (vi) interpretation bias; and (vii) conflict of interest. The risk of bias for each domain was rated as low, high or unclear. For systematic reviews we used the AMSTAR 2 critical appraisal tool. 11 Data synthesis and analysis We evaluated the model validation by assessing both the discriminative power and the calibration accuracy estimated for the women in the general population. When available in the included publication, we extracted the area under the receiver operating characteristic curve (AUROC), the net reclassification index (NRI) and the expected observed (E/O) ratio. The NRI was not included in the tables because it was only reported in 2 out of 24 articles. The characteristics of the included models and the risk prediction outcomes reported preclude the possibility to pool data across studies. Therefore, a narrative synthesis has been conducted. Key study characteristics, validation and accuracy of individual risk models, and methodological quality are described in tables and summarised in a narrative manner. Results are presented according to the original model that they reported.

Study inclusion
The database searches for primary studies retrieved 2974 citations, of which 79 were considered potentially relevant. These 79 studies were screened in full text. We found a systematic review of Anothaisintawee et al., 7 which we used as a source of primary studies. In addition, two studies were included after a manual inspection of papers' references. 12,13 After the full text was checked, 24 studies 12-35 met the inclusion criteria and were considered in the evidence synthesis. Details about study inclusion with reasons for exclusion are described in the flowchart (Fig. 1), and a list of references to excluded studies is provided in Supplementary table 2.   14,20,23 Similarly, the AUROC reported by Boyle et al. 16 and Matsuno et al. 25 were 0.60 and 0.61, although these authors added BMI, HRT, alcohol, physical activity and diet, and ethnicity into the model. Zhang et al. 13 28 Finally the addition of a polygenic risk score, mammographic density and endogenous hormones by Zhang et al. 13 reached a 0.68 AUROC value (Table 1) and obtained an improvement of the discriminative accuracy also reflected in a NRI of a 9.5%. d. IBIS model. The IBIS model original paper 33 (2) 15 was the only one that reported calibration accuracy and presented the closest E/O ratio to one of all the studies included in this review taking values of 1.00 and 1.01 for pre and postmenopausal status respectively (Table 1).
Quality assessment The quality of the included studies was moderate due to some limitations in the discriminative power, study design, and data inputs. The studies did not show important limitations with regards to the validation, appropriateness of the model analysis, reporting or interpretation of the results (Fig. 3). A summary of the risk of bias assessment per each source of limitation is presented here and the detailed appraisal and judgements in Supplementary  37 Nine studies used the expected/ observed event ratio to measure the calibration accuracy of the model. [14][15][16]20,[23][24][25]29,31 Bias due to the study design Thirteen studies used a case-control design to obtain breast cancer risk estimates, [12][13][14]16,17,[20][21][22][23]25,26,29,34 five studies used prospective cohorts, 15,18,19,27,28 and four models used retrospective cohorts. 24,[30][31][32] The study of Wang et al. 35 and the study of Tyrer et al. 33 used risk estimates obtained from a systematic review of the literature.
Agreements and disagreements with other reviews In this systematic review, we found that the number of individualised breast cancer risk prediction models has increased steadily over the past three decades. This finding is in agreement with the narrative overview published by Cintolo-Gonzalez et al. in 2017, 38 and it updates the results of a previous systematic review published by Anothaisintawee et al. in 2012. 7 In contrast to these reviews, however, our aim was to provide innovative information regarding the quality of the identified prediction models. Thus, we have identified and rigorously analysed the strengths and limitations of 24 individualised models in order to adjust our conclusions to the quality of the evidence. We have identified two new trends with regards to the use and development of the models, which are the increased use of the BCSC model and the inclusion of common genetic variation in the prediction models. As compared to the information published in the review of Anothaisintawee et al., 7 we found that in contrast to the BCRAT and Rosner & Colditz models that were the most frequently cited models up to 2010 7 the BCSC model has concentrated the attention of several authors during the last five years, although its discriminatory accuracy has not dramatically improved. Second, none of the models in the review of Anothaisintawee et al. 7 included genetic information as a risk factor. By contrast, we have identified four models including genetic information: the IBIS model 33 that includes genetic phenotype in their updated version, the BCSC model that includes a polygenetic score in both 2015 12 and 2016 29 publications, as well as the article by Zhang et al. that added a polygenic risk score to both the BCRAT and the Rosner & Colditz models. 13 Most of the included studies reported the AUROC to determine the probability that a randomly chosen woman with disease would be correctly categorised as higher risk compared to a randomly chosen woman without disease. The discriminatory accuracy estimate does not express whether the model is more or less accurate in predicting the risk of specific individuals but measures the capacity of the model to determine which women are at higher/lower risk for developing breast cancer. Thus, both calibration accuracy and discriminatory accuracy should be assessed. Contrary to what is expected, we found that authors reported the E/O ratio only in less than half of the included studies. In addition to the AUROC value, the studies of Zhang et al.   Overall, the information provided by the AUROC and the E/O ratio was consistent suggesting that the included models have moderate discriminatory accuracy and calibration accuracy when applied to the women in the general population. Nevertheless, it must be taken into account that despite the great importance of validation in terms of AUROC and E/O ratio, the presence of low values of AUROC or clearly different from 1 values of the E/O ratio does not mean that these models are useless. On the contrary, models are clinically useful even with moderate AUROC since they can reclassify individuals at the extremes of risk. 39 Thus, the verdict on risk models should not be based solely on these estimators. Instead, they need to be prospectively evaluated in clinical trials. In fact, there are currently two very large randomised trials assessing risk-based screening strategies. Both of them are using individualised models. Both the IBIS and the BCSC models are being tested in the European trial MyPeBS (My Personalised Breast Screening). 40 Also, the BCSC model is being tested in the US WISDOM trial (Women Informed to Screen Depending On Measures of risk). 41 Applicability and completeness of evidence The distribution of risk factors in such different populations may affect the applicability of the models to different contexts. The fact that different subtypes of breast cancer may have different genetic markers is widely accepted. 42 These differences, the nature of breast cancer itself and its low incidence may condition a low discriminatory accuracy of a model. In other words, in the general population, there is a low probability of having breast cancer (even in the highest risk group). This low probability may mean that the discriminatory power of a breast cancer risk model won't be as high as a risk model targeted to other common diseases such as cardiovascular events, for instance. Another potential limitation in the applicability in the screening context is the completeness and the number of included risk factors, which ranged from five to 18. Nevertheless, some potentially relevant risk factors such as genetic markers have been only included in few models. Recent studies 43,44 have shown that adding genetic information as a risk factor can increase the discriminative accuracy of the different models which opens the line for further evaluation. An evaluation that should first assess the calibration of these models in prospective cohort studies.
Overall, women are usually screened using mammography. Particularly in Europe, most programmes invite women for screening every 2 years. 2 The presence of some mammographic features in these screening mammograms may be related to the risk of developing breast cancer, as has been recently pointed out by some authors. 21,45 Only one of the 24 models identified in this systematic review included microcalcifications and masses found at mammography as risk factors in the model. 21 Time-changing variables such as radiological variables may not be as stable as personal history. However, in a screening context, this information is especially relevant because it is easily available from previous screening examinations.
Quality of the evidence We found variability in the design of the studies that were used to obtain the cancer risk estimates. Notably, the study design used in the BCSC model was a cohort, which is a robust epidemiology design that allows developing and validating prediction models. Another frequently used design was the case-control study, nested or not. Contrary to the cohort study, time-changing variables may not be well obtained in case-control studies.
Regarding the external validation, the models showed some limitations given that few of them were further evaluated in different contexts. As far as we know, there are numerous scientific publications reporting external model validation in different settings and countries. These studies may help to understand the performance of a model in a specific context, but this issue was out of the scope of our review and, therefore, we have not included external validation studies. As an example of the relevance of these studies, we can inform that the BCRAT model has more than 50 articles informing the external validation of these models in different countries. 46 The Rosner-Colditz model has also been validated in several studies, one of the most complete validations being the one performed in 2013 by the authors themselves. 37 On the other hand, we found that although the Eriksson et al. 19 model reports the highest AUC (0.71), this model has not been externally validated, which increases the uncertainty about its applicability.
Also, there were limitations in data inputs, mostly due to the fact that in several models the information was provided by selfreported questionnaires that may affect the accuracy of the results. Finally, there is a limitation when comparing the AUROC or E/O ratio across the models given that there is great heterogeneity amongst them. The models were targeted to different populations, included different sets of risk factors, and often used different methodologies. We have taken into account all these variations and presented the results by model categories.
Potential biases in the review process This systematic review was limited to studies published in English and did not involve an active search for grey literature, which is literature that is not formally published in sources such as books or journal articles. Therefore, some models may not have been identified. However, since we have conducted a comprehensive literature search in Medline, EMBASE and The Cochrane Library, we estimate that the loss of information due to the study selection criteria is low. Some key genetically oriented models, such as BOADICEA 47 and BRACAPRO 48 were not included in this review because they are aimed at high risk women and not useful for women in the general population in the screening context. Fulltext screening and data abstraction process were performed by two researchers, which increase the quality of the review process. Moreover, as far as we know, this is the first review assessing the risk of bias of the identified risk prediction models.

CONCLUSIONS
The development of individualised breast cancer risk prediction models has increased over the last three decades, but the improvements in both the discriminatory power and calibration accuracy are still limited. Despite the time that has passed since the first model was published and a large number of available publications, only one model addressed to women attending a population-based screening programme 21 was identified. Currently, it is still a challenge to recommend any of the models as the standard for predicting individual risk in screening context. However, the models have been updated by adding new variables, such as common genetic variation or radiologic variables and have shown improvements in their quality as well as in their discriminative accuracy. These new variables need further evaluation to confirm its promising impact in the prediction capacity to propose personalised strategies for breast cancer screening.