The added value of genetic information in colorectal cancer risk prediction models: development and evaluation in the UK Biobank prospective cohort study

Colorectal cancer (CRC) risk prediction models could be used to risk-stratify the population to provide individually tailored screening provision. Using participants from the UK Biobank prospective cohort study, we evaluated whether the addition of a genetic risk score (GRS) could improve the performance of two previously validated models. Inclusion of the GRS did not appreciably improve discrimination of either model, and led to substantial miscalibration. Following recalibration the discrimination did not change, but good calibration for models incorporating the GRS was recovered. Comparing predictions between models with and without the GRS, 5% of participants or fewer changed their absolute risk by ±0.3% or more in either model. In summary, addition of a GRS did not meaningfully improve the performance of validated CRC-risk prediction models. At present, provision of genetic information is not useful for risk stratification for CRC.

The added value of genetic information in colorectal cancer risk prediction models Todd Smith, Marc J Gunter, Ioanna Tzoulaki, David C Muller  Table 2  7  Supplementary Table 3 8 Supplementary Table 4 9 Supplementary Table 5 10 Supplementary Table 6 11 Supplementary Table 7 12 Supplementary Table 8 13 Supplementary  Supplementary Methods

UK Biobank Study Sample
In brief over 500,000 participants were recruited from National Health Service registers and subsequently assessed at 22 centres throughout the UK between 2006 and 2010. They are followed-up for cancer incidence and death via linkage to population registries. Further details of recruitment and data collection have been described in detail previously [1][2][3]. Participants were followed up from the date of baseline attendance until the date of diagnosis of an invasive primary cancer (excluding non-melanoma skin cancer), date of death, or January 1, 2015, whichever occurred first. CRC was defined by the International Statistical Classification of Diseases and Related Health Problems 10 th Revision codes: C18 (except C18.1, Appendix), C19 and C20. For the purposes of this study we only included participants who identified as being of white ethnicity since published GWAS have been overwhelmingly conducted with participants of European descent.

Genotyping and Imputation
Details of genotyping and imputation within the UK Biobank have already been published [4]. In summary, participants were genotyped using the UK BiLEVE Axiom TM and UK Biobank Axiom TM arrays. Imputation to ~92 million markers was subsequently carried out using the Haplotype Reference Consortium.

Estimation of absolute risk
In both the Taylor

Years of Education
Variables provided 1) qualifications held by the participants and 2) age at which participants completed continuous full time education (for all bar those who had a College or University Degree, in which case this was omitted). To calculate the number of years of education in all participants, those with a Degree were set to have completed full time education at age 21. A school starting age of 5 was then deducted from all to obtain the total number of years of education. Those who reported not going to school or who had a negative number of years of schooling were set to 0. A maximum number of years of education of 20 was used, in line with the online version of the model a , any values above this were set to 20. Variables detailed alcohol intake frequency and, based on this, a weekly or monthly measure of beer/cider, champagne/white wine, red wine, fortified wine and spirits intake. A maximum of 12 drinks per day was used, in line with the online model a , any values above this were set to 12. Variables used: Alcohol intake frequency (1558), Average weekly: beer plus cider intake (1588), champagne plus white wine intake (1578), fortified wine intake (1608), red wine intake (1568) and spirits intake (1598). Average monthly: beer plus cider intake (4429), champagne plus white wine intake (4418), fortified wine intake (4451), red wine intake (4407) and spirits intake (4440).

Variable construction overview (continued)
Number of ineligible participants due to incomplete or missing data, or who were outside the modelled age range (initial sample 433,899) (continued) Hours of moderate and strenuous activity per day (male only) The model's description was "hours of moderate physical activity per day" though the description of the data in the derivation cohort was "hours of moderate or strenuous activity per day" so the later was used. Variables detailed the number of days per typical week that moderate and vigorous activity was undertaken (for ten minutes or more) as well as the number of minutes per day this was undertaken on a typical day (if the answer to the preceding question was 1 or greater). Any duration less than 10 minutes was set to 0 and the number of hours per week calculated, following the International Physical Activity Questionnaire Guidelines b . A maximum value of 4 hours per day was used, in line with the online model a , any values above this were set to 4. Variables used:

Number of days/week of moderate physical activity 10+ minutes (884), Duration of moderate activity (894), Number of days/week of vigorous physical activity 10+ minutes (904), Duration of vigorous activity (914)
12,320 Regular use of aspirin/NSAIDs For men the variable was "aspirin" while for women it was "NSAIDS". Variables detailed the regular use of aspirin and or ibuprofen. The aspirin response was used for men and both were combined to represent NSAIDs for women. Variable used: Medication for pain relief, constipation, heartburn (6154)

4,776
Ounces of red meat intake per day (male only) Variables provided details of Beef, Lamb/mutton, Pork and Processed meat intake per week (never, <1/week, 1/week, 2-4/week, 5-6/week, ≥1 daily), these were adjusted to provide absolute values of 0, 0.5, 1, 3, 5.5 and 7 per week respectively. These were then summed and adjusted to provide ounces per day (based on 4 ounces per intake). A maximum value of 5 ounces per day was used, in line with the online model a , any values above this were set to 5. Variables used: Beef intake (1369), Lamb/mutton intake (1379), Pork intake (1389), Processed meat intake (1349).

2,270
Pack years of smoking   The mean centred log GRS ranged from -2.022 to 2.411 with a standard deviation of 0.495. Family History of Bowel Cancer ranged between 0-3 first degree relatives. Regression coefficient was determined in those eligible for inclusion in the Taylor et al. [5] analysis (n = 361,543), within which Wells et al. [6] was nested. Abbreviations: GRS, genetic risk score. a The results of the individual Wells et al. [6] male and female models were merged and the discrimination estimated. b The discrimination of the hazard ratio used by the model is assessed independently of the model's construction, to remove the effect of age the age coefficients were omitted from the Wells model. c The GRS was used directly as the scoring rule d log GRS was combined with the predicted log hazard ratio from the original models. e The predicted log hazard ratio from the original model was fitted as a covariate in a flexible parametric survival model in order to better recalibrate the predicted probabilities. f The predicted log hazard ratio from the original model and the log GRS were fitted as covariates in a flexible parametric survival models in order to better recalibrate the predicted probabilities. The UK Biobank participants were asked about a family history of "bowel cancer" in parents and siblings, this was taken to be synonymous with colorectal cancer. Number of cases is within a 5 year time horizon from recruitment.

11
Supplementary Abbreviations: GRS, genetic risk score. a The results of the individual Wells et al. [6] male and female models were merged and the discrimination estimated. b The discrimination of the hazard ratio used by the model is assessed independently of the model's construction, to remove the effect of age the age coefficients were omitted from the Wells model. c The GRS was used directly as the scoring rule d log GRS was combined with the predicted log hazard ratio from the original models. e The predicted log hazard ratio from the original model was fitted as a covariate in a flexible parametric survival model in order to better recalibrate the predicted probabilities. f The predicted log hazard ratio from the original model and the log GRS were fitted as covariates in a flexible parametric survival models in order to better recalibrate the predicted probabilities. The UK Biobank participants were asked about a family history of "bowel cancer" in parents and siblings, this was taken to be synonymous with colorectal cancer. Number of cases is within a 5 year time horizon from recruitment. Abbreviations: GRS, genetic risk score. a The results of the individual Wells et al. [6] male and female models were merged and the discrimination estimated. b The discrimination of the hazard ratio used by the model is assessed independently of the model's construction, to remove the effect of age the age coefficients were omitted from the Wells model. c The GRS was used directly as the scoring rule d log GRS was combined with the predicted log hazard ratio from the original models. e The predicted log hazard ratio from the original model was fitted as a covariate in a flexible parametric survival model in order to better recalibrate the predicted probabilities. f The predicted log hazard ratio from the original model and the log GRS were fitted as covariates in a flexible parametric survival models in order to better recalibrate the predicted probabilities. The UK Biobank participants were asked about a family history of "bowel cancer" in parents and siblings, this was taken to be synonymous with colorectal cancer. Number of cases is within a 5 year time horizon from recruitment.