Risk Model for Colorectal Cancer in Spanish Population Using Environmental and Genetic Factors: Results from the MCC-Spain study

Colorectal cancer (CRC) screening of the average risk population is only indicated according to age. We aim to elaborate a model to stratify the risk of CRC by incorporating environmental data and single nucleotide polymorphisms (SNP). The MCC-Spain case-control study included 1336 CRC cases and 2744 controls. Subjects were interviewed on lifestyle factors, family and medical history. Twenty-one CRC susceptibility SNPs were genotyped. The environmental risk model, which included alcohol consumption, obesity, physical activity, red meat and vegetable consumption, and nonsteroidal anti-inflammatory drug use, contributed to CRC with an average per factor OR of 1.36 (95% CI 1.27 to 1.45). Family history of CRC contributed an OR of 2.25 (95% CI 1.87 to 2.72), and each additional SNP contributed an OR of 1.07 (95% CI 1.04 to 1.10). The risk of subjects with more than 25 risk alleles (5th quintile) was 82% higher (OR 1.82, 95% CI 1.11 to 2.98) than subjects with less than 19 alleles (1st quintile). This risk model, with an AUROC curve of 0.63 (95% CI 0.60 to 0.66), could be useful to stratify individuals. Environmental factors had more weight than the genetic score, which should be considered to encourage patients to achieve a healthier lifestyle.


Data collection.
A structured computerized epidemiological questionnaire was administered by trained personnel in a face-to-face interview. Also, subjects filled in a semi-quantitative Food Frequency Questionnaire (FFQ), and blood samples and anthropometric data were obtained following the study protocol.
Only variables clearly related with CRC were considered for the development of risk models. The variables considered were: family history of CRC (none versus first or second or third-degree); cigarette smoking, grouped into non-smokers and smokers (including former and current); average alcohol consumption between 30 and 40 years of age (in standard units of alcohol, SUA), categorized into low-risk and high-risk consumption (> 4 SUA/ day in men and > 2 SUA/day in women) 20 ; BMI (calculated with the weight reported at 45 years of age), which was categorized according to World Health Organization criteria as underweight, normal weight, and overweight (< 30 kg/m 2 ) versus obese (≥ 30 kg/m 2 ); average physical exercise, measured from self-reported leisure-time activity performed in the past 10 years and used to estimate the Metabolic Equivalent of Task (MET) per hour per week, calculated using the Ainsworth's compendium of physical activities 21 , and categorized as no physical activity in leisure time (0 MET) and any physical activity in spare time (> 0 MET); red meat consumption, including meat from mammals (cattle, oxen, veal, beef, pork, etc.), meat from hunting birds (duck, pheasant, etc.), organ meats (liver, brains, etc.), cured meat (ham, bacon, etc.), and processed meat (hot dogs, sausages, meat balls, etc.). High intake of red meat was considered eating ≥ 65 g/day; vegetables, classified as low or high intake using 200 g/day as cut-off.
All the patients' drugs were recorded but only nonsteroidal anti-inflammatory drugs (NSAIDs) (cyclooxygenase 1 and 2 inhibitors) and ASA were taken into account for this study. Patients were considered users of NSAIDs/ASA if they consumed ≥ 1 times/day for at least 1 year.
The location of the CRC was defined according to its anatomic distribution: proximal colon (colon above the level of the splenic flexure, or including it), distal (descending colon and sigmoid colon), and rectum.
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee, and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The protocol of MCC-Spain was approved by each of the ethics committees of the participating institutions. The specific study reported here was approved by the Bellvitge Hospital Ethics Committee with reference PR 149/08. Informed consent was obtained from all individual participants included in the study.
Genotyping. The Infinium Human Exome BeadChip (Illumina, San Diego, USA) was used to genotype > 200000 coding markers plus 5000 additional custom SNPs selected from previous GWAS studies or genes of interest. The genotyping array included 25 SNPs previously identified as susceptibility variants for CRC in genome-wide association studies (GWAS) 22 . Ten SNPs were in the commercial array; we included in the custom content 15 more that had been identified at the time of designing the array (July 2012). For regions where multiple SNPs had been reported, we included only the most statistically significant SNP for each locus when linkage disequilibrium was > 0.5. As a result, we included a total of 21 SNPs in the final analysis, detailed in Table 1.
Scientific RepoRts | 7:43263 | DOI: 10.1038/srep43263 Statistical analysis. Multivariate logistic regression models were used to build risk models. All models were adjusted by a propensity score 23 to reduce bias related to differences in case and control selection frequencies, and account for the frequency matched design of the study. The propensity score model was constructed as the individual prediction (in logit scale) of a logistic regression in which case/control status was modelled with age, sex, level of education, recruiting centre, and the first 3 principal components of genetic ancestry obtained from genotyping data. The interactions between age and sex and centre and sex were also included in the model. The propensity score was added as a continuous variable to adjust the risk models. Since age and sex were used as stratification factors for frequency matching the selection of controls, these variables cannot contribute to the risk model.
An environmental risk score (ERS) was built including all the significant covariates that can be modified (alcohol use, BMI, physical exercise, red meat and vegetable intake, and NSAIDs/ASA use). Family history was not considered in this environmental score since it is not modifiable, and its effect was assessed as a separate factor. Missing values in variables were imputed using the expected value derived from a model built with complete cases. For categorical variables, the most frequent value was imputed.
To assess genetic susceptibility, an additive genetic risk score (GRS) was put together. Each SNP was coded as 0, 1, or 2 copies of the risk allele except for the SNP rs5934683 in chromosome X that was coded 0, 0.5, and 1. We defined the GRS as the count of risk alleles across all 21 SNPs, ranging from 12 to 33. Since the published effects of each SNP were similar, an unweighted GRS was preferred. We also explored the models using weights derived from the GWAS publications and models fitted to our data, but the predictive accuracy was very similar.
The predictive accuracy of models was assessed with the area under the ROC curve (AUROC), adjusted for the propensity score. Data were split into quintiles of propensity scores, and the weighted mean of the AUROC for each quintile model was calculated. Weights were proportional to the number of cases in each quintile. To account for potential overfitting that could overestimate the effect of GRS, especially for more complex models using weights, 5-fold cross validation was used to estimate the AUROC. In addition, the 95% CIs were calculated using bootstrapping techniques on top of the cross-validated estimates.
To estimate the potential public health impact of the ERS and GRS, we applied the estimated odds ratios (OR) to population average CRC incidence estimations published by the International Agency for Research on Cancer (IARC). Data were extracted from the publication Cancer Incidence in Five Continents (CI5) Volume X, for the Spanish cancer registries 24 . Average age and gender-specific cumulative risks for the Spanish population were projected according to combinations of ERS and GRS to define risk strata. For these estimates, the published cumulative risks were multiplied by the ORs estimated from out risk models. We used the average number of risk factors and risk alleles in the population as reference categories for these calculations. Also, the sensitivity and specificity values for a selection of risk scores were used, combined with the cumulative risk of developing CRC cancer for age decades from 40 years to 80 years old, in order to estimate the positive and negative predictive values. The Bayes theorem was used for these calculations. Statistical analysis was carried out using R statistical software (R Foundation for Statistical Computing, Vienna, Austria).

Results
Case and control characteristics are detailed in Table 2. Variables were coded with the lower CRC risk category as reference to simplify the effects of comparison and calculation of risk scores. All the environmental variables considered for the risk model were significantly associated with CRC, after adjusting for the propensity score. The crude ORs were very similar for the categorizations selected, ranging from 1.29 (BMI ≥ 30 mg/kg 2 ) to 1.57 (NSAID/ASA). The multivariate model with all environmental factors showed that all were independently contributing to CRC risk (Table 3). Tobacco was not included in the model since smoking was no longer significant when other factors were considered (adjusted OR 1.06, 95% CI 0.91 to 1.23). The ERS, calculated as the count of risk factors, indicated that on average the adjusted OR was 1.36 (95% CI 1.27 to 1.45). Figure 1 shows the distribution of the ERS for cases and controls, and the estimated risk of CRC according to the number of risk factors, compared to an average individual (ERS = 3).
Family history of CRC was strongly associated with CRC (adjusted OR 2.27, 95% CI 1.88 to 2.74). We combined first, second, or third-degree relatives with CRC in the risk group, since the ORs were very similar. This variable was independent of the environmental risk factors.
Out of 21 GWAS SNPs analysed, only 5 were statistically significant in our data (rs10752881, rs6983267, rs9929218, rs4939827, rs961253; Table 1). The contribution to risk of each SNP in the MCC-Spain study was  Fig. 2, the increase in risk per allele was linear, indicating the independent additive contribution of each allele to the GRS. The risk of CRC doubled for a difference of 10 risk alleles (OR 1.96, 95%CI 1.54 to 2.50). The GRS was independent of environmental variables. Also, no significant interactions were observed between the GRS and age, sex, or any of the environmental variables included in the multivariate model. Regarding tumour location, there were 30.7% (n = 405) tumours located in the rectum, 40.2% (n = 531) in the distal colon, and 29.1% (n = 385) in the proximal colon (15 subjects had missing data). The analysis stratified by cancer location did not show relevant differences (Supplementary Table 1). In general, both environmental and genetic factors had greater effects in rectal than colon cancer. High intake of red meat was the factor with major differences between colon and rectal cancers.   Predictive accuracy of the risk model. The contribution to CRC risk prediction was estimated for modifiable environmental risk factors, family history, and the GRS. Figure 3 shows the individual (red discontinuous line) and cumulative (black continuous line) contribution of each environmental factor to the risk. The cumulative contribution of the seven environmental factors resulted in a cross-validated AUROC of 0.60 (95% CI 0.57 to 0.61). Family history, which is not modifiable but can be obtained by interview, increased the AUROC to 0.61 (95% CI 0.59 to 0.64). The GRS, on its own, had an AUROC of 0.56 (95% CI 0.54 to 0.58). The increase in AUROC for the model with the GRS on top of ERS and family history (FH) was 0.02, with an overall AUROC of 0.63 (95% CI 0.60 to 0.66). This 5-fold cross-validated AUROC was smaller than the direct estimate of the model, which was 0.65, indicating some optimism in the estimate even when an unweighted GRS was used. When we explored weighted models for the GRS, the 5-fold cross-validated AUROC was 0.62 (95%CI 0.60 to 0.65) for weights derived from published GWAS and 0.63 (95% CI 0.61 to 0.66) for weights derived from the fitted logistic regression model.  As lifetime cumulative risk is a better individual measure of the impact of cancer burden, we calculated individual risk by applying the estimated RS to specific cumulative risk of CRC of the Spanish population. For this calculation, Spanish cancer incidence data obtained from cancer registries were used. Figure 4 shows how cumulative incidence curves are shifted according to the risk score. Supplementary Figures 2 and 3 show these analyses but specific to the ERS and GRS, respectively. As it is already known, men have a higher incidence of CRC than women, and incidence grows exponentially from 50 years of age for both sexes.
From Fig. 4 (numbers are shown in Supplementary Table 2) we can estimate that a Spanish man with average risk score (RS = 1, 22 risk alleles) has a lifetime cumulative risk of CRC of approximately 10% (5% in the case of a woman). In contrast, the lifetime cumulative risk would increase to 20% and 10% for men and women, respectively, among subjects with a risk score of 2 (29 risk alleles). The risk for a hypothetical individual at age 50 with an RS of 2 is similar as that for an individual with average risk alleles (GRS = 22) but younger (45 years old for men). In other words, at age 45, this man with RS = 2 would be at the same risk as a man with RS = 1 at age 50. At older ages, since the effect is multiplicative, the relative risk anticipation is greater. The cumulative risk of CRC during the screening age period (50-69), in this scenario, would double: from 3% to 6% among men and from 2% to 4% for women.
The sensitivity, specificity, and likelihood ratios of the risk model to detect CRC for selected risk score cut-offs are shown in Table 4. The use of a high cut-off (RS = 5) offers high specificity (98.94%) but low sensitivity (8.38%). These figures can be useful to assess the relative interest of extending the age of CRC screening for selected strata of the population with such high risk scores, either before age 50 or after age 69. As Fig. 5 and Supplementary Table 3 show, the positive predictive value of the model increases only in a relevant way at older ages, when the prior probability of CRC is higher, especially for RS over 2. The cumulative risk of developing CRC during the age range 70-79 is almost 40% for subjects with a risk score of 5.

Discussion
We assessed the potential utility of a risk prediction model for CRC that combines modifiable risk factors with family history of CRC and a genetic risk score based on 21 susceptibility SNPs. We have observed that modifiable risk factors have a stronger value for risk prediction than does genetic susceptibility. Though the added value of each SNP is small, the combination of 21 SNPs adds significantly to the predictive power of the risk model.
Our study is large enough to confirm that established risk factors are associated with risk: family history of CRC, high consumption of alcohol, obesity, lack of physical activity in leisure time, high intake of red meat, low intake of vegetables, and non-use of NSAIDs/ASA. These risk factors were selected based on previous evidence reported in systematic reviews and meta-analyses [25][26][27][28][29][30][31] . All were independent predictors of CRC in an average risk population, with the exception of smoking, which was only significant in the univariate analysis. A recent meta-analysis on smoking has shown that the effect is small for CRC, with a summary OR smaller than 1.25, and larger for rectal than colon cancer 32 . We also analysed other covariates that have been associated with CRC (diabetes mellitus, inflammatory bowel disease, and diverticulitis) but they were not associated with CRC in our study, perhaps because of the small number of affected individuals. Nor was intake of vitamin D, calcium, or folic acid associated with CRC. We opted not to include statins in the model since there is controversy regarding these drugs and CRC risk 33 .
Our study confirms that family history of CRC is the strongest single risk factor for CRC. We found a significant association between the GRS and family history, which highlights the importance of genetic susceptibility in CRC, though family history could also contribute to risk through shared lifestyle or environmental factors. Also, gene-environment interactions may play a role in this type of cancer [34][35][36] .
Our analysis has shown that the ERS, built as an additive model of modifiable factors, has stronger association with CRC than the GRS. On average, each environmental risk factor increases CRC risk by 35%, while each risk allele only increases it by 7%. This implies that the change of one modifiable risk factor towards healthier lifestyle might offset the effect of 4 risk alleles. Given the fact that environmental factors explain a significant part of the CRC risk, we believe it to be important to give thought to incorporating clinical data to improve current screening and encourage patients to achieve a healthier lifestyle.
We also believe it is important to consider that our genotyping array only had 21 susceptibility SNPs, and today more than 60 have been identified in diverse GWAS studies 22 . Though SNPs identified more recently have smaller effects (in the range of 5% increased risk per allele) and smaller allele frequencies, their addition may still increase the predictive accuracy of the model in a relevant way. In our Spanish population only five SNPs out of the 21 analysed were significantly associated with CRC risk. This might be related to lack of statistical power, since with 1300 cases and 2700 controls we only have 30% power to detect an OR of 1.10, but some of the SNPs may also have effects limited to specific populations. It is reassuring, however, that all SNPs analysed had an effect in the same direction as reported in the discovery study.
Several risk prediction models for advanced neoplasia or CRC have previously been published, with AUROC between 0.65 and 0.75 8 . Our estimate of predictive accuracy, corrected for overfitting through cross-validation, is slightly smaller (AUROC: 0.63, 95% CI 0.60 to 0.66), but our model could not include age and gender because these factors were used to match the controls. Also the estimated risk per allele (OR 1.07, 95% CI 1.04 to 1.10) Our study, which used more SNPs than most previous studies, as well as questionnaire data including diet, confirms that the AUROC increases with more SNPs. Furthermore, the aim of our study was also to build a risk model useful to tailoring CRC screening programs according to individuals' characteristics and calculating the potential impact of determining an individual risk score in a CRC screening population. The risk model, applied to Spanish cancer registry cumulative risk of CRC, has shown that 3 modifiable risk factors or 10 risk alleles have an expected advance of 5 years in the incidence curve of men by age 50 (2.5 years for women). The absolute effect on incidence is larger at older ages, since the effect is multiplicative. This implies that screening in average-risk populations probably should start earlier, at 45 years, for individuals with more risk factors, and could be delayed to 55 years old (or 60) for individuals with fewer risk factors or risk alleles.
Our calculations also show that it would be most useful to extend the age of CRC screening for high risk population after age 69. The positive predictive value of the model increases significantly at older ages, when the prior probability of CRC is higher. Since the conditional life expectancy of a person at age 70 still is long, extending the screening until age 79 might yield a greater reduction in CRC burden.
Moreover, we could also use the risk model to select high-risk subjects in whom colonoscopy might be the optimal initial screening technique rather than the less sensitive faecal occult blood test that is currently implemented in Spain and many other countries in Europe 37 . Another important point to highlight is that the use of prediction models, together with good communication tools, could increase the individual perceived risk, and consequently the participation rate and adherence to screening, especially in high-risk subjects. Moreover, the awareness of a personal risk of CRC might improve people's lifestyle and thereby reduce CRC incidence.
This study has some limitations. Our model was developed within a retrospective case-control setting, and relied on self-reported data. So, measurement error and recall bias may have led to an underestimation of the predictive accuracy. Cases and controls were not well matched regarding age, sex, and education. However, we performed all the analyses adjusted for a propensity score to reduce the possible bias related to this problem. The model is only applicable to asymptomatic individuals from the general population (average risk); subjects with symptoms or several affected relatives should be referred to colonoscopy independently of the risk score.
As already mentioned, this study only included 21 risk SNPs, while more than sixty have already been identified. More studies are needed to determine the generalizability, usefulness of information, and the cost-effectiveness of applying individual genotyping in a CRC screening program. However, it should be noted that the cost of whole-genome genotyping is decreasing, its determination only needs to be performed once in a lifetime, and the data probably will be useful for predicting risk of other diseases in addition to cancer.
In conclusion, we assessed the predictive accuracy of a model for CRC that could be useful to stratify the population into risk categories and tailor CRC screening by adapting the onset age, the intensity, and the screening test. In our model, although the genetic factors are significant contributors, the modifiable risk factors contribute more significantly. Risk assessment may increase screening participation and adoption of healthier lifestyles.