Introduction

Familial adenomatous polyposis (FAP) and MUTYH-associated polyposis (MAP) are two inherited syndromes that show a high incidence of adenomatous polyps and an elevated risk of developing colorectal cancer (CRC). They account for a small fraction of CRC, <4%.1 However, despite the fact that these two syndromes are caused by deleterious highly penetrant mutations in APC (GeneID 324)2 and MUTYH (GeneID 4595),3 around 15–20% of patients with polyposis exhibit no known genetic risk factors. This is especially so for multiple polyposis patients who carry between 3 and 100 adenomatous polyps. In addition, 20–30% of CRC is thought to be due to inherited multifactorial causes.4 In the absence of identification of a new deleterious mutation, CRC may in part be due to the summation of the deleterious effects of a series of low-frequency dominant and independently acting variants of a variety of different genes, each, conferring a moderate but readily detectable increase in relative risk.4 This ‘rare variant’ hypothesis was based upon the observation by Frayling et al.5 of the APC I1307K and E1317Q variants in patients with multiple adenomas. The I1307K variant is found in the Ashkenazi Jewish population at a frequency of 6–7%, whereas it is absent from non-Jewish populations, and confers an increased risk of multiple adenomas and CRC.5 This variant implies an amino (isoleucine to lysine) substitution in a region involved in protein binding, leading to a mild dominant-negative effect. The E1317Q variant substitution may also affect the function of the APC protein presumed to translate into a slight but definitive advantage for the growth of a tumor.6

Following these observations, other rare variants have been tested. The candidate variants were selected because of their known involvement in sporadic or hereditary CRC or adenomas. Fearnhead et al.7 observed a cumulative effect of 13 rare variants on five different genes in a cohort of 124 patients with adenomatous polyposis with an overall odds ratio (OR) of 2.2 (P=0.0001) when compared with a control set. Because of this publication, several variants in different CRC-susceptibility genes, such as hMLH1 and hMSH6, have been reported to increase the risk of CRC but not cause Lynch syndrome,8, 9 whereas CHEK2 confers a higher CRC risk in hereditary non polyposis CRC (HNPCC)/HNPCC-related families.10

Rare variants are defined by a minor allele frequency (MAF) <1% in the general population and are unlikely to be identified by genome-wide association studies due to their low frequency and small contribution to the overall susceptibility of a disease.4 Only variants with a frequency >5% are detected in these large case–control association studies. Rare variants are best identified in studies with selected cases and candidate genes already known to be likely to be functionally relevant.1, 5 Patients with early-onset CRC (before the age of 50) and multiple polyposis (3–100 polyps) with no known mutations in APC or MYH are ideal candidates to demonstrate an elevated predisposition to disease due to the accumulation of rare variants, as they are likely to involve inherited susceptibility.

The aim of this study was therefore to type a selection of rare (MAF <1%) and low-frequency variants (MAF 1–5%) in a relatively large set of patients with undetermined multiple polyposis (3–100 polyps) or early-onset CRC (diagnosed before 50 years of age) in order to elucidate the wider role of such variants in CRC susceptibility.

Materials and methods

A total of 315 cases and 866 controls, 1181 subjects in all, were included in this study. Collection of blood samples from cases and controls and clinicopathological information from patients were undertaken with appropriate individual informed consent and local ethical committee approvals.

Controls

The controls comprised 866 individuals collected in 10 different regions across the United Kingdom as part of the People of the British Isles study11 (see below) and were unselected with respect to disease status.

Cases

The UK patient group consisted of 112 individuals with 3–100 histologically proven synchronous or metachronous adenomatous polyps and 72 individuals with CRC diagnosed before 50 years of age. Sixty-three individuals with early-onset disease were obtained through the VICTOR clinical trial, a phase III double-blind placebo-controlled study of rofecoxib in Dukes stage B or C CRC patients following potentially curative therapy, whereas the remaining nine cases were recruited through the John Radcliffe and Churchill hospitals’ gastrointestinal clinics. With the exception of one Black Caribbean and one Indian individual, ethnic origin was White British for all UK patients for whom information was available. Non-white individuals were excluded from further analysis. No patient fulfilled the criteria for FAP, autosomal recessive MAP or HNPCC on clinical grounds. Some of these patients had already been screened for germline mutations in the APC and MYH genes in previous studies.5, 12

We also collected samples from 131 French patients, including 75 with multiple adenomas and 56 with early-onset CRC, who were recruited in the Department of Digestive Surgery at the Hospital Saint-Antoine in Paris using the criteria described above. Cases were selected from those who underwent a colectomy or total coloproctectomy for CRC or polyposis. Patients diagnosed with CRC before the age of 50 or with more than three polyps detected after 2005 were referred for a consultation with the geneticist. Immunohistochemical staining to determine loss of expression of the genes MLH1 and MSH2 and microsatellite status was performed for all patients with a CRC diagnosed before the age of 50. Microsatellite instability was confirmed using PCR as already described.13 Sequencing of the entire MUTYH and APC genes was carried out in patients with more than three adenomatous polyps.14 Only patients with no indication of HNPCC, MAP or FAP were included in this study. No ethnic identification was available for the French patients.

All the UK and French cases had histological confirmation of adenomatous polyps, but not all of them had the precise number of polyps determined. For 24 UK and 14 French adenoma patients, only ‘multiple’ was recorded. Within both the UK and French patient groups, individuals with attenuated FAP may be included, as they were not purposely eliminated from the study.

Variant selection

Rare and low-frequency variants were chosen based on prior literature reports that suggested a putative association with CRC, and with features related to colorectal disease, gastric cancer or other cancers (mostly of the breast and prostate). Variants in the BRCA genes were specifically selected from those that were classified either as non-pathogenic or as of unknown significance (see Breast Cancer Information Core database). Common variants, such as CDH1 rs16260, MTHFR rs1801133 and TP53 rs1042522, were genotyped because it has been suggested that they are associated with several types of cancer, including CRC.15, 16, 17, 18, 19

DNA extraction and processing

Genomic DNA was extracted from patients’ peripheral venous blood using the standard techniques. The People of the British Isles control blood samples were transported at room temperature to the laboratory, where the peripheral blood lymphocytes were separated under sterile conditions within 2 days of collection. DNA was prepared from the 10-ml blood residue remaining after sterile separation using either magnetic beads (GeneCatcherTM; Invitrogen, Carlsbad, CA, USA) or spin columns (Qiagen, Valencia, CA, USA). DNA concentration was determined using Pico Green20 and normalized for genotyping to 25 ng μl−1. Samples from the UK cases underwent whole-genome amplification because of limited volumes and amounts of genomic DNA. We used the Repli-g Mini kit (Qiagen), which implements a multiple-displacement amplification reaction to generate up to 10 μg of DNA per 50 μl reaction from a starting amount of at least 10 ng of genomic template. Genomic DNA from French cases and UK controls was used for genotyping.

Genotyping

We examined 70 variants, in cases and controls, in the following cancer candidate genes: APC, AXIN1, AXIN2, BRCA1, BRCA2, CDH1, CHEK2, CTNNB1, EPHB2, EXO1, MLH1, MLH3, MSH2, MTHFR, PMS2, SMAD4 and TP53, which were selected based on their involvement in familial cancers and the presence of somatic mutations in cancer. Of these, 16 variants were genotyped in a subset of only 227 controls. The complete list of variants analyzed is given in Supplementary Table 1.

Four variants were genotyped using restriction fragment length polymorphism analysis (CDH1-2, CHEK2-1, CTNNB1-1 and MSH2-8). Variants APC-10 and APC-11 (that is, I1307K and E1317Q) were typed using allele-specific PCR.7 For details of primers, enzymes and fragment sizes for these see Supplementary Table 2. Primers and conditions for all other variants are available upon request. Genotyping of the remaining variants was done using the Sequenom MassArray technology, namely matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry and the iPLEX Gold assay (Sequenom Inc., San Diego, CA, USA).21

Statistical analysis

Hardy–Weinberg equilibrium was assessed using an exact test implemented in the program PLINK v.1.07.22 Case–control association analyses were also conducted with PLINK. Two-sided P-values were calculated using Fisher’s exact test and those <0.05 (with no multiple comparison correction) were considered statistically significant for an initial analysis. Combined ORs were estimated using the Mantel–Haenszel test.23, 24

Functional in silico analysis

We used the web-based programs PolyPhen-2 and SNPs&GO to predict the effect of nonsynonymous variants on protein function.25, 26 FastSNP and F-SNP were similarly used for noncoding variants.27, 28

Results

Populations

Clinical characteristics of the patients and controls are shown in Table 1.

Table 1 Case and control sample description

Variant selection

Among the 70 variants examined, 24 (34%) were monomorphic in both the UK cases and controls and therefore not useful for analysis (Supplementary Table 1). Of the remaining 46, 10 were monomorphic in cases only and 5 were monomorphic only in controls. Thirty-one variants were considered rare having a control-population MAF <1%, seven were low-frequency variants (that is, MAF between 1 and 5%) and eight were common polymorphisms (that is, MAF >5%). If we define variant class based on the combined MAF in cases and controls as recently suggested,29, 30 only one variant (APC-17) changes categories, going from being a low-frequency variant to a rare one.

No variant was out of Hardy–Weinberg equilibrium in the control population at a Bonferroni-corrected P-value of 0.001 (0.05/46). Two variants were in Hardy–Weinberg disequilibrium in controls (TP53-1 and BRCA1-6, P<0.05) and three in patients (EPHB2-3, EXO1-12 and BRCA1-22, P<0.05) if a correction for multiple testing was not applied.

Association analysis

UK cases vs controls

When comparing UK cases with controls, four rare variants were found to have a significantly higher MAF among the patients (EXO1-12, MLH1-1, CTNNB1-1 and BRCA2-37, P<0.05; Table 2). Variant EXO1-12 was more frequent in individuals with cancer than in those with adenomas, as opposed to MLH1-1, CTNNB1-1 and BRCA2-37, which were only present in the multiple adenoma cases (Table 2). Results close to significance were also seen for rare variant EPHB2-3 and common variant CDH1-2 (P=0.07 for both), although the CDH1-2 A allele appeared to protect against disease. When analyzing carrier frequencies instead of allele frequencies, only BRCA2-37 was significant, with an OR of 4.1 (1.2–14.3; P=0.05; Table 3). Pooling together all the rare variants showed that the proportion of patients carrying rare variants was higher than the proportion of control carriers, regardless of whether the full set or a subset of controls was used (Table 3). The combined OR, obtained by merging OR 1 (effect of variants typed in the full set of controls) and OR 2 (effect of variants typed in the smaller set of controls) with the Mantel–Haenszel test, was 1.2 (95% confidence interval (CI) 0.8–1.8, P=0.42). This effect became much stronger when only variants with a MAF <0.5% were tested (combined OR 1.8; 95% CI 1.0–3.1; P=0.05; Table 3). On the other hand, the analysis of pooled low-frequency variants showed a protective but nonsignificant effect (combined OR 0.8; 95% CI 0.5–1.1; P=0.18; Supplementary Table 3). When APC-17 switched categories, from low frequency to rare variant, results changed slightly. For rare variants with MAF <1%, combined OR=1.1 (95% CI 0.8–1.7; P=0.54); for low-frequency variants, combined OR=0.8 (95% CI 0.5–1.2; P=0.24). Also, two variants (MLH3-1 and CHEK2-1) do not make the 0.5% cutoff when assessed from the frequency in the combined set of cases and controls. Taking them out of the analysis of variants with MAF <0.5% yields a combined OR of 1.6 (95% CI 0.8–2.9; P=0.17).

Table 2 Variants analyzed in UK cases and controls
Table 3 Rare variant counts in UK cases and controls

UK multiple adenoma patients vs early-onset CRC patients

Analysis by disease group (that is, multiple adenoma vs early-onset patients) of all variants with frequencies <0.5% revealed an increase in susceptibility to disease for carriers of rare variants, especially among multiple adenoma patients (combined OR 1.9; 95% CI 1.0–3.5; P=0.05; Table 4). Individually, the carrier frequency for BRCA2-37 in multiple polyp cases was significantly higher than that of controls, whereas this variant was absent among early-onset patients. MLH3-1, on the other hand, showed a significantly higher carrier frequency in the early-onset group as compared with controls, whereas BRCA2-27 was only detected in individuals with early-onset disease (Table 4). Overall, out of the 31 variants with MAF <1%, 14 were present in multiple adenoma patients only, whereas 3 (4, if counting APC-17) were present only in early-onset cases. The difference was significant whether APC-17 is included or not (P<0.05), although the smaller number of early-onset CRC patients might be introducing bias.

Table 4 Rare variant counts in UK multiple adenoma, early-onset CRC and control subjects for variants with MAF <0.5% in controls

There were significant allele-frequency differences between individuals with multiple adenomas and those with early-onset CRC in two variants in MLH3 (one rare (MLH3-1) and one low frequency (MLH3-5)) and in a common variant in CDH1 (CDH1-5) (Table 2). In these three instances, the allele frequency among early-onset patients was higher than among individuals with multiple adenomatous polyps.

Comparison between UK and French samples

Twenty-two of the variants genotyped in UK patients (16 rare, 4 low frequency and 2 common variants) were also examined in French subjects affected by either multiple polyps or early-onset CRC, recruited using the same set of criteria employed in the United Kingdom (Table 5). Three rare variants (MSH2-8, APC-10 and BRCA2-48) were absent from both, the UK and the French population. Eight variants that were detected in UK cases were not found in French patients (EPHB2-1, EPHB2-4, EPHB2-7, EXO1-4, CTNNB1-1, BRCA2-35, BRCA2-37 and CHEK2-1). On the other hand, no variant identified in French patients was missing from the UK sample (cases and/or controls). There was no significant difference between the French and UK cases with respect to the overall number of rare variants with MAF <1%, but UK patients show an excess of rare variants with MAF <0.5% compared with this set of French patients (P=0.02; Table 6).

Table 5 Variants genotyped in French patients
Table 6 Rare variant counts in UK and French patients

In silico analysis of functional effects

We investigated the putative effects of nonsynonymous variants with the programs PolyPhen-2 and SNPs&GO. Based on PolyPhen-2, 11 variants were classified as probably damaging, 6 as possibly damaging and 22 as benign, whereas according to SNP&GO, there were 18 disease variants and 21 neutral variants. Although these numbers seem fairly similar, there were several disagreements between programs with respect to the prediction of particular variants (Table 2). Among the rare variants with MAF <1%, there were 9 probably damaging, 5 possibly damaging and 14 benign, or 13 disease and 15 neutral variants. When only those variants with MAF <0.5% were considered, the ratio of damaging (probably+possibly)/disease to benign/neutral variants increased from 50 to 60%. All of the low-frequency variants, in contrast, were predicted to be benign/neutral. In addition to the missense variants, there was one synonymous (SMAD4-1) and one deleterious coding (CHEK2-1) rare variants and five non-coding variants (one rare and one common variant in the promoter and two common intronic variants in CDH1, and one low-frequency variant in the 3′ untranslated region of APC). Promoter variant CDH1-1 is predicted to eliminate a S8 transcription factor-binding site, whereas the CDH1-2 promoter variant A allele has been found to decrease transcriptional efficiency by 68% with respect to the C allele, also probably by altering transcription factor-binding sites.31 Using the programs FastSNP and F-SNP, a variant in CDH1 intron 1 was determined to potentially affect a splicing site, whereas a variant in intron 4 of the same gene showed a low risk of being an intronic enhancer (Table 2).

Discussion

We have examined 55 rare variants, 7 low-frequency variants and 8 polymorphisms in a sample of UK CRC and multiple adenoma cases and controls. Two of the four rare variants that were individually significantly associated with disease (that is, MLH1-1 and CTNNB1-1) had already been identified in the same set of individuals with multiple polyps, although not then found to be individually significant.7 In this study, we showed that these MLH1 and CTNNB1 variants were not present in a different and much larger UK control population, which explains the present case–control significant difference, and were also absent from a sample of early-onset CRC UK patients. Given that these two variants have not been found in our set of French patients, and that having 30% fewer French cases may not in itself fully explain the UK–French apparent difference, they may represent UK founder effects, as previously suggested.4 However, replication of our findings in another UK sample of multiple adenoma patients, as well as functional studies, are necessary to establish their importance as CRC risk factors. The remaining two individually significant variants (EXO1-12 and BRCA2-37) have not been associated with CRC before. BRCA2-37 was classified as not clinically significant by the Breast Cancer Information Core database (in early 2010) and predicted to be benign by PolyPhen-2, yet it was recently found to be overrepresented among subjects with familial prostate cancer,32 and SNP&GO considered it to be disease-associated. As mentioned above, the fact that it was not found among French patients could indicate a restricted distribution of this variant. Variant EPHB2-3, which was detected in a Finnish individual with rectal and prostate cancer in an earlier study,33 showed a nearly significant result. All associated variants code for nonsynonymous amino-acid changes. However, CTNNB1-1, BRCA2-37 and EPHB2-3 were predicted to be benign by PolyPhen-2, whereas MLH1-1 and EXO1-12 were considered probably damaging. SNPs&GO, on the other hand, predicted CTNNB1-1 and BRCA2-37 to be disease-associated and the remaining three variants to be neutral. Recently, SNPs&GO has been found to be more accurate than PolyPhen and other similar programs.34 However, even though it identifies CTNNB1-1 and BRCA2-37 as potentially pathogenic, it misses MLH1-1 and EXO1-12. Also, APC-11, demonstrably pathogenic,5 was not identified as such by any of these computational methods. These discrepancies indicate that the use of in silico methods to evaluate the effects of nonsynonymous rare variants is not yet sufficiently reliable to be confident of their predictions. This is especially important when using them to predict which variants to focus on.

The grouping of all rare variants in the association analysis (23 or 8, depending on the control set used) yielded a combined OR of 1.2, which suggested that there was no strong evidence of an effect on CRC. However, pooling all variants with a MAF <0.5% considerably bolstered the association, taking the OR to 1.8. Notably, even though several of the variants included in the analysis are, on the basis of the in silico analysis and the examination of other parameters of pathogenicity,35 considered to be benign, neutral, not clinically significant or of unknown significance, there is nevertheless an elevated risk from their combined action. The conclusion is that these variants may well be pathologically relevant, but that the in silico approaches are not yet adequate to detect this. The low-frequency variants (MAF between 1 and 5%) do not appear to influence susceptibility to CRC, as we described earlier for CCND1.36 It is clear that further research is needed to evaluate more fully the role of low-frequency variants in cancer.37 Defining rare variants using a threshold based on the combined set of cases and controls, as compared with just the controls, only altered the classification of three variants in our study and so did not appreciably affect the results. This was to be expected because we had a substantially larger number of controls than cases.

Extensively studied common variants MTHFR A222V (rs1801133) and TP53 R72P (rs1042522) did not show significant frequency differences between cases and controls. Conversely, CDH1-284C/A (rs16260) exhibited a lower frequency of the A allele in patients than in controls, revealing weak statistical evidence of a protective effect of this polymorphism on colorectal disease (0.25 vs 0.31, P=0.07). This is in agreement with previous findings on CRC where the C allele increases risk,15, 16 whereas in gastric, prostate and breast cancer, the A allele tends to be the risk allele.38, 39, 40 However, our study is underpowered for the detection of effects from common variants.

The analysis by disease group showed that, even though the collection of rare variants in each set of patients carries a higher risk of disease, our findings are mostly driven by the effects on individuals with multiple adenomas. BRCA2-37, the rare variant with the strongest effect in this study, was, for example, found only in patients with multiple adenomas. Although the sample size for the early-onset group was limited, our results clearly suggest that the genetic influence on CRC may mostly be seen in individuals with multiple adenomas, as compared with early-onset cases. This parallels to what is found in the clear-cut familial cases of inherited CRC. The extra layer of activity needed to go from polyp to cancer leads to an additional amount of variation that may be ‘less genetically determined’ and so obscure the underlying genetic susceptibility due to the multiple adenomas. There were, however, no significant differences in carrier frequencies between the two groups of patients, despite the fact that over half of the variants with MAF <1% were found only in the multiple adenoma group. This, again, is probably due to the relatively smaller size of the early-onset group of patients. However, the allelic frequencies of two missense variants (one rare and one low frequency) in MLH3 and one intronic common variant in CDH1 differed significantly between multiple adenoma and early-onset CRC cases, with the latter exhibiting higher frequencies of these variants. This suggests that different sets of rare variants are quite likely to be involved in different pathologies, but to detect their effect would require larger numbers of patients than we were able to study.

The association P-values reported in this study have not been corrected for multiple hypotheses testing. Taking into account the number of variants analyzed, a Bonferroni correction would take the significance threshold to 0.001. Nonetheless, we believe that because there is an a priori case for each candidate variant to be potentially functional, such a correction would be unsuitably stringent. The lack of French controls precluded a similar association study from being carried out with French samples as was done for the UK samples. Using UK controls would be inappropriate because of population stratification within Europe, especially for analysis of rare variants as they are likely to be population specific. The presence of such founder effects is appreciably suggested by the fact that the variants MLH1-1 and CTNNB1-1, which are very clearly associated with multiple adenomas in UK cases, were not found in the French multiple adenoma cases. Further analysis of such differences requires larger numbers of French cases and appropriately selected French controls. Moreover, larger cohorts, such as the EPICOLON consortium experience,41 are needed to confirm these preliminary results and meta-analysis should be performed to ensure the pathogenic effects of variants described in the present work.

In summary, because rare variants appear to be associated with higher ORs than common variants, a relatively small study like ours can uncover the effects of candidate variants with low population frequencies on complex diseases such as, in this case, CRC. We have also shown that variants with frequencies <0.5% appear to have the biggest effects regardless of the in silico prediction of their function. The role of the individual variant BRCA2-37 (V2728I) on the development of multiple adenomatous polyps deserves further examination. In general, the multiple adenoma phenotype seems to be more susceptible to genetic influence than early-onset CRC, but a larger early-onset patient sample would be necessary to confirm this finding. We have found some differences between UK and French patients in terms of the distribution of rare variants that justify closer inspection as population stratification within Europe can lead to spurious association results.

To conclude, we have confirmed that rare variants are important risk factors in CRC and as such, should be systematically assayed alongside common variation in the search for the genetic basis of complex diseases, taking great care to match cases with appropriate controls.