Edinburgh Research Explorer Association analyses identify 31 new risk loci for colorectal cancer susceptibility

Colorectal cancer (CRC) is a leading cause of cancer-related death worldwide, and has a strong heritable basis. We report a genome-wide association analysis of 34,627 CRC cases and 71,379 controls of European ancestry that identi ﬁ es SNPs at 31 new CRC risk loci. We also identify eight independent risk SNPs at the new and previously reported European CRC loci, and a further nine CRC SNPs at loci previously only identi ﬁ ed in Asian populations. We use in situ promoter capture Hi-C (CHi-C), gene expression, and in silico annotation methods to identify likely target genes of CRC SNPs. Whilst these new SNP associations implicate target genes that are enriched for known CRC pathways such as Wnt and BMP, they also highlight novel pathways with no prior links to colorectal tumourigenesis. These ﬁ ndings provide further insight into CRC susceptibility and enhance the prospects of applying genetic risk scores to personalised screening and prevention. 89 (BFDP) based OR and ability of 10 Analysis (GCTA) , on the and known SNPs, SNPs with P 5× − and

M any colorectal cancers (CRC) develop in genetically susceptible individuals 1 and genome-wide association studies (GWAS) of CRC have thus far reported 43 SNPs mapping to 40 risk loci in European populations 2,3 . In Asians, 18 SNPs mapping to 16 risk loci have been identified 4,5 , a number of which overlap with those reported in Europeans. Collectively across ethnicities GWAS has provided evidence for 53 unique CRC susceptibility loci. While much of the heritable risk of CRC remains unexplained, statistical modelling indicates that further common risk variants remain to be discovered 6 .
To gain a more comprehensive insight into CRC aetiology, we conducted a GWAS meta-analysis that includes additional, unreported datasets. We examine the possible gene regulatory mechanisms underlying all GWAS risk loci by analysing in situ promoter Capture Hi-C (CHi-C) to characterise chromatin interactions between predisposition loci and target genes, examine gene expression data and integrate these data with chromatin immunoprecipitation-sequencing (ChIP-seq) data. Finally, we quantify the contribution of the loci identified in this study, together with previously identified loci to the heritable risk of CRC and estimate the sample sizes required to explain the remaining heritability.

Results
Association analysis. Thus far, studies have identified 61 SNPs that are associated with CRC risk in European and Asian populations (Supplementary Data 1). To identify additional CRC risk loci, we conducted five new CRC GWAS, followed by a metaanalysis with 10 published GWAS totalling 34,627 cases and 71,379 controls of European ancestry under the auspices of the COGENT (COlorectal cancer GENeTics) consortium 7 (Fig. 1, Supplementary Data 2). Following established quality control measures for each dataset 8 (Supplementary Data 3), the genotypes of over 10 million SNPs in each study were imputed, primarily using 1000 Genomes and UK10K data as reference (see Methods). After filtering out SNPs with a minor allele frequency <0.5% and imputation quality score <0.8, we assessed associations between CRC status and SNP genotype in each study using logistic regression. Risk estimates were combined through an inverse-variance weighted fixed-effects meta-analysis. We found little evidence of genomic inflation in any of the GWAS datasets (individual λ GC values 1.01-1.11; meta-analysis λ 1000 = 1.01, Supplementary Figure 1).
Excluding flanking regions of 500 kb around each previously identified CRC risk SNP, we identified 623 SNPs associated with CRC at genome-wide significance (logistic regression, P < 5 × 10 −8 ).
After implementing a stepwise model selection, these SNPs were resolved into 31 novel risk loci, with 27 exhibiting Bayesian False Discovery Probabilities (BFDPs) 9 <0.1 ( Table 1, Fig. 2, Supplementary Figure 2). The association at 20q13.13 (rs6066825) had only been previously identified as significant in a multi-ethnic study 10 . Two new associations (rs3131043 and rs9271770) were identified within the 6p21.33 major histocompatibility (MHC) region, with rs3131043 located 470 kb 5′ of HLA-C, and rs9271770 located between HLA-DRB1 and HLA-DQA1. Imputation of the MHC region using SNP2HLA 11 provided no evidence for additional MHC risk loci.
Next, we performed an analysis conditioned on the sentinel SNP (r 2 < 0.1 and P conditional < 5 × 10 −8 ; Table 2) to search for further independent signals at these new and previously reported risk loci. We confirmed the presence of previously reported dual signals at 14q22.2, 15q13.3 and 20p12.3 18 . For the new risk loci, an additional independent signal was identified at 5p15. 3. In addition, a further seven signals were found at five previously reported risk loci: 11q13.4, 12p13.32, 15q13.3, 16q24.1, 20q13.13. Two of these signals were at the 15q13.3 locus, of which one was 5′ of GREM1 and the other intronic to FMN1. A further two signals were proximal and distal of 20q13.13. At 12p13.32 and 16q24.1, genome-wide associations marked by rs12818766 and rs899244, respectively, were shown. These were independent of the previously reported associations 2,14 at rs3217810 and rs2696839 (pairwise r 2 = 0.0).
In total, we identified 39 new independent SNPs associated with CRC susceptibility at genome-wide significance in Europeans. Together with the nine associations previously identified in Asian populations, and the 31 previously identified SNPs that were confirmed here, this brought the number of identified CRC association signals in Europeans to 79. Several of these risk loci map to regions previously identified in other cancers. In particular, three regions harbour susceptibility loci for multiple cancers 19 , specifically 5p15.33 (TERT-CLPTM1L), 9p21.3 (CDKN2A) and 20q13.33 (RTEL1) (Supplementary Data 5).
Functional annotation and biological inference of risk loci. To the extent that they have been deciphered, most GWAS risk loci map to non-coding regions of the genome influencing gene regulation 19 . Consistent with this, we found evidence that the CRC risk SNPs mapped to regions enriched for active enhancer marks (H3K4me1 and H3K27ac) in colonic crypts (permutation test, P = 0.034 and 0.033, respectively) and colorectal tumours (P = 4.2 × 10 −3 and 4.0 × 10 −5 ) (Supplementary Figure 3). To determine whether the CRC SNPs overlapped with active regulatory regions in a cell-type specific manner 20 , we analysed the H3K4me3, H3K27ac, H3K4me1, H3K27me3, H3K9ac, H3K9me3 and H3K36me3 chromatin marks across multiple cell types from the NIH Roadmap Epigenomics project 21 . Colonic and rectal mucosa cells showed the strongest enrichment of risk SNPs at active enhancer and promoter regions (H3K4me3, H3K4me1 and H3K27ac marks, P < 5 × 10 −4 ) (Supplementary Figure 3).
Given our observation that the risk loci map to putative regulatory regions, we examined both histone modifications and transcription factor (TF) binding sites in LoVo and HT29 CRC cells across the risk SNPs. Using variant set enrichment 22 , we identified regions of strong LD (defined as r 2 > 0.8 and D′ > 0.8) with each risk SNP and determined the overlap with ChIP-seq data from the Systems Biology of Colorectal cancer (SYSCOL) study and inhouse-generated histone data. We identified an over-representation of binding for MYC, ETS2, cohesin loading factor NIPBL and cohesin-related proteins RAD21, SMC1A and SMC3 (Supplementary Figure 4). About 87% (69/79) of the risk SNPs were predicted to disrupt binding motifs of specific TFs, notably CTCF, SOX and FOX, with 35% located within TF binding peaks from LoVo, HT29 or ENCODE ChIP-seq data (Supplementary Data 6).
The upstream mechanisms by which predisposition SNPs influence disease risk is often through effects on cis-regulatory transcriptional networks, specifically through chromatin-looping interactions that are fundamental for regulation of gene expression. Therefore, to link regulatory regions containing risk SNPs to promoters of candidate target genes, we applied in situ promoter capture Hi-C (CHi-C) data in LoVo and HT29 cells (Supplementary Data 9). About 38% of the risk SNPs mapped to regions that showed statistically significant chromatin-looping interactions with the promoters of respective target genes. Notably, as well as confirming the interaction between rs6983267 and MYC at 8q24.21 ( Supplementary Figure 2), the looping interaction from an active enhancer region at 10q25.2 implicates TCF7L2 as the target gene of rs12255141 variation (Fig. 3). TCF7L2 (previously known as TCF4) is a key transcription factor in the Wnt pathway and plays an important role in the development and progression of CRC 23 . Intriguingly, TCF7L2 has been shown to bind to a MYC enhancer containing rs6983267 24 and to a GREM1 enhancer near rs16969681 25 . Based on ChromHMM, this region is annotated as a promoter in HCT116 cells, but not in normal colonic and rectal mucosa. Additionally this locus has been implicated in lung cancer 26 and low-grade glioma 27 . Similarly, the 9p21.3 chromatin interaction provides evidence to support CDKN2B as the target gene for rs1412834 variation, a region of somatic loss.
We sought to gain further insight into the target genes at each locus, and hence the biological mechanisms for the associations, . However, while multiple nominally significant cis-eQTLs were present, nearly half of all loci had no evidence of cis-eQTLs in the sample sets used. In addition to eQTL analysis, we performed Summary-databased Mendelian Randomization (SMR) analysis 28 as a more stringent test for causal differences in gene transcription (Supplementary Data 8). There was support for the 11q23.1 locus SNP influencing CRC risk through differential expression of one or more of COLCA1, COLCA2 and C11orf53 transcripts (P SMR < 10 −10 ). There was also evidence that the 3p21.1 and 19q13.33 SNPs acted through SFMBT1 and FUT2, respectively, (P SMR < 10 −5 ), as well as the 6p21.31 SNP acted through class II HLA expression (P SMR < 5 × 10 −4 ).
Based on genetic fine-mapping and functional annotation, our data indicated several candidate target genes with functions previously unconnected to colorectal tumourigenesis (Supplementary Data 9). The SFMBT1 protein (3p21.1) acts as a histone reader and a component of a transcriptional repressor complex 29 . TNS3 at 7p12.3 encodes the focal adhesion protein TENSIN3, to which the intestinal stem cell marker protein Musashi1 has been reported to bind. Tns3-null mice exhibit impaired intestinal epithelial development, probably because of defects in Rho GTPase signalling and cell adhesion 30 . LRP1 (12q13.3, LDL receptor-related protein 1) (Fig. 3) may be involved in Wnt-signalling 31 , although its role in the intestines has not previously been conclusively demonstrated. FUT2 at 19q13.33 encodes fucosyltransferase II. Variation at this locus is associated with differential interactions with intestinal bacteria and viruses. Our data thus provide evidence for a role of the microbiome in CRC risk 32 . PTPN1 (20q13.13), also known as PTP1B, encodes a non-receptor tyrosine phosphatase involved in regulating JAK-signalling, IR, c-Src, CTNNB1, and EGFR. We annotated all risk loci with five types of functional data: (i) presence of a CHi-C contact linking to a gene promoter, (ii) presence of an association from eQTL, (iii) presence of a regulatory state, (iv) evidence of TF binding, and (v) presence of a nonsynonymous coding change (Supplementary Data 9). Collectively this analysis suggested three primary candidate disease mechanisms across a number of risk loci: firstly, genes linked to BMP/TGF-β signalling (e.g. GREM1, BMP2, BMP4, SMAD7, SMAD9); secondly, genes with roles either directly or indirectly linked to MYC (e.g. MYC, TCF7L2); and thirdly genes with roles in maintenance of chromosome integrity (e.g. TERT, RTEL1) and DNA repair (e.g. POLD3) (Supplementary Figure 5). Pathway gene set enrichment analyses 33 revealed several growth or development related pathways were enriched, notably TGF-β signalling and immune response pathways (Supplementary Figure 6, Supplementary Data 10). Other cancer-related themes included apoptosis and leukocyte differentiation pathways. We used Data-driven Expression-Prioritized Integration for Complex Traits (DEPICT) 34 to predict gene targets based on gene functions that are shared across genome-wide significant risk loci, as well as those associated at P < 10 −5 as advocated to mitigate against type II error. Tissue-specificity with respect to colonic tissue was evident (permutation test, P < 5 × 10 −3 ) and among the protein-coding genes predicted, there was enrichment for TGF-β and PI3K-signalling pathways, and abnormal intestinal crypt gene sub-networks (P < 10 −5 ; Supplementary Data 11).
Contribution of risk SNPs to heritability. Using Linkage Disequilibrium Adjusted Kinships (LDAK) 35 in conjunction with the GWAS data generated on unselected CRC cases (i.e. COIN, CORSA, Croatia, DACHS, FIN, SCOT, Scotland1, SOCCS/LBC, SOCCS/GS, UKBB, VQ58 studies) we estimated that the heritability of CRC attributable to all common variation is 0.29 (95% confidence interval: 0.24-0.35). To estimate the sample size required to explain a greater proportion of the GWAS heritability, we implemented a likelihood-based approach using association statistics in combination with LD information to model the effect-size distribution 36 , which was best represented by a threecomponent model (mixture of two normal distributions). Under this model, to identify SNPs explaining 80% of the GWAS heritability, it is likely to require effective sample sizes in excess of 300,000 if solely based on GWAS associations (Supplementary Figure 7).
After adjusting for winner's curse 37 , the 79 SNPs thus far shown to be associated with CRC susceptibility in Europeans explain 11% of the 2.2-fold familial relative risk (FRR) 38 , whilst all common genetic variants identifiable through GWAS could explain 73% of the FRR. Thus, the identified susceptibility SNPs collectively account for approximately 15% of the FRR of CRC that can be explained by common genetic variation. We incorporated the newly identified SNPs into risk prediction models for CRC and derived a polygenic risk score (PRS) based on a total of 79 GWAS significant risk variants. Individuals in the top 1% have a 2.6-fold increased risk of CRC compared with the population average (Supplementary Figure 8). Risk reclassification using this PRS offers the prospect of optimising prevention programmes for CRC in the population, for example through targeting screening 6 , and also preventative interventions. The identification of further risk loci through the analysis of even larger GWAS is likely to improve the performance of any PRS model.
Co-heritability with non-cancer traits. We implemented crosstrait LD score regression 39 to investigate co-heritability globally between CRC and 41 traits with publicly available GWAS summary statistics data. None of the genetic correlations remained significant after Bonferroni correction (two-sided Z-test, Pthreshold: 0.05/41 = 1.2 × 10 −3 ). However, nominally significant positive associations with CRC risk (Supplementary Data 12) included insulin resistance, comprising raised fasting insulin, glucose and HbA1c (positive), hyperlipidaemia, comprising raised total cholesterol and low-density lipoprotein cholesterol, and ulcerative colitis, all of which are traits or diseases previously reported in observational epidemiological studies to be associated with CRC risk 40,41 .

Discussion
Here we report a comprehensive analysis that sheds new light on the molecular basis of genetic risk for a common cancer, and greatly increases the number of known CRC risk SNPs. To identify the most credible target genes at each site, we have performed detailed annotation using public databases, and have also acquired our own disease-specific data from ChIP-seq, promoter capture Hi-C and gene expression analyses.
Given that there remains significant missing common heritability for CRC, additional GWAS meta-analyses are likely to lead to discovery of more risk loci. Such an assertion is directly supported a contemporaneous study 42 , which has reported the identification of 40 independent signals; 30 novel loci and 10 conditionally independent association signals at previously and newly identified CRC risk loci. Of these, 18 were replicated in our analysis, with an additional five exhibiting an independent signal present at the same locus (Supplementary Data 13).
Overall, our findings provide new insights into the biological basis of CRC, not only confirming the importance of established gene networks, but also providing evidence that point to a role for the gut microbiome in CRC causation, and identifying several functional mechanisms previously unsuspected of any involvement in colorectal tumourigenesis. Several of the gene pathways identified through GWAS may provide potential novel targets for chemoprevention and chemotherapeutic intervention. (1) The NSCCG-OncoArray GWAS comprised 6240 cases ascertained through the National Study of Colorectal Cancer Genetics (NSCCG) 43 -log 10 (P) UK Biobank is a large cohort study with more than 500,000 individuals recruited. Biological samples of these participants were genotyped using the custom-designed Affymetrix UK BiLEVE Axiom array on an initial 50,000 participants and Affymetrix UK Biobank Axiom array on the remaining 450,000 participants. The two arrays had over 95% common content. Genotyping was done at the Affymetrix Research Services Laboratory in Santa Clara, California, USA. Details on genotyping and quality control were previously reported 49 . Self-reported cases of cancers of bowel, colon or rectum, if not confirmed by the ICD9 or ICD10 codes were excluded from the analysis. Healthy control individuals without history of cancer and/or colorectal adenoma were included in the analysis after matching one case to four controls by age, gender, date of blood draw, ethnicity and region of residence (two first letters of postal code).
Published GWAS. We made use of 10  Quality control. Standard quality control (QC) measures were applied to each GWAS 8 . Specifically, individuals with low SNP call rate (<95%) as well as individuals evaluated to be of non-European ancestry (using the HapMap version 2 CEU, JPT/CHB and YRI populations as a reference) were excluded (Supplementary Figure 9). For apparent first-degree relative pairs, we excluded the control from a case-control pair; otherwise, we excluded the individual with the lower call rate. SNPs with a call rate <95% were excluded as were those with a MAF <0.5% or displaying significant deviation from Hardy-Weinberg equilibrium (P < 10 −5 ). QC details are provided in Supplementary Data 3. All genotype analyses were performed using PLINK v1.9 57 .
Imputation and statistical analysis. Prediction of the untyped SNPs was carried out using SHAPEIT v2.837 58    Project with an additional population matched reference panel: 3882 Sequencing Initiative Suomi (SISu) haplotypes for the FIN study, and 3000 sequenced CRC cases for the DACHS study. We imposed predefined thresholds for imputation quality to retain potential risk variants with MAF >0.5% for validation. Poorly imputed SNPs defined by an information measure <0.80 were excluded. Tests of association between imputed SNPs and CRC were performed under an additive genetic model in SNPTEST v2.5.2 60 . Principal components were added to adjust for population stratification where required (i.e. DACHS, FIN, NSCCG-OncoArray, SCOT and UKBB).
To determine whether specific coding variants within HLA genes contributed to the diverse association signals, we imputed the classical HLA alleles (A, B, C, DQA1, DQB1 and DRB1) and coding variants across the HLA region using SNP2HLA 11 . The imputation was based on a reference panel from the Type 1 Diabetes Genetics Consortium (T1DGC) consisting of genotype data from 5225 individuals of European descent with genotyping data of 8961 common SNPs and indel polymorphisms across the HLA region, and four digit genotyping data of the HLA class I and II molecules. For the X chromosome, genotypes were phased and imputed as for the autosomal chromosome, with the inclusion of the "chrX" flag. X chromosome association analysis was performed in SNPTEST using a maximum likelihood model, assuming complete inactivation of one allele in females and equal effect-size between males and females.
The adequacy of the case-control matching and possibility of differential genotyping of cases and controls was evaluated using a Q-Q plot of test statistics in individual studies (Supplementary Figure 1). Meta-analyses were performed using the fixed-effects inverse-variance method using META v1.7 61 . Cochran's Q-statistic to test for heterogeneity and the I 2 statistic to quantify the proportion of the total variation due to heterogeneity were calculated. A Q-Q plot of the meta-analysis test statistics was also performed (Supplementary Figure 1) Definition of known and new risk loci. We sought to identify all associations for CRC previously reported at a significance level P < 5 × 10 −8 by referencing the NHGRI-EBI Catalog of published genome-wide association studies, and a literature search for the years 1998-2018 using PubMed (performed January 2018). Additional articles were ascertained through references cited in primary publications. Where multiple studies reported associations in the same region, we only considered the first reported genome-wide significant association. New loci were identified based on SNPs at P < 5 × 10 −8 using the meta-analysis summary statistics, with LD correlations from a reference panel of the European 1000 Genomes Project samples combined with UK10K. We only included one SNP per 500 kb interval. To measure the probability of the hits being false positives, the Bayesian False-Discovery Probability (BFDP) 9 was calculated based on a plausible OR of 1.2 (based on the 95 th percentile of the meta-analysis OR values) and a prior probability of association of 10 −5 . A conditional analysis was performed using Genomewide Complex Trait Analysis (GCTA) 62 , conditioning on the new and known SNPs, and SNPs with P conditioned < 5 × 10 −8 and r 2 > 0.1 were clumped using PLINK. The NSCCG-Oncoarray data were used to provide the LD reference data.
Fidelity of imputation. The reliability of imputation of the novel risk SNPs identified (all with an IMPUTE2 r 2 > 0.8) was assessed for 51 SNPs (comprising all new signals not directly genotyped) by examining the concordance between imputed and whole-genome sequenced genotypes in a subset of 201 samples from the CORGI and NSCCG studies. More than 98% concordance was found between the directly sequenced and imputed SNPs (Supplementary Data 14). eQTL analysis. In the INTERMPHEN study, biopsies of normal colorectal mucosa (trios of rectum, proximal colon and distal colon) were obtained from 131 UK individuals with self-reported European ancestry without CRC. Genotyping was performed using the Illumina Infinium Human Core Exome array, with quality control and imputation as above. RNA-seq was performed and data analysed as per the GTEx Project pipeline v7 using the 1000 Genomes and UK10K data as reference. Gene-level expression quantification was based on the GENCODE 19 annotation, collapsed to a single transcript model for each gene using a custom isoform procedure. Gene-level quantification (read counts and TPM values) was performed with RNA-SeQC v1.1.8. Gene expression was normalised using the TMM algorithm, implemented in edgeR, with inverse normal transformation, based on gene expression thresholds of >0.1 Transcripts Per Million (TPM) in ≥20% of samples and ≥6 reads in ≥20% of samples. cis-eQTL mapping was performed separately for proximal colon, distal colon and rectum samples using FastQTL. Principal components for the SNP data and additional covariate factors were identified using Probabilistic Estimation of Expression Residuals (PEER). P-values were generated for each variant-gene pair testing alternative hypothesis that the slope of a linear regression model between genotype and expression deviates from 0. The mapping window was defined as 1 Mb either side of the transcription start site. Beta distribution-adjusted empirical P-values from FastQTL were used to calculate Q-values, and FDR threshold of ≤0.05 was applied to identify genes with a significant eQTL. The normalised effect size of the eQTLs was defined as the slope of the linear regression, and computed as the effect of the alternative allele relative to the reference allele in the human genome reference GRCh37/ hg19). MetaTissue was used to generate a "pan-colonic" eQTL measure from the three individual RNA-seq datasets per patient.
To supplement this analysis, we performed SMR analysis 28 including all eQTLs with nominally significant associations (P < 0.05). We additionally examined for heterogeneity using the heterogeneity in dependent instruments (HEIDI) test, where P HEIDI < 0.05 were considered as reflective of heterogeneity and were excluded.
Promoter capture Hi-C. In situ promoter capture Hi-C (CHi-C) on LoVo and HT29 cell lines was performed as previously described 63 . Hi-C and CHi-C libraries were sequenced using HiSeq 2000 (Illumina). Reads were aligned to the GRCh37 build using bowtie2 v2.2.6 and identification of valid di-tags was performed using HiCUP v0.5.9. To declare significant contacts, HiCUP output was processed using CHiCAGO v1.1.8. For each cell line, data from three independent biological replicates were combined to obtain a definitive set of contacts. As advocated, interactions with a score ≥5.0 were considered to be statistically significant 64 .
Histone mark and transcription factor enrichment analysis. ChIP-seq data from colon crypt and tumour samples was obtained for H3K27ac and H3K4me1 65 . Multiple samples of the same tissue type or tumour stage were merged together. Additional ChIP-seq data from the Roadmap Epigenomics project 21 was obtained for H3K4me3, H3K27ac, H3K4me1, H3K27me3, H3K9ac, H3K9me3 and H3K36me3 marks in up to 114 tissues. Overlap enrichment analysis of CRC risk SNPs with these peaks was performed using EPIGWAS, as described by Trynka et al. 20 . Briefly, we evaluated if CRC risk SNPs and SNPs in LD (r 2 > 0.8) with the sentinel SNP, were enriched at ChIP-seq peaks in tissues by a permutation procedure with 10 5 iterations.
To examine enrichment in specific TF binding across risk loci, we adapted the variant set enrichment method of Cowper-Sal lari et al. 22 . Briefly, for each risk locus, a region of strong LD (defined as r 2 > 0.8 and D′ > 0.8) was determined, and these SNPs were termed the associated variant set (AVS). ChIP-seq uniform peak data were obtained for LoVo and HT29 cell lines (198 and 29 experiments, respectively) 66 and the above described histone marks. For each of these marks, the overlap of the SNPs in the AVS and the binding sites was determined to produce a mapping tally. A null distribution was produced by randomly selecting SNPs with the same characteristics as the risk-associated SNPs, and the null mapping tally calculated. This process was repeated 10 5 times, and P-values calculated as the proportion of permutations where the null mapping tally was greater or equal to the AVS mapping tally. An enrichment score was calculated by normalising the tallies to the median of the null distribution. Thus, the enrichment score is the number of standard deviations of the AVS mapping tally from the median of the null distribution tallies.
Functional annotation. For the integrated functional annotation of risk loci, LD blocks were defined as all SNPs in r 2 > 0.8 with the sentinel SNP. Risk loci were then annotated with five types of functional data: (i) presence of a CHi-C contact linking to a gene promoter, (ii) presence of an association from eQTL, (iii) presence of a regulatory state, (iv) evidence of TF binding, and (v) presence of a nonsynonymous coding change. Candidate causal genes were then assigned to CRC risk loci using the target genes implicated in annotation tracks (i), (ii), (iiii) and (iv). If the data supported multiple gene candidates, the gene with the highest number of individual functional data points was considered as the candidate. Where multiple genes had the same number of data points, all genes were listed. Direct nonsynonymous coding variants were allocated additional weighting. Competing mechanisms for the same gene (e.g. both coding and promoter variants) were allowed for. Finally, if no evidence was provided by these criteria, if the lead SNP was intronic we assigned candidacy on this basis, or if intergenic the nearest gene neighbour. Chromatin data were obtained from HaploReg v4 and regulatory regions from Ensembl.
Regional plots were created using visPIG 67 , using the data described above. We used ChromHMM to integrate DNAse, H3K4me3, H3K4me1, H3K27ac, Pol2 and CTCF states from the CRC cell line HCT116 using a multivariate Hidden Markov Model 68 . Chromatin annotation tracks for colonic mucosa (E075), rectal mucosa (E101) and sigmoid colon (E106) were obtained from the Roadmap Epigenomics Transcription factor binding disruption analysis. To determine if the risk variants or their proxies were disrupting motif binding sites, we used the motifbreakR package 69 . This tool predicts the effects of variants on TF binding motifs, using position probability matrices to determine the likelihood of observing a particular nucleotide at a specific position within a TF binding site. We tested the SNPs by estimating their effects on over 2,800 binding motifs as characterised by ENCODE, FactorBook, HOCOMOCO and HOMER. Scores were calculated using the relative entropy algorithm.
Heritability analysis. We used LDAK 35 to estimate the polygenic variance (i.e. heritability) ascribable to SNPs from summary statistic data for the GWAS datasets which were based on unselected cases (i.e. CORSA, COIN, Croatia, DACHS, FIN, SCOT, Scotland1, SOCCS/GS, SOCCS/LBC, UKBB and VQ58). SNP-specific expected heritability, adjusted for LD, MAF and genotype certainty, was calculated from the UK10K and 1000 Genomes data. Individuals were excluded if they were closely related, had divergent ancestry from CEU, or had a call rate <0.99. SNPs were excluded if they showed deviation from HWE with P < 1 × 10 −5 , genotype yield <95%, MAF <1%, SNP imputation score <0.99, and the absence of the SNP in the GWAS summary statistic data. This resulted in a total 6,024,731 SNPs used to estimate the heritability of CRC.
To estimate the sample size required to detect a given proportion of the GWAS heritability we implemented a likelihood-based approach to model the effect-size distribution 36 , using association statistics from the meta-analysis, and LD information from individuals of European ancestry in the 1000 Genomes Project Phase 3. LD values were based on an r 2 threshold of 0.1 and a window size of 1MB. The goodness of fit of the observed distribution of P-values against the expected from a two-component model (single normal distribution) and a three-component model (mixture of two normal distributions) were assessed, and a better fit was observed for the latter model. The percentage of GWAS heritability explained for a projected sample size was determined using this model, based on power calculations for the discovery of genome-wide significant SNPs. The genetic variance explained was calculated as the proportion of total GWAS heritability explained by SNPs reaching genome-wide significance at a given sample size. The 95% confidence intervals were determined using 10 5 simulations.
Cross-trait genetic correlation. LD score regression 39 was used to determine if any traits were correlated with CRC risk. GWAS summary data were obtained for allergy, asthma, coronary artery disease, fatty acids, lipids (total cholesterol, high density lipoprotein, low-density lipoprotein, triglycerides), auto-immune diseases (Crohn's disease, rheumatoid arthritis, atopic dermatitis, celiac disease, multiple sclerosis, primary biliary cirrhosis, inflammatory bowel disease, ulcerative colitis, systemic lupus erythematosus), anthropometric measures (BMI, height, body fat), glucose sensitivity (fasting glucose, fasting insulin, HbA1c), childhood measures (birth weight, birth length, childhood obesity, childhood BMI), eGFR and type 2 diabetes. All data were obtained for European populations. Summary statistics were reformatted to be consistent, and constrained to HapMap3 SNPs as these have been found to generally impute well. LD Scores were determined using 1000 Genomes European data.
Familial risk explained by risk SNPs. Under a multiplicative model, the contribution of risk SNPs to the familial risk of CRC was calculated from P k logλ k log λ 0 , where λ 0 is the familial risk to first-degree relatives of CRC cases, assumed to be 2.2 38 , and λ k is the familial relative risk associated with SNP k, calculated as λ k ¼ p k r 2 k þq k p k r k þq k ð Þ 2 , where p k is the risk allele frequency for SNP k, q k = 1−p k , and r k is the estimated per-allele OR from the meta-analysis 70 . The OR estimates were adjusted for the winner's curse using the FDR Inverse Quantile Transformation (FIQT) method 37 . We constructed a PRS including all 79 CRC risk SNPs discovered or validated by this GWAS in the risk-score modelling. The distribution of risk on an RR scale in the population is assumed to be log-normal with arbitrary population mean μ set to -σ 2 /2 and variance σ 2 ¼ 2 P k p k ð1 À p k Þβ 2 where β and p correspond to the log odds ratio and the risk allele frequency, respectively, for SNP k. The distribution of PRS among cases is right-shifted by σ 2 so that the overall mean PRS is 1.0 71 . The risk distribution was also performed assuming all common variation, using σ 2 ¼ logðλ 2 sib Þ, where λ sib = 1.79, as determined using the heritability estimate from GCTA.
Pathway analysis. SNPs were assigned to genes as described in the functional annotation section. The genes that mapped to genome-wide significant CRC risk SNPs were analysed using InBio Map, a manually curated database of proteinprotein interactions.
Gene set enrichment was calculated using GenGen. Enrichment scores were calculated using the meta-analysis results and were based on 10 3 permutations on the χ 2 values between SNPs. Pathway definitions were obtained from the Bader Lab 33 , University of Toronto, July 2018 release. This data contained pathway information from Gene Ontology (GO), Reactome, HumanCyc, MSigdb C2 (curated dataset), NCI Pathway, NetPath and PANTHER for a total of 7269 pathways. GO annotations that were inferred computationally were excluded. To avoid biasing the results, the meta-analysis SNPs were pruned to only those with an r 2 < 0.1 and a distance greater than 500 kb. Pathways were visualised using Cytoscape v3. 6 We thank the High-Throughput Genomics Group at the Wellcome Trust Centre for Human Genetics (funded by Wellcome Trust grant reference 090532/Z/09/Z) and the Edinburgh Clinical Research Facility (ECRF) Genetics Core, Western General Hospital, Edinburgh, for the generation of genotyping data.
We thank the Lothian Birth Cohorts' members, investigators, research associates, and other team members. We thank the Edinburgh Clinical Research Facility (ECRF) Genetics Core, Western General Hospital, Edinburgh, for genotyping. Lothian Birth Cohorts' data collection is supported by the Disconnected The work of the Colon Cancer Family Registry (CCFR) was supported by the National Cancer Institute (NCI) of the National Institutes of Health (NIH) under Award number U01 CA167551. The CCFR Illumina GWAS was supported by the NCI/NIH under Award Numbers U01 CA122839 and R01 CA143237 to G.C. The content of this manuscript does not necessarily reflect the views or policies of the NCI or any of the collaborating centres in the CCFR, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government or the CCFR.
The CORSA study was funded by FFG BRIDGE (grant 829675, to A.G.), the "Herzfelder'sche Familienstiftung" (grant to A.G.) and was supported by COST Action BM1206. We kindly thank all individuals who agreed to participate in the CORSA study. Furthermore, we thank all cooperating physicians and students and the Biobank Graz of the Medical University of Graz.
The Croatian study was supported through the 10,001 Dalmatians Project, and institutional support of University Hospital for Tumours, Sestre milosrdnice University Hospital Center.
James East and Simon Leedham were funded by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed not necessarily those of the NHS, the NIHR or the Department of Health.