Main

CRC is a leading cause of cancer morbidity and mortality worldwide1. It is well established that genetic factors have an important role in the etiology of CRC2,3. Deleterious germline mutations in known susceptibility genes, notably APC (adenomatous polyposis coli), MLH1, MSH2, MSH6 and PMS2, confer high risk of CRC in hereditary cancer syndromes3,4,5,6. Most sporadic CRC cases, however, do not carry these high-penetrance mutations3,4. Since 2007, genome-wide association studies (GWAS) and subsequent fine-mapping analyses conducted in individuals of European descent have identified 21 low-penetrance susceptibility loci associated with CRC risk7,8,9,10,11,12,13,14,15,16,17. Together, these common loci explain less than 10% of the familial relative risk of CRC in European populations13,14. In a GWAS of 7,456 CRC cases and 11,671 controls conducted as part of the Asia Colorectal Cancer Consortium, we identified 3 new loci at 5q31.1 (near PITX1), 12p13.32 (near CCND2) and 20p12.3 (near HAO1) associated with CRC risk18. In addition, we discovered a new risk variant in the SMAD7 gene associated with CRC in East Asians19. Over the past 2 years, we have doubled the sample size in the Asia Colorectal Cancer Consortium and conducted a 4-stage GWAS, including 14,963 CRC cases and 31,945 controls, to identify additional susceptibility loci for CRC.

Results

Study overview

We performed a fixed-effects meta-analysis to evaluate approximately 2.4 million genotyped or imputed SNPs on 22 autosomes from 5 GWAS (stage 1) conducted in China, Japan and South Korea, including in total 2,098 CRC cases and 6,172 cancer-free controls (Supplementary Tables 1 and 2). There was little evidence of population stratification in these studies (Supplementary Figs. 1 and 2), with genomic inflation factor λ < 1.04 in each of the five studies and the meta-analysis (λ1,000 = 1.01). We selected 8,539 SNPs showing evidence of association with CRC risk (P < 0.05) according to prespecified criteria (Online Methods). We also included the 31 risk-associated variants identified by previous GWAS7,8,9,10,11,12,13,14,15,16,17,18,19,20, resulting in a total of 8,570 SNPs. Of these, 7,113 SNPs were successfully designed using Illumina Infinium assays as part of a large genotyping effort for multiple projects. Using this customized array, we genotyped an independent set of 3,632 CRC cases and 6,404 controls recruited in 3 studies (stage 2) conducted in China. After quality control exclusions, 6,899 SNPs remained for analysis in 3,519 cases and 6,275 controls. We evaluated associations between CRC risk and these SNPs in each study separately and then performed a fixed-effects meta-analysis to obtain summary estimates. Again, we observed little evidence of population stratification, either in the three studies individually (λ < 1.05) or combined (λ = 1.05, λ1,000 = 1.01) (Supplementary Fig. 3). In a meta-analysis of data from stages 1 and 2, we identified 559 SNPs showing evidence of association at P < 0.005. We then evaluated these SNPs using data from a large Japanese CRC GWAS (stage 3) with 2,814 CRC cases and 11,358 controls20. Thirty SNPs in 25 new loci were associated with CRC risk at P < 0.0001 in the meta-analysis of data from stages 1–3 and at P < 0.01 in the meta-analysis of stages 2 and 3. Of these SNPs, 29 were successfully genotyped in an independent sample of 6,532 CRC cases and 8,140 controls from 5 additional studies (stage 4) conducted in China, South Korea and Japan.

Newly identified risk-associated loci for CRC

In the meta-analysis of all data for the 29 SNPs from stages 1–4 with 14,963 CRC cases and 31,945 controls, signals from 10 SNPs, representing 6 new loci, showed convincing evidence of an association with CRC risk at the genome-wide significance level (P < 5 × 10−8), including rs704017 at 10q22.3; rs11196172 at 10q25.2; rs174537, rs4246215, rs174550 and rs1535 at 11q12.2; rs10849432 at 12p13.31; rs12603526 at 17p13.3; and rs1800469 and rs2241714 at 19q13.2 (Table 1, Supplementary Fig. 4 and Supplementary Tables 3 and 4). Associations of CRC risk with the top SNPs in each of the six loci were consistent across almost all studies, with no evidence of heterogeneity (Fig. 1). With the exception of the intergenic SNP rs10849432 at 12p13.31, the remaining nine newly identified risk-associated variants were located in exonic, promoter, 3′ UTR or intronic regions of known genes (Table 1). The linkage disequilibrium (LD) blocks (r2 > 0.5) tagged by rs704017 (10q22.3), rs174537 (11q12.2) and rs1800469 (19q13.2) each span multiple genes (Supplementary Table 5). The LD blocks tagged by rs11196172 (10q25.2) and rs12603526 (17p13.3) each lie within a single gene. The LD block tagged by rs10849432 (12p13.31) does not contain any known gene. Stratification analyses of the newly identified risk variants by tumor anatomical site (colon or rectum), population (Chinese, Korean or Japanese) and sex (male or female) did not identify any significant heterogeneity (Supplementary Tables 6, 7, 8). In addition to the six newly identified loci, three additional regions showed association with CRC risk near genome-wide significance at 8q24.11 (rs6469656; P = 5.38 × 10−8), 10q21.1 (rs4948317; P = 7.14 × 10−8) and 10q24.2 (rs12412391; P = 7.41 × 10−7). Results for all 29 SNPs across stages 1–4 are presented in Supplementary Table 3.

Table 1 Summary results for risk variants in the six newly identified loci associated with CRC in East Asians
Figure 1: Forest plots for risk-associated variants in the six newly identified loci.
figure 1

(a) rs704017. (b) rs11196172. (c) rs174537. (d) rs10849432. (e) rs12603526. (f) rs1800469. Per-allele OR estimates are presented, with the area of each box proportional to the inverse-variance weight of the estimate. Horizontal lines represent 95% CIs. Diamonds represent summary OR estimates generated under a fixed-effects meta-analysis; width of the diamonds corresponds to 95% CIs. Continuous vertical lines represent the null value; dashed vertical lines represent the summary OR estimates for all studies for each SNP.

We performed conditional analyses for SNPs located within a 1-Mb region centered on the index SNP in each of the six newly identified loci. No second association signal was identified at P < 0.01 after adjusting for the respective index SNP (data not shown). Four SNPs at 11q12.2 and 2 SNPs at 19q13.2 showed association with CRC risk at P < 5 × 10−8, and we thus performed haplotype analysis for these 2 loci using genotype data available for 10,051 CRC cases and 14,415 controls (stages 2 and 4). Two common haplotypes were found in the 11q12.2 locus, accounting for more than 99% of the haplotypes constructed using the four highly correlated SNPs. The haplotype with all four risk-associated alleles (frequency = 0.574 in controls) was strongly associated with CRC risk (odds ratio (OR) = 1.40, 95% confidence interval (CI) = 1.29–1.51; P = 3.69 × 10−16) (Supplementary Table 9). Similarly, we identified two common haplotypes at the 19q13.2 locus, accounting for more than 99% of the haplotypes constructed using the two highly correlated SNPs. The haplotype with the risk-associated allele at both SNPs (frequency = 0.485 in controls) was also associated with increased risk of CRC (OR = 1.16, 95% CI = 1.08–1.26; P = 1.18 × 10−4) (Supplementary Table 10). Overall, these analyses did not identify an independent signal in any of the six newly identified loci.

We examined potential SNP-SNP interactions between the 6 new risk-associated variants identified in this study (rs704017, rs11196172, rs174537, rs10849432, rs12603526 and rs1800469) and also between these 6 SNPs and the risk-associated variants in 25 previously reported loci (Supplementary Table 11). Multiplicative interactions were found with suggestive evidence of association (P < 0.05) for seven pairs of SNPs. None of these interactions, however, remained statistically significant after correcting for multiple comparisons in 180 tests (adjusted P = 0.00028).

We evaluated associations of the 10 newly identified SNPs with CRC risk in individuals of European descent using data from 3 consortia, the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO)17, the Colorectal Transdisciplinary (CORECT) Study and the Colon Cancer Family Registry (CCFR)21, with a total sample size of 16,984 CRC cases and 18,262 controls (Supplementary Table 12). In a meta-analysis of data from these consortia, all ten SNPs showed association with CRC risk in the same direction as observed in East Asians (Table 2). Five SNPs in two loci (10q22.3 and 11q12.2) were associated with CRC risk at P < 0.008 (corrected for multiple comparisons of six loci). These associations in individuals of European descent, however, were weaker than in East Asians. Tests showed statistically significant evidence of heterogeneity for risk variants at 11q12.2 and 19q13.2 (P < 0.008). The frequency of the risk-associated allele was also considerably different in East Asians and individuals of European ancestry for SNPs in five loci (Supplementary Table 13). For example, the minor allele (C) of rs12603526 is common in East Asians, whereas the minor allele frequency (MAF) is <0.02 in individuals of European descent. These differences might in part reflect distinct patterns of LD between the index SNPs and causal SNPs in these two populations. As expected, LD patterns for most of the newly identified loci were considerably different in East Asians and individuals of European descent (Supplementary Fig. 5). Large-scale fine-mapping of these loci will be helpful in identifying causal variants.

Table 2 Associations of risk variants in the six newly identified loci with CRC in individuals of European descent

Putative functional variants and candidate genes

We evaluated and annotated putative functional variants and candidate genes in each of the six newly identified loci using data from the 1000 Genomes Project22, HapMap 2 (ref. 23), the Encyclopedia of DNA Elements (ENCODE)24, expression quantitative trait locus (eQTL) databases25,26,27,28, the Catalogue of Somatic Mutations in Cancer (COSMIC)29, The Cancer Genome Atlas (TCGA) CRC project30, the Expression Atlas31, PubMed and Online Mendelian Inheritance in Man (OMIM) (Online Methods). We summarize the results below for each locus.

At the 10q25.2 locus, rs11196172 is located in intron 4 of the TCF7L2 gene. This SNP and other correlated SNPs (r2 > 0.5) fall within a region with strong enhancer activity and a DNase I hypersensitivity site annotated by ENCODE (Supplementary Table 14), suggesting a potential functional role for these SNPs. We found that the risk-associated allele of rs11196172 was significantly associated with higher expression of the TCF7L2 gene (P = 0.003) in colon tumor tissue using TCGA data (Fig. 2). The TCF7L2 gene encodes TCF7L2 (previously known as TCF4), which is a key transcription factor in the Wnt signaling pathway. Aberrant activation of Wnt signaling is found in more than 90% of CRCs30, and TCF7L2 is a known tumor suppressor in CRC. Loss of TCF7L2 function enhances CRC cell growth, whereas gain of function suppresses CRC cell growth32,33. The TCF7L2 gene is one of the most frequently mutated genes in CRC, with estimated point mutation rates of approximately 8–12.5% (refs. 29,30). Although TCF7L2 is the only gene in this locus (Supplementary Fig. 4), we also found that the risk-associated allele of rs11196172 was significantly associated with higher expression of the VTI1A gene (P = 5.1 × 10−4) in colon tumor tissue (Fig. 2). The VTI1A gene is located approximately 131 kb upstream of the TCF7L2 gene, and mRNA levels for these two genes are highly correlated in colon tumor tissue (r = 0.71; P < 0.0001). Recently, a recurrent gene fusion connecting the first three exons of VTI1A to the fourth exon of TCF7L2 was identified in approximately 3% of colorectal tumors34. It is possible that the VTI1A gene might also be involved in the association between rs11196172 and CRC risk.

Figure 2: Association of selected risk variants identified in this study with gene expression in colon tumor tissue.
figure 2

(a) rs11196172 and TCF7L2. (b) rs11196172 and VTI1A. (c) rs1535 and FADS2. Gene expression levels are represented by reads per kilobase of exon per million mapped reads (RPKM) values based on the three genotypes of each SNP shown in red, blue and green. The median RPKM values and interquartile ranges (IQRs) for each SNP are presented in the overlaid box plots, and whiskers represent 1.5 times the IQR of the lower quartile to 1.5 times the IQR of the upper quartile. In a and b, RPKM values are shown at normal scale, whereas RPKM values in c are shown with a logarithmic scale owing to departure from a normal distribution. P values for associations between SNP genotypes and gene RPKM values were tested using a linear regression model.

At the 19q13.2 locus, we identified two perfectly correlated SNPs (rs1800469 and rs2241714; r2 = 1) associated with CRC risk. Of these, rs1800469 has previously been investigated with respect to CRC risk in many small candidate gene association studies, with conflicting results5. We herein provide for the first time, to our knowledge, convincing evidence of association for rs1800469 through our GWAS analysis. SNP rs1800469 maps to the promoter of the TGFB1 gene, and rs2241714 is a nonsynonymous SNP that results in an amino acid substitution at residue 11 of the B9D2 protein. The A allele of rs1800469 has been related to higher transcriptional activity for the TGFB1 gene and higher circulating levels of the transforming growth factor (TGF)-β1 protein than the G allele35. Both rs1800469 and rs2241714 are in perfect LD with another nonsynonymous SNP, rs1800470, which causes a proline-to-leucine substitution at residue 10 of the TGF-β1 protein. Although the two nonsynonymous SNPs are predicted to be tolerated36 or benign37, the Pro10 variant encoded by rs1800470 has also been associated with an increase in TGFB1 gene expression, TGF-β1 protein secretion and circulating levels of TGF-β1 protein38,39,40. Whereas rs2241714 is an eQTL for TGFB1, both rs1800469 and rs2241714 are also eQTLs for other genes in this locus (Supplementary Table 15). In addition to these three SNPs, we suggest that many highly correlated SNPs located in the TGFB1 gene might potentially have regulatory functions (Supplementary Table 14). The TGF-β1 protein is a member of the TGF-β signaling pathway. Somatic alterations of certain components in this pathway (TGFBR2, SMAD2, SMAD3 and SMAD4) are estimated to be present in almost half of CRCs41. High-penetrance germline mutations in the SMAD4 gene are known to cause juvenile polyposis, an autosomal dominant polyposis syndrome linked to a high risk of CRC42. Germline, allele-specific expression of the TGFBR1 gene has also been shown to contribute to increased risk of CRC43. Thus far, GWAS have identified at least six other independent SNPs that are located in or proximal to genes in the TGF-β signaling pathway (SMAD7, GREM1, BMP2, BMP4 and RHPN2)9,10,13,19. Our finding of an association between a genetic variant in the TGFB1 gene and CRC risk adds further evidence for the critical role of this pathway in colorectal tumorigenesis.

At the 11q12.2 locus, the four perfectly correlated SNPs rs174537, rs4246215, rs174550 and rs1535 lie in intron 24 of MYRF, the 3′ UTR of FEN1, intron 7 of FADS1 and intron 1 of FADS2, respectively. Of these SNPs, rs4246215 is an eQTL for the FEN1 gene in normal colorectal tissue44 and is predicted to affect microRNA (miRNA) binding site activity45. SNP rs174537 is an eQTL for the FADS1 and FADS2 genes in whole blood and other types of tissue (Supplementary Table 15). Using data from TCGA, we identified a strong correlation of rs1535 genotypes with FADS2 gene expression (P = 1.4 × 10−5) in colon tumor tissue (Fig. 2). These findings suggest that the potential functions of these SNPs might be mediated through their effects on their host genes. We also found that the FEN1, FADS1 and FADS2 genes are all highly expressed in colon tumor tissue compared with normal colon tissue (Supplementary Table 16). The FEN1 gene encodes flap structure–specific endonuclease 1, a protein that is essential for DNA repair, replication and degradation and that has a critical role in maintaining genome stability and protecting against carcinogenesis46. FEN1 mutations have been found in several human cancers47. Mouse models with haploinsufficiency for Fen1 showed rapid progression of CRC and reduced survival48. Two other genes in this locus, FADS1 and FADS2, respectively encode delta-5 and delta-6 desaturases, which are key enzymes in the metabolism of polyunsaturated fatty acids. Of these proteins, delta-6 desaturase is responsible for the synthesis of arachidonic acid49, the precursor of prostaglandin E2 (PGE2), which is a key molecule mediating the effect of cyclooxygenase-2 in colorectal carcinogenesis50. Notably, SNPs in perfect LD with the risk-associated variants for CRC identified in this study are strongly associated with circulating arachidonic acid levels49. We have shown previously that high levels of the PGE2 metabolite in urine, a marker of endogenous PGE2 production, are strongly related to higher risk of CRC51. Because the LD block of approximately 190 kb tagged by the four risk-associated variants covers many putatively functional SNPs that are located in the FEN1, FADS1 and FADS2 genes (Supplementary Fig. 6 and Supplementary Table 14), it is difficult to pinpoint a single SNP or gene that might be responsible for the association with CRC risk in this locus. Nevertheless, our study provides evidence of a potentially important role for the FEN1, FADS1 and FADS2 genes in the etiology of CRC.

At the 10q22.3 locus, rs704017 is located in intron 3 of the ZMIZ1-AS1 gene and resides in a strong enhancer region predicted using ENCODE data (Supplementary Fig. 6 and Supplementary Table 14). It also maps to a DNase I hypersensitivity site identified in the Caco-2 CRC cell line. In addition to the ZMIZ1-AS1 gene, the LD block tagged by rs704017 also includes the ZMIZ1 gene, whose expression is downregulated in the Caco-2 and HT-29 CRC cell lines31. In line with these observations, we found in TCGA data that ZMIZ1 gene expression is lower in colon tumor tissue compared with normal colon tissue (P = 3.28 × 10−6). In addition, somatic mutations in the ZMIZ1 gene have been reported in more than 2% of colon tumors29. Whereas ZMIZ1-AS1 is a miscellaneous RNA (miscRNA) gene with unknown function, the ZMIZ1 gene encodes the protein ZMIZ1, which regulates the activity of several transcription factors, including AR, SMAD3, SMAD4 and p53. It has been shown that ZMIZ1 might have a broader role in epithelial cancers, including CRC52. SNP rs704010, located in intron 1 of the ZMIZ1 gene, has been associated with breast cancer53. However, this SNP, which is in weak LD (r2 = 0.09) with the risk-associated variant we identified for CRC, was not associated with CRC in this study (data not shown). Given the biological function of the ZMIZ1 gene, it is possible that this gene is involved in the association observed in this locus.

In the 12p13.31 locus, rs10849432 maps to an LD block of approximately 52 kb with no known genes. ENCODE data suggest that rs4764551 and rs4764552, perfectly correlated with rs10849432, might be located in a strong enhancer region (Supplementary Table 14). Notably, rs4764551 also maps to a DNase I hypersensitivity site in the HCT-116 CRC cell line and a binding site for the CTCF protein in the Caco-2 CRC cell line. Using data from TCGA, we showed that the closest genes to rs10849432 (CD9, PLEKHG6 and TNFRSF1A) all have downregulated expression in colon tumor tissue (Supplementary Table 16). The CD9 gene encodes the CD9 antigen, which participates in many cellular processes, including differentiation, adhesion and signal transduction. Notably, CD9 has a critical role in the suppression of cancer cell motility and metastasis54, and overexpression of the CD9 gene is associated with favorable prognosis for patients with CRC55. CD9 is also involved in suppressing Wnt signaling56. Although the function of the PLEKHG6 gene is less clear, somatic mutations in this gene were found in approximately 2% of colon tumors29. The protein encoded by TNFRSF1A is a major receptor for tumor necrosis factor (TNF)-α and is known to be involved in cytokine-induced senescence in cancer57. In addition to evidence for the three nearby genes, we also found that rs4764552 is an eQTL for the LTBR gene (Supplementary Table 15). The LTßR protein has an essential role in lymphoid organ formation and has also been linked to cancer58, including CRC59. On the basis of these data, we propose that the CD9 gene is the most likely candidate to explain the association identified in this locus. However, potential roles for other genes cannot be excluded.

At the 17p13.3 locus, rs12603526 lies in intron 1 of the NXN gene, in a region covering several regulatory elements, including a DNase I hypersensitivity site, a strong enhancer region and a site with an effect on regulatory motifs as annotated by ENCODE (Supplementary Table 14). NXN gene expression was lower in the colon tumor tissue samples included in TCGA (P = 2.83 × 10−5). Nucleoredoxin, encoded by the NXN gene, has functions related to cell growth and differentiation60. Overexpression of the NXN gene has been found to suppress the Wnt signaling pathway, and nucleoredoxin dysfunction might cause activation of the transcription factor TCF (T cell factor), accelerated cell proliferation and enhancement of oncogenicity61. Further research is needed to determine the causal variant and biological mechanism for the association at this locus.

Previously reported CRC-associated loci in East Asians

We evaluated association evidence for 31 SNPs in 25 established CRC susceptibility loci7,8,9,10,11,12,13,14,15,16,17,18,19,20 by analyzing data from stages 1–3 and our previous GWAS18,19 with a total sample size of up to 11,934 CRC cases and 28,282 controls (Table 3 and Supplementary Table 17). We found further evidence to support the associations of the four loci identified previously in our GWAS conducted among East Asians (P = 1.40 × 10−10 to 3.05 × 10−15). Of the 23 SNPs in the 18 susceptibility loci previously identified by GWAS of individuals of European descent, 20 showed association with CRC risk at P < 0.05 in East Asians in the same direction as reported in the original studies7,8,9,10,11,12,13,14,15,16,17. These signals included 6 SNPs in 4 loci (1q41, 8q24.21, 10p14 and 18q21.1) with association at P < 5 × 10−8, 6 SNPs in 6 loci with association at P < 0.002 (significance level adjusted for multiple comparisons of 25 independent loci) and 8 SNPs in 8 additional loci with association at P < 0.05. Three SNPs in three loci were not associated with CRC risk (P > 0.05). Given that our study had a statistical power of >80% to identify an association with an OR of 1.05 at P = 0.05 for SNPs with a MAF of 0.20, it is unlikely that these three SNPs confer substantial risk of CRC in East Asian populations. In general, loci initially identified in individuals of European descent had smaller ORs in East Asians, with evidence of heterogeneity noted for three SNPs (P < 0.002). SNPs rs6691170 and rs16892766, identified by previous GWAS of individuals of European descent, are not polymorphic in East Asians, and SNP rs5934683 is located on the X chromosome. We did not have data to evaluate the associations of these three SNPs with CRC risk in this study.

Table 3 Association evidence in East Asians for risk variants in previously reported CRC susceptibility loci

Familial relative risk explained by CRC-associated loci

The six newly identified loci in this study explain approximately 2.1% of the familial relative risk of CRC in East Asians (Supplementary Table 18). The variants, along with the four SNPs identified in our previous GWAS, explained approximately 4.3% of the familial relative risk of CRC in East Asians. An additional 3.4% of the familial relative risk in East Asians can be explained by 18 independent SNPs initially identified in studies conducted among individuals of European descent and confirmed in this study. On the basis of per-allele OR values derived from previously published GWAS7,8,9,10,11,12,13,14,15,16,17,18 and this study, we estimate that the SNPs in the 31 loci identified thus far explain approximately 9% of the familial relative risk of CRC in individuals of European descent (Supplementary Table 19), a level slightly higher than the 7.7% explained in East Asians.

Discussion

In the largest GWAS conducted thus far among East Asians, we identified six new genetic loci associated with CRC risk and provided suggestive evidence for three additional previously unreported loci. In addition, we replicated 22 previously reported CRC susceptibility loci. Of the six newly identified loci, two map to genes (TCF7L2 and TGFB1) that have established roles in colorectal tumorigenesis. The other four loci are located in or proximal to genes that are functionally important in transcription regulation (ZMIZ1), genome maintenance (FEN1), fatty acid metabolism (FADS1 and FADS2), cancer cell motility and metastasis (CD9), and cell growth and differentiation (NXN). Risk-associated variants at some loci fall within potential functional regions, and two are associated with the expression levels of the TCF7L2 and FADS2 genes. This study expands current understanding of the genetic basis of CRC risk and provides evidence for new genes and biological pathways that might be involved in colorectal tumorigenesis.

On the basis of a large twin study conducted in Sweden, Denmark and Finland2, the heritabilities estimated for CRC, breast cancer and prostate cancer were 35%, 27% and 42%, respectively. Thus far, more than 70 low-penetrance susceptibility loci have been identified in GWAS for breast cancer62 or prostate cancer63, and these loci together explain approximately 14% and 30%, respectively, of the familial relative risk of these cancers in individuals of European descent. For CRC, however, only 31 low-penetrance susceptibility loci have been identified, explaining approximately 9% of the familial relative risk of CRC in individuals of European descent. Compared with GWAS of breast cancer and prostate cancer, studies conducted for CRC have been relatively small. Our study, in which we evaluated approximately 7,000 promising variants identified by GWAS in the replication stages, represents one of the largest efforts thus far to follow up genetic variants identified by GWAS. We identified six new loci, representing the largest number of new loci identified for CRC risk in a single study. Although multiple GWAS with sample sizes larger than the one in this study have been conducted among individuals of European descent13,14,16, we were still able to identify risk-associated variants with relatively large effect sizes. Our study further highlights the value of conducting GWAS in non-European populations to discover new susceptibility loci for CRC.

In summary, we have identified six new loci associated with CRC risk in this large GWAS conducted among East Asians. These new loci contain genes with established connections to colorectal tumorigenesis through major biological pathways such as Wnt and TGF-β signaling, as well as genes with important biological functions that have not yet been well linked to CRC. Our study considerably expands knowledge of the genetic landscape of CRC and provides direction for future studies to characterize the causal variants and functional mechanisms of these GWAS-identified loci.

Methods

Study participants.

This GWAS was conducted as part of the Asia Colorectal Cancer Consortium, comprising a total of 14,963 CRC cases and 31,945 controls of East Asian ancestry from 14 studies conducted in China, South Korea and Japan (Supplementary Table 1). Specifically, stage 1 (GWAS discovery) consisted of 5 studies: Shanghai CRC Study 1 (Shanghai-1; n = 3,102), Shanghai CRC Study 2 (Shanghai-2; n = 908), Guangzhou CRC Study 1 (Guangzhou-1; n = 1,603), Aichi CRC Study 1 (Aichi-1; n = 1,346) and Korean Cancer Prevention Study-II CRC (KCPS-II; n = 1,301). With the exception of Shanghai-2, for which we added 423 controls from other studies64,65, samples for the remaining 4 studies were the same as we reported in our previous study18. Stage 2 consisted of 3 studies: Shanghai CRC Study 3 (Shanghai-3; n = 6,577), Guangzhou CRC Study 2 (Guangzhou-2; n = 809) and Guangzhou CRC Study 3 (Guangzhou-3; n = 2,408). Stage 3 included 1 study: the BioBank Japan CRC Study (BBJ; n = 14,172). Stage 4 consisted of 5 studies: Guangzhou CRC Study 4 (Guangzhou-4; n = 1,791), Aichi CRC Study 2 (Aichi-2; n = 708), Korean–National Cancer Center CRC Study (Korea-NCC; n = 2,721), Seoul CRC Study (Korea-Seoul; n = 1,522) and Hwasun Cancer Epidemiology Study–Colon and Rectum Cancer (HCES-CRC; n = 7,930). We estimated that our study had a statistical power of >80% to identify an association with an OR of 1.10 or greater at P < 5 × 10−8 for SNPs with a MAF of as low as 0.30. We evaluated the generalizability of the newly identified associations with CRC risk in individuals of European descent in data from 3 consortia including 23 studies (Supplementary Table 13) with a total sample size of 16,984 cases and 18,262 controls recruited in the United States, Europe, Canada and Australia: the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO)17, the Colorectal Transdisciplinary (CORECT) Study and the Colon Cancer Family Registry (CCFR)21. Summary descriptions of participating studies are presented in the Supplementary Note. Study protocols were approved by the relevant review boards in the respective institutions, and informed consent was obtained from all study participants.

Laboratory procedures.

Genotyping of samples in stage 1 was conducted as described previously18,64,65,66,67,68,69 using the following platforms: the Affymetrix Genome-Wide Human SNP Array 6.0, the Illumina HumanOmniExpress BeadChip, the Illumina Infinium HumanHap550 BeadChip, the Illumina 660W-Quad BeadChip, the Illumina Human610-Quad BeadChip, the Illumina Infinium HumanHap610 BeadChip and the Affymetrix Genome-Wide Human SNP Array 5.0. We used a uniform quality control protocol as recently described18 to filter samples and SNPs. Genotyping and quality control methods are also presented in the Supplementary Note. After quality control exclusions, we obtained 502,145 autosomal SNPs for samples in Shanghai-1, 245,961 SNPs in Shanghai-2, 250,612 SNPs in Guangzhou-1, 232,426 SNPs in Aichi-1 and 312,869 SNPs in KCPS-II (Supplementary Table 2).

Genotyping for 3,632 cases and 6,404 controls in stage 2 was completed using Illumina Infinium assays as part of the customer add-on content for multiple projects to the Illumina HumanExome BeadChip (see URLs). Details of array design, genotyping, genotype calling and quality control are provided in the Supplementary Note. Samples were excluded according to the following criteria: (i) genotype call rate of <98%, (ii) genetically identical or duplicated samples, (iii) sex determined using genetic data inconsistent with epidemiological or clinical data, (iv) first- or second-degree relatives, (v) ancestry outliers or (vi) heterozygosity outliers. Genetic markers were excluded using the following criteria: (i) MAF = 0, (ii) genotype call rate of <98%, (iii) consistency rate of <98% in positive quality control samples, (iv) P for Hardy-Weinberg equilibrium < 1 × 10−5 in controls or (v) caution SNPs revealed by the Exome Chip Design group (see URLs). We obtained a final data set including 6,899 SNPs genotyped in 3,519 cases and 6,275 controls for this project.

Cases and controls in stage 3 were genotyped using the Illumina HumanHap610-Quad BeadChip. Quality control filters were based on criteria described previously20. Methods of genotyping and quality control procedures are also presented in the Supplementary Note. After sample and SNP exclusions, we generated a data set comprising 2,814 cases and 11,358 controls with 460,463 SNPs.

Stage 4 genotyping for 29 SNPs was conducted using the iPLEX Sequenom MassARRAY platform according to manufacturer's protocols at the Vanderbilt Molecular Epidemiology Laboratory (Nashville, Tennessee, USA). Details of genotyping and quality control are provided in the Supplementary Note. We filtered out SNPs with (i) genotype call rate of <95%, (ii) genotyping consistency rate of <95% in positive control samples, (iii) an unclear genotype call or (iv) P for Hardy-Weinberg equilibrium of <1 × 10−5 in controls. The average consistency rate of these SNPs passing quality control filters was 99.9% with a median value of 100% in each of the five participating studies included in this stage.

Samples in GECCO, CORECT and CCFR were genotyped with Illumina and Affymetrix arrays17,21. Genotyping, quality control and imputation have been reported previously17,21 and are described in the Supplementary Note.

SNP selection.

Selection of SNPs for stage 2 replication was primarily based on the following criteria: (i) P < 0.05 in meta-analysis, (ii) P for heterogeneity > 0.0001, (iii) imputation R2 > 0.5 in each of the included studies, (iv) MAF > 0.05 in each of the included studies, (v) SNPs uncorrelated with established CRC SNPs (defined as r2 < 0.2 in the HapMap Asian population), (vi) SNPs uncorrelated with other SNPs identified in this project (r2 < 0.2) and (vii) data available in at least two studies (Supplementary Note). We included multiple SNPs in some regions with a prior association P value of <0.002 or with genes of interest. Risk variants identified from previously published GWAS were also included in the assay7,8,9,10,11,12,13,14,15,16,17,18,19,20. In total, 8,570 unique SNPs were selected. Of these, 7,113 SNPs were successfully designed. For stage 3 replication, we selected 559 SNPs according to the following criteria: (i) P < 0.005 in the meta-analysis of data from stages 1 and 2, (ii) association in the same direction in both stages and (iii) P for heterogeneity > 0.0001. For stage 4, we selected 30 SNPs on the basis of the following criteria: (i) P < 0.0001 in the meta-analysis of stages 1–3, (ii) P < 0.01 in the meta-analysis of stages 2 and 3, (iii) association in the same direction in the three stages and (iv) P for heterogeneity > 0.0001.

Statistical and bioinformatics analysis.

Details of imputation and population substructure evaluation are provided in the Supplementary Note. Briefly, stage 1 imputation was performed with the CHB (Han Chinese in Beijing, China) and JPT (Japanese in Tokyo, Japan) HapMap 2 panel as the reference using the MACH v1.0 program70 (see URLs). Stage 3 imputation was conducted with phased data for JPT, CHS (Southern Han Chinese, China) and CHD (Chinese in Metropolitan Denver, Colorado) participants from 1000 Genomes Project phase 1 release v3 as the reference using MACH v1.0 and Minimac71 (see URLs). Regional imputation of genotype data from TCGA30 (see URLs) was performed with the GIANT ALL reference panel from 1000 Genomes Project phase 1 release v3 using MACH v1.0 and Minimac (see URLs). To evaluate imputation quality in our study, we directly genotyped the 10 newly identified risk variants in the approximately 2,800 samples included in stage 1. The concordance between imputed and genotyped data was very high, with mean values ranging from 96.00% to 99.96% for the ten SNPs (Supplementary Table 20). For rs10849432, the imputation quality for the Aichi-1 study was relatively low (R2 = 0.57), and data from this study were therefore not included in our final analysis. We evaluated population structure in studies included in stages 1 and 2 using principal-components analysis with EIGENSTRAT software72 (see URLs). On the basis of adjusted regression models including the first ten principal components, the genomic inflation factor λ was <1.04 in each of the five studies included in stage 1 and 1.0368 in the meta-analysis of all five studies (Supplementary Fig. 2). The λ value was <1.05 in each of the three studies included in stage 2 and 1.0525 in the meta-analysis of all three studies (Supplementary Fig. 3). A rescaled inflation statistic, λ1,000, representing the equivalent value for a study with 1,000 cases and 1,000 controls using the formula73 λ1,000 = 1 + 500 × (λ − 1) × (1/Ncases + 1/Ncontrols) was 1.01 in both stages 1 and 2. These findings show little evidence of population stratification in our studies.

Associations between SNPs and CRC risk were evaluated on the basis of the log-additive model using Mach2dat70, PLINK (version 1.0.7)74, R version 3.0.0 and SAS version 9.3 (for all, see URLs). Per-allele OR estimates and 95% CIs were derived from logistic regression models, adjusting for age, sex and the first ten principal components when appropriate. Association analysis was conducted for each participating study separately, and a fixed-effects meta-analysis was conducted to obtain summary results for each of the four stages and all stages combined with the inverse-variance method using the Metal75 program. SNPs showing an association at P < 5 × 10−8 in the combined analysis of all studies were considered genome-wide significant. We also performed stratified analyses for the top SNPs by tumor anatomical site (colon or rectum), population (Chinese, Korean or Japanese) and sex (male or female). We estimated heterogeneity across studies and subgroups with a Cochran's Q test76, with P for heterogeneity < 0.008 set as statistically significant when considering multiple comparisons of six independent loci. Independent signals in a locus were identified using stepwise logistic regression models conditioning on the top risk-associated variant we identified in each of the new loci using R software (see URLs). We estimated haplotype frequencies using Haploview (version 4.2)77 (see URLs) and conducted haplotype association analysis for two loci (11q12.2 and 19q13.2) where two or more SNPs were identified using SAS Genetics v9.3 with logistic regression models. Pairwise SNP-SNP interactions between 6 top risk-associated variants in the newly identified loci with association P < 5 × 10−8 and also between these 6 SNPs and the risk-associated variants in 25 previously reported loci were evaluated using the maximum-likelihood ratio test with inclusion of interaction terms in logistic regression models. Interactions with P < 0.00028 were considered statistically significant with adjustment for multiple comparisons of 180 tests.

The familial relative risk (λ) for the offspring of an affected individual due to a single locus was estimated using a log-additive model: λ = (pr2 + q)/(pr + q)2, where p is the frequency of the risk allele, q = 1−p is the frequency of the reference allele and r is the per-allele relative risk78. The proportion of the familial relative risk explained by this locus, assuming a multiplicative interaction between markers in the locus and other loci, was calculated as log (λ)/log (λ0), where λ0 is the overall familial relative risk. λ0 is assigned to be 2.2 for CRC risk estimated from a meta-analysis79. Assuming that the risks associated with individual loci combine multiplicatively, the familial relative risks also multiply. Thus, the combined contribution of the familial relative risks from multiple loci is equal to

We generated forest plots and quantile-quantile plots using R software (see URLs). Regional association plots for SNPs in newly identified loci were generated using the website-based tool LocusZoom (version 1.1)80 (see URLs). LD structure between SNPs was determined on the basis of data from 1000 Genomes Project Pilot 1 or HapMap 2 as provided by the website-based tool SNAP81 (see URLs) and plotted using Haploview, SNAP and the UCSC Genome Browser (see URLs). LD blocks were defined using HapMap recombination rates and hotspots23. All genomic coordinates are based on NCBI Build 36.

To find putative functional variants for newly identified loci, we identified all SNPs in LD (r2 > 0.5) with the risk-associated variants using data from the 1000 Genomes Project22 and HapMap 2 (ref. 23). We mapped the genomic locations of these SNPs to nonsynonymous sites, splice sites, promoters, nearGene-3 regions, nearGene-5 regions, 3′ UTRs, 5′ UTRs, introns and intergenic regions. We evaluated the potential functional effect of nonsynonymous SNPs using the prediction algorithms SIFT36 and PolyPhen-2 (ref. 37) (see URLs). We predicted the putative function of SNPs in promoters, nearGene-3 regions, nearGene-5 regions, 3′ UTRs and 5′ UTRs with the SNPinfo Web Server45 (see URLs). We conducted analyses to evaluate the potential regulatory effect of SNPs in noncoding regions on transcription using the ENCODE tool HaploReg (v2)82 and the UCSC Genome Browser (see URLs) on the basis of their location within regions of promoter or enhancer activity, DNase I hypersensitivity, local histone modification, proteins bound to these regulatory sites, cis-eQTL and transcription factor binding motifs. We obtained additional functional evidence for these SNPs from the published literature.

We identified all genes that localize to 1-Mb windows centered on the top risk-associated variants in our newly identified loci, including SNPs correlated (r2 > 0.5) with the top risk variants. To determine whether these genes might explain the observed associations in these loci, we first examined genome-wide cis-eQTL data in multiple tissues from four major eQTL databases: the Blood eQTL Browser25, the eQTL Browser26, the Genotype-Tissue Expression (GTEx) Project27 and the Multiple Tissue Human Expression Resource (MuTHER) Project28. The significance threshold for these analyses was set to P < 0.008 to account for six tests. Somatic mutations of these genes were evaluated using data from COSMIC29 (see URLs). Expression levels of these genes in CRC cell lines were assessed using data from the Expression Atlas31 (see URLs). To correct for multiple comparisons of the 11 key genes, associations with P < 0.0045 were considered to be statistically significant. We searched the published literature for these genes with respect to CRC in PubMed and OMIM (see URLs).

Expression analysis.

We downloaded RNA sequencing (level 1) and SNP array (level 2) data for 364 colon adenocarcinoma and 18 normal colon tissue samples from TCGA30 (see URLs). To quantify expression levels of candidate genes in the newly identified loci, we normalized gene expression levels using RPKM (reads per kilobase of exon per million mapped reads) values as previously described83. Expression differences between tumor and normal samples for each gene were evaluated on the basis of RPKM values with the Wilcoxon rank-sum test. Associations between gene RPKM values and SNP genotypes were analyzed using a linear regression model including age and sex as covariates. We converted the RPKM value of a gene to log scale for analysis if it was not normally distributed. We considered P < 0.0045 to be statistically significant with adjustment for testing of the 11 key genes.

URLs.

1000 Genomes Browser, http://browser.1000genomes.org/index.html; BioBank Japan (in Japanese), http://biobankjp.org/; Blood eQTL browser, http://genenetwork.nl/bloodeqtlbrowser/; Catalogue of Somatic Mutations in Cancer (COSMIC), http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/; database of Genotypes and Phenotypes (dbGaP), http://www.ncbi.nlm.nih.gov/gap; EIGENSTRAT, http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm; eQTL Browser from the University of Chicago, http://eqtl.uchicago.edu/Home.html; GTEx eQTL Browser, http://www.ncbi.nlm.nih.gov/projects/gap/eqtl/index.cgi/; Expression Atlas, http://www.ebi.ac.uk/gxa/; Haploview, http://www.broad.mit.edu/mpg/haploview/; HaploReg v2, http://www.broadinstitute.org/mammals/haploreg/haploreg.php; HapMap Project, http://hapmap.ncbi.nlm.nih.gov/; Illumina HumanExome-12v1_A BeadChip, International Mouse Phenotyping Consortium (IMPC), https://www.mousephenotype.org/; LocusZoom, http://csg.sph.umich.edu/locuszoom/; http://genome.sph.umich.edu/wiki/Exome_Chip_Design; MACH 1.0, http://www.sph.umich.edu/csg/abecasis/MACH/; Mach2dat, http://genome.sph.umich.edu/wiki/Mach2dat:_Association_with_MACH_output; Minimac, http://genome.sph.umich.edu/wiki/Minimac; Metal, http://www.sph.umich.edu/csg/abecasis/Metal/; Multiple Tissue Human Expression Resource (MuTHER) Project, http://www.muther.ac.uk/; Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/omim/; PLINK version 1.07, http://pngu.mgh.harvard.edu/~purcell/plink/; PolyPhen-2, http://genetics.bwh.harvard.edu/pph2/; R version 3.0.0, http://www.r-project.org/; SAS version 9.2, http://www.sas.com/; SIFT, SNAP, http://www.broadinstitute.org/mpg/snap/; http://sift.jcvi.org/; The Cancer Genome Atlas (TCGA), http://cancergenome.nih.gov/; TRANSFAC, http://www.gene-regulation.com/pub/databases.html; UCSC Genome Browser, http://genome.ucsc.edu/.