Introduction

Colorectal cancer (CRC) was the third most commonly diagnosed cancer in males and the second in females worldwide in 20121, while it was the fifth in males and the third in females, with an estimated 310,244 new cases and 149,722 deaths in China in 20112. Several environmental factors, including diet, physical inactivity, obesity, cigarette smoking and alcohol consumption, were indicated as being involved in the occurrence and development of CRC3,4,5. On the other hand, it has been well established that genetic factors play an important role in CRC etiology6,7,8. Revolutionary genome-wide association studies (GWAS) and subsequent fine mapping researches have positioned over 30 susceptibility loci of CRC in Europeans9,10,11,12,13,14,15,16,17,18,19 and Asians20,21,22,23, however most variants have been found to be only tag single nucleotide polymorphism (SNPs) residing in intergenic and intronic regions without a clear function.

A major challenge in the post-GWAS era is to identify the specific genetic variants that accounts for phenotype based on their functional biology24. Recent reports showed that regulatory genome elements can greatly help to identify these causal SNPs, which could exert an effect on gene expression by modulating the activity of promoters, enhancers, insulators and silencers25,26,27. Today, regulatory genomic regions are usually characterized by various histone modifications in the flanking nucleosomes28,29. For example, enhancers are typically marked by H3K4me1 (histone H3 monomethylated at lysine 4) and promoters by H3K4me3 (histone H3 trimethylated at lysine 4) and both are regarded as active when additionally marked by H3K27ac (histone H3 acetylated at lysine 27)30,31,32,33. And chromatin immunoprecipitation-sequencing (ChIP-seq) of histone modifications has been widely used to map genome-wide enhancers and promoters34,35,36,37.

Fernandez-Rozadilla et al. mapped 3p21.31 as a CRC-relevant genomic locus in a Spanish population with 2362 cases and 2517 controls38, with a pooled P = 2.163E-06 (odds ratio (OR) = 0.784, 95% confidence interval (95% CI) = 0.709–0.867) for the tag SNP rs8180040. Although it didn’t reach the common GWAS significance threshold of 10−8 and wasn’t included in the larger-scale genetic study by Zhang et al. in East Asians23, we considered this locus containing abundant genes as an attractive region to be researched in the Chinese population. Moreover, the strongest risk polymorphism rs8180040 was not in any known transcribed or regulatory sequences and not likely the causal SNP, which meant the real functional SNPs remained mined in 3p21.31.

Using epigenomic data obtained from relevant cell types represents a powerful approach to identifying functional SNPs in post-GWAS genetic researches39,40,41,42. In this study, we analyzed ChIP-seq data of histone modifications from CRC cell lines, searched common variants within the regulatory elements of the risk-associated locus 3p21.31, investigated their associations with CRC risk via a two-stage case-control study in the Chinese population and tried to explain the underlying function by analyzing the data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). To the best of our knowledge, this is the first replication and exploration study on 3p21.31 in East Asians.

Results

Selection of Candidate SNPs

The LD block of GWAS susceptibility loci 3p21.31 was chromosome 3: 47035735-47452118. After a bioinformatics analysis (details in Methods), four common polymorphisms, rs2276854, rs807936, rs1076394 and rs807937, situated within the peaks of histone modification ChIP-Seq data generated from HCT116 or Caco2, were found in the above loci. These four SNPs were in high LD with each other (r2 > 0.9) according to the 1000 Genomes Project Phase 3 data of the CHB population. Among them, we saw rs1076394 as the most credibly functional variant due to its location in the overlapping region of H3k4me1 and H3k27ac ChIP-seq peaks (Table 1). The coexistence of these two histone modifications is broadly considered as a mark of active enhancer, while the appearance of either one of them is not. For the replication and exploration of 3p21.31, we genotyped the tag SNP rs8180040 and the potential regulatory SNP rs1076394 in Stage 1. We further validated the positive SNPs in another independent sample of Stage 2.

Table 1 Candidate regulatory SNPs in chromosome locus 3p21.31.

Population Characteristics

Descriptive characteristics of the subjects in this study are detailed in Table 2. In both stages, no significant differences were found between patients and controls in the distribution of sex and age. As expected, significantly more smokers were presented in the cases than in the controls, given that cigarette smoking was a well-established risk factor for CRC3. And we did not see the same distribution in drinking status.

Table 2 The characteristics of the study population.

Association Analysis

Both investigated polymorphisms, the presumably regulatory SNP rs1076394 and tag SNP rs8180040, were significantly associated with CRC risk in both stages and the combined analysis (Table 3).

Table 3 Association between individual SNP and colorectal cancer risk.

In Stage 1, under a multivariable logistic regression model adjusted for gender, age, smoking and drinking status, individuals with AA genotype of rs1076394 had a significantly reduced risk of CRC (OR = 0.723, 95% CI = 0.559–0.936, P = 0.014, P-FDR (false discovery rate) = 0.035) compared to those with GG homozygotes. A dominant model was used to improve statistical power by combining the GA with AA into an A-carrier group (GA plus AA) and it showed that the allele A carriers had an obviously protective effect on CRC susceptibility (OR = 0.802, 95% CI = 0.669–0.963, P = 0.018, P-FDR = 0.03). Likewise, a positive outcome was found in the additive models, with a per-A-allele OR of 0.847 (95% CI = 0.748–0.960, P = 0.009, P-FDR = 0.045). As for rs8180040, we successfully replicated this GWAS tagSNP under the dominant model (OR = 0.811, 95% CI = 0.677–0.970, P = 0.022, P-FDR = 0.825) and additive model (OR = 0.861, 95% CI = 0.758–0.979, P = 0.022, P-FDR = 0.825) with nominal significance, but failed after FDR corrections.

Two promising variants were both further genotyped in the validation Stage 2. In agreement with the Stage 1, nominal significant associations were still exhibited between CRC risk and rs1076394 (dominant model: OR = 0.745, 95% CI = 0.584–0.951, P = 0.018, P-FDR = 0.09; additive model: OR = 0.833, 95% CI = 0.708–0.980, P = 0.027, P-FDR = 0.068), or rs8180040 (dominant model: OR = 0.747, 95% CI = 0.586–0.952, P = 0.018, P-FDR = 0.09; additive model: OR = 0.837, 95% CI = 0.713–0.983, P = 0.030, P-FDR = 0.075). When we combined two stages, positive results were still observed after FDR corrections (Table 3). And two polymorphisms were in high LD (r2 = 0.895) with each other in our total samples, similar to the data of 1000 Genomes Phase 3 of CHB (r2 = 1.000).

The results of interaction analysis between the promising SNP rs1076394 and smoking were detailed in Table S2, where no significant interactions were observed under either the multiplicative or the additive model in the two stages and the combined study.

TCGA and GEO Data Analyses

We downloaded the data of gene expression, germline genotypes, CpG methylation and somatic copy number for COAD (colon adenocarcinoma) and READ (rectum adenocarcinoma) from the TCGA portal (http://cancergenome.nih.gov/) up to October 2014. Then, we performed a modified eQTL (expression Quantitative Trait Loci) analysis of the correlation between rs8180040 (rs1076394 wasn’t included in the Affymetrix GenomeWide SNP 6.0 Array of genotype profiles) and expression of genes within 1 Mb flanking regions, with the effects of somatic copy number and methylation being adjusted43. As shown in Table 4 and Fig. 1, rs8180040 was identified as an eQTL for genes NME6 (nucleoside diphosphate kinase 6, R2 = 0.019, P = 0.029, P-FDR = 0.404) and CCDC12 (coiled-coil domain-containing protein 12, R2 = 0.031, P = 0.005, P-FDR = 0.139), respectively. As the presumed regulatory variant rs1076394 was in complete LD with rs8180040 in the 1000 Genomes database, rs1076394 was also indicated as being closely related to the expression of CCDC12 and NME6. In addition, we compared the two genes’ expression between cancer and normal tissue and found a significant difference for NME6 (P = 0.029; CRC tissues: 229.5 ± 3.2 RPKM (reads per kilobases per million reads), peritumoral tissue: 210.5 ± 5.4 RPKM), but not for CCDC12 (P = 0.258; CRC tissues: 707.7 ± 15.3 RPKM, peritumoral tissue: 661.6 ± 16.49 RPKM).

Table 4 Expression correlation between rs8180040 and flanking 1 Mb genes.
Figure 1
figure 1

Association between rs8180040 and NME6/CCDC12 expression.

(A) Association between rs8180040 and NME6 expression. (B) Association between rs8180040 and CCDC12 expression.

In GEO database, the datasets of expression profiles in Asian samples (Dataset Records: GDS2609 and GDS4718) were provided, while TCGA data was mostly from Caucasian samples. We compared the NME6 expression between cancerous and normal colon tissues and found the same higher expression in cancer tissues from the Chinese population (P = 0.017; CRC tissues: 585.8 ± 29.3 RPKM, normal colon tissue: 482.2 ± 25.6 RPKM) and the Japanese population (P = 0.013; CRC tissues: 973.6 ± 79.9 RPKM, normal colon tissue: 771.4 ± 31.0 RPKM). As for CCDC12, we could not observe significant differences between cancer and normal tissues in two datasets (Chinese samples: P = 0.583, CRC tissues: 2219.9 ± 147.9 RPKM, normal colon tissue: 2118.0 ± 91.1 RPKM; Japanese samples: P = 0.062, CRC tissues: 22137.9 ± 971.7 RPKM, normal colon tissue: 24648.9 ± 870.9 RPKM).

Discussion

The identification of tag SNPs through GWAS is the important first step in understanding the relationship between genomic variation and CRC risk. But now, the foremost goal in the post-GWAS era is to shed light on the causal SNPs and their functional consequences, progressing from indirectly statistical to directly biological associations between genetic variation and disease. Accumulating evidence showed that the most likely mechanistic basis that links those noncoding genetic variants to phenotype and disease is being regulatory34.

In our study, by overlapping the LD boundaries of the locus 3p21.31 and regulatory regions predicted by CRC-specific histone modifications, we screened out the most promisingly functional rs1076394 among the four original polymorphisms in high LD. By conducting association studies in two independent Chinese populations containing 1327 cases and 2075 controls in total, we replicated the significances of tag SNP rs8180040 and found a significant protective effect for the potentially regulatory variant rs1076394 that might serve as an eQTL for the genes CCDC12 and NME6, while NME6 presented significantly higher expression in cancer tissues.

The findings led us to assume that rs1076394 might influence CRC risk by altering the activity of an enhancer that controlled NME6 expression. The potentially functional variant rs1076394 lay within a region of the genome exhibiting chromatin modifications H3k4me1 and H3k27ac in a CRC cell line HCT116, consistent with characteristics of an active enhancer across diverse tissues44,45. It was situated in the first intron of gene KIF9 (kinesin family member 9), which was involved in mitotic progression by maintaining correct spindle length46 and the degradation of the matrix by regulating macrophage podosomes47. However, according to our calculation of TCGA data, rs1076394 was related to the expression of CCDC12 and NME6, but not related to KIF9. We saw the SNP as a potential eQTL for CCDC12 and NME6, while we could not rule out the possibility that it was actually in LD with these two genes. The SNP rs1076394 was approximately 300kb upstream of CCDC12, which participated in promoting early erythroid differentiation48 and over 1000 kb upstream of NME6, which might play a role in the regulation of cell growth and cell cycle progression49,50,51. Either CCDC12 or NME6 was located in 3p21.3. But, NME6 was more suggested to be the actual contributing gene in this locus due to its significantly higher expression in cancer samples of both Caucasian and Asian populations. At the beginning, NME6 was discovered as a gene encoding a nucleoside diphosphate kinase that suppressed p53-induced apoptosis50. In a shRNA functional screen, Nme6 was reported to be crucial for the renewal of embryonic stem cells (ESCs). And ESCs are characterized by immortalization ability, pluripotency and oncogenicity52. More recently, enriched somatic mutations in NME6 was suggested to be associated with the deregulation of pyrimidine metabolism and the promotion of malignant progression in human melanoma53. Accordingly, NME6 may act as an oncogene that still needs more investigations.

The application of biofeature information such as ChIP-seq and RNA-seq data has represented an effective approach to identifying functionally regulatory SNPs and different databases including UCSC, Encode, GEO and TCGA have provided easy access to massive amounts of relevant data. Incorporating epigenetic and expression analyses into traditional molecular epidemiology could assist in the interpretation of GWAS results and the discovery of functional variants for diseases in post-GWAS studies. Using a similar strategy to other CRC-associated loci should deepen our understanding of CRC risk.

Nevertheless, several limitations should be noted here. First of all, due to the lack of relevant functional experiments, biological reality beneath the statistically significant association is uncertain. In the analysis of TCGA data, we have not restricted it to Chinese samples, when its composition is mostly Caucasian samples. It may not reflect the exact outcome of the Chinese population that we researched on. Second, the strategy of retrieving candidate polymorphisms depended on the prediction from ChIP-seq data of relevant cell lines, which was not rigorous enough to define exact regulatory elements and all the functional variants inside. Focusing on common SNPs, we could not rule out the possibility that sets of rare variants or haplotypes in LD with the tag SNP are actually causal in this locus. Third, insufficient epidemiological and clinical information prevented us from further investigating the interactions between gene and environment.

In summary, we discovered a probably regulatory SNP that was associated with CRC risk in the Chinese population. Researches on 3p21.31 and other CRC susceptibility loci with greater sample sizes and follow-up functional analyses are warranted to elaborate the biological mechanism of genetic etiology.

Methods

Study Participants

A two-stage case-control study was applied to evaluate the association between candidate variations and the risk of CRC. The discovery stage (Stage 1) consisted of 767 cases and 1397 controls, which were recruited from Tongji Hospital of Huazhong University of Science and Technology (HUST) between 2008 and 2012. The validation stage (Stage 2) involved 528 cases and 678 controls enrolled from 2013 to 2015 at the same hospital. All subjects were unrelated ethnic Han Chinese in both stages. Patients with histopathologically confirmed CRC and without previous chemotherapy or radiotherapy, were included without restriction to gender and age,. In the same time period, cancer-free controls were recruited form participants in physical examination programs of the same hospital and were adequately matched to cases by gender and age (±5 years). Definitions of smoking and drinking status were the same as in a previous study by our group54,55,56. At recruitment of each subject, a written informed consent was obtained and 2 millimeters peripheral venous blood was collected. This study was approved by the ethnics committee of Tongji Medical College of Huazhong University of Science and Technology and the methods were carried out in accordance with the approved guidelines.

SNP Selection and Genotyping

Candidate SNPs were common genetic variants (minor allele frequency, MAF > 0.05) that located in the putative regulatory elements of the 3p21.31 locus. First, we applied the software HaploView to calculate the linkage disequilibrium (LD) block of 3p21.31 with the criterion of r2 > 0.8, by inputting the Chinese Han Beijing (CHB) genotype information of 500 kb flanking the tagSNP rs8180040. This LD block was defined as the CRC susceptibility locus. Second, we downloaded ChIP-seq data regarding histone modification from two CRC cell lines, HCT116 and Caco2, form the UCSC database integrated with Encode data (S1 Table). And the extents of their signal peaks were considered as putative regulatory elements. Third, based on the CHB MAF data in dbSNP database, we only selected the common polymorphisms (MAF > 0.05) that situated within the overlapping parts between the LD block and peaks. Finally, four polymorphisms in high LD with each other (r2 > 0.9) survived after this step-wise analysis. Among them, rs1076394 was chosen as the most potentially functional variant for the genotyping assays, because of its more indicative location in an active enhancer than others’. At the same time, we also tried to replicate the tag SNP rs8180040 in our sample. In Stage 2, the nominal significant SNPs of Stage 1 were further validated. SNPs of both stages were genotyped by a TaqMan real-time polymerase chain reaction (PCR) assay (Applied Biosystems, Foster city, CA). Quality control was preformed by including 5% duplicate samples in blinded fashion, with a concordance rate of 100%.

Statistical Analysis

The differences in the distributions of gender, age, smoking, drinking status and genotypes between cases and controls were estimated by a χ2 test or t-test, where appropriate. The Hardy-Weinberg equilibrium (HWE) in controls was evaluated with a goodness-of-fit χ2 test. The odds ratio (ORs) and corresponding 95% confidence intervals (95% CIs) were used to measure the associations between SNPs and CRC susceptibility And they were calculated after adjusting for gender, age, smoking and drinking status under a multivariate logistic regression model. For multiple comparison corrections, a simple procedure (Benjamini and Hochberg) was performed in two stages and combined study to control the false discovery rate57. The LD of the candidate SNPs was analyzed using HaploView v4.258. With regard to TCGA and GEO data, the expression differences among three genotypes (TT, TA and AA) of rs8180040 were measured under a linear regression model adjusting the effects of somatic copy number and CpG methylation43 and the differences between cancer and normal samples were measured by t-test. The gene-environment interactions were evaluated by a pair-wise analysis under multiplicative59 and additive interaction models60. All the above statistical analyses were conducted using SPSS Software v20.0 (SPSS, Chicago, Illinois, USA), with the exception that the P values of additive interaction were assessed using Stata v11.0 (Stata Corporation, College Station, TX). P values in this study were two-sided with a significance criterion of P < 0.05.

Additional Information

How to cite this article: Ke, J. et al. Identification of a Potential Regulatory Variant for Colorectal Cancer Risk Mapping to 3p21.31 in Chinese Population. Sci. Rep. 6, 25194; doi: 10.1038/srep25194 (2016).