Identification of a Potential Regulatory Variant for Colorectal Cancer Risk Mapping to 3p21.31 in Chinese Population

Genome-wide association studies (GWAS) have established chromosome 3p21.31 as a susceptibility locus for colorectal cancer (CRC) that lacks replication and exploration in the Chinese population. We searched potentially functional single nucleotide polymorphisms (SNPs) in the linkage disequilibrium (LD) block of 3p21.31 with chromatin immunoprecipitation-sequencing (ChIP-seq) data of histone modification, and tested their association with CRC via a case-control study involving 767 cases and 1397 controls in stage 1 and 528 cases and 678 controls in stage 2. In addition to the tag SNP rs8180040 (odds ratio (OR) = 0.875, 95% confidence interval (95% CI) = 0.793−0.966, P = 0.008, P-FDR (false discovery rate) = 0.040), rs1076394 presented consistently significant associations with CRC risk at both stages with OR = 0.850 (95% CI = 0.771−0.938, P = 0.001, P-FDR = 0.005) under the additive model in combined analyses. Supported by the analyses of data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it was suggested that rs1076394 served as an expression Quantitative Trait Loci (eQTL) for gene CCDC12 and NME6, while NME6’s expression was obviously higher in CRC tissues. Using biofeature information such as ChIP-seq and RNA sequencing (RNA-seq) data might help researchers to interpret GWAS results and locate functional variants for diseases in the post-GWAS era.

Colorectal cancer (CRC) was the third most commonly diagnosed cancer in males and the second in females worldwide in 2012 1 , while it was the fifth in males and the third in females, with an estimated 310,244 new cases and 149,722 deaths in China in 2011 2 . Several environmental factors, including diet, physical inactivity, obesity, cigarette smoking and alcohol consumption, were indicated as being involved in the occurrence and development of CRC [3][4][5] . On the other hand, it has been well established that genetic factors play an important role in CRC etiology [6][7][8] . Revolutionary genome-wide association studies (GWAS) and subsequent fine mapping researches have positioned over 30 susceptibility loci of CRC in Europeans [9][10][11][12][13][14][15][16][17][18][19] and Asians [20][21][22][23] , however most variants have been found to be only tag single nucleotide polymorphism (SNPs) residing in intergenic and intronic regions without a clear function.
A major challenge in the post-GWAS era is to identify the specific genetic variants that accounts for phenotype based on their functional biology 24 . Recent reports showed that regulatory genome elements can greatly help to identify these causal SNPs, which could exert an effect on gene expression by modulating the activity of promoters, enhancers, insulators and silencers [25][26][27] . Today, regulatory genomic regions are usually characterized by various histone modifications in the flanking nucleosomes 28,29 . For example, enhancers are typically marked by H3K4me1 (histone H3 monomethylated at lysine 4) and promoters by H3K4me3 (histone H3 trimethylated at lysine 4), and both are regarded as active when additionally marked by H3K27ac (histone H3 acetylated at lysine 27) [30][31][32][33] . And chromatin immunoprecipitation-sequencing (ChIP-seq) of histone modifications has been widely used to map genome-wide enhancers and promoters [34][35][36][37] .
Fernandez-Rozadilla et al. mapped 3p21.31 as a CRC-relevant genomic locus in a Spanish population with 2362 cases and 2517 controls 38 , with a pooled P = 2.163E-06 (odds ratio (OR) = 0.784, 95% confidence interval Scientific RepoRts | 6:25194 | DOI: 10.1038/srep25194 (95% CI) = 0.709-0.867) for the tag SNP rs8180040. Although it didn't reach the common GWAS significance threshold of 10 −8 and wasn't included in the larger-scale genetic study by Zhang et al. in East Asians 23 , we considered this locus containing abundant genes as an attractive region to be researched in the Chinese population. Moreover, the strongest risk polymorphism rs8180040 was not in any known transcribed or regulatory sequences and not likely the causal SNP, which meant the real functional SNPs remained mined in 3p21.31.
Using epigenomic data obtained from relevant cell types represents a powerful approach to identifying functional SNPs in post-GWAS genetic researches [39][40][41][42] . In this study, we analyzed ChIP-seq data of histone modifications from CRC cell lines, searched common variants within the regulatory elements of the risk-associated locus 3p21.31, investigated their associations with CRC risk via a two-stage case-control study in the Chinese population, and tried to explain the underlying function by analyzing the data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). To the best of our knowledge, this is the first replication and exploration study on 3p21.31 in East Asians.

Results
Selection of Candidate SNPs. The LD block of GWAS susceptibility loci 3p21.31 was chromosome 3: 47035735-47452118. After a bioinformatics analysis (details in Methods), four common polymorphisms, rs2276854, rs807936, rs1076394 and rs807937, situated within the peaks of histone modification ChIP-Seq data generated from HCT116 or Caco2, were found in the above loci. These four SNPs were in high LD with each other (r 2 > 0.9) according to the 1000 Genomes Project Phase 3 data of the CHB population. Among them, we saw rs1076394 as the most credibly functional variant due to its location in the overlapping region of H3k4me1 and H3k27ac ChIP-seq peaks ( Table 1). The coexistence of these two histone modifications is broadly considered as a mark of active enhancer, while the appearance of either one of them is not. For the replication and exploration of 3p21.31, we genotyped the tag SNP rs8180040 and the potential regulatory SNP rs1076394 in Stage 1. We further validated the positive SNPs in another independent sample of Stage 2.
Population Characteristics. Descriptive characteristics of the subjects in this study are detailed in Table 2.
In both stages, no significant differences were found between patients and controls in the distribution of sex and age. As expected, significantly more smokers were presented in the cases than in the controls, given that cigarette smoking was a well-established risk factor for CRC 3 . And we did not see the same distribution in drinking status. Association Analysis. Both investigated polymorphisms, the presumably regulatory SNP rs1076394 and tag SNP rs8180040, were significantly associated with CRC risk in both stages and the combined analysis (Table 3).
In Stage 1, under a multivariable logistic regression model adjusted for gender, age, smoking and drinking status, individuals with AA genotype of rs1076394 had a significantly reduced risk of CRC (OR = 0.723, 95% CI = 0.559-0.936, P = 0.014, P-FDR (false discovery rate) = 0.035) compared to those with GG homozygotes. A dominant model was used to improve statistical power by combining the GA with AA into an A-carrier group (GA plus AA), and it showed that the allele A carriers had an obviously protective effect on CRC susceptibility (OR = 0.802, 95% CI = 0.669-0.963, P = 0.018, P-FDR = 0.03). Likewise, a positive outcome was found in the additive models, with a per-A-allele OR of 0.847 (95% CI = 0.748-0.960, P = 0.009, P-FDR = 0.045). As  (Table 3). And two polymorphisms were in high LD (r 2 = 0.895) with each other in our total samples, similar to the data of 1000 Genomes Phase 3 of CHB (r 2 = 1.000).
The results of interaction analysis between the promising SNP rs1076394 and smoking were detailed in Table S2, where no significant interactions were observed under either the multiplicative or the additive model in the two stages and the combined study. TCGA and GEO Data Analyses. We downloaded the data of gene expression, germline genotypes, CpG methylation and somatic copy number for COAD (colon adenocarcinoma) and READ (rectum adenocarcinoma) from the TCGA portal (http://cancergenome.nih.gov/) up to October 2014. Then, we performed a modified eQTL (expression Quantitative Trait Loci) analysis of the correlation between rs8180040 (rs1076394 wasn't included in the Affymetrix GenomeWide SNP 6.0 Array of genotype profiles) and expression of genes within 1 Mb flanking regions, with the effects of somatic copy number and methylation being adjusted 43 . As shown in Table 4 and Fig. 1      expression between cancerous and normal colon tissues, and found the same higher expression in cancer tissues from the Chinese population (P = 0.017; CRC tissues: 585.8 ± 29.3 RPKM, normal colon tissue: 482.2 ± 25.6 RPKM), and the Japanese population (P = 0.013; CRC tissues: 973.6 ± 79.9 RPKM, normal colon tissue: 771.4 ± 31.0 RPKM). As for CCDC12, we could not observe significant differences between cancer and normal tissues in two datasets (Chinese samples: P = 0.583, CRC tissues: 2219.9 ± 147.9 RPKM, normal colon tissue: 2118.0 ± 91.1 RPKM; Japanese samples: P = 0.062, CRC tissues: 22137.9 ± 971.7 RPKM, normal colon tissue: 24648.9 ± 870.9 RPKM).

Discussion
The identification of tag SNPs through GWAS is the important first step in understanding the relationship between genomic variation and CRC risk. But now, the foremost goal in the post-GWAS era is to shed light on the causal SNPs and their functional consequences, progressing from indirectly statistical to directly biological associations between genetic variation and disease. Accumulating evidence showed that the most likely mechanistic basis that links those noncoding genetic variants to phenotype and disease is being regulatory 34 . In our study, by overlapping the LD boundaries of the locus 3p21.31 and regulatory regions predicted by CRC-specific histone modifications, we screened out the most promisingly functional rs1076394 among the four original polymorphisms in high LD. By conducting association studies in two independent Chinese populations containing 1327 cases and 2075 controls in total, we replicated the significances of tag SNP rs8180040, and found a significant protective effect for the potentially regulatory variant rs1076394 that might serve as an eQTL for the genes CCDC12 and NME6, while NME6 presented significantly higher expression in cancer tissues.
The findings led us to assume that rs1076394 might influence CRC risk by altering the activity of an enhancer that controlled NME6 expression. The potentially functional variant rs1076394 lay within a region of the genome exhibiting chromatin modifications H3k4me1 and H3k27ac in a CRC cell line HCT116, consistent with characteristics of an active enhancer across diverse tissues 44,45 . It was situated in the first intron of gene KIF9 (kinesin family member 9), which was involved in mitotic progression by maintaining correct spindle length 46 , and the degradation of the matrix by regulating macrophage podosomes 47 . However, according to our calculation of TCGA data, rs1076394 was related to the expression of CCDC12 and NME6, but not related to KIF9. We saw the SNP as a potential eQTL for CCDC12 and NME6, while we could not rule out the possibility that it was actually in LD with these two genes. The SNP rs1076394 was approximately 300kb upstream of CCDC12, which participated in promoting early erythroid differentiation 48 , and over 1000 kb upstream of NME6, which might play a role in the regulation of cell growth and cell cycle progression [49][50][51] . Either CCDC12 or NME6 was located in 3p21.3. But, NME6 was more suggested to be the actual contributing gene in this locus due to its significantly higher expression in cancer samples of both Caucasian and Asian populations. At the beginning, NME6 was discovered as a gene encoding a nucleoside diphosphate kinase that suppressed p53-induced apoptosis 50 . In a shRNA functional screen, Nme6 was reported to be crucial for the renewal of embryonic stem cells (ESCs). And ESCs are characterized by immortalization ability, pluripotency, and oncogenicity 52 . More recently, enriched somatic mutations in NME6 was suggested to be associated with the deregulation of pyrimidine metabolism and the promotion of malignant progression in human melanoma 53 . Accordingly, NME6 may act as an oncogene that still needs more investigations.
The application of biofeature information such as ChIP-seq and RNA-seq data has represented an effective approach to identifying functionally regulatory SNPs, and different databases including UCSC, Encode, GEO and TCGA have provided easy access to massive amounts of relevant data. Incorporating epigenetic and expression analyses into traditional molecular epidemiology could assist in the interpretation of GWAS results and the discovery of functional variants for diseases in post-GWAS studies. Using a similar strategy to other CRC-associated loci should deepen our understanding of CRC risk.
Nevertheless, several limitations should be noted here. First of all, due to the lack of relevant functional experiments, biological reality beneath the statistically significant association is uncertain. In the analysis of TCGA data, we have not restricted it to Chinese samples, when its composition is mostly Caucasian samples. It may not reflect the exact outcome of the Chinese population that we researched on. Second, the strategy of retrieving candidate polymorphisms depended on the prediction from ChIP-seq data of relevant cell lines, which was not rigorous enough to define exact regulatory elements and all the functional variants inside. Focusing on common SNPs, we could not rule out the possibility that sets of rare variants or haplotypes in LD with the tag SNP are actually causal in this locus. Third, insufficient epidemiological and clinical information prevented us from further investigating the interactions between gene and environment.
In summary, we discovered a probably regulatory SNP that was associated with CRC risk in the Chinese population. Researches on 3p21.31 and other CRC susceptibility loci with greater sample sizes and follow-up functional analyses are warranted to elaborate the biological mechanism of genetic etiology.

Study Participants.
A two-stage case-control study was applied to evaluate the association between candidate variations and the risk of CRC. The discovery stage (Stage 1) consisted of 767 cases and 1397 controls, which were recruited from Tongji Hospital of Huazhong University of Science and Technology (HUST) between 2008 and 2012. The validation stage (Stage 2) involved 528 cases and 678 controls enrolled from 2013 to 2015 at the same hospital. All subjects were unrelated ethnic Han Chinese in both stages. Patients with histopathologically confirmed CRC and without previous chemotherapy or radiotherapy, were included without restriction to gender and age,. In the same time period, cancer-free controls were recruited form participants in physical examination programs of the same hospital, and were adequately matched to cases by gender and age (± 5 years). Definitions of smoking and drinking status were the same as in a previous study by our group [54][55][56] . At recruitment of each subject, a written informed consent was obtained, and 2 millimeters peripheral venous blood was collected. This study was approved by the ethnics committee of Tongji Medical College of Huazhong University of Science and Technology, and the methods were carried out in accordance with the approved guidelines. SNP Selection and Genotyping. Candidate SNPs were common genetic variants (minor allele frequency, MAF > 0.05) that located in the putative regulatory elements of the 3p21.31 locus. First, we applied the software HaploView to calculate the linkage disequilibrium (LD) block of 3p21.31 with the criterion of r 2 > 0.8, by inputting the Chinese Han Beijing (CHB) genotype information of 500 kb flanking the tagSNP rs8180040. This LD block was defined as the CRC susceptibility locus. Second, we downloaded ChIP-seq data regarding histone modification from two CRC cell lines, HCT116 and Caco2, form the UCSC database integrated with Encode data (S1 Table). And the extents of their signal peaks were considered as putative regulatory elements. Third, based on the CHB MAF data in dbSNP database, we only selected the common polymorphisms (MAF > 0.05) that situated within the overlapping parts between the LD block and peaks. Finally, four polymorphisms in high LD with each other (r 2 > 0.9) survived after this step-wise analysis. Among them, rs1076394 was chosen as the most potentially functional variant for the genotyping assays, because of its more indicative location in an active enhancer than others' . At the same time, we also tried to replicate the tag SNP rs8180040 in our sample. In Stage 2, the nominal significant SNPs of Stage 1 were further validated. SNPs of both stages were genotyped by a TaqMan real-time polymerase chain reaction (PCR) assay (Applied Biosystems, Foster city, CA). Quality control was preformed by including 5% duplicate samples in blinded fashion, with a concordance rate of 100%.
Statistical Analysis. The differences in the distributions of gender, age, smoking, drinking status and genotypes between cases and controls were estimated by a χ 2 test or t-test, where appropriate. The Hardy-Weinberg equilibrium (HWE) in controls was evaluated with a goodness-of-fit χ 2 test. The odds ratio (ORs) and corresponding 95% confidence intervals (95% CIs) were used to measure the associations between SNPs and CRC susceptibility And they were calculated after adjusting for gender, age, smoking and drinking status under a multivariate logistic regression model. For multiple comparison corrections, a simple procedure (Benjamini and Hochberg) was performed in two stages and combined study to control the false discovery rate 57 . The LD of the candidate SNPs was analyzed using HaploView v4.2 58 . With regard to TCGA and GEO data, the expression differences among three genotypes (TT, TA and AA) of rs8180040 were measured under a linear regression model adjusting the effects of somatic copy number and CpG methylation 43 , and the differences between cancer and normal samples were measured by t-test. The gene-environment interactions were evaluated by a pair-wise analysis under multiplicative 59 and additive interaction models 60 . All the above statistical analyses were conducted using SPSS Software v20.0 (SPSS, Chicago, Illinois, USA), with the exception that the P values of additive interaction were assessed using Stata v11.0 (Stata Corporation, College Station, TX). P values in this study were two-sided with a significance criterion of P < 0.05.