Introduction

Lung cancer has been the most common cancer and the leading cause of death worldwide for several decades. There are estimated to be 1.8 million new cases (12.9% of the total) and 1.59 million deaths (19.4% of the total) in 2012. In China, lung cancer incidence accounts for 21.3% and mortality accounts for 27.1% of all cancer in 20121,2. Although tobacco smoking has been confirmed to be the most common risk factor of lung cancer, just about one-tenth smokers develops lung cancer in their lifetime3. It indicates that many unclear factors like genetic risk factors may also play an important role in lung carcinogenesis4.

Over the past few years, genome-wide association studies (GWAS) of lung cancer have identified more than 40 single nucleotide polymorphisms (SNPs) in 20 genome loci associated with lung cancer risk (Supplementary Table 1)5. The loci at 3p28, 5p15, 15q25 and 6p21 have been validated to have contribution to the susceptibility of lung cancer in multiple studies6,7,8,9,10,11. These findings have provided new clues for understanding lung cancer carcinogenesis.

Marc A, et al12 systematically investigated the association of multiple types of ENCODE data with all GWAS identified SNPs and showed that about 20% of the SNPs lay in chromatin immunoprecipitation sequencing (ChIP-seq) peaks. When accounting for SNPs in strong linkage disequilibrium (r2 ≥ 0.8) with reported SNPs in CEU, the proportion reached 61%. Moreover, Rory J, et al13 proved that polymorphic binding sites owned different biochemical affinities for transcription factor (TF), resulting in altered TF recruitment and differential reporter gene repression in vivo. Through comprehensive analysis GWAS identified SNPs and TF p53 ChIP-seq data, Jorge Z, et al14 found a SNP rs4590952 (G/A) which influenced cancer risk through changing p53 binding and had undergone natural selection. All these results raise a new hypothesis that the expression levels of genes can be altered by non-coding SNPs lie within regulatory DNA sequences through altering their affinity for TFs.

CTCF is an extensive transcription factor with 11-zinc finger (ZF) protein domains and involved in many cellular processes, including transcription regulation, insulator activity and regulation of chromatin architecture. It has been reported that CTCF was one of the 127 significantly mutated genes across 12 tumor types15. Furthermore, the Chip-seq experiments of CTCF has been carried out and replicated in different laboratories to provide highly credible binding peaks in ENCODE project. However, there was little attention paid to the genetic variants in binding sites of CTCF. In this study, by using ENCODE and our previous Lung Cancer GWAS data, we tried to systematically evaluate the associations between genetic variants located at regulatory DNA sequences of CTCF and lung cancer risk. We first imputed and screened our GWAS data (2,331 cases vs. 3,077 controls) in the binding regions of CTCF and then replicated the associations in another independent population (1,115 cases vs. 1,346 controls).

Results

A total of 2,331 cases and 3,077 controls were included in the discovery stage, 1,115 cases and 1,346 controls in the validation stage. The demographic and clinical information is summarized in Supplementary Table 2. Before imputation, only 3,569 genotyped SNPs in the CTCF binding peaks exited in our GWAS data set. After imputation, the data coverage increased more than eight fold, a total of 32,453 qualified SNPs in the binding sites were analyzed in the discovery stage and three SNPs (rs37010 in 5p15.33, rs2002059 in 10q24.2, rs60507107 in 11q12.2 ) met all the criteria above (Table 1, Figure 1 and Supplementary figure 2–3). However, rs37010 was in strong LD with rs465498 (R2 = 0.94) which had been confirmed related to lung cancer risk in our previous study16, so it was no longer validated in this study. As a result, two SNPs (rs2002059 and rs60507107) were further evaluated in the validation stage.

Table 1 Associations of the 15 SNPs in discovery stage
Figure 1
figure 1

rs60507107 in CTCF TFBS Chip-seq peaks.

The replication results are shown in Table 2. Rs60507107 which located in the intron of DAGLA remained to be significantly associated with risk of lung cancer (OR = 1.13, 95%CI = 1.01–1.27, P = 0.037), consistent with the results of the discovery stage (OR = 1.22, 95%CI = 1.12–1.33, P = 4.93 × 10−6). After combining results from two stages, the results demonstrated that rs60507107 was significantly associated with lung cancer (OR = 1.19, 95%CI = 1.11–1.27) at a P-value of 6.98 × 10−7, reaching the significance level after multiple comparison (Bonferroni: 1.54 × 10−6 from 0.05/32,453). The combined ORs for the heterozygote (CT) and minor homozygote (TT) are 1.14 (95%CI = 1.03–1.26) and 1.45 (95%CI = 1.25–1.67), respectively, as compared with major homozygote (CC). The regional plot of rs60507107 was shown in Figure 2. However, the association between the other SNP rs2002059 observed in discovery stage was not replicated in the validation stage.

Table 2 Associations of the 2 replicated SNPs in GWAS and replicated stage
Figure 2
figure 2

regional plot of rs60507107 and rs37010.

Furthermore, subgroup analyses by age, gender, smoking status and histology were conducted for the association of rs60507107 with lung cancer risk (Table 3). No significant heterogeneity between the subgroups were observed at discovery stage and validation stage. The association remained significant in females (OR = 1.29, 95%CI = 1.09–1.53, P = 0.003 in discovery stage; OR = 1.27, 95%CI = 1.04–1.56, P = 0.017 in validation stage) and subjects whose smoking level less than 25 pack-years (OR = 1.19, 95%CI = 1.07–1.33, P = 0.002 in discovery stage; OR = 1.17, 95%CI = 1.02–1.35, P = 0.025 in validation stage) in both stages.

Table 3 Stratification analysis on rs60507107

Discussion

In our previous GWAS studies, several loci like 5p15.33, 22q12.2 or 12q23.1 were found associated with lung cancer or lung squamous cell carcinoma risk in Chinese9,16,17. However, these findings could only explain a small fraction of the heritability of the lung cancer because GWAS mainly focus on the peak associations and usually base on little prior information. In the current study, we systematically evaluated the association of genetic variants lay within the binding cites of CTCF, which had been proved implicated in cancer formation in recent studies, based on ENCODE database and existing GWAS data set and further replicated the promising associations in an independent case-control study in Chinese population. We finally found a novel SNP rs60507107 located in 11q12.2 was significantly associated with the lung cancer risk and also showed a similar signal: rs37010 which was at the same LD region with reported loci: chr5p15.33, rs465498 (R2 = 0.94)18. The SNP rs37010 located upstream of CLPTM1 which was associated with cisplatin-induced apoptosis and mapped to a region of LD upstream of TERT which had been demonstrated implicated in carcinogenesis (Figure 2)19,20,21,22.

The novel identified SNP rs60507107 located in the binding sites of CTCF in the first intron of DAGLA. CTCF is a transcription factor with 11-zinc finger (ZF) protein domains and involved in many cellular processes, including transcription regulation, insulator activity and regulation of chromatin architecture. It has been found that CTCF can mediate some important biological processes of lung cancer cells such as enhancer-promoter interactions of TERT and Rb2/p130 transcription23,24. The nuclear protein of CTCF can bind a wide variety of DNA target sequences with different ZF domains and play an important role in epigenetic regulation. It can function as transcriptional activator by binding a histone acetyltransferase (HAT)-containing complex or transcriptional repressor by binding histone deacetylase (HDAC)-containing complex25. The CTCF binding region identified in our study, is modified by histone H3K4Me1 and H3K27AC which usually considered as active enhancer. To measure the influence of rs60507107 to CTCF affinity, we searched the JASPA (http://jaspar.genereg.net/) and found a motif with the wild-type allele of rs60507107 “TCCATGGGAATCGCT” could bind CTCF with a relative score of 0.70, while the score decreased to 0.61 with the other allele “TCCATGGGAATCACT”. Furthermore, rs60507107 had a moderate linkage disequilibrium (R2 = 0.586) with an identified DAGLA eQTL SNP rs198464 among CHB + JPT population in 1000 Genomes project26. Additionally, the rs198464 also showed a significant association with lung cancer in our GWAS data set (OR = 1.18, 95%CI = 1.09–1.28, P = 7.44 × 10−5). This indicates rs60507107 may be involved in lung carcinogenesis through regulating CTCF binding thus influencing the expression of DAGLA.

DAGLA located at region 11q12.2 which had been identified associated with colorectal cancer (CRC) in a large-scale genetic study recently27. Previous study also indicated the region involved in hereditary prostate cancer families with primary kindey cancer28. These findings suggest the region may participate in some common mechanism of carcinogenesis. DAGLA is a diacylglycerol lipase that catalyzes the hydrolysis of DAG to 2-arachidonoylglycerol (2-AG), the most abundant endocannabinoid in tissues. The 2-AG is a physiological ligands for the cannabinoid receptors CB1 and CB2, two G protein-coupled receptors which are located in the central and peripheral nervous systems29. The cannabinoid ligand-receptor system plays an important role in a variety of physiological processes including appetite, pain-sensation, mood, memory and antitumorigenic properties30. Munson AE, et al31 had found that the growth of Lewis lung adenocarcinoma in a mouse model was inhibited after 20 days treatment with cannabinol and Δ8-THC. As a compound of cannabinoid, 2-AG also shows anticancer properties. In glioma cell C6, 2-AG demonstrates antiproliferative effects with IC50 values of 1.8 μM32. In breast cancer line MCF-7, 2-AG can decrease cellular proliferation. According to MI et al33, 2-AG can also decrease migration, or markers of migration in a wide range of cell lines.

In conclusion, the present study systematically investigated the association of genetic variants in the binding sites of CTCF and identified a new loci 11q12.2 increased lung cancer risk. Considering the moderate sample size in the validation stage, further larger well-designed population-based studies are warranted to elucidate the impact of rs60507107 on lung cancer risk.

Methods

Study populations

A two-stage case-control study was designed to evaluate the associations between genetic variants in the binding sites of CTCF and the risk of lung cancer. Study subjects for discovery stage is exactly the same with our previous GWAS study on lung cancer9,16. Briefly, the discovery stage of 2,331 lung cancer cases and 3,077 controls included two studies (Nanjing GWAS study: 1,473 cases and 1,962 controls from Nanjing and Shanghai; and Beijing GWAS study: 858 cases and 1,115 controls from Beijing and Wuhan). The histology for each case was histopathologically or cytologically confirmed by at least two local pathologists. Cancer-free subjects were recruited in local hospitals for individuals receiving routine physical examinations or in the communities for those participating screening of chronic diseases. Demographic information was collected using standard questionnaire through interviews. Smokers were defined as individuals who had smoked at an average of one cigarette or more per day and for at least one year in their lifetime; otherwise, subjects were considered as nonsmokers. Former smokers were defined as quitting for at least one year before recruitment. Both smoke year and the number of cigarettes per day were collected to calculate pack-year. The controls were frequency-matched to lung cancer cases for age, gender and geographic regions. As a result, 2,331 cases and 3,077 controls were included in the discovery stage and following new recruit 1,115 cases and 1,346 controls were replicated the associations as a validation stage. All study subjects provided informed consent and both the institutional review boards of Nanjing Medical University and Chinese Academy of Medical Sciences and Peking Union Medical College approved all procedures and all experiments were conducted in accordance with the approved guidelines.

Database preparation and bioinformatics analysis

ENCODE dataset preparation

A total of 690 datasets of TF ChIP-seq peaks were released based on the data from five ENCODE transcription factor binding sites (TFBS) ChIP-seq production groups. In consideration of tissue specificity, only peaks appeared in lung cancer cell line A549 were analyzed in this study. The uniform peaks of CTCF were downloaded from UCSC genome browser in BED format (release at 13th-May 2013, Uniform Peaks Transcription Factor ChIP-seq from ENCODE). As the Chip-seq experiments were performed in various treatment conditions, we defined the region as CTCF binding sites if peaks were called in any condition. As a result, 101,083 peaks were identified at GWAS scale in the final data set.

SNP screening based on GWAS data

A total of 3,569 SNPs located at CTCF related regions past quality control in our GWAS study16 were included in this study. To further increase the genome coverage of our data, we performed two-steps imputation analyses (pre-phasing and impute) in the ChIP-seq peak regions. After imputation, a total of 32,453 unique SNPs in the TFBS regions with high imputation quality (info >0.8) were included for further analysis. SNPs with P ≤ 1.0 × 10−5 for all GWAS samples and consistent between the Nanjing and Beijing study at P ≤ 0.05 were selected for replication, except for: (i) SNPs with MAF <0.05; (ii) SNPs located 5kb off the nearest gene; (iii) SNPs located at HLA region; (iv) SNPs in strong linkage disequilibrium (LD) (R2 ≥ 0.5)17. The detailed work-flow was shown in Supplementary figure 1. As a result, 3 SNPs, satisfied all the above criteria, were included for further validation.

Statistical analysis

The quality control in our lung cancer GWAS was described previously16. Ungenotyped SNPs were imputed in the GWAS discovery samples using SHAPEIT 1.0 (haplotype estimation step) and IMPUTE2 (genotype imputation step), taking haplotype information from the 1000 Genomes Project (Phase I integrated variant set across all 1,092 individuals, V3, released at May 2012) as reference34,35,36. Score test based on dosage files was used to analysis the single SNP association of discovery stage by SNPTEST (V2) under additive model adjusting for age, gender and pack years of smoking37. The associations between SNPs and susceptibility of lung cancer were demonstrated by calculating the odds ratios (ORs) and their 95% confidence intervals (CIs). The chi-square-based Cochran's Q statistic was calculated to test for heterogeneity between groups in a stratified analysis38. Deviations of the characteristics for lung cancer patients and control subjects were examined by the Student-t test (for continuous variables) or the χ2 test (for categorical variables) with R software (version 2.15.3). All tests were two-sided and the significance level was set at P ≤ 0.05.

Genotyping

SNPscan™ kit (Genesky Biotechnologies Inc., Shanghai, China) was used to determine genotypes of the SNPs selected in the validation stage. The detailed technical description for the kit was presented elsewhere, thirty duplicated samples were genotyped to ensure the reliability (the sensitivity was 97.03% and specificity was 93.33%)39. A series of methods were used to control the quality of genotyping: (i) case and control samples were mixed and genotyped without knowing the case or control status; (ii) forty-two known genotypes samples were genotyped as positive control and the concordance rates were above 99%; (iii) five percent of the samples were randomly selected to repeat the genotyping, as blind duplicates and the reproducibility was 100%.