Introduction

Lung cancer causes over one million deaths each year worldwide.1, 2 Although principally caused by tobacco smoking, inherited genetic factors are increasingly being recognised to be important in the aetiology of lung cancer; notably, genome-wide association studies (GWAS) in Europeans have consistently identified polymorphic variation at 15q25.1 (CHRNA5-CHRNA3-CHRNB4), 5p15.33 (TERT-CLPTM1) and 6p21.33 (BAT3-MSH5) as determinants of lung cancer risk.3, 4, 5, 6, 7 Additionally, susceptibility loci for lung cancer at 3q28, 6q22.2, 13q12.12, 10q25.2 and 22q12.2 in Asians have been identified using GWAS.8, 9, 10 Recent studies have validated the 3q28 association in Europeans.11, 12

Non-small-cell lung cancer (NSCLC) is the most frequent histological subtype of lung cancer, comprised primarily of adenocarcinoma (AD) and squamous cell carcinoma (SQ). The various lung cancer histologies have different clinical characteristics reflective of differences in their carcinogenesis and molecular profile.13 Perhaps, not surprisingly, there is variability in the genetic effects on lung cancer risk by histology with subtype-specific associations at 5p15.33 (TERT-CLPTM1) for AD14, 15 and at 9p21 (CDKN2A/CDKN2B)16 and 12p13.33 (RAD52)17 for SQ.

The associations identified by GWAS provide novel insights into the development of lung cancer. However, the tag single-nucleotide polymorphisms (tagSNPs) genotyped are generally not strong candidates for causality, and thus elucidating the functional basis of association signals is challenging. One reason for this is that the correlation matrix between tagSNP(s) and functional variant(s) at any locus can be complex. For example, at least two independent risk loci map to 5p15.33,14 and the 6p22.1–p21.31 locus is part of the human leucocyte antigen (HLA) region, which is highly polymorphic.7 Furthermore, it has recently been proposed that many GWAS signals can be a consequence of ‘synthetic associations’, resulting from the combined effect of one or more rare causal variants rather than simply linkage disequilibrium (LD) with a common risk variant.18

Although it can be hard to deconvolute association signals, functional variant discovery is aided by a deeper examination of genetic variation in the LD blocks in which the tagSNPs reside and this benefits greatly from efforts such as the 1000 Genomes Project, which afford discovery of novel variants.12

To decipher the allelic structure underscoring the associations for lung cancer at nine of the previously reported lung cancer risk-associated regions, 3q28, 5p15.33, 6p21.33, 6q22.2, 9p21, 10q25.2, 12p13.33, 13q12.12 and 22q12.2, we performed a meta-analysis of data from five independent GWAS. To maximise recovery of all variants contributing to lung cancer risk at these loci, we imputed untyped variants using 1000 Genome Project data as the reference panel.

Materials and methods

Subjects and data sets

We used GWAS data from five non-overlapping case–control series of Northern European ancestry, which have been previously reported (Supplementary Table 1): the MD Anderson Cancer Center GWAS of NSCLC comprising 1150 cases and 1134 controls;4 the Institute of Cancer Research (ICR) GWAS comprising 1952 cases and 5200 controls;7 the National Cancer Institute GWAS comprising 5713 cases and 5736 controls;15 the International Agency for Research on Cancer (IARC) GWAS comprising 2533 cases and 3791 controls6 and the Harvard GWAS based on 984 cases and 970 controls,19 which were genotyped using either Illumina HumanHap 317, 317+240S, 370Duo, 550, 610 or 1 M arrays (Illumina Inc., San Diego, CA, USA; Supplementary Table 1).

Ethics

All participants provided informed written consent. All studies were reviewed and approved by institutional ethics review committees at the involved institutions.

Quality control

Standard quality control was performed on all scans excluding individuals with low call rate (<90%) and extremely high or low heterozygosity (ie, P<1.0 × 10−4), as well as all individuals evaluated to be of non-European ancestry (using the HapMap version 2 CEU, JPT/CHB and YRI populations as a reference; Supplementary Table 1). For apparent first-degree relative pairs, we removed the control from a case–control pair; otherwise, we excluded the individual with the lower call rate. Quantile–quantile (Q–Q) plots of genome-wide association test statistics showed that there was minimal inflation rendering substantial cryptic population substructure or differential genotype calling between cases and controls unlikely in each GWAS (genomic control inflation factors, λ=1.00–1.05; Supplementary Figure 1). To bring genotype data obtained from different arrays into a common platform from the five GWAS and to recover untyped genotypes, we imputed SNPs using 1000 Genomes Project data as the reference (Supplementary Table 1). Q–Q plots for all SNPs and those restricted to rare SNPs (minor allele frequency (MAF) <1%) after imputation did not show evidence of substantive overdispersion introduced by imputation (λ=0.99–1.06 and 0.82–1.05).

Statistical analysis

Data were imputed for each scan using 1000 Genome Project data (Phase 1 integrated release 3, March 2012) as the reference, using IMPUTE2 (v.2.1.1),20 MaCH (v.1.0)21 or minimac (v.2012.10.3)22 software (Supplementary Table 1). Genotypes were aligned to the positive strand in both imputation and genotyping. Imputation was conducted separately for each scan in which before imputation each GWAS data set was pruned to a common set of SNPs between cases and controls. As previously advocated, we set thresholds for imputation quality to retain both potential common and rare variants for validation.16, 23 Specifically, poorly imputed SNPs defined by an RSQR <0.30 with MaCH or an information measure Is <0.40 with IMPUTE2 were excluded from the analyses. Tests of association between imputed SNPs and lung cancer were performed using SNPTEST (v.2.5),24 ProbABEL,25 MaCH2dat (v.124)21 or glm function in R. Principal components generated using common SNPs were included in the analysis to limit the effects of cryptic population stratification that might cause inflation of test statistics. Classical HLA alleles were imputed using HLA*IMP2, which is a method for imputing classical HLA alleles from SNP data.26, 27

The association between each SNP and lung cancer risk was assessed by the Cochran–Armitage trend test. The possibility of significant hidden population substructure or differential genotype calling between cases and controls in each GWAS was evaluated using Q–Q plots of test statistics. The inflation factor λ was based on the 90% least significant directly typed SNPs.28 Meta-analysis was undertaken using inverse-variance approaches. Odds ratios (ORs) and associated 95% confidence intervals (CIs) were calculated by unconditional logistic regression using R (v.2.6) and PLINK29 (v.1.06) software. Cochran’s Q-statistic to test for heterogeneity and the I2 statistic to quantify the proportion of the total variation due to heterogeneity were calculated.30 To explore the variability in genetic associations according to tumour histology, we derived ORs for all lung cancer, and AD and SQ cancers. Subset-based meta-analyses of AD and SQ lung cancer, stratified by AD and SQ histology, were conducted using the program ASSET.31 All statistical tests were two-sided.

Bioinformatics

LD metrics were calculated based on 5200 controls from Wellcome Trust Case Control Consortium release 2 (WTCCCII) genotyped using Illumina 1.2 M arrays and plotted using SNAP.32 LD blocks were defined on the basis of HapMap recombination rate (cM/Mb) as defined using the Oxford recombination hotspots and on the basis of distribution of CIs defined by Gabriel et al.33 To explore the epigenetic profile of association signals, we used chromatin state segmentation data generated by the ENCODE project. We used HaploReg34 and RegulomeDB35 to examine whether any of the SNPs or their proxies (ie, r2>0.8 in the 1000 Genomes EUR reference panel) annotate putative transcription factor (TF) binding or enhancer elements. We assessed sequence conservation using GERP; GERP scores (−12 to 6, with 6 being indicative of complete conservation) reflect the proportion of substitutions at that site that are rejected by selection compared with the observed substitutions expected under a neutral evolutionary model based on sequence alignment of 34 mammalian species.36

eQTL, meQTL and mutation analysis

To gain insight into the biological basis of each of the risk loci for lung cancer, we firstly performed an expression quantitative trait loci (eQTL) analysis for directly typed SNPs making use of expression data on over 1100 normal lung tissue.37 To explore the relationship between SNP genotype and gene body methylation made use of previously published methylation quantitative trait loci (meQTL) data from the Tumor Cancer Genome Atlas (TCGA) and the EAGLE study37 using sample size-weighted meta-analysis implemented in METAL.38 To examine the somatic mutation frequency of specific genes, we used data from the analysis of SQ and AD lung cancers generated by TCGA and MutSigCV v.1.439 to determine if the gene harbours more non-synonymous mutations than expected by chance given its size, sequence context and mutation rate. As advocated, we imposed a false discovery rate (FDR) of 0.1 as being statistically significant.39

Results

Subjects and definition of genomic regions

We studied five lung cancer GWAS in populations of European ancestry (Supplementary Table 1). After filtering on the basis of prespecified quality-control measures, the five GWAS provided genotypes on 12 316 lung cancer cases and 16 831 controls (Supplementary Table 1).

To explore the possibility of fine mapping the risk loci for lung cancer, which have been identified in Asians, we recovered the SNPs that had been reported to define respective loci: specifically at rs9387478 (6q22.2, hg19 chr6: g.117786180A>C), rs753955 (13q12.12, hg19 chr13: 24293859T>C), rs7086803 (10q25.2, hg19 chr10: g.114498476A>G) and rs17728461 (22q12.2, hg19 chr22: g.30598552C>G). None of these SNPs displayed a significant association with risk of all lung cancer (ie, P>0.05). For rs9387478, the previously reported risk allele showed an association with both AD and SQ albeit at only P-values of 0.01 and 0.03, respectively. In view of these findings, we did not proceed to formal integration of the regions confining our analysis to loci implicated in Europeans.

We defined the haplotype blocks and recombination hotspots containing the tagSNPs previously reported to be associated with lung cancer risk at 3q28 (TP63, rs4488809, hg19 chr3: g.189356261T>C), 5p15.33 (TERT, rs2736100, hg19 chr5: g.1286516T>G and CLPTM1L, rs402710, hg19 chr5: g.1320722C>T), 6p21.33 (rs3117582, hg19 chr6: g.31620520A>C), 9p21 (CDKN2A/CDKN2B, rs1333040, hg19 chr9: g.22083404C>T) and 12p13.33 (RAD52, rs10849605, hg19 chr12: g.1064438T>C). To include the possibility of long-range synthetic associations, we imputed the regions defined by at least 1 Mb region surrounding the tagSNP associated with lung cancer risk at each of the nine loci. Collectively, the six European and three Asian lung cancer risk loci were captured by 12.5 Mb region of the genome.

For analysis of the 6p21.33 association, we considered the major histocompatibility complex (MHC) to be defined by a 4.5 Mb region bordered by the RFP and MLN genes (rs209130, hg19 chr6: g.28867800A>G and rs1547668, hg19 chr6: g.33775446A>G, respectively) at the telomeric and centromeric ends of 6p21.33. For the HLA imputation, we made use of GWAS SNP data for an extended region in 6p21.33 bounded by rs1165196 (hg19 chr6: g.25813150T>C) and rs2772372 (hg19 chr6: g.33427350T>C).

Supplementary Table 1 shows the number of SNPs directly typed and successfully imputed (INFO score ≥0.4) at each of the five loci. Regional plots of association results and recombination rates for all five regions imputed can be found in Figure 1.

Figure 1
figure 1

Regional plots of association results and recombination rates for 5p15.33 in all lung cancer (a), 3q28 and 5p15.33 in AD (b and c), 6p21.33, 9p21 and 12p13.33 in SQ (df). All lung cancer-related panel (a) was based on 12 332 lung cancer cases and 16 831 controls; AD-related panels (b and c) were based on 3930 AD and 15 864 controls; and SQ-related panel (c) was based on 3490 SQ and 16 008 controls. Association results of both genotyped (circles) and imputed (diamonds) SNPs in the GWAS samples and recombination rates for each locus: for each plot, −log10 P-values (y axis) of the SNPs are shown according to their chromosomal positions (x axis). The top genotyped SNP in each combined analysis is a large diamond and is labelled by its rsID. The colour intensity of each symbol reflects the extent of LD with the top genotyped SNP: white (r2=0) through to dark red (r2=1.0). Genetic recombination rates (cM/Mb), estimated using HapMap CEU samples, are shown with a light blue line. Physical positions are based on NCBI build 37 of the human genome. Also shown are the relative positions of genes and transcripts mapping to each region of association. Genes have been redrawn to show the relative positions; therefore, maps are not to physical scale.

Analysis of individual lung cancer risk loci

For four of the five regions, imputation provided refinement of the association signal identifying a region of interest narrower than the original LD block likely to harbour the functional variant. However, for 6p21.33, the LD structure is large and complex. In all regions, in silico functional annotation of the most associated variants, using publically available data from ENCODE, revealed that many reside within potential regulatory regions of DNA.

At 3q28 the association was driven by the relationship with risk for AD (Table 1); the strongest association for this lung cancer histology was provided by rs13314271 (hg19 chr3: g.189357602T>C; P=3.35 × 10−7), which localises to intron 2 of TP63. The strongest eQTL within the region of association was shown by rs4488809 (hg19 chr3: g.189356261T>C) for TP63 (PeQTL=6.77 × 10−14), which is in perfect LD with rs13314271 (D′=1.0, r2=1.0) and resides within a genomic region having epigenetic markers of enhancer function and TF binding (Supplementary Table 3).

Table 1 SNP associations by histology for lung cancer risk loci at 3q28, 5p15.33, 6p21.33, 9p21 and 12p13.33

The 6p21.33 association is almost exclusively the consequence of an increased risk for SQ lung cancer (Table 1 and Supplementary Table 2). The best association for SQ was provided by rs115549526 (hg19 chr6: g.189356261T>C, P=3.80 × 10−9), which maps to 2.3 kb to 5′ of apolipoprotein M transcript variant 2. In addition to rs115549526, multiple other SNPs within the extended region of LD also show strong associations (ie, within one order of magnitude), which reflect the HLA haplotype structure and the extensive correlations over the whole MHC region. Making use of HLA imputation data, the strongest individual HLA− allelic association for SQ lung cancer was shown by HLA-B*0801 (P=1.09 × 10−9; Supplementary Table 4). Taken together with HLA-A*0101 and DRB*0301, this allele forms the ancestral 8.1 haplotype seen in Europeans. The strongest eQTLs seen in normal lung tissue for the 6p21.33 risk SNPs are shown by rs3131383 (hg19 chr6: g.31704294C>A) and rs497309 (hg19 chr6: g. 31892484T>G) for HLA-A (PeQTL=9.48 × 10−18 and 2.13 × 10−18, respectively) and between rs3117577 (hg19 chr6: g.31727474A>G) and HLA-DQB1 (PeQTL=3.76 × 10−17; Supplementary Table 5). Rs115549526, rs3131383, rs497309 and rs3117577 are all highly correlated SNPs (pairwise LD metrics D′≥0.9, r2≥0.8). The strongest meQTL within the 6p21 risk locus has previously been documented37 to be rs3131379 (hg19 chr6: g.31721033C>T) for MSH5 (PmeQTL=1.14 × 10−17; Supplementary Table 5). Perhaps, not unexpectedly, rs3131379 is strongly correlated with rs115549526 (D′=1.0, r2=0.9). Many of the risk-associated SNPs map to enhancers and have documented TF binding motifs (Supplementary Table 3).

As documented previously,7 two independent associations were shown at 5p15.33, annotating TERT and CLPTM1L genes. The 5p15.33 TERT locus that has been associated with risks of many tumour types with several independent risk loci.40 These risk signals are represented by SNPs in localised regions of LD, within the promoter and within introns 2–4.41, 42 Consistent with previously published observations,15 the strongest 5p15.33–TERT association was provided by rs2736100, which maps to intron 2 of TERT (Figure 1), and is driven by the risk for AD (P=2.51 × 10−18). In contrast, the association defined by rs37004 (hg19 chr5: g.1356684G>A, P=6.91 × 10−16), which maps 12 kb telomeric to CLPTM1L (Figure 1), influenced the risk of both AD and SQ lung cancer histologies. Rs2736100 provided for the strongest meQTL with TERT (PmeQTL=5.28 × 10−19; Supplementary Table 5). For the other 5p15.33 region of association, the intron 13 SNP rs401681 provided the strongest meQTL with CLPTM1L (PmeQTL=7.74 × 10−17; Supplementary Table 5) but was only partially correlated with rs37004 (D′=1.0, r2=0.28).

The signal at 9p21 was primarily driven by risk for SQ, the strongest association being provided by rs1333040 (hg19 chr9: g.22083404C>T), which maps within intron 13 of CDKN2B (P=2.54 × 10−6; Figure 1). Rs1333040 and closely correlated SNPs did not show any significant eQTL or meQTL associations, and did not annotate enhancer or DNAase hypersensitivity elements (Supplementary Table 3).

As expected from previously published data,17 the 12p13.33 association for lung cancer was principally driven by the association with risk of SQ lung cancer; rs3748522 (hg19 chr12: g.1058688A>C) providing the strongest association signal (P=2.35 × 10−8). Rs3748522 maps to intron 1 of the DNA double-strand repair gene RAD52 (Figure 1) within a predicted promoter that features multiple TF binding sites (Supplementary Table 3).

Discussion

Characterizing all the genetic variation within each region of association, as we have performed, is critical in deciphering the allelic architecture responsible for GWAS risk loci and for nominating specific variants for functional analyses. For four of the risk loci, we have been able to refine the association signal identifying a smaller region of interest that is the most likely a location of a functional variant and/or identifying a good candidate.

It is increasingly apparent that multiple independent risk loci for different cancers characterise the 5p15.33 locus. Telomere maintenance is a universal requirement for oncogenic progression. Telomere length (TL) shows substantial interindividual variability and genetically defined TERT-mediated differences in TL represents a promising epidemiological risk factor for cancer. In terms of the TERT risk locus, our findings provide strong evidence that rs2736100 is likely to be responsible for the 5p15.33 association for lung AD. Such an assertion is supported by a recent study demonstrating that rs2736100 genotype is associated with TL variation.43 In contrast, the genetic basis of the CLPTM1L association and risk of all lung cancer is less well defined.

The strongest associations at 3q28 for AD were shown by SNPs mapping to intron 2 of TP63. TP63 is a member of the tumour suppressor TP53 gene family, pivotal to cellular differentiation and responsiveness to cellular stress. High expression levels of TP63 are shown in lung cancers with and without amplification of TP63. Exposure of cells to DNA damage, through carcinogenic agents such as tobacco smoke, leads to induction of TP63 and transactivation of TP53 target genes. As a strong relationship between rs4488809 genotype and TP63 expression was demonstrable, it is entirely plausible that genetically determined differential expression impacts on the ability of impacts on cellular responsiveness to DNA damage. As rs4488809 is one of the top-ranked SNPs and overlaps a predicted enhancer element, it is likely that this SNP represents a tractable basis for the 3q28 association.

As previously stated at 12p13.33, a number of SNPs that are correlated with rs3748522 and map to functional elements could account for this association a priori. In contrast, the 9p21 association is more ill-defined reflecting the more moderate impact of the risk locus in addition to considerations of LD structure. It also worth noting that the Illumina Infinium Humamethylation450 Beadchip (Illumina Inc) used in meQTL analysis has a very low coverage at this locus, which also advocates the level of difficulty in identifying meQTLs in the region.

The 6p21.33 risk locus is large, gene rich and complex. Because it encompasses the major histocompatibility complex, it features many highly correlated variants across a large region making the association signal hard to refine. The meQTL data provides evidence for the role of MSH5 as the basis of the 6p21.33 association. However, a compelling case can be advanced to support HLA variation in defining SQ lung cancer risk. Somatic loss-of-function alterations of HLA-A have been reported previously in genomic studies of lung cancer. Moreover HLA-A is significantly mutated in SQ (FDR=0.07) but not in AD lung cancers. Intriguingly, the observation that HLA-class II variation influences squamous oesophageal cancer risk suggests a wider role for the MHC region in development of solid tumours.

In our study, we found no evidence to support the existence of ‘synthetic associations’ underscoring the currently identified autosomal GWAS signals for lung cancer. At all of the five loci, the variants identified as most associated with lung cancer had an MAF >10%. Although GWAS tagSNPs are unlikely themselves to be functional, they appear much more likely to tag a functional variants of a similar frequency than single or multiple rare causal variants. As some rare variants can be poorly imputed in GWAS, there remains the possibility that some low-frequency variants conferring moderate risks might have been missed. Although inflammatory bowel disease provides support for the existence of ‘synthetic associations’,18 most of the ‘evidence’ for such a model of disease association comes from simulation studies. Indeed, if such a genetic model was present, such associations would be highly tractable by linkage analysis. No putative linkage signals have, however, been identified in these regions.44

Given the caveats in correctly imputing very rare variants, our analysis did not provide evidence that any of the associations at the loci were a consequence of synthetic associations rather than LD with a common risk variant at these risk loci. This is not to say that rare disease-causing variants with large effect do not contribute to the heritable lung cancer risk in general. Evidence for this class of susceptibility is provided by the recent observation that the subpolymorphic variants BRCA2-K3326X and CHEK2-I157T are associated with substantive risk of squamous lung cancer in smokers.12

In summary, we have extensively characterized all genetic variation across five regions that have been reported to be associated with lung cancer in individuals of European decent. In addition to providing insight into the allelic architecture of these association signals our study findings provide a resource informing functional analyses aimed at defining the biological basis of risk loci. Our analysis also reinforces the observation that many of the associations are histology specific. Identifying such histology specific SNPs ultimately should refine our understanding of the origins of morphologic differences, and may contribute to the ongoing search for personalised treatment for subtype-specific lung cancer cases.