Introduction

Lung cancer is the leading cause of cancer death worldwide, causing as many deaths as the next four most deadly cancers combined (breast, prostate, colon, and pancreas), and the incidence of lung cancer is projected to double by 20501.

Recent advances in molecular biology, and the emergence of cost efficient OMICs data has contributed new insight into mechanisms involved in lung carcinogenesis. For instance, several epigenome-wide association studies have identified methylation changes that are associated with lung cancer risk2,3,4,5,6,7,8, using mainly tumor tissue collected at the time of diagnosis, but also blood in prevalent cases9. Other studies focusing on established lung carcinogens such as smoking used peripheral blood and identified several differentially methylated CpG sites10,11,12,13. Of these, a recent meta-analysis including 15,907 participants identified 2,623 differentially methylated CpG sites in relation to smoking status14.

Although smoking is an established causal risk factor responsible for a vast proportion of lung cancer incidence, identified smoking-related CpG sites have been shown to mediate some or little of the effects of smoking on lung cancer6,15. Further, the most common histological subtype of lung cancer in never smokers is adenocarcinoma16, which might originate from specific, and as yet unknown, molecular mechanisms different from those involved in smoking-induced lung cancer17. It is therefore of interest to identify lung cancer biomarkers that are unrelated to smoking exposure to (i) gain better understanding of the etiology of lung cancer and (ii) to investigate whether biological pathways affected by smoking-induced changes in DNA-methylation are similar to those affected by differential methylation at CpG sites that are not associated with smoking exposure. To-date, very few studies have investigated methylation markers of lung cancer risk that are not associated with such exposure8,18.

In the current study, we used full-resolution DNA methylation profiles from prospectively collected blood samples to identify methylation alterations in relation to future lung cancer diagnosis and assess the relationship between these markers and smoking exposure. Specifically, we adopted a stringent adjustment strategy to identify disease-related methylation changes at CpG sites that are not associated with smoking exposure and compare them to changes in methylation level at disease-related CpG sites that are also associated with smoking exposure. Finally, we exploited gene expression data measured in the same individuals to aid functional interpretation by exploring biological pathways of gene expression profiles affected by methylation changes at all lung cancer-related CpG sites.

Materials and Methods

Participants

Our study population included women from a lung cancer case–control study nested in the post genome cohort (N ~ 50 000) within the Norwegian Women and Cancer Study (NOWAC)19,20,21. All participating women were cancer-free at recruitment (1991–2006) and at time of blood sampling (2003–2006). Linkage to the national cancer registry identified 134 incident lung cancer cases. Cases were diagnosed between 2004 and 2011, and for each case, one control was matched on time since blood sampling and birth year. All participants gave written informed consent and the study was approved by the Regional Committee for Medical and Health Research Ethics and the Norwegian Data Inspectorate. We confirm that all methods employed in the study were performed in accordance with the relevant guidelines and regulations.

We used methylation data from the Netherlands Twin Register (NTR) for replication. Subjects in the NTR biobank study were recruited between 2004 and 201122,23. The study included 769 monozygotic (MZ) and 424 dizygotic (DZ) twin pairs. A blood sample was collected at inclusion and we included in the present study 125 MZ and 146 DZ adult twin pairs who were discordant with respect to their smoking status at time of blood sampling. We included pairs in which one twin never smoked and the other twin was a current smoker (N = 53 MZ, and 77 DZ), and pairs including one never smoker and one former smoker (N = 72 MZ, and 69 DZ).

DNA methylation and gene expression microarray data

Genome-wide DNA methylation profiles from bisulphite-converted, hybridized genomic DNA from buffy coat samples were generated using Illumina Infinium HumanMethylation450 Bead-Chips following a protocol described previously for both NOWAC24 and NTR22,25 samples. DNA methylation levels at each locus were expressed as the ratio of intensities arising from methylated cytosines over total intensities. For NOWAC, sample preparation and data pre-processing were performed as described elsewhere24. In brief, probes (i) on sex chromosomes, (ii) reported to be cross-reactive26 and (iii) for which methylation levels were measured in <20% of the samples were excluded. Five samples did not pass quality controls and three subjects were excluded due to >95% missing in DNA methylation results. The final analysis included 428,629 probes targeting autosomal CpG loci in 260 women (131 cases and 129 controls). In NTR data, sample- and probe-level quality checks and data pre-processing were performed as described in detail previously25 and only CpG sites identified in the NOWAC discovery data were interrogated.

For 248 of the 260 women from the NOWAC study with DNA methylation data, gene expression profiles were also available and were generated at the Norwegian University of Science and Technology. Total RNA was isolated using established protocols27 and microarray analyses were performed using the IlluminaHuman HT-12 expression Bead-Chips. Microarray data were quality-checked and pre-processed as previously described28. Original probe values were background-corrected and probes reported to have poor quality from Illumina or detected in <95% of samples were filtered out. Only transcripts on autosomal chromosomes were included in the analysis. The final gene expression data set included 18,955 transcripts assayed in 248 individuals.

Statistical models

We investigated the relationship between future lung cancer status and methylation levels using unconditional logistic regression models. As already described29, we corrected for technically-induced variation in methylation and gene expression data by fitting a preliminary linear mixed model including technical covariates (chip ID and position on the chip for methylation data and date of mRNA isolation and date of complementary RNA generation for gene expression data) as random intercepts, and, to account for the case control matching we adjusted (fixed effects) our models for the two matching criteria: age at blood collection and sample storage time.

Methylation and gene expression levels used in the downstream analyses were represented by the residuals from these mixed models. Multiple testing was accounted for by using a Bonferroni correction ensuring a family-wise error rate below 5% (corresponding per test significance level was set to 1.16e-07). We report as effect size estimates the odds ratios (OR) for one standard deviation change in the methylation levels.

We further adjusted the logistic regression models for blood cell composition, estimated according to the methods proposed by Houseman30,31. We specifically adjusted for estimated proportions of leukocytes (excluding natural killer cells and eosinophil granulocytes).

Lung cancer-related CpG sites that are not associated with smoking exposure (LC-non-AwS) were defined as those (i) found significantly associated to lung cancer in the main logistic model, and (ii) remaining associated to lung cancer upon adjustment for smoking. Conversely, lung cancer-related CpG sites that are associated with smoking exposure (LC-AwS) are defined as those losing statistical significance upon adjustment for smoking exposure. Confirmation of their lack of association to smoking exposure was sought in never smokers and an independent study including smoking-discordant twins. We investigated three measures of smoking exposure: smoking status, pack-years, and the comprehensive smoking index (CSI)32. CSI scores (Table S1) were obtained using duration of smoking (dur; years), intensity (int; average number of cigarettes per day during years of smoking), and time since smoking cessation (tsc; years) and fitting the following model to our data:

X2 = (1 − 0.5dur)(0.5tsc) ln(int + 1), where τ is the estimated half-life parameter, and δ is an estimated lag time parameter describing tsc and total duration as follows: tsc = max(tsc − δ, 0) and dur = max(dur + tsc − δ) − tsc.

To further assess possible relations between methylation levels at disease-related sites and smoking exposure, we used the methylation data of the NTR study and ran paired Student’s T-test analyses comparing the mean methylation differences within pairs of MZ and DZ smoking discordant twins. Paired T-tests were performed on residual methylation levels, which were obtained by adjusting the methylation levels (beta-values) for sex, age at blood sampling, measured cell counts (percentage of monocytes, eosinophils, and neutrophils), and technical covariates: array row and sample plate.

We ran a series of sensitivity analyses that included conditional logistic regressions for the (N = 128) case-control complete pairs. Further sensitivity analyses were restricted to (i) cases from each of the main histological subtypes separately (adenocarcinomas (N = 64), small cell and squamous (N = 43), others (N = 24))33, (ii) cases diagnosed before or after the median time elapsed from blood collection to diagnosis (4.2 years), (iii) cases that were current (N = 81), former (N = 36), or never (N = 14) smokers, separately. In these stratified analyses subsets of cases were compared to all healthy controls (N = 37, 35, 57 in current, former and never smokers) included in the study and, because case-control pairs were broken, we used unconditional logistic regression models as defined for the main analysis.

In order to ensure a wide explorative search of (N = n1) LC-non-AwS markers, which are likely weaker and less numerous than the (N = n2) LC-AwS markers, we complemented our list of n1 LC-non-AwS markers by defining a ‘second order’ set of (N = n1′) LC-non-AwS CpG sites as defined by those associated to a least one of the n1 LC-non-AwS markers, but not directly to disease status. These were identified by regressing the methylation levels of the n1 LC-non-AwS CpG sites against the (428,629 − n1) remaining CpG sites. As before, we used Bonferroni corrected per-test significance level here defined as 0.05/(n1x(428,629 − n1)).

In order to help functional interpretation of the resulting epigenetic alterations, gene expression data measured in the same individuals were linked to the DNA methylation levels of the identified markers. Specifically, we ran linear regression models assessing the association between the 18,955 assayed transcripts and (i) each of the first and second order (n1 + n1′) LC-non-AwS CpG sites and (ii) each of the n2 LC-AwS CpG sites. Statistical significance of each CpG-transcript pair was evaluated adopting a Bonferroni corrected per-test significance level (0.05/((n1 + n1′) × 18,955), and 0.05/(n2 × 18,955), respectively). The transcripts involved in any significant CpG-transcript pair were subsequently included in overrepresentation analyses based on hypergeometric tests setting a nominal p-value of 0.05 using the ‘enrichGO’ function of the Bioconductor ‘clusterProfiler‘ package34. All statistical analyses were performed using R (ver. 3.1.2, Foundation for Statistical Computing, Vienna, Austria).

Results

Sample description and overall lung cancer risk

Baseline characteristics of the NOWAC women and NTR study populations are summarized in Table 1. As expected, NOWAC cases were more commonly current smokers at blood sampling (62%) than controls (29%) (Table S1). Logistic regression models demonstrated elevated lung cancer risk in former smokers (OR = 4.07 (95% CI: 1.97–8.79), and in current smokers (OR = 8.46 (95% CI: 4.31–17.53)) as compared to never smokers. The model including CSI score alone indicated an OR of 3.66 (95% CI: 2.53–5.42) for one unit increase in CSI values and the model provided the better fit compared to other smoking metrics (including smoking status and pack years, AIC results not shown). Estimated cell type proportions were similar in cases and controls, except for the natural killer cells, which were underrepresented in cases and among current smokers (Table S2).

Table 1 Characteristics of the NOWAC (women only) and the NTR populations.

Differentially methylated CpG sites associated to lung cancer risk

We identified 25 CpG sites at which lower methylation levels were associated to higher lung cancer risk (Tables 2 and S3; boxplots of the methylation according to case/control status in Figure S1 and volcano plot in Figure S2). After adjustment for smoking, n2 = 23 of these sites were classified as LC-AwS markers, as their associations lost statistical significance (Table 2). Among the different smoking metrics considered, CSI appeared to provide the most stringent adjustment as depicted by flattened p-value distribution (Figure S3, estimates for the covariates adjusted for are presented in Table S4). Only n1 = 2 CpGs remained associated with lung cancer risk after controlling for CSI and were classified as LC-non-AwS markers (Table 2): cg10151248; PC (OR = 0.34) and cg13482620; B3GNTL1 (OR = 0.33). These two LC-non-AwS CpGs were also significantly associated with lung cancer after further adjustment for blood cell composition (Table S5). The correlations between the n1 = 2 LC-non-AwS CpG sites and the n2 = 23 LC-AwS sites were moderate (Fig. 1). Conversely, we observed stronger block correlations within the LC-AwS sites, and in particular a subset of eight CpG sites (Figure S4). Results in figures and tables are presented separately for LC-AwS and LC-non-AwS sites.

Table 2 25 Bonferroni significant CpG sites differentially methylated in cases as compared to controls (N = 131 cases, 129 controls) that were un-associated with smoking (LC-non-AwS), or associated with smoking (LC-AwS).
Figure 1
figure 1

Heatmap of the correlation between the two CpGs un-associated with smoking (LC-non-AwS) and the 23 CpGs associated with smoking (LC-AwS). Figure note: The correlation strength is represented by color as indicated in the bar to the right.

The two LC-non-AwS CpG sites were also associated with lung cancer risk in never smokers (OR = 0.36 (95% CI: 0.17–0.77) and OR = 0.31 (95% CI: 0.14–0.67) for cg10151248-PC and cg13482620-B3GNTL1, respectively). Estimates were consistent in current smokers for cg10151248-PC and cg13482620-B3GNTL1 (OR = 0.32 (95% CI: 0.19–0.57) and OR = 0.33 (95% CI: 0.18–0.61)), respectively) but slightly weaker in former smokers (OR = 0.43 (95% CI: 0.23–0.82) and 0.50 (95% CI: 0.30–0.85)). The stratified analysis showed that 10 CpG sites among the 23 LC-AwS CpG sites were significantly associated to lung cancer status in never smokers (Table S6). The methylation levels of the two most strongly associated LC-AwS CpG sites: cg05575921-AHRR and cg03636183-F2RL3, were not associated with lung cancer risk in never smokers (OR = 0.27 (95% CI: 0.03–2.14) and 1.22 (95% CI: 0.32–4.68), respectively.

Additional stratification on histological subtypes provided consistent OR estimates for cg10151248-PC across histological subtypes (Table S7; range: 0.36–0.39), and stronger effects of methylation levels were estimated in cases with shorter time to diagnosis (0.33 vs 0.41 for short and long time to diagnosis, respectively). For cg13482620-B3GNTL1, effect size estimates were consistent in both time to diagnosis classes, but the OR was lower in adenocarcinoma cases (OR = 0.35) than in ‘all other subtypes’ and ‘squamous and small cell’ cases (OR >0.49). Corresponding stratified analyses for LC-AwS CpGs are also presented in Table S7.

Using conditional logistic regressions unadjusted for smoking exposure, as a sensitivity analysis, only two of the n1 + n2 = 25 candidate CpG sites reached Bonferroni significance level (cg05575921 and cg06126421, p-values 3.99e−08 and 6.68e−08, respectively).

When comparing mean methylation levels for the 25 candidate CpG sites within pairs of smoking-discordant twins (MZ or all), we found no differences between never smokers and ever/current smokers for the two LC-non-AwS markers (Table 3). Comparison of the mean methylation levels at the LC-AwS CpG sites, showed significant differences at eight CpG sites while comparing smokers (current or ever) to never smokers. When restricting these comparisons to MZ twin pairs, six and eight CpG sites were significantly different in never to current and never to ever comparisons, respectively (Table 3).

Table 3 Difference in methylation in twins discordant according to smoking status in the NTR study for the CpG sites associated with lung cancer identified as un-associated with smoking (LC-non-AwS), or associated with smoking (LC-AwS) in the NOWAC study.

Functional investigation of the 25 candidate CpG sites

No significant association was found linking DNA methylation levels at either LC-non-AwS sites (cg10151248-PC and cg13482620-B3GNTL1) and the gene expression levels at the 18,955 transcripts assayed (containing one transcript each for PC and B3GNTL1 genes). We identified a total of n1’ = 1987 ‘second order’ LC-non-AwS CpG sites whose methylation levels were associated to that of at least one of the n1 = 2 LC-non-AwS CpG sites, and not directly with disease risk. Of these, 160 and 1,876 were associated with methylation levels of cg10151248-PC and cg13482620-B3GNTL1, respectively and their pairwise correlation is presented in Figure S5A. When regressing the n1’ ‘second order’ set of CpG sites against the gene expression levels we identified (i) 19 significant CpG-transcript pairs for cg10151248-PC (Table S8), corresponding to 19 unique transcripts and one unique CpG site (Table 4), and (ii) 137 CpG-transcript pairs for cg13482620-B3GNTL1 (Table S9), including 127 unique transcripts and nine CpG sites (Table 4). The correlations between transcripts associated to the methylation levels of at least one ‘second order’ CpG site are presented in Figure S5B. Overrepresentation analyses of transcripts involved in these significant LC-non-AsW CpG-transcript pairs demonstrated distinct enriched ontology categories relating to immune response, and involving beta cells (Fig. 2, Tables S10 and S11, respectively).

Table 4 The number of transcripts associated to the ‘second-order’ CpGs un-associated with smoking (LC-non-AwS) and the CpGs associated with smoking (LC-AwS).
Figure 2
figure 2

Network visualizations of gene ontology categories in which genes were significantly overrepresented, for the genes associated to the ‘second order’ CpGs un-associated with smoking (CpGs associated with cg10151248-PC and cg13482620-B3GNTL1) as well as those associated to the 23 CpGs associated with smoking (LC-AwS). Figure note: Biological processes categories are colored according to the significance of the overrepresentation and the gene ratio signifies the number of genes in each list relative to the number of genes in the ontology categories.

For the n2 = 23 LC-AwS CpG sites we identified 168 significant CpG-transcript pairs (Tables 4 and S12), corresponding to 100 unique transcripts and eight unique CpG sites. Overrepresentation analyses identified ontology categories distinctly different from those identified above and mostly related to responses to external stressors (Fig. 2 and Table S13).

Discussion

We combined genome-wide methylation and gene expression profiles from prospective blood samples in the NOWAC study to identify markers of lung cancer risk in Norwegian women and investigated to what extent these associations were driven by exposure to smoking. We identified 25 CpG sites associated with lung cancer risk, of which 23 were classified as LC-AwS, as they lost statistical significance after stringent adjustment for smoking exposure metrics. The two remaining CpG sites (cg10151248-PC and cg13482620- B3GNTL1) were classified as LC-non-AwS CpG sites, as they remained statistically significant after adjustment for CSI and demonstrated low correlation to the other 23 CpGs. For the majority of markers the case control difference was larger with shorter time to diagnosis.

Of the 23 LC-AwS CpG sites, eight have been acknowledged as epigenetic signatures of cigarette smoking in a recent large meta-analysis of DNA methylation and smoking14. Pairwise correlations among the same eight LC-AwS CpG sites were also markedly higher than correlations with the other 15 LC-AwS CpG sites, supporting the evidence of these being linked to smoking. Furthermore, the same eight LC-AwS CpG sites were differentially methylated in smoking discordant twins. For the majority of the 23 LC-AwS markers, the association with risk was also stronger in the smoking related histological subtypes as compared to adenocarcinoma. The evidence to classify the 23 CpG sites as LC-AwS was not equally strong, but was considered sufficient for them to be treated separately as LC-AwS markers in the downstream analyses.

The two LC-non-AwS CpG sites were consistently not associated to smoking exposure in all the analyses performed, which indicates that they are minimally associated with smoking. The association between LC-non-AwS CpG sites and risk was stronger in adenocarcinoma cases compared to the other more smoking-induced histological subtypes35, which was not the trend observed in LC-AwS markers. Although hampered by statistical power in stratified analyses of never smokers, the same two LC-non-AwS markers were found to be statistically significant, along with 10 of the LC-AwS CpGs. Comparing pairs of smoking discordant twins revealed no difference in methylation levels at LC-non-AwS CpG sites. Finally, the two LC-non-AwS CpGs were not identified in the large meta-analyses for epigenetic smoking signatures14 and we did not identify any single-nucleotide polymorphisms reported in the vicinity of these two sites36, hence arguing against possible genetic confounding. Taken together, this supports that the two LC-non-AwS CpG sites, and in particular cg10151248-PC, are not associated with smoking and are distinct from the LC-AwS markers.

To enable deeper investigation of the functional role of the methylation changes at the LC-non-AwS CpG sites, we defined ‘second order’ CpG sites as being associated with the methylation levels at any of these two LC-non-AwS CpG sites but not directly with lung cancer risk. None of the 160 CpG sites associated with cg10151248-PC were associated with smoking status in a recent large meta-analysis of DNA methylation and smoking status14, and 22 of the 1,876 for cg13482620-B3GNTL1 (1.2%) were reported as LC-AwS markers. On this basis, we explored whether the two non-LC-AwS markers and complemented list of markers less associated with smoking could provide novel pathway information relevant for lung cancer development.

The candidate methylation markers were further investigated by exploring the association between methylation levels at lung cancer related markers and gene expression data available in the same individuals. In order to ensure a comprehensive search for distinguishable pathways the full sets of markers were explored separately for the LC-AwS and LC-non-AwS CpG sites. Because regulation of gene expression through differential methylation obeys complex and multivariate mechanisms and can operate remotely (‘trans’ effects)37, all assayed transcripts were investigated. No transcript was directly associated with methylation levels at cg10151248-PC and cg13482620-B3GNTL1 (neither PC or B3GNTL1 transcripts) which may not be surprising as both are highly methylated and show small, although significant, differences between cases and controls. However, we identified associations between methylation levels at the ‘second order’ CpG sites, and transcripts. The significant CpG-transcript pairs for the 160 cg10151248-PC-related CpG sites involved 19 transcripts, none of which were AwS markers either in our data or in the large meta-analysis of gene expression data38, while 33 of 127 transcripts involved in cg13482620-B3GNTL1-related CpG-transcript pairs were identified in the large meta-analysis38.

In the exploration of LC-non-AwS markers of lung cancer, which are likely to be more subtle signals than LC-AwS markers, an enriched CpG list was assessed when comparing potential functional roles of the different sets of markers identified. The gene ontology categories identified for the transcripts of the ‘second order’ CpG sites of cg10151248-PC and cg13482620-B3GNTL1, showed a large degree of overlap for categories linked to immune responses. The genes and consequently the categories indicated for cg10151248-PC clearly differed from those derived from LC-AwS CpG sites (categories linked to response to external stressors). Results from cg13482620-B3GNTL1 showed similarity with those from cg10151248-PC but also exhibited some common categories with LC-AwS sites. Thus indicating that a wide search provided novel information on potential pathways of relevance for lung cancer.

There is very limited evidence in the literature linking the methylation or expression levels at the two LC-non-AwS CpG sites and health outcomes. Notably, the CpG methylations of PC and B3GNTL1 (located in unknown gene region and shelf region, respectively) were not associated with transcript expression for same genes. Nevertheless, hypermethylation at another CpG site in the gene B3GNTL1 has been observed in colorectal tumors compared to adjacent tissue39 and the upregulated expression of this gene has been indicated as a potential marker for colorectal cancer40. Conversely, to the best of our knowledge there are no reported characterized description of the downstream consequences of altered methylation levels at cg10151248-PC.

Residual confounding by smoking in our adjusted analyses cannot be disregarded. However, CSI appeared to be a stringent adjustment for exposure to smoking and the argumentation above supports the manner in which we classified markers as being LC-AwS or LC-non-AwS (or not directly for cg13482620-B3GNTL1). Further, adjustment for estimates of white blood cell composition were not emphasized here due to the potential over-adjustment by smoking.

In conclusion, using blood-derived DNA methylation and gene expression profile from a prospective lung cancer study in Norwegian women, our study identified 25 differentially methylated CpG sites prior to lung cancer diagnosis, of which two appeared to be LC-non-AwS, in particular cg10151248-PC. These LC-non-AwS CpG sites seemed to be involved in biological pathways distinct from those related to LC-AwS CpG sites, and linked to immunological changes in blood prior to cancer diagnosis. Although the study size is limited, the use of a stringent significance level when assessing DNA methylation and gene expression data has revealed markers that represent prospective population-specific markers of smoking exposure as well as markers potentially relevant to lung cancer development and warrant further study.