A multi-omics study links TNS3 and SEPT7 to long-term former smoking NSCLC survival

The genetic architecture of non-small cell lung cancer (NSCLC) is relevant to smoking status. However, the genetic contribution of long-term smoking cessation to the prognosis of NSCLC patients remains largely unknown. We conducted a genome-wide association study primarily on the prognosis of 1299 NSCLC patients of long-term former smokers from independent discovery (n = 566) and validation (n = 733) sets, and used in-silico function prediction and multi-omics analysis to identify single nucleotide polymorphisms (SNPs) on prognostics with NSCLC. We further detected SNPs with at least moderate association strength on survival within each group of never, short-term former, long-term former, and current smokers, and compared their genetic similarity at the SNP, gene, expression quantitative trait loci (eQTL), enhancer, and pathway levels. We identified two SNPs, rs34211819TNS3 at 7p12.3 (P = 3.90 × 10−9) and rs1143149SEPT7 at 7p14.2 (P = 9.75 × 10−9), were significantly associated with survival of NSCLC patients who were long-term former smokers. Both SNPs had significant interaction effects with years of smoking cessation (rs34211819TNS3: Pinteraction = 8.0 × 10−4; rs1143149SEPT7: Pinteraction = 0.003). In addition, in silico function prediction and multi-omics analysis provided evidence that these QTLs were associated with survival. Moreover, comparison analysis found higher genetic similarity between long-term former smokers and never-smokers, compared to short-term former smokers or current smokers. Pathway enrichment analysis indicated a unique pattern among long-term former smokers that was related to immune pathways. This study provides important insights into the genetic architecture associated with long-term former smoking NSCLC.

INTRODUCTION Non-small cell lung cancer (NSCLC) accounts for more than 85% of lung cancer cases and is the most commonly diagnosed malignant disease 1 . Tobacco smoking is a well-known environmental exposure leading to lung cancer 2,3 and has been found to be linked to the majority of NSCLC deaths 4 . Emerging evidence shows that NSCLC in smokers and never smokers are different and separate entities [5][6][7] .
It is, however, less well known that former smokers who quitted smoking long time before (e.g., >10 years), termed long-term former smokers, may also develop lung cancer 8 , and NSCLC patients who are long-term former smokers still harbor a high mortality risk 9 . So far, few studies have focused on the impact of genetic architecture on the prognosis of such a special group of patients, whose molecular mechanism for cancer death still remains unclear.
Leveraging the well-established International Lung Cancer Consortium (ILCCO) and Harvard Lung Cancer Study (HLCS), we conducted a GWAS by focusing on NSCLC patients who were longterm former smokers and identified germline genetic variants associated with overall survival time. We detected a set of SNPs with at least moderate association signals with survival from this subgroup as well as from other smoking subgroups, namely, never smokers, short-term former smokers, and current smokers. Comparing these SNPs, we investigated the genetic similarity across these smoking subgroups at the SNP, gene, expression quantitative trait loci (eQTL), enhancer, and pathway levels.

Genetic variants are associated with survival
The genome control inflation factor (λ) for the overall population under the additive model was estimated to be 1.079 (Supplementary Fig. 1), which was comparable to a previous genomewide survival study 10 , indicating that the confounding effect caused by population stratification was well controlled. Two SNPs (rs34211819 at chromosome 7p12.3, and rs1143149 at 7p14.2) reached genome-wide significance among NSCLC patients who were long-term former smokers (Fig. 1a).

Genetic variants interaction with years of smoking cessation
We performed a stratified analysis by years of smoking cessation for both long-term and short-term former smokers to evaluate the modifying effect of years of smoking cessation. As years of smoking cessation increased, the protective effect of rs34211819 and the detrimental effect of rs1143149 on survival were both elevated (Fig. 2b, c). These effects of two SNPs were only significant in long-term former smokers, and not in short-term former smokers, indicating significant heterogeneity between these two groups (rs34211819: P heterogeneity = 0.042; rs1143149: P heterogeneity = 0.034). The trend test detected significant trends for both SNPs across different subgroups (rs34211819: P trend = 0.003; rs1143149: P trend = 0.045). Further, we detected significant interaction effects between SNPs and years of smoking cessation, a type of gene-environment interactions (rs34211819 × years: P interaction = 8.0 × 10 −4 ; rs1143149 × years: P interaction = 0.003) (Fig. 2d, e).
We predicted the functional relevance by SNPinfo, RegulomeDB, and HaploReg v4.1 (Supplementary Table 4). rs34211819 in TNS3 had a high score of protein binding and could bind two proteins: B cell lymphoma 3 (BCL3) and octamer transcription factor 2 (OCT2).

Genetic similarity across patients with different smoking statuses
We first performed genetic similarity comparisons at the SNP level by extracting moderate-to-high signals from the results of survival analysis within different smoking subgroups. A total of 7789 independent SNPs was observed in the long-term former smokers, which were comparable to that of never-smokers (n = 7358) but 31.4% and 85.5% more than that of short-term former smokers (n = 5343) and current smokers (n = 4198), respectively. No SNPs were shared across the four smoking subgroups. Long-term former smokers only had 23, 20, and 11 overlapping SNPs with never smokers, short-term former smokers, and current smokers, respectively (Fig. 4a), indicating that these significant prognostic SNPs for Multi-omics analyses for TNS3 and SEPT7. a Multi-omics analyses flowchart. We observed significant eQTL/meQTL relationship, survival-related methylation, and expression patterns and upregulated proteins in NSCLC. b Significant eQTLs were observed for the two SNPs and cis genes. c Kaplan-Meier survival curves for expression of both genes using TCGA long-term former smoking patients. d meQTL boxplots of the two CpG probes and SNPs. e DNA methylation survival analysis for the two CpG probes. f Proteins BCL3 and OCT2 bound by SNPs also differed between tumor and normal tissues. NSCLC patients who were long-term former smokers seemed to differ from those for the other smoking groups.
For gene-level comparisons, these identified SNPs were assembled to genes within each subgroup (Fig. 4b). We identified 3,837 genes in long-term former smokers, and 1006 genes of them were shared with never-smokers, 22.1% more than those shared with short-term former smokers (824 genes) and 45.8% more than with current smokers (690 genes). A total of 187 genes were commonly shared by all the subgroups (Supplementary Table 5). Protein-coding genes showed these same patterns of similarity (Fig. 4c).
We further investigated the eQTL-related genes based on the GTEx lung tissue database (Fig. 4d) and enhancers from the FANTOM5 database (Fig. 4e). For long-term former smokers, the same trend was observed with germline-regulated genes, which may indicate the higher similarity with never-smoking subgroup than others. Only two eQTL-related genes were shared among all subgroups: ARHGAP15 and TSPAN9. We also performed sensitivity analysis under different thresholds (P < 10 −4 and P < 10 −5 ) and obtained the similar results (Supplementary Table 6).

Unique and shared pathways in long-term former smokers
We explored Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways in the gene set enrichment analysis of germlineregulated genes. A total of 53 pathways were significant in longterm former smokers (Fig. 4f). A total of 18 pathways were shared across all the subgroups, including the well-known signaling pathways such as mitogen-activated protein kinase, oxytocin, Ras, Rap1, and calcium pathways (Fig. 4g). However, 11 pathways were only significant in long-term former smokers, most of which were linked to immune function such as the B cell receptor signaling pathway, human T cell leukemia virus 1 infection, human immunodeficiency virus 1 infection, and T helper type 1 and T helper type 2 cell differentiation (Fig. 4h).

DISCUSSION
Understanding genetic risk factors in cancer is important for uncovering its underlying biological mechanisms 11,12 . Our genomewide investigation among long-term former smokers with NSCLC Fig. 4 Results of genetic similarity comparative analysis. Similarity comparison of SNPs with P < 10 −3 at SNP level (a), germline-regulated gene level (b), germline-regulated protein-coding gene level (c), eQTL-related gene level (d), enhancer level (e), and KEGG pathway level (f), respectively. Protein-coding genes were defined by the GENCODE database. g-h Shared and unique KEGG pathways in long-term former smokers compared with never, short-term former, and current smokers. detected rs34211819 in TNS3 and rs1143149 in SEPT7 which were associated with survival. Significant interactions revealed that the effects of SNPs could be modified by years of smoking cessation, perhaps indicating opportunities for clinical adjuvant therapy with immunotherapies in patients with risk alleles, while an immune function relationship was found for cis-regulated genes or SNPbinding proteins. Multi-omics analyses provided evidence of their eQTL/meQTL relationships, survival-related methylation, and expression patterns and upregulated proteins in NSCLC. Furthermore, we observed higher similarities between long-term former smokers and never-smokers, compared to short-term former smokers or current smokers. The distinct SNP patterns in longterm former smokers were linked to immune signals.
Tobacco smoking is associated with worse outcomes in lung cancer, as it leads to downregulation of proinflammatory cytokines, immunosuppression, and anti-inflammatory effects mediated by oxidants, carbon monoxide, nicotine, and transcriptional modifying compounds, especially in lung tissues 13,14 . Chronic inhalation of cigarette smoke affects a wide range of immunological functions, including innate and adaptive immune responses 15 . As a result, the immune system may be suppressed as long as individuals are exposed to smoking regardless of years of quitting or smoking, as evidenced by the same signaling pathways shared by all the smoking subgroups in our study. Tobacco smoking can affect the immune system by chemically modifying signaling pathways as well as the extracellular matrix through acetylation, nitrosylation, carbonylation, and oxidation, thereby affecting cell survival, activation, and differentiation 16 .
The identified genes were associated with NSCLC survival at the genomic, epigenomic, and transcriptomic levels and were related to immunological functions. SEPT7 is a member of the septin family of GTP-binding proteins, which form higher-order filamentous structures and function primarily in spatial organization and compartmentalization of many cellular processes. SEPT7 is structurally related to RAS oncogenes, which promote tumorigenesis 17 . SEPT7 is also implicated in several types of cancer [18][19][20] . While the septin family plays a critical role in cytokinesis 21 , SEPT7 is also involved in the cytoskeleton and participates in regulation of cytokinesis 22,23 . Septin-deficient T cells fail to complete cytokinesis when prompted by pharmacological activation or cytokines. Meanwhile, SEPT7-deficient fibroblasts display incomplete cytokinesis and constitutive multinucleation by affecting the mitotic spindle and midbody rather than the contractile ring 24 . As a result, SEPT7 plays a crucial role in immune functions including cytokinesis and mitosis, which are closely related to molecular changes from smoking exposure.
Another locus at 7p12 is marked by an intronic SNP in TNS3, which encodes tensin-3, a member of a family of focal adhesionassociated proteins that regulate cell adhesion and migration 25 . This gene may be an activator of cell migration and a promoter of invasion in tumor metastasis 26,27 . Here, we found that TNS3 acts as an oncogene, affecting regulation of methylation in lung cancer. A TNS3 methylation pattern in the promoter region can silence expression in renal cell carcinoma 28 . Further, the binding proteins BCL3 on 19q13.32 and OCT2 on 19q13.2 of rs34211819 are strongly linked to immune function. BCL3 translocates to the immunoglobulin alpha-locus in B cell chronic lymphocytic leukemia 29 . As an oncogene, it is an atypical member of the inhibitor of nuclear factor kappa B (NF-κB) family of proteins that can activate the NF-κB signaling cascade by directly binding to the transcription factors NFKB1 and NFKB2 30 . It is unregulated by cytokines such as tumor necrosis factor alpha, interleukin 4 (IL-4), IL-1, and IL-6 31 . OCT2 acts as a DNA-binding transcriptional activator of immunoglobulin in B-lineage cells 32 . It enables B cells to respond normally to antigen receptor signals and mediate the physical interaction with T cells or to produce and respond to cytokines that are critical drivers of B cell and T cell differentiation during the immune response 31,33 .
Genetic similarity analysis of subgroups showed lower overlap at the SNP level but relatively higher overlap of germlineregulated genes. It is possible that each SNP is unique to each subgroup and acts as an eQTL to regulate cancer-specific genes. Additionally, the effects of somatic mutations driving lung cancer should be investigated. We also found higher similarities between long-term former smokers and never-smokers, which mainly included inflammation and immune mechanisms, as reported in the previous studies 10, 34 .
This study had some limitations. First, although we used the largest lung cancer consortium to date, further external cohort validation with follow-up information and smoking cessation details is warranted. Second, we selected SNPs with moderate association strengths (from P < 10 −5 to P < 10 −3 ); although it was a reasonable approach 35,36 , some false-positive SNPs may have been included. More well-designed functional experiments are necessary to validate the biological functions.
This study also had several strengths. To the best of our knowledge, this is the first GWAS to investigate the effects of genetic variants on NSCLC patients among long-term former smokers. We included a large and relatively homogeneous study population with relatively complete and accurate follow-up, demographic, and clinical covariate information from ILCCO and HLCS. In addition, we investigated the association of candidate genes and lung cancer at the multi-omics levels, including genomics, transcriptomics, epigenomics, and proteomics. Similarity comparisons among different smoking subgroups revealed the shared genetics status at multiple levels.
In summary, our study demonstrated that TNS3 at 7p12.3 and SEPT7 at 7p14.2 are genetic regions associated with survival among long-term former smokers, and the findings with subgroup were related to immune function. Our results may shed light on the important roles of genetic architecture on cancer outcomes among long-term former smokers with NSCLC, a subpopulation that has been less studied.

Study population
In accordance with the previous studies 37-39 , long-term former smokers were defined as patients who quitted smoking at least 10 years before diagnosis, whereas short-term former smokers quitted <10 years before diagnosis, and never smokers were those who smoked <100 cigarettes during their lifetime. To identify prognostic SNPs, we focused on long-term former smokers, whose characteristics are presented in Table 1. Patients in the discovery set were recruited from ILCCO, including the Cancer de Pulmon en Asturias study, Carotene and Retinol Efficacy Trial (CARET), Liverpool Lung Cancer Project, MD Anderson Cancer Center Study, and Mount-Sinai Hospital-Princess Margaret Study. While, patients in the independent validation set were recruited from HLCS (Supplementary Information). Approval for ILCCO studies was obtained from each of the participating institutional research ethics review boards. For HLCS, the Institutional Review Board of MGH and the Human Subjects Committee of the Massachusetts General Hospital and Harvard School of Public Health approved the study. All the participants were provided written informed consent to take part in the study.
Of the 6129 NSCLC patients with follow-up information totally, 4351 eligible cases were with available smoking information, including 504 never smokers, 1299 long-term former smokers, 687 short-term former smokers, and 1861 current smokers (Supplementary Table 7).

OncoArray genotype quality control and imputation
The ILCCO study and HLCS were originally designed and genotyped as case-control studies of lung cancer risk. In this study, we extracted all NSCLC patients from these two studies with survival information. Patient genotypes were generated using the Infinium OncoArray-500k BeadChip (Illumina, San Diego, CA, USA), with standard quality control procedures performed on all eligible individuals. Briefly, excluded were samples with <95% completion and SNP assays with call rates <95% or deviating from Hardy-Weinberg equilibrium (P < 10 −6 ). Only SNPs with minor allele frequencies (MAFs) ≥ 0.05 mapping to autosomal chromosomes were included in the analysis. A total of 416,861 SNPs passed quality control 40 .
Genome-wide imputation following the Michigan Imputation Server pipeline 41 was performed to estimate missing genotype information. We phased haplotypes with Eagle v2.3 using 1000 Genomes Project data (phase 3) as a reference panel 42 and then performed imputations using the Minimac (version 3) software. SNPs with an imputation quality score R 2 < 0.8, MAF < 0.05, or P < 10 −6 for the Hardy-Weinberg equilibrium test were excluded from analyses.

Two-stage GWAS survival analysis
We used a two-stage strategy to identify significant prognostic SNPs for NSCLC patients who were long-term former smokers (Fig. 5). SNPs were encoded with an additive model (0: wild type; 1: heterozygosity; 2: homozygosity), unless otherwise stated. To quantify the association of each SNP with survival, we used the Cox proportional hazards regression model to evaluate its effect on survival, after adjusting for age, gender, clinical stage (I-IV), histology, pack-years of smoking, years of smoking cessation, and study center. Hazard ratio (HR) and 95% confidence interval (CI) were described for mortality risk for patients per minor allele carried. To control for the confounding effects of population stratification, we also performed principal component analysis (PCA) and included the first three principal components in the model, although no population stratification was observed in the PCA plot ( Supplementary Fig. 2). Following the same selection criteria commonly used in the previous GWAS survival studies of cancers 43,44 , we defined significant SNPs as those that met the following criteria: (i) P ≤ 10 −5 in the discovery set; (ii) P ≤ 0.05 in the validation set; and (iii) P ≤ 5×10 −8 in the combined set, reaching genome-wide significance. Kaplan-Meier curves were generated to illustrate survival differences between groups with different SNP genotypes, DNA methylation levels or gene expression levels.

Gene-environment interaction analysis
Since these SNPs significantly associated with prognosis of long-term former smokers with NSCLC, we further conducted gene-environment interaction analysis by estimating how their effects varied with increased years of smoking cessation. Specifically, we included the interaction terms between these SNPs and years of smoking cessation in the Cox proportional hazards regression model adjusted for the same covariates aforementioned.

Multi-omics analysis
We evaluated gene expression, DNA methylation of these significant SNPs, and further performed in silico function prediction and protein analysis.
To examine the eQTL relationships between SNPs and the expression levels of the corresponding genes, we evaluated their associations by using both the summary-level data from 383 lung tissue samples through the GTEx portal (V7 release) 45 and the individual-level data from 208 NSCLC Caucasian patients who were long-term former smokers in The Cancer Genome Atlas (TCGA) database (Supplementary Table 8). The cis-eQTL information in GTEx including normalized effect size, standard error, and nominal p value for each SNP-expression pair was collected. Genotype data from TCGA were collected from the GDC Legacy Data Portal (Level 2, birdseed data), which were generated from Affymetrix Genome-Wide Fig. 5 Study workflow. Our study mainly included three parts: (1) GWAS survival study for long-term former smokers; (2) multi-omics study for candidate SNPs and genes; and (3) genetic similarity comparative analysis among different smoking status subgroups. Human SNP 6.0 Array. We performed the same quality control and imputation procedures as aforementioned. Gene expression values were normalized using the RNA-seq by expectation-maximization method 46 , and dichotomized, when needed, into low-and high-expression subgroups by the median values. To summarize the eQTL effects from TCGA and GTEx, we used meta-analysis with the fixed-effects model. The Cox proportional hazards regression model adjusted for the same covariates as aforementioned was utilized to evaluate the prognostic effects of gene expressions in tumor tissues in TCGA. The association between SNP and DNA methylation was tested among 155 Caucasian patients who were long-term former smokers in TCGA. DNA methylation data were profiled using Illumina HumanMethylation450 BeadChips. The details of quality control were described in 47 . We used the linear regression model to assess meQTL effects and the Cox proportional hazards regression model to evaluate the association between methylation CpG probes (dichotomized by the median values) and survival. These models were adjusted for the same covariates. False discovery rate adjusted P value (q value) was used to correct for multiple comparisons.
We used an in silico approach through SNPinfo 48 , RegulomeDB 49 , and HaploReg v4.1 50 to predict potential functions of the identified SNPs. We also compared protein levels among the 101 tumors paired with normal adjacent lung cancer tissue samples using Student's paired t test in the Clinical Proteomic Tumor Analysis Consortium (CPTAC) project 51 . The data were normalized following the CPTAC Common Data Analysis Pipeline.

Genetic similarity comparative analysis
To assess genetic similarity of long-term former smoking patients with other smoking subgroups (e.g., never, short-term former, and current smokers), we performed GWAS survival analysis within each smoking subgroup using the Cox proportional hazards regression model adjusted for the same covariates aforementioned. We selected SNPs with moderate association strengths (P < 1 × 10 −3 ) 36 . Genetic similarity comparisons were made at the SNP, gene, eQTL, enhancer, and pathway levels, respectively.
For gene-level analyses, germline-regulated genes were defined as the nearest genes using the Ensembl definitions on genome build GRCh37 (hg19), which annotated the genomic locations for each subgroup using gene start and stop coordinates 52 . All gene features were defined by GENCODE V25 53 . We kept SNPs with no strong linkage disequilibrium (LD) using the PLINK indep-pairwise function; the r 2 used for all LD trimming was 0.5. eQTLs were collected from GTEx lung tissues as described above.
For enhancer-level analysis, the FANTOM5 human enhancer database was used to identify enhancer activities across most cell types and tissues 54 . SNPs located within a permissive enhancer region ±1 kb were defined as the enhancer-related variants.
For pathway-level analysis, we performed gene enrichment pathway analysis based on the KEGG database. All enrichment analyses were performed using the R package clusterProfiler 55 .

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and analyzed during this study are described in the following data record