Nuclear factor, erythroid 2-like 2-associated molecular signature predicts lung cancer survival

Nuclear factor, erythroid 2-like 2 (NFE2L2), a transcription factor also known as NF-E2-related factor 2 (Nrf2), is a key cytoprotective gene that regulates critical antioxidant and stress-responsive genes. Nrf2 has been demonstrated to be a promising therapeutic target and useful biomarker in malignant disease. We hypothesized that NFE2L2-mediated gene expression would reflect cancer severity and progression. We conducted a meta-analysis of microarray data for 240 NFE2L2-mediated genes that were enriched in tumor tissues. We then developed a risk scoring system based on NFE2L2 gene expression profiling and designated 50 tumor-associated genes as the NFE2L2-associated molecular signature (NAMS). We tested the relationship between this gene expression signature and both recurrence-free survival and overall survival in lung cancer patients. We find that NAMS predicts clinical outcome in the training cohort and in 12 out of 20 validation cohorts. Cox proportional hazard regressions indicate that NAMS is a robust prognostic gene signature, independent of other clinical and pathological factors including patient age, gender, smoking, gene alteration, MYC level, and cancer stage. NAMS is an excellent predictor of recurrence-free survival and overall survival in human lung cancer. This gene signature represents a promising prognostic biomarker in human lung cancer.

As a central regulator of antioxidant genes, NFE2L2 has received great attention for its pivotal role in a number of pathologic conditions, including cancer [7][8][9][10] . Increasingly, data have demonstrated that NFE2L2 is over expressed in a great number of solid and hematologic tumors [11][12][13][14][15][16] . Many cancer cells have been found to exhibit increased expression and activity of Nrf2 17 . In addition, NFE2L2 is highly up-regulated in various types of tumors and the prognosis of patients with tumors expressing high levels of NFE2L2 is poor 18 .
As NFE2L2 becomes dysregulated, cells may acquire several traits, including proliferation, apoptosis resistance, and a profound resistance to drugs and radiotherapy, which may promote tumor growth and cancer pathogenesis 19 . Nrf2 activators have been used in clinical trials for cancer therapy and the treatment of diseases associated with oxidative stress; on the other hand, constitutive activation of Nrf2 in many types of tumors contributes to the survival and growth of cancer cells, as well as resistance to anticancer therapy 20 . Therefore, manipulation of Nrf2 has become of great interest with regard to its use in therapy and the diagnosis/prognosis of malignant diseases.
In this present study, we conducted secondary analysis of genome-wide expression data to identify NFE2L2-associated genes implicated in cancer pathobiology. Because of its strong association with cancer, we hypothesized that NFE2L2-mediated genes could be used to indicate cancer progression. Specifically, we aimed to assess whether genes associated with NFE2L2 could be used as a prognostic tool for lung cancer patients. We found that an NFE2L2-mediated gene signature could effectively predict lung cancer survival.

Results
NFE2L2 influences gene expression in human lung cancer. We compared the gene expression patterns of control cells to those of NFE2L2 knockdown (KD) human lung cancer cells to identify genes potentially regulated by NFE2L2. One microarray dataset (GSE38332) containing gene expression information from both control and NFE2L2 KD A549 lung cancer cells 21 was downloaded from the Gene Expression Omnibus (GEO) database 22 . At the specified significance level of false discovery rate (FDR) < 0.01 and fold change > 2 (see Methods for details), 1631 probe sets encoding 1172 genes were found to be up-regulated in NFE2L2 KD cells, while 792 probe sets for 593 genes were down-regulated in NFE2L2 KD cells (Supplementary Table S1). We next searched the enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) 23 physiological pathways among the dysregulated genes revealing genes enriched in cancer-related KEGG terms (9 out of 20 significantly dysregulated pathways), such as "Colorectal cancer", "Pathways in cancer", "Endometrial cancer", "Pancreatic cancer", "Small cell Lung cancer", "Renal cell carcinoma", and "Prostate cancer" (Fig. 1). To address whether the NFE2L2-mediated genes (1765 differentially expressed genes) are statistically significantly associated with KEGG cancer pathways, we conducted a resampling test. We obtained 1,000 random gene sets by randomly selecting 1765 genes from human genome. For each random gene set, we counted the number of KEGG cancer pathways that are significantly associated with the random gene set. We found that the number of KEGG cancer pathways of the NFE2L2-mediated genes is significantly larger than that of the random gene sets (P = 0.002) ( Supplementary Fig. S1). The results suggest that the NFE2L2-mediated genes are involved in human lung cancer pathology. To explore the role of NFE2L2-mediated genes in human lung cancers, we compared the gene expression between the paired normal and tumor tissues in two lung cancer cohorts from Spain (GSE18842, ESP) 24 and Taiwan (GSE19804, TWN) 25 . In total, 1695 probe sets encoding 1284 genes were found to be commonly differentially expressed (see Methods for details) between normal and tumor tissues in the two cohorts (Supplementary Table S2). Among these probe sets/genes, 299 probe sets encoding 240 genes were found to overlap with the NFE2L2-mediated genes (Supplementary Table S3), which is statistically significant (P = 2.2 × 10 −14 by hypergeometric cumulative distribution function) and, thus, suggests that NFE2L2-mediated genes are significantly enriched among the lung cancer-related genes. The top KEGG pathways associated with the 240 overlapped genes includes "ECM-receptor interaction", "Pathways in cancer", "Complement and coagulation cascades", "Jak-STAT signaling pathway", "Focal adhesion", "Small cell lung cancer", and "TGF-beta signaling pathway" (Supplementary Fig. S2), which suggests the depth of involvement of NFE2L2-mediated genes in human cancer. Univariate Cox proportional hazards regression was applied to evaluate the relationship between lung cancer outcome and gene expression for the 240 genes. The association between recurrence-free survival and gene expression was computed in a lung cancer training cohort from Korea (GSE8894, KOR) 26 . We also calculated the association between overall survival and gene expression in another training cohort from the United States (GSE3141, USA) 27 . The Wald statistic (ratio of Cox regression coefficient to its standard error) of the 240 genes in the KOR cohort (Z R ) is positively correlated with that in the USA cohort (Z D ) ( Supplementary  Fig. S3). Among the 240 genes, we identified 50 genes with both |Z R | > 0.5 and |Z D | > 0.5. We hypothesized that the expression of the 50 genes would predict lung cancer outcome. We designated these genes as the NFE2L2-Associated Molecular Signature (NAMS) ( Table 1 and Supplementary Fig. S4 and S5). The NAMS gene set (50 genes) is significantly associated with two KEGG pathways: ECM-receptor interaction and focal adhesion ( Supplementary Fig. S6).

NAMS predicts recurrence-free survival in lung cancer.
To confirm that the NAMS would be predictive of tumor outcome in lung cancer, we constructed a scoring system to assign each patient a recurrence-risk score, representing a linear combination of the NAMS gene expression values weighted by the coefficients obtained from the training cohort (KOR) (see Methods for details). NAMS R + patients were defined as those having recurrence-risk scores greater than zero, and the other patients were assigned as NAMS R − . As expected, we found a significantly reduced recurrence-free survival for the NAMS R + patients in the training cohort ( Fig. 2 and Table 2). We next tested the prognostic power of the NAMS based recurrence-risk score in three validation cohorts from Japan (GSE31210, JPN) 28 , Sweden (GSE37745, SWE) 29 , and Canada (GSE50081, CAN) 30 .
Kaplan-Meier survival curves demonstrated a significantly reduced recurrence-free survival for NAMS R + patients in the three validation cohorts (log-rank test: P = 1.5 × 10 −6 for JPN, P = 2.5 × 10 −3 for SWE, and P = 6.0 × 10 −3 for CAN) (Fig. 2). Univariate Cox proportional hazards regression indicated that NAMS R + patients had an increased risk for recurrence of 3.59-, 2.65-, and 2.36-fold in the JPN, SWE, and CAN cohorts, respectively ( Table 2). These findings suggest that the NAMS is predictive of recurrence-free survival in lung cancer.
In addition, we checked prognostic power of NAMS on different types of lung cancer (including adenocarcinoma, squamous cell carcinoma, and large cell carcinoma). JPN cohort are all Significantly reduced recurrence-free survival for NAMS R + adenocarcinoma patients (P = 4.7 × 10 −4 ), but not squamous cell carcinoma patients, was found in the CAN cohort ( Supplementary Fig. S7). In the SWE cohort, significantly reduced recurrence-free survival for NAMS R + squamous cell carcinoma patients (P = 2.2 × 10 −2 ), but not the other two types, was observed ( Supplementary Fig. S8). Regardless of statistical significance (mainly due to the limitation of the same size), all recurrence-free survival data in all types of cancers remain at the same trend that NAMS R + patients have reduced survival.

NAMS predicts overall survival in lung cancer.
We also tested the power of the NAMS in predicting overall survival in lung cancer. The USA cohort was used for training. Similarly, a scoring system was developed to assign each patient a death-risk score, calculated as a linear combination of the NAMS gene expression values weighted by the coefficients obtained from the USA cohort (see Methods for details). NAMS D + patients were defined as those having death-risk scores greater than zero, while the other patients were NAMS D − . Kaplan-Meier survival curves demonstrated a significantly reduced overall survival for the NAMS D + patients in both the training and validation cohorts (Fig. 3). Univariate Cox proportional hazards regression indicated that the NAMS D + patients had an increased risk for death of 2.71-, 5.92-, 1.46-, and 1.76-fold in the USA, JPN, SWE, and CAN cohorts, respectively ( Table 2). These results suggest that NAMS can be used to predict overall survival in lung cancer.
Same as the recurrence-free survival study, we examined the prognostic power of NAMS on subtypes of lung cancer on overall survival. Significantly reduced recurrence-free survival for NAMS R + adenocarcinoma patients (P = 7.7 × 10 −3 ), but not squamous cell carcinoma patients, was found in CAN cohort ( Supplementary Fig. S9). In SWE cohort, NAMS fails to differentiate overall survival significantly in three subtypes of lung cancer ( Supplementary Fig. S10). Same as recurrence-free survival study, all overall survival data in all lung cancer subtypes remain at the same trend that NAMS R + patients have reduced survival, regardless of statistical significance. The prognostic power of NAMS is non-random. A computational study indicated that the prognostic power of most published gene signatures in breast cancer are not significantly better than that of random signatures with identical size 31 . Here, we performed a resampling test to check whether the NAMS performed better than random gene signatures. We constructed 1,000 random gene signatures with identical size as the NAMS (50 genes). Cox proportional hazards regression of survival was conducted for each resampled gene signature. The association between each random gene signature and recurrence-free/overall survival was measured by the sum of Wald statistic in the three validation cohorts. Our alternative hypothesis was that the sum of the Wald statistic of the NAMS would be more positive than expected by chance if the prognostic power of the NAMS was significantly better than that of random gene signatures. We found that the sum of the Wald statistic of the NAMS was significantly larger than that of randomized gene signatures (P < 0.001 for recurrence-free survival and P = 0.001 for overall survival) (Fig. 4).
We also compared the prognostic power of the NAMS against the other cancer related genes. Here, we defined the cancer related genes as those being differentially expressed between normal and tumor lung tissues (listed in Supplementary Table S2) and with |Z R | > 0.5 or |Z D | > 0.5. We performed a resampling test to check whether the prognostic power of NAMS was statistically better than the other cancer related genes. For each round of randomization, 50 genes were picked up from the pool of the cancer related genes. The performance of the random gene set was quantified by the sum of the Wald statistic of the validation cohorts. The prognostic power of the NAMS was significantly better than that of 1000 random gene signatures consisting of cancer related genes (P = 0.002 for recurrence-free survival and P = 0.046 for overall survival) (Fig. 4).

Independence of NAMS from the traditional clinical and pathological factors.
To confirm the strength of the NAMS as an independent predictor, we investigated the performance of the NAMS in comparison with the traditional clinical and pathological variables associated with prognosis in lung cancers. For the JPN cohort, we considered factors including age, gender, smoking history, stage, EGFR/KRAS/ALK mutation status (Gene alteration), and MYC level. For the SWE cohort, we included age, gender, and stage as covariate. For the CAN cohort, age, gender, smoking history, and stage were taken into account. Multivariate Cox proportional hazards regression of recurrence-free survival indicated that NAMS remained a significant covariate in relation to the clinical and pathological factors in all the validation cohorts (Table 3). Interestingly, NAMS was the most significant (lowest P-value) covariate in each cohort for recurrence-free survival (Table 3). We also conducted multivariate Cox proportional hazards regression against overall survival. We found that NAMS was still a significant covariate in the JPN and SWE cohorts (Table 3). However, the P-value of NAMS was slightly above the α -level of 0.05 (P = 0.075) in the CAN cohort (Table 3). Because the covariates differed among each cohort, we also repeated these analyses using the same three covariates (age, gender, and stage) for the three cohorts. These results did not dramatically change our interpretation (reported in Supplementary Table S4). Taken together, the NAMS is a survival predictor in cancer patients, independent of the traditional clinical and pathological factors.

Discussion
Nrf2 is a transcription factor that acts as a master regulator for the expression of a wide array of anti-oxidant genes 32 . Up-regulation of NFE2L2 by chemopreventive compounds confer protection against cancer initiation 15,16 . Several studies have indicated that dysregulation of NFE2L2 is strongly associated with human cancer 7,8 . Though the full extent to which NFE2L2 is involved in tumorigenesis is not known, genetic analyses have shown that NFE2L2 has increased mutations or has been deregulated in human cancers 12,33,34 . The role of NFE2L2 in pulmonary neoplasia, a diverse disease for which few biomarkers exist, is complicated and appears to depend on several factors, including the existence of activating mutations in NFE2L2 and/or loss of function mutations in KEAP1 35 . Through a computational genomics approach, our current study confirmed a central role for NFE2L2 in lung cancer. The complex functions of Nrf2 in carcinogenesis and chemotherapy resistance require more detailed characterization and mechanistic analyses, however, our data reveals some interesting patterns, in light of current knowledge about lung malignancy. Lung cancer is the most frequently diagnosed cancers and leading cause of cancer death in males, comprising 17% of the total new cancer cases and 23% of the total cancer deaths 36 . With the rapid advances taking place in molecular testing and associated technologies, the landscape of targetable genomic alterations in lung cancer is beginning to uncover the true complexity of the disease 37 . In this study, we explored the prognostic value of those gene sets regulated by NFE2L2 in lung cancer. First, we compared genes that were differentially expressed between lung cancer and NFE2L2-knockdown A549 cells. Over 2425 genes were found to be commonly differentially expressed between WT and NFE2L2-knockdown cells. We confirmed the critical role of NFE2L2 in carcinogenesis by the gene ontology analysis of all NFE2L2-mediated genes: 12 of 20 significantly deregulated pathways are direct cancer pathways (Fig. 1). Second, we generated the NFE2L2-associated Molecular Signature (NAMS) by filtering gene express data sets through different cohorts. Third, we validated the NAMS as a powerful tool which provides important prognostic predictions in lung cancer, as we demonstrated that the NAMS was a significant and independent predictor of recurrence-free cancer survival. Although limited by the availability of the existing lung cancer microarray datasets deposited in GEO, the consistent findings from the population with diverse genetic background in discovery and validation cohorts (e.g., USA verses KOR) strongly suggest the NAMS in lung cancer is powerful and highly conserved molecular signature among populations. Finally, these NAMS predictions were independent of other known clinical pathological covariates in this study. This last result is especially intriguing given recent developments in the way that clinicians and cancer biologists have begun to think about the cancer disease process.
Although the origin of cancer is still hotly debated 38 , cancer is unequivocally a quantitative trait 39 . Our findings are consistent with a modern understanding of carcinogenesis as involving not only a somatic evolutionary process 40,41 , but also contributions from the tumor microenvironment 42,43 , as well as general dysregulation at the tissue level 44 . In fact, the 50 gene set that comprised our NAMS itself highlights that cancer is a complex trait with many gene pathways involved (a fact that seems to be of broad significance, see also 45,46 ). These pathway-of-cancer enriched genes provide a set of NFE2L2-associated genes that might play critical roles in cancer pathogenesis. For example, RRM2 plays an important role in regulating expression of the anti-apoptotic protein Bcl-2 and reveal a critical link between RRM2 and Bcl-2 in apoptosis signaling and tumor developing 47 . Recent studies showed that concomitant low expression levels of RRM2 was predictive of a better outcome, and low expression of RRM2 could be used to predict the treatment response to platinum-based chemotherapy and survival in lung cancer 48,49 .
This study reinforces the value of re-examining available genomic/genetic data in the "big data" era with a novel translational approach. NFE2L2 is confirmed to be a novel "oncogene" (broadly construed)  validation (JPN, SWE, and CAN) cohorts. The red curves are for the NAMS positive patients while the blue curves are for the NAMS negative patients. The NAMS positive patients were defined as those having a recurrence risk score greater than zero. P-values were calculated by log-rank tests for the differences in survival between the NAMS positive and negative groups.

Recurrence-free survival
Overall survival  with a central role in carcinogenesis. More work on NFE2L2 will be necessary to fully elucidate the mechanisms underlying this pattern. In addition to cancer prognosis, now well validated in the current study, NFE2L2 holds promise for the management of multiple cancers. We confirm that the NAMS represents a promising prognostic biomarker in human lung cancer predicting recurrence-free survival and validation survival.

Methods
Gene expression profiling. We obtained the gene expression data in control and NFE2L2 KD A549 lung cancer cells from the Gene Expression Omnibus (GEO) database (GSE38332) 21 . Seven independent microarray lung cancer datasets from Spain (GSE18842, ESP) 24 , Taiwan (GSE19804, TWN) 25 , Korea (GSE8894, KOR) 26 , the United States (GSE3141, USA) 27 , Japan (GSE31210, JPN) 28 , Sweden (GSE37745, SWE) 29 , and Canada (GSE50081, CAN) 30 , were also downloaded from the GEO database (Supplementary  Table S5). Expression data of paired normal and tumor tissues from lung cancer patients were available in the ESP and TWN cohorts. The information on recurrence-free survival was available in the KOR, JPN, SWE, and CAN cohorts. The information on overall survival was available for the USA, JPN, SWE, and CAN cohorts. The ESP and TWN cohorts were used to identify the differentially expressed genes (B) Resampling pattern for overall survival. P 1 : the right-tailed P-values for the resampling test when the resampled gene signatures were picked up from whole human genome; P 2 : the right-tailed P-values for the resampling test when the resampled gene signatures were picked up from the cancer related genes. Figure  S10. Kaplan-Meier curves of overall survival on lung cancer subtypes (SWE). Microarray data analysis. We used the GCRMA algorithm in Bioconductor to normalize the expression level of each probe set for the microarray data of the control and NFE2L2 KD lung cancer cells and of the paired normal and tumor tissues. Only the probe sets with unique annotations were included in this study. The genes on chromosomes X and Y were excluded to avoid the potentially confounding factor. The significance analysis of microarrays (SAM) algorithm 50 , implemented in the samr library of the R Statistical Package, was used to compare log2-transformed gene expression levels between the control and NFE2L2 KD cells. FDR was controlled using the q-value method 51 . The probe sets with a fold-change > 2 and FDR < 0.01 were deemed differentially expressed between the control and NFE2L2 KD cells. For the ESP and TWN cohorts, paired t-test was used to identify the genes differentially expressed between normal and tumor tissues. The P-values were adjusted by Benjamini-Hochberg correction. Only the probe sets with a fold-change > 2 and adjusted P < 0.01 were deemed differentially expressed between normal and tumor tissues.
Risk score. For the two training cohorts (KOR and USA), univariate Cox proportional hazards regression was used to evaluate the association between recurrence-free survival/overall survival and gene expression. A recurrence-risk score and a death-risk score were then calculated for each patient, respectively, using a linear combination of gene expression weighted by the Wald statistic (ratio of regression coefficient to its standard error) as shown below: Here, S R (Formula 1) and S D (Formula 2) are the recurrence-risk and death-risk scores, respectively; n is the number of differentially expressed genes; Z Ri denotes the Wald statistic of recurrence-free survival for the i th gene, which was derived from the KOR cohort; Z Di denotes the Wald statistic of overall survival for the i th gene, which was derived from the USA cohort; e i denotes the expression level of gene i; and μ i and τ i are the mean and standard deviation of the gene expression values for gene i across all samples, respectively. A higher risk score implies a poor outcome.