Introduction

Lung cancer, especially non-small cell lung cancer (NSCLC), is the most common cause of cancer-associated mortality worldwide1. Adenocarcinoma, a major subtype of NSCLC, is often diagnosed at an advanced stage and generally has a poor clinical outcome2 and a relatively poor overall 5-year survival3.

The current clinico-pathological staging system is not adequate4. Even if lung adenocarcinoma patients have similar clinical characteristics and have tumours at a similar stage, they may experience different clinical outcomes2. Due to tumour molecular heterogeneity, some patients will develop metastasis early and some will not2. Since high-throughput technology, including microarray and next-generation sequencing, can simultaneously measure the expression of tens of thousands of genes, it can be used to study heterogeneity of the gene expression profile in lung adenocarcinoma5,6. However, the results of gene expression studies in lung adenocarcinoma have usually been inconsistent due to differences in study design, sample size and analysis strategy7. To date, only molecular tests for genomic mutation of EGFR and KRAS and gene fusion detection of ALK have been widely used in clinical practice8.

Although previous gene expression microarray studies have shown a statistically significant association of the expression of many individual genes with disease, the findings usually lack biological meaning9, which makes it difficult for investigators to interpret their findings10. In order to increase the power to detect differentially expressed genes and reduce the difficulty in biological interpretation, gene class-based tests, such as gene set analysis, which combine biological knowledge and gene expression levels, have become widely used11. In addition, these methods focus on sets of related genes, rather than on individual genes, as individual functionally associated genes that often show only moderate differential expression can act co-ordinately in the cell, thus magnifying the effect12,13. From a statistical point of view, this gene class-based method reduces the number of dimensions and increases statistical power, while, from a biological point of view, it should help scientists better understand biological mechanisms within the cell12.

Classification of patient risk using a single biomarker that is strongly associated with disease outcome might not be a good strategy, as a dysregulated gene that may not show any obvious association with disease on its own may interact with others in the same pathway, resulting in carcinogenesis or drug resistance14,15,16. It is therefore worth identifying particular sets of genes showing unusual expression that act in the same cancer-associated pathway. In this study, using data from the public gene expression and clinical data on the caArray database of the National Cancer Institute7, cancer-associated pathway-based approaches were used to identify pathway-based gene signatures, which may have potential for prognosis prediction and therapeutic target identification in lung adenocarcinoma.

Results

Identification of genes with a significant hazard ratio (HR) and their associated pathways

The study population consisted of 443 patients with lung adenocarcinoma from the University of Michigan Cancer Center (UM) (n = 178), the Moffitt Cancer Center (HLM) (n = 79), the Memorial Sloan-Kettering Cancer Center (MSK) (n = 104) and the Dana-Farber Cancer Institute (CAN/DF) (n = 82)7. Multivariate Cox proportional hazards regression analysis showed that 2155, 1437, 1164 or 2003 genes in the CAN/DF, HLM, UM or MSK dataset, respectively, were significantly associated with overall survival (data not shown). There were only two genes (CSNK1A1 and MYST4) had significant HRs in all four datasets. After combining the results for the four datasets, a total of 2418 genes showed a significant HR (data not shown). When pathway mapping was used to identify survival-associated biological pathways based on these 2418 genes, 15 pathways in which these genes were enriched were identified (Table 1).

Table 1 Possible clinical outcome-related pathways in lung adenocarcinoma.

Pathway-based risk score analysis identifies three pathways that are associated with overall survival in lung adenocarcinoma in all four datasets

To evaluate the impact of signatures in a given pathway on survival, a pathway-based risk score was calculated for this pathway and used, together with survival analysis, to evaluate prognostic ability. Survival analysis showed that three pathways were significantly associated with survival in all four datasets (Table 1). Twenty-one genes significantly associated with survival (11 risk genes and 10 protective genes) were involved in the high-mobility group box 1 (HMGB1)/ receptor for advanced glycation end products (RAGE) signalling pathway (Supplementary Table S2 and Fig. S2 online), 22 (10 risk genes and 12 protective genes) were involved in the beta-adrenergic receptor regulation of the extracellular signal-regulated kinase (ERK) pathway (Supplementary Table S3 and Fig. S3 online) and 31 (11 risk genes and 20 protective genes) were involved in the clathrin-coated vesicle cycle pathway (Supplementary Table S4 and Fig. S4 online).

Pathway-based signatures of the HMGB1 / RAGE signalling pathway, the beta-adrenergic receptor regulation of the ERK pathway and the clathrin-coated vesicle cycle pathway are significantly associated with clinical outcome of lung adenocarcinoma

As shown in Fig. 1, using the HMGB1/RAGE signalling pathway-based signature containing 21 differentially expressed genes and defining high-risk patients as those with a pathway-based risk score higher than the median, high-risk patients had a significantly shorter median survival than low-risk patients in the CAN/DF dataset (high-risk group 71 months, low-risk group not reached median survival; p = 0.0197), the HLM dataset (high-risk group less than 30 months, low-risk group more than 70 months; p = 0.0003), the UM dataset (high-risk group 42 months, low-risk group 130 months; p < 0.0001) and the MSK dataset (high-risk group 51 months, low-risk group 114 months; p = 0.0003).

Figure 1
figure 1

Using HMGB1/RAGE signalling pathway-based signatures as prognosis predictor, Kaplan-Meier survival analysis of patients with lung adenocarcinoma were performed in all four training datasets (CAN/DF, HLM, UK, MSK) and in the two validation cohorts (GSE14814 and GSE1321).

Similar results were obtained using the beta-adrenergic receptor regulation of the ERK pathway-based signature (Fig. 2) or the clathrin-coated vesicle cycle pathway-based signature (Fig. 3). Using the ERK pathway-based signature, the median survival of the high-risk group in the CAN/DF, HLM, UM and MSK datasets was 38, 26, 45 and 52 months, respectively, significantly shorter than that in the low-risk group (not reached median survival, 48 months, 130 months and not reached median survival) (Fig. 2; p < 0.0001, p = 0.0149, p < 0.0001 and p = 0.0014, respectively), while, using the clathrin-coated vesicle cycle pathway-based signature, the median survival of the high-risk groups in the CAN/DF, HLM, UM and MSK datasets were 37, 21, 48 and 57 months, significantly shorter than that of the low-risk group (not reached median survival, 57 months, 96 months and 114 months) (Fig. 3; p = 0.0012, p = 0.0002, p = 0.0011 and p = 0.0089, respectively).

Figure 2
figure 2

Using pathway-based signatures of beta-adrenergic receptor regulation of the ERK pathway as prognosis predictor, Kaplan-Meier survival analysis of patients with lung adenocarcinoma were performed in all four training datasets (CAN/DF, HLM, UK and MSK) and in the two validation cohorts (GSE14814 and GSE1321).

Figure 3
figure 3

Using pathway-based signatures of clathrin-coated vesicle cycle pathway as prognosis predictor, Kaplan-Meier survival analysis of patients with lung adenocarcinoma were performed in all four training datasets (CAN/DF, HLM, UK and MSK) and in the two validation cohorts (GSE14814 and GSE1321).

The survival pattern for patients with each of the three pathway-based signatures in the different datasets was interesting. In the CAN/DF dataset, the high-risk group, based, respectively, on the HMGB1 / RAGE signaling pathway, beta-adrenergic receptor regulation of ERK pathway and clathrin-coated vesicle cycle pathway, had a median survival of 71, 38 and 37 months, while the low-risk group did not reach median survival. In the HLM dataset, the high-risk group had a median survival of about 20 months (26, 26 and 21 months), while that of the low-risk group was greater than 45 months (73, 48 and 57 months). In the UM dataset, the median survival for the high-risk group was 42, 45 and 48 months, almost 3 times lower than that in the low-risk group (130, 130 and 96 months). In the MSK dataset, the median survival for the high-risk group was around 50 months (51, 52 and 57 months), while the low-risk group survived longer (114 months, not reach median survival and 114 months).

A combined pathway-based risk (CPBR) score based on the combined risk scores for the three individual pathways gives a better prediction of clinical outcome in lung adenocarcinoma

To evaluate the prognostic effect of the combined risk scores for these three pathway-based signatures, a CPBR score was computed by linear summation of each of the three pathway-based signature scores multiplied by the weighting coefficient for that pathway obtained by Cox proportional hazards regression. Patients in the high-risk group (risk score higher than the median) had a significantly shorter median survival than those in the low-risk group in all four datasets (CAN/DF p = 0.0005, HLM p = 0.0002, UM p < 0.0001 and MSK p < 0.0001) (Fig. 4a–d).

Figure 4
figure 4

Using combined pathway-based risk (CPBR) score as prognosis predictor, based on the combined risk scores for the three individual pathways, results of survival analysis were shown that overall survival was significantly different between high-risk group and low-risk group in (a) the CAN/DF dataset, (b) the HLM dataset, (c) the UM dataset, (d) the MSK dataset and (e) the validation cohort of GSE14814 (f) the validation cohort of GSE1321, respectively.

Prognostic factors of lung adenocarcinoma

Results of multivariate Cox proportional hazards regression showed that stage effect had significant adjusted HR in each dataset and can be mentioned as an independent prognostic factor (Table 2). After effects of age, sex and stage were controlled in the multivariate Cox proportional hazards regression model, each pathway-based signature or combined pathway-based risk (CPBR) score was still significant and it was also an independent prognostic factor (Table 2). Particularly, the HRs based on the CPBR score were higher than those obtained using the risk score for any single pathway signature.

Table 2 Adjusted hazard ratios of the different pathway-based signatures in four training datasets.

Comparisons of the single biomarker and pathway-based signatures

Multivariate Cox proportional hazards regression was used to examine the predictive power of 2418 genes selected from the inverse-variance weighting method. Numbers of significant genes in four datasets are 820, 574, 429 and 622, respectively. Range of significant HRs was from 0.05 to 11.3 and had high variation between four datasets (Supplementary Table S5). Considering p values and HRs, three pathway-based signatures and CPBR score had more statistical significant than 2418 genes across four datasets (Supplementary Fig. S5). In addition, these pathway-based signatures showed the consistent results in four datasets and portend the better prediction power for prognosis.

Validation of three pathway-based signatures and the CPBR score in two independent cohorts

Three pathway-based signatures and the CPBR score were validated in two independent cohorts (GSE14814 and GSE13213). Patients with high risk identified from the HMGB1 / RAGE signalling pathway, the beta-adrenergic receptor regulation of the ERK pathway, the clathrin-coated vesicle cycle pathway, or CPBR score had significant shorter overall survival (Fig. 1e, p = 0.0120; Fig. 2e, p < 0.0001; Fig. 3e, p = 0.0006; Fig. 4e, p < 0.0001 ) in the GSE14814 cohort and in the GSE13213 cohort (Fig. 1f, p = 0.0077; Figure2f, p = 0.0002; Fig. 3f, p < 0.0001; Fig. 4f, p < 0.0001), respectively.

In the first validation cohort (GSE14814), age, three pathway-based signatures and CPBR score had significant HRs. Adjusted HR of CPBR score was 2.72 (95%CI = 1.80 to 4.11 and p < 0.0001). In addition, stage, three pathway-based signatures and CPBR score had significant HRs in the second validation cohort (GSE13213). Adjusted HR of CPBR score was 1.71(95%CI = 1.45 to 2.03 and p < 0.0001) (Table 3).

Table 3 Adjusted hazard ratios of the different pathway-based signatures in two validation cohorts.

Discussion

In this study, multivariate Cox proportional regression analysis was used to identify potential survival-associated genes in each of the 4 National Cancer Institute datasets, then inverse-variance weighting method was used to combine the results from the four datasets. The use of this method increases the statistical power and provides more robust results17. Biological function or pathway analysis was then used to reveal potential biological mechanisms involved in lung cancer, allowing more precise biological interpretation.

In the Cox regression analysis, only two genes, CSNK1A1 and MYST4, were found to be significantly associated with survival in all four datasets. Since the results for the four datasets were not very consistent, we used inverse-variance weighting method to combine the four sets of results to increase statistical power. After pathway analysis, pathway-based risk scores were computed using the level of expression of genes in the same pathway weighted by regression coefficients. Each of the pathway-based scores was found to be a good predictor of clinical outcome. Subjects with a higher pathway-based risk score for a given pathway were classified into the high-risk group based on that pathway. The results showed that 21 of the 65 genes in the HMGB1 / RAGE signalling pathway, 22 of the 70 in the beta-adrenergic receptor regulation of ERK pathway and 31 of the 107 in the clathrin-coated vesicle cycle pathway were significantly associated with clinical outcome of lung cancer. The high-risk group defined using the risk score based on any one of the three individual pathways had a shorter overall survival than the low-risk group.

We then developed a simplified CPBR score to combine the effects of these three signatures and the results showed that the high-risk group had even shorter overall survival than the low-risk group and that the CPBR score gave a better outcome prediction of outcome of lung adenocarcinoma patients. These results showed that differentially expressed genes in the same pathway might interact with each other and contribute to a worse prognosis.

Some studies have reported that cellular pathway signatures can be useful for treatment development, prognosis prediction and subtype classification in lung cancer14,18,19 and that the identification of important biological pathways containing potential survival-associated genes would help in disease prevention or treatment strategy16,20. The HMGB1 signalling pathway plays the principal role in the tumorigenesis and progression of many malignant cancers21. HMGB1 is a nuclear protein that influences transcription and other nuclear functions and is associated with hallmarks of cancer, including unlimited replication, angiogenesis, apoptosis, self-sufficiency in growth signals, growth inhibitor insensitivity, inflammation, invasion and metastasis22,23. HMGB1 and its receptor, RAGE, are highly expressed in various malignant tumours, including colorectal and breast cribriform carcinoma24. The HMGB1 signalling pathway is also reported to be associated with growth and metastasis of liver cancer and to be a potential therapeutic target for this cancer21. In this study, we found that a signature in the HMBG1/RAGE pathway was associated with overall survival and this might provide insight into the pathogenesis of this cancer. As regards the beta-adrenergic receptor regulation of ERK pathway, psychological distress is a predictor of cancer mortality, especially in lung cancer25 and the stress hormone norepinephrine is a potent inducer of migratory activity in lung carcinoma cells and cell migration is mediated by the adrenergic receptor pathway26. It has also been reported that beta-blocker therapy can reduce cancer distant metastases, recurrence and mortality rate in breast cancer patients26. Clathrin-mediated endocytosis is a regulator of cellular function and abnormal endocytosis plays a key role in many diseases27. The clathrin regulation pathway has been reported to be relevant to Alzheimer’s disease27,28. These results and our own show that these 3 pathways are important for disease development and that deregulated genes in these pathways might contribute to a worse prognosis. These findings could provide a research direction for further exploration of the mechanism involved in progression of lung adenocarcinoma.

In studies using the same strategy and similar clinical and pathological features and treatment protocol, not all patients with lung cancer show the same clinical outcome and sensitivity to treatment because of the extreme heterogeneity of tumours2. Due to the tumor heterogeneity, the multiple genes or pathway based approaches showed the better performance than single marker approach29. The development of gene expression profiling should help in demonstrating the heterogeneity of the same tumour type and improve the accuracy of lung cancer risk assessment, clinical prognosis and outcome prediction. Furthermore, it should allow the design of individual targeted therapies for patients30.

The limitation of the gene expression profiling or pathway based approach is applications in the clinical practice. The IHC method or FISH method with convenient and low cost of assays is well used in the clinical practice. However, it can be only applied in testing few genes. Several technologies including the multiplexed quantitative reverse-transcriptase polymerase chain reaction (RT-PCR) or digital PCR may provide solutions for the clinical practice of the pathway based signatures. However, it still needs to be evaluated in the future study.

Gene-expression profiling has been used to explore biomarkers associated with subtypes of lung cancer, overall survival and recurrence of cancer30,31, particular in lung adenocarcinoma6,32,33,34. Although signatures associated with lung adenocarcinoma have been reported by several groups, these have tended to be different in different studies; however, the genes in the individual pathways may interact and contribute to cancer pathogenesis or progression16,18.

Through risk assessment by pathway-based signatures, patients had high risk estimated from pathway based signatures may need to receive different treatments. For example, patients with completely resected stage I NSCLC was recommend for no adjuvant chemotherapy35. However, up to 10–20% of above patients will recur or die within 5 years36 and may need to receive different clinical treatments. Patients with early recurrence or death are high risk population in the stage I lung cancer and it is still lack of the efficient methods for high risk identification of stage I lung cancer. Hence, the pathway based signature may be benefit to identify high risk group of lung cancer. In addition, pathway based signatures provide systematic point of view for prognosis and the inhibitions of connections between genes within the same pathway may have potential to be therapeutic targets in the future work.

Several limitations of the microarray data are also needed to be considered. First, the high variations of gene expression data were obtained from different microarray platforms due to different probe designs or signal detection methods. Hence, the comparisons between different microarray platforms have to be concerned. Second, the stromal cell contamination varies among dataset and samples, which would give variations in micro array-based analyses.

A potential limitation of this study is that the public datasets used in this study did not provide the information of treatment response and major genomic abnormalities such as EGFR, KRAS and ALK fusion. It is difficult to estimate the effect of the CPBR in patients with major genomic abnormalities. Additionally, the MetaCore software used enrichment analysis to find the overrepresented pathways and pathways with the smaller size of the gene set might have more chance to reach significance. It is the weakness of the enrichment analysis. Because the false positive may be introduced from the limitation of the MetaCore software, several steps were used to reduce the bias. In the first step, the Bonferroni correction method was applied to reduce potential false positive. There were more than 1000 pathways in the database of the MetaCore. The significant level was corrected from 0.05 to 10−5. In the second step, gene-signatures of significant pathways were evaluated in four training datasets. Pathways had significant associations with survival were kept. Finally, candidate pathway-based signatures were validated in two independent cohorts.

Our analysis method of gene expression profiling identified pathway-based signatures closely correlated with clinical outcome of lung adenocarcinoma. We have derived a simple CPBR score which may improve the accuracy of outcome prediction for lung adenocarcinoma, as the CPBR score showed a higher correlation with clinical outcome than gene- or individual pathway-based scores. These results provide a more complete picture of the pathogenesis of lung carcinoma and provide direction for future studies. The CPBR signature may be useful in stratifying subpopulations of lung adenocarcinoma for clinical outcome prediction and individualized therapies for lung adenocarcinoma patients.

Materials and Methods

Study population and gene expression data

The study population consisted of 443 patients with lung adenocarcinoma from the University of Michigan Cancer Center (UM) (n = 178), the Moffitt Cancer Center (HLM) (n = 79), the Memorial Sloan-Kettering Cancer Center (MSK) (n = 104) and the Dana-Farber Cancer Institute (CAN/DF) (n = 82)7. The clinical characteristics of all subjects are briefly summarized in Supplementary Table S1 online. In all 4 studies, gene expression profiles were measured using Affymetrix HG-U133A microarrays and the same experimental protocols were used. The gene expression and clinical data for these 4 groups were obtained from https://array.nci.nih.gov/caarray/project/details.action?project.id=182.

An additional published dataset (n = 71) from the University of Toronto (accession number GSE14814 in the GEO database, http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14814) and dataset (n = 117) from the Nagoya University in Japan (accession number GSE13213 in the GEO database, http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE%2013213) were used to validate our findings.

Data preprocessing

The intensity values for gene expression in the four datasets were preprocessed independently to eliminate experimental noise before data analysis. To reduce variation among microarrays in a given data set, the intensity values for each sample were normalized using the quantile-normalized method37. Finally, each intensity value underwent base 2 logarithm transformation. The flowchart for the analysis is shown in Supplementary Fig. S1 online.

Identification of genes showing statistically significant differential expression using multivariate Cox proportional hazards regression

In order to determine whether genes were significantly associated with overall survival, multivariate Cox proportional hazards regression using the clinical covariates of age, sex and stage was performed on each of the 4 datasets. Each gene in a given dataset was assigned a hazard ratio (HR) and the corresponding p value was estimated by Cox regression analysis, giving 4 HR and 4 p values for each gene, then the inverse-variance weighting method was used to combine the HRs from the four datasets38. The advantages of combining the results from the different datasets were an increase in statistical power and the identification of more robust cancer signatures17. If the confidence interval for the combined HR for a given gene did not overlap with that for the baseline risk (HR = 1), the gene was considered a potential marker that was significantly associated with survival. The HRs were then used to evaluate correlations between overall survival and the level of expression of genes; if the HR for a given gene was >1, it was defined as a potential risk gene, if not, it was defined as a potential protective gene.

Pathway analysis and risk score calculation

Genes that were differentially expressed and associated with survival were further analysed for biological function or involvement in different pathways using pathway maps of the MetaCoreTM version 6.13 (Thomson Reuters, New York, NY). Because more than 1000 pathways are included in the MetaCore database, the Bonferroni correction was used to avoid multiple testing issues. For each of the 15 pathways found to contain differentially expressed genes, a pathway-based risk score was calculated for each subject in each data set to determine the impact of the pathway on the prognosis. The risk score was a linear combination of the values for the expression of each gene in the pathway multiplied by a weighting value for each gene estimated using Cox proportional hazards regression. Using this approach, we found signatures in three pathways that were significantly associated with survival in lung adenocarcinoma. A combined pathway-based risk (CPBR) score was then derived by linear combination of the pathway-based risk score for each of the three pathways multiplied by a weighting value for that pathway estimated using Cox proportional hazards regression. We then verified this CPBR score using an independent cohort.

Survival analysis

In the pathway-based study, the pathway-based median risk score was taken as the cut-off point for high or low-risk group classification, while, in the CPBR score study, the median of the CPBR score was used as the cut-off. The Kaplan-Meier method was used to generate survival curves and the difference between survival curves was evaluated using the log-rank test. All tests were two-tailed and p values less than 0.05 were considered to be significant.

CPBR score validated in another independent cohort

Both survival analysis and calculation method of pathway-based scores for three selected pathways and the CPBR score were the same as previous section described. HR of the CPBR score was estimated from multivariate Cox proportional hazards regression with the clinical covariates of age, sex, stage and treatment method after operation.

Additional Information

How to cite this article: Chang, Y.-H. et al. Pathway-based gene signatures predicting clinical outcome of lung adenocarcinoma. Sci. Rep. 5, 10979; doi: 10.1038/srep10979 (2015).