Introduction

Lung cancer, predominantly non-small cell lung cancer (NSCLC), is the first cause of cancer deaths worldwide, resulting in about one million deaths each year. Its incidence increases continuously, especially for nonsmoking women. Despite advances in prevention, screening, resection methodology and chemotherapy strategies, five-year survival of operated patient remains low, ranging from 59 (stage IA) to 14% (stage IIIA). According to clinicians and pathologists, the current clinical staging on the basis of ‘Tumor Node Metastasis’ system is insufficient to predict the outcome of patients, especially for early-stage diseases. It would thus be of great clinical interest to identify biomarkers that may help distinguish among operated patients those who can be considered cured by surgery from those who display a high risk of recurrence and need adjuvant treatment. Regarding the latter, identification of predictive factors will either help select the appropriate drugs or prevent inappropriate use of chemotherapy. As a consequence of the absence of such predictive markers, the therapeutic strategies can vary from a country to another.

Studies profiling gene expression in lung cancer have been completed or are in progress to identify predictors of patient prognosis and response to therapy.1, 2, 3, 4 These unbiased microarray-based profilings of thousands of genes generally lead to multi-gene signatures including endpoint-selected cell proliferation genes, that is, genes that drive cell cycle progression or tumor differentiation at the latest stages of tumorigenesis. So far these signatures do not bring consistent prognostic molecular markers complementary to standard clinical-pathological markers such as Ki67, PCNA, tumor size, positive-involved nodes or distant metastasis, which also refer to cell proliferation status. Even if promising, the added value of gene profiling over routinely-used approaches thus remains of limited value and no reported gene expression signatures are still ready for clinical application.1

DNA replication represents an under-explored source of prognostic markers that could be used in combination with histo-pathological markers and/or multi-genetic signatures to predict prognosis. Faithful execution of the DNA replication program is critical to promote cell division and to limit cancer risk through the preservation of genome integrity.5 Recent evidence indicates that oncogene-induced alterations of the DNA replication program triggers replicative stress and replication-associated DNA damage, favoring the accumulation of genetic alterations in cancer cells.6 It is therefore likely that key replication proteins and associated repair/recombination factors limit cancer development by preventing genomic instability during DNA replication.

The cardinal role of DNA replication defects in tumorigenesis is illustrated by hereditary forms of colon, breast, ovary and skin cancers caused by mutations in DNA replication genes such as translesion DNA synthesis polymerases (for example, POLH), intra-S-phase replication checkpoints (for example, ATR, BRCA1) or post-replicative DNA repair and recombination genes (for example, MLH1, BRCA2, BLM).7, 8, 9, 10, 11, 12 In sporadic cancers, early mutations in replication-associated genes probably become diluted in the genetic chaos resulting from the progression of the disease, making the relationship between cancer and replication defects less obvious. Nevertheless, a large body of evidence indicates that alterations in genes involved in genome replication or surveillance favor sporadic cancers.13, 14 In addition, perturbation of the catalytic activity of the replicative DNA polymerase delta increases genomic instability and accelerates tumorigenesis in mice.15 Mouse fibroblasts expressing altered levels of the origin licensing factors Mcm4, Mcm7 and Cdt1 show numerous chromosomal abnormalities and form tumors in nude mice.16, 17, 18 Finally, non-invasive assays have been already developed to identify colorectal cancer by detection of Mcm proteins in colonocytes retrieved from the fecal surface,19 which is currently tested in clinical trials.

In this study, we assessed specifically the expression of 77 replication-associated genes in primary tumors and adjacent normal tissues from a series of 93 operated NSCLC patients. We found that many of these genes were significantly deregulated in tumors. More importantly, we show that misregulation of a subset of these genes is determinant of overall survival (OS) after surgical treatment, independently of the therapeutic strategies or tumor stages.

Results

Most of the replication-associated genes are deregulated in lung tumors

Gene expression profiles of 93 primary lung adenocarcinomas coupled with their normal counterparts at early or intermediate stages of the disease (Supplementary Table S1) were generated from a selection of 77 genes involved in different aspects of chromosome replication. These include genes involved in licensing and activation of replication origins, replication fork progression and restart, damage bypass by translesion DNA polymerases (TLS), post-replicative repair of DNA lesions and S-phase checkpoints. We then identified genes that were either up- or downregulated in tumors (T) compared with adjacent control tissues (N). Deregulated genes were stratified in three groups, on the basis of the number of tumors in which T/N expression ratio was either above or below 2. As indicated in Table 1, we evidenced genes that were downregulated (T/N<1), slightly upregulated (1<T/N<2) or highly overexpressed (T/N>2). We did not find genes downregulated more than twofold, that is, T/N<1/2 (Table 1). Table 2 shows genes that were significantly upregulated above a certain threshold (Th). Individual levels of expression in all the 93 coupled tumors are also shown for selected replication-associated genes (Supplementary Figure S1).

Table 1 Differential expression of DNA replication genes in coupled NSCLC tumors
Table 2 Exact binomial tests after setting four different T/N thresholds

As indicated in Table 1, we confirmed that the DNA repair gene ERCC1 and the tumor suppressor genes APC and TP53 were downregulated in NSCLC tumors, which is consistent with the published data.20, 21, 22 Conversely, we found that the KI67 gene, whose product is used as a proliferative marker by pathologists, was largely overexpressed in NSCLC tumors (T/N>4).

Genes involved in the licensing or the activation of replication origins were significantly overexpressed in tumors compared with controls (Table 1). Indeed, with the exception of ORC4, all the genes involved in initiation were overexpressed, including SLD5, CDT1, CYCLIN A, CYCLIN E, CDC45, CDC6, PLK1, and to a lesser extent DBF4 and MCM7. Genes encoding the nonhistone proteins HMGA1 and HMGA2 were also overexpressed. These proteins interact with the origin recognition complex,23 and have been implicated in replication origin function (JM Lemaitre, personal communication). Interestingly, we observed a concomitant repression of CUL4, an E3 ligase involved in the degradation of the replication licensing factor CDT1. Together, these data indicate that genes promoting origin firing are globally overexpressed in lung cancer cells.

Next, we investigated the expression of genes encoding different types of DNA polymerases, including replicative DNA polymerases (POLA, POLD and POLE), mitochondrial DNA polymerase (POLG) and specialized DNA polymerases involved in DNA lesions bypass or in DNA repair synthesis. Expression of the replicative DNA polymerases POLD and POLE and their processivity factor PCNA was slightly increased (Table 1), which is reminiscent of the overexpression (Table 1) of MCM7, MCM8, SLD5 and CDC45, encoding components of the replicative helicase.24 In contrast, the mitochondrial DNA polymerases POLG and all the so-called TLS were globally downregulated in NSCLC tumors (POLZ, POLK, POLI, REV1, POLH, POLL and REV1), as evidenced also for colorectal and breast cancers.25, 26 Similarly, the expression of factors involved in the ubiquitylation of PCNA and the recruitment of the Y-family TLS (SHPRH, RAD6 and RAD18) was either downregulated or unchanged (Table 1). Exceptions to this rule include POLM and POLB, which are involved in DNA repair synthesis after base excision and nonhomologous end joining, respectively27, 28 and POLQ, which is the only DNA polymerase significantly overexpressed in NSCLC tumors. Together, these data indicate that unlike replicative DNA polymerases, specialized DNA polymerases are globally downregulated in NSCLC tumors, with the notable exception of POLQ.

We then investigated the expression of genes involved in intra S-phase checkpoints, fork restart and post-replicative DNA repair. Overall, the expression of sensors of the DNA damage response (DDR) was either downregulated (ATM, RAD17) or slightly increased (53BP1, ATR, RAD9). In contrast, mediators (TIMELESS, CLASPIN) and effectors (CHK1) of the DDR pathway were overexpressed. As these factors act primarily at stalled replication forks, we checked whether other factors involved in replication fork repair and restart were also overexpressed. Remarkably, we found that expression of fork-associated factors such as SMARCAL1, FANCM, SLX4, RUVBL1, BRCA2, BLM, RECQ4 and RAD51 was globally upregulated to a variable extent. In contrast, the expression of factors involved in general DNA repair such as BACH1/FANCJ, RECQ1, SMC5, TIP60, XLF, LIG4, XRCC1, LIG3, XRCC4, DNAPKcs, LAMIN B, SIRT6, p300 and SIRT1 was either reduced or unchanged (Table 1). These data suggest that unlike general DNA repair factors, genes involved in the surveillance or the repair of stalled replication forks are upregulated in NSCLC cells.

Exact binomial tests indicated that the expression of the 31 downregulated genes was reduced by less than twofold in tumors compared with the normal tissues (Tables 1 and 2), whereas expression of 17 genes was increased by more than twofold (Table 2). The latter include CYCLIN A, RECQ4, POLQ, CLASPIN and CHK1 (T/N>3), CDC45, CDC6, KI67, HMGA2 and CDT1 (T/N>4) and PLK1 (T/N>5). To determine whether the overexpression of these genes reflects the proliferation status of cancer tissues, we used a Pearson’s test to compare their expression levels with that of the proliferation marker KI67. We found that it is indeed the case for the majority of them, but not for HMGA2 (rho=0.1), SLD5, CYCLIN E, RECQ4 and HMGA1 (rho=0.6; data not shown).

A replication gene signature is associated with poor prognosis

The ultimate aim of this study was to identify DNA replication genes whose expression in tumors can be used to predict the OS of NSCLC patients. A log-rank test for equality of survival functions identified a group of 10 genes associated with a higher overall morbidity (Figure 1 and Supplementary Table S2). This group includes POLQ (P=0.0008), PLK1 (P=0.0062), RAD51 (P=0.007), CYCLIN A (P=0.0128), CDC25A (P=0.0196), CLASPIN (P=0.0233), CDC6 (P=0.0404), POLL (P=0.0464) and RPA (P=0.0458). It also includes the DNA repair gene ERCC1 (P=0.0256), which is consistent with published data).22 Disease-free survival was associated with the expression of seven of these genes, that is, POLQ, PLK1, RAD51, CYCLIN A, CDC25A, CLASPIN and CDC6 (Supplementary Table S2). Misregulation of five of these genes correlated with relapse-free survival, with the exception of CYCLIN A and CDC25A (Supplementary Table S2). Altogether, these data indicate that the expression of five genes, that is, POLQ, PLK1, RAD51, CLASPIN and CDC6 was associated with overall, disease-free and relapse-free survival (Table 3).

Figure 1
figure 1

Relationship between OS and expression levels of POLQ, PLK1, RAD51, CLASPIN or CDC6. Probabilities for OS were estimated using Kaplan–Meier method. OS corresponds to the interval between the date of lung surgery and the date of death, related or not to the NSCLC. Expression levels were classified into three categories according to the T/N distribution, and the log-rank test was used to compare survival curves. Differences were considered significant at 5% level.

Table 3 Survival of patients according to the level of DNA replication gene expression

Next, we asked whether the expression of these five genes is co-regulated in NSCLC cells. Using a Pearson’s test, we show that it is indeed the case (Pearson’s coefficient>0.7; Supplementary Figure S2A). This was confirmed with an independent hierarchical ascending classification, following Ward’s criteria of the most deregulated genes (T/N>2). These data indicate that POLQ, CDC6 and PLK1, and to a lesser extent CLASPIN and RAD51, behave as a single ‘metamarker’ (Supplementary Figure S2B).

We used χ2-test and Fisher’s exact tests to examine the dependence between gene expression and treatment (chemotherapy, chemotherapy plus radiotherapy or no adjuvant treatment) or node stages (N). This bivaried analysis revealed that the expression of POLQ, PLK1, RAD51, CLASPIN or CDC6 did not correlate (P>0.05, Supplementary Table S3) with the therapeutic strategy or with the number of metastasis-containing nodes. In contrast, the expression levels of CDC25B (P=0.0307), GEMININ (P=0.0263) and APC (P=0.0404) correlated with anti-cancer treatment and the expression levels of CDC25B (P=0.0344), SLX4 (P=0.0466), GEMININ (P=0.0192) and MCM7 (P=0.0364) correlated with the clinical classification (Supplementary Table S4).

Finally, we used a Cox multivariate regression model to examine the effect of age, sex, treatment, tumor stage and expression of KI67 and PCNA on the association between OS and expression of these five DNA replication genes. This analysis revealed that a strong association was still observed after adjustment to covariates for four of these genes (POLQ, PLK1, CLASPIN and CDC6), except RAD51 (Table 4), presumably because RAD51 expression correlates with that of PCNA and KI67 (Pearson’s coefficients rho=0.54 and rho=0.81, respectively). Remarkably, patients with tumors strongly overexpressing either POLQ, PLK1, CLASPIN and CDC6 show a much poorer prognosis than patients with normal expression levels, with a hazard ratio of 36.3 for POLQ (95%CI 2.6–517.4, P=0.008), 23.5 for PLK1 (95%CI 1.9–288.4, P=0.01), 20.7 for CLASPIN (95%CI 1.5–275.9, P=0.02) and 18.5 for CDC6 (95%CI 1.3–267.4, P=0.03). A five-gene signature of the OS was also determined using a single mutivariate Cox regression model, including the expression levels of the five genes. For each patient, the predictive value of the Cox model was used as a risk score. Patients with a high-risk score exhibit the poorer survival prognosis, with a hazard ratio of 14.3 (95% CI 5.08–40.3, P<0.001) compared with the low-risk group (Figure 2).

Table 4 Multivaried Cox regression analysis of the relationship between survival and DNA replication gene expression by taking into account age, sex, treatment (in three categories: none/chemotherapy/chemotherapy and radiotherapy), T classification (two categories: T0 or T1/T2 or T3), KI67 and PCNA clinical markers
Figure 2
figure 2

OS of low- and high-risk groups on the basis of five-gene signature. Probabilities for OS were estimated using the Kaplan–Meier method. The hazard ratio (HR) has been obtained using a Cox regression model with the group as covariate.

Validation of the prognosis significance

The prognostic significance of our markers was tested in a subset of the reference Director’s Challenge Consortium training-testing, multisite, blinded validation study, which provides the largest available set of microarray data with extensive pathological and clinical annotation.29 To validate our data, we only considered Director’s Challenge Consortium patients with completely resected stage IA to IIIA NSCLC who did not receive neo-adjuvant therapy (total n=400). We found that POLQ (P=0.002), PLK1 (P=0.003) and CDC6 (P=0.0001) expression successfully classified patients into low- and high-risk groups (Supplementary Table S5, Supplementary Figure S3).

Discussion

DNA replication is at the heart of the cancer process. It is perturbed by the initial events driving aberrant cancer cell proliferation, namely the deregulated expression of oncogenes.6, 30, 31, 32, 33 Oncogene-induced replication stress promotes the constitutive activation of the DNA damage response in pretumoral cells, leading to the bypass of anticancer barriers such as apoptosis or senescence.34 Replication problems also increase genetic instability in the course of tumor progression through reduction of polymerase fidelity, replication fork collapse or alteration of chromatin assembly on newly replicated DNA. Replication stress also promotes cancer development by inducing breaks at common fragile sites, which are specific regions of the genome that show increased fragility when DNA replication is perturbed. common fragile site instability also increases when proteins involved in replication fork maintenance are mutated,35 or when the balance between replicative and TLS is altered.36

Although chronic replication stress promotes cancer development at its earliest stages, it is also a burden for tumor cells as it impedes cell proliferation. Recent evidence indicates that cancer cells often rely on an efficient replication stress response for viability, especially when they are deficient for p53,37 which can be exploited for selective treatment.38 Moreover, it has been recently reported that increased levels of the Chk1 kinase protects mouse cells against replication stress and favors transformation.39 Together, these data suggest that genes involved in DNA replication and in the replication stress response could be used as prognosis factors for aggressive cancers such as NSCLC. We have, therefore, monitored the expression of a subset of genes involved in different aspects of DNA replication, including initiation, elongation, replication fork maintenance and recovery, lesion bypass and post-replicative repair. We show that most of these genes are significantly deregulated in NSCLC cells relative to adjacent normal tissue. This observation supports the view that cancer cells adapt to replication stress by modifying key aspects of their replication program and by increasing their ability to restart arrested replication forks.

We also show that overexpression of POLQ, PLK1, RAD51, CLASPIN and CDC6 genes, either considered individually or as a signature, is associated with a bad prognosis in NSCLC. We propose that the upregulation of these genes helps cancer cells tolerate spontaneous replication stress. The mechanisms by which overexpression of these five genes protects cells from replication stress is currently unclear, but recent studies suggest that they could modulate key aspects of the replication stress response, including fork recovery and licensing of dormant replication origins (Figure 3). Dormant origins are backup origins that are licensed in G1, but are only used when cells are exposed to replication stress. Recent studies indicate that the viability of cancer cell lines depends on an excess of MCM proteins in the presence of the replication inhibitor hydroxyurea.40, 41, 42 As the recruitment of MCMs to replication origins depends on Cdc6,6 an attractive possibility could be that overexpression of the CDC6 gene increases the licensing of dormant origins in aggressive NSLCC cells. Claspin and Rad51 are two other key players of the replication stress response, which promote the maintenance and recovery of arrested forks, respectively.43, 44, 45 Plk1 is a key regulator of the G2/M checkpoint, which prevents cells from entering mitosis with an incompletely replicated genome.46 Besides its checkpoint function, recent evidence indicates that Plk1 contributes to the licensing of dormant origins through the phosphorylation of the Orc2 protein,47 and promotes the recruitment of the Rad51 recombinase on DNA breaks.48 Finally, overexpression of the POLQ gene could allow the error-prone bypass of DNA lesions by facilitating extension from mismatches.49, 50 Interestingly, PolQ and its helicase domain could also promote the activation of dormant origins though its interaction with components of the pre-replication complex (A Vidal-Fernandez and C Cazaux, unpublished observations).

Figure 3
figure 3

Model of the consequences of the overexpression of POLQ, PLK1, CLASPIN, RAD51 and CDC6 genes on the resistance to replication stress. (a) Deregulated oncogene expression induces replication stress in cancer cells. Replication fork stalling leads to the formation of DNA double-strand breaks and to the constitutive activation of the DNA damage response (DDR), driving cells into senescence, apoptosis or irreversible checkpoint arrest. (b) Cells overexpressing the five-gene signature are protected from replication stress in different ways. Upregulation of POLQ would favor replication past DNA lesions. Overexpression of CLASPIN and RAD51 would favor the maintenance and the recovery of stalled replication forks. Finally, upregulation of CDC6 and PLK1 could increase the number of licensed backup origins. PLK1 overexpression could also promote cell division by helping cells bypass the G2/M checkpoint.

Altogether, these data support the view that the overexpression of CDC6, PLK1, RAD51, CLASPIN and POLQ genes promotes cancer progression by increasing resistance to endogenous replication stress in a concerted manner. Besides this five-gene prognostic signature, this view is also consistent with the overexpression of many genes involved in the initiation of DNA replication and fork repair, and the downregulation of TLS polymerase genes, with the notable exception of POLQ.

Because of the high risk of recurrence, adjuvant chemotherapy is always recommended for stage II and III patients and for a subset of stage I patients, depending on the legislation of the country. Many among these patients will receive needless chemotherapy, either because of primary chemoresistance or because of their good intrinsic prognostics. Although recent multigene signatures look promising,2, 3, 4 there is still no reliable predictive genetic marker that could help clinicians choose the adequate strategy.1 In patient treated with platinum-based adjuvant chemotherapy, significant prognostic values have been associated to the expression of the nucleotide excision repair gene ERCC1.22 Combined with the mismatch repair gene MSH2, ERCC1 expression also helps identify patients who would benefit from chemotherapy.51 Here, we show that genes involved in the response to DNA replication stress may help refine lung cancer prognosis after surgery by efficiently discriminating between low-risk and high-risk patients, independently of the therapeutic strategies or tumor stages.

Materials and methods

Patients, tumor samples

Patients were treated and followed-up at the Thoracic Oncology Department of the Toulouse University Hospital (France). Coupled tumoral and normal lung tissue samples were collected from chemo-naive patients surgically treated from 2006–2010 for stage I to III primary lung adenocarcinoma then analyzed in the Pathological Department by applying the latest WHO classification. The characteristics of the patients and tumors are described in Supplementary Table S1. Immediately after surgery, tumoral and normal lung tissue samples were snap-frozen in liquid nitrogen, and stored at −80 °C. Normal lung tissues were taken from the surgical specimens, at >3 cm of distance from the tumor. Diagnosis was assessed by a lung cancer pathologist by applying the last WHO classification, and clinicopathological stage was assigned according to the Tumor-Node-Metastasis classification. Frozen samples were analyzed on hematoxylin- and eosin-colored slides, and tumor-rich areas were marked by the pathologist to ensure that the maximum amount of tumor material was collected for the genetic testing. All specimens contained at least 80% tumor cells, and necrosis was removed. Exclusion criteria included non-adenocarcinoma, stages IIIb and IV tumors. Written informed consent was obtained from all patients before testing. All informed consents were collected and stored in the Pathology Department. This study was approved by the Ethics of Human Research Committee at the Pathology Toulouse Hospital. The experiments conformed to the principles set out in the WMA Declaration of Helsinki.

RNA extraction and quantification

Thick frozen sections of tissues were obtained by using a Leica CM3050S cryostat (Nanterre, France). Frozen tissue sections of thickness 10–300 μm were crisped for 90 s at room temperature by using a 5-mm diameter stainless steel beads and a tissue lyser (Qiagen, Courtaboeuf, France); then total RNA was extracted with the RNeasy extraction kit according to the manufacturer (Qiagen). The quality of total RNA (DO260/DO280 >1.7) was assessed with the Agilent 2100 bio-analyzer using the RNA Nano Lab chip, 6000 Nano Assay kit (Agilent Technologies, Santa Clara, CA, USA). Its quantity was estimated with Nanodrop (Thermo Scientific, Brebières, France). The four most stable control housekeeping genes (GUSB, IPO8, HMBS, UBS) were selected by the GeNorm and BestKeeper softwares among 16 tested on the TaqMan Low Density Human Endogenous Control Array (Applied Biosystems, Carlsbad, CA, USA) after amplifying in triplicate 800 ng cDNA from four tumor and their four counterpart normal biopsies using the TaqMan Universal PCR Master Mix, the TaqMan Low Density Array technology (Applied Biosystems) and the 7900HT fast real-time PCR system.

To quantify RNAs from tumor and normal tissues, cDNAs were first preamplified in the presence of the 3R probes (TaqMan gene Expression Assays, Applied Biosystems) in TaqMan Preamp Master Mix (Early Access, Applied Biosystems). These products were then amplified using the Dynamic Array technology (Fluidigm, BioMark, Pueblo, CO, USA). Preamplified products were incubated in DNA-Binding Sample Loading reagent (BioMark), Master Mix (Applied Biosystems) and probes, then injected in nanotubes-containing Integrated Fluidic Circuit loader and amplified with the BioMark amplifier. Fluidigm data were analyzed with the GenEx software after normalizing tumor (T) and normal (N) tissues the levels of transcripts to the mean level of the four selected stable genes. Relative levels of expression in the tumor sample compared with the adjacent normal tissue were expressed by either by Δ(ΔCt) or T/N ratios. For a given transcript, Δ(ΔCt)=(CtN−CtHKN)− (CtT−CtHKT) where CtN and CtT are the Cycle Thresholds (Ct) in tumor and adjacent normal tissue, respectively, and CtHKN and CtHKT are the mean Ct’s from the expression of the four housekeeping (HK)/stable genes, that is, GUSB, IPO8, HMBS, UBS in tumor and counterpart, respectively. T/N was calculated from the Δ(ΔCt)=log2 (T/N) transformation. T/N>1 (<1) indicates a higher (lower) expression in the tumor sample compared with the adjacent normal tissue.

Statistical analysis

Statistical analyses were performed using the free statistical software R (version 2.9.2), including the ‘survival’ packages (R development Team, http://cran.r-project.org/) and Stata SE 11.2 software (Stata Corporation, College Station, TX, USA). When comparing the expressions in cancer tissues, the major parameters were the individual T/N ratios. The probability to observe >50% of patients who displayed a misregulation ratio higher (T/N>2) or lower (T/N<2) than 2 was assessed by a bilateral binomial test. Other thresholds were tested by unilateral binomial tests (T/N>5 or T/N<1/5, T/N>4 or T/N<1/4, T/N>3 or T/N<1/3, and T/N>2 or T/N<1/2). Correlations between genes were assessed with a Pearson’s correlation coefficient. A clustering algorithm was also applied: hierarchical ascending classification. This clustering was carried out for genes using Ward’s method of linkage and correlation distance as a metric. Expression levels were classified in three categories, according to the terciles of the T/N distribution. Expression levels were compared by χ2- or Fisher’s exact test in relation to treatment (surgery only, surgery–chemotherapy–radiotherapy, or surgery–chemotherapy) or tumor stage. Survival probabilities were estimated using Kaplan–Meier method by investigating the OS (time between surgery and death due to any cause, that is, both related and unrelated to the NSCLC), disease-free survival (time between surgery and either death due to any cause, loco-regional recurrence or detection of distant metastases) and relapse/recurrence-free survival (time between surgery and loco-regional/distant recurrence or distant metastases). The log-rank test was used to compare survival curves. We used a multivariate Cox’s proportional hazards regression model adjusted on sex, age, treatment, tumor grade and expression level of KI67 and PCNA genes. To build the five-gene signature, we used a multivariate Cox regression model, including the five prognostic genes split in terciles. We used the linear predictive value of the Cox model as a risk score, which is the expression level for each gene weighted by the associated coefficients of the Cox model. The cutoff of the risk score was determined to obtain the most significant difference between the two groups using the log-rank test (maxstat package in R). For all statistical tests, differences were considered significant at the 5% level. The Benjamini and Yekutieli correction for multiple testing was applied with binomial tests.

Validation of prognostic signature

Our signature was tested in a reference and independent microarray data set, that is, the US National Cancer Institute Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma, which includes University of Michigan Cancer Center, Moffitt Cancer Center, Memorial Sloan–Kettering Cancer Center and the Dana–Farber Cancer Institute. To reflect our patient population as much as possible, validation was restricted to NSCLC stage IA to IIIA patients who did not receive any neo-adjuvant treatment. Following sorting, the DDC validation data set included 400 patients. CEL files data extraction and normalization were performed with R Bioconductor (www.bioconductor.org), AMP (http://compbio.dfci.harvard.edu/ampand) and Cluster 3.0 (http://bonsai.hgc.jp/mdehoon/software/cluster/software.htm). Each data set was independently normalized using RMA procedure. Cluster 3.0 was used to filter data (removing all genes that have missing values in >80% of the patients), log2 transformation and normalization. Link between probe sets and gene names were identified with Clone Gene ID Converter (http://idconverter.bioinfo.cnio.es) and GeneAnnot (http://genecards.weizmann.ac.il/geneannot/index.shtml). The probability to observe >50% of patients who displayed a gene under/over-regulated was assessed by a bilateral binomial test. The threshold was the median expression value of the given control genes (generated from the Universal Human Reference RNA from Stratagene, La Jolla, CA, USA). Kaplan–Meier OS analysis and log-rank test were performed with R (survival package), in order to determine the prognostic significance of our five genes in the validation set. For a given gene, the ‘low risk group’ represents the half population whose expression value was under the median expression value. The ‘high risk group’ represents the other half population. For statistical tests, differences were considered significant at the 5% level. The Benjamini and Yekutieli correction for multiple testing was applied with binomial tests.