Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

A gene-expression signature to predict survival in breast cancer across independent data sets


Prognostic signatures in breast cancer derived from microarray expression profiling have been reported by two independent groups. These signatures, however, have not been validated in external studies, making clinical application problematic. We performed microarray expression profiling of 135 early-stage tumors, from a cohort representative of the demographics of breast cancer. Using a recently proposed semisupervised method, we identified a prognostic signature of 70 genes that significantly correlated with survival (hazard ratio (HR): 5.97, 95% confidence interval: 3.0–11.9, P=2.7e−07). In multivariate analysis, the signature performed independently of other standard prognostic classifiers such as the Nottingham Prognostic Index and the ‘Adjuvant!’ software. Using two different prognostic classification schemes and measures, nearest centroid (HR) and risk ordering (D-index), the 70-gene classifier was also found to be prognostic in two independent external data sets. Overall, the 70-gene set was prognostic in our study and the two external studies which collectively include 715 patients. In contrast, we found that the two previously described prognostic gene sets performed less optimally in external validation. Finally, a common prognostic module of 29 genes that associated with survival in both our cohort and the two external data sets was identified. In spite of these results, further studies that profile larger cohorts using a single microarray platform, will be needed before prospective clinical use of molecular classifiers can be contemplated.


Microarray expression profiling has shown promise for prognostication of breast cancer. van‘t Veer et al. (2002) identified a 70-gene-expression signature, which predicted the outcome of pre-menopausal patients with more accuracy than conventional prognostic indicators and validated their signature in a follow-up study (van de Vijver et al., 2002). However, this validation was imperfect as the training and validation cohorts had overlapping patients and external validation using independent data sets was not performed. Subsequently, Wang et al. (2005) identified a different 76-gene prognostic signature, which was a better predictor in pre-menopausal breast cancer as compared to post-menopausal cases. In a separate analysis, a wound-response signature, in combination with the 70-gene signature, was shown to identify a group of early-onset breast cancers with significantly worse outcome (Chang et al., 2005). It is noteworthy that van‘t Veer's cohort was made up of predominantly younger patients: mean age of 44 (±8) years. Wang's cohort had a mean age of 54 (±12) years, but only 51% of the cases were post-menopausal. As a result, both of these studies underrepresented cases from post-menopausal women, which constitute the majority of breast cancers seen in routine clinical practice. In addition, these studies were not independently validated in external data sets and therefore, clinical application is uncertain. The variation of gene signatures between different studies has been highlighted in follow-up analyses showing that the outcome-predictive 70-gene set was not unique and was strongly influenced by the subset of patients used for gene selection (Ein-Dor et al., 2005; Michiels et al., 2005).

Using a meta-analysis approach, Shen et al. (2004) identified a meta-signature by integrating data from four cohorts. This meta-signature predicted outcome in the studies from which it was derived, but was not validated across independent data sets. Although a meta-analysis approach benefits from an increase in the number of cases, the known variations between microarray platforms (Tan et al., 2003) may introduce new variability, which would need further validation. Interestingly, others have suggested that conventional prognostic classifiers, such as the Nottingham Prognostic Index (NPI) (Galea et al., 1992), predict outcome as well as the 70-gene signature (Eden et al., 2004). For all these reasons, it has been suggested that expression-derived classifiers are not ready for immediate clinical application (Brenton et al, 2005).

We designed a gene-expression study in a consecutive cohort of breast cancer cases with long-term follow-up representative of the population of breast cancer cases commonly seen in clinical practice: the majority of patients (67%) were post-menopausal. We then tested the performance of the expression signature derived from this cohort in independent external publicly available data sets.


A gene-expression signature derived using a semisupervised approach predicts overall survival

We first ranked the genes according to their log-rank test P-values derived from univariate Cox-regression analysis of survival. We next estimated the false discovery rate using two different methods: q-values and a non-parametric randomized approach. Both methods showed that over a range of relevant significance thresholds the expected number of false positives was significantly smaller than the number of significant tests (Supplementary Figure 1). In view of this positive result, we next applied the Cox-clustering methodology (Bair and Tibshirani, 2004) (Supplementary Methods) to derive a prognostic signature. Before deriving a prognostic signature from the entire data set, we performed multiple internal cross-validations, re-ranking the genes and deriving an optimal classifier for each choice of training set, and subsequently testing each optimal classifier on the corresponding test set, as explained in detail in Supplementary Methods. This showed that the optimal classifiers derived using Cox-clustering were not overfitted to the training set yielding highly significant Cox P-values (P<10e−3) for the distribution of test samples into poor and good prognostic groups. To arrive at an optimal classifier, independent of any specific choice of training set, we reapplied the Cox-clustering methodology to the entire data set without internal cross-validation. By incrementally adding one gene at a time from the Cox-ranked list to a classifier set of genes and then evaluating this set's association with survival via the hazard ratio (HR) of the two groups obtained by a robust 2-means algorithm, we found that there were many gene signatures, ranging in size from 30 to 110 genes, significantly predicting overall survival. The 95% confidence intervals (CI) for the HRs showed that no optimal classifier could be rigorously defined (Supplementary Figure 2). However, for purposes of external validation we declared the 70-gene signature (the ‘Cox-ranked’ signature), which maximized the HR, as optimal (Table 1, Figures 1 and 2a). The top 200 Cox-ranked genes from which this signature was derived are listed in Supplementary Methods and Supplementary Excel File 1 and the distribution of clinical parameters over the good/bad prognostic samples is shown in Supplementary Table 1.

Table 1 Cox-ranked signature
Figure 1

Heat map of the Cox-ranked prognostic gene-expression signature. Gene-expression pattern of the prognostic signature in 135 breast tumors. Genes are depicted on the vertical axis and samples on the horizontal axis. The bar shows ‘poor vs good outcomes’ using robust k-means (k=2) clustering.

Figure 2

Kaplan–Meier survival curves for the Cox-ranked gene-signature in internal and external data sets. (a) Cox-ranked prognostic signature in the cohort of 135 patients. (b) External validation of Cox-ranked classifier in van de Vijver's cohort using the nearest centroid method. (c) External validation of Cox-ranked classifier in Wang's cohort using the nearest centroid method. (d) Performance of Cox-ranked gene set in van de Vijver's cohort. (e) Performance of Cox-ranked gene set in Wang's cohort. P-value and HR (CI range) are demonstrated for each panel. ‘n’ shows the number of events/number of samples in each group.

We next investigated the association of these prognostic genes with cellular function using both the EASE software package and the Gene Ontology (GO) Tree Machine. The stability of results was tested by analysis of the top 70, 110, 150 and 190 Cox-ranked genes and further confirmed by comparing results between the two GO analysis tools. We detected a significant association (P<0.01 after correction for multiple testing) between prognostic genes and cell cycle, mitosis, extracellular matrix, transcription factors and calcium ion binding (Supplementary Table 2). The overrepresentation of extracellular matrix genes (e.g. OMD, SPARCL1, LUM) and developmental genes (NANOG, TIMELESS, SOX11, SHOX2) are novel findings in breast cancer biology.

Prognostic signature is an independent predictor of overall survival

To test whether the Cox-ranked signature predicted survival independently of other prognostic factors in our cohort, we performed multivariate analysis (Table 2). The Cox-ranked signature was the most significant predictor in univariate analysis with HR of 5.97 (95% CI: 3.00–11.89, P=2.7e-07). It is notable that the prognostic gene-signature predicted outcome independently of other prognostic tools such as the NPI and the Adjuvant! software (Table 2). The predictive power of the Cox-ranked signature in different clinical subgroups was further assessed using stratified survival analysis, which showed that it was strongly associated with outcome within each major clinical subgroup including both estrogen receptor ER+ and ER− cases (Supplementary Table 3).

Table 2 Univariate and multivariate analysis for overall survival

Validation of the prognostic gene-expression classifier in external cohorts

Next, we tested the Cox-ranked signature in both van de Vijver's and Wang's cohorts (van de Vijver et al., 2002; Wang et al., 2005). We first identified the genes from the signature that were present in these data sets, yielding Cox-ranked classifiers of 39 genes (van de Vijver) and 55 genes (Wang). For each sample in the external cohorts, a continuous prognostic index (PI) was computed using the Cox-ranked classifier as explained in Materials and methods. Samples were then assigned to poor or good prognostic classes depending on whether PI>0 or PI<0, respectively (nearest centroid classification rule). Assessment of the prognostic classification thus obtained was performed using the HR (from a Cox-regression). The classifier significantly predicted overall survival in van de Vijver's cohort (Figure 2b). In Wang's cohort (Wang et al., 2005), which used a different microarray platform (Affymetrix) and clinical end point (time to distant metastasis (TTDM)), the signature was not as predictive (Figure 2c). The sensitivity, specificity, negative predictive value and positive predictive value for the classifier in both cohorts further confirmed our findings (Table 3). To assess the prognostic ability of our classifier more rigorously, we computed the D-index, which evaluates the significance of the risk ordering of the samples as determined by the continuous PI (Table 3). The log-rank test P-value for the D-index was highly significant in van de Vijver (P<10e−6), and to a lesser degree in Wang's cohort (P<10e−4). We verified these P-values using a randomized permutation approach in which time labels were randomly permuted a large number of times (1000).

Table 3 Performance of Cox-ranked classifier in external data sets

Identification of a robust prognostic gene set across independent data sets

Variability due to differences in experimental methodology or in clinical end points might ‘alter’ the precise profile of a truly prognostic signature, but this effect might not be large enough to alter the prognostic significance of the individual genes. We thus decided to perform unsupervised robust k-means clustering (over the genes present in the Cox-ranked classifier) to separate each of the two external cohorts into two groups and to test them for significant association with overall survival (or the surrogate TTDM), using Kaplan–Meier curves and the HR. Importantly, this approach also provided us with an evaluation framework in which to formally compare different prognostic gene sets. Using our Cox-ranked gene set, we found HR=4.43 (P<10e−8) (Figure 2d) in van de Vijver's and HR=1.61 (P=0.01) (Figure 2e) in Wang's cohort. To show that these results were unlikely to have arisen by chance, we randomly selected 55 and 39 genes from the corresponding arrays, reclustered each of the two cohorts into two groups and finally computed the log-rank test P-values. This operation was performed a total of 1000 times. In van de Vijver's, we found a 0.002 chance of observing a P-value as extreme as ours, whereas in Wang's study it did not quite reach significance (0.12). These results confirmed our previous analysis and suggested to us the presence of a set of genes that are prognostic across three independent data sets. Further evidence of the robustness of our gene set compared with other published prognostic sets was provided by a direct comparison of the HRs and P-values (Supplementary Table 4). The van‘t Veer 70-gene set did not yield a significant HR in our data, whereas Chang's core serum response (CSR) set, derived from van‘t Veer's cohort, was of borderline significance (P=0.0502). Shen's 90-gene meta-set was also only prognostic in the two studies from which it was derived and failed to associate with outcome in both ours and Wang's data sets (data not shown). Wang's 76-gene set performed reasonably well across all three data sets, but had a lower HR than our 70-gene set.

Identification of a common prognostic module

As our 70-gene set was able to significantly associate with survival/TTDM across three cohorts, we decided to identify a core set of common survival-related genes that we called a prognostic module. Specifically, we asked how many of the top 200 Cox-ranked genes (from which the Cox-ranked signature was derived), which were also present in van de Vijver and Wang's cohorts, were also associated with survival/TTDM (Cox P-value <0.05) in these cohorts. This resulted in 57 out of a possible 106 genes in van de Vijver's data set and 57 out of a possible 157 in Wang's data set with Cox P-values <0.05. These two sets of 57 genes had 29 in common (Supplementary Table 5), which we found to be highly significant (10e−11<P<0.016).


There are several known problems that hinder the identification of gene-expression signatures that are prognostic across multiple data sets. One difficulty is that a large number of measurements and therefore potential correlations are derived. This leads to data overfitting, which in turn hinders the external validity. The approach most commonly followed to circumvent this problem is to divide the study cohort into training and validation subsets. This, however, decreases the sample size and can also produce other sources of bias (Ein-Dor et al., 2005). Another concern relates to the way cohorts are categorized into good and bad outcome, which is usually performed by some rather arbitrary threshold, thus introducing significant study-specific bias. A natural way to circumvent this problem is to treat outcome as a continuous variable when ranking genes, and to use unsupervised clustering over these genes to identify, without bias, subgroups of patients with different outcome (Bair and Tibshirani, 2004). Finally, another problem relates to the interplatform and intercohort variability, which significantly hinders the external validity of derived signatures. A way round this is to apply the methodologies to homogeneous subgroups of patients and to devise new more robust methods, which are less sensitive to study-specific factors.

In view of all this, we applied a semisupervised method, ‘Cox-clustering’ (Bair and Tibshirani, 2004), to analyse a cohort of 135 primary breast cancers profiled with expression microarray. We derived an optimal classifier (Cox-ranked) and tested its performance across independent data sets. Rigorous internal cross-validations were performed to test for overfitting, but the prognostic signature itself was derived using the entire cohort to avoid bias associated with a specific choice of training set. Whereas in this context the use of an ensemble of classifiers (as given by the different choices of training set) can be contemplated and can improve prediction over a single classifier (Sollich and Krogh, 1996), there are also clear advantages gained from using a unique classifier gene set, particularly in that it provides a smaller set of genes to be tested in future prospective studies. Even though application of the method to ER+ and ER− subgroups separately is desirable, for our cohort ER status and survival were not significantly correlated (Pearson correlation coefficient 0.03), thus allowing an agglomerative analysis. The derived classifier clustered the patients into two groups with survival times that were significantly different by the log-rank statistic and the plotted survival curves with 95% CI confirmed the validity of the signature (Figure 2). Multivariate analysis also demonstrated that the Cox-ranked signature was an independent prognostic classifier in a model that included node status, tumor size, grade and ER status. Importantly, the signature performed independently of NPI and ‘Adjuvant!’ software (recently shown to be highly accurate in predicting 10-year survival (Olivotto et al., 2005) in predicting outcome.

The external validation of the Cox-ranked classifier was carried out using two different prognostic classification schemes and two different measures of prognostic separation (HR and D-index). Importantly, using all methods, the classifier was a significant predictor of survival in van de Vijver's cohort. In particular, we found that the D-index was highly significant and was especially valuable as a prognostic separation measure (Table 3). Using the nearest centroid classification rule and HR performance was also good (Figure 2b). There are two likely explanations as to why the classifier only reached borderline significance in Wang's cohort. First, there are large differences between microarray platforms, Agilent in our study (and van de Vijver's) and Affymetrix in Wang's data set. Differences in microarray technology as well as transcript and annotation information are known causes of the observed variability across platforms (Tan et al., 2003; Irizarry et al., 2005). Second, the clinical end point used in both van de Vijver's and our study was overall survival, whereas Wang et al. (2005) used TTDM. The dependence of prognostic signatures on the outcome measure used could be clearly demonstrated both in our data set as well as in van de Vijver's (data not shown). Similar results were obtained by using robust k-means clustering over the 70 genes of our classifier to separate van de Vijver's and Wang's cohorts into two clusters each and showing that the clusters were significantly associated with survival and TTDM, respectively (Figure 2d and e). It is noteworthy that using this same method, external prognostic gene sets did not perform as well in our data set (Supplementary Table 3). Further analysis along these lines showed that, in spite of all interstudy variability, there were 29 genes with significant Cox scores in all three cohorts, which is highly significant. Not surprisingly, when clustering over these 29 ‘core’ prognostic genes, each cohort was significantly separated into poor and good prognostic classes.

We did not attempt to validate the signature in Sotiriou's data set (Sotiriou et al., 2003), owing to the small number of overlapping genes that had significant expression changes in that data set (only three: MYBL2, BUB1 and PRAME). Nevertheless, these genes strongly correlated with prognosis in Sotiriou's cohort (P<0.001).

Interestingly, the 29 overlapping genes were consistently overexpressed in poor prognosis tumors (with the exception of SMARCA4 and TIF1 in Wang's and MBP in Vijver's and Wang's cohorts). The biological plausibility of these 29 genes is supported, as the majority could be directly or indirectly connected with breast cancer development. For example, RAD54L, RAB22A, CDC2, DTYMK, MAD2L1, EXO1, TIMELESS, CTPS, PTTG1/Securin, TIF1, MYBL2, BUB1, DNMT3B, FANCA, ZWINT, PKMYT1 are either oncogenes, involved in cell cycle progression, chromosome dynamics, mitotic checkpoint regulation or DNA-damage response; three (PSMD2, PSMD7, APPBP1) are involved in ubiquitinylation or are proteosome components and may therefore also be involved in cell cycle regulation; two (EBP, SQLE) are involved in sterol biosynthesis; two (TIF1, SMARCA4) interact with the ER; and two (FLJ10292/MAGO-NASHI homolog, EIF4EBP1) are part of a conserved protein complex that includes MLN51, the human homolog of Drosophila barentsz, a gene involved in 17q amplifications in human breast cancers (Degot et al., 2004). The evidence that some of these 29 genes are part of a ‘core’ prognostic signature is strengthened by the fact that six (BM039, CTPS, PSMD2, BUB1, MAD2L1, PSMD7) were also identified as part of the 231 genes that correlated with the prognostic categories in the original van‘t Veer paper and five (BUB1, MAD2L1, PKMYT1, BM039, PSMD7) overlap with a cell proliferation signature recently derived from the same data and associated with extremely poor outcome (Dai et al., 2005). Incidentally, a multi-gene reverse transcriptase–polymerase chain reaction (RT–PCR) profile to predict recurrence in a large cohort of tamoxifen-treated patients (Paik et al., 2004), also contained two of our prognostic genes, CTSL and MYBL2. This indicates that some of the prognostic genes are reproducible using completely different technologies, cohorts and analysis algorithms.

We also noted that a recently described 64-gene prognostic signature, derived from a cohort of patients most of which received systemic adjuvant treatment (Pawitan et al, 2005), showed no overlap with our 70-gene signature and only a three-gene overlap with Veer's signature.

In conclusion, we have identified a prognostic classifier and gene set, which predicts the overall survival in three independent studies. We do not propose this signature for clinical application but it provides a building block for future work. We believe the way forward is to reduce the sources of variability such as microarray platform, methodology and tumor heterogeneity in future profiling studies, as well as developing new robust algorithms that build on the methods used here. Some of the variability, however, cannot be overcome unless we have better molecular classifications of breast cancer, such that fairly uniform cohorts are used to identify prognostic signatures. We also propose to test the Cox-ranked signature in a phase 2 prognostic study, which assesses its potential predictive power in addition to standard prognostic factors. An alternative approach is to agree on a ‘consensus’ set of core prognostic genes (e.g. including the 29 common genes identified here), to be validated in larger cohorts using quantitative RT–PCR or custom-made microarrays. In any case much work needs to be performed before application of expression microarrays for treatment randomization can be contemplated.

Materials and methods

Tumor samples and RNA isolation

One hundred and sixty frozen tumors were retrieved (with sample availability being the only criteria) from those collected at Nottingham City Hospital NHS Trust between 1986 and 1992. The long follow-up and conservative use of adjuvant therapy (see below) made this cohort ideal for a prognostic study. All patients had standard breast surgery, 40% (38 of 93) of ER+ cases received adjuvant tamoxifen and only six patients in total received adjuvant Cyclophosphamide, Methotrexate, 5-FU (CMF) chemotherapy. Clinical follow-up data was retrieved from the Nottingham Tenovus Primary Breast Cancer Series database (Supplementary Table 6, Supplementary Excel Files 2 and 3). This study was approved by the Nottingham City Hospital NHS Trust local research ethics committee.

From each frozen tumor, 30 sections of 30 μm thickness were obtained and homogenized in 2 ml of TRI-reagent (Sigma, Dorset, UK). Samples used had a median tumor cellularity of 60%. Total RNA was extracted following manufacturer's recommendation and purified by RNeasy mini-kit (Qiagen Ltd, West Sussex, UK). Samples were on-column treated with RNase-free DNase set (Qiagen Ltd). RNA integrity and genomic DNA contamination were tested using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). A total of 135 samples had total RNA of sufficient quality for subsequent processing. All experiments were performed in a randomized design to avoid confounding variables related to batch processing.

RNA amplification and labelling

RNA amplification was carried out from 5 μg of starting total RNA using the in vitro transcription method and purification steps as described (Naderi et al., 2004). Indirect labelling using Cy3 and Cy5 Mono-Reactive dyes (Amersham Biosciences, Bucks, UK) was performed as described (Naderi et al., 2005). Reference Cy3- and Cy5-cRNA pools were generated by mixing the labelled targets from 50 tumor samples chosen at random (common reference pool).

Hybridization and scanning

Oligonucleotide microarrays containing 22 575 features (19,061 genes and 3514 control spots) were used (Agilent Human 1A 60-mer Oligo Microarray, Agilent Technologies). For hybridizations, 1.5 μg of Cy3- or Cy5-labelled sample cRNA and 1.5 μg of reciprocal labelled reference cRNA were co-hybridized. We performed paired dye-reversal hybridizations in 127 samples, and one-way hybridization in eight samples owing to limited amount of RNA. In addition, we carried out a second set of 45 biological replicate hybridizations in 24 samples (Supplementary Excel File 4). Hybridization and washing were performed following the manufacturer's instructions. Scanning was carried out using an Agilent G2565BA scanner at 100% Photomultiplier Tubes (PMT) for all slides. The expression data is deposited at MIAMExpress at the EBI with accession number E-UCon-1 (

Data analysis

Supplementary Information has a detailed description of the data analysis methodology used.

Normalization of microarray data

Feature extraction, normalization of the raw data and data filtering were performed using the Agilent G2567AA Feature Extraction software (Agilent Technologies) and Spotfire DecisionSite 8.0 (Somerville, MA, USA). This resulted in a normalized matrix of 5257 genes.

Survival analysis

Survival analysis was carried out on the matrix of 5257 genes. Before analysis, genes were normalized to have zero mean and unit s.d. across samples. As an external validation of a prognostic classifier provides a far more stringent test of a classifier's general validity than any internal validation test, we decided to derive a prognostic classifier using our entire data set. We also performed multiple internal cross-validations but these were only performed to check that an optimal prognostic classifier derived from a training set would also be prognostic in a test set. Other reasons for separating the process of deriving an optimal classifier from the internal cross-validation are given in Supplementary Information and relate to the properties of the implemented methodology. Prognostic classifiers were derived using a semisupervised method, called ‘Cox-clustering’, which has been shown to outperform supervised methods like the one used by van‘t Veer (Bair and Tibshirani, 2004). This method uses unsupervised clustering over genes selected in a supervised manner (Cox proportional hazards model) and allows unbiased identification of subgroups of patients with different outcome.

External validation: The derived classifier (‘Cox-ranked’ signature) was then tested on two external independent data sets, of 295 and 285 samples, respectively (van de Vijver et al., 2002; Wang et al., 2005), using two different prognostic classification rules. First, for those genes present on the external arrays we normalized the expression profile to zero mean and unit variance. Genes across platforms were matched by either UniGene Symbol or Genbank accession number. Multiple matches were avoided by averaging the profiles for the replicate probes on the arrays. For an external sample with normalized gene-expression vector x, we then computed a continuous prognostic index, PI(x), defined by PI(x)=cor(x,c2)−cor(x,c1), where c2 and c1 denote the centroids of the poor and good outcome clusters as given by the Cox-ranked classifier. One rule then classified external samples into good or bad prognostic groups using the nearest centroid criterion (poor prognosis if PI(x)>0 and good prognosis if PI(x)<0) (Tibshirani et al., 2002), whereas the other rule rank-ordered external samples relative to each other according to the continuous PI-values. Evaluation of the predicted binary classification derived with the nearest centroid classification rule was assessed using the Hazard ratio obtained from a Cox-regression (Cox and Oakes, 1984). Evaluation of the predicted risk ordering of the external samples was performed using the D-index (Royston and Sauerbrei, 2004) and significance tested using both the P-value from the log-rank test as well as the one derived from a randomized permutation approach (we used over a 1000 randomizations).

Common prognostic gene module: This was derived by selecting genes from the external data sets overlapping with the top 200 Cox-ranked genes in our cohort and identifying those that were significantly associated (log-rank test, P<0.05) with survival. Significance of overlap was established by first estimating the positive rates in the external sets (psig1 and psig2 for Vijver and Wang's studies, respectively), that is, we computed the proportion of genes on the external arrays that had P-values less than 0.05. Next, we modeled the number of overlapping genes under the null hypothesis as the sum of 200 binomial random variables, where each of the 200 genes has a probability psig1*psig2 of being called significant in both external studies. Thus, this model gave us an analytical distribution for the number of overlapping genes under the null hypothesis, and a corresponding P-value for the observed number of overlapping genes. As the model assumes independence of genes, we also obtained an upper bound estimate for the P-value by assuming that all genes were perfectly correlated, that is, this is essentially tantamount to modeling the number of overlapping genes as a single binomial random variable.

Software packages used: Univariate and multivariate Cox-regressions, Kaplan–Meier analysis and HR computations were carried out with the survival package (R package version 2.16), using the R language and environment for statistical computing and clustering used the R-package cluster (V.1.10.2, GO was performed using EASE software package ( and the GO Tree Machine ( Survival estimation with the Adjuvant! software was performed using the online version 7.0 (


  1. Bair E, Tibshirani R . (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology 2: 503–511.

    Article  Google Scholar 

  2. Brenton JD, Carey LA, Ahmed AA, Caldas C . (2005). Molecular classification and molecular forecasting of breast cancer: ready for clinical application? J Clin Oncol 23: 7350–7360.

    CAS  Article  Google Scholar 

  3. Chang HY, Nuyten DSA, Sneddon JB, Hastie T, Tibshirani R, Sorlie T et al. (2005). Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA 102: 3738–3743.

    CAS  Article  Google Scholar 

  4. Cox DR, Oakes D . (1984). Analysis of Survival Data. Chopman and Hall: London.

    Google Scholar 

  5. Dai H, van‘t Veer L, Lamb J, He YD, Mao M, Fine BM et al. (2005). A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. Cancer Res 65: 4059–4066.

    CAS  Article  Google Scholar 

  6. Degot S, Le Hir H, Alpy F, Kedinger V, Stoll I, Wendling C et al. (2004). Association of the breast cancer protein MLN51 with the exon junction complex via its speckle localizer and RNA binding module. J Biol Chem 279: 33702–33715.

    CAS  Article  Google Scholar 

  7. Eden P, Ritz C, Rose C, Ferno M, Peterson C . (2004). ‘Good Old’ clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. Eur J Cancer 40: 1837–1841.

    CAS  Article  Google Scholar 

  8. Ein-Dor L, Kela I, Getz G, Givol D, Eytan D . (2005). Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21: 171–178.

    CAS  Article  Google Scholar 

  9. Galea MH, Blamey RW, Elston CE, Ellis IO . (1992). The Nottingham Prognostic Index in primary breast cancer. Breast Cancer Res Treat 22: 207–219.

    CAS  Article  Google Scholar 

  10. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC et al. (2005). Multiple-laboratory comparison of microarray platforms. Nat Methods 2: 345–350.

    CAS  Article  Google Scholar 

  11. Michiels S, Koscielny S, Catherine H . (2005). Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365: 488–492.

    CAS  Article  Google Scholar 

  12. Naderi A, Ahmed AA, Barbosa-Morais NL, Aparicio S, Brenton JD, Caldas C . (2004). Expression microarray reproducibility is improved by optimising purification steps in RNA amplification and labelling. BMC Genomics 5: 9.

    Article  Google Scholar 

  13. Naderi A, Ahmed AA, Wang Y, Brenton JD, Caldas C . (2005). Optimal amounts of fluorescent dye improve expression results in tumor specimens. Mol Biotechnol 30: 151–154.

    CAS  Article  Google Scholar 

  14. Olivotto IA, Bajdik CD, Ravdin CD, Speers CH, Coldman AJ, Norris BD et al. (2005). Population-based validation of the prognostic model ADJUVANT! for early breast cancer. J Clin Oncol 23: 2716–2725.

    Article  Google Scholar 

  15. Paik S, Shak S, Tang G, Kim F, Baker J, Cronin M et al. (2004). A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med 351: 2817–2826.

    CAS  Article  Google Scholar 

  16. Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P et al. (2005). Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res 7: R953–R964.

    CAS  Article  Google Scholar 

  17. Royston P, Sauerbrei W . (2004). A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med 23: 723–748.

    Article  Google Scholar 

  18. Shen R, Ghosh D, Chinnaiyan AM . (2004). Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics 5: 94.

    Article  Google Scholar 

  19. Sollich P, Krogh A . (1996). Learning with ensembles: how over-fitting can be useful. In: Touretzky DS, Mozer MC, Hasselmo ME (eds). Advances in Neural Information Processing Systems. MTT press: Cambridge, MA, vol. 8. pp 190–196.

    Google Scholar 

  20. Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A et al. (2003). Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci USA 100: 10393–10398.

    CAS  Article  Google Scholar 

  21. Tan PK, Downey TJ, Spitzangel ELJ, Xu P, Fu D, Dimitrov DS et al. (2003). Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 31: 5676–5684.

    CAS  Article  Google Scholar 

  22. Tibshirani R, Hastie T, Narasimhan B, Chu G . (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 99: 6567–6572.

    CAS  Article  Google Scholar 

  23. van de Vijver MJ, He YD, van‘t Veer L, Dai H, Hart AAM, Voskuil DW et al. (2002). A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347: 1999–2009.

    CAS  Article  Google Scholar 

  24. van‘t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530–536.

    Article  Google Scholar 

  25. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F et al. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365: 671–679.

    CAS  Article  Google Scholar 

Download references


Research in the Cancer Genomics Program is funded by grants from Cancer Research UK, Cambridge MIT Institute and Isaac Newton Trust. JDB is a CR-UK Senior Clinical Research Fellow. We thank Dr Patrick Royston for advice on survival analysis and Ms Claire Paish, Nottingham City Hospital for collecting the tissue samples. NLB-M is the recipient of a Praxis XXI doctoral fellowship from FCT, Ministry of Science, Portugal.

Author information



Corresponding authors

Correspondence to J D Brenton or C Caldas.

Additional information

Supplementary Information accompanies the paper on the Oncogene website (

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Naderi, A., Teschendorff, A., Barbosa-Morais, N. et al. A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene 26, 1507–1516 (2007).

Download citation


  • breast cancer
  • microarray
  • prognosis
  • gene-signature
  • survival
  • Cox-clustering

Further reading


Quick links