Introduction

Multi-parametric molecular tests are central to the treatment management of early breast cancer and their use is incorporated into most major guidelines1 as a pre-requisite for the staging of breast cancer patients, to direct prognostication and to select patients for chemotherapy treatment2,3. Two major challenges related to their use need to be addressed. Firstly, reports highlighting disagreements between tests are disquieting for physicians, health care providers, and patients alike4 since they raise the question “have I recommended/received the right test?” Secondly, the lack of consistency at an individual patient level between different tests suggests additional prognostic information may result from novel tests. Recent results from the MINDACT and TAILORx studies validate the utility of tests to direct chemotherapy use in node-negative patients2,5,6, which may be extended as new evidence emerges from retrospective3 or prospective studies7,8. In this context an error in assigning appropriate risk classifications would have significant impact on patient treatment and outcomes. Additionally, given recent evidence documenting the long-term risk of relapse for ER+ve breast cancer and the increasing use of extended endocrine therapy9 the selection of the appropriate test to detect recurrence risk over extended time periods is also critical.

Reports of disagreements between tests, based on in silico analyses of existing expression array data, were frequently attributed to methodological challenges and incomplete gene coverage10,11,12,13,14. However, recently direct comparisons, where tests were performed exactly to vendor protocols, demonstrate marked disagreement in risk categorization and subtyping of individual tumors between widely used multiparameter assays4. Furthermore, comparisons between tests in clinical trials derived cohorts provide consistent evidence that combining test results generally improves prognostic value15,16. These results may reflect the relatively modest performance of individual multiparametric tests17.

To date, no direct comparison between different multiparameter assays in a large patient cohort with associated follow-up provides robust information on the impact of discrepant test results for patients. We developed a method to compare signatures using a combined quantitative mRNA array covering key molecular signatures17, trained against the results of the same signatures measured by original methodology18. We analyzed >3000 samples from the TEAM pathology cohort19 using “trained” signatures to demonstrate the impact of disagreements between tests on patient outcome in the context of a recent clinical trial cohort.

Results

Comparing signature-trained risk scores—Likelihood ratios

We compared the ability of trained signatures to predict DMFS10 using the likelihood ratio χ2(LRχ2) based on the Cox models as a measure of the overall prognostic information provided by each model. We illustrated the performance of each “trained” test using Kaplan–Meier survival curves and estimated Hazard ratios as described above (see Fig. 1). We calculated the change in LRχ2 values(ΔLRχ2) between the reclassified and single signature models to assess prognostic improvement of reclassification with a second signature versus the single signature using existing trinary and binary (Table 1) cut points as outlined above.

Fig. 1: Test performance in ER+ve, HER2-ve breast cancer from the TEAM cohort.
figure 1

Kaplan–Meier survival curves with Log-rank Hazard ratios for cases of ER+ve, HER2−ve breast cancer from the entire TEAM cohort for Oncotype-trained (Panel a), Prosigna-trained (Panel b), and Mammaprint-trained results (Panel c) and for ER+ve, HER2−ve Node negative breast cancers treated without chemotherapy from the TEAM cohort for Oncotype-trained (Panel d), Prosigna-trained (Panel e), and Mammaprint-trained results (Panel 5). Log-Rank P values for each test are in brackets. Within each panel low (green), moderate (blue) and high (red) risk survival curves are plotted with LogRank Hazard ratios for high risk and intermediate risk (Oncotype-trained and Prosigna-trained only) calculated against low risk cases in each sub-group. 95% Confidence intervals for LogRank Hazard ratios are in brackets. For each group the number at risk (Low, moderate, high) are presented under the X axis.

Table 1 Likelihood χ2 ratios by test and cohort.

In ER+/HER2− cases (n = 3284), the Prosigna-trained signature provided greater prognostic information compared to Oncotype-trained and MammaPrint-trained signatures(LRχ2 = 146.9 vs. 118.0 and 119.5, respectively; Table 1). In bivariate models (combining 2 tests) the greatest LRχ2 was observed with Oncotype-trained and Prosigna-trained results (Table 1). Comparing bivariate and univariate results combining Oncotype-trained and Prosigna-trained results increased the LRχ2 to a far greater extent versus Oncotype-trained results (ΔLRχ2 = 60.0) than versus Prosigna-trained (ΔLRχ2 = 31.0) results. Similarly, when combining tests with Mammaprint-trained results adding Prosigna-trained results showed a greater increase in LRχ2 (ΔLRχ2 = 49.3) than did combining Mammaprint-trained results with Oncotype-trained results (ΔLRχ2 = 26.3). Adding Mammaprint-trained results to either Oncotype-trained or Prosigna-trained results to, versus either test produced the smallest improvements in the LRχ2 (Table 1). Nonetheless, all test combinations outperformed single tests to a highly statistically significant degree (p < 0.0001; Table 1).

When test results for Oncotype-trained and Prosigna-trained results were dichotomized, there were less marked differences in univariate models between these tests and Mammaprint-trianed results (Table 1). Again the largest increase in LRχ2 was observed when comparing combined Oncotype-trained and Prosigna-trained classification versus Oncotype-trained alone. All other bivariate models outperformed univariate models to a lesser, but still statistically significant, degree (p < 0.0001; Table 1).

Analysis of test performance by outcome in reclassified patients

We analyzed agreement between tests by investigating the extent to which re-classifying results for individual patients by performing tests in sequence affected predicted outcome. Example, we estimated the effects of performing a Prosigna-trained test on tumors previously classified as intermediate risk by the Oncotype-trained test.

Entire ER+ve/HER2−ve population

Oncotype-trained

Of 3284 ER+ve/HER2−ve breast cancers with results for the Oncotype-trained risk classification, 48.9% were classified low risk (DMFS10 = 87.9%), 35.8% intermediate risk (DMFS10 = 78.6%) and 15.3% high risk (DMFS10 = 67.5%) (Table 2; Figs. 1a, 2).

Table 2 Oncotype-trained results stratified by other test results, trinary classification.
Fig. 2: Forest plot of Oncotype-trained test results re-stratified by other tests, all ER+ve/HER2−ve cases.
figure 2

DMFS10 = distant metastasis free survival at 10 years post diagnosis. (95% CI) = 95% confidence interval, P value = p value, N = number of cases in each subgroup, % = percentage of cases within each risk strata. X axis = percent distant metastasis free survival. Open boxes represent primary test DMFS10 by risk group. Solid boxes represent sub-stratification by secondary tests with 95% confidence intervals (bars). Top panel (a) oncotype-trained low risk cases stratified by prosigna-trained and Mammaprint-trained results. Middle panel (b) oncotype-trained moderate risk group. Bottom panel (c) oncotype-trained high risk group.

Oncotype-trained stratified by Prosigna-trained

When Oncotype-trained results were further stratified by Prosigna-trained results a significant proportion (56.5%) of cases changed risk category (Supplementary Table 2). In Oncotype-trained low-risk cases, 279 (17.4%) were re-classified as high risk by Prosigna-trained results and 9 Oncotype-trained high-risk cases (1.8%) were re-classified as low risk by Prosigna-trained results. Oncotype-trained low risk/Prosigna-trained high-risk cases exhibited a significantly reduced DMFS10 (75.4%) relative to cases low risk by both signatures (HR = 3.19; 95%CI 2.12–4.82; p < 0.001; Table 2; Fig. 2). For Oncotype-trained intermediate-risk cases, 174 (14.8%) were classified as Prosigna-trained low risk with a DMFS10 = 91.5% (p < 0.001; Table 2; Fig. 2), and 618 (52.6%) were classified as Prosigna-trained high risk (DMFS10 = 73.3%; Table 2; Fig. 2). Few Oncotype-trained high-risk tumors were low risk by Prosigna-trained scores and no events were observed in these cases.

Oncotype-trained stratified by MammaPrint-trained

124 Oncotype-trained low-risk cases (8%) were high risk by MammaPrint-trained (DMFS10 = 72.1%; Table 2; Fig. 2; p < 0.001). 52 Oncotype-trained high-risk cases (10%) were low risk by MammaPrint-trained (DMFS10 = 70.4%; Table 2; Fig. 2; p = 0.465). Finally 528 (45%) Oncotype-trained intermediate-risk cases were MammaPrint-trained high risk(DMFS10 = 73.2%; Table 2; Fig. 2; p < 0.001).

Prosigna-trained results

Of 3284 ER+ve/HER2−ve cases with results for Prosigna-trained risk available 25.2% were low risk (DMFS10 = 92.1%, 95%CI 89.8–94.0%), 35.2% intermediate risk (DMFS10 = 84.9%, 95%CI 82.3–87.1%) and 39.7% high risk (DMFS10 = 71.4%, 95%CI 68.6–74.1%; Table 3; Figs. 1b, 3).

Table 3 Prosigna-trained results stratified by other test results, trinary classification.
Fig. 3: Forest plot of Prosigna-trained test results re-stratified by other tests, all ER+ve/HER2-ve cases.
figure 3

DMFS10 = distant metastasis free survival at 10 years post diagnosis, (95% CI) = 95% confidence interval, P = p value, N = number of cases in each subgroup, % = percentage of cases within each risk strata, X axis = percent distant metastasis free survival. Open boxes represent primary test DMFS10 by risk group. Solid boxes represent sub-stratification by secondary tests with 95% confidence intervals (bars). Top panel (a) prosigna-trained low risk cases stratified by Oncotype-trained and Mammaprint-trained results. Middle panel (b) prosigna-trained moderate-risk group. Bottom panel (c) prosigna-trained high risk group.

Prosigna-trained results stratified by Oncotype-trained results

In Prosigna-trained low-risk cases there were no significant differences in outcome across Oncotype-trained risk groups, all Prosigna trained low-risk cases experienced DMFS10 > 90% (Table 3; Fig. 3a). Similarly all Prosigna-trained high risk cases experienced a DMFS10 ≤ 80%; those that were also Oncotype-DX-trained high risk experienced significantly poorer outcome (DMFS10 = 65.7% 95%CI 60.4–70.5%, p < 0.001) than low or intermediate risk by Oncotype-trained (Table 3; Fig. 3c). Of 1155 Prosigna-trained intermediate-risk cases, 685 (59%) were classified low risk by the Oncotype-trained test (DMFS10 = 88.5%; p < 0.001), 89 cases (8%) were Oncotype-trained high risk (DMFS10 = 72.6%; p < 0.001, Table 3; Fig. 3b).

Prosigna-trained stratified by MammaPrint-trained

Excluding Prosigna-trained intermediate-risk cases the majority of results (79.7%) remained in the same risk category (Supplementary Table 2). No stratification of Prosigna-trained low-risk cases occurred using MammaPrint-trained results (Table 3; Fig. 3a). All Prosigna-trained high-risk cases had DMFS10 < 80%, 32% were MammaPrint-trained low risk (Table 3; Fig. 3c). For Prosigna-trained intermediate-risk cases 18% were MammaPrint-trained high risk (DMFS10 = 79.4%; p = 0.005; Table 3, Fig. 3b).

MammaPrint-trained

Of 3284 ER+ve/HER2−ve breast cancers with MammaPrint-Trained risk classification, 66.3% were low risk (DMFS10 = 86.9%) and 33.7% high risk (DMFS10 = 70.7%; Table 4, Figs. 1c, 4).

Table 4 Mammaprint-trained results stratified by other test results, trinary classification.
Fig. 4: Forest plot of Mammaprint-trained test results re-stratified by other tests, all ER+ve/HER2-ve cases.
figure 4

DMFS10 = distant metastasis free survival at 10 years post diagnosis. (95% CI) = 95% confidence interval, P = p value, N = number of cases in each subgroup, % = percentage of cases within each risk strata, X axis = percent distant metastasis free survival. Open boxes represent primary test DMFS10 by risk group. Solid boxes represent sub-stratification by secondary tests with 95% confidence intervals (bars). Top panel (a) Mammaprint-trained low-risk cases stratified by Oncotype-trained and Prosigna-trained results. Bottom panel (b) Mammaprint-trained high-risk group.

MammaPrint-trained stratified by Oncotype-trained

Of 2180 MammaPrint-trained low-risk cases, 68% were low risk by Oncotype-trained results (DMFS10 = 89.1%; Table 4; Fig. 4a). Mammaprint-trained low risk Oncotype-trained intermediate-risk cases (30%) exhibited DMFS10 = 83.2% (Table 4, p < 0.001) and Oncotype-trained high-risk cases exhibited DMFS10 = 70.4% (Table 4, p < 0.001; Fig. 4a). In MammaPrint-trained high-risk cases DMFS10 ranged from 73.2–67.3 across Oncotype-trained-subgroups and there were marked differences in outcome across Oncotype-trained categories (Table 4, Fig. 4b).

MammaPrint-Trained results stratified by Prosigna-trained results

In MammaPrint-trained low-risk cases 20% were Prosigna-trained high risk (DMFS10 = 78.1%; Table 4, p < 0.001) and 43% intermediate risk (DMFS10 = 86.1% Table 4; p < 0.001, Fig. 4a). Amongst MammaPrint-trained high-risk cases, only a small (n = 12) subgroup of Mammaprint-trained high, Prosigna trained low results exhibited DMFS10 = 90% (p = 0.006, Fig. 4b).

Sub-group analysis ER+ve/HER2-ve, Node-ve patients not treated with chemotherapy

Oncotype-trained

Of 970 cases in this subgroup, 47.2% were Oncotype-trained low (DMFS10 = 92.5%), 36.0% intermediate (DMFS10 = 86.3%) and 16.8% high risk (DMFS10 = 76.7%, Table 2; Figs. 1d; 5) respectively.

Fig. 5: Forest plot of Oncotype-trained test results re-stratified by other tests, Node-ve ER+ve/HER2-ve cases treated without chemotherapy.
figure 5

DMFS10 = distant metastasis free survival at 10 years post diagnosis. (95% CI) = 95% confidence interval, P = p value, N = number of cases in each subgroup, % = percentage of cases within each risk strata. X axis = percent distant metastasis free survival. Open boxes represent primary test DMFS10 by risk group. Solid boxes represent sub-stratification by secondary tests with 95% confidence intervals (bars). Top panel (a) Oncotype-trained low-risk cases stratified by Prosigna-trained and Mammaprint-trained results. Middle panel (b) Oncotype-trained moderate risk group. Bottom panel (c) Oncotype-trained high-risk group.

Oncotype-trained results stratified by Prosigna-trained results

When Oncotype-trained results were stratified by Prosigna-trained results, 57.3% changed risk category (Supplementary Table 3). In Oncotype Dx-trained low risk 95 cases (21%) were Prosigna-trained high risk with DMFS10 = 83.8% (p = 0.006, Table 2; Fig. 5). In Oncotype-trained intermediate-risk cases 12% were Prosigna-trained low risk (DMFS10 = 94.1%; Table 2, p = 0.090; Fig. 5). The 57% of Oncotype-trained intermediate-risk cases classified as Prosigna-trained high risk exhibited DMFS10 = 83.7% (Table 2; p = 0.076, Fig. 5). Only three Oncotype-trained high-risk cases were Prosigna-trained low risk no events were observed in these cases.

Oncotype-trained stratified by MammaPrint-trained

11% of Oncotype-trained low-risk cases were MammaPrint-trained high risk (DMFS10 = 80.8%, p = 0.004; Table 2, Fig. 5a). In Oncotype-trained intermediate-risk patients 50% were MammaPrint-trained low risk(DMFS10 = 92.2%, p = 0.002; Table 2, Fig. 5b). In Oncotype Dx-trained high-risk cases 11% were MammaPrint-trained low risk, no events were observed in these 18 cases (Table 2, Fig. 5c). MammaPrint-trained scores identified 37.5% of Oncotype-trained cases (intermediate or high) as low risk (DMFS10 > 90%).

Prosigna-trained stratified by Oncotype-trained

Neither Prosigna-trained low nor moderate risk cases showed statistically significant sub-stratification for outcome by Oncotype-trained risk scores (Table 3, Fig. 6a, b). Within Prosigna-trained high-risk cases 22% were Oncotype-trained low risk, however, DMFS10 for this group was 83.8% (Table 3, Fig. 6c).

Fig. 6: Forest plot of Prosigna-trained test results re-stratified by other tests, Node-ve ER+ve/HER2-ve cases treated without chemotherapy.
figure 6

DMFS10 = distant metastasis free survival at 10 years post diagnosis. (95% CI) = 95% confidence interval, P = p value, N = number of cases in each subgroup, % = percentage of cases within each risk strata, X axis = percent distant metastasis free survival. Open boxes represent primary test DMFS10 by risk group. Solid boxes represent sub-stratification by secondary tests with 95% confidence intervals (bars). Top panel (a) Prosigna-trained low-risk cases stratified by Oncotype-trained and Mammaprint-trained results. Middle panel (b) Prosigna-trained moderate risk group. Bottom panel (c) Prosigna-trained high risk group.

Prosigna-trained stratified by MammaPrint-trained

No impact of MammaPrint-trained scores was observed in the Prosigna-trained low-risk group (Table 3, Fig. 6a), with only three discordant results. For both moderate and high risk Prosigna-trained results a group of MammaPrint-trained low-risk cases were identified (DMFS10 = 93.1% and 89.6%, respectively, Table 3; Fig. 6b, c).

MammaPrint-trained results

No impact of Oncotype-trained on Mammaprint-trained scores was observed (Fig. 7; Table 4). In Mammaprint trained low-risk cases 22% were categorized as Prosigna-trained high risk, with a modest reduction in DMFS10 = 89.6% (p = 0.027, Table 4).

Fig. 7: Forest plot of Mammaprint-trained test results re-stratified by other tests, Node-ve ER+ve/HER2−ve cases treated without chemotherapy.
figure 7

DMFS10 = Distant metastasis free survival at 10 years post diagnosis. (95% CI) = 95% confidence interval, P = p value, N = number of cases in each subgroup, % = percentage of cases within each risk strata, X axis = percent distant metastasis free survival. Open boxes represent primary test DMFS10 by risk group. Solid boxes represent sub-stratification by secondary tests with 95% confidence intervals (bars). Top panel (a) Mammaprint-trained low-risk cases stratified by Oncotype-trained and Prosigna-trained results. Bottom panel (b) Mammaprint-trained high-risk group.

Discussion

Our analysis of 3284 ER+ve/HER2−ve cases using trained signatures demonstrates that the Prosigna-trained signature provides potentially more prognostic information than either the Oncotype-trained or MammaPrint-trained signatures (Table 1). This result is consistent with results in the smaller TransATAC cohort20 using original vendor methodology.

Critical to our study is the close correlation between the computationally derived “signature trained” scores and true results as shown by us previously18. For ROR-PT results the correlation coefficient between “trained” and true assay results was 0.93, comparing true to “trained” results showed 90% of cases within the same risk category (low, intermediate, high—see ref. 18). Similarly for “Oncotype-Dx trained” results the correlation coefficient between true and “trained” results was 0.87 with 75% of results giving the same risk category (see ref. 18) and only 1% of cases disagreeing by more than 1 risk category. For Mammaprint trained results, which were calculated only as categorical high versus low risk groups, over 90% of cases were classified in the same risk group by “trained” and true results18. Full details of these results are reported elsewhere18.

We also show when two trained tests are combined the overall amount of information is always greater than a single test alone. In this study, adding stratification by Prosigna-trained results to Oncotype-trained results provided the greatest LRχ2, and the improvement was greater for this combined model versus Oncotype-trained results alone than for Prosigna-trained results alone. Collectively these results suggest that, in this study, Prosigna-trained results, either alone or combined with other test results, provide potentially greater prognostic information. However, most critically, all test combinations (where two tests were used for patient stratification) outperformed models with only one test to a highly statistically significant degree. This both confirms earlier reports20 and suggests that differences between tests reflect quantitative and qualitative differences in the degree of prognostic information collected. This conclusion is supported by recent comparisons by the ATAC group, showing the impact of different signaling modules in ER+ve/HER2−ve cases21 across different signatures. The conclusion from this work is that different tests capture different aspects of prognostic drivers and therefore that future improvements in prognostic testing remain achievable.

Critically, we dissected the effect of applying a second test to risk-stratified subgroups defined by the initial result; e.g. we examined the effect of applying the Prosigna-trained signature to the “intermediate risk” group identified by the Oncotype-trained signature etc. When combining tests, Prosigna-trained results added value to both Oncotype-trained and MammaPrint-trained results (Table 1). The improved prognostic impact of Prosigna-trained results applied across all ER+ve/HER2−ve cases after Oncotype-trained results was reflected by Prosigna-trained results sub-stratifying patients across both low and intermediate risk Oncotype trained groups (Fig. 2a, b). Even within the node negative ER+ve/HER2−ve population not treated with chemotherapy (Table 2; Fig. 5a, b) Oncotype-trained low and intermediate-risk groups were also further stratified by Prosigna-trained results and 20.7% of Oncotype-trained low-risk cases were identified as high risk by Prosigna-trained results, with DMFS10 of 83.8%, which is important as results from prospective trials suggest these cases may benefit from chemotherapy2,6. This difference was more striking when Oncotype-trained results were dichotomized using cut-points applied in the Tailor-X trial. In ER+HER2−ve, node negative patients treated without chemotherapy 17–24% of cases with Oncotype-trained results ≥25 were low risk (DMFS10 > 90%) when stratified by Mammaprint-trained or Prosigna-trained results respectively (Supplementary Table 4; Supplementary Fig. 2). Conversely 18–30% of Oncotype-trained low risk cases (<25) were high risk when stratified by Mammaprint-trained or Prosigna-trained results and exhibited DMFS < 90% (Supplementary Table 4; Supplementary Fig. 2)

Conversely, only in Prosigna-trained intermediate risk cases did Oncotype-trained results provide additional stratification by risk (Fig. 3; Table 3). However this stratification was not observed in the sub-group of node negative cases treated without chemotherapy (Fig. 6). No stratification of Prosigna-trained low or high risk cases was observed using either Oncotype-trained or Mammaprint trained results (Fig. 3; Table 3). When using dichotomized risk scores for Prosigna-trained ER+ve/HER2−ve node-negative cases treated without chemotherapy no further stratification using dichotomized Oncotype-trained results was seen (Supplementary Table 5; Supplementary Fig. 5) and all Prosigna-high risk cases exhibited DMFS10 < 85% regardless of dichotomized Oncotype-trained results (Supplementary Table 5; Supplementary Fig. 5). These results are illustrative of and highlight the potential clinical impact of disagreements between tests at an individual patient level previously demonstrated in the OPTIMA-prelim cohort4.

A number of conclusions that can be drawn from our analyses. Firstly that, as with previous analyses20 there is additional prognostic value to be gained from combining multiple molecular tests in the research setting. The corollary is that no single existing assay captures the sum of prognostic information available at the transcriptomic level. This confirms earlier findings22 that improvements in prognostic assays remain possible. Such improvements may, however, require integration of additional molecular features beyond transcriptomics23,24. Secondly, there was evidence, albeit from sub-group analyses, that the known interaction between clinical risk, treatment, and molecular risk profiling may differ depending on the test chosen. If taken at face value, this might provide support for the use of different testing strategies in different patient risk strata.

Our analysis has some potentially important limitations. In particular we have used a computational approach to generate test scores for the different tests described herein. At an individual tumor level, the trained score may not be identical to the equivalent generated using original methodology. We trained our signatures in an independent cohort using the same signatures measured using original methodology18, achieving extremely high correlations with commercial test results. Additionally, the broad agreement between our analysis with the(more limited) analysis of Sestak et al. 20 using original methodology and a slightly different statistical approach is highly reassuring.

Additionally, although our cohort is exclusively postmenopausal ER-positive, 30% of cases were treated with adjuvant chemotherapy. All patients in the TEAM trial were postmenopausal, with a median age of 64 years, results presented here may not be representative of the premenopausal population. We included chemotherapy-treated patients to maximize the power of our main analysis. However, the conclusions of our analysis performed on the node-negative subgroup who were not chemotherapy-treated are broadly similar to those in the analysis of the entire cohort, suggesting that these findings are robust both in this clinically critical node negative sub-group and indeed across all patients in the TEAM cohort.

The goal of our study was to provide robust information on the impact of discordant risk classification by different molecular prognostic signatures in postmenopausal, ER+ve early breast cancer. Existing evidence highlights discordance between tests4,25, which is reiterated here. There is clear evidence that adding clinical information to test results provides additional prognostic information15,26,27,28,29, which is supported by sub-group analyses performed here, and that information provided by any individual assay is relatively modest17. To date comparisons between tests have been limited either by relatively small sample sizes or by a lack of evidence that signatures extracted from global expression data reflect actual test performance and can therefore inform patients and clinicians on the impact of discordant test results on outcome in the real-world setting. This study provides data on a large clinical trial cohort (the TEAM trial) using test signatures trained in a second cohort (OPTIMA-prelim4) to match actual commercial test performance.

In summary, our study provides novel evidence for the potential clinical impact of discordant molecular test results in a large population. Further improvements in test performance are potentially within reach and would be of benefit to patients. Evidence presented here suggests the differences in test performance are more nuanced than previously reported and that careful consideration to test selection, in the context of treatment and clinical risk may be appropriate.

Methods

Study design

Our primary analyses explored the impact of signature-trained prognostic scores, categorized in accordance with published cut-points for each assay, for patients with centrally confirmed estrogen receptor positive (ER+ve) HER2 negative (HER2−ve) disease30,31,32. HER2 positive (HER2+ve) cases were excluded since during recruitment of the TEAM trial HER2 targeted therapies were not used in this setting. We performed a secondary analysis using dichotomized scores for Oncotype Dx and Prosigna to reflect the results of the TailorX study. We also report a complete cohort analysis, including HER2+ve cases (see Supplementary Information), since no assay used was trained on samples treated with HER2-targeted therapies. Supplementary analyses further sub-divide patient groups into node negative cases treated with endocrine therapy (but not chemotherapy), node positive cases treated with endocrine therapy (but not chemotherapy) and cases treated with chemotherapy and endocrine therapy (both node negative and node positive, supplementary methods, data and figures).

Patient samples

Patient samples were derived from the Tamoxifen Exemestane Adjuvant Multicenter (TEAM) Trial pathology study (Supplementary Table 1; NCT00279448/NCT0032126/NCT0036270, NTR267, UMIN C000000057)19,33 and included only hormone receptor positive, post-menopausal cancers. Patients provided informed consent and this study was approved by the University of Toronto REB (protocol number 29021).

RNA profiling using NanoString

Profiling of all samples was performed using mRNA previously extracted and analyzed using a custom NanoString codeset as described previously22. Five 4 μm formalin-fixed paraffin-embedded (FFPE) sections per case were deparaffinised, tumor areas were macro-dissected and RNA extracted using the Ambion® Recoverall™ Total Nucleic Acid Isolation Kit-RNA extraction protocol (Life TechnologiesTM, ON, Canada). RNA aliquots were quantified using a Nanodrop-8000 spectrophometer (Delaware, USA). All 3825 RNAs extracted from the TEAM pathology cohort were successfully assayed. Probes for each gene were designed and synthesized at NanoString® Technologies (Seattle, WA, USA); and 250 ng of RNA for each sample were hybridized, processed and analyzed using the NanoString® nCounter® Analysis System, according to NanoString® Technologies protocols.

Signature-trained Risk Stratification Scores from candidate assays

We compared two different approaches to the generation of simulated risk scores18, and selected a training and validation approach using results obtained from the OPTIMA prelim study4 to fit risk stratification scores generated for this study to those derived from the relevant commercial assay. For all tests, we used the suffix-trained to discriminate the computationally derived assays scores from the commercially derived scores, e.g. Oncotype-trained vs. Oncotype-DX™.

Methods for cross comparisons between Tests

Results were available for 3811 subjects. Cases were grouped into the pre-defined risk categories for each test as follows: Oncotype DX—low risk < 18, intermediate risk 18–31 (supplementary methods), high risk ≥ 31; Prosigna-ROR-PT—low risk < 41, intermediate risk 41–60, high risk ≥ 613,20,34; MammaPrint—low risk and high risk18. We also performed a dichotomized risk analysis for Oncotype Dx using low/intermediate risk 0–25 and high risk > 25, in line with the TailorX study2, and for Prosigna RT using low/intermediate risk < 61 and high risk ≥ 61. Grouped analyses were performed as follows: (1) ER+/HER2−ve (n = 3284); and (2) hormone-receptor positive (HR+) regardless of HER2 status (n = 3811). Subjects were considered HR+ve if ER and/or progesterone receptor (PR) was reported as positive33. Differences in distant metastasis free survival (DMFS; i.e. time to first distant recurrence or death, excluding ipsilateral breast cancer recurrences but including distant metastasis, contralateral breast cancer and death from breast cancer) were evaluated using the Kaplan–Meier method with test equality of survivor functions assessed by log-rank and graphs with risk tables generated. 10-year survival function with 95% confidence intervals (95%CI) were calculated as DMFS10. Hazard ratios (HRs) were calculated using Cox proportional hazards regression models, with appropriate adjustments to obtain HRs for each risk level, with low risk set as reference. To assess the prognostic information of each signature, we evaluated the likelihood ratio χ2 (LRχ2) statistics based on the Cox models, and the difference in LRχ2(ΔLRχ2) was calculated to assess prognostic improvement. All analyses were performed using Stata 14.2 (StataCorp, College Station, TX) and R 4.0.2. Reported p-values were two-sided with p < 0.05 considered statistically significant.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.