Comparative survival analysis of multiparametric tests—when molecular tests disagree—A TEAM Pathology study

Multiparametric assays for risk stratification are widely used in the management of both node negative and node positive hormone receptor positive invasive breast cancer. Recent data from multiple sources suggests that different tests may provide different risk estimates at the individual patient level. The TEAM pathology study consists of 3284 postmenopausal ER+ve breast cancers treated with endocrine therapy Using genes comprising the following multi-parametric tests OncotypeDx®, Prosigna™ and MammaPrint® signatures were trained to recapitulate true assay results. Patients were then classified into risk groups and survival assessed. Whilst likelihood χ2 ratios suggested limited value for combining tests, Kaplan–Meier and LogRank tests within risk groups suggested combinations of tests provided statistically significant stratification of potential clinical value. Paradoxically whilst Prosigna-trained results stratified Oncotype-trained subgroups across low and intermediate risk categories, only intermediate risk Prosigna-trained cases were further stratified by Oncotype-trained results. Both Oncotype-trained and Prosigna-trained results further stratified MammaPrint-trained low risk cases, and MammaPrint-trained results also stratified Oncotype-trained low and intermediate risk groups but not Prosigna-trained results. Comparisons between existing multiparametric tests are challenging, and evidence on discordance between tests in risk stratification presents further dilemmas. Detailed analysis of the TEAM pathology study suggests a complex inter-relationship between test results in the same patient cohorts which requires careful evaluation regarding test utility. Further prognostic improvement appears both desirable and achievable.


INTRODUCTION
Multi-parametric molecular tests are central to the treatment management of early breast cancer and their use is incorporated into most major guidelines 1 as a pre-requisite for the staging of breast cancer patients, to direct prognostication and to select patients for chemotherapy treatment 2,3 . Two major challenges related to their use need to be addressed. Firstly, reports highlighting disagreements between tests are disquieting for physicians, health care providers, and patients alike 4 since they raise the question "have I recommended/received the right test?" Secondly, the lack of consistency at an individual patient level between different tests suggests additional prognostic information may result from novel tests. Recent results from the MINDACT and TAILORx studies validate the utility of tests to direct chemotherapy use in node-negative patients 2,5,6 , which may be extended as new evidence emerges from retrospective 3 or prospective studies 7,8 . In this context an error in assigning appropriate risk classifications would have significant impact on patient treatment and outcomes. Additionally, given recent evidence documenting the long-term risk of relapse for ER+ve breast cancer and the increasing use of extended endocrine therapy 9 the selection of the appropriate test to detect recurrence risk over extended time periods is also critical.
Reports of disagreements between tests, based on in silico analyses of existing expression array data, were frequently attributed to methodological challenges and incomplete gene coverage [10][11][12][13][14] . However, recently direct comparisons, where tests were performed exactly to vendor protocols, demonstrate marked disagreement in risk categorization and subtyping of individual tumors between widely used multiparameter assays 4 . Furthermore, comparisons between tests in clinical trials derived cohorts provide consistent evidence that combining test results generally improves prognostic value 15,16 . These results may reflect the relatively modest performance of individual multiparametric tests 17 .
To date, no direct comparison between different multiparameter assays in a large patient cohort with associated follow-up provides robust information on the impact of discrepant test results for patients. We developed a method to compare signatures using a combined quantitative mRNA array covering key molecular signatures 17 , trained against the results of the same signatures measured by original methodology 18 . We analyzed >3000 samples from the TEAM pathology cohort 19 using "trained" signatures to demonstrate the impact of disagreements between tests on patient outcome in the context of a recent clinical trial cohort.

RESULTS
Comparing signature-trained risk scores-Likelihood ratios We compared the ability of trained signatures to predict DMFS10 using the likelihood ratio χ 2 (LRχ 2 ) based on the Cox models as a measure of the overall prognostic information provided by each model. We illustrated the performance of each "trained" test using Kaplan-Meier survival curves and estimated Hazard ratios as described above (see Fig. 1). We calculated the change in LRχ 2 values(ΔLRχ 2 ) between the reclassified and single signature models to assess prognostic improvement of reclassification with a second signature versus the single signature using existing trinary and binary (Table 1) cut points as outlined above.
When test results for Oncotype-trained and Prosigna-trained results were dichotomized, there were less marked differences in univariate models between these tests and Mammaprint-trianed results ( Table 1). Again the largest increase in LRχ 2 was observed when comparing combined Oncotype-trained and Prosignatrained classification versus Oncotype-trained alone. All other bivariate models outperformed univariate models to a lesser, but still statistically significant, degree (p < 0.0001; Table 1

DISCUSSION
Our analysis of 3284 ER+ve/HER2−ve cases using trained signatures demonstrates that the Prosigna-trained signature provides potentially more prognostic information than either the Oncotype-trained or MammaPrint-trained signatures (Table 1). This result is consistent with results in the smaller TransATAC cohort 20 using original vendor methodology. Critical to our study is the close correlation between the computationally derived "signature trained" scores and true results as shown by us previously 18 . For ROR-PT results the correlation coefficient between "trained" and true assay results was 0.93, comparing true to "trained" results showed 90% of cases within the same risk category (low, intermediate, high-see ref. 18 ). Similarly for "Oncotype-Dx trained" results the correlation coefficient between true and "trained" results was 0.87 with 75% of results giving the same risk category (see ref. 18 ) and only 1% of cases disagreeing by more than 1 risk category. For Mammaprint trained results, which were calculated only as categorical high versus low risk groups, over 90% of cases were classified in the same risk group by "trained" and true results 18 . Full details of these results are reported elsewhere 18 .
We also show when two trained tests are combined the overall amount of information is always greater than a single test alone. In this study, adding stratification by Prosigna-trained results to Oncotype-trained results provided the greatest LRχ 2 , and the improvement was greater for this combined model versus Oncotype-trained results alone than for Prosigna-trained results alone. Collectively these results suggest that, in this study, Prosigna-trained results, either alone or combined with other test results, provide potentially greater prognostic information. However, most critically, all test combinations (where two tests were used for patient stratification) outperformed models with only one test to a highly statistically significant degree. This both confirms earlier reports 20 and suggests that differences between tests reflect quantitative and qualitative differences in the degree of prognostic information collected. This conclusion is supported by recent comparisons by the ATAC group, showing the impact of different signaling modules in ER+ve/HER2−ve cases 21 across different signatures. The conclusion from this work is that different tests capture different aspects of prognostic drivers and therefore that future improvements in prognostic testing remain achievable.
Critically, we dissected the effect of applying a second test to risk-stratified subgroups defined by the initial result; e.g. we examined the effect of applying the Prosigna-trained signature to the "intermediate risk" group identified by the Oncotype-trained signature etc. When combining tests, Prosigna-trained results added value to both Oncotype-trained and MammaPrint-trained results ( Table 1). The improved prognostic impact of Prosignatrained results applied across all ER+ve/HER2−ve cases after Oncotype-trained results was reflected by Prosigna-trained results sub-stratifying patients across both low and intermediate risk Oncotype trained groups (Fig. 2a, b). Even within the node negative ER+ve/HER2−ve population not treated with chemotherapy (Table 2; Fig. 5a, b) Oncotype-trained low and intermediate-risk groups were also further stratified by Prosignatrained results and 20.7% of Oncotype-trained low-risk cases were   identified as high risk by Prosigna-trained results, with DMFS10 of 83.8%, which is important as results from prospective trials suggest these cases may benefit from chemotherapy 2,6 . This difference was more striking when Oncotype-trained results were dichotomized using cut-points applied in the Tailor-X trial. In ER +HER2−ve, node negative patients treated without chemotherapy 17-24% of cases with Oncotype-trained results ≥25 were low risk (DMFS10 > 90%) when stratified by Mammaprint-trained or Prosigna-trained results respectively (Supplementary Table 4; Supplementary Fig. 2). Conversely 18-30% of Oncotype-trained low risk cases (<25) were high risk when stratified by Mammaprint-trained or Prosigna-trained results and exhibited DMFS < 90% (Supplementary Table 4; Supplementary Fig. 2) Conversely, only in Prosigna-trained intermediate risk cases did Oncotype-trained results provide additional stratification by risk ( Fig. 3; Table 3). However this stratification was not observed in the sub-group of node negative cases treated without chemotherapy (Fig. 6). No stratification of Prosigna-trained low or high risk cases was observed using either Oncotype-trained or Mammaprint trained results ( Fig. 3; Table 3). When using dichotomized risk scores for Prosigna-trained ER+ve/HER2−ve node-negative cases treated without chemotherapy no further stratification using dichotomized Oncotype-trained results was seen (Supplementary Table 5; Supplementary Fig. 5) and all Prosigna-high risk cases exhibited DMFS10 < 85% regardless of dichotomized Oncotypetrained results (Supplementary Table 5; Supplementary Fig. 5). These results are illustrative of and highlight the potential clinical impact of disagreements between tests at an individual patient level previously demonstrated in the OPTIMA-prelim cohort 4 .
A number of conclusions that can be drawn from our analyses. Firstly that, as with previous analyses 20 there is additional prognostic value to be gained from combining multiple molecular tests in the research setting. The corollary is that no single existing assay captures the sum of prognostic information available at the transcriptomic level. This confirms earlier findings 22 that improvements in prognostic assays remain possible. Such improvements may, however, require integration of additional molecular features beyond transcriptomics 23,24 . Secondly, there was evidence, albeit from sub-group analyses, that the known interaction between clinical risk, treatment, and molecular risk profiling may differ depending on the test chosen. If taken at face value, this might provide support for the use of different testing strategies in different patient risk strata.
Our analysis has some potentially important limitations. In particular we have used a computational approach to generate test scores for the different tests described herein. At an individual tumor level, the trained score may not be identical to the equivalent generated using original methodology. We trained our signatures in an independent cohort using the same signatures measured using original methodology 18 , achieving extremely high correlations with commercial test results. Additionally, the broad agreement between our analysis with the(more limited) analysis of Sestak et al. 20 using original methodology and a slightly different statistical approach is highly reassuring.
Additionally, although our cohort is exclusively postmenopausal ER-positive, 30% of cases were treated with adjuvant chemotherapy. All patients in the TEAM trial were postmenopausal, with a median age of 64 years, results presented here may not be representative of the premenopausal population. We included chemotherapy-treated patients to maximize the power of our main analysis. However, the conclusions of our analysis performed on the node-negative subgroup who were not chemotherapytreated are broadly similar to those in the analysis of the entire cohort, suggesting that these findings are robust both in this clinically critical node negative sub-group and indeed across all patients in the TEAM cohort.
The goal of our study was to provide robust information on the impact of discordant risk classification by different molecular prognostic signatures in postmenopausal, ER+ve early breast cancer. Existing evidence highlights discordance between tests 4,25 , which is reiterated here. There is clear evidence that adding clinical information to test results provides additional prognostic information 15,[26][27][28][29] , which is supported by sub-group analyses performed here, and that information provided by any individual assay is relatively modest 17 . To date comparisons between tests have been limited either by relatively small sample sizes or by a lack of evidence that signatures extracted from global expression data reflect actual test performance and can therefore inform patients and clinicians on the impact of discordant test results on outcome in the real-world setting. This study provides data on a large clinical trial cohort (the TEAM trial) using test signatures trained in a second cohort (OPTIMA-prelim 4 ) to match actual commercial test performance.
In summary, our study provides novel evidence for the potential clinical impact of discordant molecular test results in a large population. Further improvements in test performance are potentially within reach and would be of benefit to patients. Evidence presented here suggests the differences in test performance are more nuanced than previously reported and that careful consideration to test selection, in the context of treatment and clinical risk may be appropriate.

Study design
Our primary analyses explored the impact of signature-trained prognostic scores, categorized in accordance with published cut-points for each assay, for patients with centrally confirmed estrogen receptor positive (ER+ve) HER2 negative (HER2−ve) disease [30][31][32] . HER2 positive (HER2+ve) cases were excluded since during recruitment of the TEAM trial HER2 targeted therapies were not used in this setting. We performed a secondary analysis using dichotomized scores for Oncotype Dx and Prosigna to reflect the results of the TailorX study. We also report a complete cohort analysis, including HER2+ve cases (see Supplementary Information), since no assay used was trained on samples treated with HER2-targeted therapies. Supplementary analyses further sub-divide patient groups into node negative cases treated with endocrine therapy (but not chemotherapy), node positive cases treated with endocrine therapy (but not chemotherapy) and cases treated with chemotherapy and endocrine therapy (both node negative and node positive, supplementary methods, data and figures).

Patient samples
Patient samples were derived from the Tamoxifen Exemestane Adjuvant Multicenter (TEAM) Trial pathology study (Supplementary Table 1; NCT00279448/NCT0032126/NCT0036270, NTR267, UMIN C000000057) 19,33 and included only hormone receptor positive, post-menopausal cancers. Patients provided informed consent and this study was approved by the University of Toronto REB (protocol number 29021).
RNA profiling using NanoString. Profiling of all samples was performed using mRNA previously extracted and analyzed using a custom NanoString codeset as described previously 22 . Five 4 μm formalin-fixed paraffinembedded (FFPE) sections per case were deparaffinised, tumor areas were macro-dissected and RNA extracted using the Ambion ® Recoverall™ Total Nucleic Acid Isolation Kit-RNA extraction protocol (Life TechnologiesTM, ON, Canada). RNA aliquots were quantified using a Nanodrop-8000 spectrophometer (Delaware, USA). All 3825 RNAs extracted from the TEAM pathology cohort were successfully assayed. Probes for each gene were designed and synthesized at NanoString ® Technologies (Seattle, WA, USA); and 250 ng of RNA for each sample were hybridized, processed and analyzed using the NanoString ® nCounter ® Analysis System, according to NanoString ® Technologies protocols.

Signature-trained Risk Stratification Scores from candidate assays
We compared two different approaches to the generation of simulated risk scores 18 , and selected a training and validation approach using results obtained from the OPTIMA prelim study 4 to fit risk stratification scores generated for this study to those derived from the relevant commercial assay. For all tests, we used the suffix-trained to discriminate the   computationally derived assays scores from the commercially derived scores, e.g. Oncotype-trained vs. Oncotype-DX™.

Methods for cross comparisons between Tests
Results were available for 3811 subjects. Cases were grouped into the predefined risk categories for each test as follows: Oncotype DX-low risk < 18, intermediate risk 18-31 (supplementary methods), high risk ≥ 31; Prosigna-ROR-PT-low risk < 41, intermediate risk 41-60, high risk ≥ 61 3,20,34 ; MammaPrint-low risk and high risk 18 . We also performed a dichotomized risk analysis for Oncotype Dx using low/intermediate risk 0-25 and high risk > 25, in line with the TailorX study 2 , and for Prosigna RT using low/intermediate risk < 61 and high risk ≥ 61. Grouped analyses were performed as follows: (1) ER+/HER2−ve (n = 3284); and (2) hormonereceptor positive (HR+) regardless of HER2 status (n = 3811). Subjects were considered HR+ve if ER and/or progesterone receptor (PR) was reported as positive 33 . Differences in distant metastasis free survival (DMFS; i.e. time to first distant recurrence or death, excluding ipsilateral breast cancer recurrences but including distant metastasis, contralateral breast cancer and death from breast cancer) were evaluated using the Kaplan-Meier method with test equality of survivor functions assessed by log-rank and graphs with risk tables generated. 10-year survival function with 95% confidence intervals (95%CI) were calculated as DMFS10. Hazard ratios (HRs) were calculated using Cox proportional hazards regression models, with appropriate adjustments to obtain HRs for each risk level, with low risk set as reference. To assess the prognostic information of each signature, we evaluated the likelihood ratio χ 2 (LRχ 2 ) statistics based on the Cox models, and the difference in LRχ 2 (ΔLRχ 2 ) was calculated to assess prognostic improvement. All analyses were performed using Stata 14.2 (StataCorp, College Station, TX) and R 4.0.2. Reported p-values were twosided with p < 0.05 considered statistically significant.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and analyzed during this study are described in the following data record: https://doi.org/10.6084/m9.figshare.14617113 35 . The data generated and analyzed as part of this study take the form of 3811 individual Nanostring data files (one per sample). These data represent part of a clinical trial and were used under license for the current study, therefore restrictions apply to their availability. The data are housed in institutional storage at The Ontario Institute for Cancer Research (OICR) and are not publicly available, but can be made available upon request subject to approval from the TEAM steering committee and after appropriate data sharing agreements have been completed. Requests for data access should be directed to the senior author (J.M.S.B.).

CODE AVAILABILITY
The codes that support these findings are subject to patent applications and restrictions related to licenses. Codes are available from the author J.M.S.B. upon reasonable request and with the permission of the Ontario Institute for Cancer Research (OICR).