A systematic review of meta-analyses assessing the validity of tumour response endpoints as surrogates for progression-free or overall survival in cancer

Background Tumour response endpoints, such as overall response rate (ORR) and complete response (CR), are increasingly used in cancer trials. However, the validity of response-based surrogates is unclear. This systematic review summarises meta-analyses assessing the association between response-based outcomes and overall survival (OS), progression-free survival (PFS) or time-to-progression (TTP). Methods Five databases were searched to March 2019. Meta-analyses reporting correlation or regression between response-based outcomes and OS, PFS or TTP were summarised. Results The systematic review included 63 studies across 20 cancer types, most commonly non-small cell lung cancer (NSCLC), colorectal cancer (CRC) and breast cancer. The strength of association between ORR or CR and either PFS or OS varied widely between and within studies, with no clear pattern by cancer type. The association between ORR and OS appeared weaker and more variable than that between ORR and PFS, both for associations between absolute endpoints and associations between treatment effects. Conclusions This systematic review suggests that response-based endpoints, such as ORR and CR, may not be reliable surrogates for PFS or OS. Where it is necessary to use tumour response to predict treatment effects on survival outcomes, it is important to fully reflect all statistical uncertainty in the surrogate relationship.


BACKGROUND
Decisions about the use of new and existing health technologies should ideally be informed by estimates of treatment effects derived from high-quality randomised controlled trials (RCTs), which measure patient-relevant endpoints over a clinically appropriate timeframe. Such "final" endpoints typically involve the measurement of health benefits, which reflect aspects of the disease, and its treatment, which are important to patients (and potentially also their carers) and which relate to "how the patient feels, functions or survives". 1 In the context of advanced/ metastatic cancer, the key matter of concern is often whether the use of a given health technology leads to improvements in overall survival (OS; a final endpoint) compared to existing standard treatments. However, the estimation of treatment effects on OS may be subject to numerous problems, including potential confounding resulting from the use of post-progression treatments, insufficient study follow-up resulting in data immaturity or simply that data on OS have not been collected. In such instances, determining the impact of health technologies becomes more challenging and may rely on the use of surrogate endpoints to substitute for, and predict, a final patient-relevant clinical outcome. 2 Potentially relevant surrogate endpoints vary according to tumour type and site, but commonly include progression-free survival (PFS), time to progression (TTP), and response-based outcomes, which may include overall response rate (ORR), different levels of response (e.g. complete response [CR], partial response [PR] or very good partial response [VGPR]) and duration of response (DoR). These surrogate endpoints are often considered attractive as they typically require smaller sample sizes, occur faster and are less expensive to collect in clinical trials compared with final outcomes, thereby reducing costs associated with data collection and expediting the time required for bringing new technologies to market.
It has been recognised in the literature that the reliance on surrogates may lead to invalid conclusions regarding the net health effects of technologies, which in turn have the potential to lead to patient harm. 3 Much of the published literature around the use of surrogate endpoints has focussed on the development and application of frameworks for their validation. 4,5 In his seminal paper, Prentice 4 put forward stringent criteria for the validation of surrogate endpoints in phase 3 trials. In general terms, these criteria require that the surrogate endpoint must be a correlate of the net effect of treatment on the final clinical outcome-in other words, there must be a single pathway from the treatment to the true endpoint, which is mediated exclusively by the surrogate endpoint. 6 Applied surrogate validation studies commonly adopt a meta-analytic (meta-regression) approach based on multiple studies in order to assess whether the apparent relationship www.nature.com/bjc between the surrogate and the final endpoint remains constant in the presence of various sources of heterogeneity, such as differences in patient population, study design and treatments received. 5 Based on the NIH Biomarkers Definition Working Group's preferred terms and definitions 7 and the 2001 Journal of the American Medical Association (JAMA) User's Guide, 8 Taylor and Elston 9 proposed a hierarchy of levels of surrogate validation. Level 3 of the hierarchy relates to biological plausibility-this is the weakest form of validation and is typically based on pathophysiological studies and/or an understanding of the disease process. Level 2 requires the presence of a consistent association between the surrogate outcome and the final endpoint; this may be assessed using observational studies or arm-based analyses of trials, which have measured both the surrogate and the final outcome. This level of validation requires an assessment of the individual-level (absolute) association between endpoints and is usually undertaken using correlation analysis. Level 1 of the hierarchy represents the strongest level of surrogate validation: in order to achieve this level of validation, the treatment effect on the surrogate must correspond to the treatment effect on the final outcome. Demonstrating this level of validity requires an analysis of correlation in terms of treatment effects between arms based on data from RCTs (trial-level association). Other validation frameworks have been proposed to assess the strength of association between surrogate and final endpoints. These include the criteria proposed by the German Institute of Quality and Efficiency in Health Care 10 (IQWiG; based on the treatment effect association only) and the Biomarker-Surrogate Evaluation Schema criteria 11 (BSES2; based on both absolute and treatment effect associations). These frameworks differ in terms of the types of analyses and the strength of the relationship required to determine the reliability of the surrogate.
This systematic review summarises published meta-regression studies reporting correlation and regression analyses for the strength of the association between response-based outcomes and PFS, TTP or OS in (primarily) advanced or metastatic cancer, across any tumour site, in order to assess whether response-based outcomes may be considered as valid surrogates for PFS, TTP or OS.

Inclusion and exclusion criteria
Inclusion was restricted to articles reporting meta-analyses or meta-regressions across multiple studies and reporting the strength of association between response outcomes (ORR, CR, PR, VGPR or DoR) and either PFS, TTP or OS. The included metaregressions could themselves include RCTs and/or single-arm studies. However, individual reports analysing single trials or single cohorts were excluded from this review. Included metaanalyses could report absolute associations and/or treatment effect associations. These associations had to be reported as a correlation coefficient (e.g. Pearson's r or Spearman's r s ) and/or a coefficient of determination (R 2 ) between relevant outcomes.
Studies of any cancer and any treatment were included. The review focussed mainly on studies of advanced or metastatic cancers (and/or treatment with palliative intent), as these studies were more likely to report PFS and OS. However, studies reporting relevant outcomes were included even where the stage was not specifically restricted to advanced/metastatic disease for all patients or where this was unclear (this applied particularly to haematological cancers). Studies were excluded if they explicitly referred to adjuvant or neo-adjuvant treatment, or treatments that are given with curative intent. Studies were only included if they were written on English or contained sufficient detail in English.
The review protocol is registered on PROSPERO with registration number CRD42019127606.
Search strategy Five databases (MEDLINE, EMBASE, Web of Science, the Cochrane Database of Systematic Reviews and CINAHL) were searched from inception to March 2019. Search terms included cancer terms AND response terms AND terms for PFS, TTP and/or OS AND terms for regression, correlation, prediction, association or relationship AND terms for endpoint and/or surrogate. Search results were limited to the English language and to studies undertaken in humans. The MEDLINE search strategy is provided in Supplementary Information 1. In addition, a citation search was undertaken based on two existing meta-reviews of surrogate relationships; this identified studies that have cited any of the 48 articles included in the review by Fischer et al. 12 and/or any of the 19 articles included in the review by Davis et al. 13 In addition, relevant existing meta-reviews, including Fischer et al., 12 Davis et al., 13 Savina et al., 14 Haslam et al. 15 and any reviews identified during searching, were checked for relevant studies.
Scoring the strength of association: IQWiG and BSES2 scoring In this review, two sets of published criteria were used to assess the strength of association between surrogate and final endpoints: the IQWiG criteria 10 and the BSES2 criteria. 11 The IQWiG criteria 10 are based on the correlation coefficient (r) for the treatment effect association. Where r was not reported, it was calculated as the square root of R 2 , if available. As the medium score bracket was not clearly defined, slight modifications were made to the IQWiG criteria based on the approach used in the previous review by Savina et al. 14 (Supplementary Table 1). The IQWiG score was generated based on the magnitude of r, irrespective of its sign (i.e. a negative correlation could generate a high score). The IQWiG criteria were scored as follows: high (lower confidence interval of r is ≥0.85); medium+ (r ≥ 0.85 with no reported confidence interval or r ≥ 0.85 with wide confidence intervals [lower limit <0.85]); medium (0.85 > r ≥ 0.7 and upper confidence interval of r is ≥0.7 and lower confidence interval of r is <0.85, or 0.85 > r ≥ 0.7 with no reported confidence interval); or low (upper confidence interval of r is <0.7 or r < 0.7 with no reported confidence interval).
The BSES2 criteria 11 require R 2 values for both the absolute and treatment effect associations. Where R 2 was not reported, it was calculated as the square of r, if available. BSES2 criteria were used as an adaptation from the original BSES criteria, as described in Savina et al. 14 The original BSES criteria require R 2 for both individual and treatment effect associations and a value for the surrogate threshold effect (STE). Since so few articles report STE, this review used BSES2, which does not require the STE. The BSES2 criteria were scored as follows: excellent (R 2  Study selection and data extraction Titles and abstracts of articles retrieved by the search were examined by one reviewer and a subset was checked by a second reviewer early in the process, followed by a discussion to ensure consistency in the selection decisions. Full texts were examined by one reviewer and a subset was checked by a second reviewer, with any discrepancies resolved through discussion.
Data were extracted by one reviewer and all data were checked by a second reviewer. Data were extracted relating to study design, participant characteristics, surrogate and final endpoints analysed, methods for correlation and regression, and results including absolute associations, associations between treatment effects, STE and regression equations.
Data synthesis Data were presented in a narrative synthesis. Plots were constructed to illustrate the reported associations within each study. Some of the included meta-regression studies reported multiple subgroup analyses with differing results. Therefore, each horizontal row in the plots illustrates the range of reported associations across all subgroup analyses within a single meta-regression study. Where an included meta-regression study reported on more than one cancer type, these are shown on separate rows on the plots.
For associations between absolute values of endpoints, the plots show the range of correlation coefficients per study, across all subgroup analyses. All types of correlation coefficient were included, for example, Pearson's r and Spearman's r s . If no correlation coefficient was reported, then Pearson's r was calculated as the square root of R 2 , if available.
For associations between treatment effects, the plots show the range of regression coefficients of determination (R 2 ) per study, across all subgroup analyses. The plots include both adjusted and unadjusted R 2 values, as well as values from weighted and unweighted regressions. For studies in which R 2 was not reported, this was calculated as the square of the Pearson's r correlation coefficient, if available. R 2 was not calculated from other correlation coefficients such as Spearman, or where the method of correlation was unclear.
Quality assessment Included meta-regression studies were assessed for methodological quality based on key criteria from the AMSTAR-2 16 and ReSEEM 17 checklists most relevant to our review.

Number of included meta-regression studies
The literature search generated 2829 citations ( Fig. 1), of which 2630 were excluded during the review of titles and abstracts and a further 135 excluded during the review of full texts. In total, 63 studies (within 64 references) were included in the review.  Characteristics of included meta-regression studies Summaries of study characteristics and reported data types are provided in Supplementary Tables 3 and 4, respectively, while full details of study characteristics for each of the 63 included studies are provided in Supplementary Table 5.
The most commonly reported surrogate relationships were ORR to OS (57 studies), ORR to PFS (22 studies), CR to OS (8 studies) and CR to PFS (7 studies). Other response outcomes (DoR, PR and VGPR/CR) were only reported in one to two studies each. Twenty different cancer types were analysed, the most common being NSCLC (16 studies), CRC (10 studies), various solid tumours (8 studies) and breast cancer (5 studies References identified from other sources, e.g. other reviews (n = 7) -From previous reviews (n = 5) -Chance find (n = 2) Full-text references screened (n = 199)

Absolute (individual-level) correlation and regression
The range of absolute (individual-level) correlation coefficients is summarised in Table 1 and illustrated in Fig. 2 (for the association between ORR and PFS) and Fig. 3 (for the association between ORR and OS). Some of the included meta-regression studies reported multiple subgroup analyses with differing results. Therefore, each horizontal row in the plots illustrates the range of correlation coefficients across all subgroup analyses within a single meta-regression study. Where an included meta-regression reported on more than one cancer type, these are shown on separate rows on the plots. It is worth noting that the included meta-regression studies differed in terms of various factors, such as the number of included primary studies (shown as N on the plots), treatment type, line of treatment and precise clinical population (full details in Supplementary Table 7).  Table 1). Across those studies that report only a single analysis, the correlation coefficient was generally >0.60; however, some estimates were lower. Confidence intervals around the correlation coefficients were rarely reported. Few separate meta-regressions reported on the same tumour site, hence it is difficult to assess whether ORR may be a more reliable surrogate in certain cancer types than others. One study reported on ORR and TTP (gastric cancer; correlation r s = 0.41-0.56 across subgroup analyses, not shown on the plot). 42 ORR and OS. The reported correlation coefficients between absolute ORR and OS ranged from −0. 40   within one study of SCLC, 59 while the highest correlation coefficient between absolute PR and OS ranged from 0.29 to 0.66 in the same study 59 (Table 1).

DoR and PFS or OS.
No studies reported on the absolute association between DoR and PFS or OS.

Treatment effect (trial-level) correlation and regression
The range of treatment effect (trial-level) R 2 values is summarised in Table 1 and illustrated in Fig. 4 (for the association between ORR and PFS) and Fig. 5 (for the association between ORR and OS). Each horizontal row in the plots illustrates the range of R 2 values across all subgroup analyses within a single meta-regression study. Where an included meta-regression reported on more than one cancer type, these are shown separately on the plots. It is worth noting that the meta-regressions differed in terms of the number of included primary studies (shown as N on the plots), treatment type, line of treatment and precise clinical population (full details in Supplementary  29 and pancreatic cancer 28 reported Spearman's correlation coefficients between DoR and OS ranging from 0.40 to 0.76 ( Table 1).

Influence of clinical and study factors on association
The impact of the following patient and study factors on the association between ORR and OS was explored: treatment line; treatment type; response criteria; adjustment of OS for crossover and post-progression treatments; and aggregate versus IPD data (Supplementary Table 9). No clear effect on the association between ORR and OS was identified for any individual factor. However, this analysis was limited by the small number of publications assessing each factor within each cancer, and the wide ranges of associations observed for each. Five of the 63 included meta-analyses analysed IPD rather than aggregate data; two in breast cancer 23,24 ), one in colorectal cancer 25 , one in NHL 69 and one in ovarian cancer 66 . The associations reported in these studies were not noticeably different to those in other studies (see Figs. [2][3][4][5]. Regression equations Regression equations were reported in 14 studies for the relationship between ORR and OS; of these, four reported absolute associations 42,52,72,76 and ten reported treatment effect associations. [31][32][33]36,41,46,56,58,67,77 Regression equations were also reported in eight studies for the relationship between ORR and PFS; of these, four reported absolute associations 52,54,72,76 and four reported treatment effect associations. 24,33,67,77 These analyses spanned 10 cancer types. Full details are provided in Supplementary Tables 10 and 11. There was substantial variation in the effect measures used for both the surrogate and final outcomes; for example, difference in medians, hazard ratio (HR), odds ratio (OR), log-transformed or not. None of the included studies attempted to externally validate their regression equations for the relationship between response and other outcomes.

Surrogate threshold effect
The STE-the smallest treatment effect on the surrogate that predicts a non-zero treatment effect on the true endpoint 82 -was reported in only four studies (Supplementary Table 12). 26,39,69,77 For the relationship between ORR and PFS, one study 77 in various solid tumours reported that a difference in ORR of 15% would be required to predict a non-zero treatment effect on the HR for PFS. For the relationship between ORR and OS, two studies in various solid tumours 77 and NSCLC 39 reported that a difference in ORR of 21% and 55%, respectively, would be required to predict a nonzero treatment effect on the HR for OS, while one study 39 also reported that a difference in ORR of 41% would be required to predict a non-zero treatment effect on the difference in median OS. A further study in colorectal cancer 26 reported that an OR for ORR of 0.28 would be required to predict a non-zero treatment effect on the OR for OS. Finally, for the relationship between CR and PFS, one study in NHL 69 reported that an OR for CR (at 30 months) of 1.56 would be required to predict a non-zero treatment effect on the HR for PFS.
IQWiG and BSES2 scores for the strength of association IQWiG and BSES2 scores for the strength of association between surrogate and final endpoints were calculated for all reported subgroup analyses with sufficient data; therefore, meta-regression studies that reported more subgroups are more strongly represented in this analysis. These data are presented graphically in Supplementary Figs. 1 and 2.

DISCUSSION
This systematic review summarises published meta-regression studies reporting correlation and regression analyses for the strength of the association between response outcomes and PFS, TTP or OS across different types of cancer. In total, the review included 63 studies across 20 cancer types. The most commonly analysed relationships were between ORR and either PFS or OS. For the association between ORR and PFS, the majority of reported correlation coefficients between absolute values were >0.60 (range −0.72 to 0.96). For association between treatment effects on ORR and PFS, the majority of regression R 2 values were >0.40 (range 0.18-0.94). The association between ORR and OS appeared weaker than that between ORR and PFS; while the majority of reported correlation coefficients between absolute values were >0.40, several estimates were lower (range −0.40 to 1.00). For association between treatment effects on ORR and OS, all regression R 2 values except one were below 0.60 (range −0.08 to 0.84).
There was no clear pattern by cancer type for either the absolute or treatment effect associations, based on both multiple analyses within the same study and results across separate studies. Confidence intervals around the reported correlation coefficients and R 2 values were generally wide and often not reported.
Strength of association across all subgroup analyses within all included meta-regression studies was assessed using the IQWiG and BSES2 scoring systems. The majority of analyses were not evaluable due to the lack of required data. Of those analyses that could be scored, scores were relatively low, suggesting that response-based endpoints may be poor surrogates for OS.
Previous systematic reviews of surrogate endpoints in advanced cancer have been published. Savina et al. 14 and Haslam et al. 15 have reported systematic reviews of meta-analyses assessing any endpoint as a surrogate for OS. Both these reviews also assessed the strength of association using surrogate validation frameworks; both studies used adaptations of the IQwiG framework, and Savina et al. 14 also used the BSES2 framework. These previous reviews generally focussed on the main analyses presented within individual meta-analyses (usually that with the largest number of patients). Similar to our review, these previous reviews suggested that response-based outcomes are likely to be poor surrogates for OS. Our systematic review focusses exclusively on response-based surrogates; it includes a comprehensive search to identify relevant studies, considers PFS as a potential final endpoint as well as OS, is more up to date, includes a greater number of studies and reports results for the full breadth of analyses reported in the included meta-regression studies compared with these previous reviews. This provides a more complete picture of the extent of heterogeneity in reported relationships across the full range of meta-analyses across each cancer area. This additional breadth provides a better basis to inform judgements about the validity of response-based endpoints as a surrogate for PFS or OS.
The review is subject to a number of limitations. The reported data were highly heterogeneous in terms of effect measure and method of analysis. Therefore, some simplifying assumptions had to be made to allow the data to be summarised. For example, correlation coefficients were summarised regardless of method (Pearson's, Spearman's or other); R 2 values were summarised irrespective of whether or not the regression was weighted and whether or not the R 2 was adjusted; and for treatment effect associations, R 2 values were summarised regardless of the effect measure (e.g. HR, OR, difference in medians). In addition, only five studies used IPD rather than aggregate data in their analysis; this is a limitation of the analyses conducted in the majority of metareviews. A recent review by Xie et al. 17 highlighted wide variability in reporting standards across surrogate evaluation metaregression studies; future analyses should attempt to adhere to current best practice, for example, the reporting of surrogate endpoint evaluation using meta-analyses (ReSEEM) guidelines in order to improve the quality of these analyses. 17 It should further be noted that while meta-regression has been widely used for the purpose of evaluating the validity of surrogate endpoints in oncology, this method has been criticised as it ignores uncertainty around the treatment effect on the surrogate outcome (which is treated as a fixed covariate in the analysis). Newer methods, such as the bivariate random effects metaanalysis (BRMA) model reported by Bujkiewicz et al., 83 provides an approach for both the validation and prediction of surrogate endpoints within a Bayesian framework. This approach allows for borrowing of information across studies and fully accounts for all uncertainty surrounding the surrogate relationship. In spite of the generally poor association between response-based outcomes and final outcomes, there may still be instances in which generating predictions on the basis of response is necessary; for example, within health economic models, or more broadly, for decision-making within health technology assessment. In instances where the surrogate association is weak, this uncertainty would manifest as a wider prediction interval. If such predictions are necessary, it is therefore important that all uncertainty is reflected in the model. Future surrogate evaluation studies should consider the use of the BRMA model, rather than conventional meta-regression, as a means of fully reflecting this uncertainty.

CONCLUSIONS
This systematic review suggests that response-based endpoints such as ORR and CR may not be reliable surrogates for PFS or OS in cancer treatment. Strength of association varied widely between and within studies, with no clear pattern by cancer type. The strength of association between ORR and OS appeared weaker and more variable than that between ORR and PFS, both for associations between absolute endpoints and associations between treatment effects. While there may still be value in using response outcomes as a means of predicting final outcomes such as OS, it is important that those predictions are made on the basis of models which fully reflect the uncertainty around the treatment effect on the surrogate outcome.