Main

Most patients with non-small cell lung cancer (NSCLC) present with advanced disease at diagnosis. In this setting, therapeutic options are limited to systemic treatments, including targeted agents and increasingly immunotherapy, and in select cases radiotherapy. However, response to treatment is heterogeneous (Gridelli et al, 2003; National Comprehensive Cancer Network, 2016). As therapy is associated with significant adverse events, it is imperative to ensure that patients are actually benefitting from treatment. Early identification of progressive disease (PD) during treatment is vital to save time and costs in switching to a new treatment strategy and to avoid unnecessary side effects from exposure to an ineffective regimen (Holdenrieder et al, 2008; Holdenrieder and Stieber, 2010).

Imaging techniques are routinely used in NSCLC to monitor response to chemotherapy, but these are associated with relatively high costs and are inconvenient for patients (Mahadevia et al, 2003; Ganti and Mulshine, 2006; National Comprehensive Cancer Network, 2016). A number of established biomarkers have been utilised in NSCLC for diagnosis, prognosis and therapeutic monitoring (Barak et al, 2010; Holdenrieder et al, 2010; Molina et al, 2010). Carcinoembryonic antigen (CEA) and cytokeratin-19 fragments (CYFRA 21-1) are widely used in certain regions of the world, such as Eastern Asia (Zhi et al, 2015), for differential diagnosis in NSCLC (Schalhorn et al, 2001; Molina et al, 2003, 2010, 2016) and have demonstrated great potential for predicting early response to chemotherapy (Vollmer et al, 2003; Holdenrieder et al, 2004, 2006).

We conducted a systematic review and meta-analysis to evaluate CEA and CYFRA 21-1 in the assessment of therapy response in NSCLC. Specifically, we aimed to address two clinical questions. First, are pretherapy serum levels of CEA and CYFRA 21-1 predictive of response to therapy in patients with previously untreated advanced NSCLC? Second, are changes in serum marker levels of CEA and CYFRA 21-1 during therapy, as compared with pretherapy levels, indicative of response in patients with previously untreated advanced NSCLC?

Materials and methods

Literature search

We searched PubMed for studies published between 1 January 2000 and 23 June 2015 using the terms: CEA (cea, carcinoembryonic, carcino-embryonic, carcino embryonic), CYFRA 21-1 (cyfra, cytokeratin 19, cytokeratin-19), and NSCLC (nsclc, non-small cell lung cancer, non-small cell lung carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, adenocarcinoma of the lung, squamous cell carcinoma of the lung, lung cancer, lung carcinoma). Search terms relating to the assessment of therapy response were not included owing to the inconsistent use of terms describing tumour markers in the literature. The meta-analysis was registered with PROSPERO (registration no. CRD42015029974) (Uhlig et al, 2015).

Eligibility criteria

All original peer-reviewed research publications were considered, including prospective and retrospective studies. Eligible studies were required to: enrol adults with advanced NSCLC receiving first-line therapy; classify patients as ‘high’ or ‘low/normal’ with respect to CEA and CYFRA 21-1 serum levels prior to therapy and/or classify patients as having ‘reduction’ or ‘no reduction’ according to changes in marker levels during treatment; classify patients as ‘responder’ or ‘non-responder’ to therapy (based on World Health Organization (WHO)/Response Evaluation Criteria in Solid Tumors (RECIST) criteria); and determine serum marker levels using commercially available assays.

Review articles, systematic reviews, meta-analyses, conference abstracts and case studies were excluded, as were preclinical studies. Non-English language studies were excluded during screening.

Data extraction

Two reviewers independently extracted the following data from each study: bibliographic information; study methodology; number of patients, demographics and baseline characteristics; patient numbers for subgroups; disease and treatment characteristics; assay characteristics; marker-level characteristics; assessment of therapy with respect to objective tumour response; and statistical measures for therapy response with confidence intervals (CIs) if available. No numerical information was extracted from the figures in the study publications.

Statistical methods

Statistical analyses were performed using R programming software (The R Foundation, 2015).

Risk of bias

The risk of bias in individual studies and across studies (publication bias) was analysed independently by two reviewers and evaluated by means of Begg’s funnel plot and Egger’s test for asymmetry, with P0.05 considered significant (Egger et al, 1997). The risk of bias in individual studies (i.e., from patient selection, biomarker tests, reference standards, flow and timing or results of statistical analysis) was assessed based on the Quadas-2 checklist, modified and extended as appropriate to the question of therapy response assessment (Whiting et al, 2011). As not only publication bias but also study bias could produce outliers or asymmetry in funnel plots, results of the formal statistical tests on asymmetry should be interpreted with caution.

Meta-analysis and study heterogeneity

Three statistical measures were considered: area under the curve (AUC) (Hanley and McNeil, 1982; Walter, 2002); sensitivity and specificity (Rutter and Gatsonis, 2001; Reitsma et al, 2005); and diagnostic odds ratio (DOR) (Glas et al, 2003). The AUC and the DOR are univariate approaches to quantifying the quality of a diagnostic test. In the present meta-analysis, the diagnostic test was the correct classification of patients as responders or non-responders to therapy by means of the tumour marker CEA or CYFRA 21-1. The AUC is the integral of the Receiver Operating Characteristic (ROC) curve. An AUC of 1 represents a perfect test, while an AUC of 0.5 represents a test that does not discriminate between two groups. The sensitivity of a diagnostic test refers to the ability of the test to correctly identify a certain group of patients. Sensitivity and specificity values range from 0 to 1 and should be as high as possible. The DOR can take on any value >0. For useful diagnostic tests, the DOR is >1; higher DOR values are indicative of higher discriminatory power.

Area under the curve

AUC analysis was carried out using extracted AUC values and corresponding standard errors. A summary AUC was computed on the basis of a linear model with a random effect for the study bias. The linear model was set up on the basis of the logit-transformed AUC values: ln P/1−P, with P=2·AUC −1.

For each AUC result, Tau squared (T2), Q and I squared (I2) were calculated. T2 denotes the variance of the random variable representing study bias. Q was calculated as the weighted sum of squared differences between individual study effects and the pooled effect across studies; it is distributed as a chi-square statistic with k−1 degrees of freedom (k is the number of studies). I2 represents the percentage of the variability due to heterogeneity between studies rather than to chance; a value >50% is considered heterogeneous.

Sensitivity and specificity

Sensitivity and specificity analyses were performed using contingency tables provided in the studies or, if missing, computed from extracted data. By means of a bivariate model, estimates for the logit of both sensitivity and specificity values were calculated. It was assumed that these two parameter estimates followed a bivariate normal distribution with a covariance matrix whose entries were also estimated. The study bias was modelled as a random effect. T2 values were calculated for each sensitivity and specificity result.

Diagnostics odds ratio

DOR analysis was carried out on the basis of contingency tables. For the studies by Wang et al (2010) and Wang et al (2011), sensitivity and specificity values for several cutoffs were reported; DORs were calculated for the different cutoff values and the result with the highest DOR was chosen for further analysis. A summary logarithmic (ln) DOR was computed on the basis of a linear model with a random effect for the study bias. T2 values were calculated for each ln DOR result.

Meta-regression

A meta-regression was performed to assess the effect of ethnic group (Asian vs non-Asian), assay (manual vs automated) and NSCLC stage (III–IV vs I–IV) on the ln DOR for response.

Results

Study selection

Of the 1022 records identified, 25 studies were deemed eligible based on abstract screening (Figure 1). Eleven of these were subsequently excluded as they did not fulfil the inclusion criteria (Supplementary Table S1). Only three of the remaining studies reported results for advanced (stages IIIB or IV) NSCLC. The decision was therefore made to include all NSCLC stages in the meta-analysis, meaning that all of the 14 remaining studies were eligible (Supplementary Table S2).

Figure 1
figure 1

PRISMA flow diagram of eligible studies.*Initially, only three studies reported results for patients with advanced (stages IIIB or IV) NSCLC. The decision was therefore made to include all NSCLC stages in the meta-analysis, meaning that 14 studies were eligible for inclusion. Of these, 11 had objective response (complete or partial response) as an end point and the other three evaluated CEA and CYFRA 21-1 for their ability to show clinical benefit (i.e., response and stable disease during treatment).

Just three studies focussed on the identification of patients with PD vs those with clinical benefit (i.e., response or stable disease (SD)) (Holdenrieder et al, 2006; Nisman et al, 2008; Arrieta et al, 2013). As this distinction is of clinical relevance in the setting of treatment monitoring, data on the identification of patients with PD were extracted from a further three studies (Trapé et al, 2003; Jin et al, 2010; Wang et al, 2010). Most of the studies focussed on the clinical response end point (complete response (CR) or partial response (PR)) rather than the clinical benefit (which includes SD) and monitoring end point.

Study classification

The 14 studies were classified in terms of response definition, tumour markers assessed and whether these were evaluated as predictive or treatment monitoring markers. Eleven studies defined response as CR plus PR and compared the diagnostic performance of CEA and CYFRA 21-1 to distinguish CR+PR from SD+PD, defined as ‘non-responders’. The other classification evaluated CEA and CYFRA 21-1 as markers for clinical benefit, that is, CR+PR+SD vs patients with PD.

To reflect the change in eligibility criteria, the two clinical questions being addressed by the meta-analysis were modified to refer to all stages of NSCLC rather than just patients with advanced disease and are referred to as ‘Prediction’ and ‘Treatment monitoring’ throughout the manuscript.

An overview of the 14 eligible studies is provided in Table 1 for the comparison (CR+PR) vs (SD+PD) and in Table 2 for the comparison (PD) vs (CR+PR+SD). For the comparison (CR+PR) vs (SD+PD), four studies analysed both markers for each clinical question and two studies analysed both clinical questions for each marker (Supplementary Table S3).

Table 1 Summary of available studies for the comparison (CR+PR) vs (SD+PD)
Table 2 Summary of available studies for the comparison (PD) vs (CR+PR+SD)

Publication bias

For the ‘Treatment monitoring’ question and the comparison (CR+PR) vs (SD+PD), the study by Arrieta et al (2013) fell outside the 99% CI for CEA (AUC 0.83, Egger’s test P=0.038; Supplementary Figure S1). More detailed analysis of this study indicated that there was a study bias, rather than a publication bias, possibly because only patients with baseline serum CEA>10 ng ml−1 were included. For CYFRA 21-1, all studies were within the 99% CI and there was no evidence of publication bias (AUC 0.72, Egger’s test P=0.847).

Supplementary Figure S2 shows bias analysis using funnel plots for DOR for the comparison (CR+PR) vs (SD+PD). With one exception, all of the studies were within the 99% CI. For the ‘Treatment monitoring’ question and the comparison (CR+PR) vs (SD+PD), the study by Arrieta et al (2013) was again outside the 99% CI for CEA; however, no publication bias was present (DOR 5.0, Egger’s test P=0.153), hence it was not necessary to exclude this study from the meta-analysis. No bias was detected for DOR for the comparison (PD) vs (CR+PR+SD) for either of the clinical questions or tumour markers (Egger’s test P0.05).

AUC results

Only for the comparison (CR+PR) vs (SD+PD) and the ‘Treatment monitoring’ question were enough studies available for the meta-analysis. Two meta-analyses were performed for AUC with and without the study by Arrieta et al (2013), but no significant difference was found between them (Grubbs outlier test at the 1% significance level: AUC 0.728 (95% CI 0.599–0.871) and 0.667 (95% CI 0.606–0.742)), respectively). Removing the study reduced the level of heterogeneity from 88.4% to 38.7%.

Across-study AUC values for all combinations of response comparison, marker and clinical question were similar for the two markers, with a summary AUC of 0.728 (95% CI 0.599–0.871) for CEA and 0.724 (95% CI 0.667–0.785) for CYFRA 21-1 (Table 3; Figure 2), indicating good predictive power and clinical significance. As the CIs for the two markers overlapped, the predictive power of CEA and CYFRA 21-1 was comparable.

Table 3 Results of meta-analysis for AUC and sensitivity/specificity
Figure 2
figure 2

Forest plots for AUC for the comparison (CR+PR) vs (SD+PD).

Sensitivity/specificity results

Contingency tables for the two clinical questions are shown in Supplementary Table S4.

In the assessment of the predictive performance for subsequent (CR+PR) vs (SD+PD), across-study sensitivity for response with CEA was 56.8% (i.e., 56.8% of patients with response had a pretreatment level below the cutoff value) and specificity was 53.6% (i.e., 53.6% of patients with SD or PD had a pretreatment level above the cutoff value) (Table 3; Figure 3). Corresponding values for CYFRA 21-1 were 50.5% and 67.2%, respectively. The CIs of sensitivity and specificity for both markers overlapped, indicating comparable predictive power.

Figure 3
figure 3

Forest plots for sensitivity and specificity for the comparison (CR+PR) vs (SD+PD).

To assess the performance of the markers to indicate (CR+PR) vs (SD+PD) during treatment, meta-analyses for sensitivity and specificity for response were performed with and without the study by Arrieta et al (2013). Sensitivity for response was 74.7% with CEA (i.e., 74.7% of patients with CR/PR had a ‘strong’ reduction in marker level) and specificity was 69.8% (i.e., 69.8% of patients with SD/PD had a ‘weak’ reduction in marker level). For CYFRA 21-1, sensitivity and specificity for response were 79.1% and 60.6%, respectively (Table 3; Figure 3). No significant differences between CEA and CYFRA 21-1 were observed. Study-specific cutoffs and sensitivity and specificity values for the comparison (CR+PR) vs (SD+PD) are shown in Supplementary Figure S3.

For the comparison (PD) vs (CR+PR+SD), the CIs of sensitivity and specificity for progression for both markers overlapped for both clinical questions, indicating comparable predictive power (Table 3).

DOR results

Across-study DOR and ln DOR values are summarised in Supplementary Table S5, with corresponding forest plots shown in Supplementary Figure S4.

The DOR for response was defined as the ratio of two odds:

DORresponse=odds that response will occur, given a low marker level/odds that response will occur, given a high marker level.

For the comparison (CR+PR) vs (SD+PD) and the ‘Prediction’ question, the DOR for response was 1.49 (95% CI 1.03–2.16) with CEA and 2.16 (95% CI 1.49–3.13) with CYFRA 21-1 (Supplementary Table S5).

For the assessment of the predictive performance for subsequent (CR+PR) vs (SD+PD), meta-analyses for DOR for response were also performed with and without the study by Arrieta et al (2013). The DOR for response with CEA was 6.89 (95% CI 3.40–13.95) compared with 6.42 (95% CI 3.50–11.79) with CYFRA 21-1. The CIs were significantly different from 1 for both markers for both clinical questions, showing evidence of their clinical relevance as predictors of treatment response. Between-study heterogeneity for all combinations of marker and clinical question was very low (Supplementary Table S5).

The DOR for progression was defined as the ratio of two odds:

DORprogression=odds that progression will occur, given a high marker level/odds that progression will occur, given a low marker level.

Assessment of the predictive performance for subsequent (PD) vs (CR+PR+SD) showed the DOR for progression was 1.82 (95% CI 0.97–3.41) with CEA and 3.16 (95% CI 2.01–4.96) with CYFRA 21-1 (Supplementary Table S5). To assess the performance of the markers to indicate PD during treatment, the DOR for progression was 1.97 (95% CI 0.48–8.09) with CEA vs 14.73 (95% CI 5.01–43.29) with CYFRA 21-1, demonstrating the clinical significance of CYFRA 21-1 for the evaluation of treatment response.

Meta-regression analysis

The feasibility of subgroup analysis was assessed for all combinations of response comparison, marker and clinical question and the statistical measures AUC, sensitivity/specificity and DOR (Supplementary Table S6; Supplementary Table S7). No significant differences between ethnic groups, assay type or tumour stage were detectable.

Discussion

This systematic review and meta-analysis aimed to determine whether pretreatment serum levels of CEA and CYFRA 21-1 are predictive of response to therapy in previously untreated NSCLC (‘Prediction’), and whether changes in serum levels during therapy are indicative of response in this patient population (‘Treatment monitoring’).

For the comparison (CR+PR) vs (SD+PD), AUC data indicated good predictive power for CYFRA 21-1. However, for CEA, a high level of heterogeneity was observed as a result of inclusion of the study by Arrieta et al (2013). There were too few studies for a meaningful computation of summary ROC curves.

Across-study sensitivity and specificity results were better for ‘Treatment monitoring’ than for ‘Prediction’, indicating that the change in CEA and CYFRA 21-1 level had a higher predictive power than the pretreatment level alone. For both clinical questions, results for CYFRA 21-1 were superior to those for CEA. However, owing to the limited number of studies, and the resulting large CIs, no clear conclusion as to the clinical significance of the two markers for the comparison (PD) vs (CR+PR+SD) could be drawn.

Across-study DOR results for the comparison (CR+PR) vs (SD+PD) for both markers were superior for ‘Treatment monitoring’ compared with ‘Prediction’, again indicating the higher predictive power of the change in CEA and CYFRA 21-1 levels compared with the pretreatment level alone. For all four combinations of marker and clinical question, the DOR for response was significantly >1, supporting the clinical utility of the two markers in this setting. The DOR for response values for the ‘Treatment monitoring’ question for both markers also provided evidence of high discriminatory power. There were no significant differences in DOR for response between ethnic groups, assay type or tumour stage, lending further validity to the overall results.

For DOR and the comparison (PD) vs (CR+PR+SD), the CI for the DOR for progression for CEA included 1 for both prediction and monitoring use, indicating that neither pretreatment level nor change in level correlated with response. However, the results for CYFRA 21-1 showed the usefulness of the marker for prediction and therapy monitoring. As very few studies were available for the comparison (PD) vs (CR+PR+SD), a reliable conclusion as to the clinical significance of the markers with respect to this comparison was not possible. However, by calculating systematic bias of single studies in this meta-analysis, we obtained a much higher level of evidence for the performance of CEA and CYFRA 21-1 as biomarkers.

A number of potential sources of between-study heterogeneity and uncertainty in the meta-analysis should be considered. The use of different tumour response classifications may have resulted in varying numbers of responders and non-responders. The cutoff values chosen to represent the discriminatory power of the markers in the different studies were consistent for the ‘Prediction’ question but varied considerably for the ‘Treatment monitoring’ question. Patients with different stages of NSCLC were included, which may have confounded the predictive power of the markers. In addition, the studies used different patient selection criteria. This was particularly notable for the study by Arrieta et al (2013), which only enrolled patients with high serum CEA levels at baseline, although outlier tests proved negative and CIs for the results with and without the study by Arrieta et al (2013) overlapped. Finally, while great care was taken to ensure homogeneity of the time points for the evaluation of tumour response and marker measurements across the studies, some differences would have been inevitable (but in most studies, venipunctures were performed at the time of imaging investigations for staging).

In conclusion, results of this meta-analysis demonstrate the clinical utility of both CEA and CYFRA 21-1 for the assessment of response to therapy in NSCLC. The performance for both markers was stronger for treatment monitoring than for predictive value at baseline. With respect to the question of detecting progression during treatment (the reason, for example, why interim imaging is carried out in between chemotherapy cycles), results for CYFRA 21-1 suggested high discriminatory power, though a larger number of studies would have been preferable.

The results of this comprehensive analysis are highly relevant for the clinical management of lung cancer patients, as a majority do not yet benefit from new targeted therapy approaches. The development of well-defined criteria for the use of established cancer biomarkers will be essential as a complementary strategy for the sensitive guidance of these patients.