Main

The ARTemis trial is an open-label, randomized, phase 3 trial assessing the efficacy of neoadjuvant bevacizumab added to docetaxel followed by fluorouracil, epirubicin, and cyclophosphamide, for women with HER2-negative early breast cancer. Its primary end point was pathological complete response, defined as the absence of invasive disease in the breast and axillary lymph nodes. Initially, the two randomized arms of the trial were compared in terms of rates of pathological complete response as determined by a two-reader blinded review of local pathology reports by the Chief Investigators.1 In addition, a central pathology review and a large-scale two-stage pathology quality assurance exercise was undertaken. Thereby, the accuracy of this commonly used primary end point in neoadjuvant chemotherapy breast cancer trials was assessed and compared with the two-reader report review, which until now has been the standard used by this group.2 In addition, the reliability of central specimen review has been investigated by independent double-reading of residual cancer burden categories carried out by the two central pathologists in a subset of cases. This allows us to report on the comparison between assessment of local pathology reporting and central pathology review of original diagnostic material and also the reporting behavior of the two reviewing pathologists. Although central pathological review has been carried out in studies reporting major center results,3 as far as the authors are aware this is the first report of central pathology review of pathological complete response with residual cancer burden scoring and class definition, carried out as part of a multicenter large randomized phase 3 trial.

Materials and methods

Between May 2009 and January 2013, the ARTemis trial recruited 800 women ≥18 years old with newly diagnosed HER2-negative early invasive breast cancer (radiological tumor size >20 mm, with or without axillary involvement). Patients with inflammatory cancer, T4 tumors with direct extension to the chest wall or skin, and ipsilateral supraclavicular lymph node involvement were also eligible with any size of primary tumor. Full eligibility criteria details have been described in detail elsewhere.1 Patients were randomized from 66 UK sites and assigned, via a central computerized minimization procedure, to three cycles of docetaxel (100 mg/m2 once every 21 days) followed by three cycles of fluorouracil (500 mg/m2), epirubicin (100 mg/m2), and cyclophosphamide (500 mg/m2) once every 21 days (docetaxel-fluorouracil/epirubicin/cyclophosphamide), without or with four cycles of bevacizumab (15 mg/kg) (bevacizumab+docetaxel-fluorouracil/epirubicin/cyclophosphamide). A total 781 patients (98% of the randomized 800) underwent surgery following their neoadjuvant treatment and could be assessed, via local pathology reports, for the primary end point of absence of invasive breast cancer in the breast and axillary lymph nodes.

Methods

Diagnostic and surgical excision histopathology slides were requested from the relevant participating sites for all the 781 evaluable patients. All retrieved cases underwent central independent review, blinded to the local histopathology report, including any block descriptions, by an experienced breast histopathologist with a special interest in neoadjuvant clinical trials (JSJT and EP) between June 2011 and March 2016. The reviewing pathologist was not the same as the pathologist who had previously assessed the slides locally, and/or would have access to the histopathology results at their hospital. Any missing slides or additional relevant operations (eg, sentinel lymph node biopsy) were re-requested as necessary. The variables recorded were maximum invasive tumor size in two dimensions, whole tumor size (including ductal carcinoma in situ) in two dimensions, posttreatment tumor grade, presence of lymphovascular invasion, presence and nature of in situ disease, percentage tumor cellularity, percentage cellularity that is in situ disease, total number of lymph nodes, number of positive lymph nodes, and size of largest nodal metastasis.

In addition to assessing the validity of the findings from the blinded review of local pathology reports, an inter-pathologist reproducibility exercise was also undertaken. For this, a randomly chosen 10% of patients had samples reviewed by both pathologists for determination of levels of agreement between central review findings. To simplify this exercise, variables were restricted to those required to calculate the residual cancer burden score: invasive size (length and width), percent tumor cellularity, percent of tumor that is DCIS, size of largest nodal metastasis, and number of positive nodes.4 The 10% sample was randomly chosen, while ensuring a representative residual cancer burden class split, as recorded by the first pathologist’s reviews. This approach was written into the pre-planned agreement decided by the ARTemis Trial Management Group. Ten percent of ARTemis cases for co-review was the opportunistic sample that was deemed a manageable workload for the two ARTemis pathologists.

The results of the central pathology review were also compared with the outcome results as determined by the central review of the original histopathology reports from the source laboratory. In particular, we compared rates of pathological complete response and minimal residual disease as determined by formal assessment of the residual cancer burden and by interpretation of the histopathology. Original pathology reports were reviewed by the two Chief Investigators (LH and HME) according to residual cancer burden class definitions laid out in the Trial Protocol. The primary end point of the trial was pathological complete response and that was the main focus of report review by the Chief Investigators. Detailed information in the pathology report was not always available to enable estimates of degrees of partial response by the Chief Investigators who were given guidance by the Trial Pathologists as follows: pathological complete response—no residual invasive carcinoma in the breast or lymph nodes; residual cancer burden 1/minimal residual disease—residual tumor <5 mm; residual cancer burden 2—<50% tumor cellularity; residual cancer burden 3—no appreciable response. The local pathologists were not given any formal reporting guidelines specifically for this trial.

Statistical Methods

Agreement between the two pathologists’ residual cancer burden classes, and also between central review and local reports in determination of pathological complete response, was undertaken using the kappa statistic. Agreement between the two pathologists in terms of residual cancer burden scores and its six components were scrutinized using Bland–Altman plots and assessed using overall concordance correlation coefficient.5 Comparison of patient characteristics between groups was undertaken using chi-squared tests with continuity correction where appropriate. Logistic regression was used to assess the effect of randomized treatment arm on pathological complete response rates, after adjustment for stratification factors.

Results

A total of 22 916 slides from 727 patients were reviewed. Full sample retrieval was obtained for 681 (87%) of the 781 ARTemis patients who underwent surgery within the trial and were evaluable for the primary end point of pathological complete response. A total 483/681 patients (71%) were assessed by JSJT and 198/681 patients (29%) by EP. The maximum number of slides per patient was 164; median 29 slides. A total 94/681 patients (14%) had a positive pre-chemotherapy sentinel lymph node biopsy (SLNB) thus invalidating the calculation of a residual cancer burden score at surgery. Residual cancer burden scores and classes were thus calculated on the remaining 587 patients (75% of the 781). Patient characteristics of the 587 patients with assessable residual cancer burden appeared representative of the trial sample as a whole (Table 1).

Table 1 Patient characteristics

Inter-Rater Reproducibility of Pathologists

Sixty-five patients were double reviewed by JSJT and EP. The 65 patients were representative of the 587 sample as a whole in terms of patient characteristics (Table 1) and the random sampling technique determined that they were also representative in terms of residual cancer burden class as recorded by the first pathologist’s central review.

Residual cancer burden class

The two pathologists showed very similar reporting profiles for residual cancer burden class (observed frequencies of residual cancer burden 0:1:2:3 being 14:9:32:10 for pathologist 1 and 13:9:34:9 for pathologist 2; Table 2). In 52/65 (80%) of patients, there was agreement on residual cancer burden class, and in 13/65 (20%) where there was disagreement none were more than by one residual cancer burden class. A good level of agreement was observed over all residual cancer burden classes (kappa 0.70 (95% CI: 0.55–0.84); Figure 1). No differences were found between patient groups where JSJT and EP agreed on residual cancer burden class (n=52) or disagreed (n=13) in terms of randomized treatment arm or stratification variables (age, ER status, tumor size, clinical involvement of axillary nodes, locally advanced/inflammatory disease data not shown).

Table 2 Residual cancer burden classes for the 65 patients, by the two pathologists
Figure 1
figure 1

Level of agreement across two pathologists’ rating of residual cancer burden class. One rectangle is depicted for each level of residual cancer burden class, their height and width based on the row and column cumulative totals. Thus for the residual cancer burden 0 rectangle of pathologist 1 vs pathologist 2 comparison, all patient’s categorized as residual cancer burden 0 by either pathologist are included. The boundaries of the rectangles along both axes represent the number of patients that were categorized as residual cancer burden 0 for each pathologist. Dark squares within the rectangles represent exact agreement between residual cancer burden classes from the two pathologists (ie, both rating as residual cancer burden 0), and are of size based on the cell frequencies and located according to the cumulative totals of the previous levels. Light rectangles represent partial agreement, where the residual cancer burden class from one pathologist is 1 different of the residual cancer burden class from the other pathologist (ie, residual cancer burden 0 by one pathologist but residual cancer burden 1 by the other pathologist). White areas within the rectangle reflect disagreement by more than one level (ie, residual cancer burden 0 by one pathologist and residual cancer burden 2 or 3 by the other pathologist).

Residual cancer burden score

For the 13 patients where there was disagreement in residual cancer burden class, the majority of disagreements were due to the two pathologists’ residual cancer burden scores falling just either side of the published residual cancer burden score cut-points of 1.36 and 3.28 (Figure 2). There was good overall concordance in residual cancer burden score (concordance correlation coefficient 0.75 (95% CI: 0.40–0.91)), with the average discrepancy in residual cancer burden score being of the magnitude 0.245 (IQR: 0.135–0.501, range: 0.085–1.840).

Figure 2
figure 2

(a and b) Inter-rater reliability of pathologists’ residual cancer burden scores, where there is disagreement in residual cancer burden class (n=13 patients). (a) Pathologist 1 vs Pathologist 2 residual cancer burden scores. (b) Average of the two pathologist’s residual cancer burden scores.

Components of the residual cancer burden score

Focusing on the 13 patients where the two pathologists differed in residual cancer burden class assignment, the greatest inter-rater variability was in the assessment of percentage of ductal carcinoma in situ within the tumor (concordance correlation coefficient −0.04 (95% CI: −0.30 to 0.21)) and, to a lesser extent, in the assessment of invasive size (concordance correlation coefficient 0.20 (95% CI: −0.12 to 0.47) for width and concordance correlation coefficient 0.35 (95% CI: −0.11 to 0.68) for length) and percent of tumor cellularity (concordance correlation coefficient 0.30 (95% CI: −0.05 to 0.59)). The strongest agreement was observed in identification of number of positive nodes (concordance correlation coefficient 0.95 (95% CI: 0.85–0.98)) followed by size of the largest nodal metastasis (concordance correlation coefficient 0.74 (95% CI: 0.37–0.91)).

Sources of discrepancy

Seven cases where there was a disagreement in residual cancer burden class due to substantial differences in size measurement, cellularity or nodal status were reviewed again with joint discussion by the two pathologists. Sources of discrepancy included interpretation of multiple tumor foci as one lesion or multiple lesions, measurement of lesion size from single slides or estimating total number of slides, inclusion of pre-treatment sentinel lymph node metastases in the residual cancer burden calculation, errors in measurement, and interpretation of degenerate cells in posttreatment lymph nodes as metastasis or not. The different weightings of the elements of the residual cancer burden equation reduce the effect of the variance among those component scores.

Central Review of Pathology Specimens vs Review of Local Pathology Reports: Inter-Method Reliability

Both methods determined similar levels of pathological complete response in the 587 patients where both assessment results were available; 121 (21%) with residual cancer burden class 0 from central pathology review and 119 (20%) reported as pathological complete response from local pathology report (Table 3). A good level of agreement was observed between the two methods’ findings when grouped as the three levels of residual cancer burden 0 (pathological complete response) vs residual cancer burden 1 (minimal residual disease) vs residual cancer burden 2/3 (moderate/extensive disease; kappa 0.63 (95% CI: 0.57–0.69); Figure 3). However, for six patients, the level of disagreement was by more than one class (one patient with pathological complete response from the report review but residual cancer burden class 2 from specimen review, and five patients with moderate/extensive disease from the report review but with residual cancer burden class 0 from the specimen review.)

Table 3 Levels of residual cancer at surgery, from the two assessment methods for the 781 patients
Figure 3
figure 3

Level of agreement across the two methods of review: One rectangle is depicted for each level of pathologic response, their height and width based on the row and column cumulative totals. Thus for the residual cancer burden 0/pathCR rectangle of method comparison, all patients categorized as residual cancer burden 0 by pathologist or pathological complete response by report review are included. The boundaries of the rectangles along both the axes represent the number of patients that were categorized as residual cancer burden 0/pathological complete response. Dark squares within the rectangles represent exact agreement between methods (ie, pathologist rating as residual cancer burden 0 and report review as pathological complete response), and are of size based on the cell frequencies and located according to the cumulative totals of the previous levels. Light rectangles represent partial agreement, where the conclusion from the pathologist is one group different to that from the report review (ie, residual cancer burden 0 by pathologist but minimal residual disease by report review). White areas within the rectangle reflect disagreement by more than one level (ie, residual cancer burden 0 by pathologist and moderate/extensive residual disease by report review).

Slides for five of the six cases were available for second review by one of the pathologists (EP). Sources of discrepancy included not receiving all the tumor slides for review (two cases) and interpretation of residual tumor as ductal carcinoma in situ or invasive disease (one case). In one case, the second review agreed with the histopathology report (residual tumor) rather than the central review (pathological complete response). In another case called pathological complete response on central review, the discrepancy appears to be due to inconsistency in the original report in calling tumor cells in the node viable and nonviable; both central reviewers thought this represented an area of necrosis.

ARTemis Primary End Point Results

The ARTemis trial’s primary end point was previously reported using the local pathology report reviews on 781 patients and showed significantly more bevacizumab+docetaxel-fluorouracil/epirubicin/cyclophosphamide patients achieving a pathological complete response compared with docetaxel-fluorouracil/epirubicin/cyclophosphamide patients: 22% (95% CI: 18–27) of 388 bevacizumab+docetaxel-fluorouracil/epirubicin/cyclophosphamide patients compared with 17% (95% CI: 13–21) of 393 docetaxel-fluorouracil/epirubicin/cyclophosphamide patients (adjusted P=0.03; Table 4A).1 Using the residual cancer burden classes from the central pathology specimen review, the results remained the same: 25% (95% CI: 20–30) of 290 bevacizumab+docetaxel-fluorouracil/epirubicin/cyclophosphamide patients achieved a residual cancer burden 0 compared with 16% (95% CI: 12–21) of 297 docetaxel-fluorouracil/epirubicin/cyclophosphamide patients (adjusted P=0.02; Table 4B).

Table 4A Treatment arm comparison using local pathology report review data (n=781 patients)
Table 4B Treatment arm comparison using central pathology specimen review data (n=587 patients)

Likewise previously, using local pathology report reviews, pathological complete response rates had been found to differ significantly across both ER status (ER negative 38% [95% CI: 32–45], weakly positive 41% [29–53], strongly positive 7% [5–9]; P<0.0001), and tumor grade (grade 1/2 7% [4–11], grade 3 29% [25–34]; P<0.0001).1 Using the central pathology specimen review, similar results were found for rates of residual cancer burden 0; ER negative 39% [95% CI: 32–46], weakly positive 35% [23–48], strongly positive 7% [5–11] (P<0.0001) and grade 1/2 7% [4–12], grade 3 31% [26–37] (P<0.0001).

Discussion

This review focused on the presence or absence of pathological complete response in the excision specimen including the presence of residual ductal carcinoma in situ. Local pathologists were not given reporting proformas or guidelines for the assessment of response, which have been shown to aid concordance between pathologists in clinical trials.6 Because the reviewing pathologists were assessing the original sections in the overwhelming majority of cases, analytical issues do not impinge on this central review although differences in practice among different local laboratories would necessitate caution in drawing any comparison between centers.

In this review, the pathologists were blinded to the macroscopic description and therefore had to reconstruct the tumor bed dimensions from the slides as best as they could. With hindsight given the differences in reporting practice in this area, access to reports would have been of benefit in some cases but not in others. Normally, a pathologist records a block map to aid reconstruction of the tumor area when viewing the slides, and this was highlighted as being of particular importance for accurate assessment of response in the recent BIG-NABCG working group recommendations (https://www.mdanderson.org/education-and-research/resources-for-professionals/clinical-tools-andresources/clinical-calculators/calculators-rcb-pathology-protocol2.pdf (Accessed 19 August 2016)). In some cases, the tumor bed was present on megaslides and this made assessment much easier. The assessment of tumor bed size is often not straightforward following neoadjuvant chemotherapy because the tumor is poorly defined macroscopically, and it can be difficult to determine the tumor boundaries histologically. Tumor cellularity can be very heterogeneous, and is also difficult to assess in spite of the availability of online guidance tools (https://www.mdanderson.org/education-and-research/resources-for-professionals/clinical-tools-andresources/clinical-calculators/calculators-rcb-pathology-protocol2.pdf (Accessed 19 August 2016))7 and there is inconsistency among pathologists in these assessments.8 Agreement about pathological complete response should, however, be good and will only usually cause difficulty if small residues of tumor cells are overlooked, or if there is difficulty in interpreting in situ from invasive disease. Although in this study the best level of agreement in the reviewing pathologists’ cross-over study was of numbers of lymph nodes, this also is not always easy to determine without the macroscopic description. In some cases, the local pathologist had written on the slide to state the number of nodes present. Our data are similar to those recorded in a recent audit where a residual cancer burden score concordance of 0.25 points was reported for most cases with a kappa value of 0.714 between residual cancer burden classes. In this study, the reviewing pathologist had access to a block map/description.9 The concordance between the two pathologists is better than recorded in a recent review of consistency of reporting of residual cancer burden and replicates the finding that the reporting of the lymph node component of the score is more reproducible.8 Given the limitations of this study detailed above with lack of access to source reports and block descriptions, the residual cancer burden is shown to be a very robust system for quantifying residual disease in the clinical trial context.

The central pathology review was immensely labor-intensive. The maximum number of slides submitted for a single case was 164 (median 29 per case). At best, it was only possible to review three or four cases per hour. Not only was it time consuming for the pathologists, but it also placed a burden on local pathology departments retrieving slides and a considerable logistic burden for the Trials Office. One must question whether the exercise was worth the effort given that there was no clinically significant numerical change in the end results. However, one cannot generalize about central pathology review. In some trials central review is used at the outset to confirm eligibility whether this be Her-2 or ER status for example or a particular tumor type, eg, triple negative breast cancer. In the ALTTO Trial, both Her-2 and ER status were changed following central re-testing in 5–15% of cases.10 An important distinction must be drawn here between re-testing, potentially using different reagents and conditions, and the review of original diagnostic material. In ARTemis, we accepted a patient’s eligibility as reported but reviewed critically the end point, which was very specifically pathological.

Review of pathology reports by the two Chief Investigators was made more difficult by a lack of standardization of how local reports were written—not all units use easy-to-read synoptic reports. Moreover, the majority of standardized reports are designed for the adjuvant setting without specific fields for the additional variables that need to be recorded post neoadjuvant therapy, such as tumor cellularity and fibrosis in lymph nodes. We accept that review of lesser degrees of tumor response to neoadjuvant chemotherapy by report review was imperfect particularly for residual cancer burden class 2 (an intermediate level of response) and this is reflected by our concordance data. However, the primary end point of this clinical trial was pathological complete response and we have shown that this is reported reliably by local pathologists. Standardization of routine reporting in clinical practice for neoadjuvant cases has been addressed recently by an international working group which should make this easier when designing future clinical trials.11 It is possible that should such standardization be adopted, a measure of response such as residual cancer burden could be calculated locally. Also, although there is some evidence that pathologists are better at assessing chemotherapy response by reading pathology reports than are practicing clinicians,12 this was not borne out by our data. It was evident on review of some of the cases where there were discrepancies between report review and central review that this was due to missing slides. In this trial, the two pathologists were involved in the resolution of disputed minimal residual disease on report review and were probably helpful in that area but minimal residual disease was not an end point of the trial. Furthermore, one must urge caution in trying to make direct comparisons between residual cancer burden and more descriptive approaches to assessment of response to neoadjuvant chemotherapy particularly in equating residual cancer burden Class 1 with minimal residual disease. Residual cancer burden 1 is strongly dependent on low tumor cellularity while minimal residual disease as determined by the report review, where there was often no information on comparison of pre- and posttreatment tumor cellularity, was heavily influenced by residual tumor size and these are not always the same.

The literature on central pathology review of clinical trials is limited. The NSABP requires central pathology review for its randomized clinical trials and central reviewers are trained to operate with 90% concordance on pathological features compared with 65% concordance between local and central reporting in the NSABP B-18 trial for example.13 Recently reported central review of bone marrow fibrosis showed a concordance of 58% between central and local reporting, whereas a central panel of three reviewing pathologists achieved consistency of 88% for all three pathologists and 98% for two.14 However we were unable to find any reports of central pathology review in the context of neoadjuvant breast cancer trials.

We chose residual cancer burden as our method of measuring chemotherapy response primarily because it gave a numerical score that proved particularly convenient when it came to the cross-over study between the two pathologists. Its principal shortcoming is the lack of comparison with the baseline core biopsy, but from a clinical point of view, the tumor burden following chemotherapy is a sensible feature to measure and has been shown to correlate well with outcomes at 10 years follow-up.15 One of the important aspects of ARTemis is the future program of translational research and that has required sections from core biopsies, excised tumors and nodes to be marked up for future tissue sampling. A pathologist would certainly be required to support that aspect of a future trial. The central review process described here also provides great confidence in the recorded ARTemis end points, thus supporting subsequent translational work aimed at understanding the determinants of individual tumor response and the correlations of that with long-term outcomes. Recently, it has been shown that combining residual cancer burden with Ki67 measurements further increases the predictive power of this tool.16 Also there is a growing interest in either post-neoadjuvant studies, or allowing patients to enter other studies post neoadjuvant chemotherapy, which means that where low volumes of residual disease are permitted in such studies, perhaps caution is needed about relying on local reporting—whereas for pathological complete response or bulk residual disease one can rely on local reporting.

In conclusion, central pathology review of the ARTemis trial has allowed a direct comparison with report review and has shown that when the primary end point of the trial, pathological complete response, is compared, the two methods are equally effective. Central pathology review has a place in the assessment of minimal residual disease but if that is not an agreed pre-specified trial end point there is little extra value in doing this. Learning from the experience of ARTemis, future neoadjuvant clinical trials could be improved by training in the routine calculation of residual cancer burden. Also, standardized routine reporting using report templates would greatly assist in report review.17 Such training might provide more robust reporting of residual cancer burden classes, facilitating future clinical management, when current and planned trials of adjuvant treatment in patients not achieving a pathological complete response to neoadjuvant therapy come to fruition.