Main

Magnetic resonance imaging (MRI) has been proposed to have a role in guiding breast cancer surgical extent by measuring the size of the residual tumour after neoadjuvant chemotherapy (NAC), and has been shown to have good sensitivity for detecting residual disease in that setting (Marinovich et al, 2013). Given that current guidelines for response evaluation recommend assessment of the largest tumour diameter (Eisenhauer et al, 2009), estimation of the largest diameter by MRI may guide decisions about whether subsequent mastectomy or breast conserving surgery (BCS) should be attempted, as well as assist in the planning of resection volume to achieve clear surgical margins in BCS. Underestimation of tumour size may therefore lead to involved surgical margins and repeat surgery; overestimation may lead to overly radical surgery (including mastectomy when BCS may have been possible) and poorer cosmetic and psychosocial outcomes (Irwig and Bennetts, 1997).

The assessment of tumour size before surgery is subject to a number of potential errors (Padhani and Husband, 2000). Reactive inflammation, fibrosis or necrosis in response to NAC may present as areas of enhancement on MRI images, which may be difficult to distinguish from residual tumour (Yeh et al, 2005; Belli et al, 2006). Regression of the tumour as multiple, scattered tumour deposits may also make assessment of the longest diameter problematic, with different approaches to measurement that either include (Rosen et al, 2003; Wright et al, 2010) or exclude (Cheung et al, 2003; Bollet et al, 2007) intervening normal tissue. Ductal carcinoma in situ (DCIS) may not be well visualised (Berg et al, 2010) or, alternatively, may be indistinguishable from invasive cancer (Partridge et al, 2002). Imaging artefacts may also introduce errors in tumour size estimation. For example, the placement of markers in or around the tumour may produce areas of increased signal intensity, which are difficult to distinguish from residual foci, or areas of low signal, which may contribute to size underestimation. Underestimation may also occur owing to partial volume effects (Lobbes et al, 2012). Furthermore, the inherently pliable nature of breast tissue means that tumour dimensions may vary, depending on patient positioning (Tucker, 2012).

In this systematic review and study-level meta-analysis, we investigate agreement in the measurement of residual tumour size by MRI and pathology (the reference standard) after NAC for breast cancer, as assessed by mean differences (MDs) and 95% limits of agreement (LOA) (Bland and Altman, 1986). We also compare the agreement between pathology and alternative tests which have been used to measure residual tumour before surgery (ultrasound (US), clinical examination and mammography). The consistency of results from different methods to assess agreement is investigated, and recommendations are made about methods for future studies.

Materials and methods

Identification of studies

A systematic search of the biomedical literature up to February 2011 was undertaken to identify studies assessing the accuracy of MRI after NAC in measuring the size of residual tumour. MEDLINE and EMBASE were searched via EMBASE.com; PREMEDLINE, Database of Abstracts of Reviews of Effects, Heath Technology Assessment (CLHTA) and the Cochrane databases were searched via Ovid. Search terms were selected to link MRI with breast cancer and response to NAC. Keywords and medical subject headings included ‘breast cancer’, ‘nuclear magnetic resonance imaging’, ‘MRI’, ‘neoadjuvant’ and ‘response’. The full search strategy has been reported previously (Marinovich et al, 2012, 2013). Reference lists were also searched and content experts consulted to identify additional studies.

Review of studies and eligibility criteria

All abstracts were screened for eligibility by one author (LM), and a sample of 10% was assessed independently by a second author (NH) to ensure consistent application of the eligibility criteria. Eligible studies were required to have enrolled a minimum of 15 patients with newly diagnosed breast cancer undergoing NAC, with MRI and at least one other test (US, mammography and clinical examination) undertaken after NAC to assess the size of residual tumour before surgery. Pathologically measured tumour size based on surgical excision was the reference standard, but studies were not excluded if alternative reference standards were used in a minority of patients.

Potentially eligible citations were reviewed in full (LM or NH). The screening and inclusion process is summarised in Supplementary Information Resource 1 (PRISMA flowchart).

Data extraction

Data relating to tumour size assessment, study design, patient characteristics, tumours, treatment, technical details of MRI, comparator tests and the reference standard were extracted independently by two authors (LM, and either SC, MB or FS). Quality appraisal was undertaken using the Quality Assessment of Diagnostic Accuracy Studies checklist (version 1, modified for this clinical setting; Whiting et al, 2003, 2006). Disagreements were resolved by discussion and consensus, with arbitration by a third author (NH) when required.

Measures of agreement

Bland and Altman (1986) describe appropriate methods to assess agreement between two continuous measures and highlight the inadequacy of the Pearson’s correlation coefficient when used for this purpose. Unlike methods such as intraclass correlation (ICC), the Pearson’s correlation coefficient measures the degree to which there is a linear, but not necessarily 1 : 1 relationship. Hence, it is possible for a high Pearson’s correlation to be observed when there is poor agreement between two measures (e.g., when tests systematically under- or overestimate pathologic size). Spearman’s rank correlation is similarly problematic. A commonly reported alternative approach involves calculating the percentage of cases for which there is ‘agreement’ between measures within a chosen ‘margin of error’. This approach also has limitations, as the chosen margin of error may be somewhat arbitrary, and tendencies for one measure to under- or overestimate the other within that margin may be obscured.

The approach recommended by Bland and Altman (1986) comprises a scatterplot of the differences between the measures (the vertical axis) against their mean (horizontal axis). If the differences are normally distributed and are independent from the underlying size of the measurements, agreement may be quantified by the MD and associated 95% LOA. Hence, MDs and LOA were extracted from studies reporting these outcomes. When LOA were not presented, data were extracted from which the LOA could be derived (e.g., s.d. of the difference or root mean square error). Despite their limitations, percentage agreement within a margin of error (and associated percentages of under/overestimation) and correlation coefficients were also extracted to provide a descriptive summary of these measures.

Statistical analysis

MDs between tumour size measurements by MRI or comparator tests and pathology were pooled by the inverse variance method by assuming a fixed effect using RevMan 5.2 (The Nordic Cochrane Centre (Copenhagen), The Cochrane Collaboration, 2012) (http://ims.cochrane.org/sites/ims.cochrane.org/files/uploads/documents/revman/RevMan_5.2_User_Guide.pdf). The Cochrane Q statistic was used to assess whether statistically significant heterogeneity was present (significant at P<0.10), and the extent of heterogeneity was quantified by the I2 statistic (Higgins et al, 2003). To estimate the 95% LOA for a pooled MD, a pooled variance was computed under the assumption that the variance of the differences was equal across studies. The pooled variance was calculated as the weighted average of these within-study variances, weighted by the corresponding degrees of freedom for each study (i.e., an extension of the approach used for a two sample Student’s t-test (Woodward, 1999)).

Results

Study characteristics

A total of 2108 citations were identified. Nineteen studies were eligible for inclusion in the systematic review (Weatherall et al, 2001; Balu-Maestro et al, 2002; Partridge et al, 2002; Rosen et al, 2003; Bodini et al, 2004; Chen et al, 2004; Londero et al, 2004; Julius et al, 2005; Montemurro et al, Yeh et al, 2005; Akazawa et al, 2006; Bollet et al, 2007; Segara et al, 2007; Bhattacharyya et al, 2008; Moon et al, 2009; Prati et al, 2009; Nakahara et al, 2010; Wright et al, 2010; Guarneri et al, 2011), reporting data on 958 patients undergoing MRI and/or comparator tests; MRI data were reported for 953 patients. Studies enrolled patients between 1998 and 2007 (median mid-point of recruitment 2002), and included a median of 38 patients with MRI data (range 12–195). Characteristics of included studies are summarised in Table 1. Study quality appraisal is summarised in Supplementary Information Resource 2.

Table 1 Summary of cohort, tumour, treatment and reference standard characteristics of included studies

MRI details

Technical characteristics of MRI are summarised in Supplementary Information Resource 3. The majority of studies used DCE-MRI (84.2%) with a 1.5-T magnet (73.7%). Dedicated bilateral breast coils were used in all studies in which the coil type was reported. All studies providing detail on contrast employed gadolinium-based materials, most commonly gadopentetate dimeglumine (68.4%), typically at the standard dosage of 0.1 mmol per kg body weight (68.4%).

Reference standard

Pathology from surgical excision was the reference standard for all patients in all but one study (Bhattacharyya et al, 2008), where the absence of residual tumour (pathologic complete response, pCR) in two patients was verified by localisation biopsy, representing 0.2% of patients included in all studies. Study-specific rates of pCR ranged between 0.0% and 28.6%, with a median 14.3% (Table 1).

Mean differences between MRI and pathology

Six studies (Partridge et al, 2002; Akazawa et al, 2006; Segara et al, 2007; Prati et al, 2009; Wright et al, 2010; Guarneri et al, 2011) reported MDs and LOA between MRI and pathology (Supplementary Information Resource 4). All studies measured the longest tumour diameter, except for a study by Akazawa et al (2006) that measured the diameter along the plane connecting the nipple and the tumour centre. This study is therefore presented descriptively, but has been excluded from pooled analyses.

Meta-analysis of MDs between MRI and pathologic tumour measurement (Figure 1) showed a tendency for MRI to slightly overestimate pathologic tumour size, with a pooled MD of 0.1 cm (95% CI: −0.1–0.3 cm). There was no evidence of heterogeneity (I2=0%). Pooled LOA indicated that 95% of pathologic measurements fall between −4.2 cm and +4.4 cm of the MRI measurement.

Figure 1
figure 1

Forest plot of mean difference (cm) between MRI and pathologic size (all studies).

Within-study comparisons of MRI versus US, clinical examination and mammography are presented in Supplementary Information Resource 4. For all but a single study showing similar, small tendencies for overestimation by MRI (0.16 cm) and US (0.06 cm) (Guarneri et al, 2011), the absolute values of MDs within studies were lower for MRI than that for the alternative tests. Pooled MDs and 95% LOA are summarised in Table 2 and Figures 2, 3, 4. There was no evidence of heterogeneity for MRI in any of the analyses, or for US (all I2=0%). Pooled results from two studies (Segara et al, 2007; Guarneri et al, 2011) showed similar small overestimation of pathologic tumour size by MRI and US (MDs of 0.1 cm for both tests), with comparable LOA. Pooled MDs and LOA from two studies (Prati et al, 2009; Wright et al, 2010) were larger for mammography (0.4, 95% LOA −7.1 to 8.0 cm) than for MRI (0.1 cm, 95% LOA −6.0 to 6.3 cm), with moderate heterogeneity in MDs for mammography (I2=39%). Pooled estimates for MRI and clinical examination across four studies (Partridge et al, 2002; Segara et al, 2007; Prati et al, 2009; Wright et al, 2010) resulted in substantial heterogeneity for the latter test (Q=20.59, df=3, P=0.0001; I2=85%); three studies reported that clinical examination underestimated pathologic tumour size, and one study reported the reverse. Pooled MDs showed larger underestimation with wider LOA for clinical examination (−0.3 cm, 95% LOA: −5.3 to 4.7 cm) relative to MRI overestimation (0.1 cm, 95% LOA: −4.5 to 4.6 cm).

Table 2 Pooled MD and LOA (cm) restricted to studies comparing the respective tests (fixed effects)
Figure 2
figure 2

Forest plots of mean difference (cm) between MRI or US and pathologic size (comparative studies).

Figure 3
figure 3

Forest plots of mean difference (cm) between MRI or clinical examination and pathologic size (comparative studies).

Figure 4
figure 4

Forest plots of mean difference (cm) between MRI or mammography and pathologic size (comparative studies).

Percentage agreement

Eight studies (Balu-Maestro et al, 2002; Rosen et al, 2003; Julius et al, 2005; Yeh et al, 2005; Akazawa et al, 2006; Segara et al, 2007; Nakahara et al, 2010; Guarneri et al, 2011) reported percentage agreement between tumour size measured by MRI and pathology within a variety of margins of error based on absolute size (±0, 0.5, 1, 2 and 3 cm) or a percentage of the pathologic measurement (±30 and 50%; Supplementary Information Resource 4). One study did not report the margin of error used to calculate agreement (Balu-Maestro et al, 2002), and two studies reported percentage agreement between MRI and pathology but not the associated percentages of MRI under/overestimation (Julius et al, 2005; Akazawa et al, 2006).

Studies reporting percentage agreement (plus under/overestimation) for MRI, US and clinical examination by an absolute margin of error are summarised in Figure 5 (no studies reported these data for mammography). As would be expected, percentage agreement between all tests and pathology was observed to be higher for wider margins of error (e.g., 20% for exact agreement between MRI and pathologic measurements (Segara et al, 2007; Guarneri et al, 2011) vs 92% for±3 cm (Nakahara et al, 2010)). With the exception of one study showing a tendency for overestimation (Rosen et al, 2003), MRI appeared equally likely to overestimate and underestimate pathologic tumour size across all absolute margins of error. For US and clinical examination, a tendency towards underestimation can be observed in Figure 5, but the majority of estimates showing that bias were contributed by a single study (Segara et al, 2007).

Figure 5
figure 5

Percentage agreement, underestimation and overestimation for ( A ) MRI, ( B ) US and ( C ) clinical examination by margin of error (cm).

Percentage agreement estimates for MRI based on any margin of error were compared with those of alternative tests in six studies (Supplementary Information Resource 4). All six studies compared MRI and US (Balu-Maestro et al, 2002; Julius et al, 2005; Yeh et al, 2005; Akazawa et al, 2006; Segara et al, 2007; Guarneri et al, 2011); MRI was compared with clinical examination in four studies (Balu-Maestro et al, 2002; Yeh et al, 2005; Akazawa et al, 2006; Segara et al, 2007) and with mammography in three studies (Balu-Maestro et al, 2002; Julius et al, 2005; Yeh et al, 2005). For all but one study and across the range of reported margins of error, percentage agreement estimates for MRI were higher than those for the comparator tests. In the one exception to this pattern of results, a study reporting multiple margins of error (Segara et al, 2007) found higher percentage agreement for MRI than for US at margins of ±0 and ±1 cm, but percentage agreement at ±2 cm was slightly higher for US (92%) than that for MRI (88%). In one other study (Guarneri et al, 2011), the difference in percentage agreement favouring MRI over US was relatively small (20% vs 15% at ±0 cm; 54% vs 51% at ±0.5 cm; and 71% vs 68% at ±1 cm).

Correlation coefficients

Sixteen studies (Weatherall et al, 2001; Partridge et al, 2002; Rosen et al, 2003; Bodini et al, 2004; Chen et al, 2004; Londero et al, 2004; Montemurro et al, 2005; Akazawa et al, 2006; Bollet et al, 2007; Segara et al, 2007; Bhattacharyya et al, 2008; Moon et al, 2009; Prati et al, 2009; Nakahara et al, 2010; Wright et al, 2010; Guarneri et al, 2011) reported correlations between MRI and pathologic tumour size, and similar correlations for at least one alternative test, either by the Pearson’s (N=9) or Spearman’s (N=5) method (in two studies (Weatherall et al, 2001; Partridge et al, 2002), the method was not specified). The range of correlation coefficients was wide (0.21–0.92), with a median value of 0.70 (Supplementary Informtion Resource 4). Coefficients between 0.20 and 0.39 were reported in two studies, 0.40–0.59 in four studies, 0.60–0.79 in six studies, and 0.80 and above in four studies. One study reported ICC between MRI and pathology (0.48), in addition to Spearman’s rank coefficients (Bollet et al, 2007).

Six studies reported correlations with pathology of MRI and mammography (Weatherall et al, 2001; Bodini et al, 2004; Londero et al, 2004; Bollet et al, 2007; Prati et al, 2009; Wright et al, 2010), all of which reported consistently higher correlation coefficients for MRI. However, of the 10 studies that reported correlations with pathology of MRI and clinical examination (Weatherall et al, 2001; Partridge et al, 2002; Rosen et al, 2003; Bodini et al, 2004; Chen et al, 2004; Akazawa et al, 2006; Bollet et al, 2007; Segara et al, 2007; Prati et al, 2009; Wright et al, 2010), two found correlations favouring the latter test (Prati et al, 2009; Wright et al, 2010). Similarly, two (Nakahara et al, 2010; Guarneri et al, 2011) of 11 studies that presented correlations for MRI and US with pathology (Weatherall et al, 2001; Bodini et al, 2004; Londero et al, 2004; Montemurro et al, 2005; Akazawa et al, 2006; Bollet et al, 2007; Segara et al, 2007; Bhattacharyya et al, 2008; Moon et al, 2009; Nakahara et al, 2010; Guarneri et al, 2011) reported higher correlations for US.

Within-study comparisons of different methods

Six studies (Partridge et al, 2002; Akazawa et al, 2006; Segara et al, 2007; Prati et al, 2009; Nakahara et al, 2010; Wright et al, 2010; Guarneri et al, 2011) compared the performance of MRI and other tests by more than one method. In four of those, different methods produced results that could potentially lead to inconsistent conclusions regarding agreement, depending on which measure is considered. In two (Prati et al, 2009; Wright et al, 2010) of six studies that presented both MDs and correlations, the absolute values of the MD was lower for MRI (0.3 cm) than for clinical examination (1.2 cm), but a higher correlation was observed between clinical examination and pathologic size. The 95% LOA for MRI were wider than for clinical examination, reflecting the lower correlation for MRI. Similarly, in two of three studies presenting MDs and percentage agreement, the methods suggest opposing conclusions. Guarneri et al (2011) found a larger MD and wider LOA for MRI compared with US, but slightly higher percentage agreement, whereas Segara et al (2007) reported the reverse (for agreement within 2 cm only). In addition, the slightly higher percentage agreement for MRI than US reported by Guarneri et al (2011) contrasts with a lower correlation coefficient, and vice verse for Segara et al (2007) (for agreement within 2 cm only).

Discussion

In the neoadjuvant setting, accurate information on the extent of residual malignancy assists in guiding surgical management of breast cancer. We pooled estimates of the MD between residual tumour size measured by MRI and pathology from six studies, and found that on average, MRI had a tendency to slightly overestimate pathologic size after NAC (MD of 0.1 cm; Figure 1). However, the pooled 95% LOA around this estimate suggest that pathologic tumour measurements may lie between −4.2 cm and +4.4 of the MRI measurement, indicating that substantial disagreement may exist. Measurement errors within this range may be of clinical importance in terms of their implications for the choice of treatment approach.

Our analysis of the relative performance of MRI and alternative tests focused on studies directly comparing the tests against pathology (Bossuyt and Leeflang, 2008). Although only two studies reported MDs with pathologic measurements for both MRI and US, pooled estimates suggested that the tests had a similar tendency to overestimate pathologic size, with comparable LOA. The tendency to overestimate pathologic size was greater for mammography than MRI (two studies). Although significant heterogeneity was present in clinical examination findings, three of four studies reported the same direction of effect (underestimation) for this test. Pooled MDs showed clinical examination’s bias towards underestimation to be greater than MRI’s bias for overestimation, and within all four studies the absolute values of MDs were larger for clinical examination. Compared with MRI, wider LOA were observed for both clinical examination and mammography, suggesting that those tests had greater variability in terms of agreement with pathologic measurements. The LOA for all of the alternative tests were large enough to be of potential clinical significance.

Previous summaries of the literature about MRI’s accuracy in measuring residual tumour size have quoted correlations between MRI and pathology, and the percentage of cases in which MRI agrees with, underestimates, or overestimates pathologic measurements. Overall, correlations were considered to be ‘good’ (Lobbes et al, 2013), and the statistical significance of those correlations was emphasised (Mclaughlin and Hylton, 2011). The methodological limitations of that approach are well documented (Bland and Altman, 1986, 1990). The variable overestimation and underestimation described in those overviews has led others to attach caveats about inaccurate measurement to conclusions about the value of MRI in measuring residual tumour size (Sardanelli et al, 2010; Mclaughlin and Hylton, 2011). This inconsistency reflects an evidence base which is extensive but disparate in terms of the methods used to assess agreement, and highlights uncertainty about drawing meaningful conclusions from the literature.

Pearson’s and Spearman’s rank correlation coefficients were the most commonly reported statistics in our review (in contrast to MDs and LOA, the more appropriate statistics, yet the least reported). These correlation coefficients, which do not measure agreement (Bland and Altman, 1986), varied widely and were commonly inconsistent with more appropriate measures reported in the same study. Intraclass correlation, which does assess the degree to which a 1 : 1 relationship between measurements exists, was presented for MRI and pathology in just one study and was not reported for comparator tests (Bollet et al, 2007). The ICC may be an adjunct to the analyses recommended by Bland and Altman (1986), but this statistic alone is also limited in the extent to which it assesses agreement, as it is dependent on the range of observed values and does not separate systematic from random error (Bland and Altman, 1990).

The percentage of MRI measurements which ‘agree’ with pathology within a ‘margin of error’ may provide useful information to supplement MDs and LOA. However, the studies in our review varied considerably in the tolerated discrepancy between measures which was used to define ‘agreement’, reflecting the somewhat arbitrary nature of an ‘acceptable’ error. Furthermore, studies differed in the methods of calculating that discrepancy (i.e., absolute or relative differences), and accompanying percentages of under- or overestimation by MRI were not universally reported. This lack of consistency between studies renders the body of evidence difficult to interpret; future studies can facilitate comparability by reporting agreement, under- and overestimation for multiple margins of error, starting with exact agreement and increasing at 1 cm increments. In contrast to our pooled analysis of MDs showing that MRI has a tendency to slightly overestimate pathologic size, studies describing an absolute margin of error suggested that MRI was equally likely to under- and overestimate the pathologic measurement, highlighting that this method may obscure small measurement biases.

Studies of the agreement between imaging and pathologic size have inherent limitations. Although pathology is considered to be the ‘gold standard’, a variety of potential errors in pathologic measurement have been identified (Lagios, 2005; Provencher et al, 2012; Tucker, 2012), meaning that discrepancies with pathology may occur even when residual tumour size is accurately assessed before surgery. For example, pathologic diameters are likely to be overestimated when measured from a combination of tumour fragments, or excised and re-excised specimens (Lagios, 2005). There may also be errors in orientating intact specimens so that tumour diameters on imaging and pathology are measured in the same plane (Provencher et al, 2012), particularly if three-dimensional imaging data are unavailable to the pathologist (Weatherall et al, 2001; Tucker, 2012); this could result in pathologic measurements underestimating the longest diameter for irregularly shaped tumours (Lagios, 2005). There also exists the possibility that the process of removal, preparation or measurement of the pathologic specimen may shrink, expand or otherwise distort tumour dimensions (Pritt and Weaver, 2005; Pritt et al, 2005; Behjatnia et al, 2010; Provencher et al, 2012). Furthermore, the inclusion or exclusion of residual DCIS in pathologic measurements has the potential to affect estimates of agreement. Pooled MDs between pathology and MRI or alternative tests (and the associated LOA) must therefore be interpreted with awareness of these issues. However, if errors in the pathologic measurement are random and do not favour MRI over the comparators (or vice versa), these estimates allow for valid comparisons (Glasziou et al, 2008). Although this assumption may be reasonable when MRI, comparator tests and pathology are undertaken in the same patients, four (Partridge et al, 2002; Segara et al, 2007; Prati et al, 2009; Wright et al, 2010) of six studies reporting MDs excluded patients from one (or more) testing group(s), with discrepancies ranging from a single patient (2%) to up to 26% of patients with MRI data being excluded from analyses of comparator tests (Supplementary Information Resource 4).

Furthermore, differences in test performance may be observed if tumour size is estimated better (or more poorly) in patients selected to (or excluded from) a particular testing group. Authors should be encouraged to present data which allows agreement to be assessed for patients unique to particular analyses vs those common to all testing groups. In addition, these issues also highlight the importance of study authors clearly describing the characteristics of patients excluded from particular analyses. The presentation of important study design characteristics in included studies was generally suboptimal, but in particular, reporting of study withdrawals or exclusions (when they did occur) was poor (Supplementary Information Resource 2).

An important consideration in the interpretation of pooled MD and LOA estimates is that they may be misleading if the difference between tests is systematically related to underlying tumour size, or if the differences are not normally distributed (Bland and Altman, 1986). Plots of the differences by their mean allow for any underlying relationships to be assessed, but were presented in only half of the studies reporting MDs (Partridge et al, 2002; Segara et al, 2007; Wright et al, 2010). Examination of the plots presented in these studies suggests the possibility that the difference in pathology and MRI (or alternative tests) may be greater for larger tumour sizes. Careful attention should be given to graphical presentation of the data before calculating MDs, and data transformation should be considered when systematic relationships exist (Bland and Altman, 1986).

A possible limitation of our analysis is that many studies were not recent, and consequently newer neoadjuvant treatments, including taxanes and trastuzumab, were used in only a minority of patients (Table 1). Agreement between MRI and pathology may vary because of different patterns of tumour regression between taxane-based and non-taxane-based NAC; contrary to previous findings suggesting underestimation when taxanes are used (Denis et al, 2004), MDs in studies that used predominantly taxane-based NAC (Wright et al, 2010; Guarneri et al, 2011) suggest overestimation by MRI relative to studies using non-taxane-based regimens (Segara et al, 2007; Prati et al, 2009; Supplementary Information Resource 4). Increased rates of pCR owing to modern regimens may also potentially affect MD and LOA estimates, but examination of this issue was not possible owing to the small number of studies reporting those outcomes.

In summary, our meta-analysis is the first to explore and summarise the evidence on agreement between MRI and pathologic tumour measurements after NAC, and to highlight methodological issues which, to date, have precluded conclusions being drawn from the literature. Our work suggests a tendency for MRI to slightly overestimate pathologic tumour size measurements, but LOA are large enough to be of potential clinical importance. Few studies compared MDs between tests and pathology, but the performance of US appeared to be comparable to that of MRI; poorer agreement was observed for mammography and clinical examination. Although a large number of studies have addressed these questions, most studies have reported Pearson’s or Spearman’s correlation coefficients. Those measures are inappropriate for assessing agreement, and have contributed to uncertainty about MRI’s potential role. Further studies are warranted, and adopt the Bland–Altman approach to assessing MRI’s agreement with pathology, and which also assess the agreement with pathology of alternative tests; in addition, we have recommended methods of data presentation to assess the validity of comparisons between tests. Percentages of agreement and associated under/overestimation have limitations, but may provide useful data to supplement Bland–Altman analyses. Similarly, ICCs may also supplement these analyses, but Pearson’s and Spearman’s correlations should be avoided.