Predicting breast cancer response to neoadjuvant treatment using multi-feature MRI: results from the I-SPY 2 TRIAL

Dynamic contrast-enhanced (DCE) MRI provides both morphological and functional information regarding breast tumor response to neoadjuvant chemotherapy (NAC). The purpose of this retrospective study is to test if prediction models combining multiple MRI features outperform models with single features. Four features were quantitatively calculated in each MRI exam: functional tumor volume, longest diameter, sphericity, and contralateral background parenchymal enhancement. Logistic regression analysis was used to study the relationship between MRI variables and pathologic complete response (pCR). Predictive performance was estimated using the area under the receiver operating characteristic curve (AUC). The full cohort was stratified by hormone receptor (HR) and human epidermal growth factor receptor 2 (HER2) status (positive or negative). A total of 384 patients (median age: 49 y/o) were included. Results showed analysis with combined features achieved higher AUCs than analysis with any feature alone. AUCs estimated for the combined versus highest AUCs among single features were 0.81 (95% confidence interval [CI]: 0.76, 0.86) versus 0.79 (95% CI: 0.73, 0.85) in the full cohort, 0.83 (95% CI: 0.77, 0.92) versus 0.73 (95% CI: 0.61, 0.84) in HR-positive/HER2-negative, 0.88 (95% CI: 0.79, 0.97) versus 0.78 (95% CI: 0.63, 0.89) in HR-positive/HER2-positive, 0.83 (95% CI not available) versus 0.75 (95% CI: 0.46, 0.81) in HR-negative/HER2-positive, and 0.82 (95% CI: 0.74, 0.91) versus 0.75 (95% CI: 0.64, 0.83) in triple negatives. Multi-feature MRI analysis improved pCR prediction over analysis of any individual feature that we examined. Additionally, the improvements in prediction were more notable when analysis was conducted according to cancer subtype.


INTRODUCTION
An important advantage of neoadjuvant chemotherapy (NAC) over adjuvant therapy for locally advanced breast cancer is the ability to monitor treatment response, which allows informed adjustment of the treatment plan. Among imaging methods, magnetic resonance imaging (MRI) is the most accurate for assessing tumor response to NAC [1][2][3][4][5] . Results from the I-SPY 1 TRIAL (CALGB 150007/ACRIN 6657) found that functional tumor volume (FTV) predicted pathologic complete response (pCR) and recurrence-free survival 6,7 . Subsequently, serial measures of FTV during treatment are used in the adaptive randomization engine of the I-SPY 2 trial, designed to accelerate the evaluation of novel agents for breast cancer 8 . Pathologic complete response is the primary endpoint in I-SPY 2.
FTV represents the active portion of tumor volume, as defined by pharmacokinetic thresholds applied to dynamic contrast-enhanced MRI (DCE-MRI) 9 . While FTV has shown effectiveness for the prediction of pCR, there is still potential for improvement, especially in the setting of hormone-positive tumors 10 . Additional features can be derived from the same DCE-MRI data, including longest diameter, sphericity, and contralateral background parenchymal enhancement (BPE). These additional measures have also shown value for prediction of pCR [11][12][13][14] . Longest diameter is a standard clinical measurement used to assess tumor response, consistent with the Response Evaluation Criteria in Solid Tumors (RECIST) 15 . Sphericity is a three-dimensional shape feature previously found to be associated with pCR in the I-SPY2 trial 11 . Several studies have shown the association of BPE with breast cancer risk in the screening setting, and decreased BPE has been found to be associated with pCR following neoadjuvant chemotherapy [12][13][14]16,17 . 1 This study investigated whether the predictive performance of MRI can be improved over FTV or any single feature alone by using a combination of features measured on DCE-MRI. By providing better prediction of response, MRI can advance personalized treatment and play an important role in assessing whether to change targeted therapies or proceed directly to surgical resection.

Patient characteristics
A total of 384 patients who had complete MRI data and pCR outcome were included in the analysis (see Fig. 1 for patient exclusion details and Table 1 Fig. 2 shows the bar charts for visual comparison and Fig. 3 shows the corresponding ROC curves for each AUC value. Combining multiple MRI features resulted in higher AUC compared to single features alone, in the full cohort and in each breast cancer subtype. In the full cohort, AUC for the combined model was 0.81 (95% CI: 0.76-0.86), which exceeded the highest AUC achieved using a single feature model (LD) at 0.79 (95% CI: 0.73-0.85). The p-value of the difference between the two AUCs was <0.001.
Although AUCs of the combined features were higher than those of individual measures in the full cohort and in subtype cohorts (p < 0.001), Fig. 3 shows their relationship on the full scale of sensitivity and specificity. The ROC curves of the combined predictors had greater separation from the ROCs of a single type of predictor for the subtype cohorts than the full cohort.

DISCUSSION
Given its robust correlation with long-term outcomes, pCR has increasingly become the clinical goal of NAC in locally advanced breast cancer. The ability to use non-invasive methods to accurately predict pCR early in the course of treatment has enormous clinical implications as it would permit personalized, evidence-based escalation or de-escalation of therapy. Our results showed that MRI functional tumor volume-based prediction of pathologic outcome following NAC can be improved using a combination of multiple features, as compared to a single feature alone. Importantly, each of these features can be measured from the same DCE-MRI dataset, requiring no additional image acquisitions.
In support of our findings, previous studies using combined MRI parameters have typically shown higher predictive performance for pCR compared to those using a single parameter. For example, Lee et al compared the ability of pre-treatment DCE-MRI perfusion imaging parameters to predict pCR in 74 breast cancer patients who were treated with NAC followed by surgery 18 . Their retrospective study concluded that the model combining perfusion parameters of contralateral breast background parenchyma and those of the tumor had higher predictive value than each single-parameter model. This also agrees with results published by Hylton et al, who performed a multivariable analysis of the DCE-MRI examinations of 162 women with breast tumors 3 cm or larger 6 , showing that a model combining MRI parameters (longest diameter, functional tumor volume, signal enhancement ratio) and clinical tumor size achieved the highest predictive accuracy for pCR.
Based on our study of HR/HER2 subtype, the improvement in predicting pCR by multi-feature MRI was more notable in individual subtypes than in the full cohort. More interestingly, imaging predictors included in the optimized model were different among subtypes, which indicates that some features may capture the treatment response better than others, depending upon the cancer subtype. For example, studies have shown that tumor sizes measured using MRI were less accurate in HER2+ compared to HER2− subtypes 19,20 . However, the decrease in BPE before and after NAC showed its association with pCR in HER2+ breast cancer 21,22 . Our study showed consistent results as FTV or LD yielded lower AUCs than SPH or BPE in the HR−/HER2+ subtype, where combining them into the prediction model can help improve the predictive performance. Four MRI features were included in this analysis. They were chosen by having demonstrated clinical relevance. However, there could be many other imaging features in MRI that could also potentially be predictive of pCR. With the advancement of computational technology, radiomics can extract a large number of features and machine-learning algorithms can be used to select biologically or physiologically meaningful features to predict cancer treatment outcomes. In our future studies, other radiomics features will be explored.
Among the four MRI features that we studied, FTV is an IDEapproved algorithm and a well-established imaging biomarker in the I-SPY 1 and 2 trials. Other features all have pitfalls and challenges. LD is a standardized and internationally recognized measurement reported in the ACR Breast Imaging Reporting and Data System (BI-RADS) 23 . However, LD can be subjective and may not capture the functional or physiological changes from treatment. In this study, BPE was calculated fully automatically and therefore avoided reader subjectivity. However, achieving a reliable and automated quantitative BPE measurement is still a challenge. Approximately 30% of the MRI examinations were excluded because of inadequate fibroglandular tissue segmentations. A more reliable quantitative BPE measurement in combination with higher overall image quality standards is needed. SPH is a morphologic measurement of tumor shape. According to its definition, a solid round-shaped tumor has a larger SPH than a diffuse tumor. However, SPH does not accurately differentiate tumor necrosis and multi-centric tumors. In addition, SPH is not measurable when tumor volume has reduced to a minimal residual. We observed better predictive performance by combining these features together than using any single feature alone, which indicates that deficiencies in the individual features may compensate for each other in the prediction of treatment response.
Our study has several limitations. First, all DCE-MRI data in I-SPY 2 were under well-managed assessment and control, but we still observed various quality issues such as different signal-to-noise ratios and insufficient fat suppression. These variations could affect the variability of our MRI feature measurements. Second, SPH was not calculable when FTV was close to zero. This limitation can cause the exclusion of good responders in our analysis. Third, even though we had the advantage of a large sample size for our study (n = 384), the patient population was not evenly distributed among cancer subtypes. In particular, the HR−/HER2+ subset had only 30 patients with 10 non-pCRs, which prohibited us from achieving a reliable 95% CI confidence interval for the AUC in this sub-cohort. Fourth, because multiple agents were tested simultaneously in I-SPY 2, patients with the same HR/HER2 status could have received different agents and responded differently. In future analyses, drug efficacy should also be estimated as an independent variable in the prediction model when a larger sample size is available.
In conclusion, our study showed that MRI can provide quantitative information about tumor characteristics, and multifeature analysis yielded better prediction of pathologic complete response than sole analysis of any of the single features we examined. The improvement in the predictive performance was more notable when analysis was conducted into cancer subtype. Continued work to improve the reliability and predictive performance of individual features is currently underway and further testing of the multi-feature model will be done in expanded I-SPY 2 cohorts. HR hormone receptor, HER2 human epidermal growth factor receptor 2. Note -Unless otherwise specified, data in columns 2 and 3 are number of patients, with percentages in parentheses.

Patient population
Women 18 years of age and older and diagnosed with locally advanced breast cancer (stage II or III, tumor ≥ 2.5 cm) are eligible to enroll in the I-SPY2 trial (clinical trial number: NCT01042379; registration date: January 5, 2010) 24,25 . A total of 990 patients enrolled in I-SPY 2 from May 2010 to November 2016 and randomized to one of nine completed experimental drug arms or standard of care were considered in this retrospective study. Participants received 12 weekly cycles of paclitaxel alone (standard of care) or in combination with one of nine experimental agents, followed by four cycles of anthracycline-cyclophosphamide (AC) every 2-3 weeks, prior to definitive surgery (Fig. 4) 10 . Patients with HER2-positive cancer also received trastuzumab for the first 12 cycles. In some experimental drug arms, the experimental agent may substitute for one of the standard therapies (paclitaxel or trastuzumab). All participating sites received approval from their institutional review board. All patients provided written informed consent to participate in the study. Subsets of the patient cohort were included in previous studies 10,26,27 .

MRI acquisition and feature analysis
For each participant, MRI examinations occurred at four sequential time points: pre-treatment (T 0 , pre-NAC), after 3 cycles (T 1 , early NAC), after 12 cycles and between drug regimens (T 2 , mid-NAC), and before surgery (T 3 , post-NAC). All MRI examinations used DCE-MRI, performed according to the predefined I-SPY 2 MRI protocol (described in Supplementary Table 2). For each DCE-MRI examination, four features were assessed: functional tumor volume (FTV), sphericity (SPH), contralateral background parenchymal enhancement (BPE), and longest diameter (LD). FTV, SPH, and BPE were calculated using in-house software tools developed in the IDL software environment (Exelis Visual Information Solutions, Boulder, Colorado). The FTV method was subsequently replicated on a commercial platform that gained FDA IDE approval in 2010 for use in I-SPY 2 9,28 . LD was measured by the site radiologist and abstracted from clinical MRI reports by study coordinators at each site. Study coordinators, radiologists, and imaging scientists who worked on generating these features were blind to pathologic outcomes.
FTV and SPH were calculated within a 3D volume-of-interest (VOI) defined by the site radiologist or trained imaging coordinator. Early percent enhancement (PE) and signal enhancement ratio (SER) maps were derived by PE ¼ S1ÀS0 S0 100% and SER ¼ S1ÀS0 S2ÀS0 , where S 0 , S 1 , and S 2 are signal intensities at pre-contrast, early (approximately 2.5 minutes), and late (approximately 7.5 minutes) post contrast, respectively. FTV was calculated by summing voxel volumes with PE ≥ 70% and SER ≥ 0. As previously described, a threshold different from 70% was applied for a small number of patients when necessary to account for variability in MRI systems and tumor enhancement pattern 9 . In these cases, adjusted thresholds defined at baseline were kept constant for all subsequent MRI examinations. SPH was defined as SA0 SAtumor , where SA tumor is the surface area of the 3D FTV tumor mask and SA 0 is the surface area of a perfect sphere of the same volume. Tumor surface area was calculated using a surface meshing analysis. SPH values range from 0 to 1.0, with 1.0 representing a perfect sphere.
BPE was defined as the mean PE of fibroglandular tissue in the contralateral breast. An automated segmentation algorithm was used to identify breast tissue boundaries and a fuzzy c-means clustering algorithm was applied to classify fibroglandular tissue from the segmented breast 29 . BPE was calculated by automatically averaging over the tissue in five continuous axial slices geometrically centered in the superior-inferior direction to characterize tissue in the center of the breast. Illustrations of measuring FTV, LD, SPH, and BPE are shown in Supplementary Fig. 1.
Pathologic outcome pCR was defined as the absence of residual invasive disease in the breast and axillary lymph nodes after NAC, measured at surgery. Histopathologic analysis was performed by site pathologists.

Statistical analysis
Baseline values and percentage changes from baseline were computed for each feature and treated as independent variables in the logistic regression model using binary pCR outcome (1: pCR; 0: non-pCR) as the dependent variable. The area under the curve (AUC) for the receiver operating characteristic (ROC) was used to assess the predictive performance, with 100 repeated 5-fold cross-validation applied to avoid biased estimates of classification accuracy. The 95% confidence interval (CI) of cross-validated AUC was estimated using 1,000 bootstrap resamples. P-values of variables in the logistic regression model were estimated by the likelihood-ratio chi-squared test of nested models-with and without the variable being tested. This retrospective analysis was restricted to patients with all four MRI features available at all treatment time points.
Logistic regression models were built using single versus combined MRI features. For single-feature (i.e., FTV, SPH, BPE, or LD) analysis, optimized models were built by selecting variables-from baseline measure and percentage change at T1, T2, T3 compared to the baseline-as independent variables in the logistic regression analysis, and by achieving the highest cross-validated AUCs as mentioned above. For the combined method, all variables from four MRI features available at all treatment time points up to T3 were subject to the variable selection. For single and combined analyses, optimized models were created separately in the full patient cohort and in each of the four breast cancer subtypes defined by HR/HER2 status. Subtype was added as an additional independent categorical variable in the regression model for the full cohort.  The Wilcoxon rank and Fisher's exact test was used to assess differences by age, HR/HER2 subtype, race, menopausal status at the start of NAC, and treatment (experimental versus standard chemotherapy). AUCs of ROC curves were compared by bootstrapping with 2,000 replicates using a twosided test.
Statistical analyses were performed using R version 3.4.1 (R Foundation for Statistical Computing, Vienna, Austria), where the 'caret' package was used for logistic regression analyses 30 , the 'pROC' package for ROC analyses 31 , and the 'boot' package for calculating 95% CIs for crossvalidated AUCs 32,33 . All tests were considered nominally statistically significant when p < 0.05.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and analyzed during this study are described in the following data record: https://doi.org/10.6084/m9.figshare.12912191 34 . The datasets are as follows: the original acquired and derived MRI DICOM data, under the title "I-SPY 2 MRI Collection", and an Excel file called "Multi-feature MRI NACT Data.xlsx". These will be deposited and be publicly available in NCI The Cancer Imaging Archive (TCIA): https://www.cancerimagingarchive.net/. However, due to technical limitations with the deposition and curation of the data, their release date is anticipated to be late 2020. When they become available, this metadata record associated with this article 34 will be updated to version 2 to link the TCIA data DOI. In the meantime, please contact the corresponding author with data queries.
Received: 12 March 2020; Accepted: 21 October 2020;  Table 2. MRI features include functional tumor volume (FTV), sphericity (SPH), background parenchymal enhancement (BPE), and longest diameter (LD). ROCs were plotted in the full cohort and in sub-cohorts defined by hormone receptor (HR) and human epidermal growth factor 2 (HER2) status. Fig. 4 I-SPY 2 study schema and adaptive randomization. Patients were randomized to the standard (paclitaxel for human epidermal growth factor 2 [HER2]-negative or paclitaxel plus trastuzumab for HER2-positive) or one of the experimental drug arms. Participants received a weekly dose of paclitaxel alone (standard) or in combination with an experimental agent for 12 weekly cycles followed by four (every 2-3 weeks) cycles of anthracyclinecyclophosphamide (AC) prior to surgery. MRI examinations were performed at pre-neoadjuvant chemotherapy (NAC) (T0), early NAC (T1), mid-NAC (T2), and post-NAC (T3).