Introduction

Neoadjuvant chemotherapy (NAC) is increasingly used in the management of early and/or locally advanced breast cancer1. NAC has the following potential benefits for breast cancer therapy: preventing micrometastasis preoperatively to increase overall survival, downsizing tumors to make surgery feasible, and providing in vivo assessments of sensitivity to a chemotherapy2,3,4. After NAC, the achievement of the pathologic complete response (pCR) has been considered a metric for the success of NAC and a powerful surrogate indicator of the risk of recurrence and long-term survival5,6.

Recently, many studies have been conducted to find clinical and pathologic features to predict pCR prior to the NAC7,8,9. As a monitoring tool, breast magnetic resonance (MR) imaging has been shown to be most effective in predicting response to NAC10. Recently, several studies using radiomics methodology in which features were extracted in the computational method from breast MR images measured after the first cycle of NAC or even pretreatment have shown promise in predicting pCR11,12,13. However, radiomic studies have limited values because of the reproducibility issue of radiomics features, which could be affected by different radiomics software or MR acquisition parameters.

To overcome the limitations of previous studies, deep learning could be applied. The deep learning method extracts all possible features from a dataset by learning its representation with minimal human intervention14. In medical image analysis, convolutional neural network (CNN) has shown high performance in classification and segmentation tasks15,16,17,18. From breast MR images, several attempts with deep learning methods have been conducted to predict pCR to NAC in patients with breast cancer19,20,21. Recent studies have reported that the performance of the deep learning model is improved when the fusion approach that integrates features from MR images which have different views, scales, or acquisition protocols, and even clinical information was applied in diagnosing breast cancer15,22,23,24. However, previous studies were conducted with a relatively small number of patients or limited clinical information or using just a few select images not covering the entire tumor. This provokes the need of an in-depth research of the application of deep learning to fuse MR images with clinical information for predicting pCR to NAC in patients with breast cancer.

This study aimed to develop a deep learning model to fuse high-dimensional MR image features and the clinical information for the pretreatment prediction of pCR to NAC in breast cancer. We designed the architecture of deep neural network that combined ResNet-50 with 3D CNNs for MR images and fully connected (FC) layers for clinical information.

Results

Patients’ outcome and characteristics

Table 1 shows the general characteristics of the patients enrolled in this study. Overall, there were 133 (24.8%) and 403 (75.2%) patients with and without pCR, respectively. More than half of the patients were ER negative (n = 299, 55.8%), PR negative (n = 342, 63.8%), and HER2 negative (n = 359, 67.0%). Most patients were treated with AC-T regimen (n = 367, 68.5%).

Table 1 General characteristics of the study population.

Model performance for predicting pCR

To determine the training and validation sets, the entire patient group was randomly divided into the training set (n = 429) and validation set (n = 107). There was no significant difference between the training and validation sets, except in T stages (P = 0.047) (Supplementary Table S1). The performances are shown in Table 2 and Fig. 1. We found that the deep learning model trained on the dataset that contained paired T1-weighted (T1W) subtraction images, T2W images, and clinical information demonstrated the highest accuracy for predicting pCR in the validation set (area under the curve [AUC] = 0.888). The clinical information-trained deep learning model achieved a validation AUC of 0.827. Comparison of AUC showed a significant difference between the two models (P < 0.05). We investigated the relative importance of each clinical information as input on the performance of the deep learning model. Supplementary Figure S1 shows the importance of each clinical information to the deep learning model. The type of NAC regimen and Ki-67 level showed relatively high importance compared to other features in the clinical information. The performance in predicting pCR of the model trained on the dataset with T1W subtraction images and clinical information (AUC = 0.848) was higher than that of the model trained on only clinical information. According to the ROC curve for these three deep learning models in Fig. 1A, the performance of the deep learning models improved with the type of data used for training. Confusion matrices include information about actual and predicted classifications for models (Supplementary Figure S2).

Table 2 Performance of seven deep learning models in a validation dataset in predicting pCR.
Figure 1
figure 1

Receiver operating characteristic curves showing the AUC values of different deep learning models in the validation set. (A) ROC curves for the prediction of pretreatment pCR based on different deep learning models trained on clinical information and MR images or only clinical information pCR classifiers. T1 + T2 + C: subtracted-T1W images, T2W images, and clinical information. T1 + C: T1W subtraction images and clinical information. C: clinical information. (B) ROC curves for prediction of pretreatment pCR based on different deep learning models trained on the dataset in the combinations of MR images. T1 + T2: T1W subtraction images and T2W images. T1: T1W subtraction images. T2: T2W images. T1 (lesion): cropped image of the lesion in T1W subtraction images.

Figure 1B shows that ROC curves for the deep learning models in the validation set according to different scenarios. The deep learning model trained on T1W subtraction images and that on T2W images achieved an AUC of 0.725 and 0.663, respectively. When using cropped T1W subtraction images for the lesion as the dataset, the performance reduced with an AUC of 0.624. Compared with the deep learning model trained on uncropped T1W subtraction images, there was significant difference in AUC (P < 0.05). Compared to all deep learning models trained with single MR imaging sequence, the model that integrated T1W subtraction and T2W MR images had a higher prediction performance (AUC = 0.745).

Discussion

In this study, we present a deep learning model that is capable of predicting pCR to NAC in patients with breast cancer based on pretreatment MR images. We hypothesized that the use of both clinical information and deep features extracted from breast MR images could improve performance in predicting pCR rather than using only clinical information or MR images. Our experimental results show that the multimodal fusion approach that combined clinical information, T1W subtraction images, and T2W images is the best method for pretreatment pCR prediction (AUC = 0.888). Considering that the performance improvement in the deep learning model as increase in number of dataset, it is relevant to previous findings that multiparametric MR images can reflect various information about the tumor25. Although our results showed that the clinical information only model had a lower performance than the MRI-clinical information fused model, the performance of clinical variable model could be improved if more dedicated variables are available, considering that the scalable nature of clinical variables unlike radiomic features extracted from images.

Our study has three notable strengths compared with earlier work on predicting pCR from breast MR images using the deep learning method. First, we included 3D-bilateral whole MR images covering the axilla and chest wall to include information outside the lesion. To the best of our knowledge, this is the first application of the 3D-CNN model to extract the features from the 3D-bilateral whole MR images. Further, we confirmed that the entire image-trained model had better performance than cropped MR image-trained model adopted in previous studies. It may be due to the inclusion of invisible to human eyes or missed abnormal findings in the axilla or other organs identified in the entire image. Second, we did not need to use manual or automatic segmentation for tumor area extraction. Additionally, there was no constraint on inclusion of patients with multiple masses or non-mass enhancement type tumors in the dataset. Therefore, our method is less labor intensive and includes wider implication. Lastly, we included 536 patients who satisfied the inclusion criteria for our cohort. It is the largest cohort in the development of the pCR prediction model with breast MR images. Regarding the impact of sample size on the accuracy and reliability, Cho et al. studied the scale of the training dataset to achieve high performance of the medical image deep learning system26. They trained CNN to classify axial Computed Tomography (CT) images into six anatomical classes (brain, neck, shoulder, chest, abdomen, pelvis). They present the trend that increasing the training dataset size increases the performance of the deep learning network, while the accuracy did not increase significantly from training size 100–200. At 200 patients in the training set, the average of classification accuracy for six anatomical classes was 95.67%, especially for neck and chest CT images, over 99%. We convince that our results from 536 patients are robust enough.

However, our study still has limitations. First, although it has been confirmed that important information has been extracted from the MR image to improve pCR prediction performance, it is difficult to understand how output features were extracted from deep learning model effects to classify pCR. This has been frequently included as a problem when applying deep learning methods to clinical settings27. Recently, there are several attention techniques using the backward pass or response of a feedforward propagation on the deep learning algorithm, such as Grad-CAM and attention branch network28,29, to provide explanations for the prediction using the deep learning model. However, in our study, it was difficult to apply visual explanation on MR images to interpret the decision-making of deep learning models because of the complicated model architectures for multimodal fusion and feature extraction from 3D MR images. Future studies will need to apply methods that reveal not only the model’s prediction of the pCR but also its basis. Second, this is a retrospective single-center study. For this reason, we have MR images from only two types of MR scanners purchased from a single vendor (Philips); therefore, it might be difficult to generalize our proposed deep learning models. Further studies validating our model using MR images from multiple institutions are needed. Third, more study will need to improve the generalizability of our model for clinical implications. Even though the prediction performance of an MRI-clinical information fused model is best in our experiments, it is difficult to expect the same environment as the experimental conditions as both refined clinical information and two different MR images for model inference in a clinical setting. Considering that clinical information is readily available regardless of medical institution, the clinical information only model may have a lower performance than the fused model, but the generalizability may be higher.

Interestingly, when comparing the cases about models trained with MR images and only clinical information, the sensitivity was higher in the clinical information-only model and the specificity was higher in all the models trained with MR images (Table 2). The clinical information-only model is supposed to predict pCR relatively consistently by learning standardized clinical information such as subtype or receptor status. However, in the model that was trained with MR images, it seems that the sensitivity decreases due to the increase of false negatives (interpreted as non-pCR but actually pCR) because the chemotherapy-related changes such as fibrosis or inflammation may lead to incorrect predictions. In addition, the definition of pCR in our study was pT0/is N0, which means residual ductal carcinoma in situ (DCIS) was included as pCR. Differentiation between DCIS and invasive carcinoma is almost impossible. Therefore, residual DCIS could partly contribute to decreased sensitivity. Results might be different if we use different definition of pCR. This issue might be partly solved if functional images such as a diffusion MR images are added to the training dataset30. Further study is needed.

We selected T2W and T1W subtraction images as MR sequences for this study because they are most commonly used by radiologists to diagnose breast cancer and reflect tumor biology. If we additionally apply contrast-enhanced images from multiple post-contrast time points that were not used in this study, we can use kinetic information as leverage for predicting pCR31. Moreover, addition of diffusion-weighted imaging (DWI) can further improve the performance of the model. DWI is used to measure apparent diffusion coefficients (ADCs) on the breast tissue. ADC, a measure of the diffusivity of water within the tissue, is sensitive to intratumoral changes induced by chemotherapy. According to few clinical studies conducted on patients with breast cancer, the use of DWI is valuable in predicting pCR to NAC10,13,32. As we wrote before, in our study, sensitivity of MR image-based model was lower than clinical variable-only model. We anticipate that this could be improved when functional images such as a DWI are added to the training dataset for further study although strict standardization for obtaining consistent ADC value would be necessary.

After further studies with larger numbers of patients and various types of MR images, the proposed system might be used as an assisting tool for a pCR prediction on volumetric MR images and clinical information. Our model could be embedded software for MRI scanners for predicting pCR like MRI CAD (computer aided detection) software33. From the viewpoint of studies many drug therapies in the NAC setting, assessment of pCR is the standard primary end point, because pCR is the most important clinical characteristic to indicate the risk of recurrence and long-term survival5,6. Through the pCR prediction result of the deep learning model, the clinician may change treatment plans for breast cancer during NAC to improve survival of patients.

In summary, we conducted a study on the performance improvement of predictive model of pCR to NAC using both clinical information and pretreatment breast MR images by a multimodal fusion approach of the deep learning method. Based on the results presented in this study, we believe that a multimodal fusion approach for pCR prediction can be adopted in the development of a computer-aided diagnosis system that uses breast MR images. With further technological development, a pCR prediction model based on pretreatment MR images may be developed, and the system could aid in clinical decision-making for the NAC in patients with breast cancer.

Methods

Study cohort

This retrospective study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board of ethics committee from Samsung medical center (Seoul, Korea). The requirement for informed consent waived (IRB No. 2019-04-021). We reviewed patients with breast cancer who were treated with NAC at SMC between January 2010 and August 2018. The inclusion criteria for this study were as follows: (1) NAC with no prior therapy; (2) unilateral biopsy-proven primary breast cancer; (3) surgery after completion of NAC at our institution; and (4) breast MR images within 1 month prior to initiation of therapy at our institution. Patients who already had synchronous and metachronous double primary cancer or bilateral breast cancer were excluded. Finally, we enrolled a total of 536 patients with breast cancer who were treated with NAC in this study. The mean age of the patients was 45.2 years (range, 22–75 years; standard deviation, ± 9.9).

The NAC regimen was classified into four categories: adriamycin with cyclophosphamide (AC), AC followed by docetaxel (AC-T), AC-T plus platinum, and AC-T with trastuzumab. We treated breast cancer patients with standard NAC regimen as below: (1) adriamycin (60 mg/m2) with cyclophosphamide (600 mg/m2) iv every 3 weeks for six cycles; (2) adriamycin (60 mg/m2) with cyclophosphamide (600 mg/m2) iv every 3 weeks for four cycles followed by taxane iv every 3 weeks for another four cycles; or 3) docetaxel (75 mg/m2), and carboplatin (area under the curve 5.5). In HER2-positive breast cancer, adriamycin (60 mg/m2) with cyclophosphamide (600 mg/m2) iv every 3 weeks for four cycles followed by taxane with trastuzumab (8 mg/kg at cycle 1, 6 mg/kg at cycles 2–4) iv every 3 weeks for another four cycles. After surgery, patients received adjuvant trastuzumab every 3 weeks for a year.

The mean interval between treatment initiation and surgery was 197.0 days (range, 96–614 days). Duration of examination depended on the schedule of NAC.

MR image acquisition and preprocessing

The breast MR examination consisted of turbo spin-echo T1W and T2W sequences and 3D dynamic contrast-enhanced (DCE) sequence for each patient. T2W and contrast-enhanced T1W subtraction MR images were retrieved from the picture archiving communication system and loaded onto a workstation for further analysis. Subtraction images from contrast-enhanced images at 90 s after contrast injection to pre-enhanced images were selected in this study. The detailed description of the MR image acquisition procedure is as follows: All magnetic resonance imaging (MRI) scans were performed on either a 1.5-T or a 3.0-T scanner from Philips (1.5-T Achieva and 3.0-T Achieva, Philips Healthcare, Best, Netherlands) with a dedicated bilateral phased-array breast coil. The DCE-MRI scans on a 1.5-T scanner (n = 254) were acquired using the following parameters: TR/TE, 6.5/2.5; slice thickness, 1.5 mm; flip angle, 10°; matrix size, 376 × 374; and field of view, 32 × 32 cm. DCE-MRI was performed with axial imaging with one pre-contrast and six post-contrast dynamic series. Contrast-enhanced images were acquired at 0.5, 1.5, 2.5, 3.5, 4.5, and 5.5 min after contrast injection. The DCE-MRI scans on a 3.0 T scanner (n = 282) were acquired using the following parameters: TR/TE, 5.5/2.8; slice thickness, 3 mm; flip angle, 12°; matrix size, 500 × 237; and field of view, 30 × 30 cm. DCE-MRI was performed with axial imaging, with one pre-contrast and six post-contrast dynamic series. Contrast-enhanced images were acquired at 0.5, 1.5, 2.5, 3.5, 4.5, and 5.5 min after contrast injection. Image subtraction was performed after the dynamic series. To obtain the DCE T1W subtraction image as model input, images that were acquired before injection were subtracted from the images that were acquired at 1.5 min after contrast injection. For dynamic contrast enhancement, a 0.1 mmol/kg bolus of gadobutrol (Gadovist; Bayer HealthCare Pharmaceuticals, Berlin, Germany) was injected, followed by a 20 mL saline flush.

We applied MR image preprocessing as follows: (1) interpolating different voxel dimensions of MR images to isotropic spacing (1.2 × 1.2 × 1.2 mm3) using nearest neighbor interpolation34; (2) normalizing various intensity of MR images, DCE T1W and T2W, using histogram-matching algorithm35; and (3) converting normalized MR images from 12-bit to 8-bit grayscale (pixel values ranging from 0 to 255). We conducted all abovementioned preprocessing algorithms using the SimpleITK Python package (version 1.2.0)36.

For comparison of the performance between the models trained with uncropped versus cropped images, cropped images of 56 × 56 × 12 voxel sizes containing lesions were obtained from the original DCE T1W subtraction images. The location of lesions was manually drawn slice-by-slice on the DCE T1W subtraction images by a breast radiologist with 14 years’ experience of reading breast MR images. From the center point measured from the lesion boundary, 3D cropped images were obtained.

Clinicopathologic characteristics of patients

Pretreatment core biopsies were used to determine the receptor status of the tumor. pCR was defined as no residual invasive tumor in the breast and ipsilateral axilla (pT0/is N0). Patients were divided into pCR or non-pCR based on pathological report after final surgery. We collected the following clinicopathologic characteristics: age, body mass index (BMI), menopausal status, histologic subtypes, T stage, N stage, estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), Ki-67, cancer antigen (CA) 15-3, and NAC regimen. ER, PR score were defined according to the Allred score. HER2 status was evaluated by immunohistochemical (IHC) staining using antibody with grade 3 + being positive, 0 and 1 + being negative, and 2 + requiring further evaluation by silver in-situ hybridization. CA 15-3 levels before NAC were reviewed from the medical records. Clinical T and N stages were determined according to TNM staging system by the American Joint Committee on Cancer 7th edition35. For convenience, histologic subtypes of breast cancer were divided into two groups: invasive ductal carcinoma (IDC) and others.

For clinical information preprocessing, we normalized numeric features, such as age, BMI, ER score, PR score, T stage, N stage, and CA 15-3 level, to the range between 0 and 1 via min–max normalization. Additionally, we converted categorical features, such as HER2, Ki-67, histologic subtype, and NAC regimen, into one-hot vectors.

Model architecture and training

We applied a multimodal fusion method in the deep neural network to combine different types of data. We choose the 3D-ResNet architecture for extracting features from 3D-bilateral whole MR images. The 3D-ResNet has the skip-connection between layers in a network and tends to benefit from a reduced number of parameters37. It enables the trainable deeper structure and adds nonlinearity to the model for achieving adequate performance in classification. The proposed model consists of three parts: (1) 3D volumetric CNNs for feature extraction from MR images; (2) FC layers for clinical information and feature concatenation; and (3) final FC layer to allow prediction of pCR (Fig. 2). To achieve feature extraction from volumetric MR images, we modified the bottleneck 3D-ResNet-50 architecture designed by Hara et al.37 (Supplementary Table S2). In our study, contrast-enhanced T1W subtraction and T2W images with shapes of 224 × 224 × 64 voxels were separately entered into the 3D-ResNet-50. The outputs passed a FC layer consisting of six neurons. Finally, two 6D vectors from T1W subtraction and T2W images were extracted. Twelve clinical information passed through one FC layer consisting of six nodes with dropout rate of 0.1 (Fig. 2B). To fuse multimodal features, the two 6D vectors from T1W subtraction and T2W images and the 6D vector from clinical information were concatenated into an 18D vector. Finally, the concatenated output was passed through the final sigmoid layer, allowing a binary classification output. From the experiment, six dimensions were selected as feature vector size to achieve the best model performance.

Figure 2
figure 2

Deep learning architectures for the multimodal pCR prediction model. (A) The feature extractors for contrast-enhanced T1W subtraction MR images and T2W MR images were used in two 3D ResNet-50. The MR images for the input were subjected to isotropic transformation and cropped to a 3D form of 224 × 224 × 64. (B) FC layer was used for clinical information inputs. The outputs of each 3D ResNet-50 and FC layer for clinical information were concatenated. The final FC layer with sigmoid activation function was used in the prediction of pCR.

We elaborate here about the training procedure of our model. Dropout with a probability of 0.1 was performed in the FC layers after feature extraction layer for MR images and clinical information to prevent overfitting during training. We divided 536 patients into 8:2 (429 for training, 107 for validation) using stratified random sampling, which ensures that subsets’ proportions are the same as the original set. We applied several data augmentation methods in the training phase as random cropping, rotating, and flipping to improve generalization of the model. The random cropping was parameterized to minimize the background in axial views in the crops (224 × 224 × 64 voxels). Axial MR images were randomly rotated at an angle ranging from − 10° to 10° and flipped left and right with probability of 50%. To address the concern of class imbalance issue, focal loss38 was used. We use the PyTorch39 (version 1.1.0) open-source deep learning framework as the main tool for generating the software implementation of the presented architecture. The training was conducted for 12 h on a DGX Station with a Tesla V100 GPU with the Rectified Adam optimizer40. The model also trained with a learning rate of 0.0354, weight decay of 0.00001, batch size of 8, and early stopping to avoid overtraining41.

Statistical Analysis

We compared the difference between training and validation dataset with the t-test for mean of numeric variables and chi-square test or Fisher’s exact test for categorical variables (Python SciPy package). A P value < 0.05 indicated a statistically significant difference. We evaluated seven types of deep learning models trained on different combinations of clinical information, T1W subtraction images, and T2W images to analyze the effects of each dataset. The seven datasets to train each model consisted of the following: (1) dataset containing clinical information; (2) dataset containing T1W subtraction images; (3) dataset containing T2W images; (4) dataset containing cropped image for lesions in T1W subtraction images; (5) combined dataset containing T1W subtraction and T2W images; (6) combined dataset containing T1W subtraction images and clinical information; and (7) combined dataset containing T1W subtraction images, T2W images, and clinical information. The performance of each model was evaluated using AUC and confusion matrix. The significance of the AUC differences was tested using the method described by Hanley and McNeil42.