Introduction

Breast cancer is the most commonly diagnosed cancer among women worldwide and becomes the second leading cause of cancer-related death1. Accurate identification of axillary lymph-node (ALN) involvement in patients with breast cancer is important for prognosis and therapy decisions2. Sentinel lymph-node (SLN) is the first node draining the primary cancer. SLN dissection (SLND) is recommended to predict ALN status, especially for those with clinically negative nodes3. The American College of Surgeons Oncology Group Z0011 (ACOSOG Z0011) trial showed that among patients with clinical T1/T2 breast cancer, if there were two or fewer SLN metastases, the use of SLND alone would not lead to inferior survival compared with ALND4,5. Compared with ALND, SLND has fewer complications, but it is not risk-free surgery and still has some significant limitations, including increasing considerable anesthesia time and expense, and causing complications such as arm numbness or upper limb edema in 3.5–10.9% of patients6,7. There were studies showing that 43–65% of patients who had positive SLNs underwent unnecessary axillary surgery because of no additional non-SLN metastasis, resulting in high morbidity8,9. In fact, SLN biopsy could be avoided if there was reliable preoperative evaluation of ALN status because most patients who had early-stage breast cancer have disease-free axilla10.

Ultrasound (US) has been widely used to preoperatively characterize breast lesions and determine ALN status11. A study showed that clinical T stage and preoperative axillary ultrasound (US) results were associated with the ALN status in patients with early-stage breast cancer10, but the diagnostic performance of axillary US was poor to determine the ALN status with an area under the receiver operating characteristic curve (AUC) of 0.585–0.719 (ref. 12). Several studies intended to predict the ALN status by clinicopathological data, such as tumor grade, histological tumor size, lymphovascular invasion, Ki-67 proliferation index, and hormone receptor status13,14. However, using clinicopathological data only is not accurate enough with an AUC of 0.66–0.74 in previous studies15. In addition, some data like lymphovascular invasion and histological tumor size could not be available preoperatively, but preoperative knowledge of ALN status is important for determining appropriate axillary treatment options6.

Two-dimensional (2D) shear wave elastography (SWE), a new US technology to measure tissue stiffness, integrates B-mode image with a color-coded map which shows the distribution of shear wave velocity (SWV)16. This technique showed promise in distinguishing malignant and benign breast lesions17. Some studies suggested that stiffness of breast cancer was a predictor of ALN status. Higher SWV of breast cancer showed higher possibility of ALN metastasis18,19. However, the performance of 2D-SWE depends on the placement of regions of interest (ROI) and the AUC was only 0.759 for the prediction of ALN status17,20. Therefore, only SWE images of breast cancer might be insufficient to evaluate ALN status accurately.

Radiomics can automatically provide a large number of quantitative image features from medical images, which tends to be hard for naked eyes to recognize21,22. This method was first demonstrated to be useful in analyzing CT or MRI images on clinical oncology23,24. Recently, radiomics based on analysis of US images showed better performances than other routine methods25. However, analyzing US images by radiomics has some limitations including object segmentation and extraction of hard-coded features22. Deep learning radiomics, a newly developed method, can provide quantitative and high-throughput features from medical images by supervised learning21,22. A recent study demonstrated that this DLR was useful in analyzing SWE images and showed excellent performance in predicting the stages of liver fibrosis22. When applied to analyze medical images, DLR usually confronts small-sample learning problems. Clinical parameter combined DLR, which integrates clinical information with network characteristics, can help provide complementary information for image features and collaboratively use clinical information and US images features to build model, thus improving model performance26. Our hypothesis is that clinical parameter combined DLR might be able to extract more valuable information from images of breast conventional US and SWE and thus provide better prediction and stratification of ALN status according to the cuts off for axillary surgery of ACOSOG Z0011 trial.

Hence, the purpose of this study is to evaluate the diagnostic performance of clinical parameter combined DLR on conventional US images and SWE images of breast cancer in predicting the extent of ALN involvement in patients with early-stage breast cancer. Our results reveal that clinical parameter combined DLR yields the best diagnostic performance in predicting ALN status between disease-free axilla (N0) and any axillary metastasis (N+(≥1)) with an AUC of 0.902 in the test cohort, which was significantly higher than that of axillary US (P < 0.001, Hanley & McNeil), classification by clinicopathological data (P = 0.002, Delong et al.) and DLR on images only (P = 0.004, Delong et al.). Clinical parameter combined DLR on breast conventional US and SWE images provides a noninvasive imaging biomarker for predicting the extent of ALN involvement preoperatively and have the potential to determine appropriate axillary treatment options for patients with early-stage breast cancer.

Results

Baseline characters

Between January 2016 and April 2019, a total of 1342 women with 1342 breast lesions was studied and finally 584 women (mean age, 50 years; range, 26–83 years) with 584 malignant breast lesions were enrolled for analysis. Figure 1 shows the patient recruitment workflow. According to the results of SLND or ALN dissection, 337 had disease-free axilla (N0), 150 had low metastatic burden of axillary disease (N+(1–2)) and 97 had heavy metastatic burden of axillary disease (N+(≥3)).

Fig. 1: Patient recruitment workflow.
figure 1

In total, 584 out of 1342 patients were included according to the selection criteria. The included patients were examined by conventional US and SWE, and had complete clinical information needed for the study.

Base model selection and clinical information integration

The base model acted as a feature encoder, which had a significant impact on classification. In order to find the most suitable base model for ALN prediction tasks, the performances of ResNet50, ResNet101, Inception V3, and VGG19 in predicting ALN status between N0 and N+(≥1) were compared. When ResNet50 was selected as the basic model with best performance, clinical information was further added to the diagnostic model. The model incorporating clinical information was called ResNet50 + C, where C stands for clinical information. The method of adding clinical information was to directly input clinical information into the penultimate layer of the fully connected (FC) layer of ResNet50 by increasing the number of neurons. The detailed results were summarized in Table 1. The ResNet50, which integrated the deep features and clinical information offline, was proved to be the best in terms of performance and memory usage.

Table 1 The performance comparison of different models.

Prediction of ALN status between N0 and N+(≥1)

Adopting N0 as negative reference standard, 466 lesions were randomly assigned as training cohort and the other 118 lesions as independent test cohort. The detailed characteristics including patient age, US size, Breast Imaging-Reporting and Data System (BI-RADS) category, tumor type, estrogen receptor (ER) status, progesterone receptor (PR) status, human epidermal growth factor receptor 2 (HER-2), Ki-67 proliferation index were demonstrated in Table 2. There was no significant difference between the detailed characteristics of the two cohorts (all P > 0.05, t-test or Mann-Whitney U test). Based on axillary US findings evaluated by an experienced radiologist, axillary US findings had an AUC of 0.735, accuracy of 0.635, sensitivity of 0.721 and specificity of 0.573. The Kappa values for axillary US were 0.933 for inter-observer agreement and 1 for intra-observer agreement (both P < 0.001, Kappa test).

Table 2 Patient and tumor characteristics.

In the training cohort, clinical parameter combined DLR achieved the highest AUC of 0.936 while DLR based on images only and classification by clinicopathologic data only achieved AUCs of 0.850 and 0.771, respectively. In the independent test cohort, AUCs dropped slightly for predicting ALN metastasis and was consistent with the performance of training cohort. The clinical parameter combined DLR still achieved the highest AUC of 0.902, which was significantly higher than the AUC of other methods including axillary US findings (AUC: 0.735, P < 0.001, Hanley & McNeil test), DLR based on images only (AUC:0.796, P = 0.004, Delong et al.) and classification by clinicopathologic data (AUC:0.727, P = 0.002, Delong et al.). The accuracy, sensitivity, specificity, PPV, and NPV of clinical parameter combined DLR were also universally better than other methods. The detailed statistical results were summarized in Table 3 and its corresponding ROCs were shown in Fig. 2. The prediction results of DLR based on using only US or SWE images combined with clinical parameters were poorer than DLR based on using both US and SWE images combined with clinical parameters (P = 0.006 and P = 0.002, respectively, Delong et al.) (Supplementary Note 1 and Supplementary Table 2).

Table 3 The prediction of ALN status results (N0 v.s. N+(≥1)).
Fig. 2: Comparison of receiver operating characteristic (ROC) curves between different models for predicting disease-free axilla (N0) and any axillary metastasis (N+(≥1)).
figure 2

DLR deep learning radiomics. Numbers in parentheses are areas under the receiver operating characteristic curves. Source data are provided as a Source Data file.

Prediction of ALN status between N+(1–2) and N+(≥3)

Adopting N+(1–2) as negative reference standard, this experiment assigned 197 lesions as training cohort and 50 as independent test cohort. In the training cohort, DLR based on images only and classification by clinicopathologic data achieved AUCs of 0.874 and 0.756, respectively, while clinical parameter combined DLR achieved the AUC of 0.956. In the independent test cohort, the AUC of clinical parameter combined DLR dropped slightly but still reached 0.905, which was significantly higher than the AUC of DLR based on images only (AUC: 0.777, P = 0.04, Delong et al.) and the AUC of classification by clinicopathologic data (AUC: 0.686, P = 0.03, Delong et al.). The detailed statistical results were summarized in Table 4. The corresponding ROCs depicted the comparisons (Fig. 3).

Table 4 The prediction of ALN status results (N+(1–2) v.s. N+(≥3)).
Fig. 3: Receiver operating characteristic (ROC) curves comparison between different models for predicting low metastatic burden of axillary disease (N+(1–2)) and heavy metastatic burden of axillary disease (N+(≥3)).
figure 3

DLR deep learning radiomics. Numbers in parentheses are areas under the receiver operating characteristic curves. Source data are provided as a Source Data file.

Prediction of ALN status among N0, N+(1–2) and N+(≥3)

This model was extended to be compatible with three groups of tasks to predict ALN status. As described above, the clinical endpoints were categorized into three parts: N0, N+(1–2), and N+(≥3). The number of lesions of the three categories is 337 (N0), 150 (N+(1–2)), and 97 (N+(≥3)), respectively. The DLR model was built on breast conventional US and SWE images and was classified by axillary US findings, clinicopathologic data. The overall accuracy of differentiating the three groups was 0.805 and the confusion matrix was shown in Fig. 4. The model performed well in differentiating the N0 group while showed poorer results in the other two groups.

Fig. 4: The confusion matrix of predicting metastasis among disease-free axilla (N0), low metastatic burden of axillary disease (N+(1–2)) and heavy metastatic burden of axillary disease (N+(≥3)).
figure 4

Source data are provided as a Source Data file.

Interpretability of DLR model

For investigating the interpretability of the DLR, the network was visualized by applying the Gradient-weighted Class Activation Mapping (Grad-CAM), which could produce a coarse localization map highlighting the import regions for classification target27. The last convolutional layer of the last res-block was made transparent to the prediction of ALN status as shown in Fig. 5. We found that there were usually two locations valuable for predicting ALN status based on DLR model. One is the boundary of the tumor and the other is the low echo area inside the tumor. To some extent, this proved the effectiveness of the model.

Fig. 5: Visualization of two patient examples.
figure 5

Each example shows the gray-scale US image and corresponding heart map, and the red region represents a larger weight, which can be decoded by the color bar on the right. Image a shows that the low echo area inside the tumor is valuable for predicting ALN status, while it is the tumor boundary for image b.

Discussion

According to ACOSOG Z0011 trial, patients who had early-stage breast cancer with less than 2 SLN metastasis had no inferior survival if they underwent SLND only rather than ALN dissection4,5. Based on the results of ACOSOG Z0011 trial, all patients should undergo SLND to predict ALN status whether patients have clinically positive node or not28. However, SLND has some limitations, including causing some complications6, having false-negative rates ranging from 7.8–27.3%29,30,31 and resulting in unnecessary axillary surgery9. Hence, there is an increasing need for predicting metastatic extent of ALN accurately in a noninvasive way.

In this study, we developed and validated a clinical parameter combined DLR method based on breast conventional US and SWE images for preoperative prediction of ALN status in patients with clinical T1 or T2 breast cancer. This method showed significantly better diagnostic performances in distinguishing patients with a negative axilla (N0) and patients with any axillary metastasis (N+(≥1)) than any single method. Encouragingly, our model showed favorable discriminating ability between patients with low metastatic burden of axillary disease (N+(1–2)) and patients with heavy metastatic burden of axillary disease (N+(≥3)). With false-negative rate similar to SLND, this clinical parameter combined DLR might have the potential to serve as a noninvasively imaging biomarker to replace SLND for patients with early-stage breast cancer. The clinical parameter combined DLR showed the possibility to assist breast clinicians to make decisions for appropriate axillary treatment: no need for SLN biopsy or ALN dissection in patients with N0, SLND only for patients with N+(1–2) and ALN dissection for patients with N+(≥3)4.

Some studies argued that compared with SLND, axillary US combined with fine needle aspiration or core needle biopsy could be cost saving for patients with positive nodal status11. However, axillary US was not accurate enough to predict ALN status with an AUC of 0.585–0.719 (refs. 12,32). Some studies demonstrated that the number and morphology of abnormal lymph nodes detected by axillary US were predictors of nodal burden28, but the diagnostic performances were poor with AUC of 0.725–0.747 (refs. 33,34). In our study, the overall diagnostic performance of preoperative axillary US results was low with an AUC of 0.735, which was concordant with previous studies35,36. Compared with axillary US results alone, the clinical parameter combined DLR method makes use of all available data including findings of axillary US, clinicopathologic data, breast conventional US and SWE images and therefore, showed significantly better diagnostic performances in predicting ALN metastasis than the routine axillary US evaluated by an experienced radiologist.

Some studies reported that some histopathological data such as tumor grade, lymphovascular invasion, histological tumor size, and hormone receptor status could be a predictor of ALN metastasis10,37. However, some histopathological data, such as histological tumor size and lymphovascular invasion, could only be evaluated after surgical resection, and could not guide decisions of axillary surgery preoperatively. Although tumor grade could be estimated from core biopsy samples preoperatively, low concordant rates ranging from 67% to 75% were found between core needle biopsy and surgical excision in previous studies38,39. Different from previous studies, this current study adopts all histopathological data available after biopsy of primary breast tumor40, which is a standard procedure preoperatively. Therefore, some clinicopathologic data available preoperatively were kept as candidate factors in developing the predictive model, which could serve as a noninvasive predictive method to assess ALN status.

Clinical parameter combined DLR was completely established on analyzing images of breast conventional US and SWE with the DLR concept22 and was combined with axillary US findings and clinicopathologic data. This DLR method has shown great promise in analyzing SWE images on staging liver fibrosis22. Radiomics method was also applied in other imaging modalities like CT or MRI images of some primary cancer like bladder, colon cancer to predict regional lymph-node metastasis, demonstrating this method was a useful way to make a prediction of lymph-node metastasis41,42. Compared with the previous study, our study yielded a better diagnostic performance by concentrating on the clinical parameter combined DLR method, which can complement image features with more information and make the model more robust by restraining the features extracted from images26. In addition, for patients with suspected breast lesions, breast and axillary US is a routine practice to characterize breast lesions and axillary lymph-node status, and have the advantages of cost effective and no ionizing radiation comparing with other imaging modalities10.

Compared with those studies using SWE values as a single parameter to predict ALN status12,18, our study showed a better diagnostic performance by applying DLR on breast conventional US and SWE images. Instead of measuring the stiffness of breast cancer inside the shear wave ROI based on several parameters of SWE, the whole shear wave ROI was analyzed and a large number of features were quantified automatically by DLR22.

Some limitations have to be addressed in this study. First, this is a single-center study. Acquiring more evidence from multicenter is needed to validate this model before clinical application in the future. Second, patients with multifocal breast lesions and bilateral disease are excluded because it is difficult to determine which lesion would lead to ALN metastases and should be input in the model. Therefore, current clinical parameter combined DLR model could only be used to predict extent of ALN involvement for patients with single breast cancer. Further study is needed to build other model to predict ALN status for patients with multifocal breast lesions and bilateral disease. Third, gene markers of breast cancer like BRCA1 and BRCA2, are used to stratify patients based on the risk for disease43. However, radiogenomics, focusing on the relationship between genomics and imaging phenotypes, is not available currently although it is an interesting attempt.

Conclusions

Clinical parameter combined DLR on breast conventional US and SWE images provides a noninvasive and practical way for predicting the extent of ALN involvement preoperatively and have the potential to determine appropriate axillary treatment options for patients with early-stage breast cancer. Prospective multicenter validation is expected to acquire high-level evidence for clinical use in subsequent studies.

Methods

Patients

This prospective study was approved by Institutional Review Board of Sun Yat-sen University Cancer Center. The inclusion criteria included the followings: (a) women with US-suspected breast masses; (b) availability of clinical data; (c) patients who underwent breast surgery and sentinel lymph-node biopsy or ALN dissection with curative intent. The exclusion criteria included the followings: (a) preoperative therapy (resection biopsy, neoadjuvant radiotherapy or chemotherapy); (b) patients with multifocal lesions or bilateral disease; (c) masses deeper than 3 cm in depth due to the attenuation of SWE or larger than 3.5 cm in diameter due to the limited width of the US probe; (d) unqualified 2D-SWE measurements, which means little or no shear wave signal was acquired in the ROI of SWE; (e) benign breast lesions or carcinoma in situ; (f) missing important histopathological results (immunohistochemical results or lymph-node results); (g) incomplete information or images. Verbal informed consent was obtained from all patients.

Conventional US examinations

One of five radiologists who had 18, 6, 3, 2, and 2 years of experience in breast ultrasound respectively performed preoperative breast and axillary US with Siemens S2000 ultrasound scanner (Siemens Healthineers, Mountain View, CA, USA) equipped with a 4–9 MHz linear array transducer. The target breast mass was measured at maximal-diameter plane to determine US size and classified by using US BI-RADS17. After performing whole-breast US, the same radiologists performed axillary US routinely and recorded suspicious US features of ALN. Suspicious US features of ALN include the ratio of long axis diameter to short axis diameter<2, diffuse cortical thickening>3 mm, focal cortical bulge >3 mm, eccentric cortical thickening >3 mm, rounded hypoechoic node complete or partial effacement of the fatty hilum, nonhilar cortical blood flow on color Doppler images, complete or partial replacement of the node with an ill-defined or irregular mass and microcalcifications in the node32. The result of axillary US evaluated by the experienced radiologist was regarded as positive as long as at least one suspicious US finding was found. The result of axillary US was regarded as negative when no suspicious findings of ALN were found10. To evaluate the intra-observer agreement for axillary US, one radiologist repeated evaluating the same 30 ALNs at a time interval of 1 week. Inter-observer agreement was tested by two radiologists, evaluating the same ALNs independently in another 30 ALNs.

SWE

After performing conventional US, SWE was performed thrice at the maximal-diameter plane of the breast lesion. The ROI of SWE was adjusted to include subcutaneous fat layer and superficial pectoral muscle layer, with at least 5 mm of distance from the boundary of the lesion to the lateral borders44. With sufficient coupling material filling between probe and skin, the radiologist applied extremely slight pressure to minimize pre-compressions. When acoustic radiation force impulse was generated, patients were asked to suspend respiration for several seconds. The quality map, displayed in red-yellow-green representing low-intermediate-high quality respectively, was obtained to evaluate the quality of the SWE first. Then the velocity map of SWE was obtained. Guided by the quality map, the velocity map of SWE with fewest artifacts and the best quality was chosen and stored as the SWE image for analysis.

Data analysis

Clinical and histopathologic data were obtained from the medical records. Histopathologic results of the breast cancer included tumor type, ER status, PR status, HER-2, and Ki-67 proliferation index. Clinical data included patients age, US size, tumor location, and BI-RADS category. Histopathologic results of SLND and ALN dissection including the total number of resected lymph nodes and total number of positive nodes were recorded.

Deep learning radiomics model

The enrolled patients were randomly divided into the training cohort and independent test cohort with the ratio of 4:1 and the training cohort were then used to optimize the model parameters. We also randomly chose 25% of training images to form a validation cohort to guide the choice of hyper parameters. The whole pipeline of our model was shown in Fig. 6. Resnet was adopted as the base model which pre-trained on Imagenet45,46. In particular, the last 1000 nodes FC layer was replaced with our specifically designed three FC layers with Xavier initialized weights47. The detailed architecture of the network is shown in Supplementary Table 1.

Fig. 6: The overall pipeline of the model.
figure 6

The parallel pre-trained ResNet model encodes the input images to features which be combined with clinical parameters. Then the combined features be classified by an SVM model.

There were two steps included in the entire process, the forward computation and the backward propagation48. Before that, the rectangular ROIs were cropped from raw US images according the tumor segmentation mask, resized to 224 × 224 pixels and normalized. The pathology type was encoded to one-hot, which was the label. In the training stage, rectangular ROIs were fed into network to update model parameters by backward propagation. The outputs of the network were used as the classification results, and the cross-entropy of the outputs and the labels were calculated as the loss function. Note that to alleviate the influence of over fitting and sample imbalance, a strategy called online data augmentation was used, which meant randomly horizontal and vertical flipping the input image, randomly cropping every ROI image from four directions in the steps of 2 pixels and feeding each category image into the network with same probability. We set learning rate to 1e-4 and applied the Adam optimizer to update the model parameters with batch size 32. The maximum iteration step was set to 5000, and the learning rate decayed by 1/2 at 2000 and 4000 steps. After training, we replaced the last FC layer with an SVM as classifier and fused the clinical information and network features to collaboratively make a decision49.

Network feature extraction

Contrasting with hand-crafted and engineered features designed according to the previous medical experiences, DLR learnt the high-throughput image features in a supervised manner, which could make full use of all embedded information in US images50,51. The convolutional layers encoded the input rectangular ROIs and adaptively learnt the semantic features and the FC layers then selected the relevant features and reduced the features dimensions. Supervised by the label of input images, the model updated parameters and finally led to the most relevant features in the FC layer. The penultimate FC layer output was used as the network features, which further proved efficient and effective. For comparison, clinicopathological information and network features were used to train SVMs directly to compare predictive performance of ALN status, respectively.

Multi-modal multi-sources features fusion on DLR

As described earlier, a single modal image could be encoded into network features. An additional modal image might provide more effective information. Hence the model was extended to be compatible with bimodal image inputs. A parallel model structure was designed, which contained two ResNet50 base networks and FC layers. The two ResNet50 base model shared parameters, which accepted US and SWE images as bimodal image inputs, respectively. The final convolutional layer outputs of the parallel network were concatenated and fused by the following FC layers. Network features were extracted from layers the same as one modal model. We argued that when deep models were applied to medical images analysis usually confronting small-sample learning problems, they should be combined with the clinical information26. Finally, the deep features combined with clinical features were used to collaboratively train an SVM classifier for predicting ALN status. In the control group model ResNet50+C, the neurons were added to the penultimate FC layer of the ResNet50 with the same number of clinical features. During the model training, the network received dual-modal images and as the input, and the clinical features were directly input into the penultimate FC layer instead of extracting deep features and combining them offline. To use the proposed model, a rectangle ROI which cover the tumor should be manually selected as the network input.

Statistical analysis

By using SLN biopsy or ALN dissection as reference standard, the extent of ALN metastasis was divided into three groups, including disease-free axilla (N0), low metastatic burden of axillary disease (N+(1–2)) and heavy metastatic burden of axillary disease (N+(≥3)). The detailed clinicopathological difference of N0 and N+(≥1) was compared by t-test or Mann-Whitney U test. AUC was used to estimate the performance of axillary US, DLR based on images only, classification by clinicopathological data and clinical parameter combined DLR, and was compared by using Delong et al52. or Hanley & McNeil53,54. The other measurements like accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were also used to estimate the model performance. The calculation method is shown in Supplementary methods. Kappa test was used to compared the intra-observer agreement and inter-observer agreement. All the statistics were two side and a P-value less than 0.05 was considered statistically significant. All statistical analyses were performed using MedCalc software (V.11.2; 2011 MedCalc Software bvba, Mariakerke, Belgium), Python 3.5, matlab R2015b.

Statistics and reproducibility

Models were verified and replicated using regular machine learning metrics on independent test cohorts. We released the software of the model for replication on new data.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.