Introduction

Breast cancer stands as one of the most prevalent malignancies affecting women's health globally. By 2020, it had ranked as the most common cancer worldwide1, placing fourth in cancer-related mortality, with the highest rise in new fatal cases attributed to breast cancer1. In China alone, there are approximately 416,371 new cases and 117,174 related deaths annually, accounting for approximately 18.41% and 17.11% of global cases, respectively2. The majority of deaths stem from metastasis, with an estimated 20–30% of breast cancer patients experiencing this progression3.

The sites of distant metastasis in breast cancer closely correlate with post-metastatic survival, with bone, lungs, and liver being the most common locations3,4. Presently, clinical diagnosis of breast cancer distant metastasis heavily relies on imaging techniques. For instance, MRI, when using multi-sequence comprehensive imaging, offers morphological and functional information for bone metastasis without ionizing radiation exposure. Chest CT is recommended for detecting lung metastasis, and combining CT and MRI aids in diagnosing liver metastasis5,6.

However, conventional imaging methods exhibit limitations in distinguishing breast cancer distant metastasis. Challenges include difficulties in differentiating benign nodules in the lungs from lung metastases and identifying atypical vascular tumors in liver metastasis diagnosis5. Moreover, these diagnostic procedures can be expensive, posing a significant financial burden, particularly for breast cancer patients in developing countries, necessitating multiple examinations.

In response to these challenges, research has started exploring the use of artificial intelligence (AI) to assist in predicting breast cancer distant metastasis7,8,9,10,11,12. AI-driven predictive approaches hold promise for delivering faster, more accurate diagnoses while potentially reducing the need for expensive imaging tests, thereby alleviating patients’ economic burdens. Current AI research on breast cancer distant metastasis primarily focuses on assessing the risk of future (1, 3, or 5 years) metastasis7,8,9,10,11,12. If breast cancer patients could undergo an evaluation for distant metastasis before relatively expensive whole-body imaging or relatively invasive pathological examinations, it might help avoid unnecessary whole-body imaging tests.

This study boldly attempts AI-based diagnosis for breast cancer with distant metastasis. By integrating clinical blood markers and ultrasound data, a novel AI model for distinguishing breast cancer distant metastasis was established. The novelty lies in its independence from costly and occasionally inaccessible imaging examinations, instead utilizing relatively accessible clinical blood markers and cost-effective ultrasound data for predicting breast cancer distant metastasis. This approach not only enhances diagnostic affordability and accessibility but also introduces a new avenue for early detection of breast cancer distant metastasis. With technological advancements and deeper research, AI's application in predicting breast cancer distant metastasis could become a crucial future development in this field.

Materials and methods

Patients population

This retrospective study involved data from two centers, approved by the institutional review boards of both centers. All methods were performed in accordance with the relevant guidelines and regulations. Inclusion criteria were as follows: (1) confirmed diagnosis of de novo primary breast cancer with or without distant metastasis; (2) completion of ultrasound examinations and clinical blood marker tests before treatment (radiotherapy or chemotherapy), surgical resection, or biopsy; (3) no history of hypertension; (4) no history of diabetes; (5) no history of hyperlipidemia; (6) no history of abnormal blood markers related to liver, kidney, or cardiovascular functions; and (7) absence of other medical conditions. Exclusion criteria comprised: (1) occurrence of distant metastasis post-treatment (surgical resection or chemotherapy); (2) failure to undergo ultrasound due to unavoidable reasons (e.g., breast surface dressing); (3) absence of maximum lesion diameter in the ultrasound examination; and (4) lack of tumor markers (AFP, CEA, CA125, CA153, and CA199), liver function, kidney function, lipid profile, or cardiovascular function in the clinical blood markers. Breast cancer cases involved in this study were sourced from two research centers, with 342 cases from one center divided randomly in an 8:2 ratio into training (274 cases) and test (68 cases) sets, and 74 cases from the other center forming an external testing (test1) set. Given that breast cancer distant metastasis predominantly occurs in bones, lungs, and liver, the study included cases of bone, lung, and liver metastases among the breast cancer distant metastasis cases, as detailed in Table 1. The flowchart for selecting the study patients is shown in Fig. 1. The workflow of the models in this study is depicted in Fig. 2.

Table 1 Clinical blood markers, pathological, and ultrasound characteristics in the training, test, and test1 cohorts.
Figure 1
figure 1

The flowchart for selecting the study patients.

Figure 2
figure 2

The workflow of clinical model and combined model in this study.

Feature extraction and selection

Features extracted from clinical blood markers included tumor markers (carcinoembryonic antigen, alpha-fetoprotein, CA125, CA153, and CA199), liver function indicators (total bilirubin, direct bilirubin, indirect bilirubin, total protein, albumin, globulin, albumin-globulin ratio, gamma-glutamyl transferase, prealbumin, aspartate transaminase (AST), alanine transaminase (ALT), AST/ALT ratio, alkaline phosphatase, cholinesterase, and total bile acid), kidney function indicators (urea, creatinine, uric acid, blood bicarbonate concentration, cystatin C, potassium ion, sodium ion, chloride ion, calcium ion, and inorganic phosphorus), lipid profile (total cholesterol, triglycerides, high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, apolipoprotein A1, apolipoprotein B, A1/B ratio, and lipoprotein (a)), and cardiovascular function indicators (creatine kinase, creatine kinase isoenzyme (CK-MB), lactate dehydrogenase, and alpha-hydroxybutyrate dehydrogenase). Features from ultrasound data included the maximum diameter of breast cancer lesions.

All extracted features underwent the following procedures: initial standardization using z-score normalization (mean of 0, standard deviation of 1) to achieve a standard normal distribution of the data. Subsequently, statistical analysis employing Spearman's rank correlation coefficient measured the correlation between pairs of variables. When the Spearman correlation coefficient between features was > 0.9, one of the correlated features was retained. Then, we used the Least Absolute Shrinkage and Selection Operator (LASSO) regression model for feature selection. LASSO is a regression method that introduces an L1 regularization term, which shrinks some of the regression coefficients to zero, thereby achieving feature selection. During the feature selection process, we used Mean Squared Error (MSE) to determine the optimal regularization parameter (λ) for the LASSO model. Specifically, we calculated the MSE for different λ values through cross-validation and chose the λ value that minimized the MSE as the optimal parameter. The purpose of using MSE is to find a feature subset that effectively reduces model complexity while maintaining good predictive performance. In summary, using L1 regularized LASSO regression for feature dimensionality reduction eliminates highly correlated features, generating a sparse model where only a few features significantly contribute to the predictive outcomes, thereby enhancing the model's interpretability and generalizability.

Development and validation of models

This study employed the LightGBM machine learning algorithm to construct models for breast cancer with distant metastasis, utilizing the dimensionality-reduced features. The learning rate in the LightGBM model significantly impacts the convergence speed and performance of the model. Through cross-validation, we selected an appropriate learning rate to balance the learning speed and accuracy of the model during training. The number of trees and the depth of the trees in the LightGBM model directly affect the complexity and fitting ability of the model. We determined the optimal number of trees and tree depth through grid search methods to avoid overfitting or underfitting. The selected features from clinical blood markers were used to build the clinical model, while a combination of clinical blood markers and ultrasound features, post-dimensionality reduction, constituted the combined model. Model construction was based on fivefold cross-validation of the training set. Following model construction, validation was performed on internal (test) and external test (test1) sets, evaluating performance using metrics such as area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Subsequently, clinical decision curve analysis (DCA) was conducted, depicting the net benefit at different probability thresholds in training and internal–external validation sets to assess the clinical efficiency of the model.

Statistical analysis

Clinical baseline characteristics underwent t-tests, chi-square tests, or Fisher’s exact tests using SPSS software (version 25.0, IBM). The t-test was employed for continuous variables with homogeneity of variance, presented as x ± s, while chi-square tests or Fisher's exact tests were used for categorical variables presented as ratios. A two-tailed p value < 0.05 indicated statistical significance. Spearman rank correlation tests, heatmap plotting, z-score normalization, univariate regression analysis, multivariate regression analysis, output of feature importance in LightGBM models, and LASSO regression analysis were performed using Python software (version 3.7.17; http://www.python.org). Additionally, receiver operating characteristic (ROC) curve and clinical decision curve plotting were conducted. The DeLong test was implemented using R (version 4.3.3).

Ethical approval and consent to participate

This study has obtained ethical approval from the Medical Ethics Committee of the First Affiliated Hospital of Guangxi Medical University (Reference Number: 2023-E749-01) and the Medical Ethics Committee of Guangxi Medical University Tumor Hospital (Reference Number: KY2023868). Due to the retrospective nature of the study, the requirement for informed consent has been waived by the Medical Ethics Committee of the First Affiliated Hospital of Guangxi Medical University and the Medical Ethics Committee of Guangxi Medical University Tumor Hospital.

Results

Patient characteristics

This study encompassed data from two research centers involving a total of 416 female breast cancer cases. Among these, one center comprised 274 cases in the training cohort, 68 cases in the tese cohort, and the other center included 74 cases in the test1 cohort. Statistical differences existed across the three cohorts in blood markers including CA153, albumin, albumin-globulin ratio, gamma-glutamyl transferase, alkaline phosphatase, and alpha-hydroxybutyrate dehydrogenase, as well as the maximum diameter of breast cancer lesions obtained from ultrasound examinations. A summary of patient ultrasound and clinical blood marker features is provided in Table 1.

Feature selection

The feature data underwent normalization, followed by the retention of one feature among those with a Spearman correlation coefficient > 0.9. A heatmap illustrating the correlation analysis of features is presented in Supplementary Fig. 1. Clinical blood marker features were utilized to construct the clinical model predicting breast cancer distant metastasis, while the combination of clinical blood markers and ultrasound features was used for the combined model. Dimensionality reduction was achieved by eliminating features with zero coefficients through LASSO regression. The optimal λ value was determined for fitting the Lasso regression model (Fig. 3a,d) based on the minimum Mean Squared Error (MSE) (Fig. 3b,e). Following feature dimensionality reduction, a final selection of 17 features was made in both instances (Fig. 3c,f).

Figure 3
figure 3

Illustrates the process of feature selection using the least absolute shrinkage and selection operator (LASSO) regression model. (ac) Feature selection for clinical model; (df) Feature selection for combined model. (a, d) LASSO coefficients for different λ values, where vertical dashed lines indicate the number of features corresponding to the optimal λ value (clinical, 17; combined, 17). (b, e) Optimal λ values are chosen based on tenfold cross-validation and minimum mean squared error (MSE), represented by vertical dashed lines. After feature selection using least absolute shrinkage and selection operator regression, the nonzero coefficient features are as follows: (c) clinical features; (f) combined features.

Construction and validation of clinical and combined models

The LightGBM machine learning algorithm was employed to construct Clinical and combined models using the aforementioned selected features. The ROC curves for the clinical and combined models are displayed in Fig. 4a,b. The AUC values for the training, test, and test1 cohorts of the clinical model were 0.950 (95% CI 0.928–0.973), 0.795 (95% CI 0.689–0.901), and 0.883 (95% CI 0.808–0.958) respectively. For the combined model, the AUC values were 0.955 (95% CI 0.934–0.976), 0.835 (95% CI 0.739–0.931), and 0.918 (95% CI 0.856–0.981) for the training, test, and test1 cohorts respectively. Additional performance parameters are presented in Table 2. Notably, across the training, test, and test1 cohorts, the AUC values of the combined model were higher than those of the clinical model (Fig. 5a,c,e). DeLong tests were performed to assess differences in AUC values between the combined and clinical models specifically for the test and test1 cohorts. The DeLong test results indicated that the P-values comparing the AUC values of the combined model with the clinical model were both greater than 0.05 (Test cohort: p value 0.103; Test1 cohort: p value 0.245). Furthermore, DCA curves of these two models across the training, test, and test1 cohorts are depicted in Fig. 5b,d,f. The results indicate that the combined model exhibited the most significant net benefit in identifying breast cancer with distant metastasis across all three cohorts.

Figure 4
figure 4

Evaluation of Receiver Operating Characteristic curves for the clinical (a) and combined (b) models constructed in both the training, test and test1 cohorts were presented.

Table 2 Performance of models for predicting discrimination between breast cancer with distant metastasis and breast cancer without distant metastasis in training, test, and test1 cohorts.
Figure 5
figure 5

Receiver operating characteristic (ROC) curve evaluation and clinical decision curves analysis (DCA) for the clinical and combined models constructed in the training (a, ROC; b, DCA), test (c, ROC; d, DCA), and test1 (e, ROC; f, DCA) cohorts were demonstrated.

Analysis of model feature importance

To identify the crucial clinical blood markers and ultrasound data features contributing significantly to the clinical and combined models' predictions of distant metastasis, feature importance analysis was conducted, as shown in Fig. 6a,b. The top 5 features from both the clinical and combined models were integrated, comprising CK-MB, CEA, CA153, albumin, creatine kinase, and the maximum diameter of lesions detected by ultrasound. Subsequently, univariate and multivariate regression analyses were performed on the involved features, displaying OR and p values in Table 3. Blood markers including CA153, indirect bilirubin, magnesium ion, CK-MB, lipoprotein (a), and the maximum diameter of lesions from ultrasound showed p values < 0.05 in both univariate and multivariate regression analyses, suggesting their potential association with breast cancer and metastasis. Among these, CA153, CK-MB, lipoprotein (a), and the maximum diameter of lesions from ultrasound exhibited positive correlations, while indirect bilirubin and magnesium ion showed negative correlations.

Figure 6
figure 6

Feature importance analysis of clinical model (a) and combined model (b) built on the LightGBM algorithm. The color differences are used only for ease of identification and have no other meaning. The features shown are arranged in order of importance, and their degrees of importance are indicated numerically.

Table 3 Univariate and multivariate logistic regression analysis of variables (features) involved in models’ construction associated with breast cancer with distant metastasis.

Discussion

This study utilized the LightGBM algorithm to construct clinical and combined models based on features derived from relatively easily accessible clinical blood markers and cost-effective routine ultrasonography, aiming to identify de novo breast cancer with distant metastasis. Both internal and external testing sets demonstrated superior performance. Additionally, the combined model exhibited greater net benefit in distinguishing breast cancer with distant metastasis, showcasing higher predictive efficiency and robustness. Our predictive models effectively discerned breast cancer patients with distant metastasis from those without, providing clinicians with additional suspicion evidence and potentially enabling more effective triage management in breast cancer diagnosis and treatment.

Breast cancer stands as the most common malignancy among women globally. Among these patients, distant metastasis represents a common form of recurrence and a lifelong risk they might encounter13. Notably, distant metastasis is a significant factor contributing to diminished quality of life and, in some cases, mortality among breast cancer patients13,14. Regarding the diagnosis of metastatic breast cancer, the European society for medical oncology clinical practice guidelines stipulate the necessity for confirmation through imaging studies such as MRI, CT scans, or functional imaging like positron emission tomography-computed tomography, dynamic contrast-enhanced magnetic resonance imaging, or magnetic resonance diffusion-weighted imaging when clinical suspicion exists15. The decision for breast cancer patients to undergo a series of imaging studies solely depends on clinicians' suspicion, often requiring expensive functional imaging studies even when results are inconclusive. The strength of our model lies in its ability to accurately identify patients who may have distant metastasis among those without it, providing clinicians with more clues and suspicion evidence.

This models' significance lies in providing a more effective method to identify patients potentially suffering from breast cancer with distant metastasis. It integrates clinical blood markers and ultrasound data, which are relatively easy to obtain and cost-effective. Through the analysis of these features, the model generates reliable predictions, guiding physicians to pay earlier attention to patient cohorts that might require further examination and monitoring. Specifically, the application of these models in clinical practice implies the ability to conduct more precise screening and diagnosis of breast cancer patients. Physicians can utilize these models to conduct further targeted examinations and assessments, especially for patients at risk of distant metastasis. This early identification and intervention can aid physicians in devising more personalized treatment plans, thereby improving patients' survival rates and quality of life.

Regarding the performance difference observed between the external validation set (test1 cohort) and the internal test set (test cohort), we carefully evaluated several factors that may have contributed to this disparity. Despite both centers adhering to similar inclusion and exclusion criteria, subtle differences in patient demographics, clinical practices, or data collection methods between centers could potentially influence model performance. Notably, the test1 cohort demonstrated superior results on multiple performance metrics compared to the test cohort, particularly in terms of higher AUC values for both the clinical and combined models (Fig. 4). This outcome underscores the robustness and generalizability of our developed models when applied to a completely independent dataset, validating their predictive capability across different patient populations and clinical settings.

We primarily assessed the model performance using AUC values and clinical decision curves. Despite the combined model achieving higher AUC values than the clinical model, it did not pass the DeLong test. However, clinical decision curves demonstrated that the combined model yielded greater net benefit than the clinical model in both the test and test1 cohorts. Therefore, we consider the performance of the combined model to be superior to that of the clinical model. The superiority of the integrated model lies not only in its superior predictive performance and clinical utility compared to clinical models, but more importantly in how it integrates multiple data sources to enhance decision support. Firstly, the integrated model combines clinical blood biomarkers and ultrasound features, representing hematological and imaging information respectively. This integration of diverse data allows the model to comprehensively consider patients' physiological, biochemical, and morphological characteristics, thereby enhancing the comprehensive and accurate prediction and diagnosis of distant metastasis in breast cancer. Secondly, relying solely on clinical blood biomarkers or a single imaging examination may lead to insufficient information or misjudgments. In contrast, the integrated model integrates multiple information sources to provide a more comprehensive assessment, reducing the risk of misdiagnosis and thereby increasing the precision and confidence of clinical decisions. Additionally, the development of the integrated model not only signifies scientific advancement but also opens new prospects for future clinical practice.

In addition to constructing artificial intelligence LightGBM models capable of identifying breast cancer with distant metastasis, this study conducted an analysis of the models' feature importance. Among the features involved in the models' predictions, CK-MB exhibited the most significant importance. CK-MB, a creatine kinase isoenzyme composed of M and B subunits, primarily exists in cardiac and skeletal muscles16. Chang et al. found significantly higher CK-MB-to-total-CK ratios in late-stage malignant tumors compared to early-stage ones17, suggesting an association between CK-MB and late-stage cancer. Li et al.’s16 research indicated elevated serum CK-MB activity in various cancers, including breast cancer, with significantly higher serum CK-MB activity in metastatic tumor patients. Regarding the origin of elevated CK-MB in malignant tumors, Lee et al. detected a higher proportion of CK-MB in tumor tissues of lung cancer patients, hypothesizing that increased plasma CK-MB originates from tumor tissues rather than cardiac or skeletal muscles18. In this study, CK-MB, as one of the important features of the model, played a significant role in the model’s predictions. In univariate and multivariate regression analyses, CK-MB emerged as an independent risk factor for breast cancer with distant metastasis, positively correlated with distant metastasis. Further research and exploration are required to understand why CK-MB elevation manifests in breast cancer with distant metastasis and its source, whether from tumors or other factors.

Regarding other features of the model, CA153, a common tumor marker, demonstrated the ability to predict breast cancer with distant metastasis19. In this study's model construction, CA153 was also one of the important features. Concerning magnesium ion concentration and breast cancer distant metastasis, Karki et al.20,21 detected lower magnesium ion concentrations in breast cancer distant metastasis. This study's findings align substantially with that, showing a negative correlation between magnesium ion concentration and breast cancer distant metastasis. Presently, studies regarding indirect bilirubin and its relation to prognosis exist for colorectal, ovarian, and lung cancers22,23,24. Liu et al.25 reported a significant association between hyperlipidemia and breast cancer distant metastasis. However, there are no reported connections between indirect bilirubin, lipoprotein (a), and breast cancer distant metastasis. This study, for the first time, attempts to incorporate these as features in predicting breast cancer distant metastasis using artificial intelligence models. The relationship between indirect bilirubin, lipoprotein (a), and breast cancer distant metastasis warrants further research and exploration.

This study also has some limitations. Firstly, the metastatic breast cancer cases included in this study were de novo breast cancer with distant metastasis, encompassing common bone, liver, and lung metastases. Compared to de novo metastatic breast cancer, the prognosis of post-treatment metastatic breast cancer is poorer, possibly due to the tumor molecular reselection after treatment, leading to more aggressive biology26. The included metastatic breast cancer in this study does not cover all types of distant metastases, such as brain metastasis and post-treatment breast cancer distant metastasis. Secondly, although our study data came from two centers, both belong to the same region. The source of the dataset lacks diversity, necessitating further verification across multiple centers, even internationally. Additionally, during the diagnostic process, the reporting habits and personal experience of ultrasound physicians varied significantly, leading to considerable heterogeneity in the ultrasound results. In our collected data, the maximum diameter of the lesion was mentioned in the vast majority of reports. To ensure data accuracy and reliability, we only included the maximum diameter in the ultrasound features. In future prospective studies, we will address these issues. For instance, we will ensure that all relevant ultrasound features (such as calcification and borders) are consistently mentioned in the reports, regardless of their presence or absence. To facilitate future data collection, we may use standardized forms for ultrasound physicians to complete. Finally, in different medical institutions or with different equipment, the model's performance might vary. This study's model might require more validation datasets to ensure its generalizability and robustness in various clinical settings.

Conclusions

In summary, this study successfully developed and validated artificial intelligence clinical models and combined models using LightGBM machine learning algorithms based on clinical blood markers and ultrasound data to predict distant metastasis in breast cancer patients. Particularly, the combined model integrating clinical blood markers and ultrasound features exhibited high accuracy in predicting and identifying breast cancer distant metastasis, demonstrating potential clinical application value. These significant findings highlight the potential of developing economically efficient and easily obtainable predictive tools in clinical oncology. They are poised to elevate the level of clinical decision-making and prognosis assessment, potentially reducing the need for expensive or invasive imaging techniques. The research underscores the prospects of utilizing readily available clinical blood markers and cost-effective ultrasound data to develop predictive tools, holding critical significance for the advancement of clinical oncology, potentially offering patients more convenient and efficient healthcare.