Introduction

Osteoporosis, a skeletal disorder characterized by compromised bone strength and increased risk of fractures, poses a significant public health challenge worldwide1,2. The adverse outcomes of osteoporosis such as hip and vertebral fractures are associated with substantial morbidity, mortality, and economic costs3. Unfortunately, osteoporosis often remains undiagnosed and untreated until the occurrence of a debilitating fracture2, hence proactive identification of high-risk individuals is a critical step in mitigating the disease burden.

Many studies have recognized common risk factors for osteoporosis, including aging, female gender, family history, low body mass index, and certain lifestyle factors (smoking, excessive alcohol, lack of physical activity)1,2. Moreover, several chronic diseases such as rheumatoid arthritis, diabetes, and kidney diseases have been associated with an increased risk of osteoporosis4. Nevertheless, the complexity of interactions between these risk factors makes it challenging to accurately predict osteoporosis risk in the individual patient using traditional statistical methods5.

Recent advances in artificial intelligence (AI) and machine learning (ML) technologies offer promising opportunities for enhancing risk prediction6,7,8. ML algorithms can handle high-dimensional data and capture complex, nonlinear relationships between predictors, making them particularly suited for developing prediction models based on multifactorial disease data9,10. Indeed, ML has shown potential in various healthcare applications, including disease diagnosis and prognosis, treatment response prediction, and patient stratification11,12.

However, the development and validation of a ML predictive model for osteoporosis risk, particularly one that is based on chronic disease data, remains unexplored13. In present study, we aim to develop a ML-based predictive model for estimating osteoporosis risk using a comprehensive set of chronic disease data. Our model is expected to assist community healthcare workers in screening individuals at high-risk of osteoporosis during health follow-ups using simple indicators, thereby enabling early intervention and preventive measures in high-risk individuals. Ultimately, the results of this study have the potential to contribute to the reduction in fracture incidence, improvement in patient outcomes, and alleviation of the healthcare burden associated with osteoporosis.

Through this research, we aim to construct a predictive model in osteoporosis risk prediction, where machine learning and big data are leveraged to deliver personalized risk assessment and preventive care6. This is expected to provide a reference for the adoption and integration of ML technologies in bone health management, and potentially, in the broader context of chronic disease prevention and management.

Materials and methods

Study design and data source

This study was designed to develop and validate a machine learning predictive model for the risk of osteoporosis based on a nationwide chronic disease data in Germany. The data used in this study were obtained from 10,000 complete records of open-source primary healthcare data in the German Disease Analyzer database (IMS HEALTH)14. This open-source data considered ten different chronic diseases (CDs) based on primary healthcare diagnoses (ICD-10 codes): Hypertension (I10), Lipid metabolism disorders (E78), Diabetes (E10-E14), Coronary heart disease (I20-I25), Cancer (C00-97), Chronic obstructive pulmonary disease (J44), Heart failure (I50), Stroke (I63, I64, G45), Osteoporosis (M80, M81), and Chronic kidney disease (N18, N19).

Data preparation

Patient data were randomly split into a training set and a test set in a ratio of 7:3 using a stratified random sampling method implemented in Python (version 3.9). This approach ensured that the distribution of osteoporosis cases was similar in both datasets. Label encoding methods were applied to process categorical variables such as smoking and diabetes status.

To address the imbalance of data distribution, the random oversampling method was applied. This method involved duplicating the minority class instances to balance the dataset, improving the model's ability to learn from the underrepresented class15.

Feature selection

In the study, a comprehensive feature screening process was employed to identify the predictors for osteoporosis prediction. We harnessed the power of nine distinct machine learning algorithms to ensure a robust and exhaustive feature elimination process. The selected algorithms were: Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree Classifier (DT), Extra Trees Classifier (ET), Random Forest Classifier (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Gradient Boosting Classifier (GBC), and Ada Boost Classifier (ADA). To systematically identify and retain the most informative variables, each of these algorithms was subjected to a recursive feature elimination (RFE) procedure. This method facilitates the optimization of model performance by iteratively removing the least important features based on their predictive power16.

Model development

We employed ten different machine learning algorithms to build predictive models. These algorithms, implemented using scikit-learn, xgboost, and lightgbm modules in Python, included: Logistic Regression (LR), K Neighbors Classifier (KNN), Decision Tree Classifier (DT), Extra Trees Classifier (ET), Random Forest Classifier (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (Lightgbm), naïve Bayes (NB), Gradient Boosting Classifier (GBC), and Ada Boost Classifier (ADA)17.

These algorithms' performance was initially evaluated without hyper-parameter optimization by calculating the area under the receiver operating characteristic curve (AUC-ROC). The top three algorithms were then selected for further refinement. Hyper-parameters of these algorithms were optimized using the randomized search method.

The optimal predictive model turned out to be a stacked ensemble model, utilizing the strengths of LR, ADA, and GBC algorithms. The stacker model was developed through a two-step process. In the first layer, individual models (LR, ADA, and GBC) were trained separately on the training dataset. The predictions from these models were then used as input for the second layer to generate a final prediction.

Model validation

The best-performing model was validated both internally and externally. For the determination of appropriate cut-off values, the Youden index was utilized, which maximizes the sum of sensitivity and specificity. The external validation was performed using cumulative lift measures, assessing the ratio of the model's prediction capability compared to a random selection. A confusion matrix was used to intuitively represent prediction performance and the discrepancy between the model prediction result and the actual situation.

The performance metrics used for model evaluation included accuracy, sensitivity, specificity, and AUC-ROC. The calibration of the models was assessed by comparing the predicted probabilities with the actual outcomes.

Statistical analysis

Statistical analyses were conducted using Python (version 3.9, Python Software Foundation). Categorical variables were represented as frequencies or proportions and compared using the chi-square test or Fisher’s exact test. The Kolmogorov–Smirnov-Lilliefors (K-S-L) test was utilized to check the normality of continuous data. Non-normally distributed variables were evaluated using the Wilcoxon rank-sum test and displayed as median, first quartile (Q1), and third quartile (Q3). Differences were deemed significant if P < 0.05.

Feature importance assessment

Shapley Additive Explanations (SHAP) values were used to assess the impact and importance of each input variable on the model's output18. SHAP is a game-theoretic method to interpret the output of any machine learning model. It uses classical Shapley values from cooperative game theory, extending them to optimally attribute credit to local explanations. This approach provided insight into which factors were most influential in the prediction of osteoporosis risk.

Ethics approval

This study was approved by the Ethics Committee of Guangzhou First People's Hospital.

Results

Baseline data assessment

Utilizing the open-source primary healthcare dataset provided by IMS HEALTH, this study encompassed a cohort of 10,000 patients. Of these, 8707 (87.07%) presented with osteoporosis, whereas 1293 (12.93%) did not. Comprehensive patient characteristics can be found in Table 1. The cohort was randomly stratified into a training set (n = 7000) and a test set (n = 3000), following a 7:3 distribution. The variable distributions between these two subsets exhibited no statistically significant disparities, as outlined in Table 2.

Table 1 Baseline characteristics of study population.
Table 2 Baseline characteristics of training set and test set.

Candidate features and algorithms screening

A subset of 7000 samples was randomly delineated for the training of models. Within this subset, 6,095 (87.1%) were diagnosed with osteoporosis, contrasting with 905 (12.9%) that were not (Table 2). Referring to the Recursive Feature Elimination (RFE) outcomes across nine distinct algorithms, as visualized in Fig. 1, the data from eight algorithms indicated optimal predictive performance upon inclusion of all eleven predictor variables. Consequently, all features were employed in the construction of predictive models for the study. The initial performances of these ten models in terms of prediction are delineated in Fig. 2. At the preliminary stage of algorithmic assessment, the Area Under the Curve (AUC) was designated as the pivotal metric for performance evaluation. The three superior-performing algorithms earmarked for further exploration were LR (AUC: 0.753), ADA (AUC: 0.746), and GBC (AUC: 0.742), as depicted in Fig. 2.

Figure 1
figure 1

Results of nine different algorithms using recursive feature elimination (RFE) procedure for feature selection. Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree Classifier (DT), Extra Trees Classifier (ET), Random Forest Classifier (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Gradient Boosting Classifier (GBC), and Ada Boost Classifier (ADA).

Figure 2
figure 2

Performance of ten different models in internal validation without initial parameters. Area under receiver operating characteristic curve (AUC); Logistic Regression (LR), K Neighbors Classifier (KNN), Decision Tree Classifier (DT), Extra Trees Classifier (ET), Random Forest Classifier (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (Lightgbm), naïve Bayes (NB), Gradient Boosting Classifier (GBC), and Ada Boost Classifier (ADA).

Model development and selection

Subsequent to preliminary assessment, the three superior-performing algorithms, namely LR, ADA, and GBC, were earmarked for in-depth refinement. Hyperparameters for these algorithms underwent optimization via the randomized search methodology in tandem with tenfold cross-validation. Then these optimal predictive models turned out to be a stacked ensemble (stacker) model, which synergistically leveraged the robustness of the LR, ADA, and GBC algorithms. This ensemble was constructed via a biphasic approach: the initial phase involved independent training of LR, ADA, and GBC models on the designated training dataset, with their resultant predictions feeding into the second phase to derive a consolidated forecast.

To ascertain the reliability of the devised machine learning constructs, their performance was benchmarked using tenfold cross-validation on the training set, with outcomes delineated in Fig. 3. It was discernible that the stacker model (AUC: 0.773, std: 0.027) showcased enhanced predictive prowess in comparison to standalone LR (AUC: 0.753, std: 0.026), ADA (AUC: 0.751, std: 0.027), and GBC (AUC: 0.753, std: 0.028) models during internal validation. The calibration curves, presented in Fig. 4, furnish insights into model calibration, epitomizing the congruence between predicted osteoporosis risks and the empirically observed outcomes. And the calibration curve associated with the stacker model evinced a good agreement between predictive and observational data.

Figure 3
figure 3

Ten-fold cross-validation results of different machine learning models.

Figure 4
figure 4

(A) The calibration curves of the three models. (B) The calibration curves of the stacker model. The diagonal dotted line represents an ideal model and the solid line represents the performance of the model, while the model closer fit to the diagonal dotted line represents a better calibration.

Model performance and feature importance

As delineated in Fig. 5A, with the increasing probability threshold of osteoporosis, there is a decline in sensitivity and an enhancement in specificity. Utilizing the Youden index, an optimal threshold probability of 0.52 was ascertained for the stacker model, yielding sensitivity and specificity metrics of 0.722 and 0.664, respectively. The model's predictive capacity is further illustrated by the cumulative lift in Fig. 5C. This metric reflects the stacker model's ability to identify osteoporosis cases relative to a given sample size when compared to a random selection. In essence, it offered a comparative ratio of patients diagnosed with osteoporosis against those undiagnosed. This was instrumental in contrasting the stacker model's performance against an idealized model (one that predicts osteoporosis flawlessly) and a model based on sheer randomness. With the threshold set at 0.52, the stacker model attained a lift value of 1.9. The ROC curve for the stacker model, applied to the test dataset, was illustrated in Fig. 5B, signaling robust predictive efficacy with an AUC of 0.76. The model's predictive prowess is further underscored by the confusion matrix, as depicted in Fig. 5D. Delving into the feature importance within the stacker model, SHAP values were computed, as visualized in Fig. 6. As shown, age, gender, lipid metabolism disorders, cancer, and COPD as the top five important features for distinguishing the osteoporosis.

Figure 5
figure 5

(A) Sensitivity and specificity versus cut-off probability plot of the stacker model. Decreasing sensitivity and increasing specificity are shown for increasing probability thresholds for osteoporosis. (B) The ROC curves of the stacker model in test set. (C) The cumulative lift curve of the stacker model in test set. (D) The confusion matrix of the stacker model in the test set.

Figure 6
figure 6

(A) Ranking of feature importance of the stacker model based on SHAP values. (B) Distribution of the impact of each feature on the output of the stacker model estimated using the SHAP values. The plot sorts features by the sum of SHAP value magnitudes over all samples and shows the order of feature importance. This figure described data from the test cohort, with each point representing one patient. The color represents the feature value (red high, blue low). The x axis measures the impact on the model output (right positive, left negative). A positive value indicate a Osteoporosis risk and a negative value indicate a good outcome.

Discussion

Osteoporosis, often termed the "silent disease", is a prevalent condition that reduces bone density, predisposing individuals to increased fracture risk. Notably, the absence of overt symptoms until a fracture occurs underscores the urgency for early detection and preventive strategies19. Fractures, particularly hip fractures, associated with osteoporosis, often result in substantial morbidity, increased mortality, and significant health-care costs20. The societal and economic implications of osteoporosis-related fractures make predicting the disease an imperative not just from a clinical perspective but also from public health and economic viewpoints21.

Early prediction and identification of osteoporosis can pave the way for timely interventions, potentially decelerating or even reversing bone loss. This not only diminishes fracture risk but also bolsters the quality of life for the elderly population, ensuring greater independence and reduced healthcare expenditure22. Interventions, which range from lifestyle modifications to pharmaceutical therapies, have shown to be considerably more effective when osteoporosis is identified at nascent stages23.

The current research serves as a testament to the potential of machine learning in advancing osteoporosis prediction, highlighting a novel approach that melds the power of various predictive algorithms24. Existing methodologies primarily depend on bone mineral density (BMD) tests using DXA scans25. Although effective, these tests are not ubiquitously accessible, can be cost-prohibitive, and often are conducted when clinical symptoms manifest, potentially delaying timely intervention.

In many practical scenarios, especially in resource-limited settings like communities, it might be challenging or cost-prohibitive to obtain comprehensive lifestyle data, laboratory test, or advanced imaging results. Thus, building predictive models using data that can be extracted from primary healthcare records or community surveys offers a promising approach for early screening and detection of osteoporosis in these settings. Cheng Li26 and colleagues successfully predicted the risk of rotator cuff tears in hospital outpatients using simple questionnaire data and physical examination findings with machine learning techniques. Similarly, Limin Wang et al.27 used health questionnaire indicators and regression algorithms to make accurate predictions for symptomatic knee osteoarthritis. By identifying high-risk patients through simple indicators and recommending further precise medical examinations for them, this approach can help reduce unnecessary medical tests and save on healthcare costs.

This study, by capitalizing on nationwide primary healthcare data from Germany, offers a non-invasive and efficient means to predict osteoporosis risk based on health indicators and chronic conditions. The broad inclusion of patients spanning diverse health backgrounds ensures the model's generalizability and applicability in real-world settings. Our aim is to develop a preclinical model that could contribute to early warning and early detection and diagnosis for high-risk populations. In this study, we did not include medical laboratory test indicators and omics data as predictive factors. While this may reduce the model's performance, it also has the advantage of reducing the complexity of the model and enhancing its practicality. Innovation in the field of machine learning does not always mean using the most advanced algorithms or complex feature engineering28. Sometimes, simplifying the development of models to improve their universality and usability represents a significant form of innovation. Simple models are easier for other researchers to replicate and validate and are more feasible to implement in real-world settings. Our study results show that the model we developed has an AUC of 0.76, indicating good predictive performance.

The choice of algorithms in the present study was pivotal in ensuring robust prediction performance. The preliminary selection included a range of algorithms, out of which LR, ADA, and GBC emerged as the front-runners in terms of the AUC metric. Previous research in medical diagnostics has emphasized the importance of the AUC as an indicative measure of the model's capability to discriminate between positive and negative instances29. Although the research by Meng, Y., et al.30 suggests that sequential models, such as GRU or LSTM, may outperform non-sequential models like LR or XGB, the advantages of these models may not be fully leveraged in the context of cross-sectional data alone. In this study, considering the characteristics of the dataset used for modeling, non-sequential models were adopted as the final predictive models, which also achieved good predictive performance. Our findings are congruent with recent literature suggesting the promise of these algorithms in health-related prediction tasks31,32,33.

Ensemble methods have consistently demonstrated their mettle in improving prediction accuracy by combining the strengths of multiple models and ameliorating individual model limitations34. The use of a stacked ensemble in our study—a model synergizing the robustness of LR, ADA, and GBC—substantially augmented the AUC during internal validation. This approach capitalizes on the distinct decision boundaries offered by each algorithm, thus providing a holistic, comprehensive prediction. This approach offers higher predictive accuracy over individual models, a finding in alignment with contemporary studies on ensemble methods35.

The optimal threshold probability of 0.52 derived from the Youden index underscores the balanced consideration of both sensitivity (true positive rate) and specificity (true negative rate) in the study. This ensures not only the correct identification of actual osteoporosis cases but also minimizes false alarms, which can be critical in clinical applications to avoid overdiagnosis and unnecessary interventions. The achieved lift value of 1.9 for the stacker model accentuates its ability to effectively identify osteoporosis cases compared to random selection, validating its clinical utility.

Furthermore, the comprehensive feature selection process and rigorous validation reaffirm the model's robustness and reliability. The application of SHAP values for feature importance not only fosters transparency in machine learning predictions but also offers clinical insights, helping healthcare practitioners understand and prioritize risk factors36.

The SHAP values, an advanced tool for model interpretability, were instrumental in determining the salience of each feature within our predictive framework. Age and gender emerged as the most paramount factors, a finding that resonates with the broader osteoporosis literature. The long-established relationship between advancing age and decreased bone density makes age a pivotal predictor for osteoporosis risk37. Gender-specific differences, especially post-menopausal changes in women, exacerbate the risk of osteoporosis, emphasizing its importance in our model38.

The significance of lipid metabolism disorders in predicting osteoporosis in our model presented intriguing insights. Recent studies have begun to identify a potential association between dyslipidemia and bone mineral density (BMD) alterations39,40. Lipids play a role in bone metabolism, and aberrations in lipid profiles may adversely affect bone health. Our model's emphasis on cancer as a risk factor underscores the multifaceted relationship between cancer and osteoporosis. Some treatments for cancer, especially those involving hormone therapies, can accelerate bone loss, making patients more susceptible to osteoporosis41,42. COPD has also been linked to low bone mineral density and a higher risk of fractures. Pulmonary dysfunction and decreased BMD share underlying inflammatory pathways. The chronic inflammatory state in COPD can disrupt bone metabolism, leading to increased osteoporosis risk43. Hypertension has been associated with an increased risk of osteoporosis, potentially due to alterations in calcium homeostasis, as well as the effects of antihypertensive medications44. Stroke patients also face an increased risk of osteoporosis and fractures, likely due to immobilization and neuronal damage affecting bone metabolism. Similarly, heart failure, CHD, and chronic kidney disease have all been associated with an increased risk of osteoporosis and fractures45,46,47.

The imperative of early osteoporosis prediction has never been clearer. As populations age globally, the public health burden of osteoporotic fractures is poised to rise. Against this backdrop, our study stands as a meaningful stride towards enhancing osteoporosis predictive modalities. Utilizing the open-source primary healthcare dataset from IMS HEALTH, which included records from a large number of patients, we endeavored to develop a machine learning-based predictive model. With further research and validation, we hope the model will assist community healthcare workers in screening patients at high risk of osteoporosis during health follow-ups using simple indicators. Personalized health advice is given to these high-risk patients, and further medical tests such as laboratory tests or radiology are recommended to clarify the diagnosis. This may help to reduce unnecessary medical tests and save healthcare costs while ensuring that the benefits to osteoporosis patients.

Several algorithms were assessed in our endeavor, with the stacked ensemble approach of combining Logistic Regression (LR), Ada Boost Classifier (ADA), and Gradient Boosted Classifier (GBC) emerging as particularly promising. The superiority of this ensemble model underscores the inherent complexities of osteoporosis prediction. It emphasizes that the disease's multifaceted nature may be best captured by drawing from the strengths of multiple algorithms.

Limitations

However, it is important to acknowledge the limitations of our study. Our reliance on the IMS HEALTH dataset confines our findings to its demographic and geographic specifications. Consequently, the external validity and generalizability to other populations or regions might be limited. The ensemble model, for all its predictive prowess, also adds an element of complexity. Its seamless integration into clinical settings, especially ones with limited resources, could be a challenge. The cross-sectional nature of our dataset provides just a snapshot, whereas osteoporosis's progression warrants a more longitudinal analysis. And for this reason, we did not apply sequential model in our study. In addition, due to the limitations of the information contained in the database, our model did not incorporate factors such as diet, lifestyle, physical activity and genetic predisposition, which reduces the complexity of the model but at the same time has an impact on the performance of the model. In the future, we hope to collect more dimensions of data to conduct more in-depth and robust studies in further research and validation. Moreover, while our model identified several chronic conditions as key predictors of osteoporosis risk, it is important to note that these conditions do not operate in isolation. They often interact with each other and with other factors such as lifestyle and genetics in ways that can either exacerbate or mitigate the risk of osteoporosis. Therefore, a thorough understanding of these interactions and their implications for osteoporosis risk is necessary for the accurate interpretation and application of our model's predictions.

Conclusion

In conclusion, the study highlighted the potential of using ML techniques for predicting osteoporosis risk based on chronic disease data. The stacker model, incorporating a diverse set of variables related to age, gender, and chronic diseases, demonstrates good predictive performance and offered a tool for individualized osteoporosis risk management and early warning and detection, which could facilitate early interventions and improve patient outcomes.