Comparison of Machine Learning Approaches in Prediction of Bone Mineral Density in Elderly Men

Bone mineral density (BMD) is a highly heritable trait with heritability ranging from 50% to 80%. Numerous BMD-associated Single Nucleotide Polymorphisms (SNPs) were discovered by GWAS and GWAS meta-analysis. However, several studies found that combining these highly significant SNPs together only explained a small percentage of BMD variance. This inconsistency may be caused by limitations of the linear regression approaches employed, because these traditional approaches lack the flexibility and the adequacy to model complex gene interactions and regulations. Hence, we developed various machine learning models of genomic data and ran experiments to identify the best machine learning model for BMD prediction at three different sites. We used genomic data of Osteoporotic Fractures in Men (MrOS) cohort Study (N=5,133) for analysis. Genotype imputation was conducted at the Sanger Imputation Server. A total of 1,103 BMD-associated SNPs were identified and corresponding weighted genetic risk scores were calculated. Genetic variants, as well as age and other traditional BMD predictors, were included for modeling. Data were normalized and were split into a training set (80%) and a test set (20%). BMD prediction models were built separately by random forest, gradient boosting, and neural network algorithms. Linear regression was used as a reference model. We applied the non-parametric Wilcoxon signed-rank tests for the measurement of MSE in each model for the pair-wise model comparison. We found that gradient boosting shows the lowest MSE for each BMD site and a prediction model built using the machine learning models achieves improved performance when a large number of SNPs are included in the models. With the predictors of phenotype covariate + 1,103 SNPs, all of the models were statistically significant except neural network vs. random forest at femoral neck BMD and gradient boosting vs. random forest at total hip BMD.


Introduction
Osteoporosis is a major bone disease characterized by reduced bone mineral density (BMD) and deteriorated bone architecture, leading to increased fracture risk.Osteoporosis and its major complication, osteoporotic fracture, affecting both men and women, cause substantial morbidity and mortality worldwide 1 .Although women have a higher risk of osteoporosis, men suffer much higher morbidity and mortality rates following osteoporotic fractures, especially in their advanced age.With populations ageing worldwide, osteoporosis has become a critical public health problem globally.The worldwide fracture incidence in hip alone is projected to increase by 310% in men and 240% in women by 2050, compared to rates in 1990 2 .The potentially high cumulative rate of fracture, which often results in excess disability and mortality 3 has caused an inevitable increase in the social and economic burden associated with bone health.
BMD remains the operational definition of osteoporosis since 1994.Osteoporosis is defined by the World Health Organization as a BMD that lies 2.5 standard deviations or more below the average value for young healthy women.BMD is the single strongest predictor of primary osteoporotic fracture 4 .
Each standard deviation decrease in BMD is associated with a 1.5-3.0fold increase in the risk of fracture, depending on the skeletal region measured, type of fracture, and ethnicity of study population 5 .
BMD is a highly heritable trait.Genetic differences in BMD are well documented 6 .Family and twin studies show BMD variances of 50-85% are attributable to genetic factors 7 .Other studies report BMD heritability estimates of 72-92 8 .In the past decade, major genome-wide association studies (GWASs) and genome-wide meta-analyses have successfully identified numerous BMD-associated Single Nucleotide Polymorphisms (SNPs) associated with decreased BMD 9 .However, combining these large number of highly significant SNPs together, surprisingly, only explained a very small percentage of BMD variance 10 .Such inconsistency may be caused by limitations of the conventional regression approaches employed because these conventional approaches lack the flexibility and the adequacy to model complex interactions and regulations.The Osteoporotic Fractures in Men Study (MrOS) is a unique prospective study in which BMD data and genotyping data are available.
Linear regression has been widely used as the traditional approach to predict BMD outcome 11 .
ML focuses on implementing computer algorithms capable of maximizing predictive accuracy from complex data.ML has a much better capacity to model real-world complex relationships, including variable interactions.Several ML techniques have been applied in clinical research for disease prediction, and ML has shown much higher accuracy for diagnosis than conventional methods 12 .Gradient boosting, random forest, and neural network are widely used ML approaches for complex medical data 12 .However,

Wu et al -6 -
the performance of these ML models for BMD prediction remains unknown, especially with genomic data.
Hence, the aims of the current study are 1) to develop models using ML algorithms to predict BMD from the data with genomic variants; 2) to compare these models to determine which ML model performs the best for BMD prediction.We hypothesize that when we utilize the ML models to predict BMD, ML models will perform better than the linear regression (LR).

Data Source
The Osteoporotic Fractures in Men Study (MrOS) was used as the data source for this study.
MrOS is a federal funded prospective, cohort study which was designed to investigate anthropometric, lifestyle, and medical factors associated with bone health in older, community dwelling men.Details of the MrOS study design, recruitment, and baseline cohort characteristics have been reported 12 elsewhere.
With the approval of the institutional review board at the University of Nevada, Las Vegas and National Institute of Health (NIH), the genotype and phenotype data of MrOS were acquired from dbGaP (Accession: phs000373.v1.p1).MrOS have 5,130 subjects who have both genotype and phenotype data available for authorized access.

Study participants
Participants in the MrOS were at least 65 years old, community-dwelling, ambulatory, and had not bilateral hip replacement 13 at the study entry.At enrollment, participants had to provide self-reported data, understand and sign the written informed consent, complete the self-administered questionnaire (SAQ), attend a clinic visit, and complete at least the anthropometric, DEXA, and vertebral X-ray procedures.The participants could not have a medical condition that would result in imminent death based on the judgment of the investigators.A total 5,994 men were enrolled between March 2000 and April 2002 from six communities in the United States (Birmingham, AL; Minneapolis, MN; Palo Alto, CA; Pittsburgh, PA; Portland, OR; and San Diego, CA.) 14 .

Outcome BMD measurements
Total body, total femur BMD, and lumbar spine (L1 to L4) were measured using a fan-beam dual-energy X-ray absorptiometry (QDR 4500 W, Hologic, Inc., Bedford, MA, USA) at the second visit of MrOS.Participants were scanned for BMD measurements by licensed densitometrists using standardized procedures.All DXA operators were centrally certified based on the evaluation results of scanning and analysis techniques.Cross-calibrations, which were conducted prior to participants' visits for BMD measurement, found no linear differences across canners, and the maximum percentage .difference between scanners was 1.4% in mean BMD of total spine 15 .No shifts or drifts in scanner performance was found based on longitudinal quality control using daily scan data for standardized phantoms in each clinical center.
We predict three outcomes: Femoral Neck BMD (݊ ൌ 5,129), Total Spine BMD (݊ ൌ 5,120), and Total Hip BMD (݊ ൌ 5,129).We predict each BMD with two sets of predictors: (i) phenotype covariates + GRS; and (ii) phenotype covariates + 1,103 SNPs.We are also interested in the use of individual SNPs that are found by Morris et al. 9 .We encoded each 1,103 risk SNPs as three different genotypes (homozygous major allele, heterozygotes, homozygous minor allele) with 0, 1, and 2, respectively and then we treated it as a continuous variable to predict each BMD.

Assessment of covariates
Bone health related information, including demographics, clinical history, medications, and lifestyle factors were obtained by self-administered questionnaires.This information comprised variables used in this study, including age, race, smoking, and alcohol consumption.Height (cm) was measured using a Harpenden stadiometer and weight (kg) was measured by a standard balance beam or an electric scale.BMI was calculated as kilograms per square meter.
Smoking was categorized as "never", "past" and "current".Alcohol intake was quantified in terms of usual drinks per day.Walking speed was determined by timing completion of a 6-m course performed at the participant's usual walking speed.Mobility limitations were measured by participants' ability to rise from a chair without using their arms, as well as their ability to complete five chair stands.
Health status was assessed as impairment in instrumental activities of daily living (IADLs).

Genotyping Data
Baseline whole blood samples were used for DNA extraction.Consent for use of DNA was obtained through written consent.Quality-control Plink genotype data files were acquired through dbGaP.
Genotype imputation was conducted at the Sanger Imputation Server.Haplotype Reference Consortium (HRC) reference panel, the most comprehensive imputation reference panel, as well as Positional Burrows-Wheeler Transform (PBWT) imputing algorithm, were used for the genotype imputation to ensure high quality of genotype imputation.The most up-to-date GWAS study in the field, published by Morris et al in 2019, identified 1,103 SNPs associated with estimated BMD (eBMD) 9 .Each of the 1,103 SNPs were successfully imputed and were included in the analysis.The imputation quality was excellent with a mean R 2 of 0.99.Genotyping for MrOS samples was performed with the Illumina HumanOmni1_Quad_v1-0 H array.A total of 934,940 SNP markers with known chromosome locations, and SNP markers with minor allele frequencies greater than or equal to 0.05 were analyzed.

Genetic risk score
A genetic risk score (GRS) is a standardized metric derived from the number of risk alleles and their effect size for each study subject.This metric allows the composite assessment of genetic risk in complex traits.A linkage disequilibrium (LD) pruning was performed in advance to eliminate possible LD between SNPs.None of the 1,103 SNPs was removed after the pruning.The weighted GRS was then calculated with algorithms described previously 16 .Briefly, for each participant in MrOS, overall weighted GRS was calculated by summing the number of risk alleles at each locus multiplied by their effect size, which are regression coefficients related to BMD, from the referenced literature 6 .

Model evaluation and validation
To evaluate the accuracy of the model, we randomly split a total of 5,130 patients into two parts: (i) the training set (80%), which was used to build the model, (ii) the test set (20%), which was used to evaluate the performance (i.e., the predictive accuracy) of the developed model.Commonly used metrics to evaluate the model performance are mean squared error and mean absolute error 17 .We adopted these two metrics: where ݊ is the sample size, ‫ݕ‬ is the actual value for each observation, and ‫ݕ‬ ప ෝ is the estimated value for each observation in the test dataset.

Hyper-parameter Optimization
10-fold cross-validation was used for hyper-parameter optimization.In 10-fold cross-validation, we chose one fold as a validation set and remaining folds as the training set.We used Scikit-learn's randomized search cross-validation method 18 to find the best hyperparameters for different algorithms.
The training set was used to train and construct the models of logistic regression, random forest, gradient boosting, and neural network.Once the model was developed on the training set, then each model evaluated in a test set.We used a small learning rate and relatively small depth for RF and GB algorithms to reduce the risk of overfitting as it has been suggested that RF and GB are robust to overfitting 19 .With phenotype covariate + GRS, we used the depth 3 at THBMD, depth 2 at FNBMD and depth 2 at TSBMD.
With phenotype covariate + 1,103 SNPs, we used the depth 4 at THBMD, depth 3 at FNBMD and depth 3 at TSBMD.Regularization is also a method to avoid overfitting 20,21 in the regression models.We used the Lasso 22 to avoid overfitting with phenotype covariate + 1,103 SNPs predictors in NN.In our study, statistical significance was set at ‫‬ ൏ 0.05 ሺ5%ሻ.

Statistical Testing
Several studies suggest the use of statistical tests in machine learning for comparisons of many algorithms [23][24][25] .The most frequently used statistical tests to determine significant differences between two machine learning algorithms are the t-test and Wilcoxon signed-rank test 26 .The t-test is a parametric test that requires certain assumptions of independence, normality, and heteroscedasticity, which are usually not satisfied in the machine learning models.Thus, the nonparametric test of Wilcoxon signed-rank test is widely applicable to compare ML models.

Software used
All of the analyses were performed in the Python Software Foundation and Python Language Reference, version 3.7.3 with the package Scikit-learn: Machine Learning in Python 18 .Available at http://www.python.org

Data analysis
Figure 1 shows the overview of our data process flow for this study.With two different MrOS data sets (Phenotype set (݊ ൌ 5,143) and Genotype set (݊ ൌ 5,130)), we first imputed the missing value for each data set as we discussed in the methods section.Next, we combined these two data sets to one data set for further analysis.When combining these two data sets, data from 13 patients in the MrOS Genotypes were removed due to their phenotype variables not being available in MrOS Phenotype data set.This resulted in one data set of (݊ ൌ 5,130).Then we normalized the continuous predictors to have the same scale without distorting differences in the range of values.Normalization is a technique that is applied as a part of data preparation for machine learning models.We randomly split the dataset into a training set (80%, ݊ ൌ 4,104) and a test set (20%, ݊ ൌ 1,026).Outliers were not removed due to their relevance in the models.The most common method of missing data replacement is to use the median 27 .
The median imputation was applied to fill the missing value in the variable. .

Model Performance
Figure 2 shows the performance of each model in the test dataset (݊ ൌ 1,026).The first row (PC + GRS) of A, B, and C in Figure 2 shows the performance of each model with the predictor of phenotype covariates + GRS in the test dataset with MSE.Although the LR model had the relatively higher MSE in the first few iterations, LR and ML models are performing nearly identical with 100 iterations for each BMD site.Among these three BMD sites, FNBMD (A, in Figure 2) shows the lowest MSE, followed by THBMD (B, in Figure 2) and TSBMD (C, in Figure 2) in all of the models.The second row (PC + 1,103 SNPs) of D, E, and F show the performance of each model with the predictor of phenotype covariates + 1,103 SNPs in the test dataset of MSE.LR model results showed a higher MSE than ML models at each BMD site.
Tables 2 and 3 show the results of MSE and MAE for each model in the test dataset (݊ ൌ 1,026) with two different sets of predictors phenotype covariate + GRS and phenotype covariates + 1,103 SNPs.
The nonparametric Wilcoxon signed-rank tests were implemented for the measurement of MSE.
The hypothesis for our testing is two models are equal, otherwise are not.The results are shown in Tables 4 and 5 for two different sets of predictors, phenotype covariates + GRS and phenotype covariates + 1,103 SNPs.In each cell results for a given pair of model was placed (i.e., ‫-‬value is placed to compare two models like Neural Network vs. Linear Regression).With Bonferroni corrections for multiple comparisons (α = 0.05/6=0.0083),our results suggest that none of the models were statistically significant with phenotype covariates + GRS except gradient boosting vs. random forest at FNBMD and TSBMD.However, the difference of accuracy between two models in all pairwise comparisons was statistically significant with ‫‬ ൏ .0001,except Neural Network vs. Random Forest at FNBMD and Gradient Boosting vs. Random Forest at THBMD with ‫‬ .05.

Discussion
In this study, we investigated whether ML models perform better to predict the major BMD in MrOS data.It has been known that the performance of NN models are superior to predict BMD values Wu et al -11using several parameters among postmenopausal women than the conventional regression methods 11,28 .
To the best of our knowledge, this is the first attempt to predict BMD at a different site, using various ML models by also including the genetic factors in Men's data.Machine learning employs a wide-ranging class of algorithms widely used to solve complex prediction problems.It has been very popular when prediction is dependent on the integration of a large number of predictors, including higher-order interactions, and when sizeable training datasets are available for model fitting.In particular, gradient boosting is powerful for continuous outcome prediction 29 .ML models have been used widely in classification problems (e.g., disease or not, fracture or not).However, there are only a few studies that applied ML models to predict the quantitative variable in the clinical study.Our models' performance results support that gradient boosting shows the lowest MSE for each BMD site, and a prediction model built using the ML models achieves improved performance when a large number of SNPs are included in the models 30 .
We predicted each BMD site with two sets of predictors: 1) phenotype covariate + GRS and 2) phenotype covariate + 1,103 SNPs to predict BMD using various ML models.Our results showed that when we used the predictors of phenotype covariate + GRS, ML models are performing relatively similarly to LR, however, when we used the predictors of phenotype covariate + 1,103 SNPs, the result of model performance was consistent with our hypothesis that ML models are slightly better than LR.We assume that the low performance of ML with phenotype covariate + GRS, is due to the lack of predictors in the model.Since we only have 10 predictors including GRS, ML models are performing close to LR model.Since the use of ML approaches are suggested with the interaction of predictors 30 , our selected ML models are performing better with predictors of phenotype covariate + 1,103 SNPs.
We applied the nonparametric Wilcoxon signed-rank tests, which were implemented for the measurement of MSE in each model for the pair-wise model comparison.With Bonferroni corrections for multiple comparisons, with the predictors of phenotype covariate + GRS, none of the models were statistically significant except gradient boosting vs. random forest at FNBMD and TSBMD.However, with the predictors of phenotype covariate + 1,103 SNPs, all of the models were statistically significant except Neural Network vs. Random Forest at FNBMD and Gradient Boosting vs. Random Forest at THBMD.
We anticipate several potential applications of our findings to novel types of clinical sites.The first application would be using the ML models to predict the BMD at different sites.Many studies found that the lower BMD is associated with higher fracture risk or osteoporosis disease 31 .Thus, ML models will help to detect the low BMD early.The second application would be using the risk SNPs to predict the BMD instead of the GRS as one variable.As our study results show, when we use the GRS as one of

Wu et al -12 -
the predictors variable, the prediction performance of ML models are relatively close with the linear regression.However, when we use the risk 1,103 SNPs with other covariates, the prediction results with ML models are better than the linear regression.We assume that since the ML models are performing better with more complex data 29 , our result shows that the risk 1,103 SNPs with other covariates are performing better than the GRS with other covariates.Since our data set is relatively simple to utilize the ML models, our approaches of ML were feasible with depth equal to 2 or 3 in RF and GB models.
We have several limitations in our study.First, the total sample size (݊ ൌ 5,130) is relatively small for ML approaches.ML approaches often require much larger training data size.For this reason, we employed 10-fold cross-validation for tuning of the hyper-parameters within the training set, instead of allocating part of the study sample for validation purpose, which would cause a smaller sample size for ML model training.Second, some covariates are not available in the MrOS through dbGaP, including medications, comorbidities, and physical activities.It may be plausible to conclude that the lack of these predictor variables in phenotype covariate + GRS, caused the ML models to perform similarly to the LR model.Third, the MrOS data we used only includes male participants who were 65 years old and mostly white (90%), so our findings may not apply to women or to individuals who are of a younger age and other ethnicity.Finally, we calculated GRS based on the common variants that were found by the Morris et al. study 9 and used these risk SNPs as our predictors along with other covariates.Recently this study 31 identified 1,362 independent SNPs, which are associated with BMD and fracture.Rare variants, which provide evidence that low-frequency non-coding variants have larger effects on BMD and fracture 32 , were not included in our risk SNPs as predictors.

Conclusion
When compared with linear models, the use of a machine-learning approach, which models data non-linearity, could lead to an improvement in the accuracy of predicting each BMD with the complex data.Therefore, our findings can also be applied to other disease or women's data.
As with other prediction problems involving machine learning techniques, incremental improvements are to be expected with increased sample size, the inclusion of more predictors, and the availability of more precise summary association statistics to calculate GRS.With the ML models performing better than LR with the predictors of phenotype covariate + 1,103 SNPs, THBMD and TSBMD give slightly lower MSE and MAE.Early prediction of BMD is essential to diagnose the patient properly.Our ML models with the predictors of phenotype covariate + 1,103 SNPs, will help the Wu et al -13clinician's decision to treat the patient properly.Since the low BMD is the major risk factor for fracture in osteoporosis 33 , accurate prediction of BMD is an essential step for the osteoporosis disease.We hope that ML approaches, with accurate prediction of low BMD along with other risk factors covariates as well as genetic factors, will help the clinician's care system with a risk patient.Wu et al -24 Abbreviations

Figure 2 .
Figure 2. Comparisons of various ML models: Mean Square Error vs. Number of iterations in the test dataset (݊ ൌ 1,026ሻ.First row (A, B, C) shows the performance of each model with the predictors, phenotype covariates + GRS, in the test dataset at different BMD site.Second row (D, E, F) shows the performance of each model with the predictors, phenotype covariates + 1,103 SNPs, in the test dataset at different BMD site.

Table 1
shows the characteristics of participants within the training (݊ ൌ 4,104) and the test (݊ ൌ 1,026) datasets.The variables are not significantly different in training and test datasets.Each outcome variable of FNBMD, TSBMD, and THBMD follow the normal distribution with means 0.78, 1.07, 0.96 and standard deviations 0.13, 0.14, 0.19, respectively.

Table 1 .
Demographic and Clinical Characteristics of Training and Test Dataset * P-values were obtained by ‫ݐ‬ -test for continuous variables and chi-square test for categorical variable. *

Table 2 .
Comparisons of various ML models: Mean Square Error (MSE) and Mean Absolute Error (MAE) at each BMD with phenotype covariates + GRS in the test dataset (n = 1,026)

Table 3 .
Comparisons of various ML models: Mean Square Error (MSE) and Mean Absolute Error (MAE) at each BMD with phenotype covariates + 1,103 SNPs in the test dataset (n = 1,026)

Table 4 .
Results of the Wilcoxon Signed-Rank Test for squared errors comprised by MSE between various ML models at each BMD with phenotype covariates + GRS in the test dataset (n = 1,026)

Table 5 .
Results of the Wilcoxon Signed-Rank Test for squared errors comprised by MSE between various ML models at each BMD with phenotype covariates + 1,103 SNPs in the test dataset (n = 1,026) Overview of Data Process Flow to predict BMD.