Prediction of serum anti-HSP27 antibody titers changes using a light gradient boosting machine (LightGBM) technique

Previous studies have proposed that heat shock proteins 27 (HSP27) and its anti-HSP27 antibody titers may play a crucial role in several diseases including cardiovascular disease. However, available studies has been used simple analytical methods. This study aimed to determine the factors that associate serum anti-HSP27 antibody titers using ensemble machine learning methods and to demonstrate the magnitude and direction of the predictors using PFI and SHAP methods. The study employed Python 3 to apply various machine learning models, including LightGBM, CatBoost, XGBoost, AdaBoost, SVR, MLP, and MLR. The best models were selected using model evaluation metrics during the K-Fold cross-validation strategy. The LightGBM model (with RMSE: 0.1900 ± 0.0124; MAE: 0.1471 ± 0.0044; MAPE: 0.8027 ± 0.064 as the mean ± sd) and the SHAP method revealed that several factors, including pro-oxidant-antioxidant balance (PAB), physical activity level (PAL), platelet distribution width, mid-upper arm circumference, systolic blood pressure, age, red cell distribution width, waist-to-hip ratio, neutrophils to lymphocytes ratio, platelet count, serum glucose, serum cholesterol, red blood cells were associated with anti-HSP27, respectively. The study found that PAB and PAL were strongly associated with serum anti-HSP27 antibody titers, indicating a direct and indirect relationship, respectively. These findings can help improve our understanding of the factors that determine anti-HSP27 antibody titers and their potential role in disease development.

where, X i refer to the actual value and X normalized contains the normalized values of the X variable.Some used algorithms in this study such as SVM, MLP, and MLR need to be normalized, but the LightGBM does not.To develop the desired set of machine learning models, we used the K-Fold cross-validation (CV) methods on data.The required hyperparameters of models were determined during the fivefold CV method according to Table 3.
All analyses related to the pre-processing and modelling were implemented using Python 3 programming language.

Predictive techniques. Multiple linear regression (MLR). Linear regression is one of the most common
predictive models so it is the basis of regression-based machine learning models as well as, is so popular, simple, and widely used.In fact, this model tries to predict the outcome values using some predictors.In other words, it studies the relationship between predictor variables and outcome 22 .
Multilayer perceptron (MLP).One of the most used and popular methods in machine learning is Multilayer perceptron (MLP) which is almost simple and has clear architecture.MLPs are neural networks that include at least three layers.This model consists of inputs, weights, biases, and an activation function that yields the output.The neurons of a certain layer feed the neurons of the next layer with their outputs.The connection power between neurons is determined by the adaptive coefficient weights, which are multiplied by each input to neurons.After that, a non-linear function i.e. activation function (usually applying sigmoid or hyperbolic tangent function) is used.The training process consists of adjusting the coefficients (weights) of the MLP.With calculating the error function (mean value of the difference between the actual target (T) and the forecasted output) and updating weights based on the learning rate and the error in each epoch and at the end repeating steps until reaching the number of epochs, the training process is completed and the final weights are determined 23 .
Support vector machine (SVM).SVMs are one of the supervised learning methods and they are used for classification and regression problems 24 and recently have been successfully acted in solving these problems 25 .In classification problems, SVMs create optimal decision boundaries between observations of two or more classes, and in Regression or approximation function problems, SVMs approximate optimal function to data.In both approaches, SVMs are found the optimal solution for solving a quadratic optimization problem.The SVMs for classification are called Support Vector Classification (SVC) and the SVMs for regression are called Support Vector Regression (SVR) 11 .Also, SVMs use various kernel functions to choose optimal non-linear decision boundaries in classification and optimal non-linear functions in regression 11,26,27 .Unlike common statistical methods, SVMs do not need to know the probability distribution of observations and also, unlike Neural networks, SVMs have an optimal and global solution.On the other, in SVMs, the complexity of the calculations does not depend on the number of input variables 11,28 .SVMs based on the structural risk minimization principle try to minimize the upper bound of generalization error and this is the final goal of SVM 11,12,26 .
Ensemble methods.One of the branches of machine learning is ensemble methods.Ensemble methods combine some weak learners to build a reliable model.The main goal of ensemble learning is to improve predictability in models.In other words, converting weak learners to strong learners, increase the accuracy of the results significantly.As well, ensemble learning can handle classification and regression problems well and are ideal.The popularity of it is due to reducing the bias and variance to boost the accuracy of models 29 .
Weak base learners in ensemble learning can be homogenous (base weak learners of the same types) or heterogeneous (base weak learners of the different types).Ensemble learning methods are mainly divided into categories of boosting and bagging.Bagging stands for bootstrap aggregating such as random forest (RF).Boosting has various forms such as gradient boosting, adaptive boosting (AdaBoost), categorical boosting (CatBoost), light gradient boosting machine (LightGBM), and extreme gradient boosting (XGBoost) algorithms 29 .
An improved version of the gradient boosting algorithm is called XGBoost and is one of the popular machine learning algorithms.This algorithm works based on the decision tree approach and the gradient boosting decision tree algorithm is the base it exactly.This algorithm has powerful predictive power and its implementation is simple 30 .
Ke et al. proposed an ensemble model in 2017 31 .This model works based on a decision tree algorithm as a weak learner that was called LightGBM.It uses a novel technique called histogram-based binning and learns more efficiently than other algorithms.Tree-based models such as XGBoost produce the trees by level-wise growth method.While LightGBM applies the leaf-wise growth strategy rather than the level-wise growth method to generate the tree (Fig. 2).
Applying the leaf-wise growth strategy rather than the level-wise growth method can reduce the errors and then leads to high accuracy.In addition, LightGBM can handle high-dimensional problems.The generated decision tree with XGBoost is made using a level-wise growth method 30,32 .CatBoost is a supervised machine learning method based on gradient boosting on decision trees that is a powerful method and appropriate for the classification and regression problem with a dataset consisting of many categorical variables.AdaBoost is another ensemble method.The most common weak learner in adaBoost is the decision tree.It is the first successful algorithm to boost binary classification 32 .

Model evaluation.
For evaluating the performance of models, we employed five performance metrics.The formula of these metrics is expressed as follows: Root Mean Square Error Mean Absolute Error Coefficient of Determination Explanation techniques.Explanations using SHAP.With respect to most of the machine learning methods are black boxes, therefore, it leads to difficult interpretation.Hence, we need Explainable Machine Learning methods.SHAP stands for "SHapley Additive exPlanations".Shapley values are a widely used approach based on a game theory to explain the output of machine learning models.It technique provides global interpretability, i.e., SHAP values not only show feature importance but also show whether the feature has a positive or negative impact on predictions.In other words, this method approximates the individual contribution of each feature, for each row of data.It approximates the contribution of that feature by estimating the model output without using it versus all the models that do include it 30,33 .
Feature importance using PFI.Fisher et al. proposed the Permutation Feature Importance for the random forest.After that, this extended and can be applied to all machine learning methods.The values of a variable are permuted to assess prediction error increases or decreases via permutation.In this method, the relationship between the desired variable and outcome is broken and then the decrease in the evaluation metrics shows how www.nature.com/scientificreports/much the model depends on the feature.In fact, this method shows how important this feature is for a desired machine learning method 30,34 .

Ethics approval and consent of participant.
The study protocol was given approval by the Ethics Committee of Mashhad University of Medical Sciences and written informed consent was obtained from participants.All methods were conducted in accordance with relevant guidelines and regulations.Ethic approval code: IR.MUMS.MEDICAL.REC.1399.558.

Results
We included 55 attributes from the database.Descriptive statistics and the bivariate analysis for evaluating the initial association between the target variable and all of the independent variables are reported in Tables 1 and  2. We used the Kolmogorov-Smirnov (Lilliefors correction) test to check the normality of the distribution.Due to the non-normal distribution of variables, we used the non-parametric Spearman correlation coefficient test, the Mann-Whitney U test, and the Kruskal-Wallis H test.The mean and standard deviation of the anti-HSP27 variable are 0.246 and 0.177, respectively.The hyperparameters tuning to achieve the optimal models was performed using fivefold CV and the optimal values of hyperparameters were summarized in Table 3 4).These values also are visualized in the bar chart in Fig. 3.
During the model evaluation, the LightGBM was recognized as the most accurate model.Then, we explained the predictions of the most accurate model using two model-agnostic explanation techniques: permutation feature importance (PFI) and shapley additive explanations (SHAP).
The plotted bar chart in Fig. 3, shows the feature importance (for each feature) in the estimation of antibody titers using the PFI technique.Among all features, the PAB, HS-CRP, PAL, and TG were identified as the four most important features.The importance order of other features are shown in Fig. 4.
The base value in Fig. 5 indicates the mean LightGBM model prediction.Some features were presented in red and some in blue.The red (blue) features move the estimation higher (lower) than the base value.The LightGBM model's output value is 0.26.
The global explanation was offered via the PFI score.We used the SHAP summary plot (Fig. 6) to explain both local and global explanations and show whether the feature has a positive or negative impact on predictions.Using this plot, we can measure the magnitude and direction.The PAB and PAL were identified as the most effective variables in serum anti-HSP27 antibody titer prediction using PFI and SHAP.
The low values of the PAB variable have a high negative contribution to the prediction, while high values have a high positive contribution.The high values of the PAL variable have a high negative contribution to the prediction, while low values have a high positive contribution.
The PDW, MUAC, SBP, age, RDW, WHR, NL, PLT, Glocuse, and RBC variables have a negative contribution when their values are low, and a positive contribution on high values.While, high values of the Chol have a negative contribution to the prediction and also, low values have a positive contribution.The RBC has an almost modest contribution.HC and the 27 other features have almost no contribution to the prediction.

Discussion
Data mining and the use of machine learning methods in various scientific fields have made significant progress in their methodologies.The field of data mining research includes powerful processes and tools that lead to the effective analysis and knowledge discovery.Data mining aims are to discover patterns and unknown correlations and predict data trends and behaviors 35 .Ensemble methods are powerful data mining tools 24 .
Elevated serum levels of several heat shock proteins (HSPs), including HSP27, have been observed in individuals with cardiovascular disease (CVD).However, excessive expression of HSP27 can have detrimental effects, leading to increased inflammation in the body.Identifying the associated variables with serum anti-HSP27 antibody levels can serve as a potential biomarker of inflammation in individuals with CVD.Understanding the underlying mechanisms of CVD and identifying such biomarkers can aid in the development of new therapeutic  strategies for treating and managing CVD.Overall, this study has important implications for improving our understanding of CVD pathogenesis and advancing the development of effective treatments.
In the present study, the LightGBM model that it is a combination of decision trees as the weak learners was applied.Using a data mining approach, this study represents the first attempt to identify the demographic, clinical, and biochemical characteristics associated with anti-HSP27.The study's strengths include the use of advanced and novel methods, as well as a large sample size.
Our results showed that variables pro-oxidant-antioxidant balance (PAB), physical activity level (PAL), platelet distribution width (PDW), mid-upper arm circumference (MUAC), systolic blood pressure (SBP), age, red cell   www.nature.com/scientificreports/distribution width (RDW), waist-to-hip ratio (WHR), neutrophils to lymphocytes ratio (NL), platelet count (PLT), glucose, cholesterol, red blood cells (RBC) were associated to anti-HSP27.The relative importance of variables showed that the PAB was the most important and related variable to serum anti-HSP27 antibody titers with a direct effect on the prediction of serum anti-HSP27 antibody titers.Ghazizadeh et al. 6 for investigating the relationship between serum anti-HSP27 antibody titers and RDW, PAB conducted a cross-sectional study on 852 participants from the cohort study based on the Mashhad stroke and heart atherosclerotic disorders (MASHAD study).This study showed a significant correlation between serum anti-HSP27 antibody titers and PAB as well as RDW using Spearman correlation analysis.In addition, the univariate and multivariate logistic regression analyses after adjustment for confounder factors including sex, age, physical activity, and smoking status, showed that the level of anti-hsp27 increased 1.83 fold in line with increasing of 1 unit of PAB in subjects with level of PAB 36.31-82.63(H2O2%) in comparison to the reference group (PAB level 36.31gt).Our model's results have shown that PAB was strongly related to serum anti-HSP27 antibody titers.
Our results showed that age was an important and related variables with serum anti-HSP27 antibody titers.Our data showed that serum anti-HSP27 antibody titer did not relate to gender which is consistent with the results of Zilaee et al., and also Rea et al. study 36,37 .Zilaee et al. study was conducted on a total of 106 subjects with metabolic syndrome aged 18-65 years, with and without diabetes based on a case-control study.Rea et al. study was conducted on four age groups (less than 40, between 40 and 69, between 70 and 89, and 90 or larger than 90).No significant differences were observed in anti-HSP antibodies based on gender and age changes.But the results of the regression analysis revealed a significant relation between anti-HSP antibody levels and age 36,37 .In addition, the results of the Kargari et al. 1  In addition, we observed a strong association with an indirect effect between serum anti-HSP27 antibody titer and the PAL, while Sadabadi et al. found the PAL was not significant.One reason for the discrepancy could be that they performed their study on participants with MetS disease.
Kargari et al. in their study on 933 subjects showed a significant relationship between diabetes mellitus and serum anti-HSP27 antibody titer 1 that their results were not similar to the results of our research.Azarpazhooh et al. 8 , and Tavana et al. 20 in their studies reported that diabetes mellitus is not associated with serum anti-HSP27 antibody titer.Azarpazhooh et al. 's study was carried out on 168 patients in the first 24 h after the onset of stroke in a case-control study and the patients were matched in terms of age and gender.Hence, our findings were consistent with their findings.The study of Tavana et al. was conducted based on a case-control study on 106 subjects with metabolic syndrome and 6447 subjects with diabetes mellitus.These differences may be due to the target population, other patient conditions, and time periods for each study.
A study 38 showed the relationship between BMI and antibody titers to HSP60, 65, and 70 is significant.Kargari et al. found a significant relationship between BMI and HTN with serum anti-HSP27 antibody titer.Also, Azarpazhooh et al. concluded that serum anti-HSP27 antibody titer was significantly higher in hypertensive patients compared with non-hypertensive patients (p < 0.001).But, our results showed that BMI and HTN were not associated with serum anti-HSP27 antibody titer.In both studies Kargari et al. and Azarpazhooh et al., no significant difference was observed between smoking status and serum anti-HSP27 antibody titer and was consistent to our conclusion.Also, we explored the educational level of individuals that was not related to serum anti-HSP27 antibody titer and our result was different from the study of Victora et al. 39 .
In addition to the items mentioned so far, obesity, height, LDL, TG, Chol, WHR, and Hs-CRP were positively associated with serum anti-HSP27 antibody titer in Kargari et al. study.In another study that was conducted by Sadabadi et al., PAL and HC were not significant, and the serum HSP27 antibody titers were different (p-value = 0.05) between the subjects with high WC, HDL, TG, BPS, and BPD compared to participants with low WC, HDL, BPS, BPD, and glucose 5 .In our study, PAL was related and consistent with the study of Sadabadi et al.
In addition to all of the above, our finding revealed that there was a relationship between other variables such as MAUC and PLT with serum anti-HSP27 antibody titer.These variables were not studied in previous studies.
In summary, there are differences in the results of this study with other studies mentioned.The previous studies have been conducted using common statistical methods that require special assumptions or case-control studies but in the present study, a non-parametric method that does not require special assumptions has been used as well as it can predict and model the linear and nonlinear relationships between input and output patterns, well.These differences could be due to the cross-sectional study design and sample size in case and control groups 20 .Another reason that can cause these discrepancies is the sample size, which can be influential in the bivariate analysis stage for feature selection in our selection and other studies.Also, the presence of other influential factors or conditions of the subjects, the special patients under study that have been considered in previous studies can be effective.The limitation of this study was the exclusion of important variables such as drug use and vitamin D due to missing values exceeding 70 percent.

Conclusion
The LightGBM method was effective in elucidating the relationship between PAB and PAL and serum anti-HSP27 antibody titers with a direct and indirect effect on the prediction of serum anti-HSP27 antibody titers, respectively.The PDW, MUAC, SBP, age, RDW, WHR, NL, PLT, glucose, cholesterol, and RBC were also associated with anti-HSP27 antibody titers.In addition, we aim to investigate this topic as a longitudinal study in the future.

Figure 2 .
Figure 2. Presentation of level-wise versus leaf-wise growth strategy.

Figure 3 .
Figure 3. Barplot for model evaluation (RMSE and MAE are multiplied by 10 for better visualization).

Figure 4 .
Figure 4. PFI scores of the studied features.

Figure 5 .
Figure 5. Explanation of the LightGBM model's output value of 0.26 using SHAP.

Figure 6 .
Figure 6.SHAP summary plot of the LightGBM model.

Table 2 .
Descriptive Statistics of qualitative clinical and biochemical characteristics of the study population.

Table 3 .
The hyperparameters tuning of models.

Table 4 .
Evaluation of trained models.The mean ± standard deviation was reported.
study had similar results to Zilaee et al. and Rea et al. studies about the relationship between age and gender with serum anti-HSP27 antibody titers.