Machine learning analysis for the association between breast feeding and metabolic syndrome in women

This cross-sectional study aimed to develop and validate population-based machine learning models for examining the association between breastfeeding and metabolic syndrome in women. The artificial neural network, the decision tree, logistic regression, the Naïve Bayes, the random forest and the support vector machine were developed and validated to predict metabolic syndrome in women. Data came from 30,204 women, who aged 20 years or more and participated in the Korean National Health and Nutrition Examination Surveys 2010–2019. The dependent variable was metabolic syndrome. The 86 independent variables included demographic/socioeconomic determinants, cardiovascular disease, breastfeeding duration and other medical/obstetric information. The random forest had the best performance in terms of the area under the receiver-operating-characteristic curve, e.g., 90.7%. According to random forest variable importance, the top predictors of metabolic syndrome included body mass index (0.1032), medication for hypertension (0.0552), hypertension (0.0499), cardiovascular disease (0.0453), age (0.0437) and breastfeeding duration (0.0191). Breastfeeding duration is a major predictor of metabolic syndrome for women together with body mass index, diagnosis and medication for hypertension, cardiovascular disease and age.


Prediction model for metabolic syndrome
The performance measures for the six prediction models for metabolic syndrome are summarized in Table 2.Among the six prediction models for metabolic syndrome, the random forest performed the best in terms of the area under the receiver operating characteristic curve (AUC); 90.7% (all participants), 87.7% (diagnosed with CVD), and 82.6% (no CVD diagnosis).The values and ranks of the random forest variable importance are summarized in Table 3.A predictor with the ranking of 26th or higher can be considered to be a major predictor in this study, given that it is a top 30% among 86 predictors here.According to the random forest variable importance in Table 3, the major predictors of metabolic syndrome were body mass index (BMI) (0.1032), use of antihypertensive drugs (0.0552), hypertension (0.0499), CVD (0.0453), age at enrollment (0.0437), white blood cell count (0.0297), low-density lipoprotein (LDL), cholesterol levels (0.0263), menstrual status (0.0247), use of lipid-lowering agents (0.0237), red blood cell count (0.0231), total cholesterol levels (0.0229), subjective body image (0.0221), education level (0.0214), daily fat intake (0.0198), hematocrit levels (0.0197), and breastfeeding duration (0.0191).Breastfeeding duration was a major predictor of metabolic syndrome.Let us take an example in which the random forest variable importance of BMI, CVD, or breastfeeding duration is 0.1032, 0.0453, or 0.0191, respectively.Here, the accuracy of the model will decrease by 10.32%, 4.53%, or 1.91% if the values of BMI, CVD, or breastfeeding duration are randomly permutated (or shuffled).The importance rankings of some major predictors showed dramatic changes in the subgroup analysis, i.e., between the participants with and without CVD.For example, the predictors of medication and diagnosis for hypertension ranked second and third for all participants, respectively, but these predictors went out of the top-30 ranking for both subgroups in Table 3.Likewise, the respective rankings of menstrual status and education were eighth and 13th for all the  www.nature.com/scientificreports/participants, but their rankings dropped to 23rd or lower for both the subgroups in the same table.Breastfeeding duration ranked 16th as a predictor for all the participants.However, it was ranked slightly higher at 14th for those without CVD and much lower at 26th for those with the condition.The logistic analysis results for each important variable, including obstetric characteristics, are presented in Supplementary Material 2. The breastfeeding duration was associated with a decreased risk of metabolic syndrome (adjusted odds ratio [aOR] 0.998; confidence interval [CI] [0.996-1.000]).The odds of metabolic syndrome will decrease by 0.2% if breastfeeding duration increases by 1 month.In other words, the odds of metabolic syndrome will decrease by 2.4% (or 4.8%) if breastfeeding duration increases by 1 year, i.e., 12 months (or 2 years, i.e., 24 months).The effect of breastfeeding duration on metabolic syndrome looks small on 1 month but it is big on 1 year or two.The odds ratio is not statistical significant at 5% level but it is still useful information in machine learning, given that variable importance is primary and statistical significance is supplementary in machine learning.Logistic regression requires adopting the unrealistic assumption of ceteris paribus, i.e., "all the other variables remain constant".this context, the results of the logistic regression would serve as supplementary information to the random forest variable importance.

Discussion
In summary, among the obstetric characteristics, one of the most significant factors associated with metabolic syndrome was the duration of breastfeeding.Among the six prediction models for metabolic syndrome, the random forest had the best performance in terms of the AUC, i.e., 90.7% (all participants).In the subgroup analysis, among the women without CVD, the importance of breastfeeding duration as a predictor of metabolic syndrome was ranked 14th (0.0235), which is as important as the daily intake of sodium (12th, 0.0239).
This study presents the most comprehensive analysis of the determinants of metabolic syndrome in women using a large-scale Asian population-based cross-sectional study of 30,204 participants.While there is one paper that has addressed the association between breastfeeding and metabolic syndrome in postmenopausal women using KHANES data, our study differs in that it targeted all adult women, included more recent data (2010 to 2018), and distinguished itself by constructing a predictive model for metabolic syndrome using machine learning 9 .This study investigated whether there were differences in metabolic syndrome-related factors between the women with and without CVD.In a recent meta-analysis, the authors assumed that breastfeeding may have a preventive effect on metabolic syndrome and that it was related to breastfeeding duration 8 .However, the pooled effect of breastfeeding on metabolic syndrome was not conclusive because of the study population heterogeneity, the criteria for breastfeeding, and confounding factors for metabolic syndrome 8 .In this large-scale populationbased study, we evaluated the precise impact of breastfeeding on metabolic syndrome and compared its clinical importance to the other known risk factors known to predispose women to metabolic syndrome.
During pregnancy, the mother undergoes metabolic changes that increase insulin resistance and serum lipid levels (particularly triglyceride [TG]) 21,22 .Breastfeeding reportedly restores the overall maternal postpartum metabolic changes faster back to the prenatal baselines 23 .It also has a long-term positive effect on maternal glucose levels, lipid metabolism, and adiposity [23][24][25] .The relationship between gravidity, parity, and metabolic syndrome is still debated, necessitating further research.
In this study, we investigated the importance of specific variables in the development of metabolic syndrome in women with and without CVD.The relative importance of different variables between the participants with and without CVD can have important clinical implications.First, in women without CVD, age (second vs. tenth), breastfeeding duration (14th vs. 26th), and gravidity (26th vs. 31st) were ranked higher as compared to women with CVD.These variables appeared to have a higher association with metabolic syndrome in the women without CVD and were less important in women with CVD.Second, in women with CVD, the importance of lipid-lowering agents or diabetes drugs was relatively higher.A previous meta-analysis reported that among the five factors of metabolic syndrome, the prognosis of CVD was especially poor in patients with dyslipidemia or impaired glucose tolerance 26 .In this study, it can also be hypothesized that dyslipidemia or impaired glucose tolerance has a stronger mediating effect on metabolic syndrome in women with CVD.Third, in the three models of this study (Table 3), the nutrient intake (especially fat intake) was highly correlated with metabolic syndrome, and the importance of nutrient intake was higher in women with CVD than in women without CVD.Previous studies have reported the significance of healthy diets for metabolic syndrome, which was further emphasized in this study 27 .Moreover, the importance of diet in metabolic syndrome was reported to be greater in women with CVD than in women without CVD.Additionally, white blood cell count ranked sixth or higher as a predictor of metabolic syndrome in women.Levels of C-reactive protein, plasma, and low-grade inflammation have been reported to be positively associated with metabolic syndrome 28,29 .It is reasonable to speculate that the white blood cell count also has a positive relationship with metabolic syndrome.This study has limitations.First, a cross-sectional design was used.However, using data with a longitudinal design is expected to improve the validity of this study.Second, the duration of breastfeeding in this study is reliant on information that has been self-reported several years after the actual breastfeeding took place, which may introduce limitations to the accuracy of the data.Furthermore, although the medical history was presumed based on a physician's diagnosis, it may be subject to limitations in accuracy as it relied on self-report surveys by the participants.Similarly, an investigation into dietary intake involved a nutritionist conducting direct interviews during visits.However, there may be limitations to the objectivity of respondents' responses.Third, expanding this study to other diseases and predictors such as health utility usage might significantly contribute to this line of research.Fourth, we excluded the diagnostic criteria for the metabolic syndrome from the independent variables.However, to examine the influence of CVD and the use of cardiovascular medications on the metabolic syndrome, we included the presence of hypertension diagnosed by a physician and the use of cardiovascular medications as independent variables.Fifth, this study used random forest variable importance as primary results and logistic regression odds ratios as supplementary findings.That is, the former result was considered to be the strength of the association between metabolic syndrome and its major predictor, while the latter finding was considered to be the direction of the association.There would be other ways to examine the direction of the association, and this would make a great contribution for research in this direction.Finally, this study did not consider the possible mediating effects among the variables.
In the prediction model with a random forest of AUC 90.7%, the top predictors of metabolic syndrome included body mass index (0.1032), medication for hypertension (0.0552), hypertension (0.0499), cardiovascular disease (0.0453), age (0.0437) and breastfeeding duration (0.0191).Breastfeeding duration was one of the most important predictors of metabolic syndrome among the various obstetric characteristics.

Study population
This study was based on the fifth (2010-2012), sixth (2013-2015), seventh (2016-2018), and eighth (2019) Korean National Health and Nutrition Examination Survey (KNHANES) surveys.The KNHANES is a nationwide representative survey that obtains samples annually using a stratified multistage cluster sampling design.The KHANSE is conducted by a dedicated research team, visiting four regions each week (for a total of 192 regions annually).The survey is conducted over a period of 3 days in each region, with mobile examination vehicles visiting the area to perform health screenings, health surveys, and nutritional assessments.Health surveys and medical examinations are conducted in mobile examination vehicles, while nutritional assessments are performed by a specialized team of nutritionists who visit households directly.This data is used to assess the health status, prevalence of chronic diseases, and nutritional intake status of the population in South Korea.In the KNHANES 2010-2019, men and participants under the age of 20 years were excluded from the current analyses.The cases with missing data on the chronic occurrence or diagnosis of hypertension, myocardial infarction, angina, all the factors associated with the diagnosis of metabolic syndrome, and an outlier (the woman over 80 years old before menarche) were excluded.www.nature.com/scientificreports/ The data were publicly available and de-identified.The requirement for ethical approval was waived by the institutional review board of Korea University Anam Hospital.All methods were conducted in accordance with relevant institutional/ethical committee guidelines and regulations.The requirement for informed consent was waived because all participant information was deidentified and encrypted to protect privacy.

Variables
The variables included in this study are summarized in Supplementary Materials 1.The sociodemographic characteristics, including the age at enrollment, sex, body mass index (BMI), household income (represented as quartiles), marital status, the level of education (elementary school and below, middle school, high school, and college and above), areas of residence, economic activities, and occupations, were assessed using questionnaires.
The blood pressures, waist circumferences and body mass index (BMI) of the participants were measured.Levels of total cholesterol, TG, LDL, high-density lipoprotein (HDL), hemoglobin, hematocrit, blood urea nitrogen, blood creatinine, white blood cell, and red blood cell were also measured at the time of survey.
The participants answered questions about their insights and habits associated with their health.They were asked about their subjective body image, their goals associated with controlling their body weights, history of medical checkups for the past 2 years, history of smoking, frequency of alcohol consumption (per year), and weekly weight training routines.Data on mental health, including stress awareness and feelings of depression within a year, were also collected.The quality of life, based on health indicators, was assessed using the European Quality of Life-5 Dimensions (EQ-5D) scale 30 .The daily intake of energy (kcal), carbohydrates (g), protein (g), fat (g), sodium (mg), water (g), calcium (mg), phosphorus (mg), iron (mg), potassium (mg), and vitamin C (mg) was ascertained from the nutrition survey.
A diagnosis for CVD required the presence of at least one of the following: (1) hypertension, (2) myocardial infarction, or (3) angina.Based on the modified National Cholesterol Rationale Education Program Adult Treatment Program III criteria and the appropriate cutoff for central obesity in Korean adult women (suggested by the Korean Endocrine Society), metabolic syndrome was defined as having three or more of the following 1,31 : (1) central obesity (waist circumference ≥ 85 cm); (2) elevated TGs (serum TG concentration ≥ 150 mg/dL); (3) low HDL cholesterol (serum HDL cholesterol concentration < 50 mg/dL); (4) elevated blood pressure (systolic blood pressure ≥ 130 mmHg or diastolic blood pressure ≥ 85 mmHg) or the prescription of antihypertensive drugs; (5) elevated fasting glucose (fasting serum glucose ≥ 100 mg/dL) or the prescription of diabetes drugs.And we excluded the variables corresponding to the diagnostic criteria of metabolic syndrome among the independent variables, including waist circumference, TG, HDL cholesterol, blood pressure measurements, and fasting glucose.

Statistical analysis
An artificial neural network, decision tree, logistic regression, naïve Bayes, random forest, and support vector machine were used to predict metabolic syndrome.Data on 30,204 observations with full information were divided into training and validation sets in a 70:30 ratio (21,143:9061).The AUC curve and accuracy (the ratio of correct predictions among the 9061 observations in the validation set) were employed as the standard for model validation.The random forest variable importance, the contribution of a certain variable to the random forest performance (accuracy), was used to examine the major predictors of metabolic syndrome.Let us assume that the importance of the random forest variable of CVD is 0.0453.Here, the accuracy of the model drops by 4.53% if the values of a predictor of CVD are randomly permutated (or shuffled).The random split and analysis were repeated 50 times and averaged for external validation [32][33][34] .R-Studio 1.

Figure 1 .
Figure 1.A flow chart summarizing the experimental approach of the study.KNHANES, Korean National Health and Nutrition Examination Survey; HDL, high-density lipoprotein.

Table 1 .
The baseline characteristics evaluated for the prediction of metabolic syndrome.Values are mean ± standard deviation (median) or n (%).LDL, low-density lipoprotein; HDL, high-density lipoprotein; EQ-5D, European Quality of Life-5 Dimensions.

Table 2 .
Model performance: the average was measured for 50 runs.CVD: cardiovascular disease; AUC: area under the receiver operating characteristic curve; LR: logistic regression; DT: decision tree; NB: naïve Bayes; RF: random forest; SVM: support vector machine; ANN: artificial neural network.