Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques

Marcos-Pasero, Helena; Colmenarejo, Gonzalo; Aguilar-Aguilar, Elena; Ramírez de Molina, Ana; Reglero, Guillermo; Loria-Kohen, Viviana

doi:10.1038/s41598-021-81205-8

Download PDF

Article
Open access
Published: 21 January 2021

Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques

Helena Marcos-Pasero¹^na1,
Gonzalo Colmenarejo²^na1,
Elena Aguilar-Aguilar¹,
Ana Ramírez de Molina³,
Guillermo Reglero^4,5 &
…
Viviana Loria-Kohen¹

Scientific Reports volume 11, Article number: 1910 (2021) Cite this article

2636 Accesses
16 Citations
5 Altmetric
Metrics details

Subjects

Abstract

The increased prevalence of childhood obesity is expected to translate in the near future into a concomitant soaring of multiple cardio-metabolic diseases. Obesity has a complex, multifactorial etiology, that includes multiple and multidomain potential risk factors: genetics, dietary and physical activity habits, socio-economic environment, lifestyle, etc. In addition, all these factors are expected to exert their influence through a specific and especially convoluted way during childhood, given the fast growth along this period. Machine Learning methods are the appropriate tools to model this complexity, given their ability to cope with high-dimensional, non-linear data. Here, we have analyzed by Machine Learning a sample of 221 children (6–9 years) from Madrid, Spain. Both Random Forest and Gradient Boosting Machine models have been derived to predict the body mass index from a wide set of 190 multidomain variables (including age, sex, genetic polymorphisms, lifestyle, socio-economic, diet, exercise, and gestation ones). A consensus relative importance of the predictors has been estimated through variable importance measures, implemented robustly through an iterative process that included permutation and multiple imputation. We expect this analysis will help to shed light on the most important variables associated to childhood obesity, in order to choose better treatments for its prevention.

Genome-wide analysis in over 1 million individuals of European ancestry yields improved polygenic risk scores for blood pressure traits

Article Open access 30 April 2024

Principal component analysis

Article 22 December 2022

Development and validation of a new algorithm for improved cardiovascular risk prediction

Article Open access 18 April 2024

Introduction

Excess body weight in children has become a major public health problem worldwide. According to the WHO European Childhood Obesity Surveillance Initiative, 1 out of 3 European children between 6 and 9 years of age were overweight or obese in 2015¹.

Despite the unexpected plateauing of childhood obesity rates observed in developed countries², Spain maintains one of the highest European rates³. According to the ALADINO study, the prevalence of overweight and obesity in Spanish children is 23.2% (22.4% boys, 23.9% girls) and 18.1% (20.4% boys, 15.8% girls), respectively⁴.

Childhood obesity often leads to obesity in adults, and it is considered as one of the main risk factors associated with the development of noncommunicable diseases⁵, such as type 2 diabetes mellitus; dyslipidemia; hypertension; non-alcoholic fatty liver disease; cardiovascular disease and premature mortality in adulthood. The greater the severity of obesity, the higher is the risk of cardio-metabolic diseases, mainly in children⁶.

The multifactorial etiology of obesity is well known and includes genetic susceptibility, dietary and physical activity habits, social and health factors and, especially in the case of children, a permissive and obesogenic lifestyle that begins in the mother’s womb and continues throughout childhood and adolescence^6,7,8. In this respect, Machine Learning (ML) techniques are useful tools to analyze this convoluted phenomenology, as they are especially adapted to model complex, nonlinear relationships in high-dimensional data⁹. This is the case in methods like Random Forest (RF)¹⁰, which are based on an ensemble of decision trees built on random samples with replacement of the training set (the so-called “bagging” or bootstrap averaging of models), and with random subsets of the predictor variables used at each split in the decision trees. The prediction for new data results from averaging the prediction of all the trees in the RF. This approach allows an extensive search in the space of predictive models (even with many predictor variables), thereby increasing the accuracy of the prediction, as well as the stability against noisy variables. Overfitting is also prevented by using bootstrap subsamples with random subsets of predictor variables that decorrelate the trees. In addition, RF includes an estimation of the external prediction error, the so-called “out-of-bag” (OOB) prediction, from the training data, by averaging for each instance the predictions of the trees that were developed without that instance. More importantly, RF permits the assessment of the relative importance of the predictor variables by the calculation, for each variable, of the increased OOB error after permuting repeatedly that variable: the higher the increase in the OOB errors after permutation, the more important the variable would be¹⁰. This is especially interesting for explanatory purposes of the predicted endpoint.

Another robust ensemble-based ML method is Gradient Boosting Machines (GBM)¹¹. In this case, the decision trees are added sequentially, where one tree is fitted to reduce the prediction error of the previous ones. Normally a stochastic version of this approach is used, using at each new added tree a random subsample (e.g. 50%) of the whole dataset, in order to decorrelate the trees and thus result in predictions with less variance. GBM are also amenable to perform variable importance calculations.

There have been some recent efforts to use ML techniques to model obesity and other body mass index (BMI)-related endpoints (for a recent review, see¹²). However, these are mostly related to adult samples, while for the case of children the work has been limited, in some cases preliminary and with restricted variable sets, in any case genetic ones^{13,14,15,16,17,18,19,20,21,22}. For a recent comprehensive review of the childhood obesity field, see²³. An interesting approach in two recent works^20,21 is the use of electronic health record (EHR) databases to develop ML models for childhood obesity, but their objective is mainly predictive, not explanatory, no genetic variables were used, and no variable importance techniques were used to rank the predictors. Children obesity has peculiarities that make it to require specific modeling efforts, due to the huge hormonal and metabolic changes that occur in this period. Therefore, there is a lack of ML models for pediatric samples and with high-dimensional, multidomain variable sets, especially those focused on estimating the relative importance of these variables.

The use of ML to rank predictor variables by their importance has been described for e.g. non-calcified coronary burden²⁴, attention-deficit and hyperactivity disorder²⁵, and Crohn's disease²⁶. In the case of obesity, there is one study where RF has been used to rank variables in the prediction of BMI for adolescent girls²², although in that case the set of variables is more restricted both in number and domains, mostly of psychological nature and with no genetic data.

Thus, for this work we set out to analyze a pediatric sample by ML and predict its BMI based on a large set of 190 variables from different domains: single nucleotide polymorphisms (SNPs), lifestyle, social, health, diet, exercise, and gestation ones. The sample was a group of schoolchildren of Madrid (Spain) enrolled in the GENYAL study for the prevention of childhood obesity, and here we perform a cross-sectional analysis of the baseline data. Using variable importance estimations, we attempted to rank the variables and identify those more strongly associated with the target, in order to better characterize the important features for children obesity. We tried both RF and GBM models, in order to assess the robustness of the estimated ranks, and derived a consensus variable importance score for all the predictors by combining the predictions of the two models. This consensus ranking will assist in developing better prevention strategies that will result in better expectations for quality of life and longevity in the future.

We have to stress at this point that we use here the term “predictor variable” in an statistical sense, where the values of one or more independent or predictor variables are used to obtain the value (predict) for a dependent variable (in this case BMI), through a fitted model. Given the cross-sectional nature of the data, we are actually modelling associations of BMI with other variables at a given point in time, and not forecasting future values of BMI given some current values of the independent variables, as it would be in a longitudinal setting.

Results

Exploratory analysis

Table S1 (Supplementary Material) collects the 190 predictor variables used in the analysis. They are grouped in different domains: characteristics of schoolchildren (3); genetics (1, from 11 SNPs); physical and leisure activities (24); diet, food and nutrients (80); risk factors of pregnancy and birth (39); social, health and demographic factors (43).

The average age of the 221 participants was 6.75 ± 0.73 years (52.50% were girls (n = 116) and 47.50% boys (n = 105)). According to the WHO criteria, 32.2% of the schoolchildren evaluated had excess weight (EW) (18.1% overweight and 14.1% obesity). These figures were 25.4% and 19.0% when the International Obesity Task Force (IOFT) standard or the national criteria of the Orbegozo Foundation were used, respectively.

Table 1 shows the main descriptive characteristics regarding the schoolchildren families. Regarding the nutritional status of the parents, 57.5% of the fathers and 30.4% of the mothers had EW.

Table 1 Main social and economic characteristics of the families.

Full size table

The main diet, physical activity and birth characteristics of schoolchildren by sex are summarized in Table 2.

Table 2 Main diet, physical activity and birth and perinatal characteristics of the schoolchildren by sex.

Full size table

The variants distribution of the set of SNPs selected for the genetic risk score (GRS, see Methods) are presented in Table 3. These gene variants were consistent with the Hardy–Weinberg equilibrium in all the cases (p-values ≥ 0.05).

Table 3 Single Nucleotide Polymorphisms selection for the GRS design.

Full size table

Random forest model and variable importance’s

As described previously, we derived a RF model to predict the BMI in this sample. Multiple imputation was included in the calculation of the standardized importance scores T_j for each predictor variable x_j in the dataset. A total of 100 imputations were performed (see “Methods” section). On average, the RF models explain 55.07% of the variance, as estimated by the OOB pseudo-R². Figure 1 shows a plot of the average predicted BMI by the RF models vs the actual BMI. We can see some degree of miscalibration in the plot, as the best-fit line (dashed line; continuous line is the x = y line) shows an intercept different from 0 (− 4.06) and a slope slightly different from 1 (1.23), so the model makes worse predictions for very high values of BMI.

Through permutation of the OOB data, and within the imputation loop, we could obtain the scaled average variable importance of the different predictor variables. Figure 2 shows the resulting variable importance plot for the top-30 predictor variables. The use of multiple imputation allowed in addition to analyze in a robust way the variability of the rank of these variable importance’s, by estimating their mean rank and corresponding confidence intervals. Figure 3 shows the mean average rank and corresponding 95% confidence intervals of the 30 most important predictor variables.

The five most important variables are (in this order): Familiar nutri-status perception (Perception of the person completing the questionnaire about child's nutritional status) > Relation TEI-TEE (%) (Percentage of difference between Total Energy Intake (TEI) and Total Energy Expenditure (TEE)) > BMI of the father > BMI of the mother > Mother’s Meals (number of daily food servings of the mother). These variables are very well ranked, with both Familiar nutri-status perception and Relation TEI-TEE (%) having a null confidence interval in their average rank, as in all the imputations they were the first- and second-most important variables, respectively. The BMI of both parents share the same narrow confidence interval (3–4), while Mother’s Meals had a slightly larger confidence interval (5–7).

The next-important variables (in decreasing importance) are IPAC (Individual Physical Activity Coefficient) > GRS (genetic risk score) > Vit D (Vitamin D (mcg): quantity of daily vitamin D intake) > Mother's disease: HTG (Mother has hypertriglyceridemia by medical diagnose), with increasingly larger confidence intervals: (5–7), (5–26), (6–30) and (8–30), respectively.

The following variables show much larger confidence intervals, so that although on average they show an increasing rank, their ranking for new samples is expected to be less well defined.

Gradient boosting machine model and relative importance’s

For comparison purposes, and to check the robustness of the obtained variable importance’s, an alternative method to rank the variables was used, namely scaled relative importance’s in a Gradient Boosting Machine, again implemented within an imputation loop. Figure 4 displays the corresponding scaled relative importance bar plot. We can see a rather similar picture as with RF, with 20 out of 30 top predictor variables shared between the two plots, and the four top variables (Familiar nutri-status perception, Relation TEI-TEE (%), Mother’s BMI, and Father’s BMI) being the same and in the same order. However, the exact ordering for the rest of the variables is not fully preserved, which is not unexpected given that the two methods use different functional forms, the metrics used to measure the importance of variables are also different, and the rankings themselves have increasing variability upon moving to less important predictor variables (e.g. Fig. 2), making unfeasible to assign an exact ranking.

Consensus variable importance’s

Given that the two methods yielded reasonably similar rankings of variables, a combined variable importance was calculated for each variable by averaging the normalized variable importance matrices of the two methods. The corresponding variable importance plot is displayed in Fig. 5. Here, after the four conserved top variables (Familiar nutri-status perception > Relation TEI-TEE (%) > Mother’s BMI > Father’s BMI) the next five most important variables are, in decreasing importance, Mother’s meals > Prot(%TEI) > GRS > Mother’s disease: HTG > IPAC. We will focus our Discussion on this consensus score (CS) of importance’s.

Discussion

The results of the anthropometric measurements in the current study showed that one out of four studied schoolchildren had an excess of weight. These figures, similar to those reported in the latest ALADINO national study, reflect the magnitude of the childhood obesity problem in our society⁴.

ML is a suitable approach in predictive analytics, and it has started to be used both for early preventive recommendations related to lifestyle, and to build decision-support tools for disease risk prediction^12,27. Additionally, in view of the crucial role that prevention plays to control the high obesity prevalence, the identification of its most important risk factors could help to develop effective nutritional and educational intervention strategies. In this sense, in this study, we attempted to rank a wide set of 190 predictor variables from different domains in order to predict the BMI of children by means of ML models of the RF and GBM types.

Therefore, the novelty of the current study stems from the use of a very large number of variables from widely different domains (genetic, nutritional, exercise, social and health, lifestyle, birth and pregnancy) and their ranking by variable importance estimations. To our knowledge²³, there is no parallel in the literature in this regard by this use of such a large multidomain set of variables for childhood obesity.

We can see that the most important variable in our CS (Fig. 5) is the Familiar nutri-status perception, which has not explanatory character but shows the parents awareness of the nutritional status of their children, which has anyhow a variable degree of underestimation, especially for overweight/obese children, as we (data not shown) and others have observed²⁸. The next-important variable (Relation TEI-TEE(%)) is the questionnaire-based percentage of difference between the Total Energy Intake (TEI) and Total Energy Expenditure (TEE), which is a measure of the energy balance of the child. In this context, it is well established that obesity entails that dietary energy intake exceeds energy expenditure²⁹. Nevertheless, these results should be viewed with caution, since as the literature reviewed suggests, self-reported dietary measures by questionnaires are not fully adequate to describe the energy balance³⁰, and there are more accurate ways to calculate the TEE than physical activity questionnaires^31,32. However, although non-optimal, our questionnaire-based TEI and TEE do contain valuable information about the energy input and expenditure, and thus the Relation TEI-TEE (%) variable results in one of the best predictors for BMI.

The following three variables of the model are Mother’s BMI, Father’s BMI, and Mother’s Meals. These variables would comprise genetic, diet and lifestyle aspects, indicating that children inherit to a large extent their parents’ nutritional status^33,34. These predictors may be interesting in order to use them in predictive models for obesity even before birth, and as a matter of fact they are frequent predictor variables of simple logistic regression models for childhood obesity²³.

The 6th variable in importance (Prot (%TEI)) is a measure of the percentage of protein consumption within the diet, stressing the importance of a balanced nutritional strategy to prevent obesity. Prot (%TEI) is followed by the genetic risk score (GRS), that supports the genetic component of the BMI in children. This variable aggregates several genetic single nucleotide polymorphisms well described to affect childhood obesity, and has been used previously in studies of pediatric based-populations^35,36. GRSs have been a great success in the study on polygenic diseases, and it could be seen as a personalized risk management strategy for obesity and overweight. Similar polymorphism-based genetic scores have been described for other pathological cases like breast cancer, prostate cancer, coronary artery disease, type 1 diabetes, type 2 diabetes and Alzheimer’s disease^37,38.

The following two variables in order of importance are mother’s hypertriglyceridemia (Mother’s disease: HTG) and IPAC score. Regarding the mother’s hypertriglyceridemia as a predicting factor for children BMI, previous studies have linked the biochemical and body composition variables between adolescents and their parents, which find significant results in BMI and total cholesterol between father and son, and hypertriglyceridemia, with inadequacies of LDL or HDL shared both by adolescents and parents³⁹. In addition, the link between obesity and increased risk for hypertriglyceridemia in children has been studied⁴⁰, and can explain the association found in this work. In turn, IPAC is a measure of the total physical exercise performed by the child as obtained from of the IPAC calculation, which stresses the influence of calories consumption by physical activity in the final nutritional status⁴¹, and nowadays, it is considered as essential focus in health promotion and obesity prevention research at early ages⁴².

As was said in the Introduction, there is a single case of ML variable importance analysis (through RF) used in the prediction task of childhood obesity²². The work of Rehkopf et al.²² had a longitudinal setting and the predicted endpoints were different, namely BMI percentile change after 10 years in adolescent girls, as well as transition from normal weight to overweight or obesity. The predictor variable set was more limited (41 variables) and with a more restricted set of domains: diet, physical activity, psychological, social and parent health, lacking genetic and gestational variables. In their case, psychological variables, a domain that is absent in our dataset, appeared within the most important variables; this is probably not unexpected, given that the sample was composed of adolescent girls, were this domain would be of more importance. We think that this domain would be of less importance in our 6–8 years old children.

We would like to point out some putative limitations of our study. One is the indirect nature^43,44 of the BMI for obesity diagnosis. However, BMI is considered as a great adiposity marker and is the most practical and low-cost method, making it the most preferred one⁶. On the other hand, in pediatric samples it is frequent the use of age- and sex-specific BMI z-scores instead of raw BMI. However, our sample has a very narrow distribution of ages, with 84% of the children being 6–7 years old, and 16% 8 years old, and we did not observe significant differences between the two sexes. Therefore, we decided to use raw BMI instead, as the z-scores are quite dependent on the population they are based on.

Likewise, the use of dietary and physical activity questionnaires may lead to reporting bias and it has been criticized. To avoid or minimize such biases there is an increased need for objective measures of food intake (e.g. by use of biomarkers) and physical activity (e.g. by use of movement sensors). However, because of the high costs of such methods, questionnaires are still the most widely used instruments for determining frequency and duration of physical activity and frequency and quantity of food intake, as questionnaires are relatively cheap and efficient instruments for collecting data on a large scale in a relatively short time span⁴⁵. Nevertheless, this information should be interpreted with caution. Another limitation was the sample size, but it is important to consider that this study is framed in an intervention study of five years and corresponds to a baseline cross-sectional analysis. Therefore, at this point this model was derived for explanatory purposes, in order to identify the predictors most associated to BMI. The cross-sectional nature of the present baseline dataset prevents its use from demonstration of causality, or for predictive purposes. This model rather suggests variables that would be important for childhood obesity, in order to be further tested in longitudinal settings. The new accumulated data along the study will be incorporated in order to derive models for predictive purposes to target appropriate preventive interventions to ameliorate effectively children obesity.

From the statistical modelling point of view, variable importance techniques can be subject to biases^46,47. However, our use of a permutation approach avoids overestimation of categorical variables with many classes, and in preparing our dataset, we removed highly-correlated variables that could also be overestimated. In addition, the picture obtained from the GBM analysis is rather similar to the RF one, with up to 20 the 30 top variables shared variables between the two methods, and exactly the same four top variables. This gives confidence in the general conclusions above described about the influence of the different predictor variables. We must also take into account that many of these variables are correlated, so that the way that one method achieves a best fit will be different that the other given their different algorithms, while modeling basically the same physical mechanism. For instance, the important variable IPAC in the RF plot, is missing from the GBM plot, while in the latter Active transport to school instead appears. However, a large component of the physical activity of the child (measured by IPAC) would be going to school walking or biking, and this is measured by the Active transport to school variable. In the GBM plot sleep time is the fifth most important predictor, and the GRS has lower importance. In spite of that, there is a large similarity between the two descriptions of childhood obesity, taking into consideration that the dataset contains up to 190 predictor variables.

Finally, it is worth highlighting the homogeneity of the sample in terms of distribution by sex and the absence of genetic relatedness and stratification (since the Hardy–Weinberg equilibrium is met by all the SNPs). In addition, the sample shows a large representativeness with six schools from three different areas of the Community of Madrid involved, which allows to have a better knowledge of the situation throughout the Community and not from a specific school or area.

Methods

Study design

The GENYAL sample included 221 schoolchildren (116 girls and 105 boys) in 1st and 2nd grades (6–8 years of age) from 6 different public primary schools among the Community of Madrid (Spain). The Ministry of Education of this Community was responsible for the sampling of the schools, covering a variety of socioeconomic status of different districts, so that the selection was representative of the household income distribution in Madrid as defined by the Spanish National Statistics Institute⁴⁸. Briefly, GENYAL is a long-term clinical trial (ClinicalTrials.gov NCT03419520) for childhood obesity prevention. The duration of the project is planned to last 5 years (2017–2021) with annual data collection, including anthropometric and nutrigenetic assessment and questionnaires about physical activity, dietary and social and health aspects. On this basis, the main objective of GENYAL study was to design and validate a predictive model that identifies those children who would benefit most from actions aimed at reducing the risk of obesity and its complications through ML algorithms. The results shown in this paper corresponding to a cross section from data collected in the first year of the study (2017).

Ethical issues

The research was approved by the Research Ethics Committee of the IMDEA Food Foundation (PI:IM024). The study protocol follows the guidelines laid down in the Declaration of Helsinki and was performed in accordance with relevant regulations. All families signed their written informed consent to participate.

Anthropometric measurements

Height was determined using a Leicester height rod with a millimetric accuracy (Biological Medical Technology SL, Barcelona, Spain). Body weight, fat mass percentage and muscle mass percentage were assessed using a Body Composition Monitor (BF511- OMRON HEALTHCARE Co., Ltd, Kyoto, Japan). Waist circumference were taken using a non-elastic tape (KaWe Kirchner & Wilhelm GmbH, Asperg, Germany; range 0–150 cm, 1 mm of precision). For blood pressure monitorization, an automatic digital monitor was used (OMRON M3-Intellisense) using a cuff suitable for children.

Children were measured at their schools early in the morning by trained dietitians following standard techniques and the international WHO guidelines specific for this population⁴⁹. Measurements were taken twice in a row, considering the average as the result. BMI was calculated as weight in kg per height in squared meters; children were classified as normoweight, overweight or obese according to percentiles of the Faustino Orbegozo Foundation⁵⁰, of the International Obesity Task Force (IOFT)⁵¹, and WHO growth standards⁵². The results of overweight and obesity rates were unified as a single category called excess weight (EW). Parents’ BMI was calculated from the weight and height data reported by themselves.

SNP selection, genetic risk score and genotyping

DNA was obtained from saliva samples collected the same day of the anthropometric evaluation. Genomic DNA was extracted according to the protocol described by Stratec INVISORB Spin Tissue Mini Kit. For genotyping, the DNA samples were loaded in TaqMan OpenArray Real-Time PCR plates (Life Technologies Inc., Carlsbad, CA, USA) already configured with the specific selected SNPs with specific waves for each allele marked with a different fluorophore to determine the genotype. This process was made using the OpenArray AccuFill System (Life Technologies Inc., Carlsbad, CA, USA). Once it was charged, a PCR was made and the chips were read in the QuantStudio 12 K Flex Real-Time PCR Instrument (Life Technologies Inc., Carlsbad, CA). The results were analyzed using the TaqMan Genotyper software (Life Technologies Inc., Carlsbad, CA, USA), which assigns automatically the genotype to each sample according to the amount of detected signal for each fluorophore. Data analysis was made by TaqMan Genotyper Software v1.3 (autocaller confidence level > 90%)⁵³. Call rates for all SNPs were > 96%, and genotype frequencies were in Hardy-Weingberg equilibrium (p > 0.05).

For the purpose of this study, 11 SNPs (BDNF-AS rs925946, ETV5 rs7647305, FTO rs7190492, GNPDA2 rs10938397, KCTD15 rs368794, LEPR rs1137101 (Q223R), MC4R rs17782313, NEGR1 rs2568958, SEC16B rs10913469, TCF7L2 rs7903146 and TMEM18 rs6548238) were selected. These SNPs were included by considering their specific relationship with childhood BMI according to previous researches, having been identified by genome-wide association studies (GWAS) and the absence of linkage disequilibrium between them. From these SNPs, a GRS was developed as the total sum of risk alleles in the 11 SNPs⁵³.

Questionnaires, data collected and predictor variables used

Different self-reported questionnaires were sent to families by email or in paper format according to the parents' preference, filled by at least one of the parents and collected by researchers. This questionnaires were based on the surveys used in previous national studies (ALADINO and ELOIN)^4,54, KIDMED⁵⁵, etc.

The data obtained were processed and cleaned. Finally, a total of 190 variables obtained were classified into categories according to their specific nature. (Table S1, supplementary material). These variables are described in what follows.

Characteristics of schoolchildren

Three variables were taken into account in this category: age, sex and school year.

SNP selection and GRS

The GRS, obtained from 11 SNPs variables well described as significant in childhood obesity, was used in this domain. The GRS for each child was obtained as the sum of the number of risk alleles of each of the 11 SNPs over all the SNPs, by considering that each SNP can contain 0, 1 or 2 risk alleles: e.g. if the risk allele is A, and the SNP appears as GG, GA and AA genotypes, the corresponding number of risk alleles would be 0, 1, and 2, respectively. Therefore, the GRS is defined as:

$$GRS= \sum_{i=1}^{11}{NRA}_{i}$$

Were NRA_i is the number of risk alleles of SNP i.

Physical and leisure activities

24 variables regarding physical activity and free time data were obtained by an ad hoc questionnaire, based on the surveys used in previous national studies (ALADINO and ELOIN), after receiving content validation by a group of dietitians and exercise science experts. A 48-h physical activity record was collected, corresponding to 24 h of a week day and a complete weekend day⁵⁶ to obtain the Individual Physical Activity Coefficient (IPAC) and the Physical Activity Coefficient (PAC) through the coefficient defined by the WHO⁴⁹ and by the Institute of Medicine⁵⁷, respectively.

Diet, food and nutrients

80 variables were also gathered from dietary information through parent self-reported ad hoc questionnaires. These questionnaires were delivered to the parents with the corresponding filling instructions. Before processing, the responses of the questionnaires were checked by the researchers, and parents were phone called in case of unclear or omitted data. The questionnaires included were, the KIDMED validated questionnaire⁵⁵, a 48-h food record of two non-consecutive days, a weekday and a weekend day, as recommended by the European Food Safety Authority guidelines⁵⁸, and analyzed using the DIAL software (Alce Ingeniería, Madrid, Spain) in order to obtain information about macro and micronutrients. Finally, a questionnaire based on the surveys used in previous national studies (ALADINO and ELOIN) was used after receiving content validation by a group of Nutritionist.

Risk factors of pregnancy and birth

39 variables regarding the maternal and neonatal health and habits were obtained from self-reported ad hoc questionnaire completed by parents. This questionnaire was used after receiving content validation by a group of dietitians.

Social, health and demographic factors

43 variables were obtained from self-reported ad hoc questionnaire about the family’s status, place of birth, place of residence, etc. This questionnaire was used after receiving content validation by a group of dietitians.

Statistical modeling

R 3.4.2 (https://www.r-project.org/) was used for all the modeling and data analysis. The sample was initially characterized by a descriptive exploratory analysis. Qualitative data were presented as percentages and absolute frequencies while quantitative data were expressed as mean ± standard deviation.

The randomForest package was used to develop the RF models, using as settings 500 decision trees and 5 permutations per variable for variable importance calculations. missForest package was used for multiple data imputation with the default settings; a total of 100 imputations were used. An iterative procedure, similar as the one described in Nonyane, et al. and Little et al.^59,60, was applied in order to include multiple imputation in the variable importance estimation by taking into account both the between- and within-imputation variance in the importance scores. The process was as follows:

For each imputation m, m = 1,…,M we estimated the average importance score of variable x_j, (${\widehat{\theta }}_{j}^{m}$, where j = 1,…,p) as the average increase in the OOB MSE (Mean Squared Error) after OOB-permuting x_j for each of the B trees of the RF a total of K times:
$${\widehat{\theta }}_{j}^{m}=\sum_{k=1}^{K}\sum_{b=1}^{B}({MSE}_{kbj}^{m}-{MSE}_{b}^{m})$$

as well as the corresponding standard errors ${s}_{j}^{m}$.
From here the average importance score across the M imputations for each variable x_j was obtained from:
$${\stackrel{-}{\theta }}_{j}= \frac{1}{M}\sum_{m=1}^{M}{\widehat{\theta }}_{j}^{m}$$
Finally, the standardized importance score for each variable x_j was calculated using:
$${T}_{j}=\frac{{\widehat{\theta }}_{j}}{\sqrt{{V}_{j}}}$$
where V_j is the weighted sum of the within (${\stackrel{-}{W}}_{j}$) and between (${\stackrel{-}{B}}_{j}$) imputation variances for variable x_j:
$${V}_{j}= {\stackrel{-}{W}}_{j}+\frac{M+1}{M}{\stackrel{-}{B}}_{j}$$
which are defined as:
$${\stackrel{-}{W}}_{j}= \frac{1}{M}\sum_{m=1}^{M}({{s}_{j}^{m})}^{2}$$
$${\stackrel{-}{B}}_{j}= \frac{1}{M-1}\sum_{m=1}^{M}{({\widehat{\theta }}_{j}^{m}-{\stackrel{-}{\theta }}_{j})}^{2}$$

The multiple imputation was also used to derive (rounded to the nearest integer) mean and 95% confidence intervals for the ranks of the importance scores of the different predictor variables in the RF models.

In order to compare the results with those obtained from other methods, a Gradient Boosting Machine (GBM) relative importance plot was also obtained. The gbm package was used to derive the GBM models. Multiple models were derived within an imputation loop, and estimates of relative importance were pooled as described with the RF models. 100 iterations of imputation and model derivation were performed again. We used GBM models with 5000 trees, learning rate of 0.01, bag fraction of 0.5 and interaction depth of 3. The full dataset was used for training, and the best number of trees in each model was obtained through fivefold cross-validation. The relative importance of a variable j for a single tree T with J terminal nodes, when using regression trees in the GBM like in this case is defined as¹¹

$$\hat I_j^2\left( T \right) = \mathop \sum \limits_{t = 1}^{J - 1} \hat i_t^21\left( {{v_t} = j} \right)$$

where the summation is over the nonterminal nodes t of the J-terminal node tree T, ${v}_{j}$ is the variable selected for splitting in that node, 1() is an indicator function that equals 1 if ${v}_{t}=j$ and 0 otherwise, and ${\widehat{i}}_{t}^{2}$ is the decrease of squared error associated to that variable. GBM is an ensemble method, were successive base learners (regression trees in our case) are fitted to minimize the residuals of the previous one; therefore, the final relative importance’s for the GBM are obtained by averaging for each variable the relative importance’s over all the trees in the model.

In order to derive a consensus variable importance’s, the two 100 imputations × 190 variable matrices of RF variable importance’s and GBM relative importance’s, were first min–max normalized (within each model) in order to make them comparable. As minimum and maximum, the minimum and maximum average variable importance (relative importance for GBM) were used, respectively. After this normalization, the two matrices were merged and averaged for each predictor variable, resulting in a normalized score for each. The top-30 scoring variables were then plotted.

References

Nutrition—EU Science Hub—European Commission. EU Science Hub https://ec.europa.eu/jrc/en/research-topic/nutrition (2014).
Townsend, N., Rutter, H. & Foster, C. Evaluating the evidence that the prevalence of childhood overweight is plateauing. Pediatr. Obes. 7, 343–346 (2012).
Article CAS PubMed Google Scholar
Childhood Obesity Surveillance Initiative (COSI) Factsheet. Highlights 2015–17 (2018). http://www.euro.who.int/en/health-topics/disease-prevention/nutrition/activities/who-european-childhood-obesity-surveillance-initiative-cosi/cosi-publications/childhood-obesity-surveillance-initiative-cosi-factsheet.-highlights-2015-17-2018 (2018).
Agencia Española de Consumo, Seguridad Alimentaria y Nutrición. Ministerio de Sanidad, Servicios Sociales e Igualdad. Estudio ALADINO 2015: Estudio de Vigilancia del Crecimiento, Alimentación, Actividad Física, Desarrollo Infantil y Obesidad en España 2015. (2016).
Kunwar, R., Minhas, S. & Mangla, V. Is obesity a problem among school children?. Indian J. Public Health 62, 153 (2018).
PubMed Google Scholar
Styne, D. M. et al. Pediatric obesity—assessment, treatment, and prevention: An endocrine society clinical practice guideline. J. Clin. Endocrinol. Metab. 102, 709–757 (2017).
Article PubMed PubMed Central Google Scholar
Hill, J. Physical activity and obesity. Lancet 363, 182 (2004).
Article PubMed Google Scholar
Hruby, A. & Hu, F. B. The epidemiology of obesity: A big picture. PharmacoEconomics 33, 673–689 (2015).
Article PubMed PubMed Central Google Scholar
Hastie, T., Tibshirani, R., Friedman, J. & Franklin, J. The elements of statistical learning: Data mining, inference, and prediction. Math Intell 27, 83–85 (2004).
Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article MATH Google Scholar
Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Article MathSciNet MATH Google Scholar
DeGregory, K. W. et al. A review of machine learning in obesity. Obes. Rev. 19, 668–685 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dugan, T. M., Mukhopadhyay, S., Carroll, A. & Downs, S. Machine learning techniques for prediction of early childhood obesity. Appl. Clin. Inform. 6, 506–520 (2015).
Article CAS PubMed PubMed Central Google Scholar
Muhamad Adnan, M.H.B., Wahidah, H., Faten, D. A survey on utilization of data mining for childhood obesity prediction. in 8th Asia-Pacific Symposium on Information and Telecommunication Technologies 1–6 (2010).
Novak, B. & Bigec, M. Application of artificial neural networks for childhood obesity prediction. in Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems 377–380 (1995). doi:https://doi.org/10.1109/ANNES.1995.499512.
Novak, B. & Bigec, M. Childhood obesity prediction with artificial neural networks. in Proceedings Ninth IEEE Symposium on Computer-Based Medical Systems 77–82 (1996). doi:https://doi.org/10.1109/CBMS.1996.507129.
Hariz, M., Muhamad, B., Husain, W. & Rashid, N. A. Parameter identification and selection for childhood obesity prediction using data mining. in (2012).
Muhamad Adnan, M. H. B., Husain, W. & Abdul Rashid, N. A hybrid approach using Naïve Bayes and Genetic Algorithm for childhood obesity prediction. in 2012 International Conference on Computer Information Science (ICCIS) vol. 1 281–285 (2012).
Zhang, S. et al. Comparing data mining methods with logistic regression in childhood obesity prediction. Inf. Syst. Front. 11, 449–460 (2009).
Article CAS Google Scholar
Hammond, R. et al. Predicting childhood obesity using electronic health records and publicly available data. PLoS ONE 14, e0215571 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lingren, T. et al. Developing an algorithm to detect early childhood obesity in two tertiary pediatric medical centers. Appl. Clin. Inform. 7, 693–706 (2016).
Article PubMed PubMed Central Google Scholar
Rehkopf, D. H., Laraia, B. A., Segal, M., Braithwaite, D. & Epel, L. The relative importance of predictors of body mass index change, overweight and obesity in adolescent girls. Int. J. Pediatr. Obes. 6, e233-242 (2011).
Article PubMed Google Scholar
Colmenarejo, G. Machine learning models to predict childhood and adolescent obesity: A review. Nutrients 12, 2 (2020).
Article Google Scholar
Munger, E. et al. Application of machine learning to determine top predictors of non-calcified coronary burden in psoriasis: An observational cohort study. J. Am. Acad. Dermatol. https://doi.org/10.1016/j.jaad.2019.10.060 (2019).
Article PubMed PubMed Central Google Scholar
van der Meer, D. et al. Predicting attention-deficit/hyperactivity disorder severity from psychosocial stress and stress-response genes: A random forest regression approach. Transl. Psychiatry 7, e1145 (2017).
Article PubMed PubMed Central CAS Google Scholar
Dong, Y. et al. A novel surgical predictive model for Chinese Crohn’s disease patients. Medicine 98, e17510 (2019).
Article PubMed PubMed Central Google Scholar
Gubbi, S., Hamet, P., Tremblay, J., Koch, C. A. & Hannah-Shmouni, F. Artificial intelligence and machine learning in endocrinology and metabolism: The dawn of a new era. Front. Endocrinol. 10, 2 (2019).
Article Google Scholar
Blanchet, R., Kengneson, C.-C., Bodnaruc, A. M., Gunter, A. & Giroux, I. Factors influencing parents’ and children’s misperception of children’s weight status: A systematic review of current research. Curr. Obes. Rep. https://doi.org/10.1007/s13679-019-00361-1 (2019).
Article PubMed Google Scholar
Gregory, J. W. Prevention of obesity and metabolic syndrome in children. Front. Endocrinol. 10, 2 (2019).
Article Google Scholar
Shook, R. P. et al. Energy intake derived from an energy balance equation, validated activity monitors, and dual X-ray absorptiometry can provide acceptable caloric intake data among young adults. J. Nutr. 148, 490–496 (2018).
Article PubMed Google Scholar
Madden, A. M., Mulrooney, H. M. & Shah, S. Estimation of energy expenditure using prediction equations in overweight and obese adults: a systematic review. J. Hum. Nutr. Diet. 29, 458–476 (2016).
Article CAS PubMed Google Scholar
Silsbury, Z., Goldsmith, R. & Rushton, A. Systematic review of the measurement properties of self-report physical activity questionnaires in healthy adult populations. BMJ Open 5, e008430 (2015).
Article PubMed PubMed Central Google Scholar
Qasim, A. et al. On the origin of obesity: Identifying the biological, environmental and cultural drivers of genetic risk among human populations. Obes. Rev. 19, 121–149 (2018).
Article CAS PubMed Google Scholar
Wang, Y., Min, J., Khuri, J. & Li, M. A systematic examination of the association between parental and child obesity across countries. Adv. Nutr. Bethesda Md 8, 436–448 (2017).
Article Google Scholar
Viljakainen, H. et al. Genetic risk score predicts risk for overweight and obesity in Finnish preadolescents. Clin. Obes. 2, e12342. https://doi.org/10.1111/cob.12342 (2019).
Article Google Scholar
Mäkelä, J. et al. Genetic risk clustering increases children’s body weight at 2 years of age—the STEPS Study. Pediatr. Obes. 11, 459–467 (2016).
Article PubMed Google Scholar
Che, R. & Motsinger-Reif, A. A. A new explained-variance based genetic risk score for predictive modeling of disease risk. Stat. Appl. Genet. Mol. Biol. 11, 15 (2012).
Article MathSciNet Google Scholar
Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. https://doi.org/10.1093/hmg/ddz187 (2020).
Article Google Scholar
Cardoso Chaves, O. et al. Comparison of the biochemical, anthropometric and body composition variables between adolescents from 10 to 13 years old and their parents. Nutr. Hosp. 27, 1127–1133 (2012).
CAS PubMed Google Scholar
Hanh, N. T. H., Tuyet, L. T., Dao, D. T. A., Tao, Y. & Chu, D.-T. Childhood obesity is a high-risk factor for hypertriglyceridemia: A case-control study in Vietnam. Osong Public Health Res. Perspect. 8, 138–146 (2017).
Article PubMed PubMed Central Google Scholar
An, R. Diet quality and physical activity in relation to childhood obesity. Int. J. Adolesc. Med. Health 29, 2 (2017).
Article Google Scholar
Latomme, J. et al. Do physical activity and screen time mediate the association between European fathers’ and their children’s weight status? Cross-sectional data from the Feel4Diabetes-study. Int. J. Behav. Nutr. Phys. Act. 16, 100 (2019).
Article PubMed PubMed Central Google Scholar
Lobstein, T. Commentary: Which child obesity definitions predict health risk?. Ital. J. Pediatr. 43, 20 (2017).
Article CAS PubMed PubMed Central Google Scholar
Romero-Corral, A. et al. Accuracy of body mass index in diagnosing obesity in the adult general population. Int. J. Obes. 32, 959–966 (2008).
Article CAS Google Scholar
Koning, M. et al. Agreement between parent and child report of physical activity, sedentary and dietary behaviours in 9–12-year-old children and associations with children’s weight status. BMC Psychol. 6, 14–14 (2018).
Article PubMed PubMed Central Google Scholar
Beware Default Random Forest Importances. http://explained.ai/decision-tree-viz/index.html.
Strobl, C., Boulesteix, A.-L., Zeileis, A. & Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 8, 25 (2007).
Article CAS Google Scholar
Renta neta media de los hogares (Urban Audit) - Ayuntamiento de Madrid. http://www.madrid.es/portales/munimadrid/es/Inicio/El-Ayuntamiento/Estadistica/Areas-de-informacion-estadistica/Economia/Renta/Renta-neta-media-de-los-hogares-Urban-Audit-?vgnextfmt=default&vgnextoid=65e0c19a1666a510VgnVCM1000001d4a900aRCRD&vgnextchannel=ef863636b44b4210VgnVCM2000000c205a0aRCRD.
WHO. Physical status: the use and interpretation of anthropometry. http://www.who.int/childgrowth/publications/physical_status/en/.
Fernández, C. et al. Estudio de Crecimiento de Bilbao (Curvas y tablas de crecimiento, Estudio Transversal, 2011).
Google Scholar
Cole, T. J., Bellizzi, M. C., Flegal, K. M. & Dietz, W. H. Establishing a standard definition for child overweight and obesity worldwide: International survey. BMJ 320, 1240–1243 (2000).
Article CAS PubMed PubMed Central Google Scholar
WHO. Growth reference data for 5–19 years. http://www.who.int/growthref/en/.
Marcos-Pasero, H. et al. The Q223R polymorphism of the leptin receptor gene as a predictor of weight gain in childhood obesity and the identification of possible factors involved. Genes 11, 2 (2020).
Article CAS Google Scholar
Ortíz, H. et al. Diseño del estudio ELOIN y prevalencia de sobrepeso y obesidad en la población infantil de 4 años de la Comunidad de Madrid. (2014).
Serra-Majem, L. et al. Food, youth and the Mediterranean diet in Spain. Development of KIDMED, Mediterranean Diet Quality Index in children and adolescents. Public Health Nutr. 7, 931–935 (2004).
Article PubMed Google Scholar
Ortega, R., Requejo, A. & López-Sobaler, A. Modelos de cuestionario de actividad. in Nutriguía. Manual de nutrición clínica en atención primaria. 468 (Complutense, 2006).
Medicine, I. of. Dietary Reference Intakes for Energy, Carbohydrate, Fiber, Fat, Fatty Acids, Cholesterol, Protein, and Amino Acids. (2002). doi:https://doi.org/10.17226/10490.
European Food Safety Authority. General principles for the collection of national food consumption data in the view of a pan-European dietary survey. EFSA J. 7, 2 (2009).
Google Scholar
Nonyane, B. A. S. & Foulkes, A. S. Multiple imputation and random forests (MIRF) for unobservable, high-dimensional data. Int. J. Biostat. 3, 12 (2007).
Article MathSciNet MATH Google Scholar
Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data (John Wiley & Sons, Newark, 2019).
MATH Google Scholar

Download references

Acknowledgements

This study received logistical support by Conserjería de Educación e Investigación de la Comunidad de Madrid, Dirección General de Educación Infantil, Primaria y Secundaria. The authors would like to acknowledge the help of all the teachers and head teachers of Juan Zaragüeta, Fernando el Católico, Fernández Moratín, La Rioja, Concepción Arenal y Rosa Luxemburgo schools; and, especially, the families and children volunteers who participated in the investigation. We also acknowledge Dr María Ikonomopoulos for reviewing the English style.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

These authors contributed equally: Helena Marcos-Pasero and Gonzalo Colmenarejo.

Authors and Affiliations

Nutrition and Clinical Trials Unit, GENYAL Platform IMDEA-Food Institute, CEI UAM+CSIC, 28049, Madrid, Spain
Helena Marcos-Pasero, Elena Aguilar-Aguilar & Viviana Loria-Kohen
Biostatistics and Bioinformatics Unit, IMDEA-Food Institute, CEI UAM+CSIC, 28049, Madrid, Spain
Gonzalo Colmenarejo
Molecular Oncology and Nutritional Genomics of Cancer, IMDEA-Food Institute, CEI UAM+CSIC, 28049, Madrid, Spain
Ana Ramírez de Molina
Production and Development of Foods for Health, IMDEA-Food Institute, CEI UAM+CSIC, 28049, Madrid, Spain
Guillermo Reglero
Department of Production and Characterization of Novel Foods. Institute of Food Science Research (CIAL), CEI UAM+CSIC, 28049, Madrid, Spain
Guillermo Reglero

Authors

Helena Marcos-Pasero
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Colmenarejo
View author publications
You can also search for this author in PubMed Google Scholar
Elena Aguilar-Aguilar
View author publications
You can also search for this author in PubMed Google Scholar
Ana Ramírez de Molina
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Reglero
View author publications
You can also search for this author in PubMed Google Scholar
Viviana Loria-Kohen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

V.L.K. was the principal investigator and was responsible for the study design. H.M.P. and G.C. wrote the manuscript; H.M.P. and E.A.A. were responsible for data collection; G.C. conducted the analysis of the data, proposed the use of Machine Learning variable importance techniques and designed the RF, GBM models, as well as developed the consensus score of the variables; G.R.R. and A.R.M. supervised the final compilation of the manuscript and provided scientific advice and consultation. All authors reviewed the manuscript.

Corresponding author

Correspondence to Viviana Loria-Kohen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Marcos-Pasero, H., Colmenarejo, G., Aguilar-Aguilar, E. et al. Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques. Sci Rep 11, 1910 (2021). https://doi.org/10.1038/s41598-021-81205-8

Download citation

Received: 11 March 2020
Accepted: 04 January 2021
Published: 21 January 2021
DOI: https://doi.org/10.1038/s41598-021-81205-8

This article is cited by

Machine learning identifies prominent factors associated with cardiovascular disease: findings from two million adults in the Kashgar Prospective Cohort Study (KPCS)
- Jia-Xin Li
- Li Li
- Bo-Yi Yang
Global Health Research and Policy (2022)
Folliculin-interacting protein FNIP2 impacts on overweight and obesity through a polymorphism in a conserved 3′ untranslated region
- Lara P. Fernández
- Nerea Deleyto-Seldas
- Ana Ramírez de Molina
Genome Biology (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Genome-wide analysis in over 1 million individuals of European ancestry yields improved polygenic risk scores for blood pressure traits

Principal component analysis

Development and validation of a new algorithm for improved cardiovascular risk prediction

Introduction

Results

Exploratory analysis

Random forest model and variable importance’s

Gradient boosting machine model and relative importance’s

Consensus variable importance’s

Discussion

Methods

Study design

Ethical issues

Anthropometric measurements

SNP selection, genetic risk score and genotyping

Questionnaires, data collected and predictor variables used

Characteristics of schoolchildren

SNP selection and GRS

Physical and leisure activities

Diet, food and nutrients

Risk factors of pregnancy and birth

Social, health and demographic factors

Statistical modeling

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Machine learning identifies prominent factors associated with cardiovascular disease: findings from two million adults in the Kashgar Prospective Cohort Study (KPCS)

Folliculin-interacting protein FNIP2 impacts on overweight and obesity through a polymorphism in a conserved 3′ untranslated region

Comments

Search

Quick links