Ranking of a wide multidomain set of predictor variables of children obesity by machine learning variable importance techniques

The increased prevalence of childhood obesity is expected to translate in the near future into a concomitant soaring of multiple cardio-metabolic diseases. Obesity has a complex, multifactorial etiology, that includes multiple and multidomain potential risk factors: genetics, dietary and physical activity habits, socio-economic environment, lifestyle, etc. In addition, all these factors are expected to exert their influence through a specific and especially convoluted way during childhood, given the fast growth along this period. Machine Learning methods are the appropriate tools to model this complexity, given their ability to cope with high-dimensional, non-linear data. Here, we have analyzed by Machine Learning a sample of 221 children (6–9 years) from Madrid, Spain. Both Random Forest and Gradient Boosting Machine models have been derived to predict the body mass index from a wide set of 190 multidomain variables (including age, sex, genetic polymorphisms, lifestyle, socio-economic, diet, exercise, and gestation ones). A consensus relative importance of the predictors has been estimated through variable importance measures, implemented robustly through an iterative process that included permutation and multiple imputation. We expect this analysis will help to shed light on the most important variables associated to childhood obesity, in order to choose better treatments for its prevention.


Results
Exploratory analysis. Table S1 (Supplementary Material) collects the 190 predictor variables used in the analysis. They are grouped in different domains: characteristics of schoolchildren (3); genetics (1, from 11 SNPs); physical and leisure activities (24); diet, food and nutrients (80); risk factors of pregnancy and birth (39); social, health and demographic factors (43).
The average age of the 221 participants was 6.75 ± 0.73 years (52.50% were girls (n = 116) and 47.50% boys (n = 105)). According to the WHO criteria, 32.2% of the schoolchildren evaluated had excess weight (EW) (18.1% overweight and 14.1% obesity). These figures were 25.4% and 19.0% when the International Obesity Task Force (IOFT) standard or the national criteria of the Orbegozo Foundation were used, respectively. Table 1 shows the main descriptive characteristics regarding the schoolchildren families. Regarding the nutritional status of the parents, 57.5% of the fathers and 30.4% of the mothers had EW.
The main diet, physical activity and birth characteristics of schoolchildren by sex are summarized in Table 2.
The variants distribution of the set of SNPs selected for the genetic risk score (GRS, see Methods) are presented in Table 3. These gene variants were consistent with the Hardy-Weinberg equilibrium in all the cases (p-values ≥ 0.05).
Random forest model and variable importance's. As described previously, we derived a RF model to predict the BMI in this sample. Multiple imputation was included in the calculation of the standardized importance scores T j for each predictor variable x j in the dataset. A total of 100 imputations were performed (see "Methods" section). On average, the RF models explain 55.07% of the variance, as estimated by the OOB pseudo-R 2 . Figure 1 shows a plot of the average predicted BMI by the RF models vs the actual BMI. We can see some degree of miscalibration in the plot, as the best-fit line (dashed line; continuous line is the x = y line) shows www.nature.com/scientificreports/ an intercept different from 0 (− 4.06) and a slope slightly different from 1 (1. 23), so the model makes worse predictions for very high values of BMI. Through permutation of the OOB data, and within the imputation loop, we could obtain the scaled average variable importance of the different predictor variables. Figure 2 shows the resulting variable importance plot for the top-30 predictor variables. The use of multiple imputation allowed in addition to analyze in a robust way the variability of the rank of these variable importance's, by estimating their mean rank and corresponding confidence intervals. Figure 3 shows the mean average rank and corresponding 95% confidence intervals of the 30 most important predictor variables.
The five most important variables are (in this order): Familiar nutri-status perception (Perception of the person completing the questionnaire about child's nutritional status) > Relation TEI-TEE (%) (Percentage of difference between Total Energy Intake (TEI) and Total Energy Expenditure (TEE)) > BMI of the father > BMI of the mother > Mother's Meals (number of daily food servings of the mother). These variables are very well ranked, with both Familiar nutri-status perception and Relation TEI-TEE (%) having a null confidence interval in their average rank, as in all the imputations they were the first-and second-most important variables, respectively. The BMI of both parents share the same narrow confidence interval (3)(4), while Mother's Meals had a slightly larger confidence interval (5-7).
The following variables show much larger confidence intervals, so that although on average they show an increasing rank, their ranking for new samples is expected to be less well defined. Gradient boosting machine model and relative importance's. For comparison purposes, and to check the robustness of the obtained variable importance's, an alternative method to rank the variables was used, namely scaled relative importance's in a Gradient Boosting Machine, again implemented within an imputation loop. Figure 4 displays the corresponding scaled relative importance bar plot. We can see a rather similar picture as with RF, with 20 out of 30 top predictor variables shared between the two plots, and the four top variables (Familiar nutri-status perception, Relation TEI-TEE (%), Mother's BMI, and Father's BMI) being the same and in the same order. However, the exact ordering for the rest of the variables is not fully preserved, which is not unexpected given that the two methods use different functional forms, the metrics used to measure the importance of variables are also different, and the rankings themselves have increasing variability upon moving to less important predictor variables (e.g. Fig. 2), making unfeasible to assign an exact ranking.

Discussion
The results of the anthropometric measurements in the current study showed that one out of four studied schoolchildren had an excess of weight. These figures, similar to those reported in the latest ALADINO national study, reflect the magnitude of the childhood obesity problem in our society 4 . ML is a suitable approach in predictive analytics, and it has started to be used both for early preventive recommendations related to lifestyle, and to build decision-support tools for disease risk prediction 12,27 . Additionally, in view of the crucial role that prevention plays to control the high obesity prevalence, the identification of its most important risk factors could help to develop effective nutritional and educational intervention strategies. In this sense, in this study, we attempted to rank a wide set of 190 predictor variables from different domains in order to predict the BMI of children by means of ML models of the RF and GBM types.
Therefore, the novelty of the current study stems from the use of a very large number of variables from widely different domains (genetic, nutritional, exercise, social and health, lifestyle, birth and pregnancy) and their ranking by variable importance estimations. To our knowledge 23 , there is no parallel in the literature in this regard by this use of such a large multidomain set of variables for childhood obesity. www.nature.com/scientificreports/ We can see that the most important variable in our CS (Fig. 5) is the Familiar nutri-status perception, which has not explanatory character but shows the parents awareness of the nutritional status of their children, which has anyhow a variable degree of underestimation, especially for overweight/obese children, as we (data not shown) and others have observed 28 . The next-important variable (Relation TEI-TEE(%)) is the questionnairebased percentage of difference between the Total Energy Intake (TEI) and Total Energy Expenditure (TEE), which is a measure of the energy balance of the child. In this context, it is well established that obesity entails that dietary energy intake exceeds energy expenditure 29 . Nevertheless, these results should be viewed with caution, since as the literature reviewed suggests, self-reported dietary measures by questionnaires are not fully adequate to describe the energy balance 30 , and there are more accurate ways to calculate the TEE than physical activity questionnaires 31,32 . However, although non-optimal, our questionnaire-based TEI and TEE do contain valuable information about the energy input and expenditure, and thus the Relation TEI-TEE (%) variable results in one of the best predictors for BMI.
The following three variables of the model are Mother's BMI, Father's BMI, and Mother's Meals. These variables would comprise genetic, diet and lifestyle aspects, indicating that children inherit to a large extent their parents' nutritional status 33,34 . These predictors may be interesting in order to use them in predictive models for obesity even before birth, and as a matter of fact they are frequent predictor variables of simple logistic regression models for childhood obesity 23 .
The 6th variable in importance (Prot (%TEI)) is a measure of the percentage of protein consumption within the diet, stressing the importance of a balanced nutritional strategy to prevent obesity. Prot (%TEI) is followed by the genetic risk score (GRS), that supports the genetic component of the BMI in children. This variable aggregates several genetic single nucleotide polymorphisms well described to affect childhood obesity, and has been used previously in studies of pediatric based-populations 35,36 . GRSs have been a great success in the study on polygenic diseases, and it could be seen as a personalized risk management strategy for obesity and overweight. Similar polymorphism-based genetic scores have been described for other pathological cases like breast cancer, prostate cancer, coronary artery disease, type 1 diabetes, type 2 diabetes and Alzheimer's disease 37,38 .
The following two variables in order of importance are mother's hypertriglyceridemia (Mother's disease: HTG) and IPAC score. Regarding the mother's hypertriglyceridemia as a predicting factor for children BMI, previous studies have linked the biochemical and body composition variables between adolescents and their parents, which find significant results in BMI and total cholesterol between father and son, and hypertriglyceridemia, with inadequacies of LDL or HDL shared both by adolescents and parents 39 . In addition, the link between obesity www.nature.com/scientificreports/ and increased risk for hypertriglyceridemia in children has been studied 40 , and can explain the association found in this work. In turn, IPAC is a measure of the total physical exercise performed by the child as obtained from of the IPAC calculation, which stresses the influence of calories consumption by physical activity in the final nutritional status 41 , and nowadays, it is considered as essential focus in health promotion and obesity prevention research at early ages 42 .
As was said in the Introduction, there is a single case of ML variable importance analysis (through RF) used in the prediction task of childhood obesity 22 . The work of Rehkopf et al. 22 had a longitudinal setting and the predicted endpoints were different, namely BMI percentile change after 10 years in adolescent girls, as well as transition from normal weight to overweight or obesity. The predictor variable set was more limited (41 variables) and with a more restricted set of domains: diet, physical activity, psychological, social and parent health, lacking genetic and gestational variables. In their case, psychological variables, a domain that is absent in our dataset, appeared within the most important variables; this is probably not unexpected, given that the sample was composed of adolescent girls, were this domain would be of more importance. We think that this domain would be of less importance in our 6-8 years old children.
We would like to point out some putative limitations of our study. One is the indirect nature 43,44 of the BMI for obesity diagnosis. However, BMI is considered as a great adiposity marker and is the most practical and lowcost method, making it the most preferred one 6 . On the other hand, in pediatric samples it is frequent the use of age-and sex-specific BMI z-scores instead of raw BMI. However, our sample has a very narrow distribution of ages, with 84% of the children being 6-7 years old, and 16% 8 years old, and we did not observe significant differences between the two sexes. Therefore, we decided to use raw BMI instead, as the z-scores are quite dependent on the population they are based on.
Likewise, the use of dietary and physical activity questionnaires may lead to reporting bias and it has been criticized. To avoid or minimize such biases there is an increased need for objective measures of food intake (e.g. by use of biomarkers) and physical activity (e.g. by use of movement sensors). However, because of the high www.nature.com/scientificreports/ costs of such methods, questionnaires are still the most widely used instruments for determining frequency and duration of physical activity and frequency and quantity of food intake, as questionnaires are relatively cheap and efficient instruments for collecting data on a large scale in a relatively short time span 45 . Nevertheless, this information should be interpreted with caution. Another limitation was the sample size, but it is important to consider that this study is framed in an intervention study of five years and corresponds to a baseline crosssectional analysis. Therefore, at this point this model was derived for explanatory purposes, in order to identify the predictors most associated to BMI. The cross-sectional nature of the present baseline dataset prevents its use from demonstration of causality, or for predictive purposes. This model rather suggests variables that would be important for childhood obesity, in order to be further tested in longitudinal settings. The new accumulated data along the study will be incorporated in order to derive models for predictive purposes to target appropriate preventive interventions to ameliorate effectively children obesity. From the statistical modelling point of view, variable importance techniques can be subject to biases 46,47 . However, our use of a permutation approach avoids overestimation of categorical variables with many classes, and in preparing our dataset, we removed highly-correlated variables that could also be overestimated. In addition, the picture obtained from the GBM analysis is rather similar to the RF one, with up to 20 the 30 top variables shared variables between the two methods, and exactly the same four top variables. This gives confidence in the general conclusions above described about the influence of the different predictor variables. We must also take into account that many of these variables are correlated, so that the way that one method achieves a best fit will be different that the other given their different algorithms, while modeling basically the same physical mechanism. For instance, the important variable IPAC in the RF plot, is missing from the GBM plot, while in the latter Active transport to school instead appears. However, a large component of the physical activity of the child (measured by IPAC) would be going to school walking or biking, and this is measured by the Active transport to school variable. In the GBM plot sleep time is the fifth most important predictor, and the GRS has lower importance. In spite of that, there is a large similarity between the two descriptions of childhood obesity, taking into consideration that the dataset contains up to 190 predictor variables.
Finally, it is worth highlighting the homogeneity of the sample in terms of distribution by sex and the absence of genetic relatedness and stratification (since the Hardy-Weinberg equilibrium is met by all the SNPs). In www.nature.com/scientificreports/ addition, the sample shows a large representativeness with six schools from three different areas of the Community of Madrid involved, which allows to have a better knowledge of the situation throughout the Community and not from a specific school or area.

Methods
Study design. The GENYAL sample included 221 schoolchildren (116 girls and 105 boys) in 1st and 2nd grades (6-8 years of age) from 6 different public primary schools among the Community of Madrid (Spain). The Ministry of Education of this Community was responsible for the sampling of the schools, covering a variety of socioeconomic status of different districts, so that the selection was representative of the household income distribution in Madrid as defined by the Spanish National Statistics Institute 48 . Briefly, GENYAL is a long-term clinical trial (ClinicalTrials.gov NCT03419520) for childhood obesity prevention. The duration of the project is planned to last 5 years (2017-2021) with annual data collection, including anthropometric and nutrigenetic assessment and questionnaires about physical activity, dietary and social and health aspects. On this basis, the main objective of GENYAL study was to design and validate a predictive model that identifies those children who would benefit most from actions aimed at reducing the risk of obesity and its complications through ML algorithms. The results shown in this paper corresponding to a cross section from data collected in the first year of the study (2017).
Ethical issues. The research was approved by the Research Ethics Committee of the IMDEA Food Foundation (PI:IM024). The study protocol follows the guidelines laid down in the Declaration of Helsinki and was performed in accordance with relevant regulations. All families signed their written informed consent to participate.
Anthropometric measurements. Height was determined using a Leicester height rod with a millimetric accuracy (Biological Medical Technology SL, Barcelona, Spain). Body weight, fat mass percentage and muscle www.nature.com/scientificreports/ mass percentage were assessed using a Body Composition Monitor (BF511-OMRON HEALTHCARE Co., Ltd, Kyoto, Japan). Waist circumference were taken using a non-elastic tape (KaWe Kirchner & Wilhelm GmbH, Asperg, Germany; range 0-150 cm, 1 mm of precision). For blood pressure monitorization, an automatic digital monitor was used (OMRON M3-Intellisense) using a cuff suitable for children. Children were measured at their schools early in the morning by trained dietitians following standard techniques and the international WHO guidelines specific for this population 49 . Measurements were taken twice in a row, considering the average as the result. BMI was calculated as weight in kg per height in squared meters; children were classified as normoweight, overweight or obese according to percentiles of the Faustino Orbegozo Foundation 50 , of the International Obesity Task Force (IOFT) 51 , and WHO growth standards 52 . The results of overweight and obesity rates were unified as a single category called excess weight (EW). Parents' BMI was calculated from the weight and height data reported by themselves. SNP selection, genetic risk score and genotyping. DNA was obtained from saliva samples collected the same day of the anthropometric evaluation. Genomic DNA was extracted according to the protocol described by Stratec INVISORB Spin Tissue Mini Kit. For genotyping, the DNA samples were loaded in TaqMan OpenArray Real-Time PCR plates (Life Technologies Inc., Carlsbad, CA, USA) already configured with the specific selected SNPs with specific waves for each allele marked with a different fluorophore to determine the genotype. This process was made using the OpenArray AccuFill System (Life Technologies Inc., Carlsbad, CA, USA). Once it was charged, a PCR was made and the chips were read in the QuantStudio 12 K Flex Real-Time PCR Instrument (Life Technologies Inc., Carlsbad, CA). The results were analyzed using the TaqMan Genotyper software (Life Technologies Inc., Carlsbad, CA, USA), which assigns automatically the genotype to each sample according to the amount of detected signal for each fluorophore. Data analysis was made by TaqMan Genotyper Software v1.3 (autocaller confidence level > 90%) 53 . Call rates for all SNPs were > 96%, and genotype frequencies were in Hardy-Weingberg equilibrium (p > 0.05).
For the purpose of this study, 11 SNPs (BDNF-AS rs925946, ETV5 rs7647305, FTO rs7190492, GNPDA2 rs10938397, KCTD15 rs368794, LEPR rs1137101 (Q223R), MC4R rs17782313, NEGR1 rs2568958, SEC16B rs10913469, TCF7L2 rs7903146 and TMEM18 rs6548238) were selected. These SNPs were included by considering their specific relationship with childhood BMI according to previous researches, having been identified by genome-wide association studies (GWAS) and the absence of linkage disequilibrium between them. From these SNPs, a GRS was developed as the total sum of risk alleles in the 11 SNPs 53 .
Questionnaires, data collected and predictor variables used. Different self-reported questionnaires were sent to families by email or in paper format according to the parents' preference, filled by at least one of the parents and collected by researchers. This questionnaires were based on the surveys used in previous national studies (ALADINO and ELOIN) 4,54 , KIDMED 55 , etc.
The data obtained were processed and cleaned. Finally, a total of 190 variables obtained were classified into categories according to their specific nature. (Table S1, supplementary material). These variables are described in what follows.
Characteristics of schoolchildren. Three variables were taken into account in this category: age, sex and school year.
SNP selection and GRS. The GRS, obtained from 11 SNPs variables well described as significant in childhood obesity, was used in this domain. The GRS for each child was obtained as the sum of the number of risk alleles of each of the 11 SNPs over all the SNPs, by considering that each SNP can contain 0, 1 or 2 risk alleles: e.g. if the risk allele is A, and the SNP appears as GG, GA and AA genotypes, the corresponding number of risk alleles would be 0, 1, and 2, respectively. Therefore, the GRS is defined as: Were NRA i is the number of risk alleles of SNP i.
Physical and leisure activities. 24 variables regarding physical activity and free time data were obtained by an ad hoc questionnaire, based on the surveys used in previous national studies (ALADINO and ELOIN), after receiving content validation by a group of dietitians and exercise science experts. A 48-h physical activity record was collected, corresponding to 24 h of a week day and a complete weekend day 56 to obtain the Individual Physical Activity Coefficient (IPAC) and the Physical Activity Coefficient (PAC) through the coefficient defined by the WHO 49 and by the Institute of Medicine 57 , respectively.
Diet, food and nutrients. 80 variables were also gathered from dietary information through parent selfreported ad hoc questionnaires. These questionnaires were delivered to the parents with the corresponding filling instructions. Before processing, the responses of the questionnaires were checked by the researchers, and parents were phone called in case of unclear or omitted data. The questionnaires included were, the KIDMED validated questionnaire 55 , a 48-h food record of two non-consecutive days, a weekday and a weekend day, as recommended by the European Food Safety Authority guidelines 58 , and analyzed using the DIAL software (Alce Ingeniería, Madrid, Spain) in order to obtain information about macro and micronutrients. Finally, a question- Statistical modeling. R 3.4.2 (https ://www.r-proje ct.org/) was used for all the modeling and data analysis.
The sample was initially characterized by a descriptive exploratory analysis. Qualitative data were presented as percentages and absolute frequencies while quantitative data were expressed as mean ± standard deviation.
The randomForest package was used to develop the RF models, using as settings 500 decision trees and 5 permutations per variable for variable importance calculations. missForest package was used for multiple data imputation with the default settings; a total of 100 imputations were used. An iterative procedure, similar as the one described in Nonyane, et al. and Little et al. 59,60 , was applied in order to include multiple imputation in the variable importance estimation by taking into account both the between-and within-imputation variance in the importance scores. The process was as follows: • For each imputation m, m = 1,…,M we estimated the average importance score of variable x j , ( θ m j , where j = 1,…,p) as the average increase in the OOB MSE (Mean Squared Error) after OOB-permuting x j for each of the B trees of the RF a total of K times: as well as the corresponding standard errors s m j . • From here the average importance score across the M imputations for each variable x j was obtained from: • Finally, the standardized importance score for each variable x j was calculated using: where V j is the weighted sum of the within ( − Wj ) and between ( − Bj ) imputation variances for variable x j : which are defined as: The multiple imputation was also used to derive (rounded to the nearest integer) mean and 95% confidence intervals for the ranks of the importance scores of the different predictor variables in the RF models.
In order to compare the results with those obtained from other methods, a Gradient Boosting Machine (GBM) relative importance plot was also obtained. The gbm package was used to derive the GBM models. Multiple models were derived within an imputation loop, and estimates of relative importance were pooled as described with the RF models. 100 iterations of imputation and model derivation were performed again. We used GBM models with 5000 trees, learning rate of 0.01, bag fraction of 0.5 and interaction depth of 3. The full dataset was used for training, and the best number of trees in each model was obtained through fivefold cross-validation. The relative importance of a variable j for a single tree T with J terminal nodes, when using regression trees in the GBM like in this case is defined as 11 www.nature.com/scientificreports/ where the summation is over the nonterminal nodes t of the J-terminal node tree T, v j is the variable selected for splitting in that node, 1() is an indicator function that equals 1 if v t = j and 0 otherwise, and i 2 t is the decrease of squared error associated to that variable. GBM is an ensemble method, were successive base learners (regression trees in our case) are fitted to minimize the residuals of the previous one; therefore, the final relative importance's for the GBM are obtained by averaging for each variable the relative importance's over all the trees in the model.
In order to derive a consensus variable importance's, the two 100 imputations × 190 variable matrices of RF variable importance's and GBM relative importance's, were first min-max normalized (within each model) in order to make them comparable. As minimum and maximum, the minimum and maximum average variable importance (relative importance for GBM) were used, respectively. After this normalization, the two matrices were merged and averaged for each predictor variable, resulting in a normalized score for each. The top-30 scoring variables were then plotted.