Identification of genetic loci affecting body mass index through interaction with multiple environmental factors using structured linear mixed model

Multiple environmental factors could interact with a single genetic factor to affect disease phenotypes. We used Struct-LMM to identify genetic variants that interacted with environmental factors related to body mass index (BMI) using data from the Korea Association Resource. The following factors were investigated: alcohol consumption, education, physical activity metabolic equivalent of task (PAMET), income, total calorie intake, protein intake, carbohydrate intake, and smoking status. Initial analysis identified 7 potential single nucleotide polymorphisms (SNPs) that interacted with the environmental factors (P value < 5.00 × 10−6). Of the 8 environmental factors, PAMET score was excluded for further analysis since it had an average Bayes Factor (BF) value < 1 (BF = 0.88). Interaction analysis using 7 environmental factors identified 11 SNPs (P value < 5.00 × 10−6). Of these, rs2391331 had the most significant interaction (P value = 7.27 × 10−9) and was located within the intron of EFNB2 (Chr 13). In addition, the gene-based genome-wide association study verified EFNB2 gene significantly interacting with 7 environmental factors (P value = 5.03 × 10−10). BF analysis indicated that most environmental factors, except carbohydrate intake, contributed to the interaction of rs2391331 on BMI. Although the replication of the results in other cohorts is warranted, these findings proved the usefulness of Struct-LMM to identify the gene–environment interaction affecting disease.

www.nature.com/scientificreports/ association studies (GWAS) 9 . Among the GWAS single nucleotide polymorphisms (SNPs), the FTO locus is the strongest genetic variant and is recently reported to affect BMI through interaction with environmental factors, including physical activity, diet, and smoking 7,10-13 .
Most of the interaction studies for obesity calculate the interaction between a single genetic factor or multiple genetic factors and a single environment factor 9,14-17 . However, there is a possibility that the interaction of multiple environments and a single genetic factor can also affect the phenotype 7,18 . Recently, the structured linear mixed model (Struct-LMM) analysis was reported to identify the genetic loci, characterized by the interaction of multiple environmental factors 18 . Using this method, high-dimensional environmental data can be used in population cohorts to help understand the effects of genetic factors in a group on complex traits and diseases 18 .
In this study, we applied the Struct-LMM method to the Korea Association Resource (KARE), a well-known Korean population GWAS database, comprised of 8,840 participants 19 . A total of 8 obesity-related environmental factors, which could interact with multiple environmental factors for BMI, were examined on genome-wide genetic variants. These environments included estimated daily alcohol consumption, education, physical activity metabolic equivalent of task (PAMET), income, total calorie intake, protein intake, carbohydrate intake, and smoking status.

Materials and methods
Korean association resource (KARE) cohort. We utilized data from KARE for all the analyses. Participants of KARE cohort were recruited from two regions in South Korea (Ansan and Ansung) from 2009 to 2012 for the Korean Genome and Epidemiology Study. All study participants aged ≥ 40 years provided written informed consent, and approval was obtained from the institutional review board. The exclusion criteria were as follows: history of cancer, gender inconsistencies, cryptic relatedness, low genotype call rate (< 95%), and sample contamination ( Fig. 1) 19,20 .
From the 8,840 samples, 8,155 samples (with all the 8 environmental factors related to obesity) were selected. The distribution of the variables in the samples has been summarized in Table 1.
Genotype data. The KARE study utilized the Affymetrix Genome-Wide Human SNP Array GeneChip 5.0 19 . SNP imputation was performed using IMPUTE2 with the 1000 Genomes Project (haplotype phase 1) 19 . At  Dataset for phenotype and environmental factors. The methods for the measurement of height and weight have been described in a previous study 20 . BMI was calculated as weight (kg)/height (m 2 ). For analyses, BMI was transformed to normal distribution using Gaussian function in Struct-LMM. The environmental factors used in Struct-LMM were selected based on previous studies on BMI 9,14,16,17 . The 8 environmental factors were alcohol consumption 9 , education 14 , physical activity 9 , income 9 , total calorie intake 16 , protein intake 16 , carbohydrate intake 16 , and smoking status 17 (Supplementary Table 1).
The estimated daily consumption of alcohol (g/day) was calculated, as described previously 20 . PAMET was obtained from each participant using a structured questionnaire that included four types of physical activities including sleeping, five different sedentary activities, non-sedentary activities, and only leisure-time physical activity 21 . The total calorie intake, protein intake, and carbohydrate intake were calculated from a food frequency questionnaire (FFQ), as described 22,23 .
For the Struct-LMM interaction analysis of alcohol consumption, the amount of alcohol consumption additionally increased in each group by 20 g/day from the 1st group (less than 20 g/day) up to the 6th group (> 100 g/ day) 24 . For education analysis, participants were divided into two groups, one with education no more than secondary school and another with education more than secondary school. The physical activity was analyzed from PAMET score as a continuous variable. For income analysis, participants were divided into groups based Statistical analysis. SNP quality control was performed in PLINK v.1.9.0, as described previously 25 . The gene-environment interaction on BMI for the eight environmental factors was analyzed by Struct-LMM v.0.3.1 18 , adjusted for age, sex, and recruitment area 26 . We performed Manhattan plot drawing, box plot drawing, association analysis, residual value calculation, bar graph drawing, and correlation in R stats package version 3.5.1 (www.r-proje ct.org). For Manhattan plot drawing, qqman package was used for residual value calculation, while for association analysis, lme4 package in R stats package was used. We calculated the Bayes Factor (BF) for each environmental factor to explore the most relevant environments for GEI in Struct-LMM v.0.3.1. The Bayes Factor method used in Struct-LMM is a statistical method comparing two models, one with the environmental factors and another without the environmental factors, in order to assess which model is better by quantifying the power of each model.
We performed a gene-based genome-wide association analysis using the MAGMA 27 tool provided by FUMA 28 through GWIS results calculated by Struct-LMM in this study.
Ethics approval and consent to participate. This study was performed in accordance with the World Medical Association Declaration of Helsinki. All participants provided written informed consent to participate in the study 19 . Approval for the study was obtained from the Institutional Review Board (IRB) of Kyung Hee University (KHSIRB-19-387(EA), KHSIRB-20-077(EA)).

Association of environmental factors on BMI.
The environmental characteristics of the 8155 participants, included in this Struct-LMM analysis, have been summarized in Table 1. The participants were categorized into four obesity groups, based on the BMI values stated by the World Health Organization (Asia-Pacific region) and the Korean Obesity Society 29 . The average BMI of the participants was 24.61 (standard deviation = 3.12).
The relationship between the environmental factors was assessed using a correlogram ( Supplementary Fig. 1). It was observed that PAMET and carbohydrate intake had relatively low correlation with other environmental factors, compared to the other 6 factors.

Analysis of gene-environment interaction on BMI using Struct-LMM.
We performed the GEI test between the 5,908,111 SNPs and 8 environmental factors on BMI, using Struct-LMM adjusted for age, sex, and recruitment area. The study design has been depicted in Fig. 1. Based on the results, we did not find an association of genome-wide significance with P-value < 5 × 10 −8 , as shown in the Manhattan plot (Fig. 2). However, we found 7 potential associations with P-value < 5 × 10 −6 ( Table 2).
For each potential genetic variant, we calculated the BF value of each environmental factor to examine the interaction reliability of the respective environmental factor on BMI, using the tool included in the Struct-LMM program 18 . For rs2391333, showing the most significant interaction (P-value = 9.17 × 10 −8 ), the BF value of total calorie intake was the highest (BF = 8.43) among the 8 environmental factors, while the BF values of both carbohydrate intake and PAMET were the lowest (BF < 0.01 each, Fig. 3). When the BF values of the environmental factors were added from all the 7 potential SNPs, the BF value of PAMET was the lowest among the 8 environmental factors (Fig. 3, Supplementary Table 2). To confirm whether this was true for other SNPs as well, we recalculated the BF value of the 8 environmental factors from 66 independent lead SNPs that showed an interaction association with P-value < 5 × 10 −5 (Supplementary Table 3 Table 3). Therefore, we analyzed the GEI again by Struct-LMM, after excluding PAMET.
Second analysis of gene-environment interaction using 7 environmental factors. We performed the GEI test between the 5,908,111 SNPs and 7 environmental factors (without PAMET) on BMI, using Struct-LMM adjusted for age, sex, and recruitment area (Fig. 1). We discovered 11 independent potential genetic www.nature.com/scientificreports/ variants with P-value < 5 × 10 −6 ( Table 3, Fig. 4). A variant, rs2391331, located within the intron of EFNB2 gene (Chr 13), showed a genome-wide significance (P-value = 7.27 × 10 −9 ). We analyzed the interaction association of the other 6 potential variants identified from the analysis with the 8 environmental factors, including PAMET. While three SNPs (rs12402440, rs59756727, and rs9512706) had lower P-values in the interaction test involving 7 environmental factors, the other 3 SNPs (rs7760212, rs668056, and rs5755279) had higher P-values (Supplementary Table 4).

BF of gene-environment associated variant. The BF of environmental factors for the 11 potential
SNPs were calculated (Fig. 5, Supplementary Table 10). For the most significant variant, rs2391331, all the environmental factors (except carbohydrate intake), showed BF value > 1, thus indicating the evidence of interaction. The interaction BF values of rs2391331 were as follows: protein intake (BF = 5.28), income (BF = 4.59), total calorie intake (BF = 4.12), alcohol consumption (BF = 2.85), smoking status (BF = 2.63), and education (BF = 1.41) (Supplementary Table 10). Additionally, we investigated the interaction of individual environmental factors with the 10 potential SNPs using a fixed effect model of linear regression (Supplementary Table 11 -20). As a result, the most significant P-value (P-value = 7.65 × 10 −7 ) was observed with total calorie intake interacted with rs11730118, which also showed the highest BF value (BF = 10.12) among the 11 potential SNPs of Struct-LMM analysis (Fig. 5, Supplementary Table 10, and Supplementary Table 16). Box plots showed distribution of the in sample estimated allelic effect size on BMI considering environmental factors for 11 potential variants with GEI (n = 8155, unrelated individuals of Korean population) ( Supplementary  Fig. 2).

Discussion
In this study, we performed an interaction test between multiple environmental factors and genetic variant, using Struct-LMM at a genome-wide level 18 . The identified genetic variant, rs2391331 was significantly associated with the interaction of 7 environmental factors in a Korean population cohort, KARE (P-value = 7.27 × 10 −9 ). Of the 7 environmental factors, protein intake was the most influential environmental factor for rs2391331, and the least influential environmental factor was carbohydrate intake (Fig. 5).
A certain genetic variant may affect a trait through multiple environmental factors, as shown in FTO and MC4R [31][32][33] . The genetic variant of FTO showed interactions with diverse environmental exposures such as physical activity, diet, and alcohol consumption. The effect size of FTO variants on BMI was reduced by the increased physical activity but increased by the decreased physical activity 10 . Struct-LMM was modelled to identify such interactions with multiple environmental factors, which may not be identified otherwise. As shown in Supplementary Table 6, rs2391331 could not be identified as a genome-wide significant variant through the conventional fixed effect model approach, which used a single environmental factor. While protein intake, income, total calorie intake, alcohol consumption, and smoking status showed statistical significance after multiple correction, none of these environmental factors showed a genome-wide significance for the interaction with rs2391331 (P-value < 5.00 × 10 −8 ). Therefore, unless the Struct-LMM model was applied, rs2391331 could not be identified through the genome-wide interaction study.
Another advantage of the Struct-LMM analysis is that it provides a rank about which environmental factor is more reliable for the interaction with the genetic variant, using BF values. As shown in Fig. 5 and Supplementary Table 10, the BF values for rs2391331 indicated that the ranking was in the order of protein intake (BF value = 5.28), income (BF = 4.59), total calorie intake (BF = 4.12), alcohol consumption (BF = 2.85), smoking status (BF = 2.63), education (BF = 1.41), and carbohydrate intake (BF < 0.01). When we investigated the interaction of individual environmental factors with rs2391331 using a fixed effect model of linear regression, the P-values of environmental factors were in the order of P-value, as follows: protein intake (P-value = 1.64 × 10 −4 ), income (P-value = 3.62 × 10 −4 ), total calorie intake (P-value = 6.27 × 10 −4 ), alcohol consumption (P-value = 2.90 × 10 −3 ), smoking status (P-value = 3.77 × 10 −3 ), education (P-value = 1.88 × 10 −2 ), and carbohydrate intake (P-value = 5.28 × 10 −1 ) as shown in Supplementary Table 6. The order of P-value of each environmental factor was the same as the one of BF value obtained by Struct-LMM. In case of rs2391331, all 6 environmental factors (except carbohydrate intake) had a rather consistent effect on BMI through the interaction.
The genetic variant, rs2391331 was located in the first intron of EFNB2 gene (Chr 13). EFNB2 is the ligand of erythropoietin-producing hepatocellular kinases (EPH), the largest family of receptor tyrosine kinases 34,35 . We performed the gene-based genome-wide association analysis and verified that at the gene level EFNB2 was also statistically significant for the interaction with the 7 environmental factors (Gene-set P-value = 5.03 × 10 −10 ) (Supplementary Table 7). Further, we found that LD-linked proxy SNPs for rs2391331 were associated with EFNB2 gene in the pituitary, or testis (Supplementary Table 9). EFNB2 has been reported to be associated with schizophrenia and ankle injury through the genome-wide association study. Also, several studies have shown the association of this gene hypertension and type 2 diabetes; however, there have been no reports regarding the connection of EFNB2 with obesity-related traits [36][37][38] . Recently, it was reported that smooth muscle-specific deletion of EFNB2, using SM22Δ-Cre, resulted in lower body weight, reduced vascular smooth muscle cell proliferation, wall, and enlarged arterial diameter 39 . More recently, it has been reported that the expression of EFNB2 is enriched in proopiomelanocortin (POMC) neurons, a major regulator of energy balance and glucose homeostasis and the loss of EFNB2 in POMC-expressing progenitors mildly impairs gluconeogenesis and food intake in mice 40 . We do not know yet whether how this report would be related to the findings in this study. Further studies are required to investigate the functional analysis of EFNB2 related to obesity.
Additionally, we found 4 publications related to the GEI on BMI in which studies 96 SNPs were reported 18,[41][42][43] . We tested the 96 SNPs for replication using the Struct-LMM in this study population, however we could not find any significant interaction under the multiple correction criteria (P-value < 0.05 / 96) (Supplementary Table 21). We may provide the reasons for the failure of replication as follows. First, our sample size (N = 8,155) may not be enough for replication. It is generally accepted that for GEI studies, the bigger sample size is recommended than the size for GWAS 2 . Second, the ethnicity of our sample (East Asian population) is different from the ethnicity of previously mentioned 4 studies (3 studies in European population 18 www.nature.com/scientificreports/ population 42 ). The effects of environmental factors are supposed to be discordant between different ethnical backgrounds, making it hard to produce a replication. Third, there may be a limitation in obtaining the precise environmental data. The environmental data acquired from self-reported questionnaire (i.e., dietary intakes, alcohol, smoking, sociodemographic factors, and physical activities) may be prone to responder bias 9 . There are several limitations to our study. First, as mentioned above, we analyzed the GEI in obesity with a small sample size, which can affect the statistical power and lead to imprecise or incorrect estimates of the magnitude of observed effects 44 . In addition, the number of environmental factors investigated was small. While Struct-LMM analysis has good detection power, even if more than 10 environmental factors are analyzed, a smaller number of environmental factors reduces the detection power of this method 18 . Lastly, the results of rs2391331 were not validated in other cohorts or ethnicities. Although it is not easy to find cohorts with all the diverse environmental factors similar to this study, replication of the results in other cohorts is warranted.
In conclusion, we performed multiple environments-gene interaction analysis to identify potential SNPs of BMI in a Korean cohort. A genome-wide significant interaction of rs2391331, located in the EFNB2 locus, was identified and the interaction on BMI was influenced by 6 environmental factors, namely protein intake, income, total calorie intake, alcohol consumption, smoking status, and education.