Introduction

Marked differences in health-related traits and diseases exist between countries and between regions within countries1,2,3,4. Regions in Scotland differ for several health-related traits, and display differences in lifespan of up to 7.5 years in men and 4.9 years in women5. Complex traits related to obesity, such as body size, have a genetic basis with heritabilities between 30 and 60%6, 7. On the other hand, changes in body size and the so-called global obesity pandemic are usually associated with environmental changes related to diet, exercise levels and other socioeconomic changes8, 9. From this perspective both genetics and lifestyle differences are potential causal factors for the observed regional differences in health-related traits10,11,12,13,14,15. Disentangling the underlying causes of variation in health-related traits has direct implications for the welfare of future generations; however, genetic and environmental variation may be closely associated and therefore difficult to separate.

In this study, we explored causes of regional variation in data on obesity-related traits as indicators of the health status of ~11,000 Scottish individuals with genotypic records and a variety of measurements of possible causal lifestyle and socioeconomic factors. Our aim was to exploit these high-quality genomic data and high fidelity and deep phenotypic, lifestyle and socioeconomic data to identify the factors contributing to the differences between regions in health-related traits. As expected, obesity-related traits such as body mass index and weight differ significantly between regions. By accounting for both the genomic data and the environmental information in our analyses, we showed that trait variation is substantially influenced by both genetic variation and family environment. However, the regional variation for most obesity traits was associated with lifestyle and socioeconomic variables, such as deprivation, physical activity, etc. rather than the regional genetic structure of the sample. These results imply that although genes and family environment are important determinants of health-related traits, regional differences are attributable mostly to potentially modifiable environmental factors.

Results

Overview of analyses

The objective of this work was to disentangle genetic and environmental components of health-related traits, linked to geographic variation. We explored a Scottish population consisting of ~11,000 individuals with different degrees of kinship, genotyped for ~500K markers, phenotypes for 11 traits (8 anthropometric and 3 metabolic traits), geographic covariates (principal components) reflecting the regional genetic structure of the data (gPCs) and a large set of environmental covariates (socioeconomic and lifestyle (SELS)). We fitted jointly genetic and environmental information in a range of statistical models, in an innovative approach to disentangle the causes of regional variation. For more information, see Supplementary Tables 1, 2 and 3 and Supplementary Methods. An overview of analyses and models is shown in Fig. 1.

Fig. 1
figure 1

Overview of the models and analyses performed. G, Genomic Relationship matrix; K, Kinship matrix; C, Couples matrix; S, Siblings matrix; gPCs, Geographic Principal Components; SELS, Socioeconomic and lifestyle covariates

Regional differences in traits within Scotland

In order to illustrate the geographic differences existing in Scotland, in the Basal Model (Fig. 1) we adjusted each trait for sex, age and clinic and tested the traits for differences between the 32 regions (council areas, defined from the individual’s postcode of residence; Supplementary Table 3). For 9 of the 11 traits studied, differences between regions (i.e., council areas) were significant at a 0.05 level (see Table 1, first column).

Table 1 Significance of region on phenotypes in the benchmark framework

To test if the regional differences detected were due to the genetic relatedness of the sample, we adjusted for kinship by fitting a genomic relationship matrix (G) together with sex, age and clinic in a mixed model analysis (Family Model). We tested the residuals from this model for remaining regional differences. When including the genomic relationship matrix in the model, the differences between regions disappeared for two traits (height and body fat measured by bioelectric impedance analysis (BIA fat) (Table 1, second column), suggesting that the regional variation detected in the Basal Model for these two traits was due to the genetic relatedness of the sample. Nonetheless, for waist circumference, hips circumference, waist-to-hips ratio (WHR), body mass index (BMI), a body mass index (ABSI), creatinine and high density lipoprotein (HDL) levels regional differences still exist (α = 0.05) after adjusting for the genetic relatedness and family structure in the sample.

We then explored if the regional differences could be explained by the population genetic structure of the sample, i.e., the genetic differences between the regions. To do that we adjusted for ten geographic principal components (gPCs) that represent geographical population genetic structure in the cohort. The gPCs were calculated using a subset of unrelated individuals and unlinked markers and then extrapolated to the rest of the population. They reflect the genetic differences between regions as shown in Amador et al.14 (for more information see Methods). We adjusted for the gPCs (together with a genomic relationship matrix, sex, age and clinic in the Structure Model) and we used the residuals of the model to test whether the regional differences remained significant (Table 1, third column). For all six traits with significant regional differences after the previous analyses, these differences remained significant (α = 0.05) after adjusting for the gPCs, i.e., the genetic differences between regions do not explain the regional differences in the studied traits.

Next, we examined if the regional differences could be explained by the environmental differences measured in the cohort by adjusting for the SELS covariates. We fitted a model adjusting for a genomic relationship matrix and SELS covariates, representing this environmental information, together with sex, age and clinic (Environment Model). When we tested the significance of the region in the residuals of this model (Table 1, fourth column), we observed that only ABSI and creatinine showed significant differences (α = 0.05) between regions and these differences had become non-significant for waist circumference, BMI, WHR and HDL, indicating that the regional differences are explained by the measured SELS variables. We fitted a final model including both the gPCs and the SELS covariates (Table 1, Structure and Environment Model) to corroborate the results. The results obtained for this model were very similar to those from the Environment Model, reinforcing the conclusion that the SELS covariates are responsible for the regional differences observed.

A visualisation of the changes in the standardised residual means for each trait per region before and after adjusting for the SELS variables was created using latitude and longitude of Scottish postcodes in R16. This is shown for BMI in Fig. 2 and for all traits in Supplementary Fig. 1. The only remaining regional differences were for creatinine and ABSI. Since our results suggest that those were not due to the geographical population genetic structure (Table 1, Structure Model), these remaining differences are likely to be caused by other environmental variables not measured in our data and not associated with family genetic structure or family environment.

Fig. 2
figure 2

Regional values of BMI before and after adjusting for the environmental variables. Changes in the standardised means of BMI per region before (left panel) and after (right panel) adjusting for all the lifestyle and socioeconomic covariates. Yellow: regions with less than 20 individuals, not considered

We repeated the whole set of analyses including a larger set of genetic and environmental matrices (G: genomic relationship matrix, K: kinship matrix, C: couples matrix, S: siblings matrix; see Fig. 1: Full models F, S, E and S+E) combining the different set of covariates and the results observed were similar to those of the Benchmark models described above: most regional differences were removed when fitting the SELS variables (Supplementary Table 4).

Heritability estimates and covariate effects

We evaluated the proportion of the variance explained by all the components fitted in several mixed models to further explore genetic and environmental variation in the 11 traits studied following Xia et al.17 Using mixed-model analysis18, 19 we partitioned the phenotypic variance into components representing genetic or environmental effects. We used two genetic relationship matrices (G and K) to account simultaneously for the genetic sharing among distant and closely related individuals7; and two environmental relationship matrices that represented shared environments between members of a couple (C) and siblings (S)17 (Fig. 1, Full models).

The proportion of the phenotypic variance explained by the components in a Full Model is shown in Table 2. The table includes the results for two types of analyses: the Family Model including only the matrices, sex, age and clinic, or the Structure and Environment Model (S+E) including the matrices and gPCs and SELS covariates together with sex, age and clinic.

Table 2 Proportion of the phenotypic variance explained by genomic (G: genomic relationship matrix, K: kinship matrix) and environmental matrices (C: couple matrix, S: sibling matrix)

The estimates of the genotyped-single-nucleotide polymorphism (SNP) heritability (h 2 g, proportion of the phenotypic variance captured by matrix G) and of the pedigree heritability (h 2 kin, captured by matrix K) did not change significantly when including the extended set of covariates in the model, even for those traits where the environmental covariates contributed to regional differences. Furthermore, for most of the traits the estimates of variance due to the shared environments of couples and siblings (C and S) were robust to the inclusion of the extended set of SELS variables (Table 2). This is illustrated for two traits in Fig. 3.

Fig. 3
figure 3

Heritability estimates from models with different covariates. Proportion of the variance in two different traits captured by each of the genetic or environmental matrices fitted: Model F: including four matrices and sex, age, clinic as covariates (blue bars); Model S+E: including four matrices, gPCs, SELS and sex, age, clinic as covariates (green bars). Error bars show the standard errors of the estimates

The proportion of the variance captured from the couple environment (matrix C) was significant for eight traits although for HDL the significance disappeared after including the full set of environmental covariates. This would suggest that, for HDL, some of the phenotypic similarities observed in couples can be accounted for by the recorded lifestyle or socioeconomic variables. In addition, the variance captured by the sibling environment (matrix S) was detectable only for two traits (BIA fat and TC). For creatinine and HDL, the variance captured by sibling environment was not different from zero in the Family Model, but became significant after including the whole set of covariates. In all the cases the differences in proportion of the variance captured between the Structure and Environment Model (including the whole set of covariates) and the Family Model explored were subtle.

Table 3 shows the variance explained by the SELS covariates together with the gPCs in the models including a G matrix (details of each individual covariate are shown in Supplementary Table 5). The amount of variance explained by SELS covariates ranged between 0.64 and 35.57% while the gPCs explained always < 0.5% of the variance for all traits. Scottish index of multiple deprivation (SIMD) was the covariate affecting most traits (all except for creatinine) and years of education also explained substantial variance for several traits, with effects on most of the body measurements. Activity level explained a large amount of variance (up to 18.9%) for traits like HDL, BMI, weight and BIA fat. The dietary variables showed effects on many traits but overall explained little variance. For all traits the SELS covariates explained more variance than the geographical population genetic structure, which is consistent with the results showing that the regional differences in the obesity-related traits are associated with environmental rather than genetic variation between the regions.

Table 3 Proportion of the phenotypic variance explained by the covariates in the Benchmark Structure + Environment (S+E) Model

Discussion

Geographic differences in health-related phenotypes and diseases have been detected between countries and between regions within countries, and both genetics and environment could potentially account for these differences11,12,13,14. In this study, we disentangled the underlying causes of phenotypic differences between regions for 11 health-related traits in Scotland. To do so, we accounted for genetic structure together with environmental differences captured by environmental covariates and similarity matrices. We included all of them together in different mixed linear models in an innovative approach to the study of regional differences in health-related phenotypes. We showed the impact of familiar genetic structure, geographical population genetic structure and lifestyle and socioeconomic variables in all the traits. We found that for most of the obesity-related traits, existing regional differences within Scotland cannot be explained by geographical population genetic structure and they are predominantly driven by lifestyle and socioeconomic causes.

We showed that for height and BIA fat, the regional differences were explained by the genetic relatedness of the sample, disappearing when we corrected using a genomic relationship matrix. In the case of height, the geographical population genetic structure (gPCs) still explains 0.5% of the variance in the trait (Table 3 and Supplementary Table 5). In a previous study at the pan-European level, Robinson et al.10 detected regional differences in the genetics of height. Our results suggest that the differences observed in our Scottish cohort are due to the genetic similarity and the relatively high variance explained by the gPCs concur with results reported in Robinson et al.10 particularly considering the relatively small geographic range in our sample.

For most of the obesity traits, the regional differences disappeared when adjusting for a large set of measured environmental covariates, indicating that after accounting for other factors, the residual regional discrepancies were caused by environmental differences between the regions. An important implication is that studies that explore the causes of regional inequalities should account for both genetic and environmental factors if they are to avoid reaching biased conclusions, particularly in the presence of relatives. In the case of BIA fat, the results show a different pattern from the rest of the obesity traits. The gPCs explain a 0.18% of the variance, similar to BMI or waist circumference, but fitting the genetic relatedness of the sample explained regional differences. The phenotypic correlation between BIA fat and BMI is 0.62, suggesting that only a proportion of 0.38 of the variance was shared between traits.

The addition of the extended set of covariates did not change substantially the heritability or the variances captured by environmental matrices, suggesting that for this type of analysis, fitting only the basal covariates should suffice to obtain accurate heritability estimates for these health-related traits. Hence, previous studies that have not included these effects are unlikely to have produced significantly biased heritability estimates.

The deprivation index SIMD was the environmental variable significantly affecting the largest number of traits, although the variance accounted for by SIMD was relatively small for all traits (Supplementary Table 5). SIMD combines several indicators of deprivation (such as level of income, education, employment, crime, etc. ref. 20) in one index and the associations with SIMD corroborate the described associations between socioeconomic inequalities and health11, 12. The results showed that people living in less deprived areas are taller, with higher levels of HDL, and lower BMI and BIA fat levels. Other environmental variables also explained part of the differences between regions such as years of education, level of activity or alcohol intake. These more specific variables could be also picking up effects of more complex environmental variables such as socioeconomic status (e.g., if there is stratification in levels of alcohol intake or diet composition between different socioeconomic groups).

The increased prevalence of obesity is a worldwide health concern. Reducing the incidence of obesity by effective intervention policies in affected areas would provide substantial benefits in health and quality of life of individuals concerned and significantly reduce associated healthcare costs to the community1, 4, 9. In this study, we have shown that for most of the examined obesity traits, regional differences exist even after accounting for genetics and they can be explained by environmental differences between those regions. These environmental factors are potentially modifiable and therefore could be actionable from a health policy point of view, with the potential for appropriate interventions reducing inequalities in health between areas. In particular, our results show a large effect of physical activity and dietary choices for the studied traits. These two are recognised as critical behaviours affecting obesity and are usually targeted by policy makers21. Designing the interventions to improve these habits is important, but focusing in the relevant geographical (deprived) areas and making an impact on the relevant strata of the population will be crucial.

Methods

Data set

We used the data from the Generation Scotland: Scottish Family Health Study (GS:SFHS)22. Ethical approval for the study was given by the NHS Tayside committee on research ethics (ref: 05/s1401/89). Governance of the study, including public engagement, protocol development and access arrangements, was overseen by an independent advisory board, established by the Scottish government. Research participants gave consent to allow both academic and commercial research.

Individuals were genotyped with the Illumina HumanOmniExpressExome-8 v1.0 or v1.2. We used PLINK version 1.9b2c23 to exclude single-nucleotide polymorphisms (SNPs) that had a missingness > 2% and a Hardy–Weinberg Equilibrium test P < 10−6. Markers with a minor allele frequency smaller than 0.05 were discarded. Duplicate samples, individuals with gender discrepancies and those with more than 5% missing genotypes were also removed. The resulting data set was merged with the 1092 individuals of the 1000 Genomes population24 and a principal component analysis (PCA) was performed using GCTA18. Individuals more than ix standard deviations away from the mean of principal component 1 and principal component 2 were removed as potentially having African/Asian ancestry as shown in Amador et al.14 After quality control individuals had genotypes for 519,819 common SNP spread over the 22 autosomes. Of the ~24,000 individuals in GS:SFHS, the number of individuals without missing values for any of the covariates used in our study was N = 11,118 (4646 males and 6472 females) so we used this set of samples for all the analyses in order to allow comparisons between the models.

Phenotypes

We used measured phenotypes for 11 complex traits classified as anthropometric (height, weight, body fat measured by bioelectrical impedance analysis (BIA fat), waist circumference, hips circumference, WHR(waist/hips); BMI(weight/height2), ABSI25 (Waist/(BMI2/3×Height1/2))) and metabolic traits (levels of creatinine, total cholesterol and HDL in serum. Natural logarithm transformations were performed for all traits except for height and BIA fat, to obtain approximate normal distributions. Phenotypes with values greater or smaller than the mean ± 4 standard deviations (after transformation and adjusting for sex, age and age2) were set to missing (Supplementary Table 1). Boxplots for each trait of individuals living in each region (corresponding to different council areas) are plotted in Supplementary Fig. 2.

Covariates

We explored a large set of covariates representing potential environmental factors influencing differences between individuals in the study. We fitted these factors as putative predictors of trait variation in statistical models as discrete or continuous covariates depending on their nature. The covariates lay in three categories: basic, socioeconomic and lifestyle. Basic covariates were sex, age, and clinic where the phenotypes were measured; socioeconomic covariates were SIMD (a deprivation ranking based on living area20), years of education, household size, vehicle ratio and job status. Lifestyle covariates are alcohol units consumption, smoking status, activity level, fruit units eaten per day and consumption of different foods (fruit, vegetables, fish, meat, eggs and dairy). A detailed description of these variables is shown in Supplementary Table 2 and additional information on how the quality control was performed is given in Supplementary Note 1.

Information on the postcode at which individuals were living at the time when their data were recorded was also available. The individuals were allocated to their corresponding council area based on these postcodes. A more detailed description on the correspondence between postcodes and regions is shown in Supplementary Table 3. The distribution or incidence of the covariates in the different council areas is plotted in Supplementary Fig. 3.

We also calculated a set of variables that represent genomic geographic origin through a PCA. To do so, we created a pruned subset of SNPs in approximate linkage equilibrium with each other and we removed markers from chromosome 6 in the major histocompatibility complex region and markers in the 8p23.1 region14. We kept only unrelated individuals (i.e., by removing one individual in each pair with a genomic relationship coefficient larger than 0.025). We performed a PCA in this subset (N ind = 7370, N SNP = 91,390), we calculated the loadings of the SNPs contributing to each of the first 10 principal components, and we computed the values for these principal components for the whole 11,118 individuals used in subsequent analyses. The resulting set of variables represents well the regional genetic structure of the sample as shown in Amador et al.14 We refer to these 10 PCs as gPCs or geographical population genetic structure.

Matrices

We used design matrices representing genomic or environmental relationships as in Xia et al.17: G is a genomic relationship matrix (GRM) containing relatedness between pairs of individuals based on identity-by-state at the genotyped SNPs19, 26. K is a matrix representing pedigree relationships as in Zaitlen et al.7 This is a modification of G obtained by setting all entries in G lower than 0.025 to 0. C is a matrix representing common environmental effects shared between couples. The matrix contains a value of 1 between pairs of individuals identified as members of a couple as in ref. 17; S is a matrix representing common environmental effects shared between siblings. The matrix contains a value of 1 between pairs of individuals identified as siblings as in ref. 17. G and K were calculated using GCTA18; The environmental matrices (C and S) were created using R version 3.1.116.

Analyses

A summary with names of models and analyses undertaken is shown in Fig. 1.

Firstly, to illustrate which traits show variation between the different regions in Scotland we explored in a simple linear regression if there were differences in the traits between council areas (Basal Model, B). Using the statistical package R16, we testing the significance of the variable “region where individuals live” (region) in a linear model.

To explore if trait variation was due to genetic structure or to the environmental differences, we used variance component analyses. This way the models take appropriately into account the kinship in the sample. All the analyses were implemented in GCTA18. The basic general mixed linear model explored is shown in (1).

$$y = X{\rm{\beta }} + {g_{{g}}} + \varepsilon ,$$
(1)

where y is an n × 1 vector of observed phenotypes with n being the number of individuals, β is a vector of fixed effects and X is its design matrix, g g is an n × 1 vector of the total additive genetic effects of the individuals captured by genotyped SNPs with g g~ N(0, G σ 2 g). ε is an n × 1 vector for the residuals.

We fitted different sets of covariates in this general framework to explore four different models:

The Family Model (F) included only sex, age and clinic as fixed effects.

The Structure Model (S) included the geographical principal components (gPCs), together with sex, age and clinic as fixed effects.

The Environment Model (E) included the SELS covariates, together with sex, age and clinic in the vector of fixed effects.

The Structure and Environment Model (S+E) included all gPCs and SELS, together with sex, age and clinic in the vector of fixed effects.

For each model, we predicted by the BLUP (best linear unbiased prediction) method the total genetic (and environmental if appropriate) effect of each individual together with their residuals. We tested if the residuals were significantly different between the regions to show if the differences between regions were explained by the models or remained unexplained. We also estimated the variance explained by matrix G (σ 2 g) in all these models and the variance explained by covariates included was calculated as

$$Var\left( {Covariate} \right)*{b^2}{\rm{/}}Var\left( {Trait} \right),$$
(2)

where b is the effect of the covariate estimated from the GREML analyses.

Model (S) including a genetic matrix (G) and the gPCs allowed us to test if adding the geographic structure to Model (F) (including only a genetic matrix) would account for the differences between regions. Model (E) including G and the SELS covariates compared with Model (F), allowed us to test if the differences were due to the SELS variables. Model (S+E) included both gPCs and SELS together for comparison with the two previous models.

We fitted again the models described above, including in addition another genetic and two environmental matrices fitted in linear mixed models as in Xia et al.17 to test if the observed regional differences were affected by the inclusion or exclusion of the environmental matrices (C, S) and K as shown in (3)

$$y = X{\rm{\beta }} + {g_{\rm{g}}} + {g_{{\rm{kin}}}} + {e_{\rm{c}}} + {e_{\rm{s}}} + \varepsilon ,$$
(3)

where g kin is an n × 1 vector of the extra genetic effects associated with the pedigree for relatives with g kin~ N(0, K σ 2 k), e c and e s are n × 1 vectors representing the common environmental effects shared by couples or siblings, with e c~ N(0, C σ 2 c) and e s~ N(0, C σ 2 s).

Results for the four models (Model (F), Model (G), Model (E) and Model (G+E)) were also explored in the context of Eq. (3).

Data availability

Data are available from the MRC IGMM Institutional Data Access/Ethics Committee for researchers who meet the criteria for access to confidential data. GS:SFHS data are available to researchers on application to the Generation Scotland Access Committee (access@generationscotland.org). The managed access process ensures that approval is granted only to research which comes under the terms of participant consent which does not allow making participant information publicly available.