Comprehensive genomic analysis of dietary habits in UK Biobank identifies hundreds of genetic loci and establishes causal relationships between educational attainment and healthy eating

Unhealthy dietary habits are leading risk factors for life-altering diseases and mortality. Large-scale biobanks now enable genetic analysis of traits with modest heritability, such as diet. We performed genomewide association on 85 single food intake and 85 principal component-derived dietary patterns from food frequency questionnaires in UK Biobank. We identified 814 associated loci, including olfactory receptor associations with fruit and tea intake; 136 associations were only identified using dietary patterns. Mendelian randomization suggests a Western vs. prudent dietary pattern is causally influenced by factors correlated with education but is not strongly causal for coronary artery disease or type 2 diabetes.

Unhealthy diet is thought to be the leading risk factor for mortality both globally 1 and in the US. 2 Overall, incidence rates of these dietary risk factors and their related diseases, like obesity and type 2 diabetes (T2D) are rising in parallel worldwide, 3,4 causing global epidemics that require our urgent attention. Describing a biological basis for unhealthy dietary preferences could guide more effective dietary recommendations.
There is a clear, albeit modest, genetic component to diet, such as traditional measures of macronutrient intake (i.e. proportion of carbohydrate, fat, and protein to total energy intake), as demonstrated by significant heritability and individual genetic associations. [5][6][7][8][9] Five genomewide association studies (GWAS) of macronutrient intake have been conducted to date; the most recent multi-trait analysis identified 96 independent genetic loci by combining summary statistics of individual macronutrient GWAS from 24-hour diet recall questionnaires in 283K individuals in UK Biobank (UKB). [6][7][8][9][10] In addition, the Neale Lab conducted GWAS (http://www.nealelab.is/uk-biobank/) across thousands of mostly binary traits analyzed primarily as dichotomous outcomes (i.e. wholemeal bread vs. all others) in 361K unrelated individuals in UKB. The recent GeneAtlas 11 improved power by using linear mixed models in 450K individuals in UKB, but analyzed a smaller set of dietary variables.
Additional measures of dietary intake, including both curated measures of single food intake and multivariate dietary patterns such as those described by principal component (PC) analysis, have also shown significant associations with health outcomes in both epidemiological studies 12,13 and clinical trials. 14 Thus, with the recent advent of large biobank-sized population cohorts with dietary data, we can now perform GWAS with multiple complementary phenotyping approaches to examine a wide array of dietary habits, including previously unstudied single food comparisons (i.e. wholemeal vs. white bread) and dietary patterns. Here, we report heritability and GWAS analysis (using linear mixed models) of both single food intake, analyzed as curated single food intake quantitative traits (FI-QTs), and of PC-derived dietary patterns (PC-DPs) using food frequency questionnaire (FFQ) data in up to 449,210 Europeans from UKB; we highlight association at biologically interesting loci (olfactory receptor loci associated with tea and fruit consumption) and use Mendelian randomization (MR) analyses to elucidate causal relationships pertaining to specific dietary habits.

Dietary habits have strong phenotypic correlation and most are heritable
We derived 85 curated single FI-QTs from FFQ administered in UKB, using 35 nested and complementary questions (Supplementary Table 1 heritable FI-QTs fall into a handful of dietary food groups related to milk consumption, alcohol intake, and butter/spread consumption. The first PC-DP (hereafter referred to as PC1) is among the most heritable dietary patterns (PC1 ݄ ଶ =13.6%, Table 1), and is more heritable than all of its individual contributing FI-QTs.
PC1, which explains 8.63% (Supplementary Figure 2) of the total phenotypic variance in FI-QTs, captures previously described "Western" and "prudent" dietary factors 15 and is primarily defined by the type of bread consumed (wholegrain/wholemeal vs. white bread). Overall, the FI-QTs that have high positive loadings for PC1 include wholemeal/wholegrain bread consumption, increased fruit and vegetable intake, increased oily fish intake, and increased water intake. The FI-QTs that have high negative loadings include white bread consumption, butter and oil spread consumption, increased processed meat intake, and consumption of milk with higher fat content ( Figure 1).  (Figure 2), with outliers often explained by smaller sample size (Supplementary Figure 5).
Notably, there is a mix of FI-QTs and PC-DPs among the most successful GWAS traits, with the Western vs. prudent PC1 GWAS identifying the most significant loci (M=140) that together explain the largest amount of variance of any dietary habit analyzed (6.65%).

GWAS of dietary pattern PC1: relationship to individual food intake traits and non-dietary phenotypes
Of the PC-DPs, the "Western vs. prudent" PC1 dietary pattern has the highest heritability and most genome-wide significant loci, and similar dietary patterns have been associated with disease. 21,22 Although PC1 is both phenotypically and genetically correlated with its 19 contributing FI-QTs (absolute r p= 0.25-0.81; absolute r g =0.29-0.93), the GWAS results provide distinct sets of significant associations. Together, PC1 and its 19 contributing traits are associated with a total of 387 independent genome-wide significant loci, falling into one of four categories ( Figure 3): 55 loci significant for PC1 only (dark blue), 282 loci significant for one or more of the 19 contributing FI-QTs only (dark red), 37 loci more significant for PC1 (light blue), and 13 loci more significant for one or more of the 19 QTs (light red). The 55 loci significantly associated with PC1 but not FI-QTs still trend toward association with one or more of the 19 contributing FI-QTs, whereas the reciprocal is not always the case: some FI-QT-associated SNPs display no association with PC1 at all. This observation indicates that the use of PCs can increase power to detect some associations, while others are only detectable through association with specific foods, supporting the use of both of these complementary phenotyping approaches to more effectively define the genetic architecture of dietary intake.
In addition to being correlated with its contributing FI-QTs, PC1 displays significant genetic correlation with 248 non-diet related traits, including traits relating to physical activity, educational attainment, socioeconomic status, smoking status, medication codes, and urine biomarkers (Supplementary Table 5). The lead PC1 SNP, rs66495454, is a common indel (MAF=38%) located at chr1:72748567 in the promoter of neuronal growth factor 1 (NEGR1; Supplementary Figure 7). SNP rs66495454's deletion allele (-/TCCT) is associated with a decrease in prudent eating (beta=-0.017, P=2.80×10 -48 ) and has been previously reported as associated with a decrease in intelligence, educational attainment and, perhaps surprisingly,  Table 7).
The smaller effects suggest that while there could be pleiotropy undetected by the Egger and WM approaches, or true bidirectional effects, the causal influences are more likely to be in the direction of educational attainment to prudent dietary patterns. Importantly, because the instrumental variables used for educational attainment are not mechanistically linked directly to educational attainment/intelligence, it remains possible that causal influences on PC1 could be due to unmeasured heritable factor(s) that are themselves causal for educational attainment/intelligence. In contrast to the educational attainment analyses, we were unable to provide robust evidence of a causal relationship in either direction between BMI and PC1 due to significant heterogeneous pleiotropic effects leading to inconsistent causal effect estimates (

Do dietary habits have a causal relationship with disease and related risk factors?
To test whether the "prudent" PC1 dietary pattern is likely to causally influence disease risk, we repeated bidirectional MR analysis between PC1 with coronary artery disease (CAD) from the mostly European CARDIoGRAMplusC4D GWAS and T2D from the DIAGRAM consortium 2017 GWAS. 26,27 The only association that provides robust evidence of a causal relationship was an increased risk in CAD leading to an increase in PC1 prudent eating with a significant, albeit small effect (WM beta=0.0458, 95% CI: 0.016-0.076, P=0.003, Supplementary Table 8), suggesting reverse causation of CAD on diet. Though we found that higher educational attainment increases healthier eating, using our genetic instruments we did not identify causal evidence that eating healthier (PC1) causes a decreased risk for CAD or T2D.
In contrast to the associations with PC1, a single SNP, rs1453548, strongly influences "cups of tea per day" and has a plausible biological mechanism (it is located in an olfactory receptor-dense region and explains >96% of the observed phenotypic variance of β -ionone sensitivity 28 ). We therefore performed an MR analysis using rs1453548 on the complete set of traits in UKB using the Neale Lab GWAS (http://www.nealelab.is/uk-biobank/). Using a strict Bonferroni-corrected significance threshold (P<0.05/4358 traits =1.15×10 -5 ), we identified a significant causal effect of "cups of tea per day" on smoking status, for which increases in the minor allele T (MAF=34%) that cause an increase in "cups of tea per day" cause a decrease in smoking status (Neale 20160:Ever smoked Wald ratio estimate= -0.51, 95% CI: -0.69 to -0.32, Table 9). However, rs1453548 is directly associated with "ever smoked" smoking status in the Neale Lab GWAS at genome-wide significance (P=4.329×10 -8 ), the 51-SNP "cups of tea per day" genetic instrument excluding rs1453548 has no significant causal effect on smoking status (IVW P=0.20), and beta-ionone is also found in tobacco, 29 suggesting the effects of rs1453548 on odor perception of β -ionone may have pleiotropic effects on both smoking status and tea drinking. Together with a lack of additional significant causal relationships between rs1453548 and health outcomes in UKB, our results indicate that drinking more tea does not have clear effects on health outcomes in UKB, and it's possible that some previous reports on the health benefits of drinking more tea are a result of confounding with smoking status.

Discussion
Understanding the genetic architecture of dietary habits has immense implications for human health, but has been a difficult task, in part due to the low heritability of many dietary traits. The recent advent of large-scale datasets such as UKB, with deep phenotyping on hundreds of thousands of individuals, has now made genetic discovery of traits with relatively low heritability possible. Expansion of phenotyping to include both curated FI-QTs and PC-DPs together with GWAS in nearly 450K individuals allowed our study to make hundreds of new genetic discoveries relating to diet. Our work advances the elucidation of the genetic architecture of multiple correlated dietary habits and helps lay the groundwork for future research on nutrigenomic and other complex multifactorial multivariable datasets.
Our work emphasizes the importance of interrogating the genetics of complementary phenotypes to glean a more complete picture of the genetic architecture of diet. One of the strongest associations we observed was between SNP rs1229984 and the FI-QT "total drinks of alcohol per month" (P=3.8×10 -248 , Supplementary Figure 6). This SNP has been previously associated with alcohol consumption; 11,30-33 consistent with a recent meta-analysis, 33 our use of a curated and quantitative FI-QT improved power compared with the more categorical "overall alcohol intake" phenotype and individual alcohol subtypes (i.e. "red wine glasses per month"; Supplementary Figures 6 and 9). This increase in power is consistent with the high genetic correlation but low phenotypic correlation between individual questions related to alcohol, indicating that a composite alcohol question is more suitable for genetic discovery (Supplementary Figure 10). Furthermore, using PC1 as an example, we demonstrate that the genetic architecture of FI-QTs and PC-DPs are distinct, with hundreds of genetic associations more strongly associated with either PC1 or with its contributing FI-QTs. Overall, by using complementary phenotyping approaches, we identified 814 independent genetic associations, of which 205 were completely novel, 311 were uniquely associated with curated FI-QTs, and 136 were uniquely associated with PC-DPs.
As an initial exploration of the implications of genetically-influenced composite dietary patterns, we focused on the strong genetic overlap between PC1 dietary pattern and phenotypes related to educational attainment. 34 While bidirectional MR demonstrates some pleiotropic effects between educational attainment and PC1, the relative strengths of these causal estimates suggests that higher educational attainment and/or correlated phenotypes (such as socieoeconomic status or factors related to school performance) shift eating habits towards a healthier, more prudent diet. While previous observational studies have shown that Western and prudent dietary patterns are associated with CAD and T2D, 12,22 our MR analysis of PC1 on CAD or T2D did not demonstrate a causal effect from diet to disease, but rather a small suggestion of a reverse causal relationship between CAD and diet, (CAD diagnosis leads to a more "prudent" dietary pattern).
The conclusion that a "Western" dietary pattern does not appear to be a causal risk factor for disease must be viewed in the context of several potential limitations of our study.
Genetic instruments derived from genome-wide significant variants tend to explain a small fraction of phenotypic variance, which can lead to lack of power to detect potentially true causal effects of diet on outcomes, although this is mitigated by the large sample size of the UKB cohort. Additionally, while the use of overlapping samples in MR could in theory lead to inflated causal estimates, 35 UKB's large sample size provides robust genetic instruments, and the strength of our causal associations, combined with our validation analysis with an independently ascertained set of instruments for educational attainment, suggest that our results are likely not influenced by weak instrument bias. Furthermore, although we did not detect evidence of pleiotropy, it remains possible that pleiotropic effects of some of the variants associated with PC1 masked a causal effect on cardiometabolic disease risk. Finally, the aspects of a prudent dietary pattern reflected by PC1 (predominantly driven by wholemeal/wholegrain vs. white bread consumption) may not capture the causal protective features of a prudent dietary pattern.
However, it remains possible that there is a stronger correlative than causal relationship between the Western dietary pattern and increased risk of cardiometabolic disease.
We also find several interesting associations between specific FI-QTs (fruit, tea, coffee, vegetables, cheese, and butter) and olfactory receptors. The chr11p15 locus controlling odor perception of beta-ionone, 28 described as smelling of cedar wood but upon dilution (e.g. in tea) a more floral aroma, 36 has pleiotropic effects that both reduce the chances of ever smoking and increase tea intake. While SNPs at chr11p15 have already been shown to be associated with food choice with and without added β -ionone, 28 we highlight here for the first time a link between β -ionone odor perception with smoking status, with potential significant implications for smokingrelated health problems. This result also highlights the importance of understanding the pleiotropic consequences of variants used as genetic instruments in Mendelian randomization.
Of note, our dietary habits derived from a shortened FFQ were not adjusted for total energy intake, a measure highly correlated with physical activity and body weight; 37 as such, our dietary habits represent potentially non-isocaloric variations in dietary intake. However, we found minimal phenotypic correlation between any of our FFQ-derived phenotypes and 24-hour recall questionnaire-derived total energy intake (maximum correlation r = 0.037). Furthermore, none of our lead 814 SNPs were nominally significant in the Neale Lab total energy intake GWAS (P>0.05/814). Nine of our phenotypes, including "slices of bread per week", "overall cheese intake", and "glasses of water per day" did show significant genetic correlations with total energy intake, suggesting that the genetic architecture of these traits could be shared with traits that reflect more global lifestyle and dietary patterns (Supplementary Table 5

UK Biobank Phenotype Derivation
All phenotype derivation and genomic analysis was conducted on a homogenous population of individuals of European (EUR) ancestry (N=455,146), as determined by: 1) projection on to 1KGP phase 3 PCA space, 2) outlier detection to identify the largest cluster of individuals using Aberrant R package 42 , selecting the lambda in which all clustered individuals fell within 1KGP EUR PC1 and PC2 limits (lambda=4.5), 3) removed individuals who did not self-report as "British", "Irish", "Any other white background", "White", "Do not know", or "Prefer not to answer", as self-identified non-EUR ancestry could confound dietary habits.

Heritability, GWAS, and Genetic Correlation Analyses
Measures of heritability were obtained from BOLT-lmm software ( , we set all non-significant r g to 0. Supplementary Table 5 represents a complete pair-wise r g matrix. Enrichment for olfactory receptor genes among dietary habits was evaluated using 1,000 sets of null matched SNPs based on minor allele frequency, number of SNPs in LD at various LD thresholds, distance to nearest gene, and gene density using the SNPsnap webtool. 55 We used fisher's test for enrichment of olfactory receptor genes using SNPsnap's nearest gene annotation for 647 dietary habit GWAS index SNPs and for 1,000 sets of null matched SNPs (167 of our index SNPs were excluded for being in the HLA region or had insufficient matches).
We based the enrichment analysis contingency table on 842 annotated olfactory receptor genes among 48,903 genes in SNPsnap. Of the 1,000 null enrichment analyses, 419 had a fisher's test estimate equal to or greater than our real data's enrichment estimate.

Mendelian Randomization
Bidirectional Mendelian Randomization was conducted using genome-wide significant index SNPs clumped by 500kb windows from GWAS in UKB on PC1 (M=140), fluid intelligence scores (M=184), educational attainment (M=309), and BMI (M=1165). Fluid intelligence scores for GWAS in EUR (N=232,601) was derived from both in person and online cognitive tests.
Assessment center fluid intelligence scores (field 20016) were averaged for up to 3 visits and adjusted for average age in months, sex, and assessment center. Online fluid intelligence scores (field 20191) were adjusted for age in months, sex, and townsend deprivation index. The final fluid intelligence score was first set to average assessment center fluid intelligence score, and when missing was filled in with the online fluid intelligence score, for which the combination of these scores were then adjusted for collection method, followed by inverse normal transformation. Educational attainment for GWAS in EUR (N=450,884) was derived using a previously published method based on mapping UKB qualifications field 6138 to US years of schooling, 56 following by adjusted for age in months, sex, and assessment center, and inverse normal transformation. BMI was calculated from weight (field 21002) and standing height (field 50) and averaged from up to 3 assessment center visits. Average BMI was adjusted for average age in months, average age in months squared, assessment center, and average measurement year, followed by inverse normal transformation conducted in males and females separately.
The combined male and female BMI Z-scores were then used together for genetic association testing. All GWAS were run in BOLT-lmm adjusted for 10 genetic PCs (calculation described above) and genotyping array.
Genetic instruments for each of the three traits consisted of the complete set of index  T  h  e  U  K  B  i  o  b  a  n  k  r  e  s  o  u  r  c  e  w  i  t  h  d  e  e  p  p  h  e  n  o  t  y  p  i  n  g  a  n  d  g  e  n  o  m  i  c  d  a  t  a  .   N  a  t  u  r  e   5  6  2  ,  2  0  3  -2  0  9  ,  d  o  i  :  1  0  .  1  0  3  8  /  s  4  1  5  8  6  -0  1  8  -0  5  7  9  -z  (  2  0  1  8  )  .  4  1  T  h  e  G  e  n  o  m  e  s  P  r  o  j  e  c  t  C  o  n  s  o  r  t  i  u  m  .  A  g  l  o  b  a  l  r  e  f  e  r  e  n  c  e  f  o  r  h  u  m  a  n  g  e  n  e  t  i  c  v  a  r  i  a  t  i  o  n  .   N  a  t  u  r  e   5  2  6  ,  6  8  ,  d  o  i  :  1  0  .  1  0  3  8  /  n  a  t  u  r  e  1  5  3  9  3  (  2  0  1