Genomic analysis of diet composition finds novel loci and associations with health and lifestyle

We conducted genome-wide association studies (GWAS) of relative intake from the macronutrients fat, protein, carbohydrates, and sugar in over 235,000 individuals of European ancestries. We identified 21 unique, approximately independent lead SNPs. Fourteen lead SNPs are uniquely associated with one macronutrient at genome-wide significance (P < 5 × 10−8), while five of the 21 lead SNPs reach suggestive significance (P < 1 × 10−5) for at least one other macronutrient. While the phenotypes are genetically correlated, each phenotype carries a partially unique genetic architecture. Relative protein intake exhibits the strongest relationships with poor health, including positive genetic associations with obesity, type 2 diabetes, and heart disease (rg ≈ 0.15–0.5). In contrast, relative carbohydrate and sugar intake have negative genetic correlations with waist circumference, waist-hip ratio, and neighborhood deprivation (|rg| ≈ 0.1–0.3) and positive genetic correlations with physical activity (rg ≈ 0.1 and 0.2). Relative fat intake has no consistent pattern of genetic correlations with poor health but has a negative genetic correlation with educational attainment (rg ≈−0.1). Although our analyses do not allow us to draw causal conclusions, we find no evidence of negative health consequences associated with relative carbohydrate, sugar, or fat intake. However, our results are consistent with the hypothesis that relative protein intake plays a role in the etiology of metabolic dysfunction.


Introduction
Understanding the relationships between nutrition, lifestyle, and health is among the highest priorities for public health [1]. Many aspects of dietary intake have been studied, but the health impacts of macronutrient composition (i.e. relative intake from fat, protein, and carbohydrate) have been especially controversial in the last few decades [2][3][4]. Despite a lack of robust empirical evidence from randomized trials on the long-term effects of macronutrient restriction on body weight and health [5][6][7], dietary recommendations have shifted from low-fat to low-sugar and, more recently, lower animal-protein diets [8][9][10][11][12][13]. Connections between diet and mental health are also increasingly recognized [14]. Genetic correlation analysis allows links between mental health and dietary intake to be estimated without the need to observe psychiatric measures and macronutrient intake in the same samples. These links can then corroborate existing ideas or fuel new hypotheses about the relationships between diet and mental health.
Previous work has found that diet composition is heritable (range h 2 = 27-70%) [15][16][17] and may share genetic components with health and lifestyle [18]. The largest GWAS on relative intake from protein, fat, and carbohydrates (up to N = 91,114) to date has identified three robustly associated SNPs in or near RARB, FTO, and FGF21, each of which captures only a miniscule part of trait heritability (R 2 < 0.06%) [19][20][21]. These results suggest that diet composition is a genetically complex phenotype that requires large GWAS sample sizes for robust genetic discovery. However, proper measurement of nutrient intake requires a long and detailed questionnaire [22]. Therefore, relatively few large genotyped cohorts have collected this information, which restricts available GWAS sample sizes.
Here, we perform the largest dietary intake GWAS to date, using the vast majority of currently available European-ancestry genotyped diet data. Power calculations determined a minimum required sample size of N = 141,000 ( Supplementary Information 1.3). We nearly triple the GWAS sample size compared to earlier work [21] to N = 264,181 for relative intake of protein, carbohydrate, and fat, increasing the number of robustly associated independent loci from three to 18. Furthermore, we report the first GWAS results for relative sugar intake (N = 230,648), which is a subcomponent of our carbohydrate phenotype and captures relative intake of both naturally occurring and added sugars. The sugar GWAS identifies three additional, unique lead loci. In our largest dataset, the UK Biobank (N = 173,253) [23], we also report an auxiliary GWAS for saturated fat intake, a subcomponent of our fat phenotype, which we only use for genetic correlation analyses. We also report phenotypic associations between BMI and macronutrient subtypes (plant vs. animal-protein; saturated vs. unsaturated fat; natural vs. added sugars).
Biological annotation of our GWAS results indicates that the brain is the main driver of diet composition's genetic signal. Furthermore, we find robust genetic and phenotypic associations between relative protein intake and poor health, but no clear pattern of associations of the other macronutrients with health. Finally, we probe the robustness of these results to possible confounds due to socioeconomic status and physical activity.

Methods
This article is accompanied by a Supplementary Information, which describes further methodological details.
Phenotype definitions, GWAS, quality control, and meta-analysis We performed GWAS in European-ancestry individuals for four dietary composition phenotypes: relative intake of fat, carbohydrate, and sugar. As an auxiliary analysis, we performed GWAS for relative intake of saturated fat in the UKB. Discovery analyses were performed in UKB, while replication analyses were conducted in cohorts from the Netherlands (Lifelines, RSI/II/III), UK (ALSPAC, Fenland), USA (FHS, HRS, GARNET, HIPFX, WHIMS+), and the international consortia EPIC-InterAct and DietGen ( Supplementary Information 1 and  With the exception of DietGen, all cohorts corrected macronutrient intakes for total energy intake allowing for nonlinear effects, and GWAS was performed according to a prespecified analysis plan (Supplementary Information 2.6).
Cohort-level quality-control (QC) was performed in accordance with protocols developed by the GIANT consortium [24] and the Social Science Genetic Association Consortium (SSGAC, Supplementary Information 3.3). Filters for participants and SNPs varied by cohort and cohort sample size. SNP effects were summarized across cohorts using fixed-effects sample-size-weighted metaanalyses based on Z-statistics. For the family cohorts UKB and FHS, we used the median effective sample size as a weight (Supplementary Information 3.4).

Replication
We assess the credibility of individual SNPs from our discovery GWAS by replicating the associations of its lead SNPs in our replication GWAS (Supplementary Information 4, Supplementary Information 4.1). Our replication analyses closely followed the procedure outlined in Supplementary Information section 1.8 of Okbay et al. [25]. We conducted one-sided binomial tests for both the sign concordance of the lead SNPs and the number of lead SNPs from our discovery GWAS that differ at the P < 0.05 threshold (both with and without Bonferroni correction) in the replication GWAS. In addition to conducting binomial tests, we simulated the expected rate of replication given the discovery GWAS results, the discovery sample size and the replication sample size, and we assessed whether these expected rates matched the observed replication rates. We used bivariate LD Score regression to examine the comparability between the summary statistics from our discovery cohort, the 14 replication cohorts, and DietGen (Supplementary Table 4.2). We also report the replication record of a rare variant in DRAM1 discovered by Merino et al. [21] (Supplementary Table 4.3).

Population stratification
LD Score regression was used to estimate inflation of the GWAS results due to population stratification (Supplementary Table 5.1). We adjusted the reported standard errors and P-values of meta-analyzed SNPs for bias due to population stratification by dividing them by the square root of the LD Score regression intercept. To identify approximately independent lead SNPs, we applied the clumping algorithm in PLINK (parameters r 2 > 0.1, P-value < 5 × 10 -8 ). Supplementary Tables 5.2 and 5.4 report the lead SNPs and the overlapping loci between phenotypes.

Sensitivity analyses
We performed sensitivity analyses for the two SNPs that reached genome-wide significance in APOE (rs429358) and ADH1B (rs1229984, Supplementary Table 5.5). For the APOE SNP, we assessed its effect size and confidence interval in a subsample of the UKB aged below 60 years. For the ADH1B SNP, we assessed its effect size and confidence interval in a subsample of the UKB who report to be non-drinkers. We tested whether the confidence intervals of the effect sizes overlap with the confidence intervals in the meta-analyzed sample.

Biological annotation
All bioinformatics analyses used the results of the combined meta-analysis ( Supplementary Information 6). To annotate the top GWAS findings, we performed MAGMA [26] genebased analysis to test 18,224 genes for association with diet composition (Bonferroni-corrected P-value threshold = 0.05/18,224). To gain preliminary insights into the likely functions of the significant MAGMA genes, we queried them in Gene Network. To gain insights into probable functional genomic categories and tissues, we estimated stratified LD Score regressions for the 52 functional genomic regions of the "baseline model", the 10 broad tissuelevel annotations from Finucane et al. [27], and the 53 fine tissue-level annotations from GTEx, with Bonferroni-corrected P-value thresholds = 0.05/53, 0.05/10, 0.05/52, respectively (Supplementary Tables 6.1-6.3). To annotate the lead GWAS SNPs, we queried whether they (or SNPs in LD with them) are associated with gene expression in relevant GTEx tissues, or in LD with protein-altering SNPs (Supplementary Tables 6.3-6.6).

Estimation of genome-wide SNP heritability
We used GCTA-GREML [28] and LD Score regression [29] to estimate the SNP-based heritability of diet composition (Supplementary Information 8, Supplementary Table 8.1). We restricted the GCTA analysis to genotyped SNPs with MAF > 0.01 and a random subset of 30,000 UKB individuals in the UKB, and thereafter drop one individual in each pair of individuals with a cryptic relatedness exceeding 0.025, resulting in N = 28,635. For the LD Score regression analysis, we used the full metaanalysis results, HapMap3 SNPs with MAF > 0.01, and LD estimates from the 1000 Genomes project provided by Finucane et al. [27].

Genetic correlations between macronutrients
We used bivariate LD Score regression to estimate the genetic correlations between macronutrients (Supplementary Information 7).

Polygenic prediction
We assessed the accuracy of polygenic scores of diet composition in the HRS and RSI validation cohorts and used LDpred [30] to construct polygenic scores assuming an underlying infinitesimal model (Supplementary Information 9, Supplementary Table 9.1). Since these cohorts are included in the full meta-analysis, we conducted a new meta-analysis that excludes the holdout cohort to obtain the SNP weights. Analyses are restricted to HapMap3 SNPs with MAF > 0.05, and LD scores were calculated on the basis of the holdout cohort. Our measure of a score's predictive power is the incremental adjusted R 2 from adding the score to a regression of the phenotype on the covariates sex, birth-year, birth-year squared, and cubed, as well as the interactions between sex and the three birth-year variables, and the first ten principal components of the genetic relatedness matrix. We bootstrapped 95% percentile confidence intervals for the incremental R 2 estimates with 1000 iterations.

Genetic correlations
We used bivariate LD Score regression to estimate genetic correlations between: diet composition and various health and behavioral phenotypes (Supplementary Information 10, Supplementary Tables 10.2-10.4). We used the 1000 Genomes LD scores computed by Finucane et al. [27] and restricted analyses to HapMap3 SNPs with MAF > 0.01.

Phenotypic associations
We examined the phenotypic associations between relative macronutrient intake and BMI in four large, independent cohorts from the UK and US (UKB, HRS, FHS, and WHI, with combined N = 173,165; Supplementary Information 11, Supplementary Table 11.1). In the HRS, FHS, and WHI, we were also able to distinguish animal vs. plant protein, natural vs. added sugars, and saturated vs. unsaturated fat. In the UKB, only the distinction between saturated vs. unsaturated fat was available (Supplementary Table 11.2). We estimated the standardized regression coefficients obtained from a multiple regression of BMI on the focal macronutrient, sex, age, educational attainment, household income (available for all cohorts except FHS), and the number of dietary measurements. In the UKB, we performed an additional regression that included a measure of physical activity. We restricted the samples to individuals also included in the GWAS. We used Fisher's Z-transformation to perform fixed-effects, inverse-variance weighted meta-analysis of the standardized regression coefficients. Fisher's Z-transformation was also used to obtain 95% confidence intervals.

Phenotype definition
All cohorts used self-report questionnaires containing ≥70 food items. Average intakes were highly similar across cohorts (Supplementary Table 1.2). Using these self-reports, we calculated the relative contributions of fat, protein, carbohydrate, and sugar to total energy intake. When possible, we excluded individuals on calorie-or macronutrientrestricted diets (see Supplementary Table 1.3 for all exclusion criteria).
We do not study total energy intake because it is mainly determined by body size and physical activity [31], and because systematic underreporting of total energy intake is correlated with BMI [32]. We caution that selective underreporting of macronutrients could be problematic for the common approach we adopt of studying relative intake, but there is mixed evidence for this, and its consequences are poorly understood (Supplementary Information 2.4).
Since macronutrient intake may not scale linearly with total energy intake, we developed and applied a method that adjusts for observed non-linear relationships (Supplementary Information 2.6-2.8, Extended Data Fig. 1). Consistent with the satiating properties of protein [33], we find that relative protein intake declines at higher levels of total energy intake, while relative fat intake increases, and relative sugar and carbohydrate intake remain roughly constant (Supplementary Table 2.3).

Main results
GWAS were performed in individuals of European ancestries from over 14 population cohorts. Informed consent was obtained by the cohorts for all participants included in the analyses. Association statistics underwent rigorous quality control according to SSGAC guidelines [25,34,35], which included sample-size-dependent quality-control filters, exclusion of SNPs with too small standard errors or too large explained phenotypic variance, and visualizations of summary statistics and allele frequencies (Supplementary Information 3.3, Supplementary Tables 3.1-3.5). Our discovery sample is the subset of the UKB with survey data on dietary intake (N = 175,253). The replication phase consists of a meta-analysis of GWAS summary statistics from 14 additional cohorts that followed our analysis plan (N = 60,138) together with summary statistics from DietGen [20] (for fat, protein, and carbohydrate, N = 33,531, flowchart in Extended Data Fig. 2). DietGen [20] assumed a linear scaling of macronutrients with total energy intake. Nonetheless, we included DietGen in our meta-analysis because the genetic correlations between DietGen and our other replication cohorts are not significantly different from one at P < 0.05 (Supplementary Table 4.1). The discovery stage identified 21 approximately independent genome-wide-significant lead SNPs (see Supplementary Information 3.3.5 for a description of the clumping algorithm): 4 for fat, 5 for protein, 5 for sugar, and 7 for carbohydrate (Supplementary Table 4.2). These lead SNPs partially overlap across phenotypes and reside in 14 unique loci. In the replication stage, all 21 lead SNPs had the anticipated signs and comparable effect sizes (Extended Data Fig. 3), and 15 reach statistical significance at P < 0.05 (Supplementary Table 4.2). This empirical replication record matches or exceeds theoretical predictions that take into account the statistical winner's curse, sampling variation, and statistical power [25] (Supplementary Table 4.1). In our data, the association between DRAM1 and dietary intake reported by Merino et al. [21] does not replicate, with a discordant effect size compared to Merino et al. In order to maximize statistical power, all follow-up analyses below are based on the combined discovery and replication samples (N = 235,391-268,922, Supplementary Information 5). The quantile-quantile plots exhibit substantial inflation (λ GC = 1.12-1.19, Extended Data Fig. 4). The estimated intercepts from LD Score (LDSC) regressions [29] suggest that the vast majority of this inflation is due to polygenic signal, and only a small share is attributable to population stratification (the maximum estimate, 6%, is for fat and is not statistically distinguishable from 0% at P < 0.05; Supplementary  Fig. 1). These 36 reside in 21 unique loci (Supplementary Table 5.4). Fourteen lead SNPs are uniquely associated with one macronutrient at genome-wide significance (P < × 10 −8 ), while five of these reach suggestive significance (P < 1 × 10 −5 ) for at least one other macronutrient. The SNP effect sizes range from 0.015 to 0.098 phenotypic standard deviations per allele. The phenotypic variance explained per SNP, expressed in terms of coefficient of determination (R 2 ), ranged from 0.011% to 0.054%, the same order of magnitude as the R 2 's of the most strongly associated lead SNPs for other genetically complex traits such as BMI and educational attainment (Extended Data Fig. 5). MAGMA [26] gene-based analyses of our GWAS summary statistics identifies 81 unique genes (Extended Data Fig. 6 and Supplementary Table 5.5). While the majority of these genes are near our lead SNPs, MAGMA also identifies 33 genomic regions harboring 44 unique genes that are physically distant (>1 Mb) from our lead SNPs.

Discussion of lead SNPs from combined metaanalysis
Seven of the 21 lead SNPs had not been (directly or via LD partners, r 2 ≥ 0.6 and distance <250 kb) associated with any other traits in the NHGRI-EBI GWAS Catalog at the time of query (September 19, 2017) [36] (Supplementary  Table 5.6). Each of these seven SNPs is located in or near genes that have not been studied in depth to date.
Five lead SNPs are located in or near genes that have well characterized biological functions in nutrient metabolism or homeostasis but have not previously been associated with dietary intake. First, we find that a missense variant in APOE (rs429358) is associated with fat, sugar, and carbohydrate, where the allele that decreases Alzheimer's risk is associated with greater relative fat and lower relative sugar and carbohydrate intakes. In addition to its strong association with Alzheimer's disease [37], APOE is known to be involved in fatty acid metabolism. We explored whether the associations in our data may be driven by sample selection. Specifically, older people with dementia may be systematically missing from the UKB, and unaffected elderly people may have different eating habits than younger people. To test for this possibility, we examined the subsample of UKB participants aged below 60, where such sample selection should be largely absent. We find that the association is indeed smaller in this subsample, but the 95% confidence interval of the effect size overlaps with that of the effect size in the subsample of UKB participants aged 60 and older (Supplementary Table 5.7).
Second, a well-known missense variant (rs1229984 in ADH1B) that limits alcohol metabolism is positively associated with fat intake. The association is weaker in a sample of UKB alcohol abstainers (N = 39,679; Supplementary  Table 5.7), suggesting that it may be partially driven by substitution of fat for alcohol.
Third, one of the protein lead SNPs (rs13146907) is in KLB, which codes an essential cofactor to FGF21 [38,39], which influences sweet and alcohol taste preference via the liver-brain-endocrine axis [40][41][42]. KLB is only associated with protein in our GWAS and MAGMA analyses, while FGF21 is strongly associated with all four macronutrients in Fig. 1 Manhattan plots. The x-axis is SNP chromosomal position; the y-axis is the SNP P-value on a −log 10 scale; the horizontal dashed line marks the threshold for genome-wide (P = 5 × 10 −8 ) and suggestive (P = 1 × 10 −5 ) significance; and each approximately independent (pairwise r 2 < 0.1) genome-wide significant association ("lead SNP") is marked by a red cross.
both the GWAS and MAGMA analyses. With MAGMA, we also identified MLXIPL (only for fat), a gene that codes a transcription factor to FGF21 [43]. This combination of findings suggests that different genes involved in the same pathway are important for directing intake of different macronutrients.
Fourth, an intergenic variant (rs2472297) that has been linked to higher caffeine consumption [44,45] is associated with lower carbohydrate intake. There are various possible explanations, such as interrelated lifestyle choices pertaining to food and caffeinated drinks.
Fifth, an intronic variant in GCKR (rs780094), a carbohydrate-metabolism gene, is associated with protein.
The lead SNP is in almost perfect LD (r 2 = 0.94) with a missense variant that has been associated with lipid levels [46] and type 2 diabetes [47].

Bioinformatic analyses
Animal studies indicate that the brain and peripheral organs interact in directing macronutrient intake [48,49]. A question that arises is whether the "periphery", which digests and metabolizes macronutrients, plays a larger role than the brain, for instance by determining how the brain assigns reward values to macronutrients. (For example, this is partially the case with alcohol, where mutations that affect metabolic capacity render alcohol consumption unpleasant [50,51].) While individual loci associated with dietary intake have been studied previously (e.g., [21,52]), it is unknown in which tissues the polygenic signal is enriched. To address this question, we used stratified LDSC [27,53] to identify in which tissues diet-compositionassociated SNPs are likely to be expressed (Supplementary Information 6.1). We performed two stratified LDSC analyses, which partitioned SNP heritability according to (i) 10 broadly-defined tissues, which were ascertained with LDSC reference data from chromatin data [54] and (ii) 53 tissues (including 14 brain regions), as ascertained with LDSC reference data from sets of Specifically Expressed Genes in GTEx (known as LDSC-SEG) [53]. To correct for multiple testing across tissues, we used Bonferroni-adjusted significance thresholds for the number of tested tissues (α = 0.05/10 = 0.005 and α = 0.05/53 = 9.4 × 10 −4 , respectively).
We find that the central nervous system explains the majority of the genetic signal for all macronutrients (for the regression coefficients; Fig. 2), with the proportions of explained heritability ranging from 44% (fat and sugar) to 55% (protein). Within the central nervous system, we find broad involvement of the brain, including (frontal) cortex (fat and sugar), the basal ganglia (fat), limbic system (fat and sugar), cerebellum (protein), and hypothalamus and substantia nigra for fat and protein (and sugar only suggestively after Bonferroni correction). The confidence intervals for the coefficients overlap across brain regions, so we cannot draw conclusions about the specificity of brain regions for intake of particular macronutrients.
For fat, genetic variation related to adrenals and/or pancreas tissue is estimated to explain 37% of the heritability. Because the adrenals play a role in lipid metabolism, and the pancreas is crucial for digestion, either tissue may plausibly affect fat intake. We caution, however, that in the LDSC-SEG analyses of 53 tissues, all non-brain regions had P-values above 0.05 even before Bonferroni adjustment.
To gain insight into the putative functions of the top associated loci, we queried the 81 genes identified by the MAGMA analyses in Gene Network [55], which predicts Reactome [56] functions for genes (Supplementary Information 6.2). In addition to neural functioning (e.g., axon guidance), we find that the MAGMA genes are predicted to be involved in growth factor signaling and the immune system ( Supplementary Information 6.6). These results may imply a more pronounced role for peripheral gene functions than our stratified LDSC results, which mainly implicated the brain.

Genetic correlations, heritability estimation, and polygenic prediction
We estimated pairwise genetic correlations between the macronutrients with bivariate LDSC [57]. All are statistically distinguishable from zero at P < 0.05 (except fat and protein), but also from one and negative one (Supplementary Information 7.1, Supplementary Information 7). (As we explain in Supplementary Information 2.8, negative phenotypic and genetic correlations are not mechanically The panel shows the partial regression coefficient (τ C ) from the stratified regression, divided by the LD Score heritability of the diet-composition phenotype (h 2 ). Error bars depict 95% confidence intervals. The phenotypes are ordered from left to right (fat, protein, sugar, and carbohydrate), from darker to lighter shades. Asterisks (*) denote significant deviation from zero after Bonferroni correction for 10 tissues: * P < 0:05 10 , **P < 0:01 10 , *** P < 0:001 10 .
induced by our phenotype definition.) Thus the macronutrients have overlapping but distinct genetic architectures, consistent with previous work from animal studies showing distinct biological mechanisms involved in macronutrientspecific appetites [48]. We calculated GREML [28] estimates of SNP-based heritability using a random N = 30,000 subsample of conventionally unrelated UKB participants. The estimates range from 2.1% for protein to 7.9% for carbohydrate (Extended Data Fig. 7 [20,21]. These heritability estimates might be biased downward due to phenotypic measurement error (Supplementary Information 8.2) and are similar in magnitude to those from other complex (and also noisily measured) behavioral phenotypes, such as subjective wellbeing [34] and risk preferences [35].

Relationships with health, lifestyle, and socioeconomic status
Using bivariate LDSC [57,58], we estimated genetic correlations between our diet-composition phenotypes and 19 preselected relevant medical and lifestyle phenotypes for which well-powered GWAS results were available. We also included four additional phenotypes for which GWAS results became available after our study was underway, as well as Alzheimer's disease, motivated by the association we found between APOE and macronutrient intakes, and nine phenotypes from the psychiatric domain. To control for multiple testing, we again used Bonferroni-adjusted P-value thresholds (α = 0.05/33).
Despite their relatively weak genetic correlations with BMI, sugar, and carbohydrate have negative genetic correlations with waist circumference (r g = −0.13, SE = 0.03, and r g = −0.14, SE = 0.03) and waist-hip ratio (r g = −0.15, SE = 0.04, and r g = −0.18, SE = 0.04) that are larger in magnitude and statistically distinguishable from zero at the 5% level. All the macronutrients have negative genetic correlations with alcohol consumption (r g between −0.61 and −0.11), as expected since alcohol is included in energy Phenotypic associations between diet composition and body mass index (BMI) in four independent cohorts, in terms of standardized regression coefficients (with 95% confidence intervals). These coefficients were obtained from a regression of BMI on the focal macronutrient and several covariates (sex, age, educational attainment, and household income). FHS Framingham Heart Study (N = 4,413), HRS Health and Retirement Study (N = 2,394), UKB UK Biobank (N = 158,046), WHI Women's Health Initiative (N = 8,628). The summary estimate was based on fixed-effects, inversevariance-weighted metaanalysis.
intake and our phenotype measures are shares of energy intake ( Supplementary Information 2.8).
Next, we computed genetic correlations with indicators of socioeconomic status [25,59,60], which are heritable [59,60] and known to be phenotypically associated with food access, dietary choices, and health [61][62][63][64][65]. We found that fat is negatively genetically correlated with educational attainment (r g = −0.13, SE = 0.04). Sugar and carbohydrate are negatively genetically correlated with the Townsend deprivation index (r g = −0.23, SE = 0.06 and −0.30, SE = 0.06), which is constructed from the rates of unemployment, non-ownership of cars and houses, and neighborhood overcrowding [60,66], with higher scores indicating more severe socioeconomic deprivation. These genetic correlations might hint at environmental factors involved in macronutrient intake, although these relationships might also be caused by unmeasured, confounding factors.
Finally, we estimate the genetic correlations between diet composition and physical activity, which has widespread physical and mental health benefits [67,68]. In these genetic correlation analyses, we used unpublished physical activity GWAS summary statistics from a sample of research participants from 23andMe (N = 269,189). The physical activity phenotype is a composite measure based on self-reported activities from leisure, occupation, and commuting. We find a negative genetic correlation of physical activity with fat (r g = −0.32, SE = 0.04) and a positive genetic correlation with sugar (r g = 0.23, SE = 0.04) and carbohydrate (r g = 0.13, SE = 0.04). The genetic correlation with protein is positive but not statistically distinguishable from zero at P < 0.05 (r g = 0.011). In the psychiatric domain, we find negative genetic correlations between saturated fat and schizophrenia (r g = −0.13, SE = 0.04) and between carbohydrate and ADHD (r g = −0.19, SE = 0.04). The negative genetic correlation with schizophrenia contrasts with its known positive phenotypic correlation, as patients with schizophrenia tend to consume higher amounts of saturated fat [69]. The negative genetic correlation with ADHD might be related to ADHD's responsiveness to dietary intervention [70], or might be explained by socioeconomic status.

Discussion
The genetic correlations we find between protein and obesity, waist-hip ratio, fasting insulin, type 2 diabetes, HDL cholesterol, and heart disease, together with the association we find between the BMI-increasing FTO allele and increased protein intake, point to an intriguing hypothesis: relative protein intake may play a role in the etiology of metabolic dysfunction. This hypothesis coincides with a growing (but often overlooked [71]) body of evidence that links protein intake to obesity and insulin resistance [72][73][74][75][76][77][78][79][80]. There is some related evidence from randomized trials with infants, which found a causal relationship between high-protein baby formula and infant body fat [81]. While the underlying biological mechanisms are unclear, high consumption of protein or certain types of amino acids (i.e., building blocks of protein) is known to induce insulin resistance [82][83][84], rapamycin signaling [77], and growth factor signaling [85], which might increase metabolic dysfunction and early mortality risk. Indeed, a recent phenotypic meta-analysis of prospective observational studies (pooled N = 154,344) found that low carbohydrate diets, which restrict carbohydrate in favor of increased animal protein or fat intake, were robustly associated with increased mortality [86].
We caution, however, that the strong and consistent links between protein and poor health outcomes might also be consistent with alternative explanations. Causation could run in the reverse direction: overweight individuals may have higher protein needs or use high-protein diets as a weight-loss strategy. The associations might also be caused by other, unmeasured variables such as unhealthy lifestyle factors or co-consumed ingredients. However, we find that the phenotypic association between protein and BMI is robust to controls for educational attainment and household income. Furthermore, the genetic correlation between protein and physical activity is statistically indistinguishable from zero. These findings weigh against socioeconomic status or physical activity being confounders of the positive genetic correlation between protein and BMI. In any case, the consistent associations that we find between protein intake and poor health warrant further attention.
For sugar, the phenotypic and genetic correlations we found with BMI and other health outcomes are consistent with observations from systematic reviews and metaanalyses of phenotypic relationships. These correlations may suggest that dietary sugar, beyond its energy content, does not have negative health effects [87][88][89][90], contrary to some popular beliefs (e.g., [91]). Another possibility is that exercise offsets negative metabolic effects of high sugar intake [92,93]. Those with a higher predisposition to be physically active may tend to consume more sugar, as sugar is a metabolically convenient source of energy during exercise [94] and may enhance endurance [95]. If so, the positive genetic correlation between sugar and physical activity might partially explain the lack of genetic correlations between sugar and poor health.
For fat and carbohydrate, we also find no consistent pattern of genetic and phenotypic associations with poor metabolic health. Taken together, our results complement the findings of phenotypic analyses by the multinational EPIC-PANACEA consortium (pooled N = 373,803), which found that only calories from protein are associated with prospective weight gain [96]. While this finding was consistent across 10 countries, we caution that EPIC-PANA-CEA's evidence, like ours, is limited by its reliance on selfreported eating habits.
Overall, our results show that the relative intake of each macronutrient has a distinct genetic architecture, and the pattern of genetic correlations might be suggestive of health implications beyond total calorie intake. Moreover, our genetic correlation and bioinformatics analyses suggest a number of novel hypotheses regarding the causes and consequences of dietary intake that can be explored in future work. and R56-AG042568-04 to the University of Southern California. CCC was supported by the Intramural Research Program of the NIH/ NIDDK and thanks Kevin Hall for informative discussions. PME was funded by Nestlé Nutrition. We thank the DietGen and CHARGE consortia for sharing diet-composition GWAS summary statistics, and we thank 23andMe, Inc., for sharing physical activity GWAS summary statistics. A full list of acknowledgements is provided in Supplementary Information 13.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.