Rare and common genetic determinants of metabolic individuality and their effects on human health

Garrod’s concept of ‘chemical individuality’ has contributed to comprehension of the molecular origins of human diseases. Untargeted high-throughput metabolomic technologies provide an in-depth snapshot of human metabolism at scale. We studied the genetic architecture of the human plasma metabolome using 913 metabolites assayed in 19,994 individuals and identified 2,599 variant–metabolite associations (P < 1.25 × 10−11) within 330 genomic regions, with rare variants (minor allele frequency ≤ 1%) explaining 9.4% of associations. Jointly modeling metabolites in each region, we identified 423 regional, co-regulated, variant–metabolite clusters called genetically influenced metabotypes. We assigned causal genes for 62.4% of these genetically influenced metabotypes, providing new insights into fundamental metabolite physiology and clinical relevance, including metabolite-guided discovery of potential adverse drug effects (DPYD and SRD5A2). We show strong enrichment of inborn errors of metabolism-causing genes, with examples of metabolite associations and clinical phenotypes of non-pathogenic variant carriers matching characteristics of the inborn errors of metabolism. Systematic, phenotypic follow-up of metabolite-specific genetic scores revealed multiple potential etiological relationships.

Article https://doi.org/10.1038/s41591-022-02046-0 (Metabolon HD4), as previously described 8 (Supplementary Table 1). Metabolites with annotated identities were classified into eight broad classes relating to the metabolism of lipids (33.0%), amino acids (16.8%), xenobiotics (10.1%), nucleotides (2.5%), peptides (2.2%), carbohydrates (2.1%), cofactors and vitamins (1.9%) and 'energy' (0.8%); additional compounds had an unannotated but unique chemical identity (30.8%) (Supplementary Table 2). In a two-stage genome-wide association meta-analysis (including validation in an additional 5,698 participants from the EPIC-Norfolk study; Extended Data Fig. 1), we identified 1,847 associations of 330 genomic regions with 646 metabolites (Supplementary Table 3). Conditional analysis of these regional associations identified 2,599 conditionally independent variant associations (P < 1.25 × 10 −11 ; Methods and Supplementary Table 4). We mapped annotated metabolites to 48 established metabolic pathways (Fig. 1). Additionally, we inferred a data-driven metabolic network (Methods) to include metabolites and genetic associations identified, but not covered, by current pathway representations. Both networks (along with details for all association results) can be explored on our webserver at https://omicscience.org/apps/mgwas. metabolites) using nuclear magnetic resonance (NMR) 2,3 , but only a few, smaller-scale studies (N max ≈ 8,000) have been conducted using the much broader metabolite coverage of untargeted methods (up to 644 metabolites), which have each reported fewer than 150 loci 4,5 . In this Article we present a systematic investigation of the genetic architecture of over 900 metabolites in almost 20,000 men and women. We perform exact conditional analyses, examine allelic heterogeneity and identify genetic co-regulation of multiple metabolites by investigating shared genetic influences on sets of regionally associated metabolites from across a broad array of pathways. Based on the identified genetic associations and manual literature-based curation, we define high-confidence causal genes regulating these metabolites and systematically examine their clinical relevance across over 1,400 phenotypes.

Discovery and fine-mapping for individual metabolites
We quantified plasma levels of 913 metabolites for 14,296 individuals of European ancestry from two cohort studies (INTERVAL 6 and EPIC-Norfolk 7 ) using an untargeted mass spectrometry-based platform Map of metabolic pathways highlighting 204 of the 632 annotated metabolites analyzed in this study (dark gray and red circles), including 154 with genetic associations (red circles). We also mapped 51 metabolites to class nodes (indicated by star symbols). Of the 46 class nodes, 22 are red, indicating that they contain at least one metabolite with a genetic association. Genes (grey and lime green squares) and causal genes regulating associations discovered in the study (lime green squares; as explained in the section 'Identification of genetically influenced metabotypes') are illustrated. Downward-pointing arrowheads indicate a process and upwardpointing triangles indicate a source. The inset focuses on the tryptophan metabolism pathway. An interactive version is available on the accompanying webserver at https://omicscience.org/apps/mgwas. Article https://doi.org/10.1038/s41591-022-02046-0 The majority (n = 206; 62.4%) of genomic regions associated with multiple metabolites (Fig. 2; https://omicscience.org/apps/mgwas), including half (n = 165) with multiple annotated metabolites, specifically 83 (25.2%) associated only with metabolites from within the same class and 82 (24.8%) associated with metabolites from across classes. The FADS1/FADS2 locus associated with the most annotated metabolites (94 lipids), but extensive pleiotropy was also evident for many other regions, including within-class pleiotropy (PCSK9 and MFSD2A) and across-class pleiotropy (AGPAT1, ABCG2/PPM1K, GCKR, SLC22A1 and ABCC1/PLA2G10).

EL O V L7 D M G D H /B H M T C D O 1 A L D H 7 A 1 S L C 2 2 A 5 /S L C 2 2 A 5 , S L C 2 2 A 4 S L C 2 3 A 1 S L C 3 6 A 1 / S L C 3 6 A 2
Of the 330 associated genomic regions, 225 were not reported by the previous largest genetic studies using the Metabolon assay 4,5 . For overlapping metabolites, we replicated 302 (83.2%) reported region-metabolite associations (involving 106 genomic regions and 226 metabolites; associations with either the reported variant or the variant in linkage disequilibrium (LD), r 2 > 0.1) at P < 5 × 10 −8 (Methods). For those metabolites, our conditional analyses identified a further 212 conditionally independent variant-metabolite associations independent of the previously reported associations   Table 6). In addition, within previously reported regions we identified associations with an additional 424 metabolites (1,046 conditionally independent variant-metabolite associations), demonstrating the value of both larger sample size and broader quantification of metabolites for identifying genetic determinants of metabolite variation. , metabolite associations fall into four sets (GIMs) acting through three genes (PYCR3, OPLAH or GPT) with known roles in metabolism. a, Four GIMs defined by overlap in the genetic regulation of metabolite sets. Matrices display the −log 10 (P) (capped at 50) and direction of effect (higher, red; lower, blue) for associations from stepwise conditional models, fitting the variants in the following order: rs3935209, rs2242090, rs11777194, rs10094377, rs35975875, rs10108836, rs11986259, rs34121654. GIM 1: two variants associating with 6-oxopiperidine-2-carboxylic acid and 5-oxoproline; the causal gene is OPLAH, encoding 5-oxoprolinase, which catalyzes the ATP-dependent hydrolysis of 5-oxoproline to glutamic acid (5-oxoproline and the structurally closely related 6-oxopiperidine-2-carboxylic acid associated in this cluster). GIM 2: four variants associating with S-1-pyrroline-5-carboxylate and the unannotated metabolites X-11315 and X-11334; the causal gene is PYCR3, a pyrroline-5-carboxylate reductase that generates proline from S-1-pyrroline-5-carboxylate (the strongest associated metabolite in this cluster). GIM 3: a single variant associating with aspartate; the causal gene is GPT, encoding alanine aminotransferase, which takes alanine as a substrate and produces glutamate, which is one step removed from the associated metabolite aspartate. GIM 4: a single variant associating with the unannotated metabolite X-23639. b, Regional association indicating genomic positions of the associated variants (black lines) and causal genes (in red). c, Manhattan plot of chromosome eight, with the y axis capped at 120 for clarity. All P values presented were derived from linear mixed models. Article https://doi.org/10.1038/s41591-022-02046-0

Identification of genetically influenced metabotypes
Within genomic regions, we grouped metabolites influenced by at least one shared genetic signal into genetically influenced metabotypes (GIMs) 10 . We defined these co-regulated metabolite sets by identifying the minimal set of variants from all metabolite-specific conditionally independent lead-and secondary metabolite-associated variants that explained all regional metabolite associations (Extended Data Fig. 1). To illustrate, one 2.55-Mb region on chromosome 8 showed associations between eight variants and seven metabolites, which were partitioned into four distinct GIMs ( Fig. 4; https://omicscience.org/apps/ mgwas). We identified 423 GIMs, which included up to 15 lead genetic variants (median = 1) and up to 89 metabolites (median = 2). For 264 (62.4%) GIMS, we assigned one of 253 likely causal genes (or gene sets) by extensively mining the biochemical literature (Methods and Supplementary Table 7).

Biological insights from GIMs
GIMs can provide insights into the diverse ways in which genetic variation influences metabolism and chemical individuality. We identify examples of GIMs with important clinical implications (for example, SRD5A2 and DPYD), providing insights into fundamental metabolite physiology, indicating different roles of a multi-functional protein (TTR, SLC7A2 and SLC7A5), and with tissue-specific effects through the same protein (for example, CPS1).
We identified variation near SRD5A2, the gene product being a target of antiandrogen drugs for the treatment of male-pattern baldness and benign prostatic hyperplasia 11 , as associated with eight steroid metabolites of steroid hormone biosynthesis, including six androgen metabolites ( Fig. 5 and Supplementary Table 7). SRD5A2 encodes steroid 5α-reductase 2 (SRD5A2), which activates testosterone to dihydrotestosterone, the most potent ligand for the nuclear androgen receptor; SRD5A2 is also involved in the inactivation of gluco-and mineralocorticoids 12,13 . We observed genetic associations consistent with lower SRD5A2 activity, with lower levels of conjugates of androsterone, epiandrosterone, 3α-androstanediol and 3β-androstanediol (that is, metabolites downstream of 5α-reduction of androgenic steroids), but higher levels of the major 5β-reduced androgen metabolite etiocholanolone (Fig. 5). Specifically, lower levels of androsterone sulfate and epiandrosterone sulfate have been reported to indicate reduced SRD5A2 activity 14 , and inhibitors of SRD5A2, such as finasteride, are widely used to treat enlarged prostate and male-pattern hair loss 11 . Although direct evidence of causality is currently lacking, depression and suicidality have been reported by antiandrogen users 15,16 , and manufacturers are required to list these as potential adverse effects in some countries. Variants in the SRD5A2 region have been previously associated with the risk of male-pattern baldness 17 . We performed colocalization using HyPrColoc 18 and identified a shared genetic signal between multiple androgen metabolites and male-pattern baldness (posterior probability for a shared causal variant across all phenotypes (PP) = 0.97), with rs112881196 being a potential driver of this signal. This variant is 176 kb upstream of SRD5A2, in strong LD (r 2 > 0.9) with the strongest genome-wide association analysis (GWAS) lead variant at this locus, and showed directionally consistent associations indicating greater SRD5A2 activity and risk of male-pattern hair loss. We identified a separate, shared genetic association between androsterone sulfate, epiandrosterone sulfate and depression 19 (PP = 0.98) (Methods), with rs62142080 the most likely causal variant. In line with the increased risk of depression reported in antiandrogen drug users, the major allele (T) of rs62142080 was associated with lower metabolite levels and a higher risk of depression 19 (P = 9.36 × 10 −6 ; Fig. 5), supporting concerns about widespread use of SRD5A inhibitors 15,16 .
Another clinically relevant example is related to DPYD, encoding dihydropyrimidine dehydrogenase, an enzyme involved in the breakdown of pyrimidines such as uracil and thymine. Variants that reduce DPYD activity can limit the breakdown of commonly used fluoropyrimidine cancer chemotherapies, such as 5-fluorouracil and capecitabine, causing severe or life-threatening toxicity in 10-40% of patients treated 20 . Several variants near DPYD are routinely used to identify patients who should be started on a reduced chemotherapy dose, but an estimated 70-80% of early-onset life-threatening 5-fluorouracil toxicities are not adequately identified by current screening panels 20,21 . We identified four variants in the DPYD region at which minor alleles were specifically associated with higher plasma uracil levels (Supplementary Table 7), including two rare variants currently recommended for pre-treatment screening 20 (rs3918290: MAF 0.5%, β marginal = 1.243, P marginal = 1.49 × 10 −38 ; rs67376798: MAF 0.8%, β marginal = 0.768, P marginal = 2.23 × 10 −25 ) (Supplementary Table 8). We also identified two common variants with more modest effect sizes (rs60392383: MAF 20.6%, β marginal = 0.111, P marginal = 8.16 × 10 −12 ; rs72977723: MAF 12.1%, β marginal = 0.308, P marginal = 2.03 × 10 −54 ) (Supplementary Tables 7 and 8). The variant rs72977723 tags the toxicity-associated 'HapB3' haplotype, which is included in screening recommendations by measuring rs56038477 (ref. 20 ). Although we found rs56038477 to be associated with uracil in single-variant analyses (P marginal = 1.06 × 10 −11 ), this association was almost completely attenuated in a joint statistical model with rs72977723 (rs56038477: P joint = 0.342; rs72977723: P joint = 3.11 × 10 −45 ; Supplementary Table 9), suggesting that rs72977723 better captures the effects of the HapB3 haplotype on uracil breakdown. We found that a substantial fraction (17.8%) of our participants carry the minor allele of rs72977723 but do not carry other alleles used for screening, suggesting that the addition of rs72977723 to screening panels could identify a substantial number of additional individuals who are at risk of treatment-induced toxicity.
Distinct GIMs that share the same causal gene can highlight different functions of the same gene product, such as for multi-functional transporters. TTR encodes transthyretin (TTR), which is involved in the transport of both the thyroid hormone thyroxine and retinol (by forming a complex with the retinol binding protein, RBP) 22 . We found two GIMs that included variants probably affecting TTR function differently (Supplementary Table 7). The first GIM was represented by a rare variant (rs184097503) in perfect LD with rs28933981 (p.T119M), which is known to enhance the stability of TTR and leads to higher plasma TTR levels and greater thyroxine transport capacity 23 . In line with this, we found a strong association of the minor allele (C) with higher thyroxine levels (P = 1.14 × 10 −12 ) but no association with retinol (P = 0.573) levels (Supplementary Table 8). The second GIM was represented by a common variant (rs1667237) at which the minor allele (C) was strongly associated with higher retinol (P = 1.72 × 10 −14 ) levels. Although this variant was only modestly associated with higher thyroxine levels in our study (P = 0.003), a strong proxy (rs1080094, r 2 = 0.98) has been robustly associated with circulating free thyroxine, that is, the non-protein-bound fraction, in a study of ~50,000 participants 24 . Although thyroxine has several transporters, retinol is exclusively transported by the TTR-RBP complex, suggesting that this lack of redundancy for retinol transport could explain the stronger association with plasma retinol levels seen for this GIM.
Other examples of GIMs capturing multiple functions of a gene include those of the membrane solute transporters, SLC7A2, distinct variants being associated with either lysine or arginine levels, and SLC7A5, distinct variants being associated with either kynurenine or imidazole lactate levels (Supplementary Table 7).
We observed tissue segregation of GIMs mapping to the same causal gene. For example, two GIMs at CPS1 harbor associations with either glycine-related metabolites (rs1047891) or citrulline (rs13411696 and rs114764732) (Supplementary Table 7). CPS1 encodes carbamoyl phosphate synthetase, a key liver and small-intestine enzyme regulating entry into the urea cycle. Disease-causing mutations have been implicated in the allosteric N-acetyl-l-glutamate-binding domain 25   a, Stacked regional association plots for eight steroid metabolites, the risk of male-pattern baldness and depression in a 2-Mb window around the most likely causal gene, SRD5A2. Association statistics (P values from linear mixed models) for levels of plasma metabolites were derived from linear regression models as described in the text, and summary statistics for male-pattern baldness and depression were extracted from the literature 17,19 . The two-color gradients indicate the LD (r 2 ) with the candidate causal variants identified using multi-trait colocalization: rs112881196 (blue, lead signal for male-pattern baldness) and rs62142080 (orange, lead signal for depression). b, Forest plot showing effect estimates (box) and 95% confidence intervals for rs112881196 (top panel) and rs62142080 (lower panel) across all traits considered. Effects for depression are given as odds ratios, because logistic regression models were used for association testing, whereas effects for all other traits were estimated using linear regression models. Effect estimates and corresponding standard errors for male-pattern baldness and depression were obtained from the same studies as described in the text. Sample sizes for metabolites are described in Supplementary Table 8. Open symbols indicate non-significant effects (P > 0.05). c, Scheme describing the putative mechanism by which the two genetic variants nearby SRD5A2 alter steroid metabolism. Lower plasma levels of metabolites downstream of 5α-reduction of androgenic steroids but higher levels of the main 5β-reduced androgen metabolite etiocholanolone indicate lower activity of steroid 5α-reductase 2 (SRD5A2) conferred by variants associated with a lower risk for male-pattern baldness (via rs112881196) but increased risk for depression (via rs62142080). Parts of this figure were created with BioRender.com.
Article https://doi.org/10.1038/s41591-022-02046-0 influencing enzyme activation and thereby restricting flux into the urea cycle (primarily in the liver), with consequential effects on glycine metabolism. This is a known association with an established much stronger effect in women and a causal role in coronary disease 26 . In the small intestine, which lacks the full complement of urea-cycle enzymes, CPS1 contributes to the generation of citrulline, a metabolite used as a clinical biomarker of intestinal function and enterocyte mass 27 . Thus, the citrulline-associated GIM may reflect a tissue-specific effect of altered CPS1 expression. We observed a shared signal between the citrulline GIM and CPS1 expression (using HyPrColoc; PP > 0.8) for 10 of the 49 GTEx (V8) tissues 28 , although not in tissues known for high CPS1 expression.

Genes known to cause IEMs
IEMs are metabolic diseases caused by rare genetic variants that lead to metabolite deficiency and/or accumulation, with severe phenotypic consequences if left undetected or untreated 29,30 . Many of the identified associations with metabolite levels in this population-based study are in or near genes known to cause IEMs, as has previously been reported for PCSK9, LPL and CPS1 (refs. 1,31 Table 12). We investigated whether carriers of non-pathogenic variation at IEM genes had phenotypic features characteristic of those seen in IEM patients and found evidence for common representation of IEM-related features for several genes. For example, orthostatic hypotension-1 (OMIM #223360) is an autosomal recessive disorder characterized by mutations in dopamine beta hydroxylase (DBH), which converts dopamine to norepinephrine. Mutation carriers have higher plasma levels of dopamine and low levels of norepinephrine (noradrenaline) and epinephrine (adrenaline), leading to dysregulation of autonomic functions such as control of temperature, blood pressure and vascular tone 33,34 (Extended Data Fig. 3). We identified associations of a missense variant in DBH (rs6271, p.R549C, MAF = 7.45%) with lower levels of the norepinephrine catabolite vanillylmandelate (β per minor (T) allele = −0.164, P = 8.00 × 10 −13 ), as well as lower systolic and diastolic blood pressure and lower risk of hypertension in independent, non-overlapping studies [35][36][37][38] . We found strong evidence of a shared genetic signal for these IEM characteristic features (PP = 0.97), with rs6271 as the likely underlying causal variant in multi-trait colocalization analysis (Extended Data Figs. 4 and 5). We observed similar phenotypic convergence for a common intergenic variant, rs10840516 (MAF = 24%). The likely causal gene, tyrosine hydroxylase (TH), catalyzes the conversion of tyrosine to levodopa upstream of the biochemical reaction catalyzed by DBH. Mutations at the TH gene can lead to dysregulated dopamine metabolism, which in turn may also affect pulse rate and blood pressure regulation (Extended Data Fig. 3) 39 . Variant rs10840516 associated with higher plasma levels of 3-methoxytyrosine, dopamine sulfate and higher pulse rate in the UK Biobank (β per minor (A) allele 0.012, P = 1.10 × 10 −6 ). Multi-trait colocalization provided strong evidence of a shared genetic signal between these traits (PP = 0.79; likely causal variant rs11564705, in high LD (r 2 ≥ 0.97) with rs10840516; Extended Data Figs. 4 and 5).

GIMs enable variant to function annotation at GWAS loci
The high-confidence causal gene assignment for GIMs can guide identification of disease-causing mechanisms at known GWAS loci. We systematically investigated associations of GIM defining variants (or proxies; at r 2 > 0.8) with clinical outcomes using the NHGRI-EBI GWAS Catalog and PhenoScanner 40 (Supplementary Table 13). Variants within 54 GIMs were associated (P < 5 × 10 −8 ) with the lead variant for at least one of 41 categories of complex diseases, including coronary artery disease (CAD; 13 GIMs) and chronic kidney disease (CKD, eight GIMs) (Supplementary Table 13). Causal genes for these GIMs included established genes for CAD (for example, PCSK9, SORT1 and LDLR), age-related macular degeneration (LIPC and APOE/APOC [1,2,4]), Crohn's disease (GCKR and FADS2) and CKD (GATM). Causal genes for 15 GIMs are targets of approved drugs or clinical-phase drug candidates 41 (Supplementary  Table 13). We followed up rs17014016 in PPM1K, recently reported to be associated with an increased risk of breast cancer 42 . PPM1K encodes a phosphatase essential to catabolism of branched-chain amino acids (BCAAs) 43 . We demonstrate that the genetic associations at PPM1K with the BCAA catabolites 2-aminobutyrate, isobutyrylcarnitine and gamma-glutamyl-2-aminobutyrate colocalize (PP = 0.98) with the association with breast cancer (Methods and Extended Data Fig. 6), supporting a role for BCAA catabolism in breast cancer etiology 44 .
For these prioritized associations, we tested for a dose-response relationship using a Mendelian randomization (MR) framework (Methods) and identified 30 pairs with apparent dose-response relationships, with ten providing strong evidence; that is, there were at least three variants in the score and no evidence for between-variant heterogeneity (P > 0.05, Methods). These included a positive association between plasma levels of homoarginine and risk of CKD (OR, 1.16; 95% CI, 1.09-1.23; P = 6.5 × 10 −7 ; Extended Data Fig. 7), which contrasts with observational studies linking higher homoarginine levels with lower renal and cardiometabolic disease risk 46,47 . Our analysis provides an important advance from previous MR analysis using fewer instruments, which yielded null results 48 , highlighting the need to closely monitor kidney function when adopting supplementation strategies with homoarginine 49 due to the potential adverse effects. Much attention for a possible involvement of arginine-related metabolites in cardiometabolic disease has been paid to either arginine itself 50 or its possible adverse catabolites, (a)symmetric dimethylarginine (ADMA and SDMA), based on their suggested vasodilatory role 51,52 , with some evidence from single-locus MRs for a putative adverse effect of higher arginine on CAD 53 . Although our metabolomics platform cannot distinguish between ADMA/SDMA, we observed only weak Article https://doi.org/10.1038/s41591-022-02046-0 evidence for a possible role of arginine in cardiometabolic and renal disease (for example, diabetic retinopathy, P = 6.1 × 10 −4 , or cystic kidney disease, P = 1.1 × 10 −3 ). The observations that genetic variants associated with homoarginine are probably linked to transporters with specific affinity to homoarginine (SLC15A19 and SLC7A7) and that the known CKD intergenic variant rs1145091 (near GATM) was the strongest variant for plasma homoarginine levels argue for a possible distinct role of plasma homoarginine compared to arginine-related metabolites in plasma in CKD pathology. Furthermore, genetically predicted plasma levels of 3-methylglutarylcarnitine were inversely associated with benign neoplasms of the colon (OR, 0.89; 95% CI, 0.85-0.94; P = 6.2 × 10 −6 ). 3-Methylglutarylcarnitine is a downstream catabolite    The circos plot displays adjusted P values (q value) from logistic regression models testing for pairwise associations between 155 genetically predicted metabolite levels (scores) and 1,457 phecodes in the UK Biobank. Each dot represents one metabolite-phecode association, and colors reflect metabolite classes. Associations passing the multiple testing correction cutoff (q < 0.05) are indicated by larger triangles, the orientation of which indicates the association direction, and are annotated at the outer margins of the plot. Metabolite scorephecode associations with robust evidence for a dose-response relationship are indicated in bold (see text). Effect estimates, standard errors and P values are provided in Supplementary Table 14.
Article https://doi.org/10.1038/s41591-022-02046-0 of leucine metabolism, and elevated plasma levels are used to diagnose 3-hydroxy-3-methylglutaryl-coenzyme A lyase deficiency 54,55 , an IEM characterized by frequent metabolic acidosis with a severe liver phenotype but no reported impact on neoplasms of the colon. The multi-locus nature of our observation points towards a protective role of high 3-methylglutarylcarnitine plasma levels outside of the IEM, an observation that warrants further experimental follow-up to establish possible underlying mechanisms.

Discussion
Human metabolism and metabolic responses are highly individual and are dysregulated in many common and rare diseases. By conducting the largest genetic study of untargeted metabolomics, we have identified hundreds of genetic variants acting in complex metabolic hotspots in the genome and with large effects on many circulating metabolites. We used this information to define GIMs, which represent the genetic basis of chemical individuality and explain a substantial amount of inter-individual differences in plasma levels of over 600 metabolites. To investigate the consequences of genetic differences in chemical individuality for human health, we pursued a variety of approaches with phenotypic follow-up for a large range of rare and common human conditions. We show convergence of metabolic and phenotypic presentations of genes known to cause rare IEMs with variants at these genes identified in this study of the general population.
Previous studies generally treated metabolites as distinct entities in association analysis 1,4,5,56 , and very few considered the extensive local co-regulation of either biochemically related or seemingly unrelated metabolites (that is, those across different biochemical classes 10 ). Our approach to systematically identify such metabolic hotspots in the genome provides a framework that will probably provide additional insights for other domains of molecular traits such as gene or protein expression, but also to disentangle genetic co-regulation in the medical phenome more generally, as exemplified by the multiple signals at SRD5A2.
The lack of identification of causal genes remains one of the most important limitations for the successful translation of GWAS findings so far. The intrinsic biochemical link between the function of proteins encoded by genes close to metabolite-associated variants provides direct metabolically informed evidence for causal gene assignment based on decades of biochemical experiments. We have exemplified how this information can identify causal genes for, and provide mechanistic insight into, known loci for diverse diseases, as well as providing examples of genetically predicted metabolite levels robustly associated with complex diseases.
We have previously shown hundreds of associations between plasma metabolite levels and future onset of multiple diseases 8 , and have hypothesized that few of those are likely to be causal, but instead reflect 'common antecedents' underlying both metabolite levels and disease risk. The genetic approaches for causal inference used in this study appear to support this notion, as we found few examples with strong genetic support for a causal association between a metabolite and a disease. However, the associations identified here will enable causal assessment for future metabolome-wide association studies across many diseases. This provides a cost-effective and rapid way to (de)prioritize exposures for assessment in randomized trials, to avoid failures, such as has been seen for vitamin C and diabetes 57 or selenium and prostate cancer 58 . Furthermore, more diverse and large-scale efforts will identify genetic determinants for those metabolites not yet captured here or those for which we have identified only single or non-specific variants.
A third of compounds investigated were unannotated, so future work will include further triangulation of associated variants and causal gene assignment to assist their identification. Our results rely on individuals of European ancestry, and investigation in other ancestries will probably provide additional insights 59,60 . We note that, for certain metabolites, measurement error might have contributed to low estimates of variance explained. Our phenome-wide approach using metabolite scores could be extended at multiple levels: (1) genome-wide scores will probably provide more statistical power, although at the cost of biological specificity, (2) we could use a more refined approach to select metabolite-or pathway-specific genetic instruments to generate metabolite scores and (3) we could extend application to studies with greater numbers of disease cases.
Our results reveal a genomic landscape that accounts for chemical individuality, with important and potentially actionable insights for human health. Future integration with molecular layers providing complementary information, such as protein or gene expression, and obtained in diverse populations will further help translate how our genome shapes our health to derive treatment options for diseases.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41591-022-02046-0.
Article https://doi.org/10.1038/s41591-022-02046-0 Metabolites were matched between studies based on the Metabolon 'chemical id' where present or names in the case of unannotated metabolites. A complete list of metabolites, chemical IDs (CHEMI-CAL_ID) and compound IDs (COMP_ID) for each constituent measurement batch are given in Supplementary Table 2.

Genotyping and imputation
The genotyping protocol and QC for the INTERVAL samples (up to 50,000) have been described previously in detail 6 . In short, DNA extracted from buffy coat was used to assay ~830,000 variants on the Affymetrix Axiom UK Biobank genotyping array at Affymetrix. Genotyping was performed in multiple batches of ~4,800 samples each, and sample QC was performed, including exclusions for sex mismatches, low call rates, duplicate samples, extreme heterozygosity and non-European descent. Multidimensional scaling was performed using PLINK v1.9 to create components to account for ancestry in genetic analyses. Before imputation, additional variant filtering steps were performed to establish a high-quality imputation scaffold. In summary, 654,966 high-quality variants (autosomal, non-monomorphic, bi-allelic variants with Hardy-Weinberg equilibrium (HWE) P > 5 × 10 −6 , with a call rate of >99% across the INTERVAL genotyping batches in which a variant passed QC, and a global call rate of >75% across all INTERVAL genotyping batches) were used for imputation. Variants were phased using SHAPEIT3 and imputed using a combined 1000 Genomes Phase 3-UK10K reference panel. Imputation was performed via the Sanger Imputation Server (https://imputation.sanger.ac.uk) and resulted in 87,696,888 imputed variants. Variants with MAF < 0.01% or INFO (imputation INFO score) of <0.3 were excluded before further analysis. For EPIC-Norfolk, samples (N = 21,448) were genotyped on the Affymetrix UK Biobank Axiom array chip by Cambridge Genomic Services. Sample and variant QC followed the Affymetrix Best Practices guidelines. Samples were excluded based on DishQC < 0.82 (fluorescence signal contrast), call rate of <97%, heterozygosity outliers and sex discordance checks. Variants were excluded if the call rate was <95% or HWE P ≤ 1 × 10 −6 . Monomorphic variants and those with cluster problems detected using Affymetrix SNPolisher were excluded. Genotype imputation was performed using two different reference panels, the Haplotype Reference Consortium (HRC) (release 1) reference panel and the combined UK10K + 1000 Genomes Phase 3 reference panel. After pre-imputation QC, 21,044 samples remained for imputation. All variants imputed using the HRC reference panel were included, and additional variants imputed using only the UK10K + 1000 Genomes reference panel were added to create a combined imputed set. Variants with imputation quality INFO < 0.4 or minor allele count (MAC) of ≤2 were excluded. All positions were on genome assembly GRCh37.

Discovery GWAS and meta-analysis
GWASs were performed separately in INTERVAL (N = 8,455) and EPIC-Norfolk (N = 5,841), for each metabolite using BOLT-LMM 63 (version 2.2). Where BOLT-LMM failed, for example, due to an invalid heritability estimate close to 0 or 1, the analysis was run using SNPTEST (version 2.5.1 or 2.5.2) 64,65 . In SNPTEST analyses, related individuals were excluded and the first genetic principal components (five for INTERVAL and four for EPIC-Norfolk) were included. Variants with MAF < 0.01%, imputation quality INFO < 0.3, HWE P < 1 × 10 −6 or exact alleles unknown, and associations with absolute (effect) >10 or standard error <0 or >10 were excluded. In INTERVAL, for variants with both imputed and genotyped data, imputed data were used if the INFO score was greater than 0.6; otherwise, genotyped data were used. Study-specific results were pooled using inverse-variance weighted fixed-effect meta-analyses and METAL 66 , applying a MAC threshold of >10 in each study.

Definition of genomic regions
To define regions, all associations with P < 5 × 10 −8 in the meta-analysis and P < 0.01, MAC > 10 and consistent direction of effect in both studies were taken forward. Pairwise LD was calculated within the INTERVAL study (N ≈ 50,000). For each individual metabolite, sentinel variants (with the largest −1*log10(P)) were identified and the range of positions of variants in LD (r 2 ≥ 0.1) was used to define the region. For sentinel variants with no other variants in LD, a region around the sentinel variant (±500 kb) was created. In the next step, sentinel variants for all metabolites were considered. Pairwise LD was calculated for each sentinel variant, and regions with sentinel variants in LD (r 2 > 0.6) were merged and a further 250 kb was added to either side of each region to avoid having variants in the margin of the locus. Overlapping regions were merged until all defined regions were non-overlapping.

Validation of metabolite-region associations
We validated regional sentinel variant-metabolite associations by meta-analyzing the discovery and validation data. The GWAS for the validation data was performed using the same protocol as for the discovery data. Associations were considered validated if the association was significant after correction for multiple testing (P < 5.48 × 10 −11 ) in the validation meta-analysis, with consistent direction of effect in all three constituent GWASs.

Conditional analysis
Exact conditional analysis was performed using combined individual-level data from INTERVAL and EPIC-Norfolk discovery sets. We performed forward stepwise regression analyses, adjusting for fixed effects of the study and the top genetic principal components and considering variants with consistent direction of effect and P < 0.01 in both discovery datasets. Region-wide association analyses were performed using SNPTEST (version 2.5.2). Initially, we conditioned on the most strongly associated regional variant from marginal analyses and estimated the association of each other regional variant independently in the conditional model. We then identified the variant with the lowest P value from the tested regional variants, added it to the conditional model, and re-estimated associations of all other regional variants using the updated conditional model. This process was repeated iteratively until no further regional variants were significant at P < 1.25 × 10 −8 . The P < 1.25 × 10 −8 threshold was calculated using the Bonferroni correction, adjusting for the maximum number of variants (n = 39,297) and metabolites (n = 102) tested at any region. We fitted a final linear regression model (R version 3.2.2), and excluded any selected variants not significant at P < 1.25 × 10 −8 in the full conditional model. In a small number of instances (n = 49; 2.65%), no regional variant, including the lead variant, associated at P < 1.25 × 10 −8 (which was more stringent than the discovery analysis threshold of P < 5 × 10 −8 ). In this situation we ran conditional analyses conditioning on the lead variant and, if no other variants were found to be associated at P < 1.25 × 10 −8 , we retained only the original lead variant.

Technical validation of rare variant associations
To ensure that rare variant associations were not due to technical artefacts of the imputation, we performed a technical validation using WES data in a subset of 3,924 samples from the INTERVAL study 9 . We looked up associations from analysis of the INTERVAL WES data for 122 (49.8%) of the total 245 rare variant associations for which variants and metabolites overlapped. All associations were directionally consistent with our analysis with an almost perfect correlation of effects (r 2 = 98.33), and 118 were at least nominally significant (P < 0.05) (Extended Data Fig. 2).

Definition of GIMs
A matrix ('matrix.ref') was created with the variants from the conditional analysis for all regionally associated metabolites as rows, metabolites as columns and −log10(P) for conditional association with each metabolite as individual elements of the matrix (Extended Data Fig. 1). The variant with the largest −log10(P) for association with any metabolite in 'matrix. ref' was selected, and −log10(P) for association of the selected variant Nature Medicine Article https://doi.org/10.1038/s41591-022-02046-0 with all the metabolites within the matrix was calculated and added to a new matrix called 'matrix.out'. This variant was removed from 'matrix. ref' and the −log10(P) for the association of each of the remaining variants in the 'matrix.ref' was calculated, conditioning on the variant(s) in 'matrix.out'. The steps were repeated by selecting the next variant with the largest −log10(P) within 'matrix.ref', adding it to 'matrix-out' and estimating the associations of each variant and each metabolite in 'matrix-ref', conditioning on all variants in 'matrix-out'. This was repeated until no variant-metabolite association was identified in 'matrix.ref' with a −log10(P) > −log10 (5 × 10 −8 ). Every time we selected the variant with the largest −log10(P) for association with any given metabolite within 'matrix.ref', we ensured that this variant-metabolite association had the same direction of effect with a P value for association with the metabolite of less than 0.01 in both INTERVAL and EPIC-Norfolk. A marker order number was assigned to indicate the order in which variants were included in 'matrix.out'. In the last step, we created metabotypes within the locus using the genetic associations within 'matrix.out' with −log10(P) > −log10 (5 × 10 −8 ). Starting with the first variant, all metabolites with a significant association were selected, then we selected all variants associated with any one of these metabolites. This step was repeated until no variant or metabolite was added to the metabotype.
We reviewed all GIMs manually to check whether adjacent regions with the same metabolites were inadvertently split, and identified ten such regions. These adjacent regions were manually merged, and the GIMs were recalculated within the merged (extended) region. This method of defining GIMs was 'hypothesis-free' and inclusive, based on genetic associations. It did not take account of existing biological relationships or phenotypic correlations between metabolites. We examined phenotypic correlations among metabolites within GIMs and include these in Supplementary Table 15.
The −log10(P)s in the matrices, along with the +/− that represent the direction of effect (Supplementary Table 7), are for the associations from stepwise selection mentioned above; that is, they are conditional on only the variants with lower marker order numbers (column 'Marker Order (Conditional Analysis)'), as opposed to all the variants in the GIMs in that region.

Causal gene annotation
The biochemical investigation of living systems preceded GWAS by many decades. Often, the names of genes and proteins reflect their biochemical activity. We used both these facts to deduce the likely causal genes at many of the metabolite-associated variants. Specifically, we used automated approaches to identify potential supporting information for the 20 closest protein-coding genes to each lead variant, using distance from lead variant to the 'gene body' (transcription start site to transcription end site) of each gene. This information was manually reviewed to identify the most likely causal gene for each locus.
To leverage the fact that many gene names directly reflect their known substrates (for example, phenylalanine hydroxylase), we used the following approaches: 1. Fuzzy text match (Ruby Gem fuzzy_match, score > 0.5) of any synonym of the metabolite name (from HMDB) to the name of the gene (entrez) or the name of the protein or a synonym of the name of the protein (UniProt To leverage known biochemical pathway knowledge we used the following approaches: 1. Lookup of candidate genes in HMDB's interacting proteins annotation 2. Match of KEGG maps between each metabolite and each gene (no direct connection required, just co-occurrence on a KEGG map). 3. Fuzzy text match of any synonym of the metabolite to the set of GO biological processes with fewer than 500 human genes to which each gene was assigned after removing the following non-specific substrings from the name of the biological process: metabolic process, metabolism, catabolic process, response to, positive regulation of, negative regulation of, regulation of.
Any positive hits from the above automated analyses were manually reviewed, as well as any supporting primary literature. If the existing experimental evidence convincingly supported one of the 20 genes at the locus, that gene was selected as the biologically most likely causal gene. If there was no clear experimental evidence for any of the 20 closest genes, no causal gene was manually selected. In some cases, two or more genes at a locus had equally strong experimental evidence. This is especially the case with nearby paralogs arising from gene duplication. In these cases, multiple causal genes have been flagged, indicating that one or more of the selected genes may be contributing to the metabolite-variant association.

Assessing novelty of associations
We assessed the novelty of variant associations using associations reported by the previous two largest genetic studies that used the Metabolon assay 4,5 . Based on a 250-kb-distance-based window, we identified 631 region-metabolite associations reported by these studies. Of these, 83 region-metabolite associations were with ratios and were excluded from the comparison. Of the remaining 548 region-metabolite associations, we identified 302 region-metabolite associations as significant in our study (at P < 5 × 10 −8 ; Supplementary Table 6). Within the remaining 246 region-metabolite associations, for 185 region-metabolite associations we were not able to map the metabolite to our study. Therefore, of the 363 region-metabolite associations (involving 118 genomic regions and 243 metabolites) where metabolites were directly mapped to our study, we replicated (at P < 5 × 10 −8 ; associations with either the reported variant or a variant in LD, r 2 > 0.1) 302 (83.2%) of the region-metabolite associations (involving 106 genomic regions and 226 metabolites).

Colocalization
Where we report results from colocalization analyses, the analyses were performed using HyPrColoc 18 . In the first step we performed pairwise colocalization to derive the list of metabolites that colocalize with the outcome. The selected set of metabolites were then colocalized using the multi-trait colocalization framework described in the HyPrColoc manuscript 18 . We report only the colocalizations with a PP greater than 0.8. The following datasets were used in the colocalization analysis: gene expression (GTEx Analysis Release V8) 28 , breast cancer 42 and depression 19 .

Enrichment of genes known to cause IEMs
Genes were mapped to metabolic regions using two methods: (1) manual annotation of likely causal genes based on the biochemical literature as previously described and (2) a gene set consisting of the closest gene to any conditionally independent variant. For method (2), the closest gene was identified using Variant Effect Predictor (VEP), and genes within 5 kb of conditionally independent variants and their proxies (r 2 > 0.6) were identified using SNiPA. These genes were assessed for known causal links to IEMs using a list of 785 known IEM genes downloaded from the Orphanet database 67 .
We tested whether there was enrichment of IEM genes among all genes annotated to metabolite-associated regions compared to the percentage of known IEM-linked genes 67 (n = 785; 4%) among genome-wide protein-coding genes (N = 19,817) 68 , using a two-tailed binomial test. As a sensitivity analysis, enrichment was assessed using less specific methods of assigning genes to metabolic regions, where genes were identified within 500 kb of conditionally independent variants or within the genomic region using GENCODE. Supplementary Table 16 provides a summary and comparison of these methods.

Phenotypic assessment of metabolite-associated variants at the DBH and TH loci
Complex phenotype associations reported in GWAS Catalog, Phe-noScanner 40 and UK Biobank at a significance threshold of P = 1 × 10 −5 were identified for rs6271 at the DBH locus, rs10840516 at the TH locus, and any variants that were in strong LD (r 2 ≥ 0.8). Associations for which the phenotypes were related to one or more symptoms of the corresponding IEMs, orthostatic hypotension (OMIM #223360) and Segawa syndrome (OMIM #605407), as reported in IEMBase and other relevant literature, were tested for colocalization with metabolite levels. A list of associations at DBH and TH loci that were prioritized is provided in Supplementary Table 17. To test for colocalization between variant-metabolite associations and variant-phenotype associations, multi-trait colocalization was implemented using the R package 'HyPrColoc' (v1.0) 18 . To maximize statistical power, summary statistics from UK Biobank were used for phenotypes relating to blood pressure, hypertension, body fat composition and medication, and summary statistics from the SSGAC consortium were used for 'Years of schooling' (Supplementary Table 17). The prior probability that a variant is associated with a single trait (prior1) was set to 1 × 10 −4 , and the prior probability that a variant is associated with one trait, given it is already associated with another (prior2), was set to 0.98. Regional and alignment probability thresholds were set to 0.5. Cluster stability was assessed by using more stringent prior2 values (0.99, 0.999) and regional and alignment threshold values (0.6, 0.7, 0.8, 0.9). Only variants present in all included traits were considered for a given locus and any variants with a standard error of zero were removed.

GIMs enable variant-to-function annotation at GWAS loci
To identify complex diseases associated with the sentinel variants (or proxies at r 2 > 0.8) for our 423 GIMs, we queried the NHGRI-EBI GWAS catalog and other GWAS cataloged within PhenoScanner 40,69 . A total of 97 phenotypes from the GWAS catalog were then manually classified into 52 disease categories with EFO terms before investigating the LD between the GIM variants and top disease variant and further colocalization for specific selected associations using HyPrColoc 18 . To assess whether the causal genes are druggable, we looked at whether the genes are either targets of approved small molecules and biotherapeutic drugs (Tier 1) or clinical-phase drug candidates or encode targets with known bioactive drug-like small-molecule binding partners as well as those with ≥50% identity (over ≥75% of the sequence) with approved drug targets (Tier 2), as reported in ref. 41 .

Phenome-wide associations of metabolite levels
To facilitate phenome-association studies for metabolites, we imputed plasma metabolite levels in UK Biobank participants using conditionally independent metabolite-associated variants (Supplementary Table 4) with exact variant mappings. We created weighted (by the marginal effect) summed scores of the genetic load for metabolite levels for each of 155 metabolites with at least two variants and a clear metabolite annotation (using Stata 14.0 and R 3.6.0). We included only variants associated (P < 5 × 10 −8 in the marginal statistics) with fewer than five metabolites to minimize the impact of horizontal pleiotropy. We used these genetic scores as exposure variables, testing for associations with 1,457 phecodes, adjusting for age, sex (reported, but participants with sex chromosomes discordant from reported sex were excluded), genotype batch, test center and the first ten genetic principal components. We performed logistic regression models (using R 3.6.0) within up to 351,967 unrelated participants of white European ancestry 70 . To generate phecode-based outcome variables, we mapped ICD-10, ICD-9, Read version 2 and Clinical Terms Version 3 (CTV3) terms from self-report or medical health records, including cancer registry, death registry, hospitalization (Hospital Episode Statistics) and primary care (subset, N = 214,667), to the phecodes 45,71 . We used any code that was recorded, irrespective of whether it contributed to the primary cause of death or hospital admission, to define phecodes. We adjusted all analyses for test center to account for regional differences in coding systems and case ascertainment. For each participant and phecode, we kept only the first entry, irrespective of the original dataset, generating a first occurrence dataset. We dropped codes that were before or in the participants' birth year to minimize coding errors from electronic health records. To account for multiple testing, we applied the Benjamin-Hochberg procedure to the full list of genetic score to phecode associations tested, controlling the false discovery rate at 5%. To test whether single variants rather than the genetic score for a metabolite accounted for the observed associations, we repeated the same analysis for single variants only, and flagged all examples for which the strongest single variant was more strongly associated with the phecode compared to the composite score. To further test for a dose-response relationship, we adapted a two-sample MR framework 72 . We used heterogeneity estimates from an inverse-variance weighted MR along with MR-Egger to test for horizontal pleiotropy, and Cochran's Q statistic to test for heterogeneity among effect estimate ratios for each variant included.

Reconstruction of the metabolic network using Gaussian graphical models
We imputed missing values for 749 metabolites with fewer than 30% missing observations in each of the INTERVAL and EPIC-Norfolk discovery datasets, individually within the study. Missing observations were imputed using multivariate imputation by chained equations (MICE), implemented using R (3.3.3) and the R package 'mice' (version 2.46.0) 73 with the method 'norm', as previously proposed for metabolomics data 74 . Imputation was performed on the residuals after taking metabolite measures that were median-normalized for assay run day, natural-log-transformed, winsorized to 5 s.d. and regressing out the effects of age, sex and study-specific covariates. The imputation model for each metabolite considered other metabolites (with fewer than 30% missing values). Thirty multiple imputations were performed, each with 50 iterations of the chain. This procedure reasonably assumes that the missing metabolite values can be explained by the values and relationships between the observed metabolite values, but are independent of the unobserved metabolite values.
To construct a data-derived metabolic network 75 , we estimated partial correlations between metabolites in the following manner. Within each study, for each of the 30 imputed datasets, imputed measures were standardized (mean = 0, s.d. = 1) and used to estimate partial correlations between metabolites with the R package 'GeneNet' 76 (version 1.2.13). Partial correlation estimates were transformed using Fisher's z transformation with the R package 'psych' 77 (version 1.7.8), and estimates for the 30 sets were pooled within the study using Rubin's rules 78 . The pooled estimates for the two studies were meta-analyzed using a fixed-effect, inverse-variance weighted method implemented using the R package 'meta' 79 (version 4.3-0) and back-transformed to correlation estimates.
The Gaussian graphical models (GGMs) resulting from inclusion of absolute partial correlations greater than 0.10, 0.12 or 0.15 can be viewed at http://omicscience.org/apps/mgwas. In the networks, nodes (circles) represent metabolites and black edges the partial correlations between metabolites. Solid lines indicate positive partial correlations and dashed lines negative partial correlations. To the GGMs we added the GWAS results by connecting candidate genes Nature Medicine Article https://doi.org/10.1038/s41591-022-02046-0 (gray squares) to metabolites (green edges). Candidate genes were from two sources: (1) 'From literature', which are those annotated as the causal gene for a GIM in which the metabolite lies (as described in the section 'Causal gene annotation') and (2) 'SNiPA / VeP', which are genes annotated to the variants defining a GIM in which the metabolite lies by SNiPA/VeP. We used a systems biology approach to annotate compounds based on their metabolic neighborhood and genetic associations in the generated network to enable prediction of pathway membership and chemical identity for unannotated metabolites present in the imputed dataset (n = 224) 80 (Supplementary Tables 18  and 19).

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this Article.

Data availability
We provide open access to all summary statistics for academic use through an interactive webserver: https://omicscience.org/apps/ mgwas. Metabolite raw relative abundances are available at https:// www.ebi.ac.uk/metabolights/ (project codes: MTBLS833 and MTBLS834). The EPIC-Norfolk data can be requested by bona fide researchers for specified scientific purposes via the study website (https://www.mrc-epid.cam.ac.uk/research/studies/epic-norfolk/). Data will either be shared through an institutional data sharing agreement or arrangements will be made for analyses to be conducted remotely without the need for data transfer. INTERVAL study data from this paper are available to bona fide researchers from helpdesk@ intervalstudy.org.uk and information, including the data access policy, are available at http://www.donorhealth-btru.nihr.ac.uk/project/ bioresource. Article https://doi.org/10.1038/s41591-022-02046-0 Extended Data Fig. 1 | Study design and method of defining genetically influenced metabotypes. Following the discovery meta-analyses (INTERVAL + EPIC-Norfolk), validation of sentinel variants and metabolite specific conditional analysis, we identified 2,599 independent variant-metabolite associations. In the next step, we performed a joint variant-metabolite refinement within each region that contained more than one metabolite to group metabolites influenced by at least one shared genetic signal. We defined these co-regulated metabolite sets by identifying the minimal set of variants from all metabolite-specific conditionally independent lead and secondary metabolite associated variants that explained all regional metabolite associations. The 422 metabotypes identified through this method were manually curated to identify the causal genes associated with these GIMs.