Genome-wide association studies of metabolites in Finnish men identify disease-relevant loci

Few studies have explored the impact of rare variants (minor allele frequency < 1%) on highly heritable plasma metabolites identified in metabolomic screens. The Finnish population provides an ideal opportunity for such explorations, given the multiple bottlenecks and expansions that have shaped its history, and the enrichment for many otherwise rare alleles that has resulted. Here, we report genetic associations for 1391 plasma metabolites in 6136 men from the late-settlement region of Finland. We identify 303 novel association signals, more than one third at variants rare or enriched in Finns. Many of these signals identify genes not previously implicated in metabolite genome-wide association studies and suggest mechanisms for diseases and disease-related traits.

T he Finns are a geographically and linguistically isolated population who have experienced multiple population bottlenecks and expansions. This population history has resulted in large allele-frequency differences between Finns and non-Finnish Europeans (NFE), which are most pronounced in northern and eastern Finland, regions first settled in the 15th-16th centuries ("late settlement Finland") 1 . In a previous study of 64 cardiometabolic traits in~20,000 individuals from these regions, we took advantage of the enrichment of otherwise rare alleles to identify 26 novel trait-associated rare deleterious alleles, 19 of which were > 20-fold more frequent in late settlement Finns than in NFE 2 . These results suggested this allelefrequency enrichment could be leveraged to identify novel rarevariant associations for additional quantitative traits. Here, we do so, reporting genome-wide association study (GWAS) results for 1391 plasma metabolites (Metabolon platform) in 6136 participants in METSIM, a study of middle-aged and older men recruited from a single site in late-settlement northeast Finland 3 , who were part of our previous study 2 .
Metabolites are small molecules that play a pivotal role in cellular and physiological processes and their observed levels in biofluids can reflect those processes 4 . Most metabolomics studies are performed in blood (plasma or serum) which reflects the aggregate production and consumption of metabolites by tissues 4 . Abnormal metabolite levels are commonly associated with human diseases and disease-related traits, making them useful aids to understand disease mechanisms and to identify biomarkers for disease diagnosis, prognosis, and treatment monitoring 4 . Many metabolites are highly heritable, and previous metabolite GWAS have identified common variants; [5][6][7][8][9][10][11][12][13][14][15] the impact of rare variants on metabolites is less well studied 16,17 .
We identify 2030 independent association signals (metaboliteindex variant pairs) for 803 metabolites and demonstrate 946 genetic colocalizations of 248 metabolites with 105 diseases and disease-related traits. Many of these associations identify genes not previously implicated in metabolite GWAS and suggest mechanisms for these diseases and traits. Of the 2030 association signals, 303 are novel; of these 303 signals, 111 are at 70 variants rare or > 10-fold more frequent ("enriched") in Finns compared to NFE, 78 are for 44 metabolites identified since 2015 on the Metabolon platform, and 17 are at variants on the X chromosome, which has often been ignored in previous metabolite GWAS. This study highlights the advantages of the Finnish population for rare-variant genetic association studies and the utility of integrating metabolite and disease genetic associations in disentangling disease mechanisms.

Results
Study design. We assayed 1544 plasma metabolites using the Metabolon DiscoveryHD4 mass spectrometry platform (Supplementary Table 1 and Supplementary Data 1) in 6136 randomlyselected METSIM participants who were non-diabetic at baseline and passed quality control (QC) (Supplementary Table 2; Fig. 1). 1391 metabolites were successfully quantified in ≥500 of these 6136 participants. We created a METSIM imputation reference panel of > 26M genetic variants by integrating genome and exome sequence and array genotypes in 2922 METSIM participants ("Methods"; Supplementary Table 3). We used this reference panel to impute genotypes in all METSIM participants. To discover genetic mechanisms for plasma metabolite levels, we performed GWAS and statistical fine-mapping analysis and nominated putative causal genes for metabolites. We integrated metabolomics with FinnGen disease GWAS to understand disease mechanisms through genetic colocalization and Mendelian randomization analysis (Fig. 1).
Since body mass index (BMI) influences levels of many metabolites 18 , we repeated all 1391 GWAS with BMI as an additional covariate. Results with and without BMI adjustment were generally very similar, with Pearson correlation coefficient r = 0.999 for effect size estimates and −log 10 p-values for variantmetabolite pairs with P < 7.2 × 10 −11 in either of the two analyses ( Supplementary Fig. 2). Supplementary Data 2 lists the 83 associations with substantially different effect sizes (ratio ≥ 1. 20) with and without BMI adjustment. In what follows, we present results for analyses without BMI adjustment.
Detecting independent association signals. To identify (nearly) independent association signals, we carried out chromosomewide stepwise conditional analysis for each chromosomemetabolite pair with ≥1 association at P < 5.0 × 10 −8 . Conditional analysis identified 2030 association signals at 1143 index variants for 803 metabolites at P < 7.2 × 10 −11 (Table 1 Fine mapping. To fine map the causal variants for the 2030 association signals, we created 2 Mb regions centered on each index variant and merged overlapping regions associated with the same metabolite, resulting in 1501 regions. We used Bayesian fine-mapping 20 with a uniform prior to calculate the variant posterior inclusion probability (VPIP) that each variant is causal and the signal posterior inclusion probability (SPIP), the sum of the VPIPs for the variants in a region ("Methods"). This method can identify multiple independent signals in a region. In the 1501 regions, we identified 2435 signals with SPIP ≥ 0.95, 1952 of which are among the 2030 association signals identified in conditional analysis. For these 1952 signals, we built 95% credible sets 21 of potential causal variants, the minimal subset of variants with summed VPIP ≥ 0.95 ( Fig. 4a; Supplementary Data 5). These credible sets included 1-544 variants (median = 6); notably, 334 credible sets included only one variant (168 distinct variants).
In each of the 1952 credible sets, we identified the variant with the largest VPIP. This list comprised 1119 distinct variants, 100 with MAF > 10-fold greater in METSIM than in NFE. VPIPs for these 100 variants were greater than those for the remaining 1019 (VPIP mean = 0.73 vs. 0.47; t-test P = 1.7 × 10 −21 ; Fig. 4b Multiple novel associations arose at the same index variants. For example, we identified novel association signals with 19 metabolites at the putatively-deleteriousSLC23A3 missense variant p.Asn336Lys (rs192756070; Supplementary Fig. 7). p.Asn336Lys has 107-fold greater frequency in METSIM than in NFE (MAF = 2.3% vs. 0.022%) and is likely the causal variant for most or all 19 metabolite associations (VPIP median = 0.98, range = 0.57-1.00). SLC23A3 encodes an SLC23 ascorbic acid transporter without demonstrated nucleobase transport 22 . These novel associations suggest a wide range of transport functions for SLC23A3.
Among the novel association signals at index variants enriched in Finns, we identified an association with 3-amino-2-piperidone at the putatively-deleteriousOAT missense variant p.Leu402Pro (rs121965043, β = 1.91, P = 3.7 × 10 −35 ). p.Leu402Pro has 100fold greater frequency in METSIM than in NFE (MAF = 0.35% vs. 0.0031%) and is the likely causal variant for this association (VPIP = 0.997). OAT encodes the key mitochondrial enzyme ornithine aminotransferase which converts arginine and ornithine into glutamate and gamma aminobutyric acid 23 . OAT has not previously been implicated in metabolite GWAS, but inactivation of OAT is responsible for the Finnish heritage disease gyrate atrophy characterized by hyperornithinemia 24 . Previous studies have found increased 3-amino-2-piperidone levels in the urine of individuals with gyrate atrophy 25 .
Among the novel association signals on the X chromosome, we identified an association for tiglylcarnitine at the putatively-deleteriousHSD17B10 missense variant p.Ala95Thr (rs201378370, HSD17B10 encodes 17-β-hydroxysteroid dehydrogenase X, a mitochondrial enzyme which catalyzes oxidation of neuroactive steroids and degradation of isoleucine 26 . Mutations in HSD17B10 that abolish enzyme activity lead to HSD10 deficiency, an infantile neurodegenerative disorder in which tiglylcarnitine level is elevated 27 . In contrast, HSD17B10 is overexpressed in brains of individuals with Alzheimer's disease 28 , in which tiglylcarnitine level is decreased 29 . A previous study reported an association between the GNPTAB intronic variant rs7964859 and aspartate 15 . We identified an independent aspartate association signal with the GNPTAB frameshift variant p.Cys528ValfsTer19 (rs1209353188) (LD r 2 = 0.01; β = 0.91, P condition = 5.2 × 10 −15 conditioning on rs7964859). p.Cys528ValfsTer19 is rare in METSIM (MAF = 0.58%), absent in gnomAD NFE, and is the likely causal variant for the METSIM aspartate association (VPIP = 0.996). GNPTAB encodes the alphaand beta-subunits of N-acetylglucosamine-1-phosphotransferase which catalyzes the N-linked glycosylation of asparagine residues with mannose-6-phosphate 30 .
Nominating putative causal genes. To nominate putative causal genes for metabolite association signals, we applied two approaches. First, we nominated 66 putative causal genes for the 208 association signals for which fine-mapping analysis identified PTV or missense variants at VPIP ≥ 0.8 (see "Fine mapping"). Second, we implemented a knowledge-based approach to integrate biological information about the metabolite and the 20 protein-coding genes nearest the corresponding index variant for the 1666 of the 2030 association signals with named metabolites ("Methods"). The knowledge-based approach nominated 215 single genes for 1033 association signals with 480 metabolites (Supplementary Fig. 8) and 19 sets of 2-7 genes with similar biochemical activity (62 additional genes) for 324 association signals with 208 metabolites (Supplementary Data 3 and 7).
We compared gene nominations for the 138 metabolite association signals for which both approaches nominated causal genes. Compared to the fine-mapping analysis, the knowledgebased approach nominated the same gene for 119 signals, multiple paralogs including the same gene for 18, and a different gene for 1, for an overall consistency > 99% (Supplementary Data 8).
The 277 = 215 + 62 genes identified by the knowledge-based approach and the 66 identified by fine mapping together comprised 290 genes. 204 (70%) of the 290 genes are the closest genes to the index variants, including 188 (68%) of the 277 genes identified by the knowledge-based approach. 58 of the 290 genes have not previously been implicated in metabolite GWAS (Fig. 3c). Of the 58 novel genes, 51 were identified by the knowledge-based approach (Supplementary Data 7), 21 by finemapping analysis (Supplementary Data 5), and 14 by both. Of the 58 novel genes, 40 represent novel loci and 18 are within loci in which metabolite associations have previously been identified but the genes we nominated have not previously been implicated.
Novel genes nominated based on association signals with amino acid levels provide insight into how the encoded enzymes or transporters contribute to modifications of amino acid derivatives. As a first example, we identified a novel association at the HDAC6 missense variant p.Arg832His (rs61735967) with N6-acetyllysine (MAF = 2.9%, β = 0.71, P = 3.6 × 10 −80 ) and suggested p.Arg832His is the likely causal variant (VPIP = 0.998). Both the fine-mapping and knowledge-based approaches nominated HDAC6 as the putative causal gene. HDAC6 encodes a lysine deacetylase that removes the acetyl group from acetyllysine in histones. Increased HDAC6 expression has been found in brains of individuals with Alzheimer's disease 31 . Elevated levels of N6-acetyllysine were recently found in an Alzheimer's disease mouse model 32 .
GWAS of metabolites recently identified on the Metabolon platform helped nominate novel putative causal genes with high biochemical relevance in known metabolite-associated regions. For example, a previous study in a Japanese sample identified associations for blood creatinine and uracil levels at the LRIG1 missense variant p.Thr792Met (rs202007714) 36 , which is monomorphic in METSIM and gnomAD Finns. We identified an association at the nearby (13 kb) SLC25A26 missense variant p.Thr208Met (rs13874) with 2,3-dihydroxy-5-methylthio-4-pentenoate (DMTPA) (MAF = 48.3%, β = 0.17, P = 2.3 × 10 −21 ). We suggest SLC25A26 as the causal gene for the DMTPA association. DMTPA, an S-adenosylmethionine, was recently identified on the Metabolon DiscoveryHD4 platform. SLC25A26 is the only known mitochondrial S-adenosylmethionine transporter.
Seven of the 58 novel genes were identified only by fine mapping. Among them, we identified a novel association for glycocholenate sulfate at the rare ADCK5 missense variant p.Ala508Thr (rs552968665; β = 1.31, P = 3.6 × 10 −12 ), which is > 79-fold more frequent in METSIM than in NFE (MAF = 0.25% vs. 0.0031%). Fine-mapping analysis suggested p.Ala508Thr is the likely causal variant (VPIP = 0.89), implicating a causal role for ADCK5. ADCK5 encodes the aarF domain containing kinase 5. These results suggest ADCK5 plays a role in human bile acid metabolism.
Colocalization of metabolites with human diseases. Integrating metabolite and disease genetic associations can improve finemapping resolution 37 and clarify the potentially causal variants and disease genes. We performed Bayesian colocalization    Integrating metabolite associations substantially increased the finemapping confidence of FinnGen disease trait association signals (SPIP median = 0.91 vs. 0.73; paired t-test P = 2.5 × 10 −70 ) and the probability assigned to the most likely disease variant (maximum VPIP median = 0.54 vs. 0.11; paired t-testP = 9.1 × 10 −128 ; Supplementary Fig. 9). For example, the putatively-deleteriousSERPINA1 missense variant p.Glu366Lys (rs28929474) was associated with N-acetylglucosaminylasparagine level in METSIM (MAF = 2.3%, β = 0.56, P = 4.9 × 10 −16 ) and risk of cholestasis of pregnancy in FinnGen (odds ratio (OR) = 6.23, P = 8.1 × 10 −17 ). Our finemapping analysis suggested a causal role of SERPINA1 p.Glu366Lys for N-acetylglucosaminylasparagine (VPIP = 0.81). We detected colocalization in this region between signals for N-acetylglucosaminylasparagine and cholestasis of pregnancy (RCP = 0.99). Colocalizing these association signals increased the SPIP for cholestasis of pregnancy from 0.69 to 0.99, and the VPIP of SERPINA1 p.Glu366Lys from 0.37 to 0.80. SERPINA1 encodes a serine protease inhibitor produced mainly in the liver. SERPINA1 mutations have been associated with familial intrahepatic cholestasis 39 , and SERPINA1 p.Glu366Lys with liver diseases 40 and circulating liver enzymes 41 . We may have missed colocalizations where the true causal variants were discarded in METSIM or FinnGen; in such cases, we may have detected colocalizations at other, likely more common, variants.
Colocalization analysis suggested campesterol and gallstones share the same causal variant in this region (RCP = 0.65; Fig. 5a) and nominated rs6544713 (SCP = 0.45) as the most likely causal variant. rs6544713 resides in active regulatory units in intestinal tissue 45 and its campesterol-decreasing allele is associated with higher ABCG8 expression in colon tissue 46 (β = 0.37, P = 7.2 × 10 −16 ), but not in the liver 46 . ABCG8 and its nearby paralog ABCG5 have previously been suggested as candidate genes for gallstones 47 . Mutations in ABCG5 and ABCG8 cause sitosterolemia, characterized by elevated campesterol 48 .
Mendelian randomization analysis using 15 independent variants for campesterol suggested a causal effect of lower plasma campesterol level on higher gallstone risk ("Methods"; β = −0.70, P = 7.2 × 10 −8 ; Fig. 5b, c). ABCG5 and ABCG8 together encode a heterodimeric ATP-binding cassette transporter that facilitates secretion of cholesterol and non-cholesterol sterols in the intestine and bile. High plasma campesterol levels might compete with cholesterol for ABCG5/ABCG8 transporters during biliary cholesterol secretion, resulting in decreased biliary cholesterol levels and reduced risk of gallstones 49,50 (Supplementary Fig. 10).
Colocalization analysis suggested that hypertension colocalized with vanillylmandelate (RCP = 0.996). Using Mendelian randomization, we did not find significant evidence of causal effects of vanilymandelate on hypertension (P = 0.15; 10 independent variants; Supplementary Fig. 11) or of hypertension on vanillylmandelate (P = 0.17; 157 independent variants), suggesting this signal conferred effects on hypertension risk and vanillylmandelate through different pathways, consistent with the two possible DBH effects identified by our knowledge-based approach ( Supplementary  Fig. 10). The analysis from vanillylmandelate to hypertension could only make use of ten instruments and so may be underpowered.

Discussion
We performed GWAS of 1391 plasma metabolites in 6136 men from the late-settlement region of Finland. We sought to identify putative causal variants and genes for the resulting genetic associations, and interrogated disease molecular mechanisms by integrating metabolite and disease genetic associations. We identified 2030 association signals for 803 metabolites, including 157 signals for 125 metabolites at 121 rare variants. We identified 303 association signals for 201 metabolites as novel, including 64 signals for 58 metabolites at 51 rare variants.
Over half of these 303 novel association signals stem from the population history of Finland, the analysis of previouslyunstudied metabolites, or the analysis of the X chromosome. The Finnish population history of alternating founding events and population expansions has resulted in a set of genetic variants rare elsewhere but more common in Finns, providing increased statistical power for genetic discovery for these variants 2 , as exemplified by the Finnish heritage diseases 55 . 79 of the 303 novel association signals we identified are at 47 variants with MAF > 10-fold greater in METSIM than in NFE, with 37 novel signals at 14 variants with MAF > 100-fold greater. These include the novel association of 3-amino-2-piperidone with the rare OAT missense variant p.Leu402Pro; mutations in OAT cause the Finnish heritage disease gyrate atrophy (see "Results").
Metabolon continues to expand the set of metabolites identified on their platform. 78 of the 303 novel association signals were for 44 metabolites identified after 2015 on the Metabolon DiscoveryHD4 platform, and so studied only in the most recent Metabolon-based metabolomics GWAS 13 . For example, we identified a novel association at SLC23A3 missense variant p.Asn336Lys for 2-O-methylascorbic acid, identified on the Metabolon platform in 2019.
Our study is one of the first Metabolon metabolomics GWAS to analyze the X chromosome, where 17 of the 303 novel association signals arose. For example, we identified a novel association for tiglylcarnitine at the HSD17B10 missense variant p.Ala95Thr. HSD17B10 mutations cause a rare inborn error of metabolism characterized by cognitive impairment and variable neurological abnormalities.
Biochemical analysis existed for decades prior to the advent of GWAS. Experiments linking a gene to a metabolite often already existed in the published literature. We identified 277 putative causal genes through existing links in the literature between tested metabolites and biochemical activities of genes near our association signals. Our results suggested most of these putative causal genes acted on the associated metabolites or closely-related metabolites. These putative causal genes characterized the genetic regulatory mechanisms for plasma metabolite levels. The associations of multiple metabolites with the same gene help improve the understanding of the gene function. For example, we nominated SLC23A3 as a causal gene for 19 metabolites of various biochemical classes, suggesting a wide range of transport functions in addition to its known role as an ascorbic acid transporter.
Integrating metabolite and disease genetic associations helps disentangle disease biology. We identified 946 metabolite-disease trait pairs likely sharing the same causal variants, which helped pinpoint the likely causal variants and disease genes (Supplementary Data 10). For example, colocalization analysis of acetylglucosaminylasparagine and cholestasis suggested a shared causal role of SERPINA1 p.Glu366Lys. Mendelian randomization analysis suggested for the first time a protective effect of high plasma campesterol on gallstones. Plasma campesterol is commonly used as a biomarker for gallstones 56 and campesterol is used as a supplement to reduce low-density lipoprotein cholesterol 57 . Our finding provides supporting evidence for these applications of campesterol in the treatment of gallstones.
Data sharing increases the impact of genetic studies. To support data exploration of our metabolite GWAS results 58 , we have constructed a METSIM metabolite PheWeb site 59 (Fig. 2). This site supports querying, visualizing, and downloading our MET-SIM Metabolon metabolite genetic association results, including Manhattan and quantile-quantile plots, and summary statistics for all 1391 metabolites. In addition, we provide direct links to the Human Metabolome Database (HMDB) 60 , which presents the metabolites' biochemical characteristics and enables interpretation of metabolite genetic association results.
In summary, we performed parallel GWAS for 1391 plasma metabolites in 6136 adult Finnish males from the METSIM study, colocalized metabolite and disease genetic associations, and made these GWAS results available using PheWeb. Our findings reveal genetic determinants for a wide range of plasma metabolites and demonstrate the utility of metabolite genetic associations for the investigation of disease biology.
Methods metabolic syndrome in men (METSIM) study. METSIM is a study of 10,197 Finnish men from Kuopio in the late-settlement region of northeast Finland designed to investigate factors associated with type 2 diabetes and cardiovascular diseases 3 (Supplementary Table 2). Participants were aged 45-74 (median = 58) years during baseline visits from 2005 to 2010. Participants provided demographic, diet, exercise, disease, and medication history information, and underwent laboratory measurements, including oral glucose tolerance test, after ≥10-hour overnight fast. Morbidity, mortality, and drug treatment information was periodically updated for participants who consented to use of their hospital admission, drug reimbursement, and prescription records in Finnish national registries. Due to funding constraints, we randomly selected 6490 of the 8777 METSIM participants who at baseline were neither diagnosed with diabetes nor taking diabetes medications that might broadly impact metabolomics levels for the Metabolon metabolomics assay. After exclusion of participants who subsequently developed diabetes (n = 264), lacked array genotypes (n = 65) or body mass index (BMI) measurement (n = 1), had sex mismatch (n = 3), and/or were non-Finnish (n = 21), our analysis set comprised 6136 participants (Supplementary Table 2; Fig. 1). This study was approved by the Ethics Committee at the University of Eastern Finland and the Institutional Review Board at the University of Michigan. All participants provided written informed consent.
Metabolomics profiling and data processing. Non-targeted metabolomics profiling was performed at Metabolon, Inc. (Durham, North Carolina, USA) 61 on EDTA-plasma samples obtained after ≥10-h overnight fast during METSIM baseline visits. Briefly, methanol extraction of biochemicals followed by a nontargeted relative quantitative liquid chromatography-tandem mass spectrometry (LC-MS/MS) Metabolon DiscoveryHD4 platform was applied to assay named (n = 1240) and unnamed (n = 304) metabolites (Supplementary Table 1 and Supplementary Data 1). Samples were randomized across batches. Batches con-tained~144 METSIM samples and 20 well-characterized human-EDTA plasma samples for quality control (QC). All 6490 samples were processed together for peak quantification and data scaling. We quantified raw mass spectrometry peaks for each metabolite using the area under the curve. We evaluated overall process variability by the median relative standard deviation for endogenous metabolites present in all 20 technical replicates in each batch. We adjusted for variation caused by day-to-day instrument tuning differences and columns used for biochemical extraction by scaling the raw peak quantification to the median for each metabolite by Metabolon batch.
We captured exomes for all METSIM participants by SeqCap EZ HGSC VCRome kit (Roche) and sequenced them by HiSeq2000 (Illumina) (average depth 45×) 2 . For exome sequences, we excluded samples with estimated contamination > 3% or sample swaps compared to the array genotype data 62 and required single-nucleotide variant (SNV) array genotype concordance > 90% if array data were available. We filtered variants with genotype call rate < 98%, HWE P < 10 −6 , or overall low allele balance (alternate allele count/sum of total allele count < 30%) 2 . The resulting array-genotype dataset consisted of n = 10,066 METSIM participants with 679,866 SNVs. The exome-sequence dataset consisted of n = 9957 participants with 583,947 SNVs and 40,270 small insertions/deletions (indels).
In wave 2, we sequenced 2875 additional METSIM participants using the same methods used for wave 1. We generated a combined wave 1 + 2 call set of n = 5949 using the same methods, resulting in calls for 55,648,111 SNVs and 12,850,837 indels. Wave 2 data became available only after the main analysis for this paper was complete; we used wave 1 + 2 combined data to determine linkage disequilibrium (LD) proxies for previously-identified metabolite associated variants that were missing in wave 1 but present in wave 1 + 2 combined data (see "Identification of novel associations").
METSIM integrative panel and genotype imputation. Using the 3074 METSIM participants with wave 1 genome sequence data, we generated an integrated list of genetic variant sites by merging site lists from the genome and exome sequence data, and the OmniExpress and exome array data. Of the 3074 participants, 3055 had OmniExpress and exome array data, and 3037 had exome sequence data. We calculated genotype likelihoods for each individual at each site as the product of genotype likelihoods assuming independent data across platforms 66 . For OmniExpress and exome array genotypes, we converted genotype calls to genotype likelihoods assuming a genotype error rate of 10 −6 . We then phased genotypes using integrated genotype likelihoods in Beagle v4.1 (https://faculty.washington.edu/ browning/beagle/b4_1.html) with 50,000 markers per chunk and 3000 overlapping genetic markers between consecutive chunks 67 . We subsequently excluded 1 individual who self-identified as non-Finnish, 2 individuals identified as population outliers in genetic PCA, and 149 close relatives (estimated kinship ≥ 0.125 in KING v2.2.1 (https://www.kingrelatedness.com) 68 ). The resulting integrative panel comprised 2922 individuals genotyped for 23,294,337 SNVs and 2,851,848 indels (Supplementary Table 3). 2670 (91.4%) of the 2922 individuals had Metabolon metabolomics data.
We imputed genotypes for the 6490 study participants on the framework of their OmniExpress genotypes using the METSIM integrative panel with Minimac v4 69 . We excluded imputed variants with imputation r 2 < 0.3, leaving 19,182,997 SNVs and 2,404,717 indels for downstream analysis (Supplementary Table 4).
Variant functional annotation. We annotated all variants using the Ensembl Variant Effect Predictor (VEP, https://useast.ensembl.org/info/docs/tools/vep/ index.html) version 99 70 . We used the "-pick_order" option to annotate each variant using a single transcript, with transcripts prioritized in the following order: transcript support level (i.e., well-supported and poorly-supported transcript models based on the type and quality of the alignments used to annotate the transcript), transcript biotype (protein coding preferred), APPRIS isoform annotation (i.e. annotation based on a range of computational methods to identify the most functionally important transcripts from cross-species conservation), deleteriousness of annotation as estimated by Ensembl, transcript CCDS status (i.e., amount and type of evidence that supports the existence of a variant), canonical status of transcript (https://m.ensembl.org/Help/Glossary), and transcript length 71 . We used the dbNSFP (version 4.0) 72 plugin to generate additional predictions of variant deleteriousness from five in silico algorithms: Polyphen2 HDIV 73 , Poly-phen2 HVAR 73 , SIFT4G 74 , MutationTaster 75 , and the Likelihood Ratio Test (LRT) 76 .
Trait transformation. For each metabolite, we inverse normalized the Metabolonreported metabolite level, regressed on covariates (age at sampling, Metabolon batch, and lipid-lowering medication use status for lipid traits only), and inverse normalized the residuals. In the single-variant association analyses with BMI adjustment, we also included BMI among the covariates.
Single-variant association analysis. To account for sample relatedness and potential population stratification among the 6136 participants in our analysis set, we carried out single-variant association tests using a linear mixed model in EPACTS (v3.2.6) (https://github.com/statgen/EPACTS) on the normalized residual metabolite values. We limited analysis to the 1391 metabolites successfully measured on ≥ 500 METSIM participants and to the genetic variants with minor allele count (MAC) ≥ 5; we did not impute missing metabolite data. This resulted in 10,914,098 to 16,172,108 variants (median = 16,042,879) tested across the 1391 metabolites, since the number of variants with MAC ≥ 5 varied with the set of individuals successfully measured for each metabolite.
To choose a study-wise significance threshold for the 1391 parallel metabolite GWAS, we carried out PCA across the metabolites to determine the number of principal components required to explain metabolite variation. To account for missing data ( Supplementary Fig. 12), we first imputed missing metabolite values using the K-nearest neighbors approach 77 with K = 5. PCA of the imputed data showed that 692 principal components explained 95% of phenotypic variation for the 1391 metabolites. We therefore used a study-wise significance threshold of P < 5.0 × 10 −8 /692 = 7.2 × 10 −11 for our single-variant analyses. A metabolite quantified in n = 500 participants provided > 80% power at P < 7.2 × 10 −11 to detect variants that explained phenotypic variance ≥ 11% and had MAC ≥ 5.
PheWeb browser. We built a PheWeb browser 59 of the 1391 metabolite GWAS to support interactive visualization, exploration, and download of these results. This PheWeb (https://pheweb.org/metsim-metab) includes Manhattan and quantilequantile plots, summary statistics, and links to biochemical characteristics and functions in the Human Metabolome Database (HMDB) 60 for all 1391 metabolites.
Stepwise conditional analysis. We carried out stepwise conditional analysis in EPACTS (v3.2.6) (https://github.com/statgen/EPACTS) to identify nearindependent association signals. For each metabolite-chromosome pair with at least one single-trait genome-wide significant association (P < 5.0 × 10 −8 ), we first conditioned on the most significant associated variant and continued conditioning on the most significant remaining variant until no variant attained P < 5.0 × 10 −8 .
Fine mapping and credible sets. For each of the 2030 nearly-independent association signals, we built genomic regions of 1 Mb on either side of the index variant, less near chromosome ends. We merged overlapping regions for the same metabolite, resulting in 1501 genomic regions of 1.2 to 3.1 Mb. To identify potential causal variants within each region, we performed fine-mapping analysis using the Deterministic Approximation of Posteriors (DAP-g) method 20 (https://github.com/xqwen/ dap), assigning equal priors to all candidate variants. DAP-g uses individual-level metabolite, genotype, and covariate data to produce fine-mapping results. Since DAPg does not allow for related participants, we corrected for relatedness approximately by including the first ten genetic principal components as covariates; repeating the DAP-g analysis with 0, 20, or 100 principal components yielded similar results.
DAP-g allows for multiple independent association signals within each region. For each identified signal, DAP-g computes (1) a signal posterior inclusion probability (SPIP) that there is at least one causal variant in the signal; and (2) a posterior inclusion probability for each variant (VPIP) that the variant is causal for the signal. For each of the 1952 signals identified in stepwise conditional tests that had SPIP ≥ 0.95 in DAP-g, we constructed a 95% credible set of potential causal variants by ranking the variants in descending VPIP and including variants until their summed VPIP was ≥ 0.95.
Identification of novel associations. To assess which of our metabolite associations were novel, we compiled a list of 381 published metabolite GWAS papers (Supplementary Data 6): 354 from the NHGRI-EBI GWAS catalog 78 (https:// www.ebi.ac.uk/gwas/; release date December 1, 2020); and 27 others from the list curated by Kastenmüller et al. 11 (accessed April 1, 2021). From these papers, we identified 8502 variants with metabolite associations at P < 5.0 × 10 −8 or at the significance threshold used in the paper, whichever was more stringent (Supplementary Data 6). Among these 8502 published variants, 7807 were present in the METSIM imputed genotype data. For 194 of the published variants not present in the METSIM imputed genotype data, we identified proxies (LD r 2 ≥ 0.8 and ≤ 500 kb) using the wave 1 + 2 genome sequence dataset of 5949 METSIM participants. The 7807 variants present in the METSIM imputed genotype data and the 194 LD proxies for missing variants together comprised 8000 unique variants. To avoid problems with multicollinearity, we pruned these 8000 variants at METSIM LD r 2 > 0.99 and ≤ 1 Mb, yielding 6501 LD-pruned variants. Then, for each of the 2030 association signals, we repeated the conditional association analysis including the subset of these LD-pruned variants within ≤ 1 Mb of the corresponding index variant as covariates. We considered as novel signals those index variants with conditional P < 7.2 × 10 −11 and location > 500 kb from any of the 8502 − 7807 − 194 = 501 published variants, which were neither present nor with proxies in the METSIM imputed genotype data. Among the 501 variants, 380 were monomorphic in gnomAD v3.1 Finns (n = 5316).
Knowledge-based approach to gene nomination. To nominate putative causal genes for the 1666 of 2030 signals associated with named metabolites, we employed a two-stage knowledge-based approach 15 . In stage 1, for each variant, we identified the 20 closest protein-coding genes using the minimum distance from the index variant to the refSeq genes' transcription start or end sites. We employed an algorithm to look for lexical overlaps between the associated metabolite and each of the 20 genes. Specifically, we searched automatically for matching strings using customized scripts between: (1) the HMDB 60 metabolite name and synonyms and Entrez gene names; 79 (2) the metabolite and Entrez gene names listed in HMDB as interacting with the metabolite; (3) the metabolite name and Uniprot protein names 75 and their synonyms; (4) the metabolite and its parent classes as defined in HMDB and the Uniprot protein names and their synonyms; (5) the metabolite name and rare disease names linked to each gene in OMIM (Online Mendelian Inheritance in Man, https:// omim.org/, accessed January 1, 2021) after removing the non-specific substrings uria, emia, deficiency, disease, transient, neonatal, hyper, hypo, defect, syndrome, familial, autosomal, dominant, recessive, benign, infantile, hereditary, congenital, early-onset, idiopathic; (6) the metabolite and its parent classes and Gene Ontology (GO) biological process names 80 associated with each gene after removing the non-specific substrings metabolic process, metabolism, catabolic process, response to, positive regulation of, negative regulation of, regulation of (we only considered gene sets of < 500 genes); and (7) Kyoto Encyclopedia of Genes and Genomes (KEGG) 81 maps (https://www.kegg.jp/) containing the metabolite (as defined in HMDB) and KEGG maps containing each gene (as defined in KEGG) omitting the large "metabolic process map". For each of these pairs of terms, we calculated a Pair Distance score ranging from 0 to 1 using the Ruby gem "fuzzy_match" (https://github.com/ seamusabshere/fuzzy_match), and considered a score > 0.5 as a match.
In stage 2, we manually reviewed the evidence collected at stage 1. We selected the biologically most plausible causal gene if we identified experimental evidence linking the gene to the metabolite. >1 putative causal genes could be nominated if > 1 gene was suggested in stage 1 and/or 2; this happened most often when a locus contains multiple paralogs with similar biochemical activity. If no clear experimental evidence existed for any of the 20 genes, no causal gene was selected.
Colocalization of FinnGen disease traits and METSIM metabolites. To identify shared causal variants between METSIM metabolites and FinnGen disease traits, we carried out Bayesian pairwise colocalization analysis using fastENLOC 37,38 (https://github.com/xqwen/fastenloc). We downloaded FinnGen release 4 (https:// www.finngen.fi/en/access_results) FINEMAP 82 -based fine-mapping results for 980 disease traits with at least one association at P < 5.0 × 10 −8 . fastENLOC used these FinnGen fine-mapping results and our DAP-g-based fine-mapping results for METSIM metabolites to carry out colocalization analysis assuming a single causal variant. For each FinnGen disease trait, we estimated its degree of enrichment for genome-wide associations in metabolite GWAS using TORUS 83 (https:// github.com/xqwen/torus) and used this enrichment estimate as the prior for Bayesian analysis in fastENLOC. fastENLOC computes two probabilities. The regional colocalization posterior probability (RCP) is the probability of the same causal variant within a region for both the metabolite and the FinnGen disease trait. The variant colocalization posterior probability (SCP) is the probability a specific variant is causal for both traits. We limited colocalization analysis to the 1952 metabolite stepwise association signals with SPIP ≥ 0.95 for 792 metabolites and present colocalizations for metabolite-FinnGen disease trait pairs with RCP ≥ 0.5 (Supplementary Data 10).
Associations of campesterol with gallstones and vanillylmandelate with hypertension in METSIM. Among the 4698 METSIM participants with measured plasma campesterol level at baseline, we identified 199 with gallstones in METSIM (December 2020). To test for association between plasma campesterol level and presence of gallstones, we used logistic regression with covariates baseline study age, Metabolon batch, and lipid medication use. Among the 5173 METSIM participants with measured plasma vanillylmandelate level at baseline, we identified 1073 individuals with hypertension, and used logistic regression with covariates baseline study age, Metabolon batch, and hypertension medication use to test for association between plasma vanillylmandelate level and hypertension status.
Causal effects between metabolites and FinnGen disease traits. To infer the potential causal effects of plasma campesterol on FinnGen gallstones (phenocode: K11_CHOLELITH) and plasma vanillylmandelate on hypertension (phenocode: I9_HYPTENS), we applied four two-sample Mendelian randomization methods: inverse variance weighted 84 , weighted median 85 , MR-PRESSO 86 , and MR-Egger 87 . These methods make different assumptions and use different strategies to account for horizontal pleiotropy, which can result in false positive inference of causality. For each metabolite, we identified nearly-independent genetic instrumental variables (LD r 2 < 0.1, distance ≥ 500 kb) with unconditional single-variant association P < 10 −6 . We also ran Mendelian randomization analyses to infer the causal effect of FinnGen gallstones (phenocode: K11_CHOLELITH) on campesterol and FinnGen hypertension (phenocode: I9_HYPTENS) on vanillylmandelate in a similar way. We considered findings significant if they had the same effect direction and P < 0.05 for all four Mendelian randomization methods. We present MR-PRESSO effect estimate and p-values in the main text.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
NHGRI-EBI GWAS catalog: https://www.ebi.ac.uk/gwas/. Human Metabolic Individuality: http://www.metabolomix.com/list-of-all-published-gwas-withmetabolomics/. OMIM: https://omim.org/. KEGG: https://www.kegg.jp/. HMDB: https:// hmdb.ca. GTEx portal: https://gtexportal.org/home/. NCBI refSeq Gene: https:// www.ncbi.nlm.nih.gov/refseq/rsg/. Entrez Gene: https://www.ncbi.nlm.nih.gov/gene. UniProt: https://www.uniprot.org. Gene Ontology: http://geneontology.org. dbNSFP: https://sites.google.com/site/jpopgen/dbNSFP. FinnGen genome-wide summary statistics and Bayesian statistical fine-mapping results are available at https://r4.finngen.fi. Full summary statistics from the genome-wide association studies of the 1391 plasma metabolites are available at https://pheweb.org/metsim-metab/. METSIM individual-level data are not publicly available due to privacy restrictions on personal data. The METSIM exome sequencing and genotyping array data will be accessible through dbGaP (https:// www.ncbi.nlm.nih.gov/gap/) with accession numbers phs000752 and phs000919, respectively. The METSIM WGS dataset used in this manuscript (n = 5949) is a subset of the full METSIM WGS data, which will be deposited into dbGaP upon completion, expected in early 2022. The METSIM metabolomics dataset (n = 6490) is a subset of the full METSIM metabolomics data which will be deposited into dbGaP upon completion, expected in March-May 2022. As part of data deposit in dbGAP, we will include ID lists corresponding to the individuals included in this paper. Until these data are available from dbGaP, we will provide access to the data for this paper under a Data Use Agreement to researchers who submit a short description of the proposed biomedical research project to Dr. Michael Boehnke (boehnke@umich.edu). Source data are provided with this paper.