Heritability estimates for 361 blood metabolites across 40 genome-wide association studies

Metabolomics examines the small molecules involved in cellular metabolism. Approximately 50% of total phenotypic differences in metabolite levels is due to genetic variance, but heritability estimates differ across metabolite classes. We perform a review of all genome-wide association and (exome-) sequencing studies published between November 2008 and October 2018, and identify >800 class-specific metabolite loci associated with metabolite levels. In a twin-family cohort (N = 5117), these metabolite loci are leveraged to simultaneously estimate total heritability (h2total), and the proportion of heritability captured by known metabolite loci (h2Metabolite-hits) for 309 lipids and 52 organic acids. Our study reveals significant differences in h2Metabolite-hits among different classes of lipids and organic acids. Furthermore, phosphatidylcholines with a high degree of unsaturation have higher h2Metabolite-hits estimates than phosphatidylcholines with low degrees of unsaturation. This study highlights the importance of common genetic variants for metabolite levels, and elucidates the genetic architecture of metabolite classes.

convergence of analyses after successful 'bending' to allow for inversion of the variance-covariance 23 matrix. Additionally, we appreciate your suggestion of including factors which might contribute to non-24 convergence as quality control filters to be applied to all metabolites and will certainly keep this in mind 25 for potential future work. However, we felt adjusting the quality control filters based on observed results 26 in the same study would be 'bad practice'. not see it as an acceptable fudge to up-truncate SDs because they appear unrealistically small (instead, as 30 with 1a, I think you have to get to the heart of the problem). 31 We appreciate your suggestion that the very small observed SE's could be due to the use of constrained 32 variances. In this revised manuscript we have based all estimates on the unconstrained variance 33 component and have additionally adjusted how we've calculated the SE of the h 2 SNP , which no longer 34 results in the unrealistically small SE's. Currently we use the following procedure to calculate the SE for 35 h 2 SNP : we randomly sampled 10,000 instances from the parameter variance-covariance matrices for each 36 metabolite. The s.e.'s of the specific ratio of interest were then based on the standard deviation of the 37 ratio of interest across 10.000 samples. The same approached has been used for the calculation of our 38 redefined h 2 GW-loci SE's. 39 40 c -SNP heritabilities are ideally estimated from an unrelated dataset. The Zaitlen method you use is 41 designed when there are complex patterns of relatedness, such that pruning will drastically reduce sample 42 size. But here is seems you have fairly simple relationships -many of the closely related are twins? So it 43 would seem better to first divide your data into two subgroups, such that individuals in each subgroup are 44 unrelated (were all your individuals pairs of twins, you could simply put the first of each pair in group a, the 45 second in group b -as that is not the case, you could instead prune individuals with say a cutoff of 0.05 to 46 make group a, then prune the individuals who were discarded to make group b). It seems you should be 47 able to construct two reasonably sized groups, so that you can then analyze each separately (without 48 requiring Zaitlens truncated kinship method), and meta-analyze the two sets of estimates. 49 Thank you for suggesting a possible alternative to the Zaitlen-method for including large numbers of twins 50 in the analyses. However, we fear that by splitting the data into two (or more) groups of unrelated 51 individuals and meta-analyzing them together, bias will be introduced at the meta-analytical level, as 52 these groups will not be independent from one another. And while many of the closely related individuals 53 in our dataset are indeed twins (63.4% of the complete sample), our data also include parents, siblings or 54 offspring of twins and families thus are sometimes more complex. Moreover, twin families within this data 55 might well share (distant) kinship. Overall, we feel that meta-analyzing two or more groups would not be 56 the best course of action and continued to apply the Zaitlen method in the revised manuscript. 57 58 d -For a particular phenotype, h2gwas is generally defined as the variance explained by variants detected 59 through gwas (ie SNPs achieving gwide significance) for THAT phenotype. So to estimate h2gwas, you must 60 first perform a GWAS (or identify a previously performed GWAS for that metabolite), identify the 61 significant SNPs, then compute how much variance they explained. Hagenbeek instead define h2GWAS as 62 the variance explained by a constant set of 500 snps -these SNPs have been picked as they have been 63 determined to be gwide significant for A metabolite (i.e., of all the metabolite studies consider, each SNP is 64 associated for at least one), but for a given metabolite the 500 SNPs will almost certainly contain only 4 some of the QTLS, and contain many SNPs which are not QTLS. Therefore, the h2GWAS estimates of 66 Hagenbeek will likely be substantial underestimates for some metabolites (those for which there is a 67 strongly associated snp which is not one of the 500 considered) or over-estimates for others (where the 68 estimate of h2gwas is capturing the variance of lots of non-significant SNPs). 69 We highly appreciate the reviewer comment 1d and 1e where the reviewer indicates his suspicion that the 70 published estimates for h 2 GW-loci will likely be under-or overestimated when including the 500 SNPs 71 reported to be associated with any metabolite. Instead reviewer 1 suggests in these comments to perform 72 GWAS analyses on each of the metabolites and use the subsequent significantly identified SNPs to 73 calculate the explained variances for each metabolite separately. Unfortunately, the sample size available 74 to us is quite small (N = 5,117) and varies greatly across platforms, ranging from 1,448 observations on the 75 Biocrates platform to 4,227 observations on the Nightingale Health platform. In addition, as the reviewer 76 also points out in comment 1g, many metabolites, that belong to the same biochemical class or pathway, 77 are highly correlated. In separately analyzing metabolites with different samples sizes, without accounting 78 for the correlation among the metabolites, we fear important genetic loci might be 'missed' for those 79 metabolites with a small sample size, which might be picked up by GWA on related metabolites with larger 80 sample sizes. Taking both these issues into consideration, we've decided against performing GWA analyses 81 on each of our metabolites. 82 While we have not performed GWA studies for each of our metabolites, we do really appreciate 83 the caution of reviewer 1 of biased h 2 GW-loci estimates by the inclusion of SNPs not directly associated with 84 the target metabolite. We have attempted to address this issue in several steps. First, we have expanded 85 the list of 500 blood metabolite SNPs to include the blood metabolite-SNP associations in European 86 populations as published between November 2008 and October 2018 (40 studies in total), identifying 87 43,827 blood metabolite SNPs. Next, for all 942 unique metabolites (excl. ratios and unidentified 88 5 compounds) from the 40 studies, we retrieved the Human Metabolome Database (HMDB) identifiers. 89 Based on the HMDB identifiers (or expert opinion if no HMDB identifier was retrieved) all metabolites 90 were categorized (by the HMDB system) into 44 'classes' (12 'super classes'). For each class represented in 91 our metabolite data (12) we created GRMs capturing solely the metabolite loci for metabolites of this 92 specific class (henceforth we will refer to the heritability estimates derived from these GRMs as h 2 GW-Class ). 93 Metabolite loci for each class were identified by including all published metabolite-SNP associations of the 94 relevant class in a clumping procedure (r 2 = 0.1, radios = 500kb). The lead SNPs identified by the clumping 95 procedure were then used as input to create LD-corrected GRMs in LDAK. To address whether the 96 inclusion of associations for metabolites of a different biochemical class lead to biasing of the h 2 GW-loci 97 estimates as included in the submitted manuscript, we used the same technique to create 'non-class' 98 GRMs, which used included the lead SNPs of all metabolite-SNP associations, excluding the relevant class 99 loci and LD proxies (henceforth we will refer to the heritability estimates derived from these GRMs as 100 h 2 GW-Notlass ). This resulted in 4-variance component models, as resolved in the GCTA software. 101 Unfortunately, the proposed 4-variance component models for the 12 different classes of 102 metabolites in our data had high degrees of non-convergence (37.9% total). Non-convergence of the 4-103 variance component models appears to be associated with the number of SNPs in the 'class-specific' 104 GRMs, with small GRMs displaying more convergence issues ( Table A). Suggesting that metabolite-specific 105 'hit GRMs' would have resulted in similarly high non-convergence rates. 106  HydroxyAcidsAndDerivatives  49  5  0  5  OrganooxygenCompounds  51  9  1  8  Protein  80  3  0  3  Glycerophospholipids  85  145  104  41  SteroidsAndSteroidDerivatives  108  7  3  4  Lipoprotein  239  63  62  1  FattyAcyls  260  11  11  0  CarboxylicAcidsAndDerivatives  400  44  43  1  108 To address the convergence issues we've decided to recode the HMDB metabolite classes into 109 HMDB metabolite super classes, reducing the 12 classes to 5 classes. Unfortunately, 3 of the 5 super 110 classes only comprise of a single class and the number of SNPs in these GRMs will not increase. Since these 111 analyses had previously shown poor convergence they were not been repeated. Combining classes into 2 112 super classes with a larger number of SNPs in the GRM results in a convergence rate of 97.8% (Table B). 113 Hagenbeek instead using 2 or 3 component mixed models. But in this, the gwide GRM will contain many 139 SNPs in high LD with the 500 SNPs, so the gwide GRM will "suck" up some of the variance explained by the 140 500 -at a guess, the resulting estimates of h2gwas would be at least 50% lower than their true value. 141 We concur with reviewer 1 that we had not addressed the LD among the multiple GRMs included in the 3-142 variance component models and appreciate reviewer 1's insight in how this might impact the variance 143 explained by the GRM including the metabolite SNPs. As explained in the response to comment 1d, we 144 have now expanded the 3-variance component model to 4-variance component models. We have 145 accounted for the LD between the class-specific and non-class GRMs by removing the class-specific lead 146 SNPs and LD-proxies from the non-class GRM. To further address the LD among the class and non-class 147 GRMs with the GRM used to estimate h 2 SNP , we removed the lead SNPs, their LD-proxies and 50kb 148 surrounding each SNP from both GRMs from the GRM used to estimate the h 2 SNP . 149 We greatly appreciate the reviewers insight with regards to the effect of the correlation structure among 156 metabolites on our analyses. As can be seen in Supplementary Table 12 of the Supplementary Tables a 157 large number of the metabolites included in our study are indeed highly correlated. As the reviewer 158 mentions the calculate group means as reported in the main text and in Tables 2-3 could be inflated due 159 9 to the underlying correlation structure of the metabolites averaged over (though of course their 160 underlying correlation structure is also why we average over these specific groups of metabolites). To 161 account of this we also applied multivariate mixed-model meta-regression models, corrected for the 162 sample overlap and correlation structure among the metabolites, to assess differences in metabolite 163 We would like to thank the reviewer with his honest assessment of our use of the BayesS method. We 182 concur that our choice to only include SNPs previously associated with metabolites, as opposed to 183 genome-wide SNPs, likely leads to biases in our analyses. In addition, we've noticed that the correlation 184 among the submitted BayesS results and the results as obtained with a larger dataset (based on all 185 published metabolite-SNP associations since 2008, as described in response to comment 1d) are lower 186 than anticipated (r = 0.59; p = 1.315e-09). This seems to suggest instability of the BayesS estimates across 187 relatively comparable runs, though at this time we cannot properly address or explain this. Therefore, in 188 accord with the suggestion of reviewer 2, we have removed the BayesS analyses from the manuscript. 189 190 Minor comments: 191 3 -The abstract does not contain SDs (or conf ints) or total sample size; the main text does not contain SDs,192 and when you mention the sample sizes, could you mention a total please (and not just the range of the 193 sample sizes across the four platforms). Also, please put number of snps in main text. 194 We would like to thank the reviewer for pointing out these obvious caveats to the abstract and main test. 195 The revised main text now includes the SE where relevant, the SE has not been added to the abstract as 196 the abstract has been rewritten in such a manner that this would not be interpretable. The total sample 197 size has been included in the abstract, in the final paragraph of the introduction, in the first paragraph of 198 the 'characterization of the heritable influences on lipid and organic acid levels' subsection of the results, 199 in the first paragraph of the discussion, in the 'participants' subsection of the methods' and in Table 1

significance (really sorry if I missed it). 205
We highly appreciate that the reviewer indicates it was unclear what the 'significant differences across 206 classes' was founded on. AS touched upon in responding to comment 1g, we have used multivariate 207 mixed-model meta-regression models to test for significant mean differences among metabolite classes 208 and lipid species. In the submitted version this was included in the final paragraph of the results section. In 209 an effort to clarify the use of these models we have now includes the results for each comparison after the 210 relevant descriptions of the mean heritabilities. We hope this improves the findability and readability of 211 the results section. 212 213

-Please can you provide a few representative histograms of what metabolite levels look like. For your 214
analyses, ideally they would be approximately normal (or maybe bimodal, as then the relative values 215 would still be accurate) -so could you make a few histograms for a few metabolites picked at random to 216 show whether they are. If some are not, you should consider excluding the least-normal looking ones (or 217 outliers individuals). 218 We thank the reviewer for his concerns with regards to the possible breach of the assumptions with 219 regards to normal distributions. We would like to point out that outlier detection of individuals has been 220 performed for all metabolites, as described in the methods. Additionally, all metabolites had been 221 transformed prior to analyses. Conform with previous studies using the same metabolomics platforms, 222 both 1 H-NMR platforms had been transformed using inverse rank normalization and were per definition 223 normally distributed, however, both MS platforms had been transformed using their natural logarithm. 224 while ln-normalization generally improves the skewedness of the distribution, some highly skewed 12 metabolites will still not have a perfectly normal distribution. Please find the histograms of 8 random 226 metabolites (4 from the Biocrates 'Bioc' platform and 4 from the Lipidomics 'Lip' platform) included here: We would like to thank the reviewer for pointing out the confusing statement on L383 with regards to the 244 heterozygosity values. We use the PLINK F-statistic (or inbreeding coefficient) to determine 245 heterozygosity. This statistic can actually produce negative values, which commonly reflects sampling 246 error, though high negative values could indicate sample contamination (see 247 http://zzz.bwh.harvard.edu/plink/ibdibs.shtml#inbreeding). We have adjusted the 'genotyping' section in 248 the revised manuscript to reflect this more clearly. Additional changes have been made to this section to 249 reflect the changed made to the cleaning procedure when updating the genotype data freeze used on our 250 samples (to increase the sample size in the current analyses). 251 252 253 14

-L404 -what threshold for unrelated individuals were set to zero. Did you use 5% (?) like zaitlen? 254
We thank the reviewer for noticing we had neglected to include the threshold, the reviewer is correct in 255 assuming we had applied the same threshold was in the Zaitlen paper. We have adapted the relevant 256 section accordingly, which now includes the following statement: "for all individual pairs sharing less than 257 0.05 of their genome their sharing was set to zero" 258

-Supp note 1 -it's very good to test bayess through simulation -but it seems the heritabilities you used 275
in your simulations (10, 20, 40, 60, 80%) are much higher than those observed for real data. 276 We thank the reviewer for his concern with regards to the BayesS simulations. However

-In methods, you explain you use LDAK weighted kinships (which obviously I support). But in the main 284
text, you only mention GCTA (which you use for the REML estimation). When you say in L104 that you use 285 GCTA, I think most people will assume you use the GCTA Method (ie GCTA Kinships), rather than just the 286 GCTA REML solver. Therefore, consider changing this (perhaps say used "GCTA and LDAK", or "GCTA with 287 LDAK Kinships", or something similar). Or, considering you only use GCTA for its REML solver, you could 288 instead use the REML solver in LDAK. 289 We thank the reviewer for pointing out that the distinction in the submitted manuscript with regards to 290 using GCTA only for its REML solver, while creating GRMs in LDAK, was unclear. As per the reviewers 291 suggestion the result section has now been adapted to include the following sentence: "analyses were 292 performed in the genome-wide complex trait analysis (GCTA) software and the analyses for 'lipids' and 293 'organic acids' used unique class-specific and non-class genetic relationship matrices ( We thank reviewer 2 for the assessment with regards to the readability and understandability of the 317 abstract. The entire abstract has been rewritten and expanded with ~50 words. 318 319 3. Symbols for indicating different heritability's are usually in italic, but often not. Please be concise. 320 We thank the reviewer for pointing out this oversight. We have made a thorough inspection of the 321 manuscript to ensure all mentions of the heritability estimates have been properly formatted and placed 322 in italic. 323 324 4. I find the last part of the paper that deals with selection problematic. In most cases the causal variant or 325 haplotype is not known. Therefore the conclusions drawn from the allele frequency and effect estimate is 326 based on unknown linkage disequilibrium between the associated variant and the true causal allele and 327 because that relationship is unknown, any conclusions drawn from the MAF and beta can be, and most 328 likely are, erroneous. Unless the authors can convincingly justify these analyses, they must be removed 329 from the paper. 330 We would like to thank the reviewer with their honest opinion and assessment of the BayesS method. In 331 their paper introducing the BayesS method Zeng et al. (2018) acknowledge that not including the causal 332 variants in the GRM will cause the absolute value for selection, S, to be slightly underestimated due to 333 imperfect tagging. However, as reviewer 1 points out in comment 2 our choice to only include SNPs 334 previously associated with metabolites, as opposed to genome-wide SNPs, leads to further bias in the 335 analyses. In addition, we've noticed that the correlation among the submitted BayesS results and the 336 18 results as obtained with a larger dataset (based on all published metabolite-SNP associations since 2008, 337 as described in response to comment 1d of reviewer 1) are lower than anticipated (r = 0.59; p = 1.315e-338 09). We cannot properly address or explain these issues at this time, as such we agree with reviewer 2's 339 opinion and have removed the BayesS analyses from the manuscript.  Here are my responses to the comments in turn, followed by a quick summary: ========== 1a -thank you for switching to unconstrained variances, and I'm pleased this has increased the convergence rate. I disagree with your reason for not reexamining quality control: "However, we felt adjusting the quality control filters based on observed results 26 in the same study would be 'bad practice'." Were someone to perform a GWAS and find 1000 (almost certainly) false positives, it would not be considered bad practice to go back and reexamine the quality control.
1b -your method for obtaining SDs is acceptable, thank you 1c -you have explained why you would prefer not to follow my suggestion -I see some justification, but in my view it remains that SNP heritabilities, essentially by definition, should be obtained from unrelated samples, and so at the least a verification that results are consistent if you used only unrelated individuals is (in my opinion) required 1d and 1e -I disagree with your reason for not preferring to estimate the variance explained by associations for any metabolite. I remain of the opinion that h2GWAS is, by definition, the variance explained by gwide significant hits -so these must be specific to each metabolite. Therefore, either you perform a gwas for each metabolite, curate a list of metabolite specific associations from previous literature, or change the aim of your work (ie do not attempt to report h2gwas, but some other less standard statistic, h2anymetabolite (but Im not sure the benefit of this statistic). I appreciate the effort of the alternative analysis you performed, but I can't work out what problem this is addressing (but I'm sorry if this is my misunderstanding) 1f -thank you for the extra analysis, I believe that satisfies my original concern.
1g -thank you for the analysis, I believe that suffices 1h -thank you for sds 2 -authors have removed bayes s from the paper -that solves my concern :) My minor and very minor comments were addressed.

######
In my view, the estimates of (total) heritability are interesting, but I believe already reported (albeit in less detail). Estimates of h2snp, and how they vary by class, would be interesting, but for reasons explained above, I'm not convinced by the accurracy. Likewise, estimates of h2gwas would be interesting, but I don't feel these are what are reported. Furthermore, due to the nature of the data (ie not by fault of the authors), I note the sds of estimates appear high to the point results might not be useful (e.g., the average h2snp for lipids reported is 0.05, with a variance 0.24 -that means a confidence interval of -.45 to .55, while the c.i. for the reported h2gwas is 0-0.12) Signed Doug Speed Reviewer #2: Remarks to the Author: I am happy with the modifications done, I have no further comments.
Reviewer #3: Remarks to the Author: General Comments: General point: While you're estimating the "total" heritability based on SNP data, the 'h2total' is not actually just heritable genetic variation. It includes any environmental variation shared by close relatives, which is likely substantial. Traditional twin models would have h2 and c2 terms, and your model is combining some of the c2 into the h2total, because close relatives share environments to some degree (which likely covaries to some degree with pedigree relatedness), and you haven't accounted for those shared environment elsewhere in the model. Total variance explained would be a better way to treat the term, and to be very clear throughout the manuscript (with appropriate limitations listed in the discussion) of what that estimate is and what it is not. See Zaitlen 2013 PLoS Genetics for more details on that point. Related to that, showing the distributions of the relatedness estimates would help. I see that a previous reviewer commented on the data structure, asking about twins vs. other relatives, and I think a supplementary figure would be helpful for that. I don't see the application of the Zaitlen thresholded GRM method as problematic, but I do think that there should be a full and accurate description of the sources of variation that can contribute to the variance estimated by the thresholded GRM, which are not just genetic, and which clearly have bearing on your conclusions.
Another limitation with respect to the GW loci would clearly be the power of the underlying 40 studies where the curated list of SNPs was taken from. These had very different sample sizes, ranging 200-11K+. Different metabolites were presumably studied in different papers looking at Supp. Table 1. How might differential power influence the resulting cross-class comparisons in Figures 2 & 3? This is not to say that the current study is flawed, because I think it's a great question to ask how much of the variance the previously-identified GW significant loci explain, but it's clearly a structural component that would factor into the current analyses. In general, more detail on what other factors, from sample size to shared environments to other things, might also be influencing the estimates and what limitations those might have in terms of the general conclusions of the current manuscript is needed.
Tables A & B of your rebuttal should be included in the supplements, with a description of why you did not use metabolite-specific GRMs, which was a question I originally had in reading the manuscript until I read through the previous rebuttals. I think most readers would have the same question.
Regarding the disagreement between using previously-identified GW significant loci or running the GWAS within the current sample, I think using previously-identified loci is perfectly fine. The question could be made a bit clearer in the text, that you are estimating the variance explained by previously identified GWAS hits, and in an independent sample trying to gauge how important they are. The fact that they are identified in separate sample actually is better, as it takes care of possible 'winner's curse' issues with in-sample significant hits (which given the small sample size would be unlikely anyhow). The first reviewer's concern about definitions of genome-wide significant hits being metabolite-specific is certainly valid, but I think it's a bit tangential to what I see as the point of that aspect of the analysis, which is to evaluate how much genetic variance previously-identified metabolite associated loci are explaining. Perhaps simply calling the h2GWAS something else, such as h2metabolite-hits, and being very clear of the goal and purpose of that aspect of the analysis in the manuscript, would address the issue. (I'd like to see a clearer interpretation of that piece anyhow.) LDAK weighted GRMs -see comments below. From the rebuttal, it appears the authors have already substantially changed the analyses at least once, and I realize that there would be substantial work involved in changing or demonstrating the utility of using the LDAK GRMs, which I am sympathetic to. However, the logic of using weighted GRMs using previously clumped SNPs is confusing to me, which seems to be doing the same thing twice and likely removes real signal anyhow. Additional simulations could potentially address these concerns, however.
Detailed Comments: L119: Focusing on the 369 lipids or 'organic acids and derivatives' seems fine, except that of the 402 metabolites classified in L116-117, there are 336 'lipids' and 53 'organic acids' and no 'organic acids and derivatives'. Which set of metabolites did you exactly use? (336+53=389, so it can't be just the lipids + organic acids.) I'm probably missing something, but even so, it should be clear which classes you used in this text as well as Supp. Table 1. L270-273: If you have measures for the same individuals on different platforms, please also provide a measure of reliability or concordance across those measurements.
L349-357: If you had already imputed all individuals to GONL, why also then impute those (GONLimputed data) to 1KG on the Michigan server? What was gained from that? Please explain why you imputed twice, and what the benefit of imputing to GONL and then using those imputed data to impute again as opposed to simply using the array data to impute to 1KG to begin with. L389-390: For GCTA-style analyses of all the G-W significant SNPs simultaneously in one GRM, clumping is unnecessary, and in fact will remove any additional signal that could be gained from additional SNPs with small effects that are in partial LD with the lead SNP. The 'GCTA' model for GRMs can actually account for the LD among these SNPs (see the Yang 2015 Nat. Gen. supplemental derivations). Please clarify or explain in more detail.
L397: LDAK weights the markers by their LD, which you've essentially already done by clumping. What is to be gained by then weighting them somehow? Further, were the LD weights based on the clumped (pruned) set of SNPs or were the weights calculated before the clumping was performed? Again, this will change the weights, because after clumping you've removed much of the LD of the markers in the set, which will increase their weights in the LDAK matrix, and probably change the GRM drastically. With only 500 GW SNPs in the GRM, which are presumably spread across the genome, that were already clumped, I imagine the weighted (LDAK) and unweighted (GCTA) GRMs would be quite similar anyhow (this is certainly not the case for the main weighted GRM, which includes all the imputed variants outside of GW loci, but perhaps for the matrix with only 500 clumped variants, they're largely identical). In general, the logic and reasoning for using the clumping and weighted GRMs isn't clear, and as noted below, the simulations don't really address the issue.
L399-: Did you remove the GW associated SNPs before calculating the G1 and G2 matrices? It's not clear if there was one set of G1 and G2 matrices for each of the metabolites or a single pair of G1 and G2 matrices used for every metabolite, but you wouldn't want to include the GW associated SNPs for a metabolite in the G1 and G2 matrices as well. (I see in L148 of the rebuttal that you removed lead SNPs and those within 50Kb of the lead SNPs from the matrix, which takes care of my concern. That wasn't clear, however, from the methods. Also, the rebuttal paragraph beginning on L362 discusses three variance component models, which I don't see anywhere in the main text or methods or supplementals. What is this analysis, how was it performed, and where are the results?) L406: LDAK estimates of relatedness will be altered by the weights. Using those within a thresholded matrix (the V(G2) estimate) is likely not the same thing as thresholding based on an unadjusted GRM (GCTA-based), or if it is close, you should demonstrate that the same pairs are set to 0 for each approach. It would be simple to compare the two matrices and check to see if the thresholded (set to 0) pairs are the same, and if they are, then that's fine, but if the thresholded pairs set to 0 are different, then you will have to explore how that changes the estimates, particularly since the Zaitlen et al. paper, and all others I'm aware of using close relatives, have used an unweighted GRM for that purpose. L420: How did you compare the models? Supplementary note 3 doesn't describe any formal comparisons. Log likelihood? AIC? BIC? Something else? As the standard errors are likely to be quite large for the h2SNP estimate, just finding similar point estimates doesn't seem appropriate or sufficient. L432: What exactly did you resample? How did you perform your random sampling? Why not just use the standard approach for this, the delta method? (Yang 2015 applied that approach, and it is also in the back of the classic Lynch and Walsh "Quantitative Genetics" textbook.) Supplementary Note 2: What were the underlying data used for the simulations? Array SNPs, imputed variants, or WGS? That makes a huge amount of difference with these kinds of simulations, because the array and imputed variants are not a random subset of all WGS variants, but rather are better-tagged and are more common than average (and likely more so than unknown causal variants in real data). Using imputed or array data as the underlying simulation leads to a situation where the LD is baked into the simulations already, matching the LDAK model, so of course it will look 'better' but simply because of the LD relationships already in the data. The scenario of randomly chosen SNPs is also quite different from your strategy of first clumping variants with plink before making the matrix, because in the real data, you've forced all the markers in the matrix to be in low LD with one another. With imputed or array variants used for causal variants in your simulations, the simulations are going to show LD-dependent models working better, though that may not have much relevance to real data where the causal variants are unknown, and likely not array SNPs or well-tagged imputed variants. Additional simulations to establish that thresholding the weighted GRM rather than an unweighted GRM are unbiased would be good to see to address the concern of L406 above. 1

Reviewer #1 1
We very warmly thank the reviewer for taking the time to critically appraise our manuscript once more. 2 We are pleased to have addressed major concerns 1b, 1f-h and 2, as well as the minor and very minor 3 comments to the satisfaction of reviewer #1. Below we address remaining concerns 1a and 1c-e. 4 1a -thank you for switching to unconstrained variances, and I'm pleased this has increased the 5 convergence rate. I disagree with your reason for not reexamining quality control: "However, we felt 6 adjusting the quality control filters based on observed results in the same study would be 'bad practice'." 7 Were someone to perform a GWAS and find 1000 (almost certainly) false positives, it would not be 8 considered bad practice to go back and reexamine the quality control. 9 Upon request of reviewer #1, we reexamined the QC of our metabolomics data, which consisted of the 10 following steps: 1) exclude metabolites with mean coefficient of variation larger than 25%; 2) exclude 11 metabolites with more than 5% missing data; 3) impute metabolite values below the limit of 12 detection/quantification with half of the limit; if this was unavailable with half of the lowest observed 13 value; 4)impute values for outliers (more than 5 SD removed from the mean) with 'mice'; and 5) apply 14 transformation: inverse normal rank transformation for metabolites measured on NMR platforms and 15 natural logarithm transformation for metabolites measured on mass spectrometry (MS) platforms. 16 2 With these criteria we obtained a convergence rate of nearly 100% (97.8%.) Of the 8 (2.2%) metabolites 17 that failed to converge, 2 were measured on an MS-platform (e.g., log-normal transformed), and 6 on an 18 NMR platform (e.g., inverse rank normal transformed). Therefore, non-normality of the phenotype is 19 unlikely the driver behind non-convergence. Next, we plotted the residuals (after regression out the 20 covariates as included in the GCTA analyses) of the 6 metabolites as analyzed on an NMR platform. As can 21 be seen in Figure A there are deviations from normality for these 6 metabolites. Plotting the residuals 22 from 6 random metabolites of the same platform also shows deviations from normality, but less severe 23 ( Figure B). Together this suggests non-normality of the residuals as an explanation of non-convergence. has become widely used when including related and unrelated samples in GREML analyses, particularly for 49 twin-family cohorts. We greatly value the comments of the reviewer, and recognize that more methods 50 comparisons are possible. However, we would prefer our paper to focus on a decade of metabolomics 51 GWASs. 52 Thank you for your detailed comments and suggestions. We have followed the suggestion that we 66 redefine the aim of our study. We hypothesize that the genetic variance of metabolite loci of the not-67 target super classes reflect genetic signals that current GWA and (exome-) sequencing studies for the 68 target metabolite have so far been underpowered to detect. To distinguish our aim, from the aim of 69 evaluating how much of the genetic variance in a metabolite is due to the metabolite-specific loci, we no 70 longer refer to the sum of the super class-specific and other super classes variance as h 2 GW-loci , but as 71 h 2 metabolite-hits (in line with the suggestion of reviewer #3). We've made the necessary adjustments to the 72 manuscript to better reflect the aim of our analyses: Introduction (L84-89), and the Results (L134-155). 73 Finally, we must address reviewer #1's concern that we are ignoring the request to investigate the genetic 74 variance of only the metabolite-specific loci. We actively argued against using metabolite-specific GRMs to 75 estimate the genetic variance of metabolite-specific loci in our previous rebuttal. There, we attempted to 76 meet reviewer #1 half way: instead of using a single GRM with all metabolite loci we first attempted to 77 create several GRMs using class-specific GRMs. However, as shown in We would like to thank the reviewer for appraising our manuscript and rebuttal and are happy we've been 100 able to modify the previous version of the manuscript to the reviewer's specifications. We trust that the 101 changes to the current version of the manuscript are still in line with reviewer #2's expectations. Thank you for noting that " using previously-identified loci is perfectly fine". We also appreciate the 208 suggestion to clarify the aim of our study and to use a different term for h 2 GW-loci and have done so. To 209 distinguish our aim, estimate the genetic variance of previously identified metabolite loci in an 210 independent sample, from the aim of evaluating how much of the genetic variance in a metabolite is due 211 to the metabolite-specific loci, we no longer refer to the sum of the super class-specific and other super 212 classes variance as h 2 GW-loci , but as h 2 metabolite-hits (in line with the suggestion of reviewer #3). We've made 213 the necessary adjustments to the manuscript to better reflect the aim of our analyses: Introduction (L84-214 89), and the Results (L134-155). 215

LDAK weighted GRMs -see comments below. From the rebuttal, it appears the authors have already 216
substantially changed the analyses at least once, and I realize that there would be substantial work 217 involved in changing or demonstrating the utility of using the LDAK GRMs, which I am sympathetic to. 218 However, the logic of using weighted GRMs using previously clumped SNPs is confusing to me, which 219 seems to be doing the same thing twice and likely removes real signal anyhow. Additional simulations 220 could potentially address these concerns, however. 221 Thank you for recognizing the huge amount work we did. We appreciate the concern with regard to 222 clumped SNPs in a weighted LDAK GRM. In the first draft of our manuscript we used the weighted LDAK 223 GRM without prior clumping of the SNPs. However, in the previous round of rebuttals reviewer #2 had the 224 following concern: "How is the GW-loci GRM affected by LD between the SNPs that were used to build it? 225 Many of the SNPs are from the same genetic loci and can describe the same association signal but just 226 have another metabolic measure associated with it [..]". To accommodate reviewer #2 we included 227 clumped GRMs in the second draft of the manuscript. 228 To address the concern of reviewer #3 about overcorrecting a weighted GRM when using clumped SNPs, 229 and possibly removing real signal, we compared the clumped and not-clumped LDAK GRMs. We used the 230 GCTA power calculator to compare the estimated SE's and power for the different GRMs included in our 231 manuscript using clumped and not-clumped SNPs (https://cnsgenomics.shinyapps.io/gctaPower/). The 232 13 differences between the GRMs using clumped or not-clumped SNPs are very small (Supplementary Table  233 16). Across all 5 GRMs the average mean differences between the clumped and not-clumped GRMs are 234 negative, indicating that clumped GRMs results in smaller SE's, though with a minimal difference 235 (Supplementary Table 16). Similarly, the average power differences are zero or positive (again with 236 minimal differences), indicating that clumped GRMs are as powerful as, or more powerful than not-237 clumped GRMs (Supplementary Table 16). While the gain of using a clumped weighted LDAK GRM versus 238 a not-clumped LDAK GRM is small, we feel that given the large SE's we observed in our study, the choice 239 for clumped weighted LDAK GRMs is valid. This description has been included in Supplementary Note 3. 240 Supplementary Table 16. Results power analysis, using the GCTA power calculator, to compare the clumped LDAK GRMs with the not-clumped LDAK genetic 241 relatedness matrices (GRM). The Type I error rate, α, has been set to 0.05 for all power calculations. We've calculated the power assuming the true SNP   Table 17. Results power analysis, using the GCTA power calculator, to compare the clumped LDAK GRMs with the clumped GCTA genetic 325 relatedness matrices (GRM). The Type I error rate, α, has been set to 0.05 for all power calculations. We've calculated the power assuming the true SNP