Genomic analysis of male puberty timing highlights shared genetic basis with hair colour and lifespan

The timing of puberty is highly variable and is associated with long-term health outcomes. To date, understanding of the genetic control of puberty timing is based largely on studies in women. Here, we report a multi-trait genome-wide association study for male puberty timing with an effective sample size of 205,354 men. We find moderately strong genomic correlation in puberty timing between sexes (rg = 0.68) and identify 76 independent signals for male puberty timing. Implicated mechanisms include an unexpected link between puberty timing and natural hair colour, possibly reflecting common effects of pituitary hormones on puberty and pigmentation. Earlier male puberty timing is genetically correlated with several adverse health outcomes and Mendelian randomization analyses show a genetic association between male puberty timing and shorter lifespan. These findings highlight the relationships between puberty timing and health outcomes, and demonstrate the value of genetic studies of puberty timing in both sexes.

-The questions asked of the UK Biobank participants seem very broad. Were they given any indication of what 'average' was? Did you use any of the data from the follow-ups to provide some sort of validation of their answer? If so, how did you account for the individuals that changed answers? -The SNP effect estimates from the analysis of early voice breaking and early facial hair in the UK Biobank will be in the opposite direction to the other traits. What impact does this have on the MTAG analysis? Would you get the same result if all of the effect estimates were aligned to later pubertal timing? -I was under the impression that MTAG reported the adjusted estimates for all traits included in the analysis, or performed an inverse-variance weighted meta-analysis that accounts for sample overlap in the GWAS results assuming that the GWAS are all measuring the same phenotype. What was used for this analysis? -The MTAG method will be somewhat new to many readers and makes some fairly strong assumptions. Could some discussion of these assumptions of MTAG be included, and whether the authors think they are (or aren't) justified in their analysis? -Were the FDR calculations conducted that are recommended by the MTAG authors? -What is the adj_r2 presented in supplementary table 5 for ALSPAC results? If this is the adjusted r2 from the full model, then I think it is slightly misleading as it is the variance explained by all covariates in the model, adjusting for the number of covariates, not just by the allele score.
-Would the authors please provide some interpretation of the ALSPAC results, particularly regarding the direction of effect for each phenotype? For example, the beta coefficients for the voice breaking phenotype are negative in the early phases but positive in the latter two and positive for age at voice breaking. -A number of large GWAS studies with smaller replication samples are using the slope of a regression analysis between the beta coefficients as evidence for replication (see for example the latest GIANT GWAS, PMID: 30124842). Would a similar analysis be useful in ALSPAC, where the age of voice breaking and facial hair appearance measures are more precisely measured, to complement the existing analysis and provide further evidence of whether the SNPs replicate? -Why were the 23andMe voice breaking data used for the genetic heterogeneity analysis and not the MTAG results? -The traits used in the MR analyses seem very cherry-picked. Can the authors provide justification why the used hair colour and longevity in their MR analyses? And why they did not investigate traits like cardiovascular disease and type 2 diabetes that they mention in the first paragraph of the introduction and which show strong genetic correlations? -When investigating the relationship between hair colour and puberty timing, the authors state that "this sex difference appears following the progressive darkening of hair and skin colour during adolescence". Therefore, could the timing of puberty cause darkening of the hair and/or skin? Would it be worthwhile conducting a bidirectional MR analysis to investigate whether this is possible? -Why were the 5 SNPs identified in a 2010 meta-analysis of 23andMe data used in the sensitivity analysis of hair colour and puberty timing? Could the larger number of SNPs that reached genomewide significance in 23andMe in Hysi et al. (with corresponding effect sizes) be used instead as this would provide a more powerful instrument? -In supplementary table 14, it looks like significant heterogeneity is identified in the MR analyses with prostate cancer. Some discussion of this should be presented. -In the discussion, the authors mention that SRD5A2 (MEMO1) is a newly highlighted gene implicated in pubertal timing, the disruption of which leads to male-specific reproductive disorders. However, it is actually more strongly associated with AAM than voice breaking in 23andMe. Could the authors please provide more explanation?
Minor -A number of the values in column Y of Supplementary Table 2 (Concordant direction (with Menarche) for the UKBB FH (Early) effect) are incorrect. For example, rs6931884 in row 3 is not concordant (increasing age at AM and increasing early facial hair). -In the results section the authors say all 78 SNPs were used in the ALSPAC replication analysis, but in the methods section they state that only 73 SNPs were used. Which is correct? -Including the gene names in the supplementary tables will allow for easier comparison with the figures and text. -Can the authors provide a reference for the statement, "Two variants located near to genes that are disrupted in rare disorders of puberty…"? -The supplementary tables and figures could to be described in more detail. For example, what are the axes on each figure? What is the colour coding for each figure (for example, SF2)? What are each of the columns in the tables?
Reviewer #2: Remarks to the Author: In this paper, the authors perform an unprecedentedly high-powered GWAS of puberty timing in males by combining summary statistics for several related phenotypes using MTAG (Turley et al. 2018). This analysis produces results very similar to those found in a previous study of puberty timing in females (Day et al. 2017), with a single exception where an association has a significant effect in the opposite direction. They find a substantial amount of genetic correlation between puberty timing and several other complex traits. They follow up on some of these in Mendelian Randomization (MR) analyses. I overall thought the project was well-executed and well-written. The comparison of their results to the previous work in females I found especially interesting. In contrast, I thought the MR analyses were quite weak and should either be greatly revised or dropped from the paper. Beyond this, I only have a few other concerns. I go into further detail on these concerns below.
Major Points: 1) I found the MR work in the paper to be very unconvincing. To be ready for publication, the text would require substantial softening of the language around the credibility of the results, a much more thorough explanation of the assumptions of the approach and why readers should expect them to be satisfied, or the analyses should be removed entirely.
1a) MR requires strong assumptions about the absence of alternate pathways from the instrumental variable (in this case, the SNPs) to the outcome variable (e.g., timing of puberty) except through the exposure variable (e.g., hair color). MR-Egger and heterogeneity tests rule out certain forms of pleiotropy that would violate this assumption, but not all of them. For example, if a third latent factor causes both the exposure and outcome variables, we would pass both tests, but a causal interpretation would not be appropriate. In the conclusion, the authors even identify potential pathways that may violate this assumptions (androgens vs estrogens). Why should we think these don't exist? These concerns also exist for the MR analyses of the effect of puberty timing on lifespan and prostate cancer. If the SNPs identified in the puberty timing analysis operate by changing the concentration of hormones in the body, how can we be sure that these hormones are not affecting the outcome traits directly as opposed to through puberty timing? 1b) With respect to the hair color phenotype specifically (which gets much more real estate in the paper), I'm further concerned by the timing. In many individuals, I believe hair color is not determined until the early 20s, which is why the standard survey questions asks for natural hair color at age 20 (Han et al. 2008), whereas puberty has long passed for most by then. While hair color later in life may be highly correlated with early life hair color, it's variability still makes it more difficult to believe that it could be the cause of puberty timing and not potentially the opposite.
1c) It seems that in several cases, the authors use two-sample IV even when there is sample overlap between the two samples. Why do they lead with these results when the standard errors are unreliable? I'm not totally sure, but it's possible that the standard errors are actually too large when there is sample overlap. This would make their standard errors conservative, but if they would like to lead with those results, they should state and justify this claim explicitly.
1d) For the two exemplar traits, how were these selected? From the point of view of the reader, it makes a difference if these were the only two phenotypes considered, or if the authors tried several and reported the ones that seemed most interesting. This is especially the case since the two examples they present are both only marginally significant and wouldn't survive a multiple testing correction.
2) The sign concordance between the results for males and the previous results for females seems very high to me. Is it perhaps due to relatedness between the male and female UKB samples? If you assume the effect size estimates are exactly the same in men and women, given the effect size estimates (with perhaps a winner's curse correction), what is the expected sign concordance?
3) The authors use a clumping algorithm that is just based on distance. This is a bit unusual. I looked up the most recent GWAS papers in Nature Genetics, and they both had some sort of r2 threshold in addition to a distance threshold (Maguire et al. 2018, Evangelou et al. 2018. I can understand why they would want to make these results comparable to their previous study on puberty timing in females, but it would also be nice to have some sense of if the SNPs would survive an r2-based locus definition as well. For example, what's the maximum r2 between any pair of "independent" SNPs in your analysis? Minor points: 1) MTAG produces output for each phenotype. It was not clear to me which phenotype was used when the authors referred to the MTAG output. I'm guessing from some of the figure labels that they used the 23andMe phenotype? If so, this should be clarified. It would also be to good justify using this measure over the other phenotypes.
3) On page 13, the authors state, "The effect estimate for the association of each SNP on the trait of interest is then derived using a marginal likelihood function,..." MTAG is a moments-based method and not a likelihood-based method. 4) On page 14, the authors state, "We used a conservative genome-significant P-value threshold of P<5 x 10-8 to determine significant SNP associations." This is the standard threshold, not a conservative one. I'd drop the word, "conservative". Han, J., Kraft, P., Nan, H., Guo, Q., Chen, C., Qureshi, A., ... & Martin, N. G. (2008). A genomewide association study identifies novel alleles associated with hair color and skin pigmentation. PLoS genetics, 4(5), e1000074.
Reviewer #3: Remarks to the Author: Hollis, Day and coworkers report the results of a multi-trait genome-wide association study for male puberty timing including more than 200,000 subjects. The study identifies 78 independent loci, 29 of which have not previously been associated with pubertal timing. The genetic correlation with health outcomes is further characterized, with particular emphasis on a potential causal relationship between early onset puberty, prostate cancer risk, and a shorter lifespan.
The paper addresses an important and interesting topic. Previously female pubertal timing has been extensively analyzed, but less is known about the genetic underpinnings of male pubertal timing. Moreover, the current analyses are based on an impressive number of samples.
Nonetheless the phenotype measurements used in the study impose a bit of a challenge. While pubertal timing is a continuous trait, the phenotype assessment is dichotomized and based on recall, i.e. the timing of voice breaking and the appearance of facial hair were assessed based on a scale "younger than average", "about average", and "older than average". The authors have then used a somewhat unusual approach to construct the GWASmodels, i.e. two sets of binary models per phenotype have been analyzed ("younger" vs. "average" and "average" vs "older"), and all four models have then been combined together with a continuous model for voice breaking using Multi-Trait Analysis of GWAS (MTAG). With this setup, the identical control group has been used twice for the voice breaking and the appearance of facial hair analyses.
-What might the impact be of utilizing the identical control groups many times in the MTAG-setting? -Has MTAG been validated for a similar analysis strategy before? -Does the running GWAS analysis of the extremes, i.e. "younger" vs. "older" for both phenotypes and combining the summary stats with the continuous trait analysis yield similar results? Obviously, such strategy would result in a considerable reduction of the number of samples, but would there be similar signals still? -Even if the authors have attempted to seek collective confirmation of the 78 signals identified by MTAG by computing a polygenic risk score and testing for association with timing of voice breaking in an independent cohort, one might still attempt to seek confirmation of single loci, i.e. loci that show very modest association in the individuals GWAS-analyses, e.g. rs1979835, rs17193410, rs7136086, rs10110581, rs10164550, rs10765711, by ordinal regression analysis.
-One of the main results is the reported association between pubertal timing and three loci previously linked with pigmentation. To control for population stratification the phenotypic relationship has been assessed in white males of European ancestry only, also including 40 genetic principal components. Because the identification of an association between a phenotype and pigmentation loci raises the question of underlying population structure, the authors could provide additional information on potential genomic inflation, i.e. by providing information on the genomic inflation factor and the qq-plots for the individual GWAS-analyses.

Reviewer 1
• The questions asked of the UK Biobank participants seem very broad. Were they given any indication of what 'average' was? Did you use any of the data from the follow-ups to provide some sort of validation of their answer? If so, how did you account for the individuals that changed answers? Response: No indication was given on the 'average'. Reassuringly, the proportion of individuals changing responses between follow-ups was very low (0.5% for facial hair and 0.3% for voice breaking) and concordance between voice breaking and facial hair reports was high (see Table S1). We used the initial questionnaire assessment for all analyses.
• The SNP effect estimates from the analysis of early voice breaking and early facial hair in the UK Biobank will be in the opposite direction to the other traits. • Would the authors please provide some interpretation of the ALSPAC results, particularly regarding the direction of effect for each phenotype? For example, the beta coefficients for the voice breaking phenotype are negative in the early phases but positive in the latter two and positive for age at voice breaking.

Response:
We have clarified in the text (page 7) that the direction of the ALSPAC results is consistent with the polygenic risk score for later puberty timing. We have also clarified the units of responses in Table S5.
• A number of large GWAS studies with smaller replication samples are using the slope of a regression analysis between the beta coefficients as evidence for replication (see for example the latest GIANT GWAS, PMID: 30124842). Would a similar analysis be useful in ALSPAC, where the age of voice breaking and facial hair appearance measures are more precisely measured, to complement the existing analysis and provide further evidence of whether the SNPs replicate?
Response: The small replication sample in ALSPAC (N=2,394), compared to our discovery sample (205,354 men for continuous pubertal timing), limits the ability to confirm effects of individual SNPs. We therefore tested a polygenic risk score for later puberty timing, as our downstream analyses, i.e. Mendelian randomisation and genetic prediction, is a score based.
• Why were the 23andMe voice breaking data used for the genetic heterogeneity analysis and not the MTAG results? Response: As described above, for the sex-heterogeneity analysis (age at voice breaking versus AAM in females) we used the effect estimates from the 23andMe sample only as these were directly reported on a continuous scale (in years) which is consistent with the AAM variable.
• The traits used in the MR analyses seem very cherry-picked. Can the authors provide justification why the used hair colour and longevity in their MR analyses? And why they did not investigate traits like cardiovascular disease and type 2 diabetes that they mention in the first paragraph of the introduction and which show strong genetic correlations? Response: We explored hair colour following the annotation of novel loci in our discovery MTAG: at least three loci overlapped genes previously associated with pigmentation. The MR analyses are chosen to confirm previous reports based on AAM variants and lifespan (PMID: 28873088) and prostate cancer (PMID: 28436984).
• When investigating the relationship between hair colour and puberty timing, the authors state that "this sex difference appears following the progressive darkening of hair and skin colour during adolescence". Therefore, could the timing of puberty cause darkening of the hair and/or skin? Would it be worthwhile conducting a bidirectional MR analysis to investigate whether this is possible? Response: As stated below in response to a similar comment from Reviewer 2, the hypothesis being tested is not whether there is a causal relationship between puberty timing and hair colour, but whether the biology underlying these traits is shared. The manuscript has been amended to state this explicitly. With regard to the suggestion of a bidirectional MR, while this may be informative the summary data for the hair colour GWAS is unfortunately not available publically.
• Why were the 5 SNPs identified in a 2010 meta-analysis of 23andMe data used in the sensitivity analysis of hair colour and puberty timing? Could the larger number of SNPs that reached genome-wide significance in 23andMe in Hysi et al. (with corresponding effect sizes) be used instead as this would provide a more powerful instrument?

Response: We used the older 2010 GWAS for hair colour (in 23andMe) as a sensitivity test that avoids any overlap between discovery sample (for hair colour) and voice breaking sample (in UK Biobank). This rationale is described in the text.
• In supplementary table 14, it looks like significant heterogeneity is identified in the MR analyses with prostate cancer. Some discussion of this should be presented.

Response: Due to the removal of the two SNPs as discussed above, when we re-ran the prostate cancer analysis, the p-value crept to the other side of significance (p=0.06). As a result we have reduced the prominence of the prostate cancer analysis both in the paper and in the abstract. We now feel that commenting on the heterogeneity of a non-significant result may not be of as much interest to a reader.
• In the discussion, the authors mention that SRD5A2 (MEMO1) is a newly highlighted gene implicated in pubertal timing, the disruption of which leads to male-specific reproductive disorders. However, it is actually more strongly associated with AAM than voice breaking in 23andMe. Could the authors please provide more explanation? Response: We did not mean to indicate SRD5A2 as a male-specific signal (the results text on Page 8 describes this also as a AAM signal) and there is no evidence for sexheterogeneity (P=0.99; Table S6). For balance, we have added in the Discussion that higher 5-alpha-reductase activity has been reported in women with polycystic ovary syndrome.
• A number of the values in column Y of Supplementary • Can the authors provide a reference for the statement, "Two variants located near to genes that are disrupted in rare disorders of puberty…"? Response: The references have been added (page 8).
• The supplementary tables and figures could to be described in more detail. For example, what are the axes on each figure? What is the colour coding for each figure (for example, SF2)? What are each of the columns in the tables?

Response: Details have been added to the supplementary tables and figures.
• I found the MR work in the paper to be very unconvincing. To be ready for publication, the text would require substantial softening of the language around the credibility of the results, a much more thorough explanation of the assumptions of the approach and why readers should expect them to be satisfied, or the analyses should be removed entirely.

Response:
We have updated the text describing the MR analyses in response to reviewers' comments to address the specific comments below. Specifically we have revised any text suggesting a direct causal effect of hair colour on puberty timing, while explicitly stating our hypothesis that hair colour and puberty timing share common underlying pathways which MR tests more robustly compared to non-genetic associations (page 9).
• MR requires strong assumptions about the absence of alternate pathways from the instrumental variable (in this case, the SNPs) to the outcome variable (e.g., timing of puberty) except through the exposure variable (e.g., hair color). MR-Egger and heterogeneity tests rule out certain forms of pleiotropy that would violate this assumption, but not all of them. For example, if a third latent factor causes both the exposure and outcome variables, we would pass both tests, but a causal interpretation would not be appropriate. In the conclusion, the authors even identify potential pathways that may violate this assumptions (androgens vs estrogens). Why should we think these don't exist? These concerns also exist for the MR analyses of the effect of puberty timing on lifespan and prostate cancer. If the SNPs identified in the puberty timing analysis operate by changing the concentration of hormones in the body, how can we be sure that these hormones are not affecting the outcome traits directly as opposed to through puberty timing? Response: We used MR to evaluate the association between hair colour-related variants and puberty timing to provide evidence that these traits have shared biological mechanisms. As the reviewer correctly points out, had we been assessing the explicit hypothesis that hair colour causes a change in puberty timing the results would have to be interpreted differently given the possibility of alternate pathways which are unrelated to the instrumental variable. Given the temporal relationship between puberty and hair colour it seems logical that pathways underlying hair colour influence puberty timing, rather than vice versa.
• With respect to the hair color phenotype specifically (which gets much more real estate in the paper), I'm further concerned by the timing. In many individuals, I believe hair color is not determined until the early 20s, which is why the standard survey questions asks for natural hair color at age 20 (Han et al. 2008), whereas puberty has long passed for most by then. While hair color later in life may be highly correlated with early life hair color, it's variability still makes it more difficult to believe that it could be the cause of puberty timing and not potentially the opposite. Response: As described above, the hypothesis being tested here is not of whether hair colour influences puberty, but rather whether there is shared biology. As mentioned in response to a similar comment from Reviewer 1, while a bi-directional MR may still be informative the summary data from the hair colour GWAS is not publically available.
• It seems that in several cases, the authors use two-sample IV even when there is sample overlap between the two samples. Why do they lead with these results when the standard errors are unreliable? I'm not totally sure, but it's possible that the standard errors are actually too large when there is sample overlap. This would make their standard errors conservative, but if they would like to lead with those results, they should state and justify this claim explicitly. Response: For the hair colour MR to which the reviewer refers, the first stage of the investigation was to check whether there was an association using the more robust and statistically powered hair colour instrument from Hysi et al. However as mentioned in the manuscript there is sample overlap with UK Biobank being used in the hair colour discovery GWAS (~46% of the total sample).While this does not preclude the use of two-sample MR, we then sought to confirm these findings using the 5-SNP score, which was less powered but contained summary results from nonoverlapping samples. Incidentally, this is likely a broader problem in the field when data for 2-sample MRs is based on the results of large scale consortium metaanalyses therefore the lengths that we have gone to in order to address this exceeds usual practice.
• For the two exemplar traits, how were these selected? From the point of view of the reader, it makes a difference if these were the only two phenotypes considered, or if the authors tried several and reported the ones that seemed most interesting. This is especially the case since the two examples they present are both only marginally significant and wouldn't survive a multiple testing correction.
Response: See response to Reviewer 1, above, on the choice of MR traits (prostate cancer and lifespan, to confirm previous analyses that used AAM loci).
• The sign concordance between the results for males and the previous results for females seems very high to me. Is it perhaps due to relatedness between the male and female UKB samples? If you assume the effect size estimates are exactly the same in men and women, given the effect size estimates (with perhaps a winner's curse correction), what is the expected sign concordance? Response: We do not find the high concordance in directions of effects between sexes to be surprising as studies of rare disorders of puberty and animal models find substantial shared biology.
• The authors use a clumping algorithm that is just based on distance. This is a bit unusual. I looked up the most recent GWAS papers in Nature Genetics, and they both had some sort of r2 threshold in addition to a distance threshold (Maguire et al. 2018, Evangelou et al. 2018. I can understand why they would want to make these results comparable to their previous study on puberty timing in females, but it would also be nice to have some sense of if the SNPs would survive an r2-based locus definition as well. For example, what's the maximum r2 between any pair of "independent" SNPs in your analysis? Response: In response to this comment, we have identified correlations between two pairs of voice breaking loci: LD r 2 =0.595 between rs138625771 and rs75602844; LD r 2 =0.293 between rs112881196 and rs72789842. All other SNP pairs have r 2 <0.006.We have revised the paper throughout to clarify that we identify 76 independent signals for male puberty timing (there are still 29 signals not previously associated with puberty in either sex).
• MTAG produces output for each phenotype. It was not clear to me which phenotype was used when the authors referred to the MTAG output. I'm guessing from some of the figure labels that they used the 23andMe phenotype? If so, this should be clarified. It would also be to good justify using this measure over the other phenotypes.
Response: The effect estimates reported here are from MTAG inverse-variance weighted meta-analysis models for continuous age at voice breaking (from the 23andMe sample) as the output trait, as this corresponds directly with the AAM trait studied in females.
• On page 13, the authors state, "The effect estimate for the association of each SNP on the trait of interest is then derived using a marginal likelihood function,..." MTAG is a moments-based method and not a likelihood-based method.

Response:
The biorxiv version of the MTAG paper stated (page 4): (https://www.biorxiv.org/content/biorxiv/early/2017/03/20/118810.full.pdf) To derive the MTAG estimator of the effect of SNP on trait , we derive the marginal likelihood function". However, as this line does not appear in their peer reviewed version, we have now removed this phrase from our paper (page 14).
• On page 14, the authors state, "We used a conservative genome-significant P-value threshold of P<5 x 10-8 to determine significant SNP associations." This is the standard threshold, not a conservative one. I'd drop the word, "conservative". Response: We have removed this description (page 15).
• How do you treat red hair in the hair color GWAS? I believe that red hair is produced by a different biological process than other colors (Han et al 2008).

Response:
We considered red hair colour as a separate category in our non-genetic analyses in UK Biobank ( • One of the main results is the reported association between pubertal timing and three loci previously linked with pigmentation. To control for population stratification the phenotypic relationship has been assessed in white males of European ancestry only, also including 40 genetic principal components. Because the identification of an association between a phenotype and pigmentation loci raises the question of underlying population structure, the authors could provide additional information on potential genomic inflation, i.e. by providing information on the genomic inflation factor and the qq-plots for the individual GWASanalyses. 3. I may have misinterpreted the use of MTAG, but I thought the results you were presenting were the voice breaking data from 23andMe adjusted for the other traits -therefore the units from both the MTAG analysis and the 23andMe analysis should be the same? If this is the case, and given the list of novel loci is from the MTAG results, I would be interested to know whether there was heterogeneity between the effect sizes from the MTAG analysis of male puberty timing and the effect sizes for age at menarche. Also, I may have missed it, but I could not find anywhere in the paper that said that the MTAG results were the continuous age at voice breaking results after adjusting for the UK Biobank puberty timing traits; I think this should be made clearer for the reader. 4. I agree with reviewer 2's comment regarding pleiotropy in the MR analyses. The authors have appropriately changed the wording around the hair colour MR analysis, but not for the longevity analysis. For example, given the large genetic correlations between puberty timing BMI/growth related traits, could it be possible that the longevity MR result is driven by an association with BMI? I think this should be addressed in the paper and the authors should be cautious using phrases like 'the findings support a causal effect of later puberty timing in males on longer lifespan…'. Reviewer #2: Remarks to the Author: I thank the authors for their many changes in response to my comments and the comments of the other referees. I think the draft is much improved. I still have some concerns related to the Mendelian Randomization (MR) analyses found in the paper. Those concerns are described below.
Major points: 1) I appreciate the new text that states that the authors are using MR to test for shared biology between hair color and puberty timing. Given that reviewer 1 and I were both confused by the purpose of this analysis, I would recommend that the authors explicitly state that they are not attempting to establish causality between the two phenotypes in addition to stating that they are using it to establish shared biology. This is important since MR was specifically designed to test for and measure causal relationships, and I'm unaware of other papers that use MR just to establish shared biology as is done in this paper.
2) I'm still concerned by the strength of the causal language in the longevity analysis. The authors use strong causal language throughout the paper, including the abstract, results section, and discussion. How can the authors be sure that the SNPs considered don't operate on longevity through alternate pathways that violate the exclusion restrictions? They apparently find SNPs that appear to be heterogeneous using MR-PRESSO, which suggests that there maybe horizontal pleiotropy in this case. I understand that they attempt to correct for this by removing SNPs, but the abstract of the MR-PRESSO paper highlights that removing such SNPs does not always correct for potential bias. While the median-based method appears significant when considered alone, it would not survive a Bonferroni correction when grouped with the other two prostate traits. Overall, I find these results unconvincing. If the authors would like to include them, they should minimally outline the assumptions under which they are causal and discuss why or why not the assumptions may be plausible. If the authors choose to omit this MR analysis entirely, I still believe that the other analyses are sufficiently important and interesting and that they represent a substantial contribution to the literature.

Minor points:
3) In the response, the authors state that they removed the passage saying "The effect estimate for the association of each SNP on the trait of interest is then derived using a marginal likelihood function." This passage is still found on page 14.
4) On page 15, the authors have change "We used a conservative threshold..." to "We used a genome-significant threshold..." The more correct language would probably be "We used the genome-wide significant threshold..." 5) On page 7, that authors state, "A polygenic risk score for later puberty timing based on 73 of the male puberty timing signals was associated with lower likelihood of attaining voice breaking in ALSPAC boys" Why not all 76 SNPs? Where some missing from ALSPAC. I couldn't find an explanation for why 3 SNPs were omitted from the polygenic risk score.
Reviewer #3: Remarks to the Author: The authors have addressed all the points raised. I have no further comments.