GWAS of mosaic loss of chromosome Y highlights genetic effects on blood cell differentiation

Mosaic loss of chromosome Y (mLOY) is frequently observed in the leukocytes of ageing men. However, the genetic architecture and biological mechanisms underlying mLOY are not fully understood. In a cohort of 95,380 Japanese men, we identify 50 independent genetic markers in 46 loci associated with mLOY at a genome-wide significant level, 35 of which are unreported. Lead markers overlap enhancer marks in hematopoietic stem cells (HSCs, P ≤ 1.0 × 10−6). mLOY genome-wide association study signals exhibit polygenic architecture and demonstrate strong heritability enrichment in regions surrounding genes specifically expressed in multipotent progenitor (MPP) cells and HSCs (P ≤ 3.5 × 10−6). ChIP-seq data demonstrate that binding sites of FLI1, a fate-determining factor promoting HSC differentiation into platelets rather than red blood cells (RBCs), show a strong heritability enrichment (P = 1.5 × 10−6). Consistent with these findings, platelet and RBC counts are positively and negatively associated with mLOY, respectively. Collectively, our observations improve our understanding of the mechanisms underlying mLOY.

4) The authors need to frame motivation for analyses better. They don't frame their overall analysis plan. They just jump from one integrated genetic epi analysis to another.
5) The authors did a nice job with the analysis of survival.

Some minor comments
Right at beginning of Results… Age and mLOY. Figure 1 does not show biological interpretation. What proportion of individuals exhibit loss? To what degree is loss exhibited in those individuals with loss? The combined result is highly significant but P < 1e-100 is uninformative beyond a demonstration of at least non-zero effect and sample size.
Smoking. Again, what is the effect? It's the first paragraph but buried in the Supplement in a Figure for the reader to figure out.
Reviewer #2: Remarks to the Author: Terao et al. performed the first also the largest GWAS of mosaic loss of chromosome Y (mLOY) in East Asian population, using genetic information from 95,380 Japanese males, and mean logRratio as a proxy for the phenotype mLOY. The authors identified 50 independent genetic markers in 46 loci, of which, 35 were novel. They also performed a compressive function annotation analysis, using Haploreg, PASAL pathway analysis, LD score regression, LDSC-SEG and more algorithms.
Despite a large amount of results, and different layers of data (e.g., functional annotations, survival data, environmental exposures such as smoking, hematological traits, etc), the manuscript is to some extent, not so very well-organized. 1. As an original article (at least from my end, the manuscript was submitted as an original article not a letter), the manuscript lacks discussion, which should have been there to provide interpretations and implications to the current results. 2. A proper summarization or conclusion is also lacking. 3. Although mendelian randomization (MR between mLOY and AST) was mentioned in the methods, it was never cited in the results (supplementary figure 14 & 15) -indeed those relevant parts can be found in supplementary note, authors could at least be able to mention "see supplementary notes" instead of dropping them entirely from the main text. 4. Repeated presentations to the same results, which did not add extra information or values -for example supplementary table 14 and supplementary figure 13 were actually the same thing -I would suggest the authors keep only one of these two items given that these results do not add extra values to each other. I would also suggest the authors drop short tables which has only 1 row such as supplementary table 10 -it is sufficient to mention this piece of result using a short sentence. (please correct through the whole manuscript to avoid such situations -write clearly and succinctly, not necessarily piling up all the results). 5. Some sentences read ambiguous and confusing, for example, "The highest heritability enrichment in expressing gene of megakaryocyte/erythrocyte progenitor (MEP) among MEP, common lymphoid progenitor (CLP) and granulocyte/macrophage progenitor (GMP) (the similar differentiation stage after HSC, Figure 3C) might also be compatible with enrichment in FLI1 binding sites which is critical for MEP)", "Subjects with high fraction of mLOY consistently showed associations (of what?)", "This trend was consistent when restricting to the 31 novel loci, with 23 out of 25 variants having the same effect direction in the two studies (where comes from the 25 variants when the sentence began with 31 loci?)", "Within contrast, we observed a non-significant negative association with RBC count", etc. I would suggest a universal language editing to the manuscript.
In addition to those major points, there are several minor comments: 1. For the GWAS analysis, age, array, smoking and disease status were included as covariates. Wonder if principal components (say, first 2 PCs) were also included to control for population stratification? 2. 53 basic annotations: Hilary Finucane's annotation vs. Steven Gazal's annotation. The author may try Gazal's [PMID: 28892061] baseline model which takes into consideration the LD pattern. 3. Genetic correlation between mLOY and hematologic traits, since all these traits were based on the same bunch of individuals (BBJ), have the authors accounted for sample overlapping? 4. Results from BBJ were replicated by using UKBB data -wonder how similar is the phenotype (mLOY) between UKBB vs. BBJ, given that the two cohorts used different algorithms to calculate the proxy? Maybe quantify the genetic correlation between the UKBB mLOY (estimated using hidden Markov model) and BBJ (log-RR) mLOY. 5. A small typo: "We found that mLOY is a highly polygenic trait and departure of mean chisquared statistics could be largely explained by polygenic effects (lambda genomic control 1.086 > intercept 1.019 in LD score regression, Figure 1B)." -should be 1.066.
We thank the reviewers for taking the time to review our manuscript. We have considered each of the comments carefully, and have strived to improve the manuscript based on their comments.
Aside from response to the comments from the reviewers, we modified our manuscript to strictly follow the journal format. We modified the title of the manuscript.

Reviewers' comments:
Reviewer #1 (Remarks to the Author): Terao et al present an analysis of mosaic loss of Y using a data set of 95 thousand Japanese males. They present a GWAS for genetic predictors of acquired Y loss, using integrated molecular epidemiology techniques.
While aspects of their study were modern, and the importance of documenting loss of Y (or other chromosomes) is novel in this East Asian population, the presentation of details was poor, diminishing the impact of the paper.
> Thank you for your comment. We are very sorry that the manuscript was not well-organized. We intensively modified the manuscript to improve the presentation.

Major Points
Pages/lines were not numbered making it difficult to reference the comments below.
> We are sorry for this. We put line and page numbers in the revision.
1) This manuscript read as if it were a first draft. The authors need to edit both for grammar and for clarity. Please focus on using precise language. The manuscript is missing a discussion section (if format doesn't permit at least a discussion paragraph needs to be added)?
Overall there is a lack of clarity in exposition. I cannot judge the science in some sections because information is missing.
>Thank you. We have made efforts to improve clarity. The lack of discussion section was a consequence of transferring this manuscript directly to Nature Communications from another journal. In revision, we substantially restructured the manuscript to make our arguments more clear and added a discussion section. We also asked native speakers to carefully check the manuscript.
Abbreviations must be explained in each section or at least the paper (they cannot be introduced only in the Abstract) >Thank you for pointing this out. We carefully reviewed the revision and made modifications.
We ensured that abbreviations were explained when they appeared for the first time in the main text.
That one variant is not polymorphic in the Japanese population…. "Given the observed strong overlap… suggests that a causal variant in Europeans might not be polymorphic…." -> this comment is inane given what was established. What is the additional point here? Seemed highly repetitive.
>Thank you for pointing this out. We agree that the sentences are repetitive. We deleted the latter part and modified the former part as follows.  Table 3). Therefore relative invariance at these 4 loci in the Japanese population precludes identification of a statistically significant signal in this population.
We also carefully reviewed the revision to avoid the similar situation.
"The new loci found in the current study…. showed overlap of HSC." -> ??? "overlap of HSC" is not informative whatsoever and it is not grammatically correct. The last sentence of this paragraph was extremely difficult to parse -what point are the authors trying to make? There are not enough details and it is lacking a good explanation.
>Thank you for the comment. Please excuse us for the poor presentation of concepts in this paragraph. We carefully reviewed the revision to improve grammar and scientific clarity. The parts pointed out by the reviewer were modified as follows to make arguments clear. > Thank you for the comment. We modified this part accordingly as shown below. In addition, we carefully reviewed the revision to avoid similar situation. 2) The authors used median and average LRR and get slightly different results (two loci differ), they need to investigate this discrepancy and justify why they used average LRR. -> This is highly descriptive and what does it mean? If one says "we looked at both" and then picked one, it comes across as fishing. (In my opinion they should use median LRR, instead of average LRR for the subsequent analyses.) >Thank you for the comment. In the revised manuscript, we follow the precedence of Wright et al in using mean LRR. We show that using median LRR gives a qualitatively similar result, addressing the possibility that some variants with extreme LRR deviated the mean LRR. We confirmed that mean LRR was highly correlated with median LRR and genetic association results were quite similar between the analyses using mean and median LRR.
We modified accordingly as follows.
(page 11 line 352~) In accordance with the largest study of mLOY in Europeans (Wright et al, 2017), we used the mean of LRR as a proxy of chromosome Y copy number. Using the mean, rather than the median, of LRR allows that some strongly deviated markers could unduly influence estimation of chromosome Y dosage. However, we found that the median and mean are strongly correlated (Spearman's rho:0.993) and lead to quite similar genetic association results (median LRR identifies 44 of the 46 loci identified using mean, Supplementary Note 7). The minimal difference in the two indices supports the rationale to use the method applied in this study.
(page 14 line 455~) We took this approach to avoid strong influence of limited number of variants with extreme values of LRR. To confirm that our data did not suffer from noise due to variants with extreme LRR, we also took median of LRR signals in each individuals among the variants and conducted GWAS with the use of median LRR instead of mLRR-Y.
We also put Supplementary Note 7 to discuss this point.
3) The authors need to provide more detail regarding their methods. This is a mandatory point for serious consideration of the manuscript. It is impossible to give a full account of examples but for one: …. mLOY with the use of the 53 "basic annotations".... What basic annotations? A few were referenced. But this is assuming the audience for this paper can read the mind of the authors.
There are numerous instances such as this.
> Thank you very much for the comment. We fully agree that detailed explanation is important to make ourselves clear in the text. 4) The authors need to frame motivation for analyses better. They don't frame their overall analysis plan. They just jump from one integrated genetic epi analysis to another.
>Thank you very much for pointing out this. We added sentences or phrases explaining the motivation for each analysis to begin each paragraph in the Results section.

5) The authors did a nice job with the analysis of survival.
>Thank you very much for the positive comment.

Some minor comments
Right at beginning of Results… Age and mLOY. Figure 1

does not show biological interpretation. What proportion of individuals exhibit loss?
To what degree is loss exhibited in those individuals with loss? The combined result is highly significant but P < 1e-100 is uninformative beyond a demonstration of at least non-zero effect and sample size.
>Thank you for the comment. We agree that description of quantitative effect of age is meaningful. We put a description about an effect size of age on mLRR-Y signal in the Result section as written below. Since we analyzed mLOY in a quantitative manner, it is not possible to infer the proportion of subjects carrying the loss. Instead, we showed a quantitative effect of age on mLRR-Y signals.
(Page 5 line 123~) We observed a strong association between age at DNA collection and mLOY (1 year increase in age associated with 2.2% standard deviation (SD) decrease in mLRR-Y signal, p<1.0x10 -100 in multiple linear regression, Figure 1A  > We are very sorry for this. It reflects the direct transfer of our manuscript from Nature Genetics in a format lacking discussion. In the revised manuscript, we added a discussion section. We also added detailed explanation of our methods and intensively modified the manuscript.

A proper summarization or conclusion is also lacking.
> This relates to the point raised above. We have addressed this by adding conclusions in the revised manuscript. > Thank you very much for pointing this out. We totally agree that we should mention this analysis. We modified the main text accordingly to mention this point as follows.

Although mendelian randomization (MR between mLOY and AST) was
(Page 10 line 313~) We also conducted genetic correlation analyses between mLOY and quantitative traits or malignancies with which we previously reported genetic associations 35 . We  In contrast, we observed a non-significant negative association with RBC count (i.e., higher RBC count, higher mLRR-Y and lower mLOY, Supplementary Figure   11).
In addition to those major points, there are several minor comments: 1. For the GWAS analysis, age, array, smoking and disease status were included as covariates. Wonder if principal components (say, first 2 PCs) were also included to control for population stratification?
Thank you very much for this comment. Since we used bolt-lmm in our analyses, we did not use PCs as covariates. Bolt-lmm allows a mixed effect model that takes the genetic relation matrix (GRM) into consideration to compute statistics. We agree that PCs also capture genetic correlations of the data, and our approach using bolt-lmm to take these effects into consideration. Perhaps reflecting the relative genetic homogeneity of our population relative to those in which PCs capture more genetic correlations, when we included PCs in covariates, we obtained quite similar results. We added an explanatory sentence to mention this point.
(Page 15 line 493~) Since we used mixed model taking into account genetic correlation matrix among subjects, we did not include principal components in the covariates. While different LD structure between Japanese and UK population made it difficult to quantify genetic similarities of mLOY between the two populations, we showed that significant variants found in the BBJ very often shared direction of effect on mLOY in the UKB. We tried to replicate 156 significant variants found in UKB, as a result, 92 out of the 100 variants which were polymorphic and passed QC showed shared direction in effect 38 , suggesting the strong overlap of susceptibility loci for mLOY between the two populations.
5. A small typo: "We found that mLOY is a highly polygenic trait and departure of mean chi-squared statistics could be largely explained by polygenic effects (lambda genomic control 1.086 > intercept 1.019 in LD score regression, Figure 1B)." -should be 1.066.
> Thank you very much for the comment. Sorry for the confusion. As the reviewer indicated, the lambda of the entire GWAS is 1.066. However, LD score regression analysis took variants in Hapmap3 to ensure accuracy of genotypes. This resulted in slight difference in lambdas between original GWAS and LD score regression. We added a Supplementary Note 2 to clarify this point.