Genome-wide analyses of 200,453 individuals yield new insights into the causes and consequences of clonal hematopoiesis

Clonal hematopoiesis (CH), the clonal expansion of a blood stem cell and its progeny driven by somatic driver mutations, affects over a third of people, yet remains poorly understood. Here we analyze genetic data from 200,453 UK Biobank participants to map the landscape of inherited predisposition to CH, increasing the number of germline associations with CH in European-ancestry populations from 4 to 14. Genes at new loci implicate DNA damage repair (PARP1, ATM, CHEK2), hematopoietic stem cell migration/homing (CD164) and myeloid oncogenesis (SETBP1). Several associations were CH-subtype-specific including variants at TCL1A and CD164 that had opposite associations with DNMT3A- versus TET2-mutant CH, the two most common CH subtypes, proposing key roles for these two loci in CH development. Mendelian randomization analyses showed that smoking and longer leukocyte telomere length are causal risk factors for CH and that genetic predisposition to CH increases risks of myeloproliferative neoplasia, nonhematological malignancies, atrial fibrillation and blood epigenetic ageing.

analyses with CH as the outcome in the cohort of 200,453 individuals. We found that age increased the risk of CH by 6.7% per year and that prevalent hypertension, but not obesity or type 2 diabetes (T2D), was associated with CH status (Fig. 2a and Supplementary  Table 5). We also found that individuals with CH were more likely to be current or former smokers, an association that held true for different forms of CH and was strongest for ASXL1-mutant CH ( Fig. 2a and Supplementary Table 5). Analyses of complete blood count and biochemical parameters identified both known and previously unreported associations with overall CH and CH subtypes ( Fig. 2a and Supplementary Tables 6 and 7). We also found that CH status was associated with lower prevalent levels of total and low-density lipoprotein cholesterol, most marked for JAK2 and splicing factor-mutant CH ( Fig. 2a and Supplementary Table 7).
Associations between CH and incident disease. We next performed a phenome-wide association study (PheWAS) of incident disease in the UKB considering CH at baseline as the exposure. This identified strong associations with myeloid malignancies and associated sequelae (Extended Data Fig. 3a and Supplementary Table 8). Analyses for selected phenotypes (Supplementary Table 9) also identified a high incidence of myeloid malignancies with all forms of CH ( Fig. 2b and Supplementary Table 10) and increased risks of other hematological and nonhematological neoplasia, including lymphoma, lung and kidney cancers ( Fig. 2b and Supplementary  Table 10). Notably, associations with lung and other cancers were also observed in self-reported never smokers (Extended Data Fig.  3b and Supplementary Table 11). Unlike previous reports linking CH with ischemic cardiovascular disease (CVD) 5,10,22 , we did not find a significant association between CH and ischemic CVD, including coronary artery disease (CAD) and stroke; but we did find an association with heart failure and atrial fibrillation, and a composite of all CVD conditions in CH with large clones in multivariable regression models ( Fig. 2b and Supplementary Table 10). While CH was associated with CAD and ischemic stroke in unadjusted analyses, adjusting for age led to these associations attenuating to the null, demonstrating the impact of age as a confounder (Extended Data Fig. 3c and Supplementary Table 12). Finally, we also found that CH increased the risk of death from diverse causes ( Fig. 2b and Supplementary Table 13).
Heritability of CH and cell-type-specific enrichment. To identify heritable determinants of CH risk, we performed a genome-wide association study (GWAS) on the 184,121 individuals with genetically inferred European ancestry to identify common (minor allele frequency (MAF) > 1%) germline genetic variants predisposing to   CH. In the GWAS, we compared 10,203 individuals with CH with 173,918 individuals without CH, after quality control (QC) of the germline genotype data. Linkage disequilibrium score regression (LDSC) 23 showed little evidence of inflation in test statistics due to population structure (intercept = 1.009; lambda genomic control factor = 0.999). The narrow-sense (additive) heritability of CH was estimated at 3.57% (s.e. = 0.85%). We partitioned the heritability across four major histone marks observed in 10 cell-type groups aggregated from 220 cell-type-specific annotations 24 and identified strong enrichment of the polygenic CH signal in histone marks enriched in hematopoietic cells (P = 5.9 × 10 −5 ; Fig. 3a and Supplementary Table 14). Next, we partitioned the heritability of CH across open chromatin state regions in various hematopoietic progenitor cells and lineages 24,25 . Previous work on other traits 25,26 has established that trait heritability tends to be enriched in transcriptionally active open chromatin regions in trait-relevant cell types, helping implicate specific cell types as key mediators of the GWAS signal. Consistent with this, we found CH heritability enrichment in accessible chromatin regions in HSCs, common lymphoid and myeloid progenitors, multipotent and erythroid progenitors, and B cells ( Fig. 3b and Supplementary Table 15). Overall, these findings endorse the intuitive assumption that CH associations exert their greatest biological effect on HSC/progenitor populations.
A trend for opposite effects at 14q32.13-TCL1A was also observed in a previous study 17 , but did not achieve genome-wide significance for TET2-CH. When comparing 4,049 individuals with large or 6,154 individuals with small clones against the 173,918 controls without CH, we found that the overall CH loci at 5p15.33-TERT and 3q25.33-SMC4 were associated at genome-wide significance with large clone CH ( Fig. 4d and Supplementary Table 20), while 5p15.33-TERT and 6q21-CD164 were associated with small clone CH. For small clone CH risk, we also identified a previously unreported locus marked by rs72755524 at 5p13.3 in a region with several long non-coding RNAs (lncRNAs) (Fig. 4e [18][19][20][21] revealed that in addition to 14q32.13-TCL1A, the lead alleles at 6q21-CD164 also had opposite effects on DNMT3A-versus TET2-CH. The lead variants at 6q21-CD164 and 5p13.3-LINC02064 were associated with small, but not large, clones while the association at 7q32.2-TMEM209 was highly specific to TET2-CH. The lead variants at 1q42.12-PARP1 and 3q25.33-SMC4 had greater effects on large than small clone CH. At the whole-genome level, we estimated the genetic correlation (r g ) between DNMT3A-CH and TET2-CH as −0.48 (s.e. = 0.33, P = 0.15) and large and small clone CH as 0.37 (s.e. = 0.18, P = 0.018) using high-definition likelihood inference 30 . Finally, we also performed a focused scan to explore rare variant (MAF: 0.2-1%) associations with the three CH traits with largest case numbers (overall, DNMT3A and small clone CH; each compared with 173,918 controls). This identified one new locus at 22q12.1-CHEK2 where the T allele (frequency = 0.3%) of lead variant rs62237617 was perfectly correlated (r 2 = 1) with the 1100delC CHEK2 protein-truncating allele (rs555607708) and conferred a large increase in risk of DNMT3A mutation-associated CH (OR = 4.1, 95% confidence interval (95% CI): 2.7-6.1, P = 6.3 × 10 −12 ).
Replication of genome-wide significant associations. Replication was undertaken using independent somatic mutation calling and germline association analysis pipelines on data from 221,285 European-ancestry individuals in the UKB, for whom WES was performed after our UKB discovery set. We focused on DNMT3A and/or TET2 mutation carriers (n = 9,386) in the replication sample, stratified by these two genes and clone size, and evaluated the 20 unique lead variants identified in the discovery GWAS (representing 26 distinct overall/subtype-specific CH associations). Eighteen of 20 variants were replicated at P < 0.05, with 16 replicating at P < 0.0025 (accounting for testing 20 variants), and 19 showing consistent directionality (Supplementary Table 22). Variants rs13130545 (overall CH; 4q35.1-ENPP6) and rs72755524 (small clone CH; 5p13.3-LINC02064) were not associated at P < 0.05 in replication analysis. Notably, we confirmed our observation that lead alleles at TCL1A and CD164 had opposite effects on DNMT3Aand TET2-CH, and replicated the CHEK2 association.

Blood chromosomal mosaicism and CH due to gene mutation.
It is not known whether the germline genetic architecture underlying predisposition to CH due to individual gene mutations is similar to that underlying CH due to mosaic chromosomal alterations (mCAs). We used data from a recent blood mCA GWAS 31 to answer this and found that 13 of 19 unique lead variants identified for the five gene-mutant CH traits were associated with hematological mCA risk (P < 10 −4 ; Supplementary Table 23). Notably, for our lead variants rs2296312 (14q32.13-TCL1A) and rs8088824 (18q12.3-SETBP1), the alleles conferring increased DNMT3A-CH risk reduced hematological mCA risk (Supplementary Table 23). We found a correlation between overall CH and mCAs (r g = 0.44, s.e. = 0.21, P = 0.037) using LDSC 23 . This germline genetic correlation together with enrichment of the CH GWAS signal in common lymphoid and myeloid progenitors ( Fig. 3b) supports the recent finding that gene-mutant CH and mCAs have overlapping biology that leads them to confer risk of both lymphoid and myeloid malignancies 32 . Further, a phenome-wide scan 33,34 showed that several newly identified lead variants in our analyses were associated with multiple blood cell counts/traits (Supplementary Table 24).

Gene-level associations and network analyses.
We used two complementary methods to perform gene-level association tests for each of our five CH traits: multi-marker analysis of genomic annotation (MAGMA) and a transcriptome-wide association study using blood-based cis gene expression quantitative trait locus (eQTL) data on 31,684 individuals 35 and summary-based Mendelian randomization (SMR) coupled with the heterogeneity in dependent instruments (HEIDI) colocalization test 36 . Both approaches converged on a new locus at 6p21.1, associated at gene-level genome-wide significance (P MAGMA < 2.6 × 10 −6 , P SMR < 3.2 × 10 −6 ) with DNMT3A-CH and marked by CRIP3 (P MAGMA = 3.4 × 10 −7 , P SMR = 6.6 × 10 −7 ; Fig. 5a and Supplementary  Tables 25 and 26). While CRIP3 was the only 6p21.1 gene to reach gene-level genome-wide significance in both MAGMA and SMR, we did find subthreshold evidence for association between SRF or ZNF318 in the same region and DNMT3A-CH (Fig. 5a). Notably, SRF encodes the serum response factor known to regulate HSC adhesion 37 while ZNF318 is an occasional CH somatic driver 38 . More globally, protein-protein interaction (PPI) network analysis 39 , using proteins encoded by the 57 genes with P MAGMA < 0.001 in the overall CH analysis (Supplementary Table 25) as 'seeds' , identified the largest subnetwork ( Fig. 5b) as encompassing 13 of 57 proteins with major hub nodes highlighted as TERT, PARP1, ATM and SMC4. This was consistent with the emerging theme that potential trait-associated genes at subthreshold GWAS loci are often part of interconnected biological networks 40,41 . The subthreshold genes identified by MAGMA that encoded protein hubs in this network included FANCF (DNA repair pathway) and PTCH1 (hedgehog signaling; Fig. 5b), both implicated in acute myeloid leukemia pathogenesis 42,43 , and GNAS, a CH somatic driver 44 . The CH subnetwork was significantly enriched for several pathways including DNA repair, cell cycle regulation, telomere maintenance and platelet homeostasis (Supplementary Table 27).

Functional target gene prioritization at CH risk loci.
To prioritize putative functional target genes at P lead-variant < 5 × 10 −8 loci identified by our GWAS of five CH traits, we combined gene-level genome-wide significant results from MAGMA and SMR (Supplementary Tables  25 and 26) with five other lines of evidence: PPI network hub status (Supplementary Table 28); variant-to-gene searches of Open Targets 45 for lead variants; and overlap between fine-mapped variants 46,47 (Supplementary Table 29 Table 30). The genes nominated by the largest number of approaches, representing the most likely targets, were SMC4, ENPP6, TERT, CD164, ATM, PARP1, TCL1A, SETBP1 and TMEM209 (Supplementary Table 31).
Among the newly identified loci, lead variant rs138994074 at 1q42.12 was strongly correlated (r 2 = 0.93) with rs1136410, a missense germline mutation in PARP1 (Supplementary Table 30) wherein the G allele, which is protective for DNMT3A-CH, leads to a missense variant (p.Val762Ala) in the catalytic domain of its protein product associated with reduced Poly(ADP-ribose) polymerase 1 activity 53 . While SETBP1 was the only gene nominated at 18q12.3 (by only one approach, Open Targets 45 ), its nomination is strengthened by the fact that somatic SETBP1 mutations are recognized drivers of myeloid malignancies 54,55 . We also evaluated the 'druggability' of the prioritized genes in the context of known therapeutics (yielding support for TERT and PARP1) and ongoing drug development (yielding limited support for SMC4, ATM, Known (previously published) and new loci are indicated by cytoband and target gene (based on the prioritization exercise described in the text). Since there were multiple independent loci at 5p15.33 (LD r 2 < 0.05), we also label the 5p15.33 signals using the lead variant rs number for each signal. Our prioritization exercise was focused on protein coding genes near each lead variant and since there were no protein coding genes within 1 Mb of the lead variant at 5p13.3, we labeled this association using the nearest noncoding RNA. The CH traits corresponding to each Manhattan plot are: a, Overall CH. b, CH with mutant DNTM3A. c, CH with mutant TET2. d, CH with large clones. e, CH with small clones.  We used independent (r 2 < 0.001) variants associated with overall, DNMT3A, TET2, and large and small clone CH at P < 10 −5 as genetic instruments for each of these traits and assessed their associations with outcomes (Supplementary Tables 35, 39 and 40). Since more variants were available at P < 5 × 10 −8 for overall and DNMT3A-CH, we also examined the consistency of associations when using genome-wide (GWS; P < 5 × 10 −8 ) and sub-genome-wide significant (sub-GWS; P < 10 −5 ) instruments for these two traits. Using the sub-GWS instrument, genetic liability to overall CH had the largest associations ( Fig. 7a) with myeloproliferative neoplasms (MPN) risk 48 (OR = 1.99, 95% CI: 1.23-3.23, P = 5.4 × 10 −3 ), intrinsic epigenetic age acceleration 64 (which represents a core characteristic of HSCs 67 ; beta = 0.39, 95% CI: 0.08-0.69, P = 0.01) and the blood-based Hannum epigenetic clock 64 (beta = 0.27, 95% CI: 0.04-0.49, P = 0.02) and even larger associations were observed when using the GWS instrument. Genetic liability to CH conferred increased risks of lung 68 , prostate 69 , ovarian 70 , oral cavity/pharyngeal 71 and endometrial cancers 72 (Fig. 7a,b and Supplementary Table 39). MR analyses did not support causal risk-conferring associations between genetic liability to CH and CAD 73 , ischemic stroke 74 and heart failure 75 , with similar lack of evidence across gene-specific and clone size-specific CH, and GWS instrument analyses (Fig. 7a,b and Supplementary Table 39). However, we did uncover an association between genetic liability to overall CH or DNMT3A-CH and atrial fibrillation 76 risk (OR = 1.09, 95% CI: 1.04-1.15, P = 4.9 × 10 −4 for overall CH with the GWS instrument; Supplementary Table 39). Among cytokines/growth factors 65 , genetic liability to overall CH was associated with elevated circulating stem cell growth factor beta (beta = 0.19; 95% CI: 0.07-0.30, P = 1.1 × 10 −3 ). MR analyses also revealed bidirectional associations between CH phenotypes and several blood cell counts/traits 29 , suggesting a shared heritability (Figs. 6b and 7a Genes on chromosome 6 within 250 kb of CRIP3 The HEIDI test is a test of heterogeneity of Wald ratio estimates. b, Largest subnetwork of genes/proteins associated with overall CH risk identified by the NetworkAnalyst tool. NetworkAnalyst uses a 'Walktrap' random walks search algorithm to identify the largest first-order interaction network. All genes (n = 57) with P MAGMA < 0.001 in the overall CH MAGMA analysis were mapped to proteins and used as 'seeds' for network construction which was done by integrating high-confidence PPIs from the STRING database. The largest subnetwork constructed contained 13 of the 57 seed proteins and included 210 nodes and 231 edges. The colored nodes indicate seed proteins that interact with at least two other proteins in this subnetwork with the intensity of redness increasing with number of interacting proteins. Seed proteins that interact with six or more other proteins in the subnetwork are named above their corresponding node.
Tables 37 and 39). We found little evidence to support an association between genetic liability to CH and LTL (Supplementary Table  39). Finally, we also performed an MR-PheWAS evaluating associations between genetic liability to overall or DNMT3A-CH and 1,434 disease/trait outcomes in the UKB. Reassuringly, the strongest associations involved blood cell counts/traits and hematopoietic cancers, but we also uncovered new associations such as with malignant skin cancers (Supplementary Tables 41 and 42). Results of MR sensitivity analyses using the weighted median 77 and MR-Egger 78 methods are provided in Supplementary Tables 36-42.   (1) standard deviation unit for continuous exposures (alcohol use in drinks per week, BMI, waist-to-hip ratio adjusted for BMI (WHRadjBMI) (a); LTL, two epigenetic aging traits, and red cell, white cell and platelet counts (b); and five circulating lipid traits (c)) and (2) log-odds unit for binary exposures (smoking initiation (ever having smoked regularly) and genetic liability to T2D (a)). IVW regression was used for all MR analyses, and results were not adjusted for multiple comparisons. Details of units are provided in Supplementary

Discussion
We present an observational and genetic epidemiological analysis of CH in 200,453 individuals in the UKB and report a series of insights into the causes and consequences of this common aging-associated phenomenon. We increase the number of germline associations with CH in European-ancestry populations from 4 (ref. 17 ) to 14, reveal heterogeneity of associations by CH driver gene and clone size, and implicate putative new CH susceptibility genes, including CD164, ATM and SETBP1, through functional annotation. We also demonstrate that the CH GWAS signal is enriched at epigenetic marks specific to the hematopoietic system, particularly in open chromatin regions of hematopoietic stem/progenitor cells. The robustness of our GWAS is supported by replication of the vast majority of associations in an additional set of 221,285 individuals from the UKB and further affirmed by our replication of previous European-ancestry-specific CH associations 17 , the consistency of our estimates of CH heritability MR with overall CH as the exposure MR with DNMT3A CH as the exposure

Fig. 7 | IVW MR forest plots with CH traits as exposures.
Forest plots with OR markers (for cancers and cardiovascular/metabolic traits) or exponentiated beta coefficient (exp(beta)) markers (for blood cell traits, lipids, adiposity measures and epigenetic aging indices). ORs/exp(betas) are represented as per log-odds unit increase in genetic liability to overall CH (a) or DNMT3A-CH (b). OR/exp(beta) markers with corresponding P < 0.05 are represented by filled circles. IVW regression was used for all MR analyses, and results were not adjusted for multiple comparisons. Symbols represent OR markers and error bars represent 95% CIs. Red symbols and error bars represent results using genetic instruments comprised exclusively of genome-wide significant (P < 5 × 10 −8 ) variants. Black symbols and error bars represent results when using genome-wide significant and sub-GWS (P < 10 −5 ) variants in the genetic instrument. Large effect size estimates (ORs/exp(betas)) are shown in the lower panels. Sample sizes for all genome-wide association datasets used are provided in Supplementary Table 35. Full results, including from sensitivity analyses, are presented in Supplementary Tables 39 and 40. IS, ischemic stroke.
with previous reports 17,79 and the fact that many of our lead variants are associated with related traits 29,31,60,80 . At 14q32.13-TCL1A, we replicate the reported association with DNMT3A-CH (ref. 17 ) and identify a new genome-wide significant association with TET2-CH. Strikingly, however, we found that the association operates in the opposite direction for TET2-CH, versus DNMT3A-CH. This inverse relationship, also supported by our finding of a suggestive negative genetic correlation between TET2-and DNMT3A-CH, is tantalizing in light of recent observations that ageing has different effects on the dynamics of these two forms of CH, resulting in TET2-CH becoming more prevalent than DNMT3A-CH in those over 80 yr (refs. 20,81 ). Also notable in this light is the finding of an association at 6q21-CD164 with DNMT3A-CH, and a trend in the opposite direction for TET2-CH that was confirmed in the replication analysis. As CD164 is expressed in the earliest HSCs 82 and encodes a key regulator of HSC adhesion 83,84 , this proposes that HSC migration and homing may play important roles in CH pathogenesis. The reciprocal relationship of both TCL1A and CD164 with the two main CH subtypes suggests that their expression must be tightly regulated to prevent the development of one or other CH subtype, making these loci important targets for hijack by the effects of somatic mutations. In fact, a recent study 85 suggests that this may be how TET2 and ASXL1 mutations interact with a TCL1A promoter variant associated with clonal expansion rate. TCL1A is not expressed in normal or DNMT3A-mutated HSCs and the authors show that the locus becomes susceptible to activation in the presence of TET2 or ASXL1 mutations only when harboring the reference allele at the promoter variant, leading to faster clonal expansion. This type of interaction may operate for CD164 and other CH risk loci, or alternative models of interaction between the germline and somatic genome may exist.
New CH risk loci included the PARP1 coding variant rs1136410, where the G allele is protective for DNMT3A-CH and associated with reduced catalytic activity 53 suggesting that this most common form of CH may be vulnerable to PARP inhibition, in keeping with the observed synergy between PARP and DNMT inhibitors 86 . We also identified three lead variants at the TERT locus for which CH risk alleles were associated with longer LTL, a finding corroborated by our MR results linking increased LTL to CH. Interestingly, a recent study found deleterious rare germline TERT variants associated with shorter telomeres in patients with myelodysplastic syndromes 87 . However, compared with conventional myelodysplastic syndromes, these cases displayed a paucity of somatic mutations in DNMT3A (2 of 41) and TET2 (3 of 41 cases), suggesting that evolutionary paths may differ between cases with long versus short telomeres.
The rich phenotypic data captured by the UKB, coupled with our genetic analysis of CH and external GWAS datasets, enabled us to explore associations of CH using multivariable regression and interrogate, at scale, potential causal relationships between CH and its putative risk factors and consequences using MR. This highlighted that smoking and longer telomere length are causal risk factors for CH. These associations were valid across multiple CH subtypes and, in the case of smoking, corroborated by observational estimates. We also reveal that not only is genetic predisposition to CH causally associated with MPN risk, but it also increases the risk of lung, prostate, ovarian, oral/pharyngeal and endometrial cancers. In these analyses, the use of two-sample MR protected against potential reverse causality arising from cancer therapy-induced selection pressure on hematopoietic clones 88 . These MR results suggest that genetic liability to CH may be a biomarker for development of cancer elsewhere in the body, analogous to the link between genetic predisposition to Y chromosome loss in blood and solid tumor risk 89 .
We investigated the recently identified association of CH with blood-based epigenetic clocks 90 , using bidirectional MR, and show that this association is likely to be causal in the direction from CH to epigenetic age acceleration. We also showed that genetic predisposition to CH was associated with elevated circulating levels of stem cell growth factor beta, a secreted sulfated glycoprotein that regulates primitive hematopoietic progenitor cells 91 . Finally, we unraveled a previously unreported association between genetic liability to CH and atrial fibrillation risk, which was also supported by our observational analysis. However, unlike previous reports based on smaller sample sizes 5,10,22 , we did not find evidence in observational and MR analyses to support an association between CH and CAD or ischemic stroke. However, our MR analyses indicated that higher BMI and circulating apolipoprotein B levels were associated with TET2 and large clone CH risks, respectively, with apolipoprotein B being the key causal lipid risk factor for CAD 63,92 . We also demonstrated the impact of age, in particular, as a strong confounder of the CH-CAD/ischemic stroke associations. These results raise the possibility that reported associations of CH with CAD/stroke risks may suffer from residual confounding. Moreover, many of the cohorts that reported these associations are enriched in participants at high cardiovascular risk 10 , in contrast to the UKB, where participants may be healthier, and potentially have lower epigenetic aging. Recent findings suggest that CH is associated with CAD/stroke only on a background of epigenetic aging 90 , offering a plausible mechanistic explanation for the absence of an association in our study.
Collectively, our findings substantially illuminate the landscape of inherited susceptibility to CH and provide insights into the causes and consequences of CH with implications for human health and ageing.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-022-01121-z. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/. © The Author(s) 2022

Methods
Study population and WES data. The UKB resource was approved by the North West Multi-centre Research Ethics Committee under reference number 21/NW/0157 and all participants provided written, informed consent to participate. Participants in the UKB are volunteers and not compensated for participation. Data from the UKB resource were accessed under approved application numbers 56844, 29202 and 26041 for this study. The UKB is a prospective longitudinal study containing in-depth genetic and health information from half a million UK participants. For this study, we have selected 200,453 individuals (200k) who had WES data available (age range: 38-72, median age: 58; 55% females). WES was generated in two batches, the first of approximately 50,000 samples (50k) 93 and the second comprising an additional 150,000 samples (150k) 19 . Exomes were captured using the IDT xGen Exome Research Panel v.1.0 including supplemental probes; a different IDT v.1.0 oligo lot was used for each batch. Multiplexed samples were sequenced with dual-indexed 75 × 75-base-pair paired-end reads on the Illumina NovaSeq 6000 platform using S2 (50k samples) and S4 (150k samples) flow cells. The 50k samples were first computed using FE protocol and reprocessed later to match the second batch of 150k sequences which were processed using a new improved unified OQFE pipeline. As the initial 50k samples were sequenced on S2 flow cells and with a different IDT v.1.0 oligo lot from the remaining 150k samples, which were sequenced on S4 flow cells, we included the WES batch as a covariate in downstream analyses.
Sequence data processing, CH mutation calling and filtering. CRAM files generated by the OQFE pipeline were obtained from UKB (Fields 23143-23144). Variant calling on WES data from 200,453 individuals was performed using Mutect2, Genome Analysis Toolkit (GATK) v.4.1.8.1 (ref. 94 ). Briefly, Mutect2 was run in 'tumor-only' mode with default parameters, over the exons of 43 genes previously associated with CH (Supplementary Table 1). To filter out potential germline variants we used a population reference of germline variants generated from the 1000 Genomes Project (1000GP) 95 and the Genome Aggregation Database (gnomAD) 96 . All resources were obtained from the GATK Best practices repository (gs:// gatk-best-practices/somatic-hg38). Raw variants called by Mutect2 were filtered out with FilterMutectCalls using the estimated prior probability of a reading orientation artifact generated by LearnReadOrientationModel (GATK v.4.1.8.1). Putative variants flagged as 'PASS' using FilterMutectCalls or flagged as 'germline' if present at least two times with the 'PASS' flag in other samples were selected for filtering. Gene annotation was performed using Ensembl Variant Effect Predictor (VEP) (v.102) 97 . We required variants with a minimum number of alternate reads of 2, evidence of the variant on both forward and reverse strands, a minimum depth of 7 reads for single nucleotide variants (SNVs) and 10 reads for short indels and substitutions, and a MAF lower than 0.001 (according to 1000GP phase 3 and gnomAD r2.1). For new variants, not previously described in the Catalogue of Somatic Mutations in Cancer (COSMIC; v.91) 98 nor in the Database of Single Nucleotide Polymorphisms (dbSNP; build 153) 99 , we used a minimum allele count per variant of 4, and a MAF lower than 5 × 10 −5 . From resulting variants, we selected those that: (1) are included in a list of recurring hotspot mutations associated with CH and myeloid cancer (Supplementary Table 2); (2) have been reported as somatic mutations in hematological cancers at least seven times in COSMIC; or (3) met the inclusion criteria of a predefined list of putative CH variants 17,79 (Supplementary Table 3). We included previous variants flagged as germline by FilterMutectCalls if: (1) the number of cases in the cohort flagged as germline was lower than the ones flagged as PASS; and (2) at least one of the cases had a P < 0.001 for a one-sided exact binomial test, where the null hypothesis was that the number of alternative reads supporting the mutation was 50% of the total number of reads (95% for copy number equal to one), except for hotspot mutations which were all included. For the final list, we excluded all variants not present in COSMIC or in the list of hotspots that had a MAF equal to or higher than 5 × 10 −5 and either the mean VAF of all cases was higher than 0.2 or the maximum VAF was lower than 0.1. Frameshift, nonsense and splice-site mutations not present in COSMIC or in the hotspot list were further excluded if for each variant none of the cases had a P < 0.001 for a one-sided exact binomial test. A complete list of filtered variants is provided in Supplementary Table 4.
Trait selection and modeling for observational analyses. Phenotypes were downloaded in December of 2020 and individual traits were pulled out from the whole phenotype file. Cancer, metabolic and CVD traits were generated, combining individual traits and diagnosis dates based on disease definitions (Supplementary Table 9). For each definition of disease, the first diagnosis event that occurred in each trait was selected. Baseline was defined as the date of sample collection when the individuals attended the assessment centers. The prevalent cases are those identified before the baseline, while incident cases were defined as the events that occurred after the baseline. Unless specified, all regression models included age, sex, smoking status, WES batch and the first ten ancestry principal components as covariates and all analyses were adjusted for multiple comparisons using the false discovery rate (FDR) computed by the Benjamin-Hochberg procedure implemented in the p.adjust function (R stats package v.4.0.2). Blood cell counts and biochemical traits were log 10 transformed and analyzed using a logistic regression model with overall and gene-specific CH as outcomes, including the assessment center as covariate and, in the case of cholesterol and cholesterol species, the use of cholesterol-lowering medication as an additional covariate. Individuals with myeloid malignancies or hematological neoplasms at baseline (that is, with a cancer diagnosis date before the date they attended the assessment centers) were excluded from the analysis. For cancer, CVD and death risk, we performed a time-to-event regression analysis. In the case of cancer and CVD, we performed a competing risk analysis, using the date of death by other cause as the competing event, while for the risk of death we used the Cox proportional hazards model. The cancer/CVD/death event was used as an outcome and CH was considered as the exposure in these analyses. Individuals without the event who died before the end of the follow-up were censored at the time of death, while the rest were censored at the end of the follow-up. For CVD and death risk analyses, we also included BMI, high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, T2D status and hypertension status as covariates. Individuals with myeloid or other malignant neoplasms at baseline were excluded from all aforementioned analyses. The proportional hazards assumption for the Cox and competing risk models was assessed by examining the Schoenfeld residuals. For the phenome-wide association analysis between International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10) codes as outcomes and CH status, logistic regression models were used including age, sex, WES batch and the first ten genetic ancestry principal components as covariates. Analyses were performed over 11,787 selected ICD-10 codes corresponding to disease conditions (A to N), symptoms, signs, and abnormal clinical and laboratory findings (R), and factors influencing health status (Z). All analyses were performed using glm (R stats package v.4.0.2), coxph (R survival package v.3.2-11) and crr (R cmprsk package v.2.2-10) functions.
Genome-wide association analyses. Germline genotype data used were from the UKB release that contained the full set of variants imputed into the Haplotype Reference Consortium 100 and UK10K + 1000GP (ref. 95 ) reference panels and genotyped on the UK BiLEVE Axiom Array or UKB Axiom Array 101 . Derivation of the analytic sample for UKB individuals of European ancestries followed the QC protocol of Astle et al. 29 and included the following steps: after filtering genetic variants (call rate ≥ 99%, imputation quality info score > 0.9, Hardy-Weinberg equilibrium P ≥ 10 −5 ) and participants (removal of genetic sex mismatches), we excluded participants having non-European ancestries (self-report or inferred by genetics) or excess heterozygosity (>3 s.d. from the mean), and included only one of each set of related participants (third-degree relatives or closer). After QC, we were left with 10,203 individuals with CH and 173,918 individuals without CH. The subset with CH included 5,185 and 2,041 individuals with DNMT3Aand TET2-mutant CH, respectively, and 4,049 and 6,154 individuals with large (VAF ≥ 0.1) and small (VAF < 0.1) clone size CH, respectively. Association analyses were performed for autosomal and X chromosomal variants using noninfinitesimal linear mixed models implemented in BOLT-LMM 102 (v.2.3.6) with age at baseline, sex and first ten genetic principal components included as covariates.
Statistically independent lead variants for each CH phenotype were defined using LD-based clumping with an r 2 threshold of 0.05 applied across all genotyped and imputed variants, with P < 5 × 10 −8 , imputation quality score > 0.6 and MAF > 1%. This was implemented using the FUMA pipeline (v.1.3.6b) (ref. 103 ). For the rare variant association scan, we used more stringent cut-offs of P < 10 −9 and imputation quality score > 0.8 to define lead variants but did not require LD-clumping since only one such association was identified. Approximate conditional analysis conditioning on the common (MAF > 1%) lead variants was performed using the --cojo-cond flag in the Genome-wide Complex Trait Analysis (GCTA) v. 1.93 tool (refs. 27,104 ).
We also evaluated associations of the lead variants for overall CH risk in the 505 individuals with CH and 11,893 controls (retained after the QC steps described above), comprising the ancestrally diverse (non-European) subcohort of the 200k UKB cohort, using logistic regression and adjusting for age, sex, WES batch and 40 genetic ancestry principal components.
Replication of genome-wide significant associations. Replication analysis was performed using 221,285 unrelated UKB individuals of European ancestry (age range: 39-73, mean age: 57; 53% females), for whom WES was performed subsequent to the initial 200k, using the same protocol. Alignment to the GRCh38 genome reference with Illumina DRAGEN Bio-IT Platform Germline Pipeline v.3.0.7 and QC were performed as detailed by Wang et al. 105 . Somatic variant calling was performed with GATK's Mutect2 (v.4.2.2.0) using a panel of normals to remove recurrent artifacts, and subsequent filtering was performed with FilterMutectCalls, including the filtering of read orientation artifacts using priors generated with LearnReadOrientationModel. Putative somatic variants were identified from Mutect2 'PASS' calls in DNMT3A and TET2 based on (1) matching the list of putative somatic mutations identified in the discovery cohort, or (2) any DNMT3A or TET2 protein-truncating variants as predefined by Wang et al. 105 . Sample sizes for DNMT3A-, TET2-and large and small clone DNMT3A-or TET2-mutant CH are provided in Supplementary Table 22. Replication association statistics were calculated on the 221,285 replication exomes using the imputed genotype data with logistic regression, adopting age, sex and the first four genetic ancestry principal components as covariates.
Heritability, cell-type enrichment and genetic correlation. We used LDSC (v.1.0.1) 23 to estimate the narrow-sense heritability of CH on the liability scale assuming the population prevalence of CH to be 10% (based on the prevalence of CH in the UKB '200k' cohort as shown in Fig. 1b) and constraining the LDSC intercept to 1. The intercept, which in its unconstrained form protects from bias due to population stratification, was constrained to 1 to provide more precise estimates given that there was little evidence of inflation in test statistics due to population structure in unconstrained analysis (unconstrained intercept estimated as 1.009 (s.e. = 0.0067) and lambda genomic control factor of 0.999). We used the pre-computed 1000 Genomes phase 3 European ancestry reference panel LD score dataset for heritability estimation. We used the same LD scores and the --rg flag in LDSC to estimate the genetic correlation between the CH and mCA GWAS summary statistics 31 . Cell-type group partitioned heritability analysis was performed using LD scores partitioned across 220 cell-type-specific annotations that were divided into 10 groups 24 : central nervous system, cardiovascular, kidney, adrenal/pancreas, gastrointestinal, connective/bone, immune/hematopoietic, skeletal muscle, liver and other. Each of the ten groups contained cell-type-specific annotations for four histone marks: H3K9ac, H3K27ac, H3K4me1 and H3K4me3 (ref. 24 ). We also used LD scores annotated based on open chromatin state (assay for transposase-accessible chromatin using sequencing (ATAC-seq)) profiling by Corces et al. 25,26 in various hematopoietic progenitor cells and lineages at different stages of differentiation. To estimate the genetic correlation between DNMT3Aand TET2-CH and between large and small clone CH we used the high-definition likelihood (HDL; v.1.4.0) 30 inference approach to improve power given the low sample size in each subtype-specific CH GWAS.

Gene-level association and network analyses.
We undertook genome-wide gene-level association analyses using two complementary approaches. First, we used MAGMA (v.1.08 implemented in FUMA v.1.3.6b) which involves mapping germline variants to the genes they overlap, accounting for LD between variants and performing a statistical multi-marker association test 106 . Second, we performed a transcriptome-wide association study using blood-based cis gene eQTL data on 31,684 individuals from the eQTLGen consortium 35 and SMR coupled with the HEIDI colocalization test to identify germline genetic associations with CH risk mediated via the transcriptome 36 . The gene-level genome-wide significance threshold in the MAGMA analyses was set at P = 2.6 × 10 −6 to account for testing 19,064 genes and for SMR was set at P = 3.2 × 10 −6 after adjustment for testing 15,672 genes. Further, only genes with SMR P < 3.2 × 10 −6 and HEIDI P > 0.05 were declared genome-wide significant in the SMR analyses since the HEIDI P > 0.05 strongly suggests colocalization of the GWAS and eQTL signals for a given gene 36 . NetworkAnalyst 3.0 (ref. 39 ) was used for network analysis. All genes with P < 10 −3 in each MAGMA analysis for overall, DNMT3A-and TET2-mutant, and large and small clone CH were used as input. The protein-protein interactome selected was STRING v.10 (ref. 107 ) with the recommended parameters (confidence score cut-off of 900 and requirement for experimental evidence to support the PPI). The largest possible network was constructed from the seed genes/proteins and the interactome proteins 39 . Hub nodes were defined as nodes with degree centrality ≥ 10 (that is, a node with at least 10 edges or connections to other proteins in the network as a measure of its importance in the network and consequently its biology). Pathway analysis of this largest network was conducted using the enrichment tool built into NetworkAnalyst and with the Reactome pathway repository therein 108 .
Fine-mapping and target gene prioritization. We fine-mapped the lead variant signals identified by the FUMA LD-clumping pipeline using the Probabilistic Identification of Causal Single Nucleotide Polymorphisms (PICS2; v.2.1.1) algorithm 46,47 to identify candidate causal variants most likely to underpin each association. The PICS2 algorithm computes the likelihood that each variant in LD with the lead variant is the true causal variant in the region by leveraging the fact that for variants associated merely due to LD, the strength of association scales asymptotically with correlation to the true causal variant 46 . We only retained variants with a PICS2 probability of 1% or more in our final list of fine-mapped candidate causal variants. We overlapped these fine-mapped variants with gene body annotations 48 using GENCODE release 33 (ref. 109 ) (build 37) annotations after removing ribosomal protein genes. Fine-mapped variants were also overlapped with ATAC-seq peaks across 16 hematopoietic progenitor cell populations and ATAC-RNA count correlations calculated using Pearson coefficients for hematopoietic progenitor cell RNA counts of genes within 1 Mb of the ATAC peaks and these were used to identify putative target genes of fine-mapped variants that overlapped ATAC-seq peaks 25,[48][49][50] . We also looked up the SIFT 51 and PolyPhen 52 scores for these fine-mapped variants using the SNPnexus v.4 annotation tool 110 to identify coding variants with predicted functional consequences. Finally, we used the Open Targets Genetics resource 45 to identify the most likely target gene of the lead variant at each locus as per Open Targets and used this in our omnibus target gene prioritization scheme described below.
To prioritize putative target genes at the P lead-variant < 5 × 10 −8 loci identified by our GWAS of overall CH, DNTM3A-CH, TET2-CH and large/small clone size CH, we combined gene-level genome-wide significant results from (1) MAGMA and (2) SMR with (3) PPI network hub status of the gene, (4) variant-to-gene searches of the Open Targets database for lead variants, and overlap between fine-mapped variants and (5) gene bodies, (6) regions with accessible chromatin (ATAC-seq peaks) across 16 hematopoietic progenitor cell populations that were also correlated with nearby gene expression (RNA sequencing) in the same cell populations and (7) missense variant annotations from SIFT and PolyPhen. Genes nominated by at least two of the seven approaches were listed (except where only one of the seven methods nominated a single gene in a region in which case that gene was listed) and the genes nominated by the largest number of approaches represented the most likely targets at each locus. We also evaluated the 'druggability' of the prioritized functional target genes in the context of known therapeutics and ongoing drug development using the Open Targets Platform 56 and canSAR 57 v.1.5.0 databases. The database canSAR provides chemistry-based (assesses the likely 'ligandability' of a protein based on the chemical properties of compounds tested against the protein itself and/or its homologs) and antibody-based (assesses if a target is potentially suitable for antibody therapy) predictions.
Phenome-wide association scan for lead variants. We used PhenoScanner V2 (refs. 33,34 ) with catalog set to 'diseases & traits' , P value set to '5E-8' , proxies set to 'EUR' and r 2 set to '0.8' to search for published phenome-wide associations between our lead variants or variants in strong LD (r 2 > 0.8) with the lead variants and other diseases and traits.
MR analyses. MR 111,112 uses germline variants as instrumental variables to proxy an exposure or potential risk factor and evaluate evidence for a causal effect of the exposure or potential risk factor on an outcome. Due to the random segregation and independent assortment of alleles at meiosis, MR estimates are less susceptible to bias from confounding factors as compared with conventional observational epidemiological studies. As the germline genome cannot be influenced by the environment after conception or by preclinical disease, MR estimates are also less susceptible to bias due to reverse causation. MR estimates represent the association between genetically predicted levels of exposures or risk factors and outcomes, as compared with conventional observational epidemiological estimates, which represent direct associations of the exposure or risk factor levels with outcomes. Effect allele harmonization across GWAS summary statistics datasets followed by MR analyses were performed using the TwoSampleMR v.0.5.6 R package 58 . The CH phenotypes were considered as both exposures (to identify consequences of genetic liability to CH) and outcomes (to identify risk factors for CH). When considering CH phenotypes as outcomes, germline variants associated with putative risk factors or exposures at P < 5 × 10 −8 were used as genetic instruments for the risk factors/exposures, except for the appraisal of circulating cytokines and growth factors 65 wherein variants associated with cytokines/growth factors at P < 10 −5 were used as instruments. IVW analysis 113 was the primary analytic approach with pleiotropy-robust sensitivity analyses carried out using the MR-Egger 78 and weighted median 77 methods. A full list of external GWAS data sources used for MR analyses is provided in Supplementary Tables 30 and 31. We also conducted an MR-PheWAS evaluating overall CH and DNMT3A-CH as exposures (using variants associated with these at P < 10 −5 ) and 1,434 disease and trait outcomes in the UKB data using summary genetic association statistics for the outcomes that were generated by the Neale lab (http://www.nealelab.is/uk-biobank) and accessed via the TwoSampleMR v.0.5.6 R package and the Integrative Epidemiology Unit (IEU) OpenGWAS project portal 114 . FDR control was applied to the MR-PheWAS IVW analysis P values.

Statistics and reproducibility.
No statistical method was used to predetermine sample size. The experiments were not randomized and investigators were not blinded during the experiments and outcome assessment. Participants were excluded from the GWAS due to genetic sex mismatch, excess heterozygosity (>3 s.d. from the mean) and relatedness (only one of each set of participants who were third-degree relatives or closer were retained). To summarize, our study design included observational genomic analyses of CH in 200,453 individuals across ancestries, genome-wide association and post-GWAS analyses for five CH traits (overall, DNMT3A, TET2, large clone and small clone CH) in 184,121 individuals of European ancestry, followed by trans-ancestry genetic association analyses in 12,398 individuals, and replication genetic association analyses in an additional 221,285 individuals of European ancestry-all from the UKB. Fig. 1 | Characterization of CH in the uK Biobank. a, Histogram stratified by sex showing the age distribution of individuals in the UKB cohort (n=200,453). b, Overall percentage of females and males in the UKB cohort. c, Percentage of the most common self-reported ancestry groups in the UKB cohort. Ancestry groups with a frequency lower than 1% were grouped under the 'Other ancestry group' category. d, Number of individuals with 1, 2, 3, and 4 somatic mutations. More than 90% of individuals with CH had only one driver mutation identified. e, Percentages of different CH mutation types identified. f, Relative prevalence of each of the six base substitution types amongst the identified CH mutations. g, Density plot showing the variant allele fraction (VAF) distribution of all CH somatic mutations. h, Density plot showing similar VAF distribution for different mutation types. Mean and median are indicated for g and h. Fig. 2 | Age distribution of CH by mutant gene, clone size, and sex. a, Prevalence of CH in the cohort with advancing age. The blue line represents the smoothed model fitted to a generalized additive model with 95% confidence interval (CI; gray shadow). b, Prevalence of CH by age stratified by the top eight most frequently mutated genes. Colored lines represent the smoothed model fitted to a generalized additive model with 95% CI (colored shadows). Y-axis is log-scaled. c, Clone size, estimated by the variant allele fraction (VAF), increases with age. The blue line represents the smoothed model fitted to a generalized additive model and the shadow represents the 95% CI. d, Empirical cumulative distribution (ECD) of the age of individuals with CH stratified by sex. CH was observed one year earlier in females than in males (median 61 versus 62 years; P=1.6x10 −4 , two-sided pairwise Wilcoxon rank sum test).