Genetics and epidemiology of mutational barcode-defined clonal hematopoiesis

Clonal hematopoiesis (CH) arises when a substantial proportion of mature blood cells is derived from a single hematopoietic stem cell lineage. Using whole-genome sequencing of 45,510 Icelandic and 130,709 UK Biobank participants combined with a mutational barcode method, we identified 16,306 people with CH. Prevalence approaches 50% in elderly participants. Smoking demonstrates a dosage-dependent impact on risk of CH. CH associates with several smoking-related diseases. Contrary to published claims, we find no evidence that CH is associated with cardiovascular disease. We provide evidence that CH is driven by genes that are commonly mutated in myeloid neoplasia and implicate several new driver genes. The presence and nature of a driver mutation alters the risk profile for hematological disorders. Nevertheless, most CH cases have no known driver mutations. A CH genome-wide association study identified 25 loci, including 19 not implicated previously in CH. Splicing, protein and expression quantitative trait loci were identified for CD164 and TCL1A.


Associations of CH with disease
In case-control analysis, CH had strong associations with both myeloid and lymphoid neoplasia (Table 1 and Supplementary Table 3).CH was also associated with existing or subsequent diagnoses of chronic obstructive pulmonary disease (COPD), lung cancer, peripheral artery disease (PAD), emphysema and alcohol abuse.These nonhematological conditions are known to be smoking-related, and their significance was substantially attenuated once smoking was taken into account.This suggests that the associations may be due to residual confounding from various aspects of smoking behavior.Hematological disorder associations were not similarly attenuated by smoking adjustments.Analysis restricted to never smokers produced similar conclusions (Supplementary Table 4).
Case-control analysis revealed no indication of association between CH and key CVD phenotypes, neither in UKB nor in ISL (Supplementary Table 5).Unadjusted for smoking, no CVD phenotype passed Bonferroni significance and, once adjusted, none was even nominally significant.To examine this further, we conducted a time-to-CVD-event analysis in UKB.We considered also whether CH defined by mutational barcodes differed in this respect from CH containing a CPLD mutation.Additionally, we examined CHIP as defined using the filtering strategy recommended in ref. 19,20.In all three instances, we were unable to measure any increased risk of CVD in people with CH.We did, though, observe strong effects from potential confounders in the multivariable model (Table 2).CH has also been implicated in pro-inflammatory phenomena, a suggested basis for its reported CVD association 21,22 .Accordingly, we looked for CH associations with a panel of inflammatory conditions, but saw none (Supplementary Table 5).In UKB, CH was associated with alcoholic liver disease (Table 1) but not fatty liver conditions, at variance with a recent report 23 .
To better understand the increased mortality rate attributable to CH, we examined the primary cause of death records in a meta-analysis of ISL and UKB.Participants with CH were at increased risk of death from both myeloid and lymphoid hematological disorders, as well as lung cancer, COPD and alcohol abuse (Supplementary Table 6).As before, the nonhematological risks were attenuated (but not eliminated) by adjustment for smoking.Chronic ischemic heart disease and heart failure had nominally significant hazard ratios (HRs), but did not meet the Bonferroni threshold.Even though a substantial number of deaths from acute myocardial infarction occurred in the cohort, their frequency was not elevated in participants with CH.

Association of mosaic somatic mutations with CH
Most prior DNA sequence-based studies identified CH using a predefined list of CPLD mutations that are already known to occur in myeloid neoplasia 4,[13][14][15] .Some studies have tested mutated genes for statistical association with CH or evidence of positive selection in CH 1,3,24,25 .Our method can identify CH irrespective of whether a CPLD mutation is present.Thus we can search in a comparatively unbiased manner for genes with mutations that drive CH.We conducted a gene-based burden test for somatic mutations associated with CH (Fig. 1a and Supplementary Table 7).As anticipated from previous studies 1,3,4 , mutations in DNMT3A, TET2 and ASXL1 were the most significantly associated with CH.Most of the other genes are known to be commonly mutated in myeloid disease.Some are implicated, additionally or uniquely, in lymphoid neoplasia 26 .
We also examined the intragenic distribution of the somatic mutations and used Fisher's exact tests to identify individual mutations that drive the signal from each gene (Fig. 1b-e and Supplementary Fig. 1).ASXL1 exhibited predominantly frameshift or nonsense mutations in the 13th (last) exon.ASXL1 activation in myeloid neoplasia typically results from gain-of-function mutations that produce C-terminally mostly granulocytes.These cells have high production rates and short time lags from committed progenitor cells, which in turn require continual replenishment from HSC or multipotent progenitors 10 .Naturally, the lymphocytic lineages have a much greater time lag from the underlying HSC population.Clonal expansions in CH can show multilineage involvement extending to lymphocytes, but do not always do so 11,12 .
Perhaps as a result of the proximity of myeloid lineages to the underlying HSC population, somatic mutations that initiate myeloid malignancies are thought to arise in the HSC compartment.Similar mutations can be found in apparently normal but clonally expanded hematopoietic cells from individuals who appear to be well.In both cases, the mutations can be traced back to underlying HSC 12 .We refer to them as 'candidate preleukemic driver' (CPLD) mutations, because of their propensity to drive CH expansions and consequently to increase risks of hematological disease.Indeed, the presence of a CPLD mutation in a blood sample from an evidently healthy individual has, by many investigators, been used to define the presence of CH 4,[13][14][15] .Clearly, and as pointed out by others 16 , this biases the detection of CH in favor of genes and mutations that may subsequently lead to the development of myeloid neoplasia.
As cell populations grow they accumulate mutations, most of which are presumed to be phenotypically inconsequential.As a result, every clone is uniquely 'barcoded' by the somatic mutations that were present in the founder cell at the inception of the clone.If a particular clone expands sufficiently, its mutational barcode becomes evident in DNA sequence reads.We have shown through whole-genome sequencing (WGS) of peripheral blood that clonal expansions indicative of CH can be detected by examining counts of mosaic somatic mutations (if sufficient care is taken to differentiate them from germline variants and sequencing errors) 1 .Thus CH expansions can be identified solely on the basis of barcode mutations, irrespective of whether they carry a CPLD mutation.This method enabled us and others to show that CH is very common, if not inevitable, in the elderly [1][2][3] .Moreover, most CH cases do not carry an obvious CPLD mutation.Here we use mutational barcodes to study the epidemiology and genetics of CH in participants from Iceland (ISL) and the UK Biobank (UKB) for whom we have generated extensive WGS data.

Identification of CH cases from WGS in ISL and UKB
We used WGS from 45,510 Icelanders and 130,709 British ancestry participants from the UKB 17,18 .Average sequencing depth was 33× for UKB and 38× for ISL.Participants with prior diagnoses of hematological disorders or grossly abnormal hematology measurements on entry were excluded.We identified people with CH based on an evolution of our mutational barcode strategy 1 .Mosaic somatic mutation barcodes were generated by modeling low variant allele fraction (VAF) sequence reads (Extended Data Fig. 1).To reduce contamination from low-VAF germline variants and recurrent sequencing errors, we used only indicator mutations that were observed once in each cohort and restricted in VAF range to 0.10-0.25.Participants with barcodes containing a number of indicator mutations above a threshold were considered to have CH.We identified 16,306 people with CH, a prevalence over the two cohorts of 9.3%.
As anticipated from previous studies, CH was uncommon in under 45-year-olds, but increased dramatically in frequency thereafter, approaching 50% by age 80.Both current and previous smoking substantially increased risk of CH (Extended Data Fig. 1b,c).Pack years further increased CH risk (P = 8.57 × 10 −7 ), whereas years since stopped smoking were protective (P = 3.54 × 10 −10 ; Supplementary Table 1), indicating a dose-dependent relationship between smoking and CH.While the mechanisms by which age and smoking promote CH are yet to be elucidated, both factors clearly are potential confounders in epidemiological analyses.Participants with CH were at substantially greater risk of all-cause mortality and of being diagnosed subsequently with a Article https://doi.org/10.1038/s41588-023-01555-ztruncated proteins 27 .However, we also saw protein truncation mutations in exon 12, namely Arg404Ter and Arg417Ter, that associated strongly with CH (P = 9.7 × 10 −6 and 2.6 × 10 −6 , respectively, UKB, Fisher's exact test).These mutations are puzzling because they would be expected to induce nonsense-mediated decay of the ASXL1 transcript 28 , which would obviate a gain-of-function effect.Further investigation is warranted.The CH association with GNB1 was completely attributable to Lys57Glu mutations (P = 1.4 × 10 −46 , UKB, Fisher's exact test).GNB1 mutations affecting Lys57 predominate in myeloid neoplasia, whereas mutations at other positions are more frequent in lymphoid malignancies 29 .In CALR, high-impact mutations clustered in the ninth (last) exon, suggesting a gain-of-function analogous to that seen in PPM1D and ASXL1 (Fig. 1d,e).Such mutations are present in essential thrombocythemia (ET) and primary myelofibrosis 30 ; however, they have not been consistently implicated as CH-defining mutations (Supplementary Table 7).We obtained robust evidence linking high-impact PRR14L mutations to CH (P = 3 × 10 −11 , UKB, SKAT-O).PRR14L is not generally recognized as a CH gene (Supplementary Table 7); however, mutations have been seen in chronic myelomonocytic leukemia and infrequently in CH participants 31 .
We previously reported a tentative association between CH and MYD88 mutations in ISL 1 .We confirm that finding robustly here (P = 1.9 × 10 −10 , UKB, SKAT-O), the strongest signal coming from Leu-252Pro.MYD88 Leu252Pro (formerly Leu265Pro) mutations are particularly related to lymphoplasmacytic lymphoma/Waldenström macroglobulinemia (LPL/WM), which would not be expected to have a substantial bloodborne component 26,32,33 .However, MYD88 mutations also occur in an atypical minority of chronic lymphocytic leukemia (CLL) and Leu252Pro has been observed in normal B cells from patients with LPL/WM 34,35 .We also reported a CH association with mutations in MTA2 (ref. 1) and confirm that finding here (P = 7.9 × 10 −7 , UKB, SKAT-O).Individually significant missense mutations were clustered within the SANT domain (Fig. 1b,c), which recruits histone deacetylase-1 to the nucleosome remodeling and deacetylase (NuRD) complex 36 .Even though we were able to demonstrate strong associations between the common CPLD genes and CH, most cases could not be accounted for by an obvious driver mutation (Extended Data Fig. 2).Several factors may contribute to this; a lower sensitivity for CPLD mutation detection in WGS versus whole exome or panel sequencing, driver

Differential risks of hematological disorders
We investigated the types of hematological disorders arising in participants with CH.Moreover, we considered how the risk profile of CH defined by mutational barcodes (referred to herein as simply 'CH' or 'barcode-CH' when disambiguation is required) differed from CH defined by the presence of a CPLD mutation (CPLD-CH) or by the absence of a CPLD mutation in a barcode positive case (CPLDneg-CH) (Supplementary Table 8).As shown in Fig. 2a, HRs for both myeloid and lymphoid disorders were increased for all three CH classes.There were, however, differences in nuance.-log 10 (P) 30  40   Fig.Within lymphoid subtypes, barcode-CH and CPLDneg-CH carried significant risks of CLL, whereas CPLD-CH did not.This suggests that some barcode-CH cases may have incipient, undiagnosed CLL or high-count monoclonal B cell lymphocytosis (MBL).However, because B cells normally comprise a small proportion of the leukocyte population, even in MBL, B cell clonal expansions are unlikely to pass our CH detection threshold in the absence of an overt hematological abnormality.Accordingly, they are unlikely to account for a substantial number of barcode-CH cases.Moreover, associations with MPN and CLL could be driven by undetected mCA accompanying the barcode-CH 37,38 .We investigated whether, among CPLD-CH participants, risks of hematological disorders differed by the particular CPLD gene involved (Fig. 2b).Significant HRs were seen for ASXL1-CH, DNMT3A-CH, JAK2-CH, SF3B1-CH, SRSF2-CH, TET2-CH and TP53-CH but not for PPM1D-CH.The risk from JAK2-CH was greater than from any other of the CPLD genes.While participants with DNMT3A-CH were at somewhat increased risk, HR estimates for other CPLD-CH types including ASXL1-CH and TET2-CH were substantially higher.
One CH GWAS variant, at TERT, was reported by us previously in association with barcode-CH in ISL 1 .We reproduced this association; however, the sentinel TERT variant this time was rs7705526_A (OR = 1.28,P = 1.79 × 10 −78 ), which is the same variant as subsequently reported for CPLD-CH 13 .Several other CH GWAS loci have been associated with related phenotypes, such as CPLD-CH [13][14][15] , mCA 38,39 , loss of Y chromosome (LoY) [40][41][42] or MPN 43,44 .The LD between our CH GWAS variants and those signals is detailed in Supplementary Table 10.We found no previous reports for 19 of the CH GWAS loci.
To gain further insight into CH without known drivers, we repeated the GWAS using only CPLDneg-CH participants as cases (Extended Data Fig. 4 and Supplementary Table 11).Effects were broadly similar to the barcode-CH GWAS (m = 1.02,P = 1.47 × 10 −18 ).Following two new loci were detected: TERC and KDM6B.The protective effect of chr14:TCL1A rs2887399_T was stronger in CPLDneg-CH, perhaps due to the differing effects of this allele in various CPLD mutation backgrounds (see CPLD gene specific CH GWAS associations, below).CHEK2 and SMC4 variants had somewhat larger effects in barcode-CH.

CPLD gene-specific CH GWAS associations
We repeated the GWAS meta-analysis on CPLD-defined CH for driver genes where there was sufficient power to do so.Considering all variants that were significantly associated with barcode-CH or any one of the CPLD-CH types, we compared their effects on barcode-CH and various types of CPLD-CH.There were substantial differences in effects between CPLD-CH types (Extended Data Fig. 5 and Supplementary Table 12).
Viewing the patterns overall, most variants demonstrated no effect on ASXL1-CH.While TET2-CH, for example, showed a highly significant slope when regressed on barcode-CH (m = 0.94, P = 5.64 × 10 −10 ), the slope for ASXL1-CH versus barcode-CH was much shallower and of lower significance (m = 0.41, P = 8.76 × 10 −4 ).Moreover, PPM1D-CH produced no significant regression against barcode-CH.One possible explanation is that environmental factors have a greater influence on ASXL1-CH and PPM1D-CH than on other CPLD-CH types-risk of PPM1D-CH was substantially increased in patients who have undergone chemotherapy (OR = 7.9, P = 4.5 × 10 −4 ; Supplementary Table 13), while ASXL1-CH was more strongly associated with smoking than other CPLD-CH types (Supplementary Table 14) in agreement with previous reports 9,45,46 .

CH GWAS variants affect blood traits, telomeres and MPN
To gain insight into the functionality and pleiotropic effects of the CH GWAS variants, we examined published GWAS associations for them and variants in LD (Supplementary Table 15).Even though participants with grossly abnormal hematology had been excluded from the study, many clinical hematology parameters 47 showed associations with the CH phenotype.Moreover, many CH GWAS loci had associated clinical hematology traits in the GWAS Catalog or UKB data (Supplementary Tables 15 and 16 and Extended Data Fig. 6).Several CH GWAS variants were reportedly associated with leukocyte telomere length (LTL) in the GWAS Catalog.To investigate this in detail, we examined the relationship between CH and LTL, using UKB samples that were contemporaneously assessed for both CH (in this study) and LTL (in ref. 48).CH, along with age and prior or current smoking, was strongly associated with shorter LTL (β = −0.129,P < 2 × 10 −16 ; Supplementary Table 17) as seen previously in ISL 1 .Moreover, most CH GWAS variants associated with shorter telomeres, in line with the CH:LTL phenotype association.However, the two chr5:TERT variants and a variant on chr6p22 (near the MHC) were significantly associated with longer telomeres (Fig. 4a and Supplementary Table 18).As a result of this discordance, no significant regression parameters could be obtained and, consequently, a Mendelian randomization (MR) analysis was not considered prudent.For a complementary examination of the effects of LTL GWAS variants on the CH phenotype, we conducted a new GWAS for LTL in the UKB, using our current WGS-based imputation.We found 191 LTL variants (Supplementary Table 19).Their effects on LTL and CH are plotted in Fig. 4b.We found evidence of a massive discordance of effects, with some longer LTL alleles associated with increased CH risk and others associated with reduced risk (indicated as 'cloud 1' and 'cloud 2,' respectively, in Fig. 4b).Here again, MR analysis was not considered advisable.
Observed LTL is measured in blood that may contain CH expansions.So, any variant that promotes CH but does not directly affect telomeres would appear to cause shorter telomeres, because of the association between CH and contemporaneously observed short telomeres.By the same token, such CH-promoting variants might be identified as LTL-associated variants in an LTL GWAS.To examine this, we repeated the GWAS for LTL, using only participants without proven CH.There was no evident difference in the effects of LTL GWAS variants between the two subgroups (Extended Data Fig. 7).
As was shown in Fig. 2a, CH associated strongly with subsequent diagnoses of MPN in line with its proposed status as a clinical precursor to MPN 49 .The majority of CH GWAS variants also conferred risk of MPN (Fig. 4c and Supplementary Table 18).MR analysis was consistent with CH having a causative effect on MPN (inverse-variance weighted (IVW), P = 7.86 × 10 −6 ; Supplementary Table 20).

CH GWAS variants are involved in expression quantitative trait loci (eQTL), splicing quantitative trait loci (sQTL) and protein quantitative trait loci (pQTL)
We considered whether the CH GWAS variants affect RNA abundance or splicing of nearby genes.For each sentinel variant, we identified all variants in LD (r 2 ≥ 0.8) and then queried public RNA-seq eQTL and sQTL databases, focusing on blood or blood-related cell types.Variants with substantial cis effects were investigated further in ISL RNA-seq data from 17,848 peripheral blood samples (Supplementary Table 21).eQTL at ABCC5 and TRIM59/SMC4 are described in Extended Data Fig. 8, while other salient examples are discussed below: CD164 is, biologically, a good candidate for a role in CH pathogenesis.It is expressed on early HSC and can affect their proliferation, differentiation, adhesion to bone marrow stromal elements, migration and retention in HSC niches [50][51][52] .Public sources revealed a CD164 sQTL in blood, lymphoblastoid B-cell lines (LCL) and several nonhematological tissues.The top reported sQTL in whole blood has r 2 = 0.81 with our sentinel CH GWAS hit (rs3056655), while the top sQTL in LCL has r 2 = 0.86.Using ISL blood RNA-seq, we ascertained that the sQTL affects the two major isoforms of CD164, which differ by the presence (CD164-202) or absence (CD164-203) of exon 5.The latter isoform lacks the full-length CD164 protein's glycosaminoglycan attachment site.Increased exon 5 skipping was strongly associated with the rs3056655_A CH risk allele (P = 3.04 × 10 −302 , β = 0.44).Coverage  9. Several high-effect, rare variants were deemed to require further confirmation and were not considered further (indicated in Supplementary Table 9).
We carried out a proteomic analysis of plasma samples from 12,636 UKB participants for whom we had CH status information, using the Olink platform to interrogate levels of 1,472 proteins and test them for association with CH.Several proteins of relevant biological interest ranked highly (by significance), including the hematopoietic progenitor cell growth factors FLT3LG and CLEC11A, thrombopoietin THPO, pro-inflammatory cytokines CCL5 and TNFSF12 and smoking marker ALPP (Supplementary Table 22).Second in the ranking was TCL1A, an oncoprotein in T cell leukemias, lymphomas, CLL and several nonhematological cancers 53 .Higher TCL1A levels were associated with CH (P = 2.05 × 10 −13 , β = 0.21), and this replicated ISL SomaScan proteomic data (P = 2.86 × 10 −3 , β = 0.06) (ref.54).TCL1A is of particular interest because a CH GWAS variant is located 162 bp upstream of the gene's transcription start site (Fig. 6a).The minor allele, rs2887399_T (minor allele frequency (MAF) ∼20%), is protective against CH in our data.It has been implicated (with varying direction of effect) in CPLD-CH, mCA and LoY (see above and refs.13,41,55).The rs2887399_T allele is reported to suppress ectopic expression of TCL1A in CPLD mutant HSC 56 .A search for cis-pQTL using UKB Olink and ISL SomaScan identified two conditionally independent LD classes of variant, both with minor alleles acting to reduce TCL1A expression.One LD class of pQTL was correlated with rs2887399_T (r 2 ∼0.67), whereas a second LD class pQTL, typified by rs78986913_A was not (r 2 ∼0.092,MAF ∼4%; Fig. 6b,c).Curiously, rs78986913_A did not show an independent signal in GWAS for CH predisposition in conditional analysis (P adj = 0.78).
To investigate this further, we searched for RNA-seq cis-eQTL for TCL1A.In whole blood, both the 4% MAF rs78986913_A and the 20% MAF rs2887399_T variant classes reduced expression of TCL1A.Conditioning the eQTL signal on rs78986913, COLOC 57 revealed an 85% probability of peak identity between the rs2887399 eQTL and the CH GWAS peak.Both the 4% MAF and 20% MAF variants classes affected expression in B cells.However, in monocytes only the 20% MAF rs2887399_T variant was associated with TCL1A RNA expression and a 4% MAF rs78986913_A peak was not in evidence (Fig. 6d-g).It appears that, in this case, the eQTL and pQTL of relevance to CH may be restricted to the myeloid lineage.

Discussion
This study expands greatly on our previous investigation of CH detected using mutational barcodes 1 , extending the number of cases from 1,403 to 16,306.We reaffirm the strong associations between CH, age and smoking and provide evidence that smoking has a dose-dependent impact on CH.Aside from confirming the risk for hematological diseases, we find that CH associates with COPD, lung cancer, PAD, emphysema and alcohol abuse.These conditions are all smoking-related.The effects of CH on their risks were strongly attenuated when adjusted for smoking.It is likely that the remaining associations are due to residual confounding from various aspects of smoking behavior that could not be fully taken into account in the analysis.It is notoriously difficult to remove all residual confounding from smoking behavior, especially when using self-reported information 58,59 .An attractive hypothesis is that smoking creates an inflammatory state, exerting pressure on the hematopoietic system, depleting the HSC and progenitor cell pool and driving compensatory HSC self-renewal, thereby increasing the probability of a clonal outgrowth [60][61][62] .
Studies that reported an association between CH and CVD received a great deal of attention, having been reviewed extensively 15,21,22 .Somewhat less attention was given to contemporaneous studies reporting a lack of association, albeit sometimes in smaller samples [7][8][9]12,14,15,63 . The presentstudy finds no evidence of an

Article
https://doi.org/10.1038/s41588-023-01555-zassociation between CVD and barcode-CH or CPLD-CH.The strong potential for confounding by age and smoking has been emphasized, here and elsewhere 14 .Moreover, our stringent exclusion of people with a pre-existing hematological abnormality may be a factor.Some hematological disorders (particularly MPN) have known associations with blood clotting and CVD risk 64 .We observed an increased incidence of CVD among the participants whom we excluded compared to participants without CH (HR = 5.08, P < 2 × 10 −16 ).We also note that published CVD risks are seen particularly for ASXL1-CH (which has a demonstrable smoking bias) and JAK2-CH (which associates strongly with MPN) 9,15 .Not taking these considerations sufficiently into account may create or inflate an apparent CVD risk.
There may be a large number of undiscovered mutations that confer a sufficient fitness advantage to drive HSC clonal expansions to overt CH over a long period of time 2,24,25 .We find several genes that are not well recognized as CH drivers, some with previously noted involvement in myeloid (or in some cases lymphoid) disease.Nevertheless, most CH still cannot be accounted for by an obvious driver mutation.No satisfactory explanation has yet emerged and the question merits further investigation.
Here we provide new evidence for 25 loci with germline variants that predispose to barcode-CH.We additionally identify three secondary signals and two suggestive, missense variants.Several variants overlap with loci that have been associated with CPLD-CH, mCA, LoY and MPN, underlining the close relationships between these phenotypes 1,9,14,15,[42][43][44]65,66 . CH GWS variants commonly show pleiotropic associations with blood cell traits, LTL and MPN but not CVD-no CH GWAS variants had listings for CVD in the GWAS Catalog, and MR analysis gave no indication that CH risk variants increased CVD outcomes (Supplementary Table 20).
Based on MR using the few instrumental variables that were available to them at the time, a study described in ref. 67 concluded that long-LTL alleles predispose to CH, whereas CH alleles predispose toward shorter telomeres.This is not fully consistent with our observations, in which we see many discordant effects (Fig. 4).MR studies typically show that long-LTL alleles are associated with cancer predisposition, whereas observed telomere lengths in blood of predisposed people or in tumors can be either longer or shorter.Indeed, we find that CH is linked to shorter observed LTL, perhaps as a result of extra divisions that an HSC clone had to undertake to gain its dominance (see Fig. 4a above and ref. 1).In leukemias, paradoxically, risk is increased by both long and short observed LTL, measured prospectively 68 .A rationalization for this, as evidenced in congenital telomeropathies, could be that too short telomeres impair HSC function and precipitate a bone marrow insufficiency.This places a selective pressure on the HSC population and the marrow is repopulated by HSCs that have acquired alterations allowing them to bypass the replicative exhaustion induced by the telomere erosion 69,70 .MR studies in MPN, CLL and   However, when the eQTL signal is adjusted for the 4% MAF rs78986913 variant (P adj values shown in blue), then the peaks overlap with a PP.H4 = 85% probability that they correspond to the same signal.The position of the CH GWAS sentinel variant rs2887399 is indicated by the gray vertical line.f, TCL1A eQTL from 758 B cell RNA samples.g, TCL1A eQTL from 884 monocyte samples.In all panels except e, the r 2 focus is on rs2887399.Participants were also excluded if they had substantial evidence of abnormality from hematology parameters measured at recruitment (if available), comprising white blood cells (WBC) < 1.5 × 10 9 or >35 × 10 9 cells per l or hemoglobin concentration (HGB) < 8 g dl −1 , or platelet count (PLT) < 50 × 10 9 cells per l.
We extracted all singleton SNPs (SNPs occurring only once in the UKB cohort) for 149,960 participants, then filtered on genotype quality (GQ) ≥ 90 to obtain some 287 million singleton variants (ignoring hard filtering).
The following filter steps were applied: • use FILTER in (PASS, Low_QD) • 15 ≤ depth ≤ 60 • minor allele reads ≥3 to remove spurious low-VAF bump We estimate the number of somatic singleton mutations with 0.1 ≤ VAF ≤ 0.25 as the number of observed variants in this VAF range minus the number of expected germline variants.To model the expected number of germline variants in this VAF range, we make the following assumptions: • The expected number of germline variants in the VAF ranges 0.1-0.25 and 0.75-0.9are approximately equal (that is, there is symmetry in the germline variant VAF distribution).• The vast majority of variants in VAF ranges 0.35-0.65 and 0.75-0.90are germline variants.
• The ratio of germline variants in VAF ranges 0.75-0.90and 0.35-0.65 is approximately constant for each participant, given sequencing depth and sequencing center.
For each depth, we compute the ratio of total observed (germline) variants in VAF range 0.75-0.9compared to VAF range 0.35-0.65.This computation is done separately for each sequencing center.For each participant, the number of expected germline variants in VAF range 0.1-0.25 for a given sequencing depth is then computed as the expected fraction of germline variants in VAF range 0.75-0.9,given the observed number of variants in VAF range 0.35-0.65 at the given depth.Only sequencing depths ≥21 were considered.Based on an expected fraction of CH of around 1% at age 40, we set a threshold of ≥20 observed somatic singleton indicator mutations with 0.1 ≤ VAF ≤ 0.25 to define CH.This threshold was adjusted for sequencing center (+1 for Vanguard and −2.2 for Sanger) to achieve agreement of age dependency between the sequencing centers.Note that the VAF of the indicator mutations is not a precise measurement of the VAF of the CH clone-because only ∼20 indicator mutations are required to define CH, VAF distributions of somewhat smaller and larger clones are likely to pass through the detection window.Moreover, larger clones will generate subclones with indicator mutations of lower VAF.

ISL.
For ISL, we needed to accommodate for different sequencing platforms.A total of 33,189 samples sequenced on Illumina HiSeqX were processed to determine CH status as previously 1 .For 12,510 samples sequenced on Illumina NovaSeq, reads were aligned to hg38 reference using bwa mem (v0.7.10), indels realigned using GATK IndelRealigner (GATK 2.3-9) and duplicates removed using Picard MarkDuplicates (V1.117).Genotypes were called using GATK HaplotypeCaller and GATK GenotypeGVCFs (v.2014.4-3.3.0-0-ga3711aa).Variants were (hard) filtered as above.CH status was determined as described above for UKB; however, singletons were determined based on a cohort of ∼100,000 sequenced participants.As no base quality recalibration was applied to ISL, the estimated number of somatic singletons for 0.1 ≤ VAF ≤ 0.25 was higher than for UKB (46 for WGS NoPCR Nova and 32 for NEB WGS).Average sequence depth was 38.
Definition of CPLD-CH.We ran Strelka2 (2.9.10) somatic workflow on CPLD gene regions on CRAM files from genome alignment (see above).To suppress artifacts due to mapping problems, we used one of the CRAM files as a normal sample for all other samples.Variants were filtered on depth >10, FILTER = 'PASS,' and 0.01 ≤ VAF ≤ 0.99.To identify germline variants, we performed a binomial test on VAF against 0.5, and classified calls with P > 0.05 as potential germline calls.Variants with >5 observations and >75% potential germline calls were removed.We annotated the remaining variants using VEP and kept only those moderate/high-impact variants that were either high impact (but not in 'GNAS,' 'JAK2,' 'SRSF2,' 'SF3B1') or present in ref. 13.
Note that the definition of CPLD-CH is not subject to the same VAF restrictions as the mutational barcode method described above.Moreover, particularly in younger individuals, CPLD-CH can be detected in the absence of a mutational barcode, as discussed in ref. 1 (see also Supplementary Table 8).
To define CHIP in Table 2, we used the strategy recommended in refs.19,20, adapted to our dataset.Variants in the 73 candidate genes (except U2AF1) were called using Strelka2.Variants were annotated with VEP v.100.Variants given in Vlasschaert Supplementary Table 1  (ref.20) were selected and kept if they had depth ≥20 and minAD ≥3.Variants occurring at ≥15 times were tested for association with age and rs7705526-variants with P > 0.1 or estimate <0 for both covariates were removed.A binomial test was used to remove putative germline variants by testing if the read depth was statistically different from half of the sum of all sequencing reads at that site.Variants with P > 0.01 were removed, except for variant sites TET2 H1904R, I1873T and T1884A.https://doi.org/10.1038/s41588-023-01555-zCH GWAS, association testing and meta-analysis.Methods for GWAS association testing are described in detail elsewhere 17,76 .Briefly, association between imputed variants and barcode-CH as a binary phenotype was tested by logistic regression under a multiplicative genetic model.For ISL, the model included as covariates-sex, county of birth, current age or age at death (first-and second-order terms included) and an indicator function for the overlap of the lifetime of the individual with the time span of phenotype collection.In UKB, 20 principle components were used to adjust for population stratification, with age and sex included as covariates.LD regression was used to account for cryptic relatedness and stratification 77 .Analysis of quantitative hematological parameters and LTL used the linear mixed model implemented in BOLT-LMM 78 .For meta-analyses, GWAS results from ISL and UKB were combined using a fixed-effects inverse-variance method based on effect estimates and s.e. in which each dataset was assumed to have a common OR but allowed to have different population frequencies for alleles and genotypes.Sequence variants were mapped to NCBI Build 38 and matched on position and allele to harmonize the datasets.We tested ∼75.2 million variants for association, with MAF > 0.001% and imputation information >0.8 in at least one of the cohorts.For conditional analysis, the sentinel signal at each locus was defined as the variant with the lowest Bonferroni adjusted P value using adjusted significance thresholds 79 .Conditional analysis used individual-level genotype data to test possible secondary signals ±500 kb from the sentinel signal.
CPLD-CH GWAS.The GWAS was repeated using individuals who were identified as carrying a somatic mutation in CPLD genes as affected.For the CPLD-CH × barcode-CH effect × effect plots, variants were included if they were associated at P < 5 × 10 −8 (or 5 × 10 −7 for moderateor high-impact variants) in barcode-CH or in any one of the CPLD-CH classes and had not been excluded as high impact, rare variants as indicated in Supplementary Table 9. Variants were not plotted if they had abs(log e OR) > 3, but they were included in the data table (Supplementary Table 12).
Investigation of pleiotropic traits in the GWAS Catalog.For each sentinel variant, we identified all variants in LD (r 2 ≥ 0.8) within ±500 kb.For those variants, we then searched the GWAS Catalog 80 for reported associations with P < 1 × 10 −7 .
LTL and MPN effect × effect plots and MR.Variants selected for effect × effect plots and MR of LTL and MPN were genome-wide significant according to stringent weighted Bonferroni criteria after stepwise conditional analysis at each locus 79 .LTL variants and effects were determined by GWAS using UKB LTL data 48 .MPN outcomes were freshly recalculated using current UKB data (Supplementary Table 18).MR analyses were performed using linear regression without an intercept term, weighted by the inverse-variance of the outcome associations (IVW), MR coupled with an intercept test and weighted linear regression with an intercept term (MR-Egger 81 ).
RNA eQTL and sQTL analysis.Public domain databases that were screened for RNA-seq eQTL and sQTL data are detailed in the Data Availability section.In-house RNA-seq analysis was performed as an extension of our previous studies 76,82 -we isolated RNA from whole blood samples from ISL participants (n = 17,848), in addition to 822 T cell, 758 B cell and 899 monocyte samples, using Chemagic Total RNA Kit special (PerkinElmer) and sequenced it using Illumina HiSeq 2599 and NovaSeq systems.STAR software (v.2.5.3) was used to align RNA-seq reads to personalized genomes 83 .Kallisto 84 was used to estimate transcript abundances.BOLT-LMM was used to test additive model association between transcript abundance and genetic variants.Adjustment factors were as follows: sequence artifact estimations, demographic characteristics, blood cell counts and 100 leave-one-chromosome-out (LOCO) principle components of the gene expression matrix.The top cis-eQTL was defined as the variant with the most significant association within 1 Mb of the gene.
LeafCutter (v.0.2.6) (ref.85) was used to quantify RNA alternative splicing.Linear regression under the additive model was used to test the association between alternative splicing events and linked genetic variants using quantile-normalized-percentage-spliced-in (PSI) values for each junction.Adjustment factors were as follows: sequence artifact estimations, demographic characteristics, blood cell counts and 15 LOCO principle components of the quantile-normalized PSI matrix.Colocalization analysis between CH GWAS variants and eQTL was carried out using COLOC 57 implemented in R.
Proteomics.Proteomic analysis of ISL plasma samples (including n = 18,527 participants assessed for CH) using the SomaScan version 4 panel was described previously 54 .Proteomic analysis of UKB plasma samples (n = 12,636 participants with CH assessment) was conducted using the Olink Explore 1536 platform as part of the UKB-Pharma Proteomics Project (UKB application 65851).The vast majority of the samples were randomly selected from among UKB participants.Olink measurements used the normalized protein expression (NPX) values recommended by the manufacturer, which include normalization.

Statistical testing.
All statistical tests used in the study were two-sided.None of the P values quoted were adjusted for multiple testing.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
In addition to data presented in Supplementary Tables 1-22, the following new datasets are made available at: https://www.decode.com/summarydata/ 1. Variant level GWAS meta-analysis data for ISL and UKB for barcode-CH and each CPLD-CH type illustrated in Fig. 3. 2. Mutation level counts and Fisher's exact test results for each somatic mutation tested in ISL and UKB.WGS, genotype and phenotypic data for UKB participants can be accessed by approved researchers via the UKB research analysis platform: https://ukbiobank.dnanexus.com/landing.Guidance on access can be found here: apply for access (ukbiobank.ac.uk).Individual-level ISL WGS, RNA-seq and phenotype data cannot be made publicly available because that is prohibited by the Icelandic Act on Data Protection and Processing of Personal Data and conditions set forth to us by the Icelandic Data Protection Authority.On-site access to the data at deCODE genetics facilities may be granted.Interested parties should write to the lead contact author S.N.S. with a brief description of the requirements and intended use.Requests will be discussed by the deCODE data access committee and a response given within 4 weeks.We used data from the following public domain sources:    https://doi.org/10.1038/s41588-023-01555-z12.The chr14:TCL1A rs2887399_T allele was protective against barcode-CH, TET2-CH and ASXL1-CH whilst the same allele increased risk of DNMT3A-CH, in line with previous reports.The chr14:TCL1A variant is indicated in the DNMT3A-CH and ASXL1-CH panels to illustrate the reversal of effect.Similarly, the chr6:CD164 chr6:CD164 rs3056655_A allele increased risk of barcode-CH and DNMT3A-CH but decreased risk of TET2-CH 13,14 .The latter result was seen only in UKB, whereas ISL data could not confirm it.The chr3:SMC4 rs201009932 variant had no discernible effect on ASXL1-CH while it had a pronounced effect on JAK2-CH.chr3:THRB had no apparent effect on DNMT3A-CH and chr5:TERT rs7705526 had no effect on PPM1D-CH.Other variants showed prominent effects only in specific CPLD-CH types: chr12:SOX5 and chr14:DLK1 had no evident effects outside of barcode-CH, while chr13:KLF12 had no apparent effect outside of PPM1D-CH.The chr9:JAK2 rs16922785_G allele (indicated in the JAK2-CH panel) only conferred CH risk in the context of the JAK2 Val617Phe somatic mutation and was preferentially linked to it in cis, as has been noted previously for the 46/1 JAK2 haplotype and MPN risk 104 .rs16922785 is in moderate LD with the 46/1 haplotype (r 2 = 0.68) and had a somewhat stronger association with JAK2-CH than the 46/1 haplotype tagger rs12343867_C (P = 1.60 × 10 −9 vs 1.04 × 10 −7 ).Hematological traits are ordered by hierarchical clustering within the CH at-risk and CH protective strata.Platelet parameters were affected by the greatest number of variants: PCT, PLT, PDW and MPV; followed by erythrocytic parameters MCH, RBC and MCV.The best alignments in direction of effects (that is where the effects of the variant on CH and the hematological trait were consistent with the phenotype:phenotype association) were seen again for platelet parameters PDW, PCT and PLT as well as for MO#, LY# and BA%.From the perspective of the CH GWAS variants, the variants affecting the most hematological traits were chr6:CD164 and chr6:HLA-C.However chr6:CD164 had rather poor alignment in the direction of effects.The best alignments were seen for chr21:14966851 NRIP1, chr3:THRB and chr3:16068930:SMC4.Clinical hematology parameters are as defined in Sheard 47 .
https://doi.org/10.1038/s41588-023-01555-z).There were two independent CH GWAS signals at 3q25; a 1-2%EAF CH risk variant chr3_160368930_T_TA and a ∼ 55%EAF CH risk variant rs2305407_A, which carries the eQTL association.Accordingly, the CH GWAS plot (blue) shows the P adj values for rs2305407_A conditioned on chr3_160368930_T_ TA.The TRIM59 RNAseq eQTL signal (red) is scaled as indicated in the legend.COLOC revealed a PP.H4 = 96% probability of peak identity.COLOC did not show substantial evidence of peak identity with the SMC4 eQTL, whether the CH GWAS signal was conditioned on chr3_160368930_T_TA or not, with PP.H4 = 4.5% and 2.2%, respectively.eQTL and CH GWAS signals were derived from linear and logistic regression association analysis, respectively.

Fig. 2 |
Fig. 2 | Differential risks of subsequent hematological disorders for barcode-CH, CPLD-CH and CPLDneg-CH.a, HR and 95% CI from Cox regressions for subtypes of hematological disorder, stratified by CPLD-CH, barcode-CH and CPLDneg-CH.Diagnoses were included if they arose 6 months or more after blood sampling for CH determination.Data are meta-analysis of UKB and ISL Article https://doi.org/10.1038/s41588-023-01555-z

Fig. 3 |
Fig. 3 | GWAS meta-analysis of barcode-CH in ISL and UKB.Manhattan plot showing logistic regression GWAS results (−log 10 (P) versus chromosomal position) from 16,306 cases and 159,913 controls.The horizontal red line corresponds to a P value of 5 × 10 −8 .Named loci have unconditional P values of <5 × 10 −8 .Loci are named by the nearest gene or plausible candidate.The TERT and TCL1A loci are offscale, and their P values are indicated on the plot.Detailed data for named loci are in Supplementary Table9.Several high-effect, rare variants were deemed to require further confirmation and were not considered further (indicated in Supplementary Table9).

Fig. 4 |
Fig. 4 | Effects CH GWAS variants and LTL GWAS variants on CH, LTL and MPN outcomes.a, Effects of CH GWAS variants on CH (x axis) and LTL (y axis) outcomes.LTL data are from UKB (n = 418,251).The two discordant TERT variants mentioned in the text are indicated.b, Effects of LTL GWAS variants on LTL (x axis) and CH (y axis) outcomes.Variants are grouped into 'cloud 1' (shaded brown) and 'cloud 2' (shaded blue) according to their direction of effect on CH (see text).c, Effects of CH GWAS variants on CH (x axis) and MPN (y axis) outcomes.MPN outcomes were obtained from meta-analysis of ISL and UKB data (n case = 1,124 and n control = 747,154).In all panels, only variants with MAF > 1% are plotted.The plotted points are association effect estimates from logistic/linear regression and the bars indicate 95% CI.The red dotted lines indicate the IVW regressions.The chromosomal location of each plotted variant is indicated by color as indicated in the color key, lower right.

Fig. 5 |
Fig. 5 | CH GWAS variants are associated with splicing and expression of CD164.a, Splice diagram of the two major CD164 mRNA isoforms from whole blood RNA-seq data.Blue bars depict exons and are wider in coding regions.Introns are depicted as black arrowed lines.The sQTL affects skipping or inclusion of exon 5. Effects (β in s.d.units) from linear regression of the CH risk rs3056655_A allele are as follows: E4 to E6 (β = 0.44, P = 3.04 × 10 −302 ; E4 to E5 (β = −0.22,P = 3.29 × 10 −72 ); E5 to E6 (β = −0.14,P = 4.16 × 10 −32 ).Thickness of the arcs indicates the overall usage of the different splice junctions.Black arcs indicate a reduction in usage in association with rs3056655_A, while the brown

. 1 |
Age and smoking dependency of CH. a, Frequency distribution in UKB of singleton mutations: Mutations that were observed only once in the cohort were plotted by variant allele fraction (VAF

. 3 |
Locus zoom plots for loci where a secondary signal was detected by conditional analysis.Plots show conditional logistic regression GWAS results (−log 10 P vs chromosomal position) from 16,306 cases and 159,913 controls.The adjusted signals are shown, with the primary signal in the upper part of each panel and the secondary signal in the lower part.r 2 values relative to the peak signal are shown by color as indicated in the color bar, bottom right.a, SMC4 locus.b, TERT locus.c, NRIP1 locus.

. 4 |
GWAS of CPLDneg-CH and comparison of effects with barcode-CH GWAS.Data are a meta-analysis of ISL and UKB.GWAS variants were included if they were significantly associated with barcode-CH or CPLDneg-CH.The plotted points are association effect estimates (log e odds ratio) and 95%CI from logistic regression association testing for variants in barcode-CH (16,306 cases, 159,913 controls) and CPLDneg-CH (11,692 cases, 151,277 controls) respectively.The fitted inverse variance weighted linear regression, fixed through the origin, is shown as a red dotted line.Variants that were newly discovered in the CPLDneg-CH GWAS are colored green.Labeled loci are discussed in the text.Extended Data Fig. 5 | Effects of GWAS meta-analysis variants on various types of CPLD-CH vs barcode-CH.GWAS variants were included if they were significantly associated with barcode-CH or any of the CPLD-CH types.The x-axes show the effects (log e odds ratio) and 95%CI (horizontal lines) for each variant in barcode-CH, determined by logistic regression.The y-axes show the corresponding effects and 95%CI (vertical lines) for each variant in the different types of CPLD-CH, as indicated above each panel.The dotted line shows the position of the diagonal.Gray lines indicate the position of no effect.Detailed data including case and control numbers are in Supplementary Table

. 6 |
Effects of CH GWAS variants on clinical hematology parameters.a, GWAS Catalog reports: For each sentinel CH GWAS variant, we identified all variants in LD with r 2 > = 0.8 within +/−500kb

. 8 |
Co-localization of eQTL with CH GWAS loci chr3q27:ABCC5 and chr3q25:TRIM59/SMC4. a, Public databases report that ABCC5 expression is down regulated in association with the CH risk allele chr3:183954156_GT in whole blood, monocytes and T-cells.This eQTL was confirmed in ISL whole blood RNAseq (β = −0.926sd, P = 1 × 10 −1657 ).We noted a closely correlated, moderate impact splice region variant (rs7636910, r 2 = 0.96) in ABCC5.The panel shows a plot of RNAseq eQTL signals from whole blood (red) and CH GWAS results (blue) by genomic location.eQTL P-values are scaled as indicated in the legend.Co-localization analysis (COLOC57 ) indicated a PP.H4 = 74% probability that the eQTL and CH GWAS signals arise from the same, single causative variant.ABCC5 is, however, not a compelling biological candidate for CH causation.b, Public databases report that TRIM59 and SMC4 expression in blood is increased in association with CH risk allele rs2305407_A, which is annotated as an SMC4 splice region variant.These signals replicated in ISL blood RNAseq (TRIM59: β = 0.458sd, P = 1 × 10 −420 ; SMC4: β = 0.073sd, P = 1.75× 10 −11

Table 1 | Associations between clonal hematopoiesis and disease in UKB
Phenotype list is edited to remove redundancies and subphenotypes.a Multivariable regression, adjusted for sex and age at blood draw (linear and quadratic).b Additionally, adjusted for smoking status (current, previous), pack years and years since stopped smoking.c Heart failure was included in the UKB table because prior literature reports implicated an association with CH.

1 | Association of mosaic somatic mutations with CH. a, Results
As in b and c but for the CALR gene.FE, Fisher's exact.
and 95% CI from Cox regressions for subtypes of hematological disorder, stratified by CPLD-CH, barcode-CH and CPLDneg-CH.Diagnoses were included if they arose 6 months or more after blood sampling for CH determination.Data are meta-analysis of UKB and ISL (n = 162,963 participants overall, 14,837 with barcode-CH, 5,288 with CPLD-CH and 11,692 with CPLDneg-CH).b, HR and 95% CI for subsequent hematological disorder stratified by CPLD genes.MM, multiple myeloma; MGUS, monoclonal gammopathy of undetermined significance; OMF, osteomyelofibrosis.
The study included WGS of whole blood samples from 45,699 Icelanders participating in various projects at deCODE genetics.The study was authorized by the Icelandic National Bioethics Committee and the Data Protection Authority (License VSN-16-104).All individuals gave written informed consent.
).The counts were further stratified by the age of the subject at blood draw.Note that there is a 'bump' in the distribution starting below a VAF of approximately 0.3 and that the size of this 'bump' is age dependent.This distribution was modeled to identify people with more than the expected number of low-VAF mutations, as explained further in the Methods.b, Proportion of subjects with CH increases with age.The line connects the observed CH proportions, error bars are 95%CI.Data are from the ISL sample (n = 45,510), which has a larger age range than UKB.c, Effects of current and previous smoking on CH by age: CH was modeled by age and stratified by current or previous smoking status using sex, Pack-Years and Years Since Stopped Smoking as covariates.Points correspond to observed CH proportions and error bars are 95%CI.Lines correspond to a logistic regression fit.Data are from the UKB sample (n = 130,709).The proportion of subjects with barcode-CH by age is shown in blue.Proportions of subjects where a CPLD mutation had been identified (CPLD-CH) are in green and the proportion with a mutation in DNMT3A or TET2 are in magenta.CPLD mutations were defined as in ref. 13.The lines indicate a data fit using a generalized additive model with cubic splines.Shading indicates 95%CI.a, Data from UKB. b, Data from ISL.
. For those variants, we searched the GWAS Catalog for reported associations with P-values < 1 × 10 −7 from linear regression association.CH GWAS loci (y-axis) are colored red if the Alt allele increased CH risk, otherwise blue.Circles are colored red if the Alt allele was associated with an increase in the hematological trait value (x-axis), blue if there was a decrease and gray if the direction of effect could not be ascertained.b, Associations from linear regression between sentinel CH GWAS variants and clinical hematology traits measured on contemporaneous samples in the UKB: CH GWAS (y-axis) are colored red if the Alt allele increased CH risk, otherwise blue.Hematological trait symbols (x-axis) are colored red if their values increased in association with the CH phenotype, blue if they decreased in CH and gray if they were not associated with CH.Blocks are colored in if the effect of the CH GWAS variant on the trait was at least nominally significant: red indicates that the Alt allele was associated with an increase in the hematological trait value, blue indicates a decrease.Intensity of color indicates the effect size.