Introduction

Clonal hematopoiesis of indeterminate potential (CHIP) is a common age-related phenomenon in which hematopoietic stem cells (HSCs) acquire leukemogenic mutations resulting in the selection and expansion of a genetically distinct subpopulation of blood cells (variant allele fraction, VAF > 2%)1. The prevalence of CHIP detectable through next-generation sequencing of blood DNA is up to 10% of adults >70 years and nearly 20% of adults >90 years2,3,4. CHIP is associated with increased risk for hematological cancers2, coronary artery disease (CAD)4,5, congestive heart failure6, stroke7, chronic obstructive pulmonary disease8,9,10, osteoporosis11, and all-cause mortality2,3. The genes most commonly mutated in clonal hematopoiesis are the epigenetic regulators DNMT3A and TET2, and other commonly mutated genes include regulators of HSC proliferation and tumor suppression2,4.

DNA methylation (DNAm), the chemical addition of a methyl group to DNA at a cytosine followed by a guanosine (CpG), is a commonly studied epigenetic mechanism with important roles in cell and tissue differentiation. Similar to CHIP, DNAm patterns change distinctly with age, and have been associated with multiple diseases including cancers12,13 and coronary artery disease14,15. Notably, the products of the two most commonly mutated genes in CHIP regulate DNAm, with DNMT3A catalyzing de novo methylation, and TET2 initiating demethylation via conversion of methylated cytosines to 5-hydroxymethylcytosine16. During hematopoiesis, HSCs normally acquire DNAm patterns consistent with terminal cell lineage, but knockout of Dnmt3a in mice prevents HSCs from establishing new DNAm patterns, leading to a self-renewal pattern17. Despite its opposing regulatory role in demethylation, knockdown of Tet2 led to a similar pattern of increased HSC self-renewal, and global loss of hydroxymethylation in HSCs18. These results suggest that both the addition and removal of methyl groups are necessary to promote differentiation of HSCs, and that insight may be gained from examining the relationship between CHIP and DNAm at specific sites across the genome.

We hypothesized that CHIP overall and gene-specific CHIP mutations would be associated with distinct DNAm signatures, given the roles of DNMT3A and TET2 in regulating DNAm. In this study, we conducted multi-ancestry epigenome-wide association meta-analysis of CHIP, followed by enrichment analysis and functional annotation of associated CpG loci, and mediation analysis and Mendelian randomization to examine the potential interplay between CHIP and DNAm in aging and disease.

Results

Baseline characteristics of the study population

The characteristics of the discovery cohort CHS (N = 582) and replication cohort ARIC (N = 2655) study participants are presented in Table 1. In CHS, 61% of participants were female, 48% were African American, and the mean (standard deviation) age was 73.6 (5.2) years at the time of blood draw for whole-genome sequencing (WGS). In ARIC, 61% of participants were female, 71% were African-American, and the mean (standard deviation) age was 57.4 (5.9) years at the time of blood draw for whole exome sequencing (WES). Overall, CHIP prevalence was 14.8% (86/582) in CHS and 5.3% (142/2655) in ARIC. The top three CHIP genes in both cohorts included DNMT3A, TET2 and ASXL1 (Supplementary Fig. 1a), with median clone sizes in the 0.11–0.27 VAF range (Supplementary Fig. 1b). Among individuals with CHIP, 86% of CHS and 92% of ARIC participants had a single CHIP mutation with VAF > 2% (Supplementary Fig. 1c). CHIP prevalence was 11.48% (21/183) in CHS and 8.14% (72/884) in ARIC at 61–70 years of age (Supplementary Fig. 1d).

Table. 1 Baseline Characteristics of the Study Participants

Epigenome-wide association analyses

The EWAS workflow is presented in Supplementary Fig. 2. We performed a multi-ancestry meta-analysis to carry out discovery EWAS in CHS-AA and CHS-EA. We identified 7422, 4528, and 11,805 CpGs that were differentially methylated (FDR < 0.05) in individuals with any CHIP, DNMT3A CHIP, and TET2 CHIP, respectively; 539, 499, and 1595 CpGs were significant according to a Bonferroni criterion (P < 1.04 × 10−7) (Fig. 1a and Supplementary Fig. 3a, b). Among the 478,661 CpGs tested, at FDR < 0.05, the presence of any CHIP was associated with decreasing DNAm at 1.17% (5618) of sites and increasing DNAm at 0.38% (1804) of sites (Fig. 1a, b). Notably, the DNMT3A and TET2 EWAS profiles showed opposing patterns. The presence of DNMT3A CHIP was associated with decreasing DNAm at 0.93% (4435) of sites and increasing DNAm at 0.02% (93) of sites (Fig. 1c and Supplementary Fig. 3a). In contrast, the presence of TET2 CHIP was associated with decreasing DNAm at 0.23% (1092) of sites and increasing DNAm at 2.24% (10,713) of sites (Fig. 1d and Supplementary Fig. 3b). Quantile-quantile plots of expected and observed −log10(P) are presented in Supplementary Fig. 4a–f. Consistent with the widespread epigenetic regulatory role of the most frequently mutated CHIP genes, the genomic inflation factor was 1.11, 0.92, and 1.45 in any CHIP, DNMT3A, and TET2 CHIP meta EWAS, respectively. In a sensitivity analysis considering a more restricted definition of CHIP requiring larger clone sizes (VAF > 0.10; “expanded CHIP”), results were similar to the EWAS for any CHIP (7881 CpGs associated with an inflation factor of 1.40; Supplementary Fig. 5a–c).

Fig. 1: Results from epigenome-wide association studies of four CHIP phenotypes.
figure 1

a Directional Manhattan plot of discovery multi-ancestry meta-EWAS for any CHIP in CHS cohort, where direction indicates positive vs. negative correlations between CHIP and DNAm. Each dot represents a CpG site, with genomic location on the x-axis and –log10(P)*sign(test statistic) on the y-axis, where P values are based on a two-sided inverse-variance-weighted meta-analysis. Solid horizontal line indicates Bonferroni significance, and dashed line indicates 5% FDR. bd Volcano plots depicting the effect size and −log10(P) from CHS meta EWAS of b any CHIP, c DNMT3A CHIP, and d) TET2 CHIP. Dashed line indicates FDR < 5%, and colored points highlight CpGs replicated in ARIC cohort. e Overlap of replicated CpGs among the four CHIP EWAS. f Distribution of DNAm at the eight most significant replicated CpGs associated with both DNMT3A and TET2. Colored points show DNAm proportions at each CpG for individuals with DNMT3A (blue) or TET2 (green) CHIP, overlaid by density functions for each group and lines representing medians of each distribution. For comparison, medians for individuals without CHIP are shown as white circles.

We next performed a replication analysis of the FDR-significant CpGs (FDR < 0.05) in the ARIC study cohorts (1897 AA and 758 EA participants). Approximately 66% (4912/7422), 84% (3803/4528), and 13% (1479/11,801) of CpGs associated with any CHIP, DNMT3A CHIP, and TET2 CHIP were replicated with FDR < 0.05 and concordant effect direction in the multi-ancestry meta-analysis of ARIC-AA and ARIC-EA EWAS. The lower replication rate for TET2 is likely attributable to the low prevalence of TET2 CHIP among ARIC-EA, which only included one individual with TET2 CHIP (Table 1). When we performed the replication analysis solely in the ARIC-AA cohort, the replication rate was similar, with 1423 of 11,801 CpGs successfully replicating, including 88% (1308) of the sites replicated in the full meta-analysis (Supplementary Data 3). Comparison of our TET2 discovery results to a previous EWAS of TET2 CHIP19 revealed that 63% (6943 out of 11,010 matched CpGs) of CpGs associated with TET2 CHIP had concordant effect direction and FDR < 0.05 in Tulstrup et al.19 (Supplementary Data 4), suggesting that our discovery analysis was robust. 1393 of the 1479 CpGs replicated in ARIC were analyzed by;19 of these, >90% (1258) were corroborated in this comparison, suggesting that our replication results are valid, though conservative due to the low prevalence of TET2 in the younger ARIC cohort and our stringent FDR-based replication criterion.

Summary statistics for all replicated CpGs from the discovery and replication EWAS, as well as a combined meta-analysis across CHS and ARIC, are presented for the three CHIP categories in Supplementary Data 13, and the 20 most significant CpGs from the combined meta-EWAS in DNMT3A and TET2 are shown in Tables 2 and 3. Among replicated sites, 99.8% (3795/3803) of sites associated with DNMT3A CHIP showed decreased DNAm, while 94.9% (1404/1479) of sites associated with TET2 CHIP showed increased DNAm with CHIP. In the combined meta-analysis, the two CpGs most significantly associated with any CHIP and DNMT3A CHIP lie within the first intron of HOXB3. 86% of replicated CpGs associated with expanded CHIP were also associated with any CHIP (Fig. 1e; Supplementary Fig. 6). However, fewer of the TET2 and DNMT3A CHIP-associated CpGs overlapped with any CHIP and expanded CHIP EWAS, (12–13% and 34–43% respectively for TET2 and DNMT3A). There was limited overlap between TET2- and DNMT3A-associated CpGs; only 23 CpGs were common between the two, though this was greater than expected by chance (OR = 1.98; P = 0.003). Eleven of these CpGs were common among the four CHIP categories (Fig. 1e; Supplementary Fig. 6), where the presence of CHIP was associated with reduced DNAm in all categories for all eleven CpGs (Supplementary Table 1). For the other 12 CpGs, DNMT3A CHIP was associated with decreased DNAm while TET2 CHIP was associated with increased DNAm (Fig. 1f).

Table. 2 Top 20 DNMT3A-CHIP-associated CpGs
Table 3 Top 20 TET2-CHIP-associated CpGs

Enrichment analysis

To investigate the regulatory and functional potential of CpG sites associated with CHIP and specifically with DNMT3A or TET2 mutations, we performed a series of analyses to assess whether these sets of CpG sites were enriched relative to other CpGs on the array for regions likely to regulate genes, regions and/or genes associated with specific biological processes, and regions identified as functionally relevant (via methylation, chromatin accessibility, or gene expression profiles) in HSCs vs. the components of whole blood. We also examined enrichment for genes whose methylation has been found to associate with these mutations in two more extreme contexts: DNMT3a knockout mice17 and AML patients with driver mutations in DNMT3A or TET2.

CHIP-associated CpG sites are enriched in promoter-adjacent regulatory regions

Previous studies have highlighted that the distribution of genome-wide DNAm changes associated with gene regulation and diseases is not random20. For example, tissue-specific differentially methylated regions (T-DMR) and cancer-specific DMR (C-DMR) have been found to be depleted in CpG islands (CGI – CpG-rich regions that characterize promoter regions), but 13-fold more frequent in CGI shores ≤2 kb from CGI21,22. It has also been reported that methylation shows greater variation and stronger association with nearby gene expression at CGI shores and CGI shelves (adjacent regions 2–4 kb from CGI)21,23. To examine the regulatory potential of replicated CpGs, we assessed enrichment for CGI, CGI shores, CGI shelves, and other regions (“open sea”). Replicated CpGs were highly depleted in CGI in all three CHIP categories (0.16 ≤ OR ≤ 0.36; 1.2 × 10−270 ≤ P ≤ 3.0 × 10−118 Supplementary Table 2). CpGs associated with any CHIP or DNMT3A CHIP were highly enriched in CGI shores (1.9 ≤ OR ≤ 2.8; 2.5 × 10−267 ≤ P ≤ 4.0 × 10−78), while CpGs associated with TET2 CHIP were depleted in shores (OR = 0.79; P = 2.4 × 10−4) but enriched in CGI shelves (OR = 1.87; P = 1.2 × 10−16). Sets of CpGs associated with DNMT3A or TET2 CHIP were enriched in open sea regions (1.4 ≤ OR ≤ 2.4, 3.7 × 10−63 ≤ P ≤ 1.6 × 10−26), while CpGs associated with any CHIP were depleted in these regions (OR = 0.82; P = 8 × 10−11).

CpG sites associated with DNMT3A CHIP are enriched in regions associated with stem cell reprogramming and cancer

Because of the role of CHIP mutations in blood cancers and in HSC self-renewal and stemness vs. differentiation, we also examined whether CHIP-associated CpGs were enriched for the C-DMR and T-DMR reported in ref. 21 and for induced pluripotent stem cell reprogramming-specific DMR (R-DMR) determined experimentally by Doi et al.24. We observed pronounced enrichment in the any CHIP and DNMT3A CHIP categories for R-DMR (OR > 3.0; P < 4.9 × 10−53) and C-DMR (OR > 1.6; P < 2.4 × 10−5) (Supplementary Table 3). In contrast, TET2 CHIP-associated CpGs showed mild but non-significant depletion for both R-DMR and C-DMR (OR ≤ 0.83; P > 0.05). All three CHIP categories were significantly depleted for T-DMR, which may reflect that the T-DMR were identified via comparisons of liver, spleen, and brain so do not necessarily vary across blood cell subtypes.

Taken together, the CGI and DMR enrichment analyses suggest distinct regulatory profiles for sets of CpGs associated with any CHIP, DNMT3A mutations, and TET2 mutations. CpGs associated with DNMT3A mutations, which tend to be hypomethylated, are more likely to reside in regions associated with gene expression (CGI shores), cancer (C-DMR), and cellular reprogramming (R-DMR). In contrast, CpGs associated with TET2 mutations tend to be hypermethylated, are enriched in a different set of regions likely to associate with gene expression (CGI shelves), and are not enriched in C-DMR or R-DMR.

Genes near sites associated with DNMT3A and TET2 CHIP are enriched for distinct biological processes

Gene ontology (GO) enrichment analysis was performed for genes annotated to replicated CpGs. For the 3803 replicated CpGs associated with DNMT3A CHIP, we identified 75 ontologies enriched at FDR < 0.05 and 10 after Bonferroni adjustment for 22,710 ontologies (P < 2.2 × 10−6). A majority of the enriched GO terms were related to developmental and cellular processes, including several terms related to vascular development (Supplementary Data 5). In contrast, among the 1479 replicated CpGs associated with TET2 CHIP, we identified 27 enriched GO terms at FDR < 0.05 and 9 at Bonferroni significance. Ontologies enriched among TET2-associated sites generally related to immune processes, including activation of immune cells of both the myeloid and lymphoid lineages (Supplementary Data 6). No GO terms were enriched among genes near the 4912 CpGs associated with any CHIP. These results further support a pattern of distinct regulatory consequences associated with DNMT3A vs. TET2 CHIP mutations.

Sites associated with DNMT3A and TET2 CHIP are enriched for transcription factor binding motifs

Because DNAm changes may influence gene regulation through modulation of transcription factor binding affinity25, we used HOMER26 to investigate enrichment for 364 previously reported transcription factor binding motifs. The 200-bp regions surrounding replicated CpGs associated with DNMT3A CHIP were enriched for 40 motifs (FDR < 0.001; Supplementary Fig. 7), including RUNX1 and RUNX2 with roles in HSC and osteoblastic differentiation, five members of the GATA subfamily of transcription factors with roles in development and self-renewal, and five members of the Homeobox family including HOXA9 with roles in AML. Regions surrounding TET2-associated sites were enriched for 51 binding site motifs (FDR < 0.001), of which the top 15 belonged to the Erythroblast Transformation Specific (ETS) family of transcription factors with roles in cellular differentiation and proliferation (Supplementary Fig. 8). Both DNMT3A- and TET2-associated sites were highly enriched for motifs for ERG, an essential regulator of hematopoiesis that is aberrantly expressed in leukemia27,28. The enrichment of both sets of sites for motifs of transcription factors involved in hematopoiesis and related proliferative processes further supports a functional role for these DNAm changes and their possible involvement in downstream consequences of CHIP such as HSC self-renewal and leukemia.

Sites associated with DNMT3A and TET2 CHIP have distinct DNAm profiles in HSCs

Because DNMT3A and TET2 mutations can cause HSCs to propagate through self-renewal rather than differentiate into blood cells17,18, we examined the DNAm profiles of CpGs associated with DNMT3A and TET2 CHIP in HSCs vs. downstream blood lineages. Specifically, we compared distributions of average DNAm levels in whole-genome bisulfite sequencing (WGBS) data across myeloid cells, lymphocytes, and HSCs from the BLUEPRINT project29—first for the full set of CpGs on the array, and then for sets of CpGs associated with DNMT3A and TET2 mutations. Consistent with previous reports, the distribution of average DNAm proportions across CpGs on the array was bimodal for all three cell types, with the majority of sites either fully methylated or fully unmethylated (gray points in Fig. 2a–c). In contrast, many of the CpGs associated with DNMT3A or TET2 CHIP showed intermediate methylation levels in WGBS data from myeloid cells, with median DNAm proportions of 0.45 and 0.40 among CpGs associated with DNMT3A and TET2 CHIP (Fig. 2a). Both of these sets of CpGs showed higher levels of methylation in lymphoid cells, with median values of 0.64 and 0.85 respectively (Fig. 2b). Notably, in data from HSCs, the two groups of CpGs showed diverging patterns. The majority of sites associated with TET2 CHIP, which generally showed increased DNAm with CHIP, were fully methylated in HSCs (median = 1.0). In contrast, CpG sites associated with DNMT3A CHIP, which generally showed decreased DNAm with CHIP, tended to have lower levels of DNAm in HSCs (median = 0.33; Fig. 2c). These data suggest that DNMT3A and TET2 CHIP, through opposing mechanisms, each lead to blood DNAm profiles that are more consistent with HSC identity. Because of the large number of CpGs in each group, all pairwise comparisons between cell types were significant (Wilcoxon P < 2 × 10−16).

Fig. 2: Enrichment patterns among DNMT3A- and TET2-associated CpGs.
figure 2

ac Distribution of average methylation levels estimated from external WGBS data for myeloid cells (a), lymphoid cells (b), and HSCs (c) for three sets of CpG sites: all CpGs on Illumina 450 K array (gray), and CpGs showing replicated association with DNMT3A CHIP (blue) or TET2 CHIP (green). Each point represents a CpG, while filled curves show the density function corresponding to all CpGs in each set. Horizontal lines indicate median of distribution. Because of the large number of CpGs considered (N = 478,661), all pairwise comparisons between cell types were significant (Two-sided Wilcoxon P < 2 × 10−16). d Enrichment in cell-specific DHS among the top 1000 DNMT3A- or TET2-associated CpGs, compared to 1000 random genomic-context-matched CpGs (one-sided binomial test; N = 2000). Estimated OR (x-axis, indicated by filled squares) shows extent to which DNMT3A- or TET2-associated CpGs are enriched (or depleted) for DHS regions in six distinct cell types (y-axis), compared to other sites on the array. Horizontal lines indicate 1−α confidence intervals for estimated OR, using a Bonferroni-adjusted α of 0.05/12. Th1/2: Type 1/2 T helper cells. ef Comparison of DNAm profiles associated with gene-specific mutations in CHIP vs. AML. Test statistics from EWAS of mutations in DNMT3A (e) or TET2 (f) in the context of blood samples from healthy individuals with or without CHIP (x-axis; Z-statistics from discovery sample meta-analysis, N = 582) vs. tumor samples from patients with AML (y-axis; T-statistics from EWAS of mutation type in TCGA data, N = 127 (e) or 108 (f)). Black points: FDR < 0.05 in CHS discovery sample but did not replicate; Blue or green points: FDR < 0.05 and replicated in ARIC.

Sites associated with DNMT3A and TET2 CHIP show differential enrichment for accessible regions in HSCs and progeny cells

We next examined whether CpGs associated with DNMT3A and TET2 CHIP were preferentially located in regulatory regions active in HSCs or downstream blood lineages. Because open chromatin is associated with active regulatory elements and bound transcription factors30, and demethylation has been shown to induce an open chromatin state31, we investigated whether sites associated with DNMT3A or TET2 mutations were enriched for accessible regions of chromatin in HSCs and five peripheral blood cell types. Using the eFORGE tool32, we tested the top 1000 replicated CpG sites associated with DNMT3A or TET2 mutations for enrichment for DNase I hypersensitive (DHS) hotspots identified by ENCODE33. Both sets of replicated CpG sites were enriched for DHS hotspots in HSCs, with the enrichment most pronounced for DNMT3A-associated CpGs (OR = 4.1, P = 1.3 × 10−98; Fig. 2d). TET2-associated CpGs showed enrichment for DHS hotspots among all five blood cell types (OR > 1.5, 1.1 × 10−63 < P < 2.9 × 10−9), with a strong enrichment for regions accessible in monocytes (OR = 3.3; P = 1.1 × 10−63). DNMT3A-associated CpGs were enriched for DHS among B cells, naïve T cells, and type 1 T helper cells (1.39 < OR < 1.59, 2.2 × 10−11 < P < 2.4 × 10−6) but not monocytes or type 2 T helper cells (OR < 1.22; P > 0.0042).

Genes proximal to DNMT3A-CHIP-associated sites show enrichment for HSC marker genes

To further compare the differential DNAm profiles to regulatory profiles in HSCs vs. progeny cells, we tested whether the set of genes annotated to replicated CpG sites associated with DNMT3A or TET2 CHIP mutations showed enrichment or depletion for genesets previously identified as marker genes for HSCs vs. other hematopoietic cells using scRNA-seq data from the Human Cell Atlas bone marrow tissue project34,35. Comparing 24 marker genesets, genes near DNMT3A-associated sites showed the strongest enrichment for HSC marker genes (OR = 2.6; P = 2 × 10−25), with more modest enrichment for marker genesets for naïve T cells, monocytes, common myeloid progenitor cells, and platelets (7 × 10−9 < P < 0.0008; Supplementary Fig. 9). In contrast, genes near TET2-associated sites showed nominally significant depletion for HSC marker genes (OR = 0.46; P = 0.004), though were enriched for marker genesets for naïve T cells, monocytes, and neutrophils (1 × 10−14 < P < 0.0008; Supplementary Fig. 9). Comparison to human orthologs of marker genes identified in murine hematopoietic cells36 showed a similar enrichment for HSC marker genes among genes proximal to DNMT3A-associated CpGs (OR = 2.5; P = 1.9 × 10−16), and a similar (but non-significant) depletion among genes near TET2-associated CpGs (OR = 0.71; P = 0.3; Supplementary Fig. 10). No other significant enrichments or depletions were observed for genes near DNMT3A CHIP-associated sites, though each of the three lymphoid marker genesets showed nominally significant enrichment (0.009 < P < 0.034). Genes near TET2-associated sites were enriched for marker genes associated with natural killer cells (OR = 3.7; P = 2.9 × 10−4) and granulocytes (OR = 2.9; P = 0.0016), and showed nominally significant enrichment for monocytes (OR = 2.3; P = 0.0079). Finally, we evaluated enrichment in a set of 36 orthologous genes (33 encoding transcription factors, and three encoding translational regulators) that were hypothesized as potential HSC reprogramming factors based on >2.5-fold greater expression in murine HSCs compared to 39 other hematopoietic cell types37. Genes near DNMT3A-associated sites were highly enriched for these 36 factors (OR = 6.6; P = 5 × 10−39), while genes near TET2-associated sites were not (OR = 0.57; P = 0.51).

Taken together, the results from the cell-type-specific enrichment analyses are consistent with a pattern where the hypomethylation associated with DNMT3A mutations occurs in regions associated with an HSC-like epigenetic and transcriptional profile, while the hypermethylation associated with TET2 mutations occurs primarily in regions associated with accessibility and transcription in differentiated blood cells.

Genes proximal to both DNMT3A- and TET2-CHIP-associated sites show enrichment for genes hypo-methylated in Dnmt3a knockout mice

Challen et al17. previously reported that knockout of Dnmt3a in mice is associated with region-specific hypo- and hyper-methylation, and provided lists of genes corresponding to both hyper- and hypo-methylated regions. We assessed whether genes near sites associated with DNMT3A or TET2 CHIP were enriched for human orthologs of these genes. For genes near sites associated with DNMT3A CHIP, we observed strong enrichment for orthologs of genes associated with hypo-methylated regions in knockout mice (OR = 2.4; P = 5 × 10−28), but no enrichment for genes associated with hyper-methylated regions (OR = 0.82; P = 0.10). Genes near sites associated with TET2 CHIP were moderately enriched for orthologs of genes associated with hypo-methylated regions in DNMT3a knockout mice (OR = 1.4, P = 0.014), but not for genes associated with hyper-methylated regions (OR = 0.80; P = 0.26).

Overlap between replicated sites and sites associated with aging

To examine the extent to which CHIP may contribute to the well-established DNAm signature of aging, we compared the results from our CHS EWAS of CHIP to an EWAS of age performed using the same dataset (see Methods). Of the 4341 sites significantly associated with age (P < 1.045 × 10−7), 243 overlapped with the 7423 sites associated with CHIP in the CHS (OR = 3.86; P = 7 × 10−64), and 176 overlapped with the 4192 sites that replicated in ARIC (OR = 3.95; P = 3 × 10−46), representing greater than-expected overlap in both cases. Comparing the EWAS profiles of CHIP vs. age, there was no correlation between the full set of Z-statistics from the two EWAS (r = ­0.015), but the CHIP and age EWAS Z-statistics showed substantial correlation when restricting to sites that were significant in the CHIP EWAS (r = 0.44) or sites significant in the age EWAS (r = 0.52; Supplementary Fig. 11). Among the 4341 sites significant in the age EWAS, we performed Sobel tests38 of a model where CHIP mediates the relationship between age and DNAm. Nominally significant evidence of mediation (P < 0.05) was observed for only 174 of 4341 sites, fewer than the 5% expected by chance. No sites showed significant mediation with FDR < 0.05, suggesting a lack of support for our hypothesis that CHIP could help explain the DNAm signature of aging.

Overlap between replicated sites and sites associated with leukemogenic mutations in cancer

We next used data generated by TCGA12 to investigate DNAm profiles in tumor samples from AML patients harboring either DNMT3A or TET2 driver mutations. Of the 396,065 CpGs available for analysis in the TCGA data, 13,031 associated with the presence of a DNMT3A mutation in our EWAS of AML patients (see Methods; FDR < 0.05), and 12 associated with the presence of a TET2 mutation (FDR < 0.05). CpGs that were significant in our replication analysis of DNMT3A CHIP were more likely than other CpGs to be significantly associated with a DNMT3A mutation in the AML patients (OR = 19.6, P < 2 × 10−16). Sites associated with TET2 mutations in our replication analysis did not overlap with the 12 CpGs associated with TET2 mutations in the AML patients but did show greater than-expected overlap with the 26,016 sites nominally associated (P < 0.05) with TET2 mutations in AML (OR = 6.7, P < 2 × 10−16). Figure 2e, f shows that for both genes, differential DNAm was directionally consistent for the CHIP and AML EWAS, with the majority of sites associated with DNMT3A CHIP showing decreased DNAm in both contexts and the majority of sites associated with TET2 CHIP showing increased DNAm in both contexts. This directional consistency led to correlation between DNMT3A CHIP and DNMT3A AML test statistics (r = 0.29; Fig. 2e) and TET2 CHIP and TET2 AML test statistics (r = 0.33; Fig. 2f); comparable correlations were not observed between DNMT3A CHIP test statistics and TET2 AML test statistics (r = 0.10) or vice versa (r = 0.04).

Mendelian randomization analysis of CHIP-associated DNAm and coronary artery disease

To investigate whether DNAm changes may mediate the relationship between CHIP and CAD4,5, we tested whether DNAm at CHIP-associated CpGs causally influences the risk for CAD using two-sample Mendelian randomization (MR). For this analysis, 2580 CpGs that replicated in at least one CHIP EWAS met the inclusion criteria (5 or more independent associated cis-mQTL; see Methods) to be tested for causal association with CAD. The full MR summary statistics are presented in Supplementary Data 7. Genetic instruments were selected as cis-mQTL for these CpGs from the GoDMC database39,39 (http://www.godmc.org.uk/) based on significant SNP-CpG association (P < 5 × 10−8) and partial independence from other SNPs (r2 < 0.05), followed by HEIDI-outlier analysis to remove pleiotropic instruments (see Methods). CAD outcome summary statistics were obtained from the independent meta-analysis of CARDIoGRAMplusC4D40 and UK Biobank CAD GWAS by van der Harst and Verweij41. CHIP has been shown to be associated with increased risk for CAD3,5,42. Consistent with the epidemiological observations, 1298 CHIP-associated CpGs were associated with increased risk for CAD in the MR analysis, of which 51 showed significant association with CAD at FDR < 0.05 and 12 at the Bonferroni threshold (P < 0.05/2580). However, there were 1282 CHIP-associated CpGs where the change in DNAm was associated with reduced risk for CAD in the MR analysis, of which 53 showed significant association with CAD at FDR < 0.05 and 7 at the Bonferroni threshold. A forest plot representing FDR-significant association between CpGs and increased risk for CAD is presented in Fig. 3, and scatter plots of corresponding SNP effects on exposures and outcome for the 12 Bonferroni-significant CpG sites were presented in Supplementary Fig. 12. Among the 51 exposure CpGs, 16 were associated with any CHIP, 31 associated with DNMT3A CHIP and 4 associated with TET2 CHIP (Fig. 3). Of the sites associated with DNMT3A CHIP, all showed decreased DNAm in individuals with CHIP, and decreased DNAm was associated with increased CAD risk in the MR analysis. For the 4 TET2-associated sites, 3 were consistent with CHIP → increased DNAm → increased CAD risk.

Fig. 3: Mendelian randomization analysis of CHIP-associated CpGs and CAD risk.
figure 3

For sets of replicated CpGs associated with CHIP, DNMT3A, and TET2 (y-axis), the odds ratio (x-axis, indicated by filled squares) reflects the change in CAD risk associated with each SD increase in DNAm, with lines representing 95% confidence intervals estimated by GSMR. GSMR analysis was based on published summary statistics (effect estimates) for cis-mQTL39 (N = 32,851) and CAD GWAS41 (N = 547,261). Only exposure CpGs showing causal evidence in the MR analysis (FDR < 0.05 based on P-values from two-sided χ2 test) are presented here; full summary statistics are available in Supplementary Data 7. *”Association with CHIP”: “+” or “-” signs indicate effect directions for associations with CHIP, DNMT3A or TET2 in the meta-EWAS of CHS AA, CHS EA, ARIC AA, and ARIC EA EWAS. #SNP: number of SNPs included in the MR analysis for each CpG.

Among the cis-mQTL used in the MR analyses, several were also cis-eQTL in blood43 (Supplementary Data 8). For example, in CHIP-associated CpG cg14594111, increased DNAm was causally associated with reduced CAD risk (Fig. 3), and the mQTL associated with cg14594111 were also cis-eQTL associated with reduced expression of nearby genes including C5, CNTRL, GSN, PHF19, and RAB14 (Supplementary Data 8). Likewise, in DNMT3A-associated CpG cg17969560, increased DNAm was causally associated with reduced CAD risk, and corresponding cis-mQTL were also cis-eQTL associated with reduced expression of nearby genes, such as LTBP3 and NEAT1. At FDR significance (FDR < 0.05), four TET2-associated CpGs showed causal association with CAD where increased DNAm in cg01919885, cg18642369, and cg10233454 were causally associated with increased CAD risk (1.03 ≤ OR ≤ 1.05; 1.8 × 10−4 < P < 9.0 × 10−4), whereas increased DNAm in cg08530064 was causally associated with reduced CAD risk (OR = 0.97; P = 2.0 × 10−3) (Fig. 3). Here, mQTL alleles associated with increased DNAm were associated with reduced gene expression in RGS12 (cg01919885), DOCK9 (cg18642369), STAT6 (cg10233454) and TMEM176B/TMEM176A (cg08530064) (Supplementary Data 8).

Discussion

Our study identified thousands of CpG sites across the genome whose DNAm was associated with CHIP, including distinct DNAm profiles associated with mutations in each of the two genes most commonly mutated in CHIP. Although this is the first study to identify these opposing methylomic profiles in the context of CHIP, the observed methylomic signatures of DNMT3A and TET2 are consistent with previous work studying mutations in these genes in other contexts. Our observed pattern of decreased DNAm associated with DNMT3A mutations and increased DNAm associated with TET2 mutations is consistent with a recent study of conditional knockout mice that observed a preponderance of hypomethylated regions when comparing regions of open chromatin in Dnmt3a-null to control mice, and hypermethylated regions when comparing Tet2-null mice to controls44. We observed the same patterns of increased or decreased DNAm when we compared AML patients with DNMT3A or TET2 mutations to AML patients with other mutations, and comparison between the AML and CHIP results revealed significant overlap between sets of CpG sites associated with DNMT3A or TET2 CHIP and those associated with mutations in the corresponding gene in AML (Fig. 2e, f). Notably, the DNAm samples used in our study of CHIP were all from healthy participants with no apparent malignancy, and the average VAF was low (~19%, Fig. S1b), compared to the VAF of somatic mutations found in cancer (often 50%). This highlights that aberrant DNAm patterns similar to those found in AML may predate clinical malignancy and can also occur in individuals with CHIP who never progress to cancer. It is also noteworthy that despite the low VAF in most individuals, we were able to observe striking DNAm profiles associated with CHIP, resembling profiles associated with leukemogenic mutations in AML patients or with complete knockout of the genes in mice.

Taken together, these results suggest that there are distinct DNAm profiles associated with impaired activity of DNMT3A or TET2 that can be observed across multiple contexts. Rather than a global gain or loss of DNAm across the genome, each of these DNAm signatures reflects gain or loss of DNAm at specific sites. DNMT3A-associated sites showed enrichment for reprogramming-specific DMRs identified by comparing DNAm of fibroblasts to induced pluripotent stem cells derived from those fibroblasts24, and gene ontology analysis of these signatures identified enrichment for ontologies relevant to developmental and cellular processes among genes located near DNMT3A-associated sites, and enrichment for immune processes and immune cell activation among genes located near TET2-associated sites. If we consider the canonical regulatory role of DNAm as a silencer of gene expression, this would suggest that the loss of DNAm associated with DNMT3A mutations leaves genes active in stem cells free to be expressed, while the gain of DNAm associated with TET2 mutations silences genes active in the downstream progeny of HSCs.

Along these lines, examination of DNAm levels in HSCs and their downstream progeny revealed that CpG sites associated with DNMT3A mutations had decreased DNAm in HSCs compared to myeloid and lymphoid cells, while sites associated with TET2 mutations showed increased DNAm in HSCs (close to 100% methylation for many sites). DNMT3A-associated sites also showed strong enrichment for regions of open chromatin in HSCs, and genes near these sites were enriched for HSC marker genes identified in both humans and mice. The two sites showing the most significant association with DNMT3A CHIP mapped to HOXB3, a gene found to be overexpressed in acute myeloid leukemia patients with DNMT3A mutations45 and highly expressed in uncommitted hematopoietic cells46. In contrast, TET2-associated sites were most enriched for regions of open chromatin in monocytes, and genes near these sites showed depletion for HSC marker genes. Overall, these patterns are consistent with a scenario where mutations in either DNMT3A or TET2 both lead to DNAm patterns consistent with HSC-like activity, but through different avenues: DNMT3A mutations lead to DNAm loss that upregulates genes related to HSC activity, while TET2 mutations lead to DNAm gains that downregulate genes related to immune cell activity, thus maintaining an HSC-like state. This scenario aligns well with experimental data showing that knockout of either Dnmt3a17 or Tet218 results in increased self-renewal of HSCs, but that this occurs through immortalization of HSCs in Dnmt3a knockout models47, while Tet2 knockout models show normal exhaustion of HSCs but myeloid skewing during differentiation48. Our results support models previously suggested for Dnmt3a knockout49 and hypothesized for CHIP in general50, where DNMT3A loss prevents the silencing of the HSC self-renewal program that normally occurs through methylation of key regions, while TET2 loss prolongs self-renewal by disrupting the differentiation program normally activated via demethylation of key genes and regions.

Notably, EWAS of DNMT3A and TET2 CHIP were recently performed in a smaller set of individuals (N = 244), but this study did not identify significant associations between DNAm and DNMT3A CHIP19. The estimated effect sizes from our TET2 EWAS showed modest correlation with theirs (r = 0.269), and they also noted enrichment for transcription factor motifs from the ETS family among their results for TET2, but there was little correlation between effect sizes estimated from their DNMT3A EWAS vs. ours (r = 0.046). Sample size differences are one possible explanation for the difference between the two studies, but a more likely explanation is that DNAm differences associated with DNMT3A CHIP were masked in the previous study due to the inclusion of the top four principal components of DNAm as covariates. Given the striking DNAm profile of DNMT3A CHIP we observed in the CHS and ARIC cohorts, and the relatively large prevalence of DNMT3A CHIP (55 of 244 individuals in ref. 19) it is likely that both DNMT3A CHIP and the cell type proportions (which were included as covariates) were correlated with these principal components, inducing collinearity and masking any association in the previous study. Our high replication rate in ARIC (84% for CpGs significantly associated with DNMT3A CHIP in the discovery analysis), along with the alignment of our findings to previously reported experimental results, supports the presence of robust and distinct epigenetic profiles associated with both DNMT3A and TET2 CHIP.

A recent study by Nachun et al.51 reported associations between CHIP and increased biological age measured by seven different DNAm-based biomarkers of aging. Specifically, the presence of CHIP was associated with an average increase in age acceleration (residual of DNAm-predicted age after adjusting for chronological age) of 1.3–3.1 years across the seven biomarkers. This result supported our initial hypothesis that increased CHIP in older individuals may help explain the genome-wide pattern of age-related DNAm changes. We did observe a moderate correlation in the DNAm profiles associated with age vs. CHIP, but mediation analysis did not provide evidence for CHIP as a potential mediator of the relationship between age and DNAm; however, it may be useful to explore this further in larger studies. Interestingly, Nachun et al. found that stratifying individuals with CHIP based on positive vs. negative age acceleration identified a group at elevated risk for coronary heart disease51, suggesting that CHIP and DNAm-based age acceleration each contribute independent information about disease risk.

CHIP has been shown to contribute to the increased risk for CAD in older individuals4,5, but the mechanisms underlying this increased risk are not fully elucidated. Therapeutic hypotheses have focused on inflammasome activation5,52,53 but the involvement of orthogonal pathways is not well understood. Our MR analysis identified 51 CpG sites where CHIP-associated DNAm changes may contribute to CAD risk. For many of these CpGs, the instrumental variables associated with change in DNAm had an inverse effect on the expression of nearby genes, consistent with the canonical inverse relationship between DNAm and gene expression. Several of these genes have documented functions in lipids metabolism, inflammation, and atherosclerosis. For example, CHIP is associated with reduced DNAm in cg14594111, which is correlated with increased expression of complement C5. Increased C5 level in plasma is correlated with atherosclerotic plaque volume and coronary calcification54, whereas C5a—a protein fragment of the C5 protein— promotes atherosclerotic plaque disruptions55,56. DNMT3A CHIP is associated with reduced DNAm in cg17969560, whose mQTL instruments are correlated with increased expression of LTBP3 and NEAT1. LTBP3 is implicated in development of aortic aneurysms and dissections57,58,59, whereas NEAT1 is implicated in inflammation and atherosclerosis60,61,62. TET2 CHIP is associated with increased DNAm in cg10233454, which is correlated with reduced expression of STAT6. Lower STAT6 expression reduces polarization of anti-inflammatory M2 macrophages, increases plaque instability, and thus increases CAD risk63,64. TET2 CHIP is also associated with reduced DNAm in cg08530064, which is correlated with increased expression of TMEM176A/TMEM176B. TMEM176A/TMEM176B is found to be causally linked with HDL-C metabolism65,66, and higher expression of TMEM176B inhibits the NLRP3 inflammasome by controlling cytosolic Ca2+ 67. NRLP3 inflammasome is involved in atherosclerosis68 thus higher expression of TMEM176B/TMEM176A could have protective CAD effect in individuals with TET2 CHIP.

Interestingly, the MR analysis also identified 53 CpG sites where CHIP-associated DNAm changes showed a protective effect against CAD. Similar to the 51 “risk” CpG sites, the majority of these sites showed decreased DNAm with CHIP, but for these sites the MR analysis suggested that decreased DNAm at these sites was protective against CAD. Several of these sites were annotated to the first intron or promoter region of DNMT3B, which, if upregulated, could potentially help compensate for reduced DNMT3A activity. Four were annotated to the first intron of PRDM16, which is protective against cardiac hypertrophy and heart failure69, and whose expression in adipose tissue protects against diet-induced weight gain, likely through greater energy expenditure and activation of brown fat cell (as opposed to white fat cell) activity70. While it may seem counterintuitive for CHIP-associated DNAm changes to be identified as protective against CAD, the results of our functional annotation analyses suggest that the primary role of the DNAm changes associated with CHIP is to determine self-renewal vs. differentiation of HSCs. If DNAm does mediate the relationship between CHIP and CAD, it may be that the overall increase in CAD risk is incidental—i.e., that the CHIP-associated DNAm changes include a mix of risk and protective effects that when averaged lead to an increase in risk.

A potential limitation of our study was that DNAm and CHIP were not always measured on concurrent samples. While concurrent measurement in all samples would minimize potential sources of noise, it is important to note that once CHIP is acquired (VAF > 2%), the CHIP clone grows or remains stable in the majority of individuals71,72. CHIP was measured either prior to or concurrently with first DNAm measurement for 84% of CHS participants in this study, and prior to or concurrently with the second for >99% of individuals. A second limitation was that CHIP prevalence was lower in our replication sample compared to our discovery sample. This was likely due to the younger age range of the replication sample, as Supplementary Fig. 1d shows comparable prevalence in CHS and ARIC within age groups. Another possible contributing factor is that our discovery vs. replication analyses relied on CHIP called from WGS vs. WES data. However, the previous work73 has reported similar prevalence for CHIP called via these two approaches, and the prevalence of DNMT3A and TET2 CHIP in the ARIC AA cohort was similar to the population prevalences reported using WGS data for this cohort in73. Based on high rates of replication, the differences in prevalence did not appear to hinder our replication of CHIP or DNMT3A CHIP. In contrast, only one individual with a TET2 mutation was present in the ARIC EA cohort studied here. This led to a lower replication rate for CpGs associated with TET2 CHIP in the multi-ancestry meta-analysis, with only 13% replication of CpGs significant in the TET2 discovery analysis as compared to 84% replication for DNMT3A CHIP and 66% replication for any CHIP. However, comparison to the results from the EWAS of TET2 CHIP reported in19 suggested an effective replication rate of 63%, supporting that our discovery results are robust and the lower replication rate stems from low prevalence of TET2 mutations in the younger ARIC cohort. Future studies in larger and older cohorts will help address this limitation, and will enable the examination of other genes with a lower population prevalence of mutations (e.g. ASXL1).

In conclusion, our results are consistent with a pattern where the two most common CHIP mutated genes both promote self-renewal of HSCs through opposing mechanisms, with DNMT3A mutations associated with loss of DNAm in regulatory regions near genes associated with HSC activity, and TET2 mutations associated with gain of DNAm in regulatory regions near genes associated with activity of progeny cells. Mendelian randomization analysis suggests that some of the DNAm alterations associated with CHIP may promote the risk for age-related clinical outcomes such as CAD, while others may be protective against risk.

Methods

Study cohorts

The Cardiovascular Health Study (CHS) is a population-based cohort for studying the risk factors for coronary heart disease and stroke in people ≥65 years of age74. Our discovery sample consisted of 582 CHS participants who had both CHIP and DNAm data available. DNAm was measured from blood samples taken from these participants in years 5 and 9 (N = 405), year 5 (N = 171), or year 9 only (N = 6). CHIP calls were based on whole-genome sequences (WGS) of blood samples, the majority of which were taken 3 years prior (year 2, N = 192) or concurrently (year 5, N = 294) with the first DNAm measurement. 86 participants had CHIP calls based on blood samples taken during years 6–9 (so prior to or concurrent with the second DNAm measurement), and the remaining fpur individuals had CHIP calls based on year 10 samples.

Replication samples consisted of 2655 participants from the Atherosclerosis Risk in Communities (ARIC) Study. DNAm was measured from blood DNA samples taken at visit 2 (year 1990–1992; N = 2228) and visit 3 (year 1993–1995; N = 427). CHIP calls were based on whole exome sequences (WES) of blood DNA samples taken at visit 2 (N = 2234) and visit 3 (N = 421).

Informed consent was obtained from all study participants, and the study design and methods were approved by the respective institutional review boards at each of the collaborating institutions: University of Washington Institutional Review Board (CHS); University of Mississippi Medical Center Institutional Review Board (ARIC: Jackson Field Center); Wake Forest University Health Sciences Institutional Review Board (ARIC: Forsyth County Field Center); University of Minnesota Institutional Review Board (ARIC: Minnesota Field Center); and Johns Hopkins University School of Public Health Institutional Review Board (ARIC: Washington County Field Center). Each study received institutional certification before depositing sequencing data into dbGaP, ensuring approval by all relevant institutional ethics committees and compliance with relevant ethical regulations.

DNA methylation measurement

DNA methylation data for CHS and ARIC peripheral blood leukocyte samples were measured via the Illumina Infinium HumanMethylation450 BeadChip (Illumina Inc., San Diego, CA) (see Supplementary Note 1 for details).

CHIP calls

CHIP was detected previously in CHS from WGS blood DNA in the NHLBI Trans-Omics for Precision Medicine consortium73. The same procedure was applied for WES data in ARIC. Mutect2 software75 was used for somatic mutation calling from WGS data in CHS and WES data in ARIC. CHIP was called from the Annovar76 annotated VCF files using a custom R script and predefined list of CHIP genes, variants, and rules. The detailed CHIP calling pipeline was previously reported in Bick et al.73 (https://app.terra.bio/#workspaces/terra-outreach/CHIP-Detection-Mutect2). Individuals with a CHIP mutation at variant allele fraction (VAF) > 2% were defined as CHIP, and those without a CHIP mutation as control. CHIP mutations with VAF > 10% were considered expanded CHIP clones.

Discovery and replication EWAS

Ancestry-stratified epigenome-wide association analysis was performed using the CpGassoc (v2.60) R package77. Separate EWAS were performed in African-American (AA) and European American (EA) individuals within both the discovery and replication cohorts. Each EWAS fit a linear model for each CpG that modeled DNAm proportion as the outcome with CHIP status as the independent variable, adjusted for age, age2, sex, batch, and estimated cell type proportions. In the CHS, individual random effects were included to account for repeated measures from the two longitudinal timepoints. We modeled CHIP status in three different ways, as an indicator variable for the presence of CHIP (yes/no), presence of a CHIP mutation in DNMT3A, or the presence of a CHIP mutation in TET2. METAL software78 was used to perform inverse variance weighted fixed effect meta-analysis and Cochran’s Q-test for heterogeneity79. In the discovery analysis, we performed multi-ancestry meta-analyses to combine the results from EWAS within CHS-AA and CHS-EA within each of three EWAS (any CHIP, DNMT3A CHIP, and TET2 CHIP). P-values were computed for each site based on two-sided Z-tests, and genome-wide significance was assessed via false discovery rate (Benjamini-Hochberg FDR < 0.05) and Bonferroni threshold P < 1.04 × 10−7 (0.05/478661). CpGs significant in the discovery analysis (FDR < 0.05) were followed up with a replication analysis in ARIC. In the replication analyses, we fit the linear model described above to sets of discovery CpGs from the three aforementioned CHIP categories; models were fit separately in the ARIC-AA and ARIC-EA cohorts, followed by a multi-ancestry meta-analysis. CpGs with FDR < 0.05 and effect direction concordant with the discovery analysis were considered to be successful replications. As a sensitivity analysis, discovery and replication EWAS was also performed for a more restrictive definition of CHIP (VAF > 0.10; Supplementary Note 1).

Enrichment tests

We tested each set of replicated CpGs from the three EWAS for enrichment of location relative to CpG islands (CGI), previously established differentially methylated regions (DMR), gene ontologies, and transcription factor binding motifs. Within each set of enrichment tests, we used a Bonferroni-adjusted significance criterion to adjust for the three EWAS and the multiple enrichment categories, unless otherwise specified. We used two-sided Fisher’s exact tests to test for enrichment of replicated CpGs in relation to CGI, CGI shores (≤2 kb from CGI), shelves (2–4 kb from CGI), and open sea regions (>4 kb from CGI)20,21. We used Illumina annotation data on experimentally determined tissue-specific differentially methylated regions (DMR), cancer-specific DMR (CDMR), or reprogramming-specific DMR (RDMR)24 and performed Fisher’s exact tests to elucidate whether replicated CpGs were enriched in these categories. We performed gene ontology enrichment analysis on sets of genes near replicated CpGs using the missMethyl Bioconductor R package80 v1.26.1. Finally, we used the HOMER software suite26 v4.11 to test the 200-bp regions surrounding replicated CpGs for enrichment for previously reported transcription factor binding motifs while accounting for regional differences in GC content. For the HOMER analysis we used the default settings to perform one-sided binomial tests to test for enrichment of known motifs 8, 10, or 12 bp in length, with the 200-bp regions surrounding CpGs not associated with DNMT3A or TET2 mutations (FDR > 0.05) provided as background sequences for comparison.

Functional annotation of replicated sites

To investigate the functional potential of CpG sites associated with DNMT3A or TET2 mutations, we assessed whether these sets of CpG sites were enriched (relative to other CpGs on the Illumina 450 K array) for regions identified as functionally relevant in the components of whole blood and in HSCs based on cell-type specific DNAm, chromatin accessibility, or gene expression profiles obtained from external reference data. Among genes near CpGs associated with DNMT3A or TET2 CHIP, we also examined enrichment for genes associated with differential methylation in DNMT3a knockout mice17. For each set of tests, we used a Bonferroni-adjusted significance criterion to adjust for the two sets of CpGs and the multiple enrichment categories, unless otherwise specified.

Cell-type specific DNAm

To characterize the DNAm profiles of these CpG sets in HSCs vs. downstream blood lineages, we computed average methylation levels at each CpG according to WGBS data generated as part of the BLUEPRINT project29. Preprocessed DNAm data (counts of methylated and total reads by site) were downloaded from GEO series GSE87196 for HSCs and six peripheral blood cell types (CD4+ and CD8+ T cells, B cells, natural killer cells, monocytes, and neutrophils) obtained from purification of blood samples from three healthy donors. To establish average DNAm levels for HSCs vs. progeny cells while maximizing genomic coverage, data were combined across donors and within myeloid (monocytes and neutrophils) and lymphoid (T cells, B cells, and natural killer cells) lineages to form three datasets representing average DNAm levels in myeloid cells, lymphoid cells, and HSCs. The R functions liftover() and findOverlaps() from the rtracklayer (v1.54.0) and GenomicRanges (v1.46.1) Bioconductor packages81 were used to identify CpGs in the WGBS data that overlapped with CpGs analyzed in the EWAS. Two-sided Wilcoxon tests were then used to compare the cell-type-specific DNAm distributions for our sets of replicated CpGs vs. other CpGs on the array.

Cell-type specific chromatin accessibility

To examine whether these CpG sets are enriched for regions of accessible DNA in HSCs and downstream progeny, we used the eFORGE tool32 v2.0 to test the top 1000 CpG sites in each set for enrichment in regions identified as DNase I hypersensitive (DHS) hotspots generated by the ENCODE project33 for HSCs and five peripheral blood cell types. eFORGE uses a binomial test to assess whether overlap with DHS hotspots is greater in our sets of replicated CpG sites compared to 1000 genomic-context-matched random probe sets from the same array. To assess significance, we compared the p-value from each binomial test to a Bonferroni-adjusted significance criterion adjusted (α = 0.05/12 = 0.0042 to account for two sets of CpG sites tested for enrichment in six cell types). For descriptive purposes, we generated odds ratios as the ratio of (1) the odds of sites overlapping DHS in our data to (2) the odds of sites overlapping DHS in the 1000 matched random sets.

Cell-type specific gene expression

To examine whether genes proximal to these CpG sites are enriched for cell-type-specific gene expression patterns, we used the Illumina 450 K annotation to associate each CpG site with a gene, and used two-sided Fisher’s exact tests to test these gene sets for enrichment or depletion of sets of genes previously identified as marker genes for HSCs vs. other hematopoietic cells derived from human bone marrow34 or mouse bone marrow, spleen, and peripheral blood36,37. For murine genesets, we identified human orthologs from Ensembl Release 10582 using the biomaRt Bioconductor package83. Enrichment or depletion for marker genesets was assessed using two-sided Fisher’s exact test with Bonferroni adjustment for the number of cell types considered.

D n m t3a knockout mice

We obtained lists of genes previously identified as hypo- or hyper-methylated in Dnmt3a knockout mice from the supplemental materials of ref. 17. As above, we identified human orthologs from Ensembl Release 10582, and tested for enrichment via two-sided Fisher’s exact tests.

Comparison of replicated sites to sites associated with aging

To investigate the overlap between CHIP-associated sites and sites showing differential DNAm with age, we used CpGassoc to perform an EWAS for age in the CHS sample. Similar to our discovery EWAS, this analysis considered DNAm proportion as the outcome, with age as the independent variable and covariates for CHIP, sex, batch, and estimated cell type proportions, and random effects to account for repeated measures. We then compared results from the CHIP vs. age EWAS by considering (1) the proportion of CpGs that were significantly associated with both traits, and (2) Pearson correlation between the meta-analysis Z-statistics from the two EWAS. For CpG sites associated with both traits, we performed two-sided Sobel tests38 to assess whether CHIP is a potential mediator of the relationship between age and DNAm, defining significance as FDR < 0.05.

Comparison of replicated sites to sites associated with leukemogenic mutations in cancer

To assess the overlap between sites associated with DNMT3A and TET2 mutations and sites that associate with these mutations in the context of acute myeloid leukemia (AML), we downloaded data generated by The Cancer Genome Atlas (TCGA) that included Illumina 450 K DNAm data for tumor samples from 140 adult AML patients for whom potential driver mutations had been identified via whole-genome or whole-exome sequencing12. We then used CpGassoc to perform an EWAS for DNMT3A mutations by comparing patients with DNMT3A but not TET2 mutations (N = 28) to patients with other mutations (N = 99), adjusting for age and sex as covariates. We performed a similar EWAS to identify sites with differential DNAm in patients with TET2 but not DNMT3A mutations (N = 9) compared to patients with other mutations (N = 99). We compared the results of these EWAS to the results from our discovery EWAS and meta-analysis, assessing the correlation between test statistics as above.

Mendelian randomization analysis

To evaluate the potential of DNAm as a potential mediator of the relationship between CHIP and coronary artery disease (CAD), we performed two-sample Mendelian randomization (MR) between exposures (replicated CHIP-associated CpGs) and outcome (CAD). Here, cis-methylation quantitative trait loci (cis-mQTL) from the GoDMC database39 were used as instrumental variables (IVs) for the replicated CpGs (excluding MHC region 6: 27486711-33448264) associated with either any CHIP, DNMT3A or TET2 CHIP. The summary statistics of the CAD GWAS meta-analysis of CARDIoGRAMplusC4D40 and UK Biobank from van der Harst and Verweij41 were used. We used the generalized summary-data-based Mendelian randomization (GSMR) method of GCTA v1.93.284,85 for the analysis.

We prepared a European ancestry LD reference panel using 20,000 random samples from the UK Biobank imputed GWAS dataset. SNPs with allele frequency difference >0.2 between the GWAS summary dataset (mQTL or CAD QTL) and the LD reference were excluded. In the forward GSMR analysis we considered replicated CpGs with at least five partially independent (linkage disequilibrium, LD r2 < 0.05) cis-mQTL with association P < 5 × 10−8. HEIDI-outlier analysis (heterogeneity in dependent instrument, described in Zhu et al.85) was then performed to detect and exclude variants with pleiotropic effects and IVs with P < 0.01 were excluded. For the FDR-significant (FDR < 0.05) GSMR results, we extracted cis-expression quantitative trait loci (cis-eQTL; Bonferroni-adjusted P < 0.05) from eQTLGen (www.eqtlgen.org)43 to see whether the cis-mQTL used in MR were also cis-eQTL, and compared the change in DNAm with corresponding change in gene expression.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.