Main

Genome-wide association studies (GWAS) have been successful in discovering more than 100,000 genomic loci that contain common single-nucleotide polymorphisms (SNPs) associated with complex traits and disease-related phenotypes, providing a very important starting point for the systematic investigation of the molecular mechanism of human disease1,4. However, the vast majority of these genetic associations remain devoid of any mechanistic hypothesis underlying their molecular and cellular functions, as more than 90% lie outside protein-coding exons and probably have non-coding roles in gene-regulatory regions with circuitry that remains unresolved2,3.

Large-scale experimental mapping5,6,7,8 and integration of histone modification marks and DNA accessibility have helped to annotate diverse classes of gene-regulatory annotations, including distal-acting and tissue-specific enhancers and proximal-acting and mostly constitutive promoters9,10. These maps help to elucidate the molecular basis of complex traits by revealing preferential localization (enrichment) of trait-associated genetic variants in tissue-specific gene-regulatory elements3,6,11,12,13,14,15,16 and by fine-mapping possible causal genetic variants in enriched annotations14,17,18,19. However, these maps also have limitations: they miss many disease-relevant tissues, have variable quality, and are prone to experimental noise and methodological variation between protocols, laboratories, antibody lots, reagents, batches, computational processing pipelines, software versions and integration pipelines. Moreover, consortia that require common marks across samples often exclude samples that miss some marks or marks that are missing in some samples, thus reducing biological space coverage, and often only profile few marks in many samples, or many marks in few samples owing to cost limitations.

Here we overcome many of these limitations and present a new human epigenome reference, EpiMap (for epigenome integration across multiple annotation projects) (Fig. 1a). We inferred chromatin-state annotations that combine multiple marks 9, and a high-resolution enhancer annotation that combines DNA accessibility and multiple chromatin enhancer states. We grouped enhancers into modules that show common activity patterns, and inferred candidate upstream regulators and enriched functions of downstream genes for each module on the basis of regulatory motif and gene ontology enrichments. We also inferred enhancer target genes using a machine learning approach. We integrated this high-resolution gene-regulatory circuitry with genetic association results, revealing traits with epigenomic enrichments, and predicting causal variants and tissue-specific target genes. We distinguished unifactorial, multifactorial and polyfactorial traits on the basis of the diversity of their enriched tissues, and partitioned the loci of polyfactorial traits according to their overlap in distinct enriched tissues, thus revealing their distinct biological processes and disease comorbidity patterns. We also distinguished monotropic versus pleiotropic loci, and found that top-scoring loci frequently have multiple predicted driver variants, converging through diverse pleiotropy patterns involving multiple enhancers with a common target gene, multiple genes in a common tissue, or multiple genes in multiple tissues. Our results demonstrate the utility of dense, rich, multidimensional, high-resolution epigenomic and regulatory circuitry annotations for gene regulatory studies, complex trait investigation and studies of disease locus mechanism, resulting in unprecedented scale, scope and coverage of biological space and disease complexity.

Fig. 1: EpiMap resource overview.
figure 1

a, We created a compendium of over 17,000 epigenomic tracks across 18 marks by uniform processing and imputation and used these to call chromatin states for 833 biosamples and active-enhancer states over 2.1 million DNase I hypersensitive sites (DHSs). We used unsupervised clusters of the enhancer activities to call enhancer downstream target genes, upstream regulators, and to prioritize, investigate and compare hundreds of GWAS traits and thousands of loci. GO, gene ontology; QC, quality control; TF, transcription factor. b, Data matrix across 859 samples (columns) and 40 assays (rows), ordered by the number of experiments (parentheses) and coloured by metadata. EEM, extra-embryonic membranes; ES, embryonic stem; expts, experiments; HSC, haematopoietic stem cell; iPSC, induced pluripotent stem cell; H3T11ph, histone H3 phosphorylated at T11; PNS, peripheral nervous system. ENCODE new, ENCODE post-2012 data freeze + publication; Roadmap new, Roadmap post-2015 data freeze + publication.

EpiMap generation and validation

We uniformly processed 3,030 observed5,6,7,8 genomic tracks across 859 biosamples (406 ENCODE5, 425 Roadmap Epigenomics6 and 28 Genomics of Gene Regulation (GGR)8 samples) that span 18 epigenomic assays, and computationally imputed20 14,952 tracks (Fig. 1b, Supplementary Fig. 1, Supplementary Table 1), which are available for download and interactive visualization21 at http://compbio.mit.edu/epimap.

Our imputed tracks matched held-out observed tracks, both visually across randomly selected regions (Extended Data Fig. 1a, b) and quantitatively with more than 85% peak recovery and more than 75% average genome-wide correlation for punctate marks (59% of tracks) genome-wide (Supplementary Fig. 2). Imputation was robust even with few supporting datasets, and performed best when target datasets showed more than 50% average correlation to their ten nearest datasets, which held for 98% of single-assay samples (Supplementary Fig. 3).

Imputed data also matched independent post-data freeze experiments, outperforming ‘average signal’ and ‘nearest track’ benchmarks (in practice knowable only after generating the target track) for 96% of punctate marks and 77% of broad marks both genome-wide and specifically focusing on rare events (Extended Data Fig. 1c–h).

Disagreement between imputed and observed tracks helped to flag 138 potentially problematic datasets, which independently also showed markedly lower quality control scores (Supplementary Fig. 2a–c) and revealed potential sample or antibody swaps (Supplementary Figs. 4, 5), some of which were independently flagged by the data producers. Subtraction of the imputed track signal from the observed track signal revealed 13 experiments with potential antibody cross-reactivity or secondary specificity (Supplementary Figs. 68). From subsequent analyses, we removed the 138 flagged datasets and 442 tracks based solely on assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) or low-quality DNase-seq data, resulting in 2,850 observed and 14,510 imputed marks across 833 biosamples used in the remainder of this work.

Epigenomic landscape

The resulting compendium of 833 high-quality reference epigenomes, grouped into 33 tissue categories, represents a major increase in biological space coverage, with 75% (624 of 833) of biosamples corresponding to new biological specimens. Observed and imputed data co-clustered, but imputed datasets better captured the continuity between and within different sample categories, and more clearly revealed sample-type groups, probably driven by both their cleaner signal tracks and their sheer number. Moreover, the distances between imputed datasets were less affected by technical covariates (Supplementary Fig. 9) and were more consistent with sample groupings (Supplementary Fig. 10).

Hierarchical and two-dimensional embedding clusterings of multiple marks using both genome-wide and relevant-region-specific correlation patterns (Extended Data Fig. 2a–c) grouped biosamples first by life stage (adult versus embryonic) and sample type (complex tissues versus primary cells versus cell lines), and second by distinct groups of brain, blood, immune, stem cell, epithelial, stromal and endothelial biosamples within them. Active marks (histone H3 lysine 27 acetylation (H3K27ac), histone H3 lysine 4 monomethylation (H3K4me1) and H3K9ac) primarily grouped biosamples by differentiation lineage (blood, immune, spleen, thymus, epithelial, stromal and endothelial) and tissue type (lung, kidney, heart, muscle and brain), while repressive marks (H3K27me3 and H3K9me3) captured life stage (pluripotent, induced pluripotent stem cell derived, embryonic and adult) (Extended Data Fig. 2b, Supplementary Fig. 11), consistent with previous studies22,23. Donor sex was not a primary factor in sample grouping.

We annotated genome-wide locations of 18 chromatin states5,6,9 in all 833 biosamples using combinations of histone modifications, including multiple types of enhancer, promoter, transcribed, bivalent and repressed regions (Extended Data Fig. 3a), using a mixture of observed and scaled imputed data and excluding the 138 flagged observed datasets (Methods). Genomic coverage and mark frequencies remained stable across biosamples for most states (Extended Data Fig. 3b–d), but biosamples with fewer observed datasets showed more heterochromatin and Polycomb-repressed states, consistent with our previously noted lower imputation accuracy for broad marks.

We annotated 2.1 million high-resolution active-enhancer regions by intersecting five active-enhancer states with 3.6 million accessible DNA regions from 733 DNase-seq experiments24. These covered 13% of the genome cumulatively and 0.8% on average for biosamples individually (Fig. 2a, Supplementary Fig. 13a–g), and represent a more than twofold increase relative to the ENCODE 2020 release25 (Extended Data Fig. 4). Clustering biosamples by sharing of active enhancers captured biologically meaningful groupings (Extended Data Fig. 5, Supplementary Fig. 12).

Fig. 2: Enhancer module circuitry.
figure 2

a, Overview of gene-regulatory module clustering. The full module breakdown is shown in Extended Data Figs. 6, 7 and online at http://compbio.mit.edu/epimap. Activity modules are shown in Fig. 2b, Extended Data Fig. 6a. FC, fold change. b, Clustering of 2.1 million enhancer elements (top) into 300 modules (columns) using the activity levels of enhancers (heat map) across 833 samples (rows), quantified by the levels of H3K27ac within accessible enhancer chromatin states. Bottom, the enrichment of each module for each metadata annotation, highlighting 34 groups of modules (separated by dotted lines): 33 specific to sample type (coloured boxes) and 1 multiply enriched (left-most). LCL, lymphoblastoid cell line. c, Subsets of enhancer module centres (top panels) and motifs (bottom panels) for heart, brain and haematopoietic cell samples (top, rows), selected GO terms (middle, rows) and selected motifs (bottom, rows) in modules (columns) with maximal enrichment in each of the three sample categories. The GO heat map is coloured by enrichment −log10P (0–2: white; 2–3: yellow; 3–4: orange; and 4+: red). The full subsets are shown in Extended Data Fig. 7. CMP, common myeloid progenitor; MPP, multipotent progenitor; NK, natural killer.

Enhancer modules, targets and regulators

For each high-resolution active-enhancer region, we defined H3K27ac-based local activity levels across 833 biosamples and used them to group enhancers into 300 enhancer modules (Fig. 2a, b, Extended Data Fig. 6a–c, Supplementary Figs. 14, 15), including 290 tissue-specific modules (1.8 million enhancers, 88% of enhancers cumulatively, active in 2% of biosamples on average) and 10 broadly active modules (251,079 enhancers, 12% of enhancers, active across 77% of sample categories on average).

Enhancer modules showed substantial high-resolution, tissue-specific gene ontology enrichments for neighbouring genes (Extended Data Figs. 6, 7, Supplementary Fig. 16), including ion channels (for brain modules); camera-type eye development (eye); neural precursor cell proliferation (neurosphere); endothelial proliferation, hemidesmosomes and digit morphogenesis (endothelial, stromal and epithelial); and organ development and morphogenesis (embryonic).

We predicted 3.3 million tissue-specific enhancer–gene links by combining epigenomic–transcriptional correlation and genomic proximity, each gene linked to 13 enhancers and each enhancer to 1.5 genes on average, at a median distance of 42,359 bp. Links were approximately sixfold more specific than enhancers, and sample-specific links spanned larger distances than constitutive links (Extended Data Fig. 8a–f). Our links outperformed previous linking approaches, using both gene-set enrichment metrics and curated gold-standard datasets (Methods, Extended Data Fig. 8g, h), and greatly expanded the biosamples with predicted links (from 127 to 833).

We predicted upstream regulators for 273 modules (91%), implicating 1,175 motifs grouped into 160 motif archetypes26 (Extended Data Figs. 6, 7, Supplementary Fig. 17), including 152 tissue-specific motif archetypes (enriched in 6 modules on average) and 8 broadly enriched (enriched in 53 modules on average). Specific motifs include: GATA and SPI1 in the blood and immune samples27; NEUROD2 and RFX4 in the brain and peripheral nervous system28,29; KLF4 for digestive tissues30; and TEAD3 for the placenta, myosatellite and epithelial cells31.

Broadly enriched motifs revealed highly connected, combinatorially acting master regulators, including HNF1A in the liver, kidney and pancreas (with NR5A2)32; AP-1 (also known as JUN) or JDP2 in immune, bone and cancer samples33; and TEAD3, paired alternately with MYF6 (myosatellite), TFAP2A (placenta) and AP-1 (stromal) (Extended Data Fig. 7d, e).

Motif enrichments often partitioned tissue categories into subgroups specific for developmental stage and tissue type (Fig. 2c, Extended Data Fig. 7a–c), including heart into embryonic heart (NFIX and E2F1), aorta and arteries (SRF and PAX5), and heart chambers (MEF2D and ESRRG); brain into embryonic (NFIX and NEUROD2), adult brain (RFX2 and SOX10), and astrocytes (NFE2L2 and JDP2); and haematopoietic cells into natural killer cells (ETV2), B cells (NFKB2 and SPIB), and multipotent progenitors (GATA1 and NFE2L2).

Interpreting GWAS loci

We next used our 2.1 million enhancer annotations and their tissue specificity to interpret genetic variants associated with complex traits3,6,11. We compiled a compendium of 803 well-powered GWAS34 with 10 or more significant loci and over 10,000 cases (15% of 5,454 GWAS publications) that capture over 70,000 GWAS loci (63% of the NHGRI-EBI catalogue4).

We found 17,658 significant trait–tissue enrichments, enabling fine-mapping of candidate driver SNPs in over 27,000 loci (39%) from 245 traits in tissue-enriched enhancers (false discovery rate of <1%) (Extended Data Fig. 9a, Supplementary Figs. 18, 19). New biosamples captured the strongest GWAS enrichment in 79% of cases (193 of 245) and the only significant enrichment in 24% of cases (n = 58), and our annotations captured 2.5-fold more GWAS studies than DNase alone (245 versus 97) (Extended Data Figs. 9b, 10a–d).

To capture common enrichments of similar biosamples, we also calculated trait enrichments for enhancer modules, resulting in approximately threefold more enriched traits (717 versus 245) but 38% fewer SNPs in enriched annotations (Supplementary Fig. 20). Instead, using the hierarchical enhancer-sharing tree (Extended Data Fig. 2c) to reveal the appropriate tissue resolution of GWAS enrichment in comparisons of subtree-versus-parent enhancers showed 2.2-fold more enriched traits (540 versus 245) and 20% more SNPs in enriched annotations (32,532 SNPs) (Extended Data Fig. 10e, Supplementary Fig. 18), representing an approximately tenfold increase from the 54 traits enriched in H3K27ac and the 58 traits enriched in H3K4me1 reported by the Roadmap Epigenomics project6.

Our epigenomic enrichments and enhancer–gene links yielded new biological insights on disease loci, with many compelling examples. For breast cancer GWAS35, enriched in epithelial and cancer biosamples (Extended Data Fig. 9c), the highly localized rs17356907 genetic signal (P = 10−39, rank no. 16) localized precisely in a narrow epithelial and cancer enhancer nearest to USP44 but linked instead to NTN4, which is implicated in tumorigenesis and angiogenesis (Extended Data Fig. 9d). For schizophrenia GWAS36, maximally enriched in the mid-frontal cortex (Extended Data Fig. 9e), the diffuse rs2007044 genetic signal (P = 10−17, rank no. 3) overlapped a broad set of enhancers nearest to the DCP1B promoter, all of which linked to CACNA1C, which encodes a calcium channel implicated in neuropsychiatric disorders, suggesting that multiple causal variants may contribute jointly to its dysregulation37 (Extended Data Fig. 9f). We have provided an interactive website for exploring more than 30,000 additional loci across more than 500 traits at http://compbio.mit.edu/epimap.

GWAS and tissue co-enrichments

We then studied trait–tissue, trait–trait and tissue–tissue epigenome GWAS co-enrichment patterns to gain insights into their complex interactions. First, we used the number of distinct tissue categories enriched in each trait (Extended Data Fig. 9a (left), Supplementary Data 1) to distinguish: 56 ‘unifactorial’ traits (22%) with most enriched nodes in only one tissue group (for example, QT interval in the heart, educational attainment in the brain and hypothyroidism in immune cells) versus 192 ‘multifactorial’ traits (79%) enriched in five tissue categories on average (for example, Alzheimer disease in immune cells and the brain38; waist-to-hip ratio39 in adipose, muscle, kidney and digestive tissues), of which 26 ‘polyfactorial’ traits (11%) enriched in 14 tissue categories on average (including coronary artery disease (CAD)40 in 19 tissue groups, including liver, heart, adipose, muscle and endocrine samples).

Second, we used trait co-enrichment patterns in the same tissues to cluster GWAS traits with similar properties. The resulting network (Fig. 3) showed a small number of densely connected communities of primarily unifactorial traits (for example, cognitive and psychiatric traits in the brain and neurons, heartbeat intervals in the heart, cholesterol in the liver, filtration in the kidney, immune traits in T cells and blood cell counts in haematopoietic cells) with multifactorial connectors between them (for example, CAD between heart, endocrine and liver; HDL and triglycerides between liver and adipose; lung function between lung, heart and digestive tissue; blood pressure between heart and endocrine, endothelial and liver; and cell count between liver and digestive tissue) (Supplementary Figs. 21, 22). Many biologically meaningful similarities in this epigenomic co-enrichment-based network are missed by a network based on genetic overlap (934 edges, traits sharing 5% or more loci at a 10-kb resolution), which only captures 5% of epigenomic co-enrichment edges (283 of 5,547) (Supplementary Figs. 2325).

Fig. 3: Trait–trait network.
figure 3

The network across 538 traits (by per-node false discovery rate correction) by similarity of epigenetic enrichments (cosine similarity ≥ 0.75), laid out using the Fruchterman–Reingold algorithm. Traits (nodes) are coloured by the contributing groups (pie chart by the fraction of −log10P, and size by maximal −log10P) and interactions (edges) by the group with the maximal dot product of enrichments between two traits. The redundant node names indicate different GWAS (the full names for non-singleton nodes are available in Supplementary Fig. 22). AD, Alzheimer disease; ADHD, attention-deficit/hyperactivity disorder; BMI, body mass index; CVD, cardiovascular disease; FEV1, forced expiratory volume in 1 s; T2D, type 2 diabetes; vWF, von Willebrand factor; WHR, waist-to-hip ratio.

Third, we used co-enrichment properties of pairs of tissues in the same traits to distinguish ‘principal’ tissues (for example, immune cells, liver, heart, brain and adipose tissues) that showed consistently higher enrichments versus ‘partner’ tissues (for example, digestive, lung, muscle and epithelial tissues) for the same GWAS traits, suggesting that they have driver rather than auxiliary roles (Extended Data Fig. 11a). Specific principal–partner tissue pairs co-occurred more frequently than expected (Extended Data Fig. 11b), and revealed biologically meaningful traits where they probably co-act (Extended Data Fig. 11c), including: liver with adipose tissue (for cholesterol traits), with digestive tissue (for gallstone) and with blood cells (for serum protein levels); and adipose tissue with endothelial cells (for waist-to-hip ratio), with heart tissue (for atrial fibrillation) and with muscle tissue (for blood pressure).

Partitioning multifactorial traits

We next used our epigenomic annotations to partition multifactorial trait SNPs into tissue-specific components, by studying functional and disease enrichments for distinct subsets of enhancer-overlapping SNPs in each enriched tissue (Fig. 4a–d, Extended Data Fig. 12, Supplementary Fig. 26).

Fig. 4: Partitioning of polyfactorial traits.
figure 4

a, Workflow for the investigation of GWAS epigenomic enrichment using the biosample tree (Extended Data Fig. 2c). Additional trait enrichments, SNP assignments, links and their corresponding loci are available at http://compbio.mit.edu/epimap. b, Epigenomic enrichments for CAD40 on the enhancer-sharing tree. Nodes that passed false discovery rate < 0.1% are labelled by rank, category and components, and subtrees are shown (the large circles are the top 20 nodes by −log10P). The leaves are annotated by metadata and the number of enriched parent nodes (outer, red = 1, black = 2). c, The top 10 enriched nodes for CAD with nominal P values (heat map) and shared enhancer set sizes (bar plot) with the number at the subtree (full bar) and the number of differential enhancers between the node and its parent (tested set, dark bar). d, GO enrichments of node enhancers with lead SNPs (nearest expressed genes), coloured by the tissue group of each node and diagonalized (over-representation test). e, Enrichment for significant loci in overlap of CAD loci with loci from five related traits, within enriched enhancers in each node (heat map, −log10P of one-tailed Mann–Whitney test against the loci of each trait in the enhancer annotations). f, Enhancer overlaps with the top 30 lead SNPs from CAD GWAS for the top 10 enrichments on the enhancer tree. gi, Loci centred on CAD lead SNPs with links (top), the H3K27ac signal (middle) and GWAS summary statistics, for lead SNPs rs11591147 (chr1: 55,505,647; P = 2 × 10−25) (g), rs6841581 (chr4: 148,401,190; P = 5 × 10−24) (h) and rs17114046 (chr1: 56,966,350; P = 8 × 10−28) (i). Loci show enhancer–gene links for SNP proximal enhancers for the top enrichments (Enr.) and across the locus for labelled categories (Cat.; linked enhancers in grey) (top); the H3K27ac signal in enhancers for the top three enriched subtrees, the six selected tissue categories and the average (middle); and genes (transcription start site (red lines)) and CAD GWAS summary statistics, with SNPs below P = 5 × 10−8 in grey (bottom).

For example, the 339 CAD-associated SNPs lying in enriched tissue enhancers partitioned into: 195 heart-enhancer SNPs enriched in artery, cardiac and vessel morphogenesis; 171 endocrine-enhancer SNPs in lipid homeostasis; 169 liver-enhancer SNPs in cholesterol and lipid metabolism and transport; 122 adipose-enhancer SNPs in axon guidance and focal adhesion, consistent with adipose tissue innervation processes; and 112 embryonic stem cell-derived–muscle enhancer SNPs, enriched in septum morphogenesis, chamber and aorta development.

These partitions also showed distinct co-associations (Fig. 4e). For example: heart, muscle and endothelial enhancer CAD SNPs co-associated with high blood pressure and atrial fibrillation; liver and endocrine enhancer CAD SNPs with systolic blood pressure; adipose enhancer CAD SNPs with waist-to-hip ratio; and liver, adipose and endocrine CAD SNPs with HDL cholesterol.

Individual multifactorial trait loci included both single-tissue and multiple-tissue loci (Fig. 4f). Some CAD loci overlapped only heart enhancers (for example, EDNRA, TCF21 and ADAMTS7), some only liver enhancers (for example, PCSK9), some lacked any enhancer overlaps (possibly acting at non-enhancer levels of regulation, or in uncaptured tissues or conditions), and many overlapped enhancers that were active in multiple tissues (for example, LDLR, APOE, SH2B3 and COL4A1), suggesting multiple mechanisms of action even at the single-locus level.

For example, the liver-only CAD-associated locus near the LDL cholesterol regulator PCSK9 (ref. 41) (rs11591147, P = 2 × 10−25, rank no. 21) showed a strong liver-specific signal and liver-specific enhancer–gene links to PCSK9 (Fig. 4g). The heart-only CAD-associated 250-kb locus near EDNRA contains two separate associations, the transcription start site-centred rs6841581 (P = 5 × 10−24, rank 27) and the enhancer-centred rs4583018 (P = 8 × 10−15, rank 66), both in strong coronary artery enhancers and both linked to EDNRA through strong artery links, putatively reflecting multiple functional variants37 that converge on the same target gene in the same tissue (Fig. 4h).

Even seemingly single-tissue loci sometimes showed second-tissue signals: the 1-Mb rs17114046 locus (P = 8 × 10−28, rank no. 14; Fig. 4i) showed primarily liver activity with multiple SNP-overlapping enhancers linked to liver-expressed PLPP3, the liver-specific deletion of which increases atherosclerosis42; however, our liver-specific links also implicated liver-produced complement factor C8A43, and our heart-specific and muscle-specific links implicated PRKAA2, which encodes an AMP kinase subunit that is involved in cardiac metabolism44. These are both biologically relevant, highlighting that even individual loci may be pleiotropic, a property repeatedly found for many top-scoring loci.

Discussion

In this work, we presented a comprehensive map of the human epigenome, EpiMap, encompassing approximately 15,000 epigenomic tracks across 833 distinct biological samples that greatly expand the coverage of both embryonic and adult tissues and cells. We combined observed and imputed datasets across 18 epigenomic marks to jointly annotate and distinguish diverse classes of chromatin states, including enhancer, promoter, transcribed, repressed and quiescent regions. We extensively validated the high quality of our annotations and found that they outperformed stringent benchmarks, using both held-out and external experimental datasets for validation.

We used this resource to assemble a comprehensive view of human genome circuitry across primary tissues, cells and cell lines, annotating 2.1 million high-resolution gene-regulatory regions; their activity patterns across 833 biosamples; their enriched regulatory motifs, motif combinations and putative upstream regulators that are responsible for their co-regulation; their enriched gene functions and biological pathways that they probably control; and their tissue-specific target genes. Our high-resolution enhancer annotations provide a highly concentrated view of the non-coding landscape, yielding many gene-regulatory insights but covering only 0.8% of the genome in each sample and only 13% total across all samples. Our linking revealed the high number of enhancers that control each gene and the high tissue specificity of long-range enhancer–gene links. Our upstream regulator analysis revealed a highly combinatorial and hierarchical view of gene regulation, with a small number of master regulators (for example, RFX2–RFX4, GRHL1, HNF1A and AP-1) interacting with diverse partners in different tissues to define tissue-specific gene-regulatory programs.

Our work has also provided high-resolution molecular investigations of complex traits and human disease circuitry. We found statistically significant epigenomic enrichments for 540 GWAS traits implicating 30,000 SNPs in tissue-enriched enhancers, used trait and tissue co-enrichment patterns to annotate tissue partnerships and trait pleiotropy, and to partition disease SNPs into tissue-specific functional components. For individual GWAS loci, our work provides mechanistic insights at unprecedented scale. We have highlighted specific examples of GWAS investigations at varying levels of complexity, from the typically sought single-enhancer to single-gene, to multiple enhancers converging on a single target37, to multiple genes and multiple tissues acting in pleiotropy in a single locus.

Beyond the specific examples highlighted in our figures, we have also provided a rich interactive supplementary website (Supplementary Fig. 27) for our study (at http://compbio.mit.edu/epimap), enabling detailed interactive exploration of functional and motif enrichments of 300 enhancer modules; motif–tissue networks and enrichments; GWAS enrichments for 540 traits against our biosample tree; GWAS-enriched tissue enhancer SNP overlaps and target gene predictions; and 30,000 disease locus visualizations with putative driver SNPs, enhancers, tissues and tissue-specific target genes. These can enable the generation of detailed hypotheses for future experimental follow-up in countless studies of gene regulation and disease.

Our collection also has several limitations: tissue samples are not at single-cell resolution; we do not consider donor genotype or phenotype; imputation may result in increased homogeneity and miss rare sample-specific events; and we still miss many tissues, environmental and stimulation conditions, and developmental stages.

Our work enables many future studies: hierarchical and multi-resolution tree-based analyses of gene regulation and GWAS; machine learning-based gene circuitry and combinatorial regulatory motif analyses45,46; more sophisticated network analyses of our tissue–trait, trait–trait and tissue–tissue relationships; and guiding the experimental prioritization, methodological development and validation experiments, which can continue to further our understanding of gene regulation and human disease circuitry.

Methods

Epigenomic datasets and processing

Primary data sources and metadata information

We analysed 3,030 datasets, including 2,329 epigenomic chromatin immunoprecipitation followed by sequencing (ChIP–seq) datasets, 635 DNase-seq datasets and 66 ATAC-seq datasets from ENCODE at https://www.encodeproject.org/, released as of 24 September 2018. These marks include tier 1 assays: DNase-seq, H3K4me1, H3K4me3, H3K27ac, H3K36me3, H3K9me3 and H3K27me3; tier 2 assays: ATAC-seq, H3K9ac, H3K4me2, H2AFZ, H3K79me2 and H4K20me1; tier 3 assays: POLR2A, p300, CTCF, SMC3 and RAD21; and tier 4 histone marks: 16 non-imputed histone acetylation marks, 4 methylation marks (H3K9me2, H3K79me1, H3K9me1 and H3K23me2), H3.3 and H3T11ph. We assigned unique sample IDs to each unique combination of: extended biosample summary, donor, sex, age and life stage, wherever each attribute was available. We removed samples with genetic perturbations and kept only samples with appropriately matched ChIP–seq controls. We provide a metadata matrix including the mapping between ENCODE accessions and our unique sample IDs (Supplementary Table 1; also at http://compbio.mit.edu/epimap). We mapped the 111 Roadmap biosamples and the 16 ENCODE 2012 biosamples to any of our biosamples with overlapping dataset accessions if the accessions were used in the flagship Roadmap epigenomics analysis. This mapping assigned 25 samples to ENCODE 2012 and 184 samples to Roadmap 2015, some of which were merged multi-donor samples in Roadmap, out of the final 833 samples that passed quality control. These were merged into 16 and 111 tissue types, respectively, in the Roadmap 2015 publication6.

Uniform data processing

We downloaded one alignment file per replicate, prioritizing filtered alignments aligned with BWA in hg19 whenever possible. We uniformly processed the ChIP–seq and DNase-seq datasets according to the processing pipelines established by the Roadmap Epigenomics Consortium6. In brief, we filtered out improperly paired and non-uniquely mapped reads, truncated reads to 36 bp, filtered out a blacklist of low complexity and artefact regions (ENCODE accession ENCSR636HFF), and filtered reads against a mappability track of uniquely mappable regions for 36-bp reads47. Truncating read lengths inevitably missed some repetitive regions that the longer reads could have helped resolve, but helped to avoid potential biases from alignment differences, as over two-thirds of the datasets had read lengths of 36 bp or lower (Supplementary Fig. 9). We converted .bam files to tagAlign, used liftOver48 to map GRCh38 alignments to hg19, and pooled all experiments within each ID and assay combination. We subsampled the pooled ChIP–seq datasets to a maximum of 30 million reads and the DNase-seq and ATAC-seq datasets to a maximum of 50 million reads. We used the SPP peak caller49 to estimate fragment length. In cases with extremely low fragment length in the ATAC-seq and DNase-seq datasets we used the average fragment length (73 bp) from the average of the rest of the tracks. We generated −log10 P value signal tracks against matched whole cell extracts for both the ChIP–seq and the accessibility datasets using the MACS250 and the SPP49 peak caller and cross-correlation analysis to identify the proper fragment length as in the Roadmap analysis.

Epigenomic imputation

Imputation

We carried out epigenomic imputation on 859 unique biosamples using ChromImpute20 for a total of 10,778 imputed datasets over 13 tier 1 and tier 2 assays using predictors trained on all 35 epigenomic assays across 859 samples. We also imputed 4,345 datasets for the five DNA-associated factors, using only the 35 epigenomic assays as features to train predictors with ChromImpute. We provide all imputed and processed observed tracks along with track sets for the 833 quality controlled samples at https://epigenome.wustl.edu/epimap21.

Quality control

For imputation quality control and validation, we compared observed tracks to imputed tracks when both were available (that is, when at least two original observed datasets were available for that biosample). We calculated all imputation quality control metrics from the original ChromImpute publication20, including genome-wide correlation, imputed and observed peak recovery (%), and the area under the receiver-operator characteristic curve (AUC) for all pairs of imputed and observed tracks. In addition to the quantitative metrics, we visually inspected the epigenomic predictions as part of our quality control. We showed (Extended Data Fig. 1b) three dense and varied regions of different resolutions (25 kb, 200 kb and 1.5 Mb) for each of two randomly chosen samples containing both observed and imputed tracks for each assay. We calculated the epigenomic profile quality metrics normalized strand cross-correlation coefficient (NSC), relative strand cross-correlation coefficient (RSC) and read depth for all datasets and compared these to the imputation quality control metrics (see the tables in Supplementary Table 1). We flagged low-quality tracks by detecting the elbow in the ranked correlation metrics, which we calculated as the point where the change in correlation exceeded 5% of the correlation. Validation on external datasets was carried out on 51 experimental tracks across eight marks and assays from ENCODE after our data freeze, similarly subsampled to 30 million (marks) and 50 million reads (accessibility), which we remapped from GRCh38 to hg19 and evaluated on fully remapped 200-bp bins (90.1%) in chromosome 1 (Extended Data Fig. 1c, d). For the data homogeneity analysis, we restricted the data to only biosamples in each mark with both observed and imputed data (Supplementary Fig. 10).

Sample and antibody swap detection

To systematically identify both potential sample or antibody swaps and poor-quality experiments, we computed the correlation of each observed experiment against all 10,734 imputed tracks for histone marks and assays (all imputed tracks before removing samples by quality control). We then calculated the average correlation among the top 10 most similar tracks to each observed track. We flagged potential antibody swaps by comparing the average correlation against samples of the putative mark against those computed for other marks. We fitted a multivariate linear model to each mark comparison, flagged datasets with residuals greater than 3 standard deviations of the average correlation and visually confirmed seven antibody swaps (six low-quality tracks). Similarly, we flagged potential sample swaps by comparing the correlation between imputed and observed tracks against the average correlation in the top 10 tracks in the same mark. We fitted a multivariate linear model and flagged datasets with residuals greater than 3 standard deviations of the residuals distribution. We report 19 potentially swapped samples, of which 5 were also flagged as low-quality tracks (Supplementary Fig. 8).

Secondary reactivities

In addition to genome-wide quality control of imputed tracks, we also focused on the specific differences between observed and imputed tracks. For each observed mark, we generated a genome-wide ‘delta’ track, computed as the difference in signal intensity between the observed and the imputed data, rescaling imputed tracks to match the signal intensity properties of the observed tracks, as the observed tracks showed a general bias for higher intensity. Some of these ‘delta’ tracks showed surprisingly high correlations with ‘primary’ tracks of non-putative marks, indicating potential secondary antibody reactivities. To flag these reactivities, we compared the average correlation of each of the delta tracks to the top 10 closest imputed tracks for each mark. As with antibody swaps, we fitted a multivariate linear model in each mark combination to flag outliers. We flagged 19 tracks and reported 13 after visual inspection as potential secondary reactivities or single replicate swaps (for example, in the case of DNase-seq) (Supplementary Figs. 7, 8). We noted that some cases showed clear difference tracks that do not match available antibodies, suggesting that the secondary reactivity is not a common mark in our compendium.

Biological space coverage

To evaluate the similarity of imputed and observed tracks across samples, we calculated the pairwise genomic correlations between all pairs of imputed and observed signal tracks. We hierarchically clustered the imputed or observed correlation matrix of each individual mark using Ward’s method. We averaged all imputed matrices for the six main marks (H3K27ac, H3K4me1, H3K4me3, H3K36me3, H3K27me3 and H3K9me3) to create a fused correlation matrix, which we similarly clustered. We plotted the hierarchically clustered tree for the fused matrix alongside the metadata information for each biosample using the circlize R package51.

In addition, we calculated mark-specific Spearman correlations that were restricted to relevant features within all observed and imputed tracks per mark. We mapped each of the 13 marks to its top state by emission probability in the ChromHMM 25-state model and any other states with emission probability over 80%. For ATAC-seq, we used the same region list as DNase-seq. For each mark, we averaged and reduced each 25-bp signal track to any 200-bp regions that were labelled as one of the states associated with the mark in any of the 127 imputed Roadmap biosamples under the 25-state model6,20. We calculated the Spearman correlation between sets of these region-restricted mark signal tracks and generated similarity matrices across all datasets for a mark. Using these Spearman correlation matrices on all observed and imputed signal tracks, we computed UMAP dimensionality reductions for each mark and assay using with the uwot R package52 with the default parameters, except for n_neighbours = 250, min_dist = 0.25 and repulsion_strength = 0.25.

Epigenomic annotations

Chromatin-state annotations

We computed epigenomic annotations on 3,533 imputed and 1,465 observed datasets for 6 marks on 833 samples using ChromHMM with the fixed 18-state model from Roadmap6 with the same mnemonics and colours. We used observed data wherever possible, except in cases with no observed data or where observed data were removed in quality control. The table of the signal tracks used to calculate the annotations is available as Supplementary Table 2. The observed data were binarized from signal tracks with a −log10 P value signal cut-off of 2. To binarize the imputed data and facilitate comparison with the observed data, we established mark-specific binarization cut-offs. We first separately calculated the overall probability distributions of all imputed and observed tracks for each mark. Then, for each mark, we set the imputed binarization cut-off value to the value of the quantile that matched the quantile in the observed data for the −log10 P value > 2 cut-off. We used liftOver48 to map all 833 (after quality control) ChromHMM annotations to GRCh38, using a stringent reciprocal mapping strategy, ensuring that all resulting GRCh38 regions were also 200 bp and non-overlapping, and we have provided these alongside hg19 annotations and as track sets at https://epigenome.wustl.edu/epimap/.

Defining active enhancers

We define active enhancers as the intersection of DHS regions with enhancer annotations and high H3K27ac signal (average signal of >2 in the region containing the DHS ± 100 bp). We defined DHS regions from an index list of 3,591,898 DHS element consensus locations in GRCh38, determined from 733 DNase-seq experiments, that we mapped using liftOver48 to 3,568,912 hg19 locations24. We intersected the hg19 regions with the 833 imputed enhancer annotations (states 7, 8, 9, 10, 11 and 15 in the 18-state model). This resulted in 2,842,995 regions with at least one enhancer annotation in any biosample. Finally, we intersected this matrix with the H3K27ac signal in the ±100-bp region that encompassed each DHS from the same tissue-specific imputed and observed datasets used to calculate the ChromHMM annotations. This procedure resulted in 2,356,914 active-enhancer regions. We created an equivalent promoter element region using the promoter annotations (states 1, 2, 3, 4 and 14 in the 18-state model). We noticed that several regions shared both enhancer and promoter annotations. As a conservative cut-off, we assigned all regions to either enhancers or promoters if over 75% of its active occurrences were labelled as that type of element (Supplementary Fig. 13). This final thresholding procedure yielded 2,069,090 enhancers, 204,104 promoters and 122,358 dyadic elements (neither specifically promoter nor enhancer). The matrices and enhancer locations are available at http://compbio.mit.edu/epimap.

For all images using tissue group order, including ChromHMM tracks and module heat maps, groups were ordered alphabetically within six major groups: tissue or organs (adipose, bone, digestive, endocrine, heart, kidney, liver, lung, mesenchymal, muscle, myosatellite, pancreas, placenta and EEM, reproductive, smooth muscle and urinary), other primary cells (endothelial, epithelial and stromal), blood and immune (blood and T cell, HSC and B cell, lymphoblastoid, spleen and thymus), nervous system (brain, eye, neurosphere and PNS), stem (embryonic stem cell-derived, embryonic stem cell and induced pluripotent stem cell), and other (cancer and other).

Defining enhancer modules

To define enhancer modules, we clustered the binary enhancer matrix defined by intersecting enhancer annotations with DHS regions and with the average centred and flanking (±100 bp) H3K27ac signal above a −log10 P value of 2 using the k-centroids algorithm with the Jaccard distance and the number of clusters set to k = 300. The average module contained 6,897 enhancers, and the largest module (enumerating constitutive elements) contained 93,554 enhancer regions. In all heat map plots of module centres (and associated enrichment figures), we diagonalized the matrix by ordering each column in the heat map (module centres) by the biosample that contributed the maximal signal. All columns that had a signal over 25% in more than 50% of rows were shown first. We used this diagonalization procedure for all diagonalized heat maps. We coloured each module by the tissue group that contained its maximal signal. Modules highlight sample groupings and organize according to cell type and tissue. Major groups were ordered alphabetically within six major groups and samples were ordered within groups according to Ward method’s clustering of the Jaccard distance of the module centres matrix. We performed enrichment on the module centres against the metadata of included samples (signal over 25%) by the hypergeometric test, and show enrichments with −log10P > 2 (Fig. 2b).

Gene ontology enrichment

We performed gene ontology enrichments on each enhancer module using GREAT v3.0.0 for the biological process, cellular component and molecular function ontologies53. We analysed and visualized the results in the same manner as in the Roadmap core paper6. We only considered enrichments of 2 or greater with a multiple testing-corrected P < 0.01. For Fig. 4c, we reduced the gene ontology enrichment by modules matrix to terms with a maximal −log10P > 4 that were enriched in less than 10% of modules. The full enrichment matrix is shown in Supplementary Fig. 16. As in the case of the diagonalized module centres, we labelled each term according to the module containing its maximal signal. We used a bag of words approach (as described in Roadmap6) to pick 36 representative terms out of 865 total terms for Extended Data Fig. 6b, such that each tissue group has at least one term and the rest are representatively allocated across groups.

Motif enrichment

We performed motif enrichment analysis across enhancer modules as described in the Roadmap paper6,54. In brief, we measured the enrichment of 1,690 motifs consisting of the JASPAR (2018)55 core non-redundant vertebrate motifs, the HOCOMOCO v1156 human motif set and the SELEX motifs by Jolma et al.57. We computed the enrichments for each of the 1,690 motifs relative to a joint DHS and intergenic background, additionally controlled by 100 shuffled motifs for each motif. We reported the motif with the highest enrichment in any module for each of the 286 previously identified motif archetypes26. We only reported motifs with a maximum log2-transformed fold change of at least 1, resulting in 160 motif archetypes (corresponding to 1,175 total motifs), which we show with their position weight matrix (PWM) logos against all 300 modules in Extended Data Fig. 6c.

Enhancer–gene linking

We predicted enhancer–gene links for each biosample using the Pearson correlation between gene expression and the histone mark activity of nearby enhancers (within 1 Mb) for six marks (H3K27ac, H3K4me1, H3K4me2, H3K4me3 and H3K9ac). We precomputed correlations between all genes and nearby enhancers across the 304 biosamples with paired expression data. A negative set of correlations for each enhancer was computed using random genes in a different chromosome. We predicted links for each biosample and ChromHMM enhancer state separately (states E7, E8, E9, E10, E11 and E15). Predictions were made by training an XGBoost classifier on the positive set of all valid links against their paired negative links, using precomputed correlations and distance to the transcription start site as features, and keeping all links with a probability above 5/7 (ref. 58).

We validated enhancer–gene links using curated gold-standard data59 in CD34, GM12878, HeLa and K562 cells (Extended Data Fig. 8). We compared four sets of correlation-based predictions (alone or with H3K27ac and H3K4me1 activity, and with and without distance-based rescaling) against distance alone, enhancer–gene links from Roadmap, and H3K27ac correlation and/or activity times distance (calculated using EpiMap tracks and enhancers in compared epigenomes)60. For methods without a threshold value, such as distance alone, only the nearest or highest score gene was used for each as a cut-off value for F1. In addition, we created a gene ontology-based gold-standard set of links from gene ontology terms that were enriched within enhancer clusters by GREAT53. For each gene ontology term per cluster, we added enhancer–gene links for enhancers within 1 Mb of at least two genes in the gene ontology term. Negative link sets were constructed by taking physical and expression quantitative trait locus (eQTL) negative link sets that were also not enriched by gene ontology.

GWAS enrichment analysis

We pruned the NHGRI-EBI GWAS catalogue34 (downloaded from https://www.ebi.ac.uk/gwas/docs/file-downloads on 3 May 2019) using a greedy approach: within each trait + PMID combination, we ranked associations by their significance (P value) and added SNPs iteratively if they were not within 5 kb of previously added SNPs. We also removed all associations in the HLA locus (for hg19: chr6: 29,691,116–33,054,976). This reduced the catalogue from 121,000 to 113,000 associations. Finally, we reduced the catalogue to 926 unique GWAS (from 5,454 GWAS) with an initial sample size of at least 20,000 cases or individuals (wherever cases and controls were not annotated). This resulted in 66,801 lead SNPs, which landed in 33,417 unique genome intervals when we split the genome into 10,000-bp intervals.

Flat GWAS–epigenome enrichments and module-based GWAS–epigenome enrichments

We performed the hypergeometric test to evaluate GWAS enrichments on flat epigenomes and on modules. For these flat enrichments, we compared each number of SNP–enhancer intersections for each enhancer set (flat epigenome or module) to the full set of intersections in all M enhancers. As above, we corrected for multiple testing for each GWAS and enhancer set combination by computing and correcting with null association P values for flat epigenomes and modules using the null catalogues generated for the tree enrichment. Rarefaction curves were calculated on the flat epigenome enrichments by iteratively adding the sample that was either significantly enriched or the maximal enrichment for the most remaining GWAS until all GWAS were accounted for (Extended Data Fig. 10c, d).

Tree-based GWAS–epigenome enrichments

We constructed a tree by hierarchically clustering the Jaccard similarity of the binary enhancer-by-epigenomes matrix using complete-linkage clustering. Then, for each node in the tree, we calculated its consensus epigenomic set, defined as the set of all enhancers present in all leaves of the subtree, such that each node’s set was a superset of that of its parent. For each GWAS, we asked whether the novel consensus enhancers at a node were significantly enriched for lead SNPs by comparing the enrichment between each node and its parent as measured by the likelihood ratio test between two logistic regressions.

In brief, for each GWAS catalogue unique trait and PubMed ID, we found all intersections of its pruned SNPs with M = 2,069,090 enhancers. Then Y is an indicator vector of size M, which shows the intersected enhancers. We found all consensus enhancers (the intersection of epigenomes in the subtree) in the node of interest (vector XN) and in its parent (XP). All vectors are 1×M. We calculated XD = XNXP (specific enhancers), which was also in {0,1}(1×M) as each node contained a superset of its parent’s enhancers. We then calculated the following two logistic regressions: M1: Y~XP + 1; M2: Y~XP + XD + 1. We calculated the log-likelihood difference and applied the likelihood ratio test to test whether adding the specific enhancers (M2) was significantly different from the parent model (M1). To correct for multiple testing on a per GWAS and node basis, we generated 1,000 null GWAS for each lead SNP set size by shuffling the trait associations across GWAS locations, giving 243,000 null GWAS in total. We used these catalogues to compute the null association P values for each permuted GWAS and used the 0.1% and 1% top quantiles as false discovery rate cut-offs.

For the CAD example, gene ontology terms61 were calculated using the nearest gene of each enhancer hit by a lead SNP. We pruned genes to expressed genes by calculating the average RNA-seq profiles for each tissue group and excluded genes that had log2 FPKM < 2 in the average RNA-seq of each sample’s group. Of 833 samples, 341 samples have matched RNA-seq, which we list in addition to releasing the processed data at http://compbio.mit.edu/epimap. We kept only the gene ontology terms that were significant in 25% or less of nodes, and report the top two gene ontology terms per node in Fig. 4d and all gene ontology terms in Supplementary Fig. 26.

For locus investigations (in NTN4, CACNA1C, EDNRA and PLPP3), we found the nearest active enhancer to each lead SNP in each node (within 2.5 kb), plotted the H3K27ac signal in the 2.1 million enhancers only, and (1) directly mapped links that originated at one of the enhancers near a lead SNP in the top three enriched epigenomes or (2) any links in the locus present in at least half of the samples in one of the selected tissue groups.

Tissue similarity

We assigned each internal node in the tree to a unique tissue if over 50% of the leaves of the subtree came from that tissue and as ‘multiple’ if the subtree was not the majority of one tissue. We assigned tissue labels to 641 of 832 (77%) internal nodes where the majority of leaves corresponded to a single group. Using these assignments, we created a tissue by GWAS matrix by adding the −log10 P values for each tissue node set from all of the GWAS enrichments on the tree. We binarized this matrix and computed the Jaccard similarity across tissues to calculate a tissue similarity matrix. To assess the significance of tissue overlap, we compared each overlap value against the overlaps from 10,000 permuted enrichments. We collapsed each permuted matrix into a tissue by the GWAS matrix to compute the overlaps under the null. We performed the permutations for each tissue against other tissues by shuffling the enrichment P values on the node by the GWAS matrix. Specifically, we (1) binarized the enrichment matrix, (2) fixed the column of the group of interest, (3) permuted the remainder of the matrix, keeping its row and column marginals the same, and then (4) calculated the cosine distance between the permuted and the original matrix of enrichments.

Cross-GWAS network

To evaluate the cross-GWAS similarity, we normalized the tissue by the GWAS matrix for each GWAS to obtain the proportion of significance attributed to each tissue for each GWAS (Supplementary Fig. 21). We reduced the matrix to 538 significant GWAS with at least 20,000 cases (or individuals when no cases were specified) passing a false discovery rate correction at 0.1% at the per-node and per-GWAS size level. We created a GWAS–GWAS network using the cosine distance matrix as an adjacency matrix, keeping 5,547 links with a cosine distance of 0.25 or less. We used the Fruchterman–Reingold algorithm to lay out the graph62. We used the tissue by the GWAS matrix to colour links according to the maximum tissue in the product between each pair of nodes and to colour nodes according to the maximal tissue for each node (Supplementary Fig. 22).

To compare the epigenetic network to trait genetic similarity, we binned SNPs in the GWAS catalogue into 10-kb windows starting from the beginning of each chromosome. We counted the number of intersecting bins between two traits and kept any trait pairs with Jaccard similarity of at least 1%. To compare this to the epigenetic network, we plotted only links in the epigenetic network that coincided with any SNP-sharing GWAS pairs. In addition, we plotted the heat maps of the tree enrichments distance matrix and the genetic similarity matrix side by side, first organized by hierarchically clustering the enrichments matrix and then by clustering the genetic similarity matrix (Supplementary Figs. 2325).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.