Reference epigenome mapping across tissues and cell types

Integrative analysis of 111 reference human epigenomes.

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14252

We jointly processed and analyzed our 111 reference epigenomes with 16 additional epigenomes from ENCODE 9,23. We generated genome-wide normalized coverage tracks, peaks and broad enriched domains for ChIP-seq and DNase-seq 7,32, normalized gene expression values for RNA-seq 33, and fractional methylation levels for each CpG site 31,34,35. We computed several quality control measures (Fig. 2, Table S1) including: number of distinct uniquely mapped reads; the fraction of mapped reads overlapping areas of enrichment 18,36 ; genome-wide strand cross-correlation 37 (Fig. 2e-g); inter-replicate correlation; multidimensional scaling of datasets from different production centers (Fig. S1); correlation across pairs of datasets (Extended Data 1e); consistency between assays carried out in multiple mapping centers (Table S2); and read mapping quality for bisulfite-treated reads 38,39. Outlier datasets were flagged, removed or replaced, and lower-coverage datasets were combined when possible (See Methods).

Fine-mapped genetic architecture of disease

Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Farh, K. K.-H. et al.Nature 10.1038/nature13835

Prior studies that have integrated GWAS with epigenomic features focused on lead SNPs or multiple associated SNPs within a locus, of which only a small minority reflects causal variants10,16–19,21. Although these studies demonstrated enrichments within enhancer-like regulatory elements, they could not with any degree of certainty pinpoint the specific elements or processes affected by the causal variants. To overcome this limitation, we leveraged dense genotyping data to refine a statistical model for predicting causal SNPs from genetic data alone. Rare recombination events within haplotypes can provide information on the identity of the causal SNP, provided sufficient genotyping density and sample size. We therefore examined a cohort of 14,277 cases with multiple sclerosis and 23,605 healthy controls genotyped using the Immunochip, which comprehensively covers 1000 Genomes Project SNPs22 within 186 loci associated with autoimmunity20. We developed an algorithm, Probabilistic Identification of Causal SNPs (PICS), that estimates the probability that an individual SNP is a causal variant given the haplotype structure and observed pattern of association at the locus (Methods, Extended Data Figs 1–4).

[…]

We next generalized PICS to analyze 21 autoimmune diseases, using Immunochip data when they were available or imputation to the 1000 Genomes Project22 when they were not (Methods; Supplementary Table 1). We mapped 636 autoimmune GWAS signals to 4,950 candidate causal SNPs (mean probability of representing the causal variant responsible for the GWAS signal: ~10%). PICS indicates that index SNPs reported in the GWAS catalogue have on average only a 5% chance of representing a causal SNP. Rather, GWAS catalogue index SNPs are typically some distance from the PICS lead SNP (median 14 kb), and many are not in tight LD (Fig. 1d, Extended Data Fig. 5). PICS identified a single most likely causal SNP (>75% probability) at 12% of loci linked to autoimmunity. However, most GWAS signals could not be fully resolved due to LD and thus contain several candidate causal SNPs (Fig. 1e).

To confirm the functional significance of fine-mapped SNPs, we compared PICS SNPs against a strict background of random SNPs drawn from the same loci. Candidate causal SNPs derived by PICS were strongly enriched for protein-coding (missense, nonsense, frameshift) changes, which account for 14% of the predicted causal variants compared to just 4% of the random SNPs. Modest enrichments over the locus background were also observed for synonymous substitutions (5%), 3′ UTRs (3%), and splice junctions (0.2%) (Fig. 1f). Although these results support the efficacy of PICS for identifying causal variants, ~90% of GWAS hits for autoimmune diseases remain unexplained by protein-coding variants. Candidate causal SNPs and the PICS algorithm are available through an accompanying online portal (http://www.broadinstitute.org/pubs/finemapping).

Epigenomic annotation of genetic variants using the Roadmap Epigenome Browser.

Zhou, X. et al.Nature Biotechnology 10.1038/nbt.3158

Advances in next-generation sequencing platforms have reshaped the landscape of functional genomic and epigenomic research as well as human genetics studies. Annotation of noncoding regions in the genome with genomic and epigenomic data has facilitated the generation of new, testable hypotheses regarding the functional consequences of genetic variants associated with human complex traits1,2. Large consortia, such as the US National Institutes of Health (NIH) Roadmap Epigenomics Consortium3 and ENCODE4, have generated tens of thousands of sequencing-based genome-wide data sets, creating a useful resource for the scientific community5. The WashU Epigenome Browser6-8 continues to provide a platform for investigators to effectively engage with this resource in the context of analyzing their own data. Here, we describe the Roadmap Epigenome Browser (http://epigenomegateway.wustl.edu/browser/roadmap), which is based on the WashU Epigenome Browser and integrates data from both the NIH Roadmap Epigenomics Consortium and ENCODE in a visualization and bioinformatics tool that enables researchers to explore the tissue-specific regulatory roles of genetic variants in the context of diseases. The Browser takes advantage of over 10,000 epigenomic data sets it currently hosts, including 346 ‘complete epigenomes’, defined as tissues and cell types for which we have collected a complete set of DNA methylation, histone modification, open chromatin and other genomic datasets9. Data from both the NIH Roadmap Epigenomics and ENCODE resources are seamlessly integrated in the browser using a new Data Hub Cluster framework. Investigators can specify any number of SNP-associated regions and any type of epigenomic data, for which the browser automatically creates “virtual data hubs” through a shared hierarchical metadata annotation, retrieves the data, and performs real-time clustering analysis. Investigators interact with the Browser to determine the tissue specificity of the epigenetic state encompassing genetic variants in physiologically or pathogenically relevant cell types from normal or diseased samples.

We illustrate the epigenomic annotation of two noncoding SNPs, identified from genome-wide association studies of people with multiple sclerosis10, by clustering the histone H3K4me1 profile of SNP-harboring regions and RNA-seq signal of their closest genes across multiple primary tissues and cells (Fig. 1). Both SNPs lie within putative enhancer regions. Whereas rs307896 marks an enhancer common across cell types, rs756699 is located in an enhancer specific to immune cells and is potentially targeting TCF7, a T cell specific gene 3.8kb downstream (Fig. 1). Thus, reference epigenomes provide important clues into the functional relevance of these genetic variants in the context of the pathophysiology of multiple sclerosis, including inflammation11.

Investigators can also use the browser to identify co-variation of epigenomic, transcriptomic, and transcription factor binding profiles across cell types to predict relationships between regulatory sites and target genes. Additionally, investigators can explore multiple complete reference epigenomes in different browser panels in parallel using synchronized genomic coordinates or independent genomic coordinates. A variety of Epigenome Browser functions, including gene set view, genome juxtaposition, chromatin interaction display and statistical testing, can be applied to better engage with this epigenomic resource.

We also provide the means for investigators to build their own Data Hub Clusters of different scales and clone the browser on Amazon Cloud to visualize and analyze private data in the context of public data. These tools, along with the rapidly growing epigenomic datasets of human cells of different states, will facilitate the translation of genetic signals into molecular mechanisms, leading to prognostic, diagnostic and therapeutic advances.

Mapping orthologous regulatory regions between mouse and human

Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.

Gjoneska, E. et al.Nature 10.1038/nature14252

We mapped orthologous genes between mouse and human using ENSEMBL one-to-one orthologs (see Methods). We also mapped orthologous non-coding regions using multiple mammalian sequence alignments, mapping each mouse peak to its best human match (see Methods). We found matches for 90% of promoter regions, 84% of enhancers, 74% of Polycomb-repressed regions, and 33% of heterochromatin regions (Supplementary Table S3). Comparing our mouse chromatin states to human hippocampus chromatin states 10, we found significant epigenomic conservation at orthologous non-coding regions (Extended Data Fig. 3b), consistent with recent results 11.

Identification of orthologous human regions

Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.

Gjoneska, E. et al.Nature 10.1038/nature14252

The promoter (H3K4me3 peaks annotated as transcription start site by chromatin state), enhancer (H3K27ac peaks annotated as enhancer by chromatin state), and Polycomb-repressed regions (H3K27me3 peaks annotated as Polycomb-repressed by chromatin state) were mapped to the human genome. BED files representing the coordinates of these peaks in mm9 were mapped to mm10 using liftover 44. Those peaks were mapped compared to the human genome the UCSC multiple alignment chain files (http://hgdownload.soe.ucsc.edu/goldenPath/mm10/multiz60way/)45. More specifically, the alignments that overlap the mouse peak and include hg19 were extracted. We calculated the human mouse pairwise alignment for each multiple alignment using the “globalms” function of biopython (http://biopython.org/, version 1.59; python version 2.71). The highest scoring pairwise alignment formed base of the orthologous region in human. This region was extended on either side using lower scoring multiple alignments. The orthologous region in hg19 was required to be greater than 30bp and no more than twice the length of the region in mouse. The mean conservation was examined using the PHASTCons score across placental mammals46 based on the same 60-way multiple sequence alignment. The mapped enhancer regions were annotated with their chromatin state in human hippocampus, and across all 127 cell types and tissues, using BEDTools47.

Epigram — a pipeline to predict histone modification and DNA methylation patterns from DNA motifs

Predicting the human epigenome from DNA motifs.

Whitaker, J. W., Chen, Z. & Wang, W. Nature Methods 10.1038/nmeth.3065

Herein, we used our analysis pipeline, Epigram, to capture the cis elements that interact with the dynamic regulatory program to shape the epigenome (Fig. 1b). By surveying various cell types, we revealed mark-specific motifs, which may be universally recognized by chromatin-modifying enzymes, and motifs with cell type–specific interplay, which may be recognized by cell type–specific cofactors. We applied this approach to predicting the placement of six histone modifications and DNA methylation valleys (DMVs) in five cell types17: human embryonic stem cells (H1), neural progenitor cells (NPC), trophoblast-like cells (TBL), mesendoderm cells (ME) and mesenchymal stem cells (MSC) (Fig. 1c). To tease out the cis elements that are recognized by epigenomic regulatory factors, we removed simple sequence biases such as G+C content during analysis. We observed that motifs have location preferences within modified regions, such as the center of H3K27ac or the edge of H3K4me3 or H3K9me3. Furthermore, we demonstrated the importance of Epigram motifs in the regulation of histone modification through the significant correlation between their disruption and inter-individual H3K27ac variation. Our study provides a catalog of cis elements that play important roles in shaping the epigenomic modifications, which is useful for designing new epigenome-editing tools. We first examined whether DNA motifs could distinguish genomic regions that possess modified histones from regions that do not possess any modified histones. For the sake of discussion, we refer to this as the ‘single-mark analysis’. We started by correcting a potential bias in the chromatin immunoprecipitation–sequencing (ChIP-seq) data that can be caused by the preferential sequencing of (G+C)-rich genomic fragments 18,19 (Fig. 1d).

Genboree Workbench brings together epigenomic data and tools

Integrative analysis of 111 reference human epigenomes.

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

On-line tools integrated within the Genboree Workbench enable the types of analyses reported in [EC14 Amin V et al. Nature Comm. 2015] and in Figure 4i, [Roadmap Epigenomics Consortium et al. Nature 2015]. The tools enable users to carry out similar types of analyses by using either consortium-generated or their own epigenomic profiling data. To learn how to use the tools, check the on-line tutorial at http://genboree.org/theCommons/projects/aminv-natcomm-2015/wiki.

Recommendations for methylation analysis by whole-genome bisulfite sequencing

Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing.

Ziller, M. J., Hansen, K. D., Meissner, A. & Aryee, M. J. Nature 10.1038/nmeth.3152

Whole genome bisulfite sequencing (WGBS) allows unbiased genome-wide DNA methylation profiling but the associated high sequencing costs continue to limit its widespread application. To experimentally determine the minimal sequencing requirements we utilized several high coverage reference data sets for our analysis. Here, we present the data derived recommendations for minimum sequencing depth for WGBS libraries, highlight what is gained with increasing coverage and discuss the trade off between sequencing depth and number of assayed replicates.

Method for MethylC-seq library preparation

MethylC-seq library preparation for base-resolution whole-genome bisulfite sequencing.

Urich, M. A. et al.Nature Protocols 10.1038/nprot.2014.114

Overview of MethlyC-seq

To survey the methylation states of cytosines at single-base resolution on a genome-wide scale, we developed a whole-genome bisulfite sequencing approach called MethylC-seq (Fig. 1). This method uses high-throughput DNA sequencing of genomic DNA subjected to sodium bisulfite conversion15–18. After deep sequencing of a library generated from fragments of sodium bisulfite–treated DNA, the basecall at each cytosine reference position indicates the original methylation status of the cytosine in each genomic DNA (gDNA) fragment, where a thymine indicates that it was unmethylated and a cytosine indicates that it was methylated. The frequency of DNA methylation at any cytosine with sufficient sequence coverage can be estimated for the population of genomes that comprised the genomic DNA sample.

Typical methylomes generated by MethylC-seq for mouse, human and Arabidopsis genomes achieve coverage of >90–95% of the cytosines in the genome6,19. This protocol is largely framed around standard protocols designed to construct DNA sequencing libraries, but substantial modifications have been made such as eliminating all electrophoresis and gel extraction steps, adding the sodium bisulfite conversion reaction and making modifications to the number of PCR cycles. Briefly, purified genomic DNA (50 ng–2 µg) is fragmented, end repaired, 3’-adenylated and ligated to sequencing adapters in which all cytosines are methylated. Adapter-ligated DNA is then subjected to bisulfite conversion, after which limited amplification of the library is performed by PCR using primers specific for the sequencing adapters. The resulting library is then ready for sequencing after library quantification.

ChromImpute for large-scale epigenome imputation

Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

Ernst, J. & Kellis, M. Nature Biotechnology 10.1038/nbt.3157

Here, we take an ensemble regression-based approach to epigenomic imputation. We impute each target mark in each target sample separately, by combining information from large numbers of datasets that were experimentally determined, but without using any data for the target mark in the target cell type (Fig. 1a, S1). We leverage two classes of features (see Methods, Fig. 1d):

  • Same-sample (different-mark) information (Fig. 1b): The first class of features uses information from the signal of other marks mapped in the target sample, both at the target position and at neighboring sites.

  • Same-mark (different-sample) information (Fig. 1c): The second class of features uses information from the signal of the specific mark of interest at the target position in the most similar samples. Similar samples are defined based on similarity with the signal of marks that have been mapped in the target sample both locally and globally (see Methods). The features in this class are effectively predictions that could be made by a K-nearest neighbor method for various values of K and distance functions.

As no training data is available for the target mark in the target sample, we learn the relationships between the features and the target mark using other samples that contain the target mark. We use regression trees27, as they can handle nonlinearities (including the constraint that signal values are non-negative), they support combinatorial interactions among features, and they are relatively fast to train. The prediction for each target mark in each target sample is based on an ensemble predictor that averages the values resulting from regression trees trained on each sample in which the target mark is available, thus reducing the impact of biases from any one individual predictor

Figure 1: Datasets available for each reference epigenome.
figure 1

List of 127 epigenomes including 111 by the Roadmap Epigenomics program (E001-E113) and 16 by ENCODE (E114-E129). Full list of names and quality scores in Table S1. a-d: Tissue and cell types grouped by type of biological material (a), anatomical location (b), showing reference epigenome identifier (EID, c), and abbreviated name (d). PB=Peripheral Blood. ENCODE 2012 reference epigenomes shown separately. e-g. Normalized strand cross-correlation quality scores (NSC)37 for the core set of five histone marks (e), additional acetylation marks (f) and DNase-seq (g). h. Methylation data by WGBS (red), RRBS (blue), and mCRF (green). 104 methylation datasets available in 95 distinct reference epigenomes. i. Gene expression data using RNA-seq (Brown) and microarray expression (Yellow). j. 26 epigenomes contain a total of 184 additional histone modification marks. k. 60 highest-quality epigenomes (purple) were used for training the core chromatin state model, which was then applied to the full set of epigenomes (purple and orange).

Figure 2: Genetic fine-mapping of human disease.
figure 2

a, GWAS catalog loci were clustered to reveal shared genetic features of common human diseases and phenotypes. Color scale indicates correlation between phenotypes (high=red, low=blue). b, Association signal to MS for SNPs at the IFI30 locus. c, Scatter plot of SNPs at the IFI30 locus demonstrates the linear relationship between LD distance (r2) to rs1154159 (red) and association signal. d, Candidate causal SNPs were predicted for 21 autoimmune diseases using PICS. Histogram indicates genomic distance (bp) between PICS Immunochip lead SNPs and GWAS catalog index SNPs. e, Histogram indicates number of candidate causal SNPs per GWAS signal needed to account for 75% of the total PICS probability for that locus. f, Plot shows correspondence of PICS SNPs to indicated functional elements, compared to random SNPs from the same loci (error bars indicate standard deviation from 1000 iterations using locus-matched control SNPs).

Figure 3: Chromatin state conservation.
figure 3

b, Promoter, enhancer, and repressed chromatin states in mouse hippocampus (rows), as profiled in this study, align to matching chromatin states in human (columns), as profiled by the Roadmap Epigenomics Consortium10. Shading indicates enrichment relative to human chromatin state abundance (columns). The number of regions overlapping is shown in each cell of the heatmap.

Figure 4: Identifying motifs that are predicative of epigenomic modifications
figure 4

(a) Site-specific DNA-binding factors regulate the epigenome. The blue section shows three regulatory levels of the cell type–specific state: (i) gene regulatory network, (ii) site-specific DNA-binding factors and (iii) epigenomic regulation of gene expression. The green square represents non-cell-type–specific DNA sequence regulatory influences over the epigenome. The purple box shows stimuli that influence the cell type–specific state. (b) Overview of the cis-element cataloging process. (c) Schematic showing H1 human embryonic stem cells and the other four cell types that were derived through in vitro differentiation. The table lists the analyzed epigenomic modifications. (d) Flow chart of the key stages in our analysis pipeline. (e) Effect of sequence-set balancing (SSB) on sequences sets. The bar plot shows the number of regions in a set before and after SSB. Violin plots show the distribution of region G+C content and length before and after SSB.

Figure 5: Coverage requirements for WGBS experiments
figure 5

(a) Heat map showing the pairwise Pearson correlation coefficients (PCC) for genome-wide methylation profiles of the samples used in this study (n = 14; rep, replicate). Average methylation levels were estimated in 1-kbp tiling windows. (b) Distribution of DMR sizes and average methylation difference for DMRs found at 30× comparing hESCs to human cortex, CD184 to liver and CD4 to CD8 using 2 replicates each. Black dots indicate medians, and ellipsoids span from the 25th to the 75th percentile in each dimension. (c) True positive rate (TPR) as a function of coverage for the indicated samples using 2 replicates for each group. TPR is defined as the fraction of high-coverage (30×) reference DMRs recovered at the coverage level indicated. (d) Distribution of DMR sizes and average methylation difference for DMRs discovered at 1× and additional DMRs discovered when increasing the coverage from 1× to 5×, 5× to 10× and 10× to 30× in the hESC–human cortex comparison using 2 replicates each. Dots and ellipsoids as in b. (e) False discovery rate (FDR) as function of coverage for DMRs exhibiting a methylation difference of 20% or greater when comparing hESCs to human cortex, CD184 to liver, or CD4 to CD8 using 2 replicates for each group.

Figure 6: Chromatin states and DNA methylation dynamics.
figure 6

i, Chromatin mark changes during cardiac muscle differentiation. Heat map = average normalized mark signal in Enh. C2 cluster enrichment55, with all clusters shown in http://compbio.mit.edu/roadmap.