To empower experimentalists with a means for fast and comprehensive chromatin immunoprecipitation sequencing (ChIP-seq) data analyses, we introduce an integrated computational environment, EaSeq. The software combines the exploratory power of genome browsers with an extensive set of interactive and user-friendly tools for genome-wide abstraction and visualization. It enables experimentalists to easily extract information and generate hypotheses from their own data and public genome-wide datasets. For demonstration purposes, we performed meta-analyses of public Polycomb ChIP-seq data and established a new screening approach to analyze more than 900 datasets from mouse embryonic stem cells for factors potentially associated with Polycomb recruitment. EaSeq, which is freely available and works on a standard personal computer, can substantially increase the throughput of many analysis workflows, facilitate transparency and reproducibility by automatically documenting and organizing analyses, and enable a broader group of scientists to gain insights from ChIP-seq data.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Gene Expression Omnibus
Nekrutenko, A. & Taylor, J. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat. Rev. Genet. 13, 667–672 (2012).
Plocik, A.M. & Graveley, B.R. New insights from existing sequence data: generating breakthroughs without a pipette. Mol. Cell 49, 605–617 (2013).
Marx, V. Biology: the big challenges of big data. Nature 498, 255–260 (2013).
Bernstein, B.E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).
Consortium, E.P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Edgar, R., Domrachev, M. & Lash, A.E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
Chelaru, F., Smith, L., Goldstein, N. & Bravo, H.C. Epiviz: interactive visual analytics for functional genomics data. Nat. Methods 11, 938–940 (2014).
Nielsen, C.B. et al. Spark: a navigational paradigm for genomic data exploration. Genome Res. 22, 2262–2269 (2012).
Coulombe, C. et al. VAP: a versatile aggregate profiler for efficient genome-wide data representation and discovery. Nucleic Acids Res. 42, W485–W493 (2014).
Huang, W., Loganantharaj, R., Schroeder, B., Fargo, D. & Li, L. PAVIS: a tool for peak annotation and visualization. Bioinformatics 29, 3097–3099 (2013).
Ji, H. et al. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. 26, 1293–1300 (2008).
Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Sadeghi, L., Bonilla, C., Strålfors, A., Ekwall, K. & Svensson, J.P. Podbat: a novel genomic tool reveals Swr1-independent H2A.Z incorporation at gene coding sequences through epigenetic meta-analysis. PLoS Comput. Biol. 7, e1002163 (2011).
Salmon-Divon, M., Dvinge, H., Tammoja, K. & Bertone, P. PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci. BMC Bioinformatics 11, 415 (2010).
Ye, T. et al. seqMINER: an integrated ChIP-seq data interpretation platform. Nucleic Acids Res. 39, e35 (2011).
Giardine, B. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005).
Halbritter, F., Vaidya, H.J. & Tomlinson, S.R. GeneProf: analysis of high-throughput sequencing experiments. Nat. Methods 9, 7–8 (2012).
Liu, T. et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 12, R83 (2011).
Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2008).
Bendall, S.C. & Nolan, G.P. From single cells to deep phenotypes in cancer. Nat. Biotechnol. 30, 639–647 (2012).
Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).
Anders, S., Pyl, P.T. & Huber, W. HTSeq: a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Gautier, L., Cope, L., Bolstad, B.M. & Irizarry, R.A. affy: analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315 (2004).
Kharchenko, P.V., Tolstorukov, M.Y. & Park, P.J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 1351–1359 (2008).
Landt, S.G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22, 1813–1831 (2012).
Marinov, G.K., Kundaje, A., Park, P.J. & Wold, B.J. Large-scale quality analysis of published ChIP-seq data. G3 (Bethesda) 4, 209–223 (2014).
Cleveland, W.S. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836 (1979).
Cleveland, W.S. Lowess: a program for smoothing scatterplots by robust locally weighted regression. Am. Stat. 35, 54 (1981).
Amaratunga, D. & Cabrera, J. Analysis of data from viral DNA microchips. J. Am. Stat. Assoc. 96, 1161–1170 (2001).
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Liang, K. & Keleş, S. Normalization of ChIP-seq data with control. BMC Bioinformatics 13, 199 (2012).
Pepke, S., Wold, B. & Mortazavi, A. Computation for ChIP-seq and RNA-seq studies. Nat. Methods 6, S22–S32 (2009).
Wilbanks, E.G. & Facciotti, M.T. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 5, e11471 (2010).
Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).
Jiang, H., Wang, F., Dyer, N.P. & Wong, W.H. CisGenome Browser: a flexible tool for genomic data visualization. Bioinformatics 26, 1781–1782 (2010).
Valouev, A. et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods 5, 829–834 (2008).
Simon, J.A. & Kingston, R.E. Occupying chromatin: Polycomb mechanisms for getting to genomic targets, stopping transcriptional traffic, and staying put. Mol. Cell 49, 808–824 (2013).
Steffen, P.A. & Ringrose, L. What are memories made of? How Polycomb and Trithorax proteins mediate epigenetic memory. Nat. Rev. Mol. Cell Biol. 15, 340–356 (2014).
Margueron, R. & Reinberg, D. The Polycomb complex PRC2 and its mark in life. Nature 469, 343–349 (2011).
Di Croce, L. & Helin, K. Transcriptional regulation by Polycomb group proteins. Nat. Struct. Mol. Biol. 20, 1147–1155 (2013).
Blackledge, N.P. et al. Variant PRC1 complex-dependent H2A ubiquitylation drives PRC2 recruitment and polycomb domain formation. Cell 157, 1445–1459 (2014).
Farcas, A.M. et al. KDM2B links the Polycomb Repressive Complex 1 (PRC1) to recognition of CpG islands. eLife 1, e00205 (2012).
He, J. et al. Kdm2b maintains murine embryonic stem cell status by recruiting PRC1 complex to CpG islands of developmental genes. Nat. Cell Biol. 15, 373–384 (2013).
Wu, X., Johansen, J.V. & Helin, K. Fbxl10/Kdm2b recruits polycomb repressive complex 1 to CpG islands and regulates H2A ubiquitylation. Mol. Cell 49, 1134–1146 (2013).
Riising, E.M. et al. Gene silencing triggers polycomb repressive complex 2 recruitment to CpG islands genome wide. Mol. Cell 55, 347–360 (2014).
Klose, R.J., Cooper, S., Farcas, A.M., Blackledge, N.P. & Brockdorff, N. Chromatin sampling: an emerging perspective on targeting polycomb repressor proteins. PLoS Genet. 9, e1003717 (2013).
Hansen, K.H. et al. A model for transmission of the H3K27me3 epigenetic mark. Nat. Cell Biol. 10, 1291–1300 (2008).
Margueron, R. et al. Role of the polycomb protein EED in the propagation of repressive histone marks. Nature 461, 762–767 (2009).
Cooper, S. et al. Targeting polycomb to pericentric heterochromatin in embryonic stem cells reveals a role for H2AK119u1 in PRC2 recruitment. Cell Rep. 7, 1456–1470 (2014).
Kalb, R. et al. Histone H2A monoubiquitination promotes histone H3 methylation in Polycomb repression. Nat. Struct. Mol. Biol. 21, 569–571 (2014).
Denissov, S. et al. Mll2 is required for H3K4 trimethylation on bivalent promoters in embryonic stem cells, whereas Mll1 is redundant. Development 141, 526–537 (2014).
Moon, T.S., Lou, C., Tamsir, A., Stanton, B.C. & Voigt, C.A. Genetic programs constructed from layered logic gates in single cells. Nature 491, 249–253 (2012).
Dietrich, N. et al. REST-mediated recruitment of polycomb repressor complexes in mammalian cells. PLoS Genet. 8, e1002494 (2012).
Tavares, L. et al. RYBP-PRC1 complexes mediate H2A ubiquitylation at polycomb target sites independently of PRC2 and H3K27me3. Cell 148, 664–678 (2012).
Pengelly, A.R., Copur, Ö., Jäckle, H., Herzig, A. & Müller, J. A histone mutant reproduces the phenotype caused by loss of histone-modifying factor Polycomb. Science 339, 698–699 (2013).
Bernstein, E. et al. Mouse polycomb proteins bind differentially to methylated histone H3 and RNA and are enriched in facultative heterochromatin. Mol. Cell. Biol. 26, 2560–2569 (2006).
Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Saldanha, A.J. Java Treeview: extensible visualization of microarray data. Bioinformatics 20, 3246–3248 (2004).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Li, H. et al. 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).
Pruitt, K.D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).
We would like to thank J. Christensen, P. Cloos, N. Dietrich, N. Jungersen, J. Wegeberg and K. Helin for inspiring discussions, suggestions and proofreading; K. Isaksen for help with easeq.net; and T. Rasborg for usability consultancy. This work was supported by a grant from the Danish National Research Foundation (DNRF82) (K.H.).
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Regionsets in EaSeq can be ordered and generated from other types of data, and can contain parameters such as quantified levels of ChIP-seq signal.
(a) Sorting of regions by a graphical interface (middle). Left: H3K4me3 heatmap at 19,903 TSS that was sorted for gene expression in CD4+ T-cells with the highest expression at the base. Right: data after sorting for quantified H3K4me3 levels (‘Sort’ tool). (b) Overview of how new regionsets are generated from datasets using the ‘Peaks’ tool (top), from genesets using the ‘Extract’ tool (middle), or other regionsets using the ‘Controls’ tool (bottom). (c) New parameters are generated and added to an existing regionset using the ‘Quantify’ tool for quantitation of ChIP-seq signal at the regions. Left: examples of coordinates from the 19,903 genes in the regionset. Middle: example tracks generated by the ‘Quantify’ tool and showing levels of H3K4me3, H3K4me2, and RNA-PolII signal surrounding the start of the regions of interest. Signal within the regions were automatically colored orange and black outside the regions. The blue square overlaid by the tool illustrates the quantified area on the X-axis and the quantified value on the Y-axis. Right: results from the quantitation of the four example regions. Each column corresponds to a new parameter added to the existing regionset.
Supplementary Figure 2 Side-by-side comparison of tools in EaSeq and frequently used command-line counterparts.
(a, b) Tracks from EaSeq (top) and the UCSC genome browser (bottom) showing high resemblance in the positions of individual reads from an H3K27me3 ChIP (SRR946804) mapped at mm9. Tracks show the Ecel1 locus (a) and a 1 kbp window of the Sgk3 locus (b). Each plot is unmodified output from EaSeq or the UCSC genome browser. (c, d) 2D-histograms showing the distribution of read counts at the +/-5 kbp surrounding mm9 CGIs counted using EaSeqs quantitation tool and Bedtools intersect –c (c) or HTSeq (d) as well as the Pearson’s correlation coefficient, r. (e) Pseudocolored 2D-histogram showing relationship between the distribution of read counts as in (d) in relation to the distance between the CGIs. Coloring reflects the average log10 distance to the nearest CGI (measured from center to center), and red, green, and blue colors reflects average distance of app 10 kbp, 3 kbp, and 1 kbp. (f) 2D-histogram showing the distribution of read counts at the +/-5 kbp surrounding mm9 CGIs that are distanced at least 10 kbp + one read length (150 bp) from the nearest neighboring CGI. Reads were counted as in (d) as well as the Pearson’s correlation coefficient, r. (g-k) 2D-histograms showing the quantified values of read counts at the +/-5 kbp surrounding mm9 CGIs counted from two H3K27me3 datasets from mESCs (SRR946804 and SRR946806) using EaSeqs quantitation tool and left unnormalized (g) or normalized using quantile normalization in R (h), quantile normalization in EaSeq (i), or LOESS normalization in EaSeq (k). (j) Shows the relationship between the Quantile normalization of SRR946804 done using R (X-axis) and EaSeq (Y-Axis) as well as the Pearson’s correlation coefficient.
Supplementary Figure 3 Side-by-side comparison of tools in EaSeq and frequently used command-line counterparts.
(a) Histograms showing the size distributions of 100 bp regions generated from larger CGIs using EaSeq’s homogenization tool (left) and Bedtools makewindows (right). (b) Illustration of how a single CGI (green) at the Zdhhc8 locus is subdivided into smaller 100 bp regions using EaSeq’s homogenization tool (red) and Bedtools makewindows (blue). Coordinates refer to mm9. (c) 2D-histograms showing the distribution of distances between CGIs and the nearest TSS identified using EaSeqs colocalization tool (Y-axis) and Bedtools closest (X-axis) as well as the number of CGIs (n) visualized in each plot. Left plot shows all CGIs, middle plot only those that do not overlap with a TSS, and right plot those that overlap. The orange circle and arrow illustrates the location in the histogram of the three CGIs shown in (d). (d) Illustrations of individual CGIs (green) at three different loci where EaSeq colocalization tool (red) and Bedtools closest (blue) reach divergent results when assigning a TSS for each CGI. Coordinates refer to mm9. (e) Heatmaps of CGIs sorted for overlap with gene bodies and sizes and colored according to overlap with particular sets of regions identified using EaSeq tools or their Bedtools intersect counterpart. Upper panel: The parts of the CGIs overlapping with the regions computed by EaSeq or Bedtools were colored according to the color illustrated for each heatmap, whereas non-overlapping CGIs or parts hereof were colored black. Lower panel: The subset of regions in each heatmap were derived from the CGI population using the EaSeq or Bedtools algorithm described at the origin of each arrow, and colored according to the colors used for that subset of regions in the upper panel. The parts of those subsets that were derived using EaSeq and overlapped with regions derived using the Bedtools counterpart, or a Bedtools algorithm that generates a mutually exclusive set of regions, were highlighted in the color of the Bedtools set of regions – and vice versa – illustrating the extent of overlap between EaSeq and Bedtools counterparts.
Supplementary Figure 4 EaSeq’s ‘adaptive local thresholding’ (ALT) peak-finding procedure generates peak sets of the same quality as those generated by widely used peak-finding procedures.
(a) Number of peaks found for the transcription factors NRSF and GABP. (b, c) Graphs depicting the fraction of a set of PCR validated true positives (b) or identified motifs (c) that were identified as a function of the ranking of peaks found by ChIP-seq for the transcription factor GABP. (d) Box plot illustrating the positional accuracy of GABP peaks found by EaSeq compared to those identified in the same dataset by CisGenome and the two versions of MACS. Y-axis values represent the genome-wide distances from the apex of each peak to the nearest GABP motif (if any). (e) Graph depicting the fraction of GABP-motif containing-peaks in relation to peak rank.
Supplementary Figure 5 Heat maps show elevated levels of PcGs at CGIs associated near active genes, but the signals of some negative controls are also elevated at CGIs in general.
(a) Heat maps of PcG levels at +/-2.5 kbp near the center of all mouse CGIs (arrowheads = center) show an elevated vPRC1 signal at H3K4me3 enriched CGIs. The order of the CGIs was derived from nearest neighbor chain hierarchical clustering of PcGs, H3K27me3, H2AUb, input, and IgG control signals within +/- 2.5 kbp of the CGI centers using the ‘ClusterP’-tool in EaSeq, which is based on the algorithm (Juan, Les Cahiers de l'Analyse des Données. 7, 6, 1982, Benzécri, Les Cahiers de l'Analyse des Données. 7, 9, 1982). H3K4me3 and H3K36me3 marks were included to visualize the transcriptional status of the loci surrounding the CGIs. * vPRC1 signal at active CGIs. (b) Heat maps of several negative controls within 5 kbp of CGIs (Arrowheads) frequently show marked enrichment at CGIs. CGIs were sorted according to the K4me3-K27me3 balance used in Fig. 6a-c. Fr/M/kbps: DNA fragments per million reads per kilobasepairs. Grey bars marks if the control samples are of the input or IgG type.
Supplementary Figure 6 Overview of the methodology used to calculate correlation of factors affecting PcGs.
(a) Heatmaps showing the Pearsson’s correlation coefficients (r-value) for the positive control test of six Ezh2 and ten Suz12 datasets from mESCs. Data were quantified and analyzed in 200 bp windows derived from peaks from the very same Ezh2 dataset. The average correlation from all 60 combinations is shown below and in Fig. 7a. SRR-containing names refer to SRA accession numbers (http://www.ncbi.nlm.nih.gov/sra) and can be found in Supplementary Table 1. (b) Heatmaps showing the r-values from each combination in the positive control test of six Ezh2 and ten Suz12 datasets in relation to the size of the window used for quantifying Suz12 signal. The regions used for the analyses were derived from the very same Ezh2 dataset as analyzed, by extending each peak to the closest size divisible by 200 and subdividing it into 200 bp fragments. Then, the Ezh2 signal was quantified within these 200 bp regions and the windows used for quantifying Suz12 ranged from 0.2 to 2 kbp in steps of 0.2 kbp, 2 to 20 kbp in steps of 2 kbp, and 20 to 200 kbp in steps of 20 kbp. The r-values were calculated for each combination of window sizes. (c) Example of how the domain sizes were derived from the range of r-values from two combinations of Ezh2 and Suz12 datasets. For each range of r-values the domain size is set to the window size resulting in the highest r-value. (d) Visualization of the range of domain sizes at a single locus within the Ptprd gene together with tracks of selected ChIP-seq signals. The triangle illustrates the extent of the domains in relation to the color coding in the heatmaps in c, e, and f, and the gray area in the middle illustrates the extent of the 200 bp window used for quantitating PcGs (Ezh2 in the case of a-f). (e) Heatmap showing the domain sizes as derived in d for each combination of Ezh2 and Suz12 datasets. (f) Heatmap showing the domain sizes in E ranked and presented in one dimension together with the median domain size (bottom). (g, h) Test of the robustness in the calculation of domain sizes for the combination datasets exemplified in e. The sizes of the datasets were reduced randomly to the target sizes on the axes in steps of 1M reads (Ezh2) or 2M reads (Suz12), correlated and analyzed as in e. Calculated domain sizes were largely independent on dataset size, and therefore signal strength of Ezh2. For Suz12, the calculated domain size was only increased when dataset sizes were =< 4M reads. PcGs are tightly correlated to H2AUb levels, whereas H3K27me3 association is weaker and on long range. (i) Heatmaps showing the average of Pearsson’s correlation coefficients from combinations of PcG member and chromatin features that potentially affect PcG recruitment. The setup is similar to that of Fig. 6a but the window size used for analyzing the features varies from 200 bp to 200 kbp in steps of 200 bp for sizes up to 2 kbp, in steps of 2 kbp for sizes from 2 kbp to 20 kbp, and in steps of 20 kbp for sizes above that. Analysis was only done in areas scored as peak for each PcG dataset.
Supplementary Figures 1–6 and Supplementary Note (PDF 1256 kb)
List of accession numbers, reference to primary publication, and metadata for all datasets used for main and supplementary figures except Figure 6e–g. (XLSX 23 kb)
List containing accession numbers, metadata, and Z-scores for all datasets used for main Figure 6e–g (XLSX 6266 kb)
EaSeq (ZIP 39300 kb)
About this article
Cite this article
Lerdrup, M., Johansen, J., Agrawal-Singh, S. et al. An interactive environment for agile analysis and visualization of ChIP-sequencing data. Nat Struct Mol Biol 23, 349–357 (2016). https://doi.org/10.1038/nsmb.3180
Genome Biology (2021)
G-quadruplexes originating from evolutionary conserved L1 elements interfere with neuronal gene expression in Alzheimer’s disease
Nature Communications (2021)
Differential chromatin binding of the lung lineage transcription factor NKX2-1 resolves opposing murine alveolar cell fates in vivo
Nature Communications (2021)
Cell Death & Disease (2021)
Simultaneous disruption of PRC2 and enhancer function underlies histone H3.3-K27M oncogenic activity in human hindbrain neural stem cells
Nature Genetics (2021)