Occupancy maps of 208 chromatin-associated proteins in one human cell type

Transcription factors are DNA-binding proteins that have key roles in gene regulation1,2. Genome-wide occupancy maps of transcriptional regulators are important for understanding gene regulation and its effects on diverse biological processes3–6. However, only a minority of the more than 1,600 transcription factors encoded in the human genome has been assayed. Here we present, as part of the ENCODE (Encyclopedia of DNA Elements) project, data and analyses from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP–seq) experiments using the human HepG2 cell line for 208 chromatin-associated proteins (CAPs). These comprise 171 transcription factors and 37 transcriptional cofactors and chromatin regulator proteins, and represent nearly one-quarter of CAPs expressed in HepG2 cells. The binding profiles of these CAPs form major groups associated predominantly with promoters or enhancers, or with both. We confirm and expand the current catalogue of DNA sequence motifs for transcription factors, and describe motifs that correspond to other transcription factors that are co-enriched with the primary ChIP target. For example, FOX family motifs are enriched in ChIP–seq peaks of 37 other CAPs. We show that motif content and occupancy patterns can distinguish between promoters and enhancers. This catalogue reveals high-occupancy target regions at which many CAPs associate, although each contains motifs for only a minority of the numerous associated transcription factors. These analyses provide a more complete overview of the gene regulatory networks that define this cell type, and demonstrate the usefulness of the large-scale production efforts of the ENCODE Consortium.


Additional introductory material
According to the most recent census and review of putative TFs, including manual curation of DNA-binding domains in protein sequences and experimental observations of DNA binding, there are 1,639 known or likely TFs in the human genome 2 . However, other tallies 1,7 , and broader definitions of proteins that associate with DNA, including transcriptional cofactors such as RNA polymeraseassociated proteins, histone-binding regulators, and chromatin modifying enzymes, suggest there are likely >1,800 and possibly as many as 2,500 such proteins encoded in the human reference genome assembly; we refer to these collectively as chromatin-associated proteins (CAPs) to distinguish this broad group of proteins from the stricter definition of direct DNA-binding TFs. A typical TF binds preferentially to a short DNA sequence motif, and, in vivo, some TFs also exhibit additional chromosomal occupancy mediated by their interactions with other CAPs 8-10 , although the extent and biological significance of most secondary associations are not well understood 79 . CAPs play vital roles in orchestrating cell type-and cell state-specific gene regulation, including the temporal coordination of gene expression in developmental processes, environmental responses, and disease states 3-6,11-13 . Identifying genomic regions with which a TF is physically associated, commonly referred to as TF binding sites (TFBSs), is an important step toward understanding its biological roles. The most common genome-wide assay for identifying TFBSs is chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) 14-16 . In addition to highlighting potentially active regulatory DNA elements by direct measurement, ChIP-seq data can define specific DNA sequence motifs that can be used, often in conjunction with expression data and chromatin accessibility maps, to infer likely binding events in other cellular contexts without performing direct assays. Elegant methods have been developed for identifying motifs 62,80-82 , including ones that consider the plasticity of individual bases within and adjacent to a motif [83][84][85][86] , account for structural details in relation to TF co-occurrence [87][88][89] , or incorporate directly measured and inferred motifs 8 . Subsets of motifs can be specific to different cell types or environmental contexts, and can depend on the chromatin state and presence of cofactors for accessibility 90,91 , and the presence of a motif sequence alone is often not predictive of a binding event [92][93][94] . While motifs identified by enrichment in ChIP-seq are often representative of direct binding, this is not always the case, as co-occurrence of other TFs could lead to the enrichment of their motifs.
Further, the ChIP-seq method identifies both protein:DNA and, indirectly, protein:protein interactions, such that indirect and even long-distance interactions (e.g. looping of distal elements) can be captured as ChIP-seq enrichments.
A long-term goal is comprehensive mapping of all CAPs in all cell types, but a compelling and more immediate aspiration is to create a deep map of all CAPs expressed in a single cell type. The resulting consolidation of hundreds of genome-wide maps for a single cellular context promises insights into CAP networks that are otherwise not possible. Such comprehensive data will also provide the necessary backdrop for understanding large-scale functional element assays, and should improve the ability to infer TFBSs in other cell types that are less amenable to direct measurements.
Previous analyses of sets of numerous CAPs have been performed 95-99 . However, the larger studies to date have assayed occupancy by transfected CAPs, often expressed ectopically and at non-physiological levels, in contrast to this study, in which we performed assays on endogenous proteins expressed at physiological levels. This work in the HepG2 hepatocellular carcinoma cell line is part of the Encyclopedia of DNA Elements (ENCODE) Consortium effort toward achieving "factor completeness" (e.g., the mapping of all expressed CAPs' binding locations) in a subset of commonly used human cell lines. We present here an analysis of 208 CAP occupancy maps in HepG2, composed of 92 traditional ChIP-seq experiments with factor-specific antibodies and 116 CETCh-seq (CRISPR epitope tagging ChIP-seq) experiments. We developed the CETCh-seq method to address the dearth of ChIP-competent antibodies for many factors, and this method has been shown to be a robust, powerful assay 17,18 . Its strength is that the endogenous CAPs are tagged with a universal epitope that is recognized by a single well-characterized ChIP antibody, and that the tagged factors are expressed at physiological levels to avoid ectopic ChIP peaks that can be caused by conventional transgene overexpression 100,101 . As more CETCh-seq experiments are performed, the growing database is used to identify any antibody-specific artifacts attributable to cross-reactivity. This is part of the ENCODE Consortium quality control process for ChIP-seq, CETCh-seq,  Table 1). This large and unbiased sampling in one cell type allowed us to approach analysis from complementary directions, beginning with patterns of CAP occupancy and co-occupancy to find preferential associations with each other and with promoters, enhancers, or insulator functions, and in the other direction, working from genomic loci, sequence motifs, and epigenomic state to explain occupancy.
All ChIP-seq/CETCh-seq data are available through the ENCODE web portal (www.encodeproject.org), or at Gene Expression Omnibus. We identified each CAP's genome-wide binding sites by using the SPP algorithm 57 , with replicate consistency and peak ranking determined by Irreproducible Discovery Rate (IDR) 104 . This publicly available ENCODE occupancy data, together with analyses and insights presented here, comprise a key resource for the scientific community.

Liver-specific TFs and genes reveal the cis-and trans-networks of HepG2
Identifying transcription networks is important for understanding how genes specify a cell type. Our current understanding is that TFs, including key cell-type specifying factors, interact with other factors via combinatorial cross-regulation to drive gene expression in a cell-specific manner. To identify HepG2-specific cisregulatory elements, we used IDEAS segmentation to identify all promoter-like and enhancer-like regions in at least one of five other cell lines (GM12878, H1hESC, HUVEC, HeLa-S3, and K562), and filtered these regions from the HepG2 segmentation. In the resulting set of 59,115 putative HepG2-specific cisregulatory regions, we found significant enrichment (Fisher's exact test, adjusted p-value <0.001, BH FDR corrected) of distinctive CAPs at HepG2-specific enhancer loci, including known important liver TFs such as HNF4A, HNF4G, CEBPA, and FOXA1, along with additional CAPs not previously associated with liver cell identity such as TEAD1, RXRB, and NFIL3 ( Supplementary Fig. 1a).
Because HepG2 is a cancer cell line derived from liver tissue, we focused on liver-specific genes, filtering for genes that are highly and specifically expressed in liver and also expressed in HepG2 at levels of at least 10 TPM. This identified a total of 57 key liver/HepG2 specific genes. We then examined the peak calls of all 208 CAPs close to promoter regions of the 57 liver specific genes (+/-2 kb from TSSs), finding between 13 and 148 CAPs associated with promoters of each of these genes. Pioneer TFs (capable of binding closed chromatin and usually involved in recruiting other factors 105,106 ) such as FOXA1, FOXA2, and CEBPA, as well as key chromatin regulators such as EP300, associate with most of the liver-specific genes ( Supplementary Fig. 1b). Of note, the promoters of the very highly expressed liver genes ALB, APOA2, AHSG, FGA, and F2 (also known as thrombin) have very high apparent factor occupancy/association: 65, 148, 124, 114, and 130 CAPs, respectively (Supplementary Fig. 1c). We examined CAP occupancy at the promoters of all genes as well as of those genes expressed at 10 TPM or higher in HepG2, and compared these to CAP occupancy at the 57 liver-specific genes ( Supplementary Fig. 1d-f,   Supplementary Table 6). In each analysis, increasing factor number correlates positively with increasing RNA level. We note that some prior studies suggested that TF occupancy at highly expressed loci is a technical artifact of ChIP-seq 51 , but, as described below in the section on HOT sites, several lines of evidence argue that these signals represent true biology. The 57 liver-specific genes have significantly higher expression (rank percentile t-test; p-value < 0.0001) when compared to other genes matched by number of CAPs, indicating a trend toward higher expression associated not only with a higher number of associated CAPs but with specific factor identities. The CAPs that are associated with higher than expected expression based on the number of CAPs associated at their promoters include unsurprising examples such as PAF1 and RNA polymerase II subunit A (Ser2 phosphorylated), marks of active transcription, as well as ATF4 and HSF1 (Supplementary Fig. 1g). However, we note that there are still many CAPs that have not yet been assayed by ChIP-seq, and this could explain some of the deviation from expected expression. An additional caveat is that each experiment is normalized separately, thus limiting comparison of relative activity levels of individual CAPs.

SOM analysis
For an independent assessment of co-occupancy and as an additional quantitative analysis, we trained a chromatin self-organizing map (SOM) 32 using all 208 CAPs with the SOMatic package 33 . This analysis generated 196 distinct clusters of SOM units, with each such "meta-cluster" sharing similar profiles, and corresponding decision trees that trace the supervised learning path used to determine the unique features of each metacluster profile (Fig. 4c, Extended   Data Fig. 7). Focusing on the key HepG2 TFs FOXA1/2 and HNF4A, we found that 18 distinct metaclusters accounted for nearly half of the peaks for these 3 TFs (43% for FOXA1, 43% for FOXA2, and 49% for HNF4A). CAPs important for liver development, nucleosome remodeling, and the cohesin complex show high co-binding signal in these key 18 metaclusters.
Looking closer at the CAPs that distinguish these 18 key clusters, we found that