Pluripotent stem cells provide a powerful system to dissect the underlying molecular dynamics that regulate cell fate changes during mammalian development. Here we report the integrative analysis of genome-wide binding data for 38 transcription factors with extensive epigenome and transcriptional data across the differentiation of human embryonic stem cells to the three germ layers. We describe core regulatory dynamics and show the lineage-specific behaviour of selected factors. In addition to the orchestrated remodelling of the chromatin landscape, we find that the binding of several transcription factors is strongly associated with specific loss of DNA methylation in one germ layer, and in many cases a reciprocal gain in the other layers. Taken together, our work shows context-dependent rewiring of transcription factor binding, downstream signalling effectors, and the epigenome during human embryonic stem cell differentiation.
Human embryonic stem (ES) cells hold great promise for tissue engineering and disease modelling; yet a key challenge to deriving mature, functional cell types is understanding the molecular mechanisms that underlie cellular differentiation. There has been much progress in understanding how core regulators such as OCT4 (also known as POU5F1), SOX2, and NANOG as well as transcriptional effector proteins of signalling pathways, such as SMAD1, TCF3, and SMAD2/3, control the molecular circuitry that maintains human ES cells in a pluripotent state1,2. While the genomic binding sites of many of these factors have also been mapped in mouse ES cells, cross-species comparison of OCT4 and NANOG targets showed that only 5% of regions are conserved and occupied across species3. Together with more general assessment of divergent transcription factor (TF) binding4, those results highlight the importance of obtaining binding data in the respective species.
It is well understood that epigenetic modifications, such as DNA methylation and posttranslational modifications of the various histone tails, are essential for normal development5,6. TF binding sites are overlapping with regions of dynamic changes in DNA methylation and are linked to its targeted regulation7,8. More generally, TFs orchestrate the overall remodelling of the epigenome, including the priming of loci that will change expression only at later stages6,9,10. It has also been shown that lineage-specific TFs and signalling pathways collaborate with the core regulators of pluripotency to exit the ES cell state and activate the transcriptional networks governing cellular specification11,12. However, how the handoff between the central regulators occurs and what role individual TFs and signalling cues play in rewiring the epigenome to control proper lineage specification and stabilize commitment is still poorly understood.
TF binding maps across human ES cell differentiation
To dissect the dynamic rewiring of TF circuits, we used human ES cells to derive early stages of endoderm (dEN), mesoderm (dME) and ectoderm (dEC)13,14,15 along with a mesendoderm (dMS) intermediate (Fig. 1a, Supplementary Information). We defined and collected the dMS population at 12 h owing to maximal expression of BRACHYURY (T) (Fig. 1b), and carried out chromatin immunoprecipitation followed by sequencing (ChIP-seq) for four of the Roadmap Epigenomics Project16 core histone modifications (H3K4me1, H3K4me3, H3K27Ac and H3K27me) as well as RNA sequencing (RNA-seq) of polyadenylated transcripts (Supplementary Table 1). As expected, we observe upregulation of key TFs including FOXA2 and HNF4A in dEN, HAND1 and SNAI2 in dME, and OTX2 and PAX6 in dEC (Fig. 1b,c)9,17. We identified high-quality antibodies for 38 factors (Fig. 1c) and provide detailed information, including their validation and use in other studies, in Supplementary Table 2.
Using a micrococcal nuclease (MNase)-based ChIP-seq (MNChIP-seq) protocol18 we obtained binding patterns as well as reproducibility comparable to sonication ChIP-seq with only 1–2 million cells (Extended Data Fig. 1a–e). We quantified the enrichment over background for each experiment (Supplementary Table 3), and show that the level of binding is comparable to TF ChIP-seq data from ENCODE19 (Extended Data Fig. 1f). To evaluate computationally the specificity of the chosen antibodies we searched our binding maps for previously reported motifs of the respective factors20 (Extended Data Fig. 2). Our final data set consists of 6.7 billion aligned sequencing reads that yield 4.2 million total binding events (Supplementary Table 3). The binding spectrum of all TFs averages 21,468 peaks and ranges from 578 to 100,778 binding events. Of these 23% are found in promoters, 44% in distal regions, 30% in introns, and 3% in exons.
Classes of TF dynamics
We first grouped the TF binding dynamics into four main classes (static, dynamic, enhanced and suppressed) similar to prior studies in yeast21 and then further subdivided each of these as either temporal (between successive time-points) or cross-lineage (between germ layers) (Fig. 2a and Extended Data Figs 3 and 4).
A number of factors, including NANOG, show largely static binding in ES cells and endoderm (Fig. 2a). This could be the result of NANOG’s proposed functions in endoderm, including protection against neuroectoderm specification and buffering TGF-β signalling to avoid premature induction of definitive endoderm11. CTCF is both temporally and cross-lineage static in its binding pattern, showing a similar overlap between cell types as between replicates (Extended Data Figs 1a and 4a). The high similarity in binding is consistent with a previous study that investigated CTCF binding in 19 diverse human cell types22. Although each of the germ layer derivatives exhibits unique expression signatures, they show overall only limited transcriptional dynamics9, which is in agreement with the largely static enrichment for POLII and cMYC (Extended Data Fig. 3a).
In contrast, a number of the selected factors show dynamic binding between two (for example, GATA4) or more (for example, SMAD4) cell types (Fig. 2a, b). EOMES changes its binding profile notably during the dMS to dEN transition, suggesting its function may evolve at different stages of differentiation (Fig. 2c). Also, OTX2 occupies a largely different binding spectrum in the undifferentiated cells compared to dEN and dEC (Fig. 2d). Many factors also exhibit different temporal and cross-lineage dynamics. For example, while NANOG binding is temporally static in dMS and dEN, it is suppressed temporally and cross-lineage in dME (Extended Data Figs 3a, 4b). Meanwhile, OCT4 and SOX2 binding is temporally static in dEN, but cross-lineage dynamic between dEN and dME (Extended Data Figs 3a and 4c). Likewise, TCF4 (a transcriptional effector of WNT signalling) is temporally static in dEN but suppressed in dME and dEC, consistent with the lack of WNT signalling in those germ layers13,14,15 (Extended Data Figs 3a and 4d). Finally, OTX2 is temporally suppressed in dME (Fig. 2a), but temporally dynamic in the other germ layers (Fig. 2d).
To investigate the interplay between TFs across the cell types and how they might collaborate to mediate cellular transitions, we analysed all pairwise TF co-binding relationships. We identify several germ-layer-specific co-binding interactions; for example, GATA4 targets associate significantly (hypergeometric P < 10−300) with SMAD1 binding in dME, but less so in dEN (Fig. 3a, left and Extended Data Fig. 5). To extend this, we clustered all co-binding relationships and identified groups of interactions between factors and developmental time points (Fig. 3a, right). We found both clusters of many regulators in one cell type as well as clusters for individual TFs across cell types. For instance, cluster C1 shows that CTCF binding spectrum is highly similar in all three germ layers. In cluster C2, we find high overlap in binding between key endoderm regulators, while C4 captures primarily pluripotent and dMS binding profiles. Many known mesoderm factors aggregate in clusters C8 and flanking the pluripotent cluster C4 are EOMES, T, and NR5A2 clusters (C3, C5, C6), all known regulators in mesendoderm that are likely to be involved in the transition towards mesoderm and endoderm11.
Interestingly, we noticed that GATA4 and OTX2 binding in the different cell types is not only divergent, but enriched at distinct genomic features (Fig. 3b). In dME 36% of all GATA4 binding sites occur in promoters, compared to only 13.6% in dEN. OTX2’s fraction of binding sites at promoters is larger in dEN (34%) and dEC (28%) than in ES cells (13%). Accompanying GATA4’s shift in binding preference, we also observe higher levels of H3K4me1 at dEN targets and higher H3K27Ac and H3K4me3 enrichment in dME (Fig. 3c). Similarly, OTX2 associates with higher H3K27Ac and H3K4me1 levels in ES cells, and higher H3K4me3 occupancy in dEN and dEC, in line with increased promoter binding in these two germ layers (Fig. 3c). It is worth noting that similar to the distinct GATA4/SMAD1 co-binding, OTX2 co-occupies a higher fraction of loci with SMAD1 in dEN than in dEC (Fig. 3a, left and Extended Data Fig. 5). Although TGF-β signalling is primarily associated with effector proteins SMAD2/3, it also acts through the SMAD1/5/8 complex and may encourage interaction with OTX2 in dEN but not in dEC, where TGF-β signalling is specifically inhibited23.
H3K27Ac domains identify lineage regulators
Extended H3K27Ac domains have recently been termed super-enhancers and were used to describe regulatory regions that enrich for binding sites of master TFs in the respective cell types24,25. Binding of GATA4 in dME indeed coincides with long stretches of H3K27Ac near several mesodermal genes (Fig. 4a). We therefore used the previously described approach24,25 to rank extended H3K27Ac domains in our populations and identify such super-enhancers (Supplementary Table 4), which were indeed predominantly unique to each cell type (Fig. 4b and Extended Data Fig. 6). As expected, in human ES cells, core regulators OCT4, SOX2, NANOG (abbreviated OSN), and OTX2 binding is highly enriched at super-enhancers1,26 (Fig. 4c).
We used enrichment of binding at super-enhancers for identifying possible master regulators in the germ layers (Fig. 4c); the results were highly robust to different cut-offs for defining the super-enhancers (Supplementary Table 5). Surprisingly, we found that many of the core regulators bound at ES cell super-enhancers also occupy dEN super-enhancers, including OSN, OTX2, SMAD1, TCF4, and SMAD2/3 (Fig. 4c and Extended Data Fig. 6e). In mesoderm, GATA4 and SMAD1 were the most highly enriched factors at dME super-enhancers (Extended Data Figs 6f and 7), consistent with GATA4’s known role in directing cardiomyocyte development downstream of BMP signalling27. OTX2 is known to regulate neuronal subtype specification in the midbrain28 and we found strong enrichment for OTX2 binding at ectoderm super-enhancers (Fig. 4c and Extended Data Fig. 6g, h). Meanwhile, dMS super-enhancers were enriched for known regulators such as EOMES and T, along with OSN and OTX2 (Fig. 4c). At a lower significance level we also find enrichment for a number of endoderm factors, including FOXA1/2, GATA4/6 and SOX17 (Supplementary Table 5). Interestingly, binding of EOMES, T and FOXA1/2 in the undifferentiated ES cells was also enriched (hypergeometric P < 10−6) at dMS super-enhancers (Fig. 4c and Extended Data Fig. 6), suggesting that a number of loci might be already marked before differentiation.
Regulation of poised enhancers across germ layers
As dEN H3K27Ac domains were mostly devoid of known endoderm TFs, we asked if such regulators are instead present at regions that enrich for H3K4me1, as seen at the HNF1B locus (Fig. 5a). H3K4me1 can be found at both active and poised enhancers29 and is known to also form extended enhancer domains that may not overlap with the H3K27Ac domains24,25. Using the same approach as above we identified extended H3K4me1 domains in dEN and then measured enrichment for TF binding in these regions. In contrast to H3K27Ac, the top H3K4me1 domains were enriched for binding of FOXA1/2, GATA4, GATA6, and SOX17 (Extended Data Fig. 8a, b), known regulators of the early endodermal fate30. We then measured the significance in overlap between TF binding and all poised enhancers for each cell type and found strong enrichment for these regulators and PRDM1 in dEN (Extended Data Fig. 8c, d).
In concordance with this analysis and global chromatin remodelling trends (Extended Data Fig. 8e), GATA4 is associated with dynamics of H3K4me1 in dEN and H3K27Ac in dME. Given that the SMAD proteins are known to interact with histone acetyltransferases EP300 and CBP31, it is plausible that, through BMP signalling in dME, GATA4 interacts with SMAD1 and recruits EP300 to induce acetylation of H3K27 at target sites. This recruitment relationship is further supported by the higher enrichment of GATA4 motif instances at SMAD1 binding sites in dME versus dEN (Fig. 5b and Extended Data Fig. 8f) and the stronger enrichment of H3K27Ac at GATA4 targets in dME versus dEN (Fig. 3c).
To further explore this, we used several shRNAs to knock down (KD) GATA4 and then measured gene expression following differentiation into dME and dEN (Extended Data Fig. 9a). The mean expression for more than 20 lineage markers is very similar between control and KD cell lines, arguing that the KD cells still differentiate into comparable populations (Fig. 5c, right bar). While the GATA4 KD in dEN does not greatly affect any of the measured endoderm TFs (total P = 0.49, paired t-test), in dME the KD leads to a 1.7–4-fold reduction in the expression of seven key factors (total P = 5.39−5, paired t-test). GATA4 binding in dME and dEN occupies similar loci in control and KD cell lines (Extended Data Fig. 9b, c), and H3K27Ac super-enhancers in dME are largely unaffected by our knockdown (Extended Data Fig. 9d, e). Nonetheless, we observe a significant decrease in SMAD1 and H3K27Ac enrichment in dME at GATA4 target sites in the KD lines (Fig. 5d, P < 10−300, paired t-test). To a lesser degree, we also observe a decrease in mean SMAD1 occupancy at binding sites away from GATA4 (Extended Data Fig. 9f). This could be the result of the general reduction of SMAD1 expression in the dME KDs or linked to other TFs that aid SMAD1 binding, such as factors from the TEAD and GATA family (Fig. 5b).
Loss of DNA methylation at targets of lineage TFs
DNA methylation can silence genomic regions, directly or indirectly, and plays an important role during mammalian development5. Some TFs can modulate DNA methylation levels8, but it is not generally known which factors can alter it in a developmental context and which ones might be sensitive to its presence. In endoderm at a region upstream of SOX17, we observe specific loss of DNA methylation accompanied by epigenetic remodelling to a poised state. We also observe that the loss of DNA methylation associates with lineage-specific binding of several TFs (Fig. 6a and Extended Data Fig. 10a). Interestingly, OTX2 and NANOG show some enrichment already in ES cells that seems to be linked to a very focal depletion of DNA methylation that may serve as a means of initial marking or protecting the region for downstream binding (Extended Data Fig. 10b).
We next performed global enrichment analysis for all TF binding at regions that either gained or lost DNA methylation. Many target sites of OSN as well as SMAD1 and TCF4 show gain of DNA methylation in all three lineages, consistent with silencing of their pluripotency-related target genes (Fig. 6b, left). The dMS target sites of T and EOMES also become methylated in the three germ layer populations. Interestingly, we frequently find a reciprocal gain in DNA methylation in the alternative lineages of key dEN and dEC factors (Fig. 6b, middle).
As shown near SOX17, we also find that lineage regulators associate with targeted loss of DNA methylation. For instance, in dEN binding sites of EOMES, FOXA1/2 (Extended Data Fig. 10c, d), GATA4/6, SOX17, and OTX2 display focal and germ layer specific loss of DNA methylation (Fig. 6b, c). We also find strong enrichment for loss of DNA methylation at OTX2 binding sites in dEC (Fig. 6b, d). In dME we find seven partially overlapping TFs that show loss of DNA methylation at their binding sites, especially in regions that also gain H3K27Ac (Fig. 6b, e and Extended Data Fig. 7c). Using reduced-representation bisulfite sequencing32 we measured the DNA methylation level for a representative subset of targets in GATA4 KD and control lines. Both dME and dEN GATA4 KD cells displayed significantly higher methylation level (P < 10−10, paired t-test) (Fig. 6f and Extended Data Fig. 10e), suggesting a possible role for GATA4 in the focal depletion of DNA methylation.
Directed differentiation of human ES cells into the three embryonic germ layers coupled with comprehensive TF binding analysis and integration with epigenomic data has allowed us to characterize differentiation-associated regulatory dynamics. We find that targets of many lineage-specific factors associate with loss of DNA methylation in those germ layers, while factors that are expressed in more than one lineage (GATA4, GATA6, OTX2, SOX17) show a corresponding loss of DNA methylation at their targets in multiple cell types. This is in line with the model that some TFs have an intrinsic ability to alter DNA methylation, although more work is needed to determine if all of these can indeed be considered “pioneer factors”33. We also find a specific gain of DNA methylation for the targets of many TFs at later time points or in parallel time-points but along alternate lineages. This might present a possible mechanism for occluding binding sites of certain methylation sensitive factors at past or alternate differentiation paths.
To investigate the interplay between TF binding and the chromatin landscape, we focused on TF dynamics at H3K27Ac super-enhancers, where OTX2 and OSN seem to guide the transition to dEN while GATA4 and OTX2 act as key regulators for dME and dEC, respectively. GATA4 exemplifies a factor with distinct germ layer functions, where in dEN it resides at poised enhancers and in dME it appears to associate with SMAD1/EP300 to establish and maintain H3K27Ac domains. The dual use of GATA4 and OTX2 highlights the modularity in transcriptional networks in development and the complex interaction of downstream signalling effectors, TFs and chromatin in the three germ layers.
No statistical methods were used to predetermine sample size.
Human ES cell culture
Cell culture was done as reported previously9. Briefly, we chose the NIH approved, male human embryonic stem (ES) cell line HUES64 because it has maintained a stable karyotype over many passages and is able to differentiate well into the three germ layers. HUES64 was routinely tested for Mycoplasma and was negative in all instances. ES cells were maintained on ∼15,000 cells per cm2 irradiated murine embryonic fibroblasts (MEFs, Global Stem) and cultured in 20% knockout serum replacement (KSR, Life Technologies), 200 mM GlutaMAX (Life Technologies), 1× minimal essential media (MEM) non-essential amino acids solution (Life Technologies), 10 μg ml−1 bFGF (Millipore), 55 µM β-mercaptoethanol in knockout Dulbecco’s modified Eagle medium (KO DMEM, Life Technologies). ES cells were passaged every 4–5 days using 1 mg ml−1 collagenase IV (Life Technologies).
Directed differentiation of human ES cells
When human ES cells reached 60–70% confluency on MEFs, the cells were plated as clumps on 6-well plates coated with Matrigel (Life Technologies) in mTeSR1 basal medium (Stem Cell Technologies). We maintained the cells for three days in feeder-free culture and then induced directed differentiation towards mesendoderm, endoderm, mesoderm, and ectoderm using different media conditions. For mesendoderm and endoderm differentiation cells were cultured for 12 and 120 h, respectively, in Roswell Park Memorial Institute (RPMI) medium (Life Technologies) supplemented with 100 ng ml−1 activin A (R&D Systems), 50nM ml−1 WNT3A (R&D Systems), 0.5% FBS (Hyclone), 200 mM GlutaMAX (Life Technologies), 0.2× MEM non-essential amino acids solution (Life Technologies), and 55 µM β-mercaptoethanol. For the first 24 h of mesoderm differentiation, cells were cultured in DMEM/F12 medium supplemented with 100 ng ml−1 activin A (R&D Systems), 10 ng ml−1 bFGF (Millipore), 100 ng ml−1 BMP4 (R&D Systems), 100 ng ml−1 VEGF (R&D Systems), 0.5% FBS (Hyclone), 200 mM GlutaMax (Life Technologies), 0.2× MEM non-essential amino acids solution (Life Technologies), and 55 µM β-mercaptoethanol. From 24 to 120 h of mesoderm differentiation, Activin A was removed from the culture. For ectoderm differentiation cells were cultured in DMEM/F12 medium supplemented with 2 µM TGF-β inhibitor (Tocris, A83-01), 2 µM WNT3A inhibitor (Tocris, PNU-74654), 2 μM dorsomorphin BMP inhibitor (Tocris), 15% KOSR (Life Technologies), 0.2× MEM non-essential amino acids solution (Life Technologies), and 55 µM β-mercaptoethanol. Media was changed daily. Before inducing differentiation, we manually removed the differentiated cell clumps. We routinely obtain greater than 80% differentiated cells based on the presence of the surface marker CD56 (81.7% of mesoderm and 94.4% of ectoderm cells) and greater than 70% differentiated cells based on the surface marker CD184 for endoderm.
RNA extraction and RNA-seq
For measuring expression levels, RNA was isolated from the human ES cells and differentiated cells using TRIzol (Invitrogen, 15596-026), further purified with RNeasy columns (QIAGEN, 74104) and DNase treated. RNA-seq library construction and data analysis was carried out as described previously9.
Supplementary Table 2 lists detailed information for all antibodies used in this study, along with references that validate the specificity and use of each antibody.
MNChIP-seq and library construction
ChIP-seq for all chromatin marks was done as in ref. 9. MNChIP-seq for TFs was carried out as in ref. 9 with several modifications, including the micrococcal nuclease (MNase) digestion. Briefly, cell were grown to a final count of 10 million, resuspended in PBS, and crosslinked in 10% formaldehyde solution for 10 min at room temperature. Following quenching with 0.125M glycine and two PBS washes, we isolated nuclei using cell lysis buffer (20 mM Tris-HCl ph8, 85 mM KCl, 0.5% NP40). Nuclei were then digested using MNase (Worthington, LS004797) as done in ref. 18. Digestion was stopped with 0.05 M EGTA and chromatin was aliquoted into 1–2 million cells per ChIP. Antibodies were added and immunoprecipitation was carried out overnight at 4 °C as done in ref. 9. The next day, protein G beads (Life Technology, 10009D) were added for 2 h at 4 °C to isolate the protein-bound DNA and washed twice using low salt wash buffer (0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris-HCl pH 8.1, 150 mM NaCl), high salt wash buffer (0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris-HCl pH 8.1, 500 mM NaCl), LiCl wash buffer (0.25 M LiCl, 0.5% NP40, 0.5% sodium deoxycholate, 1 mM EDTA, 10 mM Tris-HCl pH 8.1,), and TE buffer pH 8 (10 mM Tris-HCl, pH 8, 1 mM EDTA pH 8). DNA was eluted twice using 100 μl of ChIP elution buffer (1% SDS, 0.1 M NaHCO3) at 65 °C for 15 min. Crosslinking was reversed by addition of 32 µl reverse crosslinking salt mixture (250 mM Tris-HCl pH 6.5, 62.5 mM EDTA pH 8, 1.25 M NaCl, 5 mg ml−1 proteinase K) for 5–18 h at 65 °C. DNA was isolated using phenol/chloroform extraction and treated with DNase-free RNase for 30 min at 37 °C. The whole-cell extract (WCE) control was generated using MNase-treated material that was then reverse-crosslinked and phenol/chloroform-extracted, skipping the immunoprecipitation and washing steps. DNA libraries were constructed using standard Illumina protocols for blunt-ending, polyA extension, and ligation, except each clean-up step was replaced with phenol/chloroform extractions to preserve small fragments as done in ref. 18. Ligated DNA was then PCR-amplified and gel-size-selected for fragments between 30 and 600 bp. Samples were sequenced using Illumina HiSeq at a target sequencing depth of 20 million uniquely aligned reads.
shRNA infection and knockdown experiments
ES cells were maintain MEFs in KSR culture media as described above and passaged onto geltrex coated dishes in mTeSR1 culture media before infection. When cells were ∼75% confluent, cells were collected with accutase as single cells or small clumps. 100,000 ES cells were plated per well of 12-well plate coated with geltrex and in mTeSR1 culture media. After 24 h, ES cells were infected twice on separate days for 3 h with approximately 30 viral particles per cell. 48 h after the last infection, cells were selected with 1 μg ml−1 puromycin until the non-infected ES cells die off (usually within 3 days). Knockdown (KD) and control shRNA-infected ES cell lines were then maintained as described above. We then performed directed differentiation of three control and KD cell lines into 5-day dEN and dME. We collected cells and carried out RNA and DNA extraction as ref. 9. cDNA reaction was set-up from 1 μg of total RNA per sample using High-Capacity cDNA RT Kit (Life Technologies). qPCR was performed on 384-well TaqMan hPSC Scorecard plates using Viia7 RUO software and Applied Biosystems ViiA7 instrument. CT values were normalized using two probes of the ACTN housekeeping gene and averaged for the three GATA4 KD and three control cell lines to obtain fold change in expression. DNA was used for reduced-representation bisulfite sequencing as in ref. 32. We also collected crosslinked cells from the same samples and carried out MNChIP-seq for GATA4, SMAD1, and H3K27Ac as described above. Composite plots display the average normalized occupancy for three GATA4 KD and two control cell lines. We used pLKO.1 cloning vector with the following target sequences for GATA4 KD: CCAGAGATTCTGCAACACGAA, CGAGGAGATGCGTCCCATCAA, CCCGGCTTACATGGCCGACGT. The shRNA control cell lines targeted gene products not present in the human genome using the same cloning vector with the following target sequences: TGACCCTGAAGTTCATCTGCA (GFP) and CACTCGGATATTTGATATGTG (Luciferase).
Selection of transcription factors
Approximately half of the transcription factors (TFs) were chosen because they are known to play an important role in regulation of pluripotent cells or in the transition to mesendoderm (for example, BRACHYURY), endoderm (for example, SOX17), mesoderm (for example, GATA4), and ectoderm (for example, PAX6). Others were chosen computationally based on Nanostring expression analysis and RNA-seq data. Previous work12 identified that OCT4 and SOX2 play distinct roles in the transition from ES cells to mesendoderm and ectoderm based on differential expression of these TFs in the two lineages. We used a similar approach to computationally identify factors that are differentially expressed in mesoderm and endoderm. Another study showed that temporal upregulation of TFs can be indicative of their importance at specific stages of blood differentiation34. We used this approach to identify factors that were upregulated upon transition to mesendoderm, mesoderm and endoderm and included those as well in the study (see Supplementary Table 2 for additional details on the factors).
ChIP-seq and MNChIP-seq data processing
Reads were aligned to the hg19 reference assembly using bwa version 0.5.7 (ref. 35) with default parameter settings. Subsequently, reads were filtered for duplicates and extended by 200 bp. For visualization, extended reads were summed at each base and normalized for sequencing depth by scaling the y axis to represent cumulative reads per 1 million reads sequenced. This normalization was used for browser and heat map visualizations of the data in all figures. We used MACS36 peak calling algorithm with default settings to identify significant binding events for each TF, excluding duplicate reads. Peaks were additionally discarded if they overlapped with regions that MACS detected as peaks in four different WCE samples. Such regions have been shown to cause false-positive peaks in ChIP-seq data due to unannotated high copy number regions37. Peaks were then annotated according to their proximity to transcription start sites (TSSs) using Homer38. Peaks within exons and introns were annotated first. Then, peaks overlapping a region from −2,000 bp to +500 bp of their nearest TSS were annotated as at promoters. Peaks outside of promoters but not in exons or introns were annotated as distal.
Data quality assessment and motif analysis
To quantify enrichment over background in ChIP-seq experiments, we measured the percentage of reads in peaks by counting all unique tags within 1,000 bp regions centred on all binding events, using bedtools multicov function with default parameters. To compare to ENCODE, we downloaded all (n = 1,410) TF ChIP-seq profiles with matching peak and raw data (.bam) files from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/, and computed the percentage reads in peaks in the same manner. Since ENCODE data was collected in cell types where the factors are known to be active, for Extended Data Fig. 1f we excluded all our TF binding profiles for time points where the factors are not highly expressed and expected to be inactive (middle box plot).
To quantify the specificity of our antibodies computationally, we carried out motif analysis that measured the enrichment of 1,887 known DNA binding sequences at 500 bp regions centred on the peaks of each TF using Centrimo39 and Homer38 (Extended Data Fig. 2). For six factors, (POL2, SALL4, T, NR5A2, THAP11, TRIM28) we did not find a reliable DNA-binding motif in the database of 1,887 motifs combining TRANSFAC and Jolma et al. data sets20. For the remaining 32 TFs, we found that 88% (28/32) of factors significantly (P < 10−75) associate with the known DNA binding motif. Moreover, we carried out de novo motif discovery for these factors (using MEME40 and Homer38) and show that these motifs are highly similar to the known motifs, further supporting the specificity of these antibodies (Extended Data Fig. 2). For the other 4 factors (SRF, REX1, STAT3, TAL1) of the 32, we believe that either the known motifs in the database do not match the in vivo binding affinities for these factors in our cell types or that cross-reactivity of the antibody with other proteins is occurring. To be conservative, we have excluded all these factors from further analyses, figures, and the main manuscript.
The GATA4 and SMAD1 motif enrichment in Extended Data Fig. 8f was also carried out using Centrimo39 with weighted moving average of 50-bp window. Finally, motif enrichment for Fig. 5b was carried out by scanning 1,887 motifs (see above) within 500 bp of binding using Centrimo39 and displaying three or more of the most enriched DNA motifs per cell type.
TF dynamics and co-binding relationships
For quantifying TF dynamics between cell types and co-binding relationships between TFs, peak regions were merged if two peak centres were a distance of 1,000 bp or less, and significance P values were calculated using the hypergeometric distribution and were subsequently corrected for multiple hypothesis testing. For each TF MNChIP in each condition, we calculated a vector of the –log10 P values for interactions with all other experiments. We then clustered all vectors along both rows and columns based on correlation distance using hierarchical clustering algorithm and average linkage (Fig. 3a). We filtered all experiments with no interactions at significance level P value < 10−5 for ease of visualization. To define classes of TF binding dynamics, binding was termed enhanced/suppressed if we observed at least a twofold increase/decrease in binding sites between two different conditions. If the binding sites had not decreased/increased twofold between two conditions, we defined the co-binding relationship as static if P value < 10−300, and dynamic if P value > 10−300.
Defining chromatin state
For differential signal enrichment analysis, we first computed the number of uniquely aligned sequencing tag midpoints for all 1-kb tiles of the genomic black list filtered human genome. Genomic region black lists were obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz.
For each histone mark and each condition, we then determined all 1-kb tiles significantly enriched over the whole-cell extract (WCE). To that end, we fitted local Poisson models to the read count normalized WCE tag distribution for each 1-kb tile of the human genome41. Only regions enriched threefold or higher compared to the whole cell extract and significant after correcting multiple testing using the Benjamini–Hochberg method at a q value ≤ 0.05 were retained. In order to identify differentially enriched regions between ES cells and each of the ES-cell-derived populations, we took advantage of a recently published analysis strategy based on mixture models that allows to incorporate replicate information and to correct for differences in IP efficiency and signal to noise ratio42. We used the R implementation in the software package enrich to first fit a latent Poisson mixture model with two components to each ChIP-seq experiment in order to obtain an estimate of the fraction of reads in the signal component. Next, we used the initial parameter estimates from the latter model to fit a joint Poisson mixture model for each group of biological replicates. Finally, we used the obtained models for each sample group to conduct pairwise comparisons accounting for sequencing depth and differences in IP efficiency. To that end, we made the assumption that the true number of enriched regions between two compared conditions for a given mark or factor is similar and set the p parameter in the enrich mix function to 1. Finally, we obtained a list of candidates of differentially enriched regions at an FDR = 0.05 and retained only those regions that exhibited an absolute log2 difference ≥ 1.5 in the estimated tile enrichment levels and that were significantly enriched above background according to the first analysis step. Next, we specifically decided to exclude more gradual changes in histone modifications and restricted the set of differentially enriched regions to those that were above background in one but not the other condition in each of the pairwise comparisons: ES cell vs dMS, ES cell vs dEN, ES cell vs dME and ES cell vs dEC. Based on these differential analysis results, we then binarized our ChIP-seq histone modification enrichment matrix. Next, we used this binarized matrix to assign each tile one of 10 states, now also incorporating DNA methylation data. The states were defined as follows (see below for details) with their order recapitulating their precedence: H3K4me3&H3K27me3, H3K4me3, H3K27me3&H3K4me1, H3K27ac, H3K4me1, H3K27me3, unmethylated region (UMR, where 0% ≤ UMR ≤ 10% methylation), intermediate methylated region (IMR, where 10%<IMR ≤ 60% methylation), highly methylated region (HMR, where 60%<HMR ≤ 100% methylation), none (no detectable histone modification enrichment or DNA methylation data for a given 1-kb tile).
Using chromatin data, we defined super-enhancers as in refs 24, 25. Briefly, we used MACS36 peak calling algorithm (default settings, except –p parameter was set to 1e−9) to detect enrichments in H3K27Ac ChIP-seq data for each cell type. Peaks were then merged if they were within a distance of 12.5 kb. We then ranked the stitched H3K27Ac enriched regions based on the normalized, background-subtracted average read density (in units of reads-per-million-mapped per bp of stitched region). The cutoff for classifying super-enhancers was defined as refs 24, 25, or the point where a line with a slope 1 is tangent to the curve of normalized region signal versus region ranking. The same procedure was used to define H3K4me1 super-enhancers per cell type.
We also used this procedure to find super-enhancers within a more inclusive set of parameters (MACS parameter –p set to 1e−5 instead of 1e−9 and stitching distance set to 5 kb instead of 12.5 kb), but found no differences in our conclusions (Supplementary Table 5). We also found no difference when using other cut-offs for defining super-enhancers (top 250, top 500, top 1,000, and top 2,000 enhancer regions, Supplementary Table 5), and found that using a fixed threshold had the advantage of uniformity between cell types in the enrichment analysis. Finally, excluding all enriched regions within 2,500 kb of TSSs also led to highly similar results and did not change our conclusions.
Chromatin states versus super-enhancers
H3K27Ac chromatin states are 1-kb genomic tiles that are significantly enriched for H3K27Ac over whole cell extract (WCE) and not enriched for other chromatin marks of higher priority. These regions are the ones displayed in the chromatin states maps that happen to fall into stitched H3K27Ac super-enhancers. For an extended H3K27Ac region to be classified as a super-enhancer, it must be enriched in H3K27Ac read density relative to all other H3K27Ac enhancer regions (not relative to WCE) for a given cell type.
TF enrichment analysis
We assessed the significance of overlap in TF binding and regions merged within super-enhancers by using the hypergeometric distribution. For each cell type, we only used TF peak regions in that cell type and super-enhancers as defined by chromatin data for that cell type. We used the same approach for measuring the TF binding enrichment at poised enhancers, or regions enriched for H3K4me1 and H3K27me3 histone modifications29. For chromatin state transition analysis, we defined the initial state as ES cells and the next cellular state as dMS or one of the three germ layers (dEN, dME, and dEC).
We then carried out TF enrichment analysis using MNChIP binding data per cell type and different epigenetic state transitions into that cell type. P values were again calculated using the hypergeometric distribution, and were subsequently corrected for multiple hypothesis testing. This analysis was used for both chromatin state transitions and DNA methylation state transitions. For Fig. 6b, we identified all differentially methylated 1-kb tiles in the genome (mean methylation difference ≥ 0.15) between ES cells and the three germ layers. In addition, we also identified regions that transitioned from an HMR state to an H3K27Ac state, termed regions that lose methylation and gain H3K27Ac. We then carried out the enrichment analysis for TF binding in these regions as described above.
Heat maps and composite plots
Heat maps were generated for regions −1 kb to 1 kb from the centre of each merged TF peak, using bins of size 50 bp. ChIP occupancy was normalized to sequencing depth as described above. Binding events for two or three ChIP-seq experiments were merged before heat map generation using Homer, as described above. ChIP-seq composite plots were generated for regions −5 kb to 5 kb from the centre of each TF peak, using bins of size 200 bp. Signal was normalized to sequencing depth, where 1 represents the mean ChIP occupancy at regions furthest from the peaks. DNA methylation composite plots were generated for regions −2 kb to 2 kb from the centre of each TF peak, using bins of size 100 bp. Mean methylation was calculated by averaging of the methylation ratio at all unique CpGs within a given bin, excluding bins with no CpGs. P values for composite plots were calculated between two samples (for example, KD and control) by finding the normalized histone mark enrichment or normalized methylation level for each sample at 300 bp regions centred around each TF peak, and then using the paired t-test. Using region size of 1 kb or 600 bp led to the same biological conclusions. Reduced-representation bisulfite sequencing captured only 1,897 of the 42,477 GATA4 bound regions in dEN and 2,331 of 35,842 GATA4 bound regions in dME with sufficient CpG methylation coverage; hence only these regions were used for the composite plots in Fig. 6f, Extended Data Fig. 10e, and associated P value calculations.
Gene Expression Omnibus
All data have been deposited in GEO under accession code GSE61475.
We would like to thank all members of the Meissner laboratory for their support and feedback. We also thank F. Kelley and other members of the Broad Technology Labs and Sequencing Platform as well as J. Doench and members of the Genetic Perturbation Platform at the Broad Institute. We would like to thank L. Gaffney for graphical support. This work was supported by the NIH Common Fund (U01ES017155), NIGMS (P01GM099117), NHGRI (P50HG006193) and the New York Stem Cell Foundation. A.M.T. was supported by NIH Ruth L. Kirschstein NRSA fellowship 5F32DK095537. A.M. is a New York Stem Cell Foundation Robertson Investigator.
Extended data figures
Gene expression levels in FPKM for the RNA-seq data generated for the mesendoderm (dMS) cell type in this study.
This table includes information for each factor (rows) in the study, including a brief description of why it was included (last column). In addition, the table includes detailed information for all antibodies used in this study, along with references that report the use and specificity of this antibody. TFs were selected through literature and computationally based on germ layer specific gene expression patterns (Fig. 1c and Supplementary Information), focusing on TFs that are relevant for the endoderm and mesoderm populations, since the additional dMS intermediate affords higher temporal resolution to capture early rewiring dynamics as well as factors involved in the lineage bifurcation. The set includes key developmental regulators as well as transcriptional effector proteins of the TGF-β (SMAD2/3/4), BMP (SMAD1/4), and WNT (TCF4, LEF1) signaling pathways, which play a critical role during the establishment of the three germ layers13,15.
Uniquely aligned reads, number of binding events (peaks), and enrichment over background (% reads in peaks) for all ChIP-seq experiments (rows) generated in this study.
Coordinates and annotation of the nearest gene (Homer) for extended H3K27Ac domains (super-enhancers) detected in each cell type (ESC, dMS, dEN, dME, and dEC displayed as separate sheets) and for extended H3K4me1 domains in dEN.
TF binding enrichment analysis at super-enhancers detected using different thresholds. TF binding enrichment (–log10 P values) at super-enhancers defined as in24,25, using a variation of the published method with less stringent parameters (MACS parameter –p set to 1e−5 instead of 1e−9 and stitching distance set to 5kb instead of 12.5kb), and using several other thresholds (top 250, top 500, top 1000, and top 2000 enhancer regions). Different choices for super-enhancer threshold are displayed as different columns and cell types as different sheets. For each cell type, TF enrichments at super-enhancers are highly similar arguing that the binding enrichment analysis is robust to the choice of threshold for classifying super-enhancers. We also repeated this analysis for H3K4me1 super-enhancers in dEN in sheet “dEN_K4me1”. Finally, sheet “dENspecific_K4me1” compares TF binding enrichment analysis of all H3K4me1 regions, dEN-specific regions relative to all 5 cell types, and dEN-specific regions within the three germ layers. Spearman correlation of the non-zero enrichment values at dEN specific regions versus dEN H3K4me1 domains are shown at the bottom row of this sheet.