Detection of new genomic control elements is critical in understanding transcriptional regulatory networks in their entirety. We studied the genome-wide binding locations of three key regulatory proteins (POU5F1, also known as OCT4; NANOG; and CTCF) in human and mouse embryonic stem cells. In contrast to CTCF, we found that the binding profiles of OCT4 and NANOG are markedly different, with only ~5% of the regions being homologously occupied. We show that transposable elements contributed up to 25% of the bound sites in humans and mice and have wired new genes into the core regulatory network of embryonic stem cells. These data indicate that species-specific transposable elements have substantially altered the transcriptional circuitry of pluripotent stem cells.
At a glance
Although it has been recognized that the gain and loss of regulatory elements are common features of eukaryotic genomes1, 2, most studies investigating this have been limited to the detection of binding events in one species followed by an in silico analysis of evolutionary conservation3, 4, 5 or have been restricted by the scope and comparability of the functional datasets being analyzed6, 7. To systematically explore the impact of newly arisen regulatory elements in a mammalian system, we generated matching datasets in human and mouse undifferentiated embryonic stem cells and studied the role of OCT4, NANOG and CTCF. The first two are known key regulators in embryonic stem cells7, 8, and the third is an important factor in the organization of regulatory blocks9. Previous studies have pointed to both similarities and differences between the expression profiles of these cells10, 11. Additional insights gained about the evolution and the wiring of this core regulatory network could provide deeper understanding of pluripotent stem cells derived from various species12.
We began our analysis by generating chromatin immunoprecipitation sequencing (ChIP-Seq) libraries for these three factors and then determined their genome-wide occupancy profile in human embryonic stem cells (see Online Methods). We used the full set of binding regions (Fig. 1a) to enable analyses of loci across a range of enrichment levels. Using these binding regions, a de novo motif-finding method recapitulated the known OCT4, NANOG and CTCF DNA binding motifs and helped confirm the quality of the data (Online Methods, Supplementary Fig. 1 and Supplementary Note). Notably, the motifs defined from comparable mouse embryonic stem cell datasets13 explained the human binding regions nearly as well (Supplementary Fig. 1b). This confirms the high similarity of the DNA-binding specificity of these proteins in human and mouse embryonic stemcells.
In a preliminary study, we suggested that the overlap between human and mouse binding regions in embryonic stem cells was limited7. However, that earlier assessment was hindered by the fact that the dataset of human samples was not genome wide and that the detection technologies used for each species were different (array based versus sequencing based). In contrast, the human datasets presented here enable a direct comparison to the mouse datasets previously obtained13, and so, based on the regions detected in the human samples, we evaluated the proportion of regions that were also observed to be bound in mouse by looking for binding evidence within homologous windows of 1 kb in length (Online Methods). Overall, we found that only 2.0%, 1.9% and 16.7% of the regions occupied by OCT4, NANOG and CTCF in human were also occupied in mouse, respectively. Increasing the window sizes to 2 kb and 5 kb only had a moderate effect on the results (Supplementary Fig. 2a). Focusing on the top 10% most enriched regions, it is even more notable that only 3.8% of the OCT4 regions and 5.3% of the NANOG regions are conserved compared to 49.6% of the CTCF regions (Fig. 1b). To address potential issues with the sensitivity of the ChIP-Seq assays, we performed the converse analysis starting with the mouse binding regions and looking for evidence of binding in the human datasets but we also observed limited conservation (Supplementary Fig. 2b). Together, this confirms that the in vivo occupancy profiles of OCT4 and NANOG are notably different between human and mouse embryonic stem cells.
Recent studies have suggested that for a number of transcription factors, transposable elements have been a rich source of new binding sites4, 14. We were interested in measuring whether this phenomenon was also a major contributing factor for the binding sites of OCT4, NANOG and CTCF in human embryonic stem cells because this could affect the regulation of neighboring genes15, 16. By calculating the observed overlap between the binding regions of each factor and the various repeat families, we were able to identify specific transcription factor-repeat associations that were more common than those expected by chance (Online Methods). For instance, even though there are only 767 LTR9B repeats from the endogenous retrovirus 1 (ERV1) repeat family in the human genome, we observed that 255 (33.2%) of these repeats are bound by OCT4. By chance, we would have only expected 3.1 (0.4%), and the number seen here corresponds to an 82-fold enrichment. We call such binding sites repeat-associated binding sites (RABS). Looking at the tag density in and around repeat instances of over-represented families, it is clear that specific regions of their ancestral sequence are preferentially targeted (Fig. 1c and Supplementary Fig. 3). Moreover, in many cases, aligning the bound instances of a given repeat family can show that the same region of the ancestral sequence has a high degree of sequence similarity among the bound sequences and harbors the cognate binding motif (Fig. 1d).
Collectively, we calculated that RABS accounted for 20.9%, 14.6% and 11.1% of the OCT4, NANOG and CTCF binding regions, respectively (Fig. 1e and Supplementary Table 1). Notably, the contributions of RABS were evenly distributed among the high- and the low-intensity binding regions for CTCF and were slightly skewed toward strongly bound sites for OCT4 and NANOG (Supplementary Fig. 4). For both OCT4 and NANOG, we found that the ERV1 repeat family is the largest contributor of RABS. In total, 2,464 (8.3%) of the OCT4 binding regions and 6,376 (7.2%) of the NANOG binding regions overlapped ERV1 repeats (Fig. 1f). Applying the same procedure to the mouse datasets showed that RABS accounts for 7.2%, 17.1% and 28.3% of the binding regions of Oct4, Nanog and Ctcf, respectively (Fig. 1e). It is notable that most of the families of transposable elements that have been exapted in the two species are different and correspond to species-specific sequences (Fig. 1f and Supplementary Table 2). Indeed, of the 6,231 OCT4 binding regions classified as RABS in human, only 58 (0.9%) have a homologous region in the mouse that is also bound.
To determine the functional relevance of RABS, we depleted human embryonic stem cells of POU5F1 (also known as OCT4) by RNA interference (RNAi) and examined differential gene expression by microarray analysis. We processed the microarray data and identified 721 genes that were down regulated and 1,407 genes that were up regulated (Online Methods and Supplementary Table 3). When we checked whether the differentially expressed genes had binding within 20 kb of their transcription start site (TSS), we observed an enrichment of OCT4 and NANOG binding regions especially around the downregulated genes (Online Methods, Fig. 2a and Supplementary Fig. 5a). Moreover, we found that OCT4 regions overlapping a NANOG region were 1.85-fold over-represented in proximity to downregulated genes as compared to nonregulated genes (P value < 1.0 × 10−10, Fig. 2b). Similarly, conserved OCT4 regions were also enriched 1.96-fold (P = 0.0011). Also of note, OCT4 RABS showed a 1.86-fold enrichment (P = 5.6 × 10−8), and breaking up the RABS by repeat family revealed that the enrichment increased to 3.1-fold (P = 2.5 × 10−8) for binding sites embedded in the ERV1 repeat family. This is strong evidence for a functional role of the OCT4-ERV1 sites in transcriptional regulation.
Given that the majority of the OCT4 and NANOG binding regions are different in humans and mice (Fig. 1b) and that we had access to matching Pou5f1 RNAi data in mouse embryonic stem cells7, we investigated the binding profiles around conserved gene targets in further depth. We compared the expression of orthologous genes between humans and mice and identified 137 genes that were downregulated in both human and mouse (conserved targets) following RNAi treatment (Online Methods and Supplementary Table 4). Included in this list is POU5F1, as well as a number of other factors implicated in embryonic stem cell biology (for example, SOX2, NANOG, KLF4 and DPPA4). Although the strongest binding signal was observed in the immediate promoter of these genes, there was an enrichment of binding regions reaching up to 20 kb both upstream and downstream of the TSS (Online Methods and Supplementary Fig. 5b,c). In total, 72 of the 137 (53%) conserved targets had an OCT4-NANOG binding region, but only 11 of these (15%) were homologously bound in the mouse samples, whereas the other genes showed evidence of binding site turnover (Fig. 3a and Supplementary Table 5). For instance, AEBP2, which encodes a protein found in the PRC2 complex that is known to be important for stem cell self-renewal and differentiation17, is a typical example of and shows evidence of binding site turnover (Fig. 3b). For this gene, the proximal promoter site in human overlaps a repeat that appears to be absent in mouse (Supplementary Fig. 6a). An exception to this is SOX2, which has a very well-conserved binding profile in mice and humans for the three factors considered here (Supplementary Fig. 6b).
Looking at the 584 genes that only showed downregulation in human embryonic stem cells, we found that 160 (27%) had an OCT4-NANOG binding region. Notably, for these human-specific targets, the fraction of binding regions corresponding to RABS was higher (22.5%) as compared to the conserved targets (12.4%). For instance, SCGB3A2 (encoding secretoglobin, family 3A, member 2), which is downregulated following POU5F1 RNAi treatment, contains two binding regions in its promoter that are bound by OCT4 and NANOG and that overlap ERV1 repeats (Fig. 3c). This gene, which was previously reported as one of the most highly expressed genes in human embryonic stem cells18, is not regulated in mouse, but this difference can now be explained by the presence of species-specific transposable elements. In total there are 50 human-specific targets that have a RABS, including 23 that have an ERV1-RABS (Fig. 4). We selected two of these ERV1-RABS and showed, using a luciferase assay, that they can drive enhancer activity and that this activity is ablated if the OCT4 motif is mutated (Supplementary Note). Together, these results suggest that many genes have been rewired into the core regulatory network of human embryonic stem cells following the insertion of transposable elements.
In summary, we found that CTCF has a stable occupancy profile not only across cell types19 but also across species. In contrast, OCT4 and NANOG have very different binding profiles in human and mouse embryonic stem cells, with only ~5% of their sites being homologously occupied. The fact that there is also a limited concordance between regions experimentally observed to be bound and conserved elements, as determined from multispecies sequence alignments (Supplementary Fig. 7), implies that in vivo maps in the relevant species will be important in the study of many mammalian systems. Moreover, to help explain the vast occupancy differences, we showed that species-specific transposable elements have been an important source of new sites in both species. Using matched binding and expression datasets, we also demonstrated that many of these transposable element–derived sites are found in the vicinity of conserved target genes in human and mouse. Finally, beyond the genes that have similar expression profile changes in human and mouse, we were also able to identify a group of human-specific target genes that show evidence of having been added to the core regulatory network of human embryonic stem cells via the insertion of transposable elements. Although we do not expect all binding events to directly influence gene expression, this data adds important support to a seminal hypothesis on the impact of repeats on the evolution of transcription regulation20, 21, 22. Our results reveal the striking plasticity of the core regulatory network of mammalian embryonic stem cells and the importance that transposable elements have had in facilitating this functional turnover.
Whole-genome chromatin-immunoprecipitation datasets.
The hESC line H1 (WA-01, passage 28)23 was used for this study. The cells were cultured feeder free on Matrigel (Becton Dickinson)24. Condition medium used for culturing hESCs contained 20% knockout serum replacement, 1 mM L-glutamine, 1% nonessential amino acids, 0.1 mM 2-mercaptoethanol and an additional 8 ng/ml of basic fibroblast growth factor (Invitrogen) supplemented to the hESCs unconditioned medium. The medium was changed daily. The hESCs were subcultured with 1 mg/ml collagenase IV (Gibco) every 5–7 d. The H1 hESCs were cross-linked with 1% formaldehyde for 10 min at room temperature, and the formaldehyde was then inactivated by the addition of 125 mM glycine. ChIP-Seq was carried out as described previously13. Briefly, chromatin extracts containing DNA fragments with an average size of 500 base pairs (bp) were immunoprecipitated. Illumnia/Solexa adaptors were ligated to the ChIP DNA fragments (10 ng) and subjected to 15 cycles of PCR amplification. The fraction of fragments averaging 200 bp in length was selectively cut out from the gel and eluted by Qiagen gel extraction kit. Using the Illumina/Solexa platform, 13–22 million 36-bp tags were sequenced from these samples, out of which 9.6, 9.9 and 12.6 million tags were mapped uniquely to the human genome (NCBI36/hg18 assembly) using the ELAND program (see URLs). The antibodies used were Abcam (AB19857) for OCT4, R&D (AF1997) for NANOG and Upstate (07-729) for CTCF.
Finally, using the program MACS25, the binding regions were ranked based on the enrichment of ChIP sequenced tags by comparing each ChIP library to an input library as a control. We identified 29,740, 88,351 and 87,883 peak regions for OCT4, NANOG and CTCF, respectively (Supplementary Tables 6,7,8). We defined the binding peaks as those above the P value cutoff of 1.00 × 10−5. A number of factors will influence the resolution of the peak calling procedure (most notably initial fragment length and sequencing depth). For our analysis, we retained all peaks; however, it is possible that some of the peaks called in close proximity to each other might have originated from a single binding location.
De novo motif finding.
To find the motifs over-represented in the binding regions, we used the repeat-masked sequence from the regions 100 bp around the top 1,000 peaks of each transcription factor as input for the MDmodule program26. The highest-ranking motif in each library was similar to the known motif of the corresponding specific transcription factor (Supplementary Fig. 1a). For each identified motif, we scanned back the bound regions using a previously described method7 with e-value cutoff of 0.001 to identify the binding peaks that had the motif (Supplementary Fig. 1b). We also did the same motif scan using the mouse PWMs previously identified13 to calculate the proportion of human binding regions that can be explained by the mouse motif. Finally, scanning larger 600-bp windows centered around the middle of the bound regions revealed a strong enrichment for the recognition motifs especially within 60 bp of the peak (Supplementary Fig. 1c). Together these results help confirm the quality of the ChIP procedures.
Assessing conservation in vivo and in silico.
To identify the binding regions conserved in vivo, we first extended each region identified in human to 50 bp, 200 bp or 1,000 bp (1 kbp) windows surrounding the peaks and used liftOver27 with default parameters to determine the homologous regions on the mouse genome (NCBI36/mm8; Supplementary Table 9). For the rest of the study (to be conservative and to maximize overlap), we intersected the results from the 1-kbp windows with the mouse binding regions reported previously13 to identify the conserved binding regions. We also did the converse, starting from the mouse binding regions. The in vivo conservation estimates obtained in this way could have been affected by the choice of antibodies in the two species, but it is encouraging to see in the human regions similar levels of enrichment for motifs obtained independently in human and mouse (Supplementary Fig. 1b). This helps confirm the high similarity of the DNA binding specificity for these proteins in the two species and supports the comparability of the datasets. Finally, for the in silico analysis, we identified the human binding regions that overlap the 28-Way PhastCons Elements track28 from the UCSC Genome Browser27 using centered windows of fixed length (50 bp, 100 bp and 200 bp). The results are shown in Supplementary Table 10.
Identification of RABS.
We used the 200-bp window surrounding the center of the transcription factor binding regions and intersected these with the RepeatMasker (see URLs) track from UCSC Genome Browser to find the number of overlaps of each transcription factor's binding regions with specific repeats. We also annotated each binding region with respect to its nearest RefSeq genes, up to a 100-kbp distance. We separated the binding regions into six categories according to the peak location: TSS (within 1 kbp of a TSS), promoter (up to 5 kbp upstream of TSS), intragenic (within the RefSeq gene boundary), proximal (up to 10 kbp away from the gene boundaries), distal (up to 100 kbp away from the gene boundaries) and desert (more than 100 kbp away from any RefSeq genes). Next, we generated a random dataset of 200,000 regions with the same annotation distribution as the true regions and intersected with the RepeatMasker track to obtain the expected number of overlaps of each transcription factor with repeat elements. We then used a one-sided binomial test to compare the observed number of repeats intersecting the true binding regions with the expected numbers from the annotation-matched background. We identified RABS as those repeats with statistically significant (P < 1 × 10−5) association with a transcription factor's binding regions. We also did the same analysis for the mouse binding regions.
Microarray expression analysis, target identification and network analysis.
The background-adjusted Illumina results were normalized using MeV by performing log2 transformation, followed by median centering on samples and median centering of genes across the samples. We used SAM29 with 5% false discovery rate and >1.5-fold cutoff to find the genes with statistically significant changes in expression upon RNAi treatment. We noted that depletion of POU5F1 by RNAi induced rapid differentiation of human embryonic stem cells. Therefore, the gene expression profile is a combination of primary and secondary gene expression changes. For the mouse RNAi results, we used the data as previously provided7. To determine an appropriate distance cutoff to associate binding regions to genes, we looked at the absolute enrichment of OCT4-NANOG binding regions in proximity of downregulated RefSeq genes (Supplementary Fig. 5a). To maximize enrichment and comprehensiveness but also limit the level of background noise, we identified targets of each transcription factor in each genome as genes with binding regions within 20 kbp of its TSS. We sorted the expression changes of the genes and to display general binding patterns we used a sliding window one-eighth the size of the gene list and calculated the proportion of the changing genes that are bound by each transcription factor. We compared this proportion with the number of genes in the whole array that were bound by the transcription factor as the background. P values associated with fold enrichments were calculated using a one-sided binomial proportion test.
Finally, to identify homologous genes in human and mouse, we selected the longest transcript to represent each RefSeq in each species and used liftOver to convert the coordinates into the other species. We then intersected the new coordinates with the RefSeq genes of that particular genome and identified those genes that intersect in the same strand as the homologous gene pairs from the two species.
Raw sequence tags, peaks files and OCT4 RNAi expression files have been deposited to GEO with the accession code GSE21200.
H.-H.N. and G.B. designed the experiments. N.-Y.C., X.L. and Y.-S.C. performed the experiments. G.K. performed the data analysis with contributions from J.J. and C.H. G.B. wrote the manuscript with contributions from H.-H.N. and G.K.
Gene Expression Omnibus
- Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121 (2002). &
- Divergence of transcription factor binding sites across related yeast species. Science 317, 815–819 (2007). et al.
- Large-scale turnover of functional transcription factor binding sites in Drosophila . PLOS Comput. Biol. 2, e130 (2006). et al.
- Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res. 18, 1752–1762 (2008). et al.
- Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007). et al.
- Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat. Genet. 39, 730–732 (2007). et al.
- The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 38, 431–440 (2006). et al.
- Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122, 947–956 (2005). et al.
- The protein CTCF is required for the enhancer blocking activity of vertebrate insulators. Cell 98, 387–396 (1999). , &
- Derivation of pluripotent epiblast stem cells from mammalian embryos. Nature 448, 191–195 (2007). et al.
- New cell lines from mouse epiblast share defining features with human embryonic stem cells. Nature 448, 196–199 (2007). et al.
- Stem cells and early lineage development. Cell 132, 527–531 (2008).
- Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117 (2008). et al.
- Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc. Natl. Acad. Sci. USA 104, 18613–18618 (2007). et al.
- Endogenous retroviral LTRs as promoters for human genes: a critical assessment. Gene 448, 105–114 (2009). , &
- Transposable elements and the evolution of regulatory networks. Nat. Rev. Genet. 9, 397–405 (2008).
- SUZ12 is required for both the histone methyltransferase activity and the silencing function of the EED-EZH2 complex. Mol. Cell 15, 57–67 (2004). &
- Gene expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Proc. Natl. Acad. Sci. USA 100, 13350–13355 (2003). et al.
- Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128, 1231–1245 (2007). et al.
- Regulation of gene expression: possible role of repetitive sequences. Science 204, 1052–1059 (1979). &
- The significance of responses of the genome to challenge. Science 226, 792–801 (1984).
- Retroposons—seeds of evolution. Science 251, 753 (1991).
- Embryonic stem cell lines derived from human blastocysts. Science 282, 1145–1147 (1998). et al.
- Feeder-free growth of undifferentiated human embryonic stem cells. Nat. Biotechnol. 19, 971–974 (2001). et al.
- Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008). et al.
- Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100, 3339–3344 (2003). , , &
- The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002). et al.
- Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005). et al.
- Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001). , &
This work was supported by the Agency for Science, Technology and Research (A*STAR) of Singapore.
- Supplementary Text and Figures (708K)
Supplementary Figures 1–7, Supplementary Tables 1–10 and Supplementary Note.
- Supplementary Table 1 (60K)
Human RABS for OCT4, NANOG and CTCF
- Supplementary Table 2 (24K)
Mouse RABS for Oct4, Nanog and Ctcf
- Supplementary Table 3 (496K)
Human and mouse POU5F1 and Pou5f1 RNAi results
- Supplementary Table 5 (160K)
OCT4 binding regions around the conserved and the human-specific OCT4 target genes
- Supplementary Table 6 (3M)
OCT4 binding regions