G-quadruplex DNA structures in human stem cells and differentiation

The establishment of cell identity during embryonic development involves the activation of specific gene expression programmes and is underpinned by epigenetic factors including DNA methylation and histone post-translational modifications. G-quadruplexes are four-stranded DNA secondary structures (G4s) that have been implicated in transcriptional regulation and cancer. Here, we show that G4s are key genomic structural features linked to cellular differentiation. We find that G4s are highly abundant in human embryonic stem cells and are lost during lineage specification. G4s are prevalent in enhancers and promoters. G4s that are found in common between embryonic and downstream lineages are tightly linked to transcriptional stabilisation of genes involved in essential cellular functions as well as transitions in the histone post-translational modification landscape. Furthermore, the application of small molecules that stabilise G4s causes a delay in stem cell differentiation, keeping cells in a more pluripotent-like state. Collectively, our data highlight G4s as important epigenetic features that are coupled to stem cell pluripotency and differentiation. Whether G-quadruplexes (G4s) regulate stem cell self-renewal and fate determination during embryonic development is not well understood. Here, the authors reveal that the embryonic stem cell state is defined by very high G4 abundance. G4s are progressively lost during differentiation as cells transit to lower lineage potential while artificial G4 stabilisation leads to delayed differentiation.

H uman embryonic stem cells (hESC) can indefinitely selfrenew yet retain the capacity to generate all embryonic cell lineages 1 . hESC pluripotency is governed by a core network of master transcription factors (TFs), including OCT4, NANOG and SOX2 2 , and their interplay with signalling pathways as well as chromatin and histone modifiers. This hESC TF network sustains the pluripotent state by simultaneously promoting chromatin plasticity whilst repressing key genes involved in lineage commitment through the establishment of unique epigenetic architecture. More specifically, the hESC epigenome comprises decondensed chromatin, high levels of active histone modifications and low levels of heterochromatic proteins 3,4 . Promoters primed for activation upon differentiation can also be marked by both active (H3K4me3) and repressive (H3K27me3) histone marks, allowing genes to be up or down-regulated in response to the appropriate developmental stimulus 5 . During differentiation, the epigenetic landscape is reorganised through increased DNA methylation, decreased chromatin accessibility and histone mark reassignments. Such changes during differentiation lead to a reorganisation of global 3D chromatin architecture to enable the activation of lineage-specific gene expression programmes and silencing of unrelated programmes [6][7][8] . This is underpinned by rearrangements in promoter-enhancer interactions and super-enhancers; large clusters of regulatory elements that serve as docking sites for TFs, epigenetic modifiers and basal transcriptional machinery [6][7][8] .
Of emerging importance to genome architecture and function are alternative DNA secondary structures known as G-quadruplexes (G4s) that form from certain guanine-rich sequences 9 (Fig. 1a). G4s arise from the self-association of guanine bases through Hoogsteen hydrogen bond base pairing (Fig. 1a). Computational analyses show that nucleic acid sequences with G4-forming potential are evolutionarily conserved 10 and non-randomly dispersed across the genome being particularly enriched in promoters, recombination sites and telomeres 11,12 . Experiments using an antibody, specifically raised against G4 secondary structures, reveal that the endogenous G4 genome landscape is tightly regulated since only~1-2% of over 700,000 human sequences can biophysically fold into a G4 structure in vitro 13 , actually do so within a chromatin context 14,15 . Previously, we showed that G4s have an increased prevalence in cancer cells 16 and are particularly associated with highly expressed and amplified genes in patient-derived aggressive breast cancer tissue 17 . In keeping with computational predictions, endogenous G4s in chromatin are primarily located in nucleosome-depleted regions (NDRs) of highly active promoters 14,18,19 . Accumulating evidence suggests that G4s act as recruitment hubs for TFs 15,20,21 , are involved in interactions 21 between distal genomic loci and can interact with/ modulate DNA methyltransferases 22,23 , which further connects G4s to transcriptional and epigenome regulation 9,24 . Given the extensive evidence, particularly from cancer cell studies, that G4s modulate the functions of the (epi)genome, we now present data on the potential roles of G4s in human embryonic stem cells and their differentiation into downstream lineage-specific states.

Results
G4 abundance in pluripotent human embryonic stem cells is lost during differentiation. To gain insights into the role of G4s during development, we utilised a human embryonic stem cell (hESC) system which enables differentiation to be studied in vitro and is an accepted model that recapitulates many aspects of cell lineage specification during human embryogenesis 25 . hESCs were differentiated into two well-characterised, multipotent stem cells each with differing lineage potentials: cranial neural crest cells (CNCCs) 26 and neural stem cells (NSCs 27 , Fig. 1b, c). NSCs undergo self-renewal and can generate neurons, astrocytes, and oligodendrocytes of the central nervous system 28 . CNCCs also self-renew but have a broader lineage potential giving rise to cranial neurons and glia as well as craniofacial mesodermal derivatives, such as bone, cartilage and smooth muscle 29,30 . We first confirmed stem cell derivation and identity using a range of established cell-lineage-specific antibody markers by immunofluorescence microscopy (IF), flow cytometry experiments and RNA-seq ( Fig. 1c and Supplementary Fig. 1). For example, hESCs showed specific expression of OCT4 (POU5F1) and NANOG, NSCs specifically expressed PAX6 and SOX1 and CNCCs specifically expressed TFAP2A and TWIST1.
To capture the incidence of G4s in cellular chromatin we deployed our well-characterised G4 structure-specific antibody, BG4 31 in chromatin immunoprecipitation sequencing (G4-ChIPseq) 14 experiments. Hereafter, we define G4s as G4-ChIP-seqpositive regions identified in at least 2 out of 3 biological replicates (see Methods section, Supplementary Fig. 2). We found 17950 robust G4s in pluripotent hESCs and upon their differentiation into CNCCSs or NSCs we observed 9456 and 4436 respectively (Fig. 1d). We confirmed that BG4 was indeed recognising a broad spectrum of G4 structural types (Supplementary Fig. 2h, i) as for previous studies 14,32 . Corroborating this finding, G4 loss upon differentiation was observed at the singlecell level via BG4 immunofluorescence (IF) staining (Supplementary Fig. 3a, b). G4 loss was also primarily independent of the cell cycle stage ( Supplementary Fig. 3c-e and Supplementary Discussion).
Most G4s in differentiated stem cell lineages are present in pluripotent embryonic stem cells. Remarkably, over 75% of G4s found in CNCCs or NSCs were also found in hESCs (Fig. 1e, f and Supplementary Fig. 4a, b). Despite the distinct lineages of NSCs and CNCCs, 83% (3,694/4,436) of NSC G4s were also observed in CNCCs ( Fig. 1f and Supplementary Fig. 4c) and 77% (3426/4436) of NSC G4s were common to both CNCCs and hESCs ( Supplementary Fig. 4d). Therefore, while G4s are lost during differentiation, G4s present in differentiated daughter cells largely occur at genomic locations where G4s were already present in the hESC state.
When hESCs differentiate, chromatin accessibility generally decreases as cell lineages become specified 7,8 , which we confirmed by Assay for Transposase-Accessible Chromatin (ATAC-seq, Supplementary Fig. 5a). Over 99% of G4s are located at NDRs regardless of stem cell type ( Supplementary Fig. 5b) and is similar to our previous observations in cancer 14 . Controlled differentiation provides the opportunity to evaluate whether G4s and chromatin structures are tightly coupled during cell state transitions. We, therefore, compared common and stem celltype-specific G4s ( Supplementary Fig. 5c, d) and open chromatin sites (as identified by ATAC-seq, Supplementary Fig. 5e, f) for hESCs, CNCCs and NSCs. Upon differentiation, we observed that G4 loss is accompanied by a corresponding loss of ATAC signal, conversely, the establishment of new G4s in differentiated daughter cells leads to a corresponding gain in ATAC signal. Hence, G4s in human stem cells are intimately linked to chromatin accessibility.
G4s are positioned within stem cell regulatory regions. To investigate possible functions of G4s in human stem cells, we next explored the locations of G4s in each stem cell type (Supplementary Fig. 6a). For each stem cell type, G4s were found primarily in promoters, enhancers, super-enhancers and sites occupied by lineage-specific transcription factors, which are of vital importance to transcriptional programmes and cell identity [33][34][35] (Fig. 2 and Supplementary Fig. 6). This included the binding sites of important core pluripotency master regulators OCT4, NANOG and KLF4 in hESCs 2,36 , the neural crest master regulators TFAP2A and NR2F1 in CNCCs 37 , the neuroectoderm lineage specification transcription factor PAX6 in NSCs 38 (Fig. 2c) and the H3 histone acetyltransferase p300, involved in enhancer activity 39 (Supplementary Fig. 6c). Notably, cell-typespecific G4s were found to be enriched at cell-type-specific enhancers while G4s common to two or more stem cell types were more enriched in promoter regions (Fig. 2a). Comparing our G4s maps with promoter-capture Hi-C maps for hESCs 40 and hESC-derived neuronal progenitor cells (developmentally analogous to NSC 41  promoters and promoter-interacting regions (Fig. 2e). G4s are also prevalent at the binding sites of known architectural proteins involved in promoter-enhancer chromosomal loop anchorage and formation in hESC (Fig. 2f). These observations implicate G4s in the long-distance regulation of promoter-enhancer interactions and overall 3D structural organisation of the genome 42 in stem cells. The majority of G4s were also located at hypomethylated CpG islands (> 60% of all G4s) in hESCs (Supplementary Fig. 6d) and is consistent with our earlier findings that showed an inverse relationship between G4 signals and proximal cytosine methylation in cancer cells 22 . Overall, our findings highlight an intimate relationship between the incidence of folded G4s in genomic regulatory regions and stem cell identity.
G4s in promoters are associated with H3K4me3 and bivalent promoter transitions during differentiation. In addition to promoters exclusively marked by either activating (H3K4me3) or repressive (H3K27me3) histone marks, bivalent promoters are classified by having both H3K4me3 and H3K27me3 (Fig. 3a). Bivalent promoters are prevalent in stem cells and are particularly important for developmental and lineage-specific genes that are generally repressed or lowly expressed, but are poised for rapid activation upon differentiation 5 . We next examined the relationship of folded G4s to different promoter classes and transcription. Most promoters with a G4 in each stem cell type were also marked by H3K4me3 (67-94%, Fig. 3b). G4s were also found to be enriched in bivalent promoters (Fig. 3c, d and Supplementary Fig. 7a, b) with almost half (3041/6350) of all hESC bivalent promoters carrying a G4 ( Supplementary Fig. 7c). This relationship is also apparent from the coincidence of hESC G4s at the binding sites of RING1B (Fig. 2f), a PRC1 component generally found at bivalent domains 43 . Our observation shows that folded G4 structures can physically co-exist with repressive histone marks in the context of bivalency. This provides experimental support for findings from computational studies that predict G4 sequence motif association with bivalent and polycomb-associated chromatin at DNA [44][45][46] . Our data thus highlight G4s as prevalent structural features in poised genes involved in developmental decisions. During lineage specification, bivalent promoters generally become transcriptionally repressed or active, concomitant with loss of either H3K4me3 or H3K27me3, respectively 8,47 . There are four combinations of promoter G4 transitions that are characterised by G4 presence (G4+) or absence (G4−) in the initial state (hESCs; G4 E ) and daughter cells (G4 D ) for a given gene ( Supplementary Fig. 8a, Methods section). Hereafter we refer to "G4 maintenance" to describe when promoter G4s are present in both hESCs and the differentiated daughter cell. Significantly more hESC bivalent promoters (p < 3.0E-05) transition to an active H3K4me3 status in CNCCs when maintaining (G4 E + G4 D +, 79%) or gaining (G4 E − G4 D +; 83%) a G4, as compared to those that lose (G4 E + G4 D −; 40%) or never had a G4 (G4 E − G4 D −, 30%, Fig. 3e). Conversely, hESC bivalent promoters that either lose or do not gain a G4 are more likely to (p < 2.6E-09) maintain bivalency G4 E + G4 D −; 43%; G4 E − G4 D −; 43%) or become repressed (H3K27me3; G4 E + G4 D −, 10%; G4 E − G4 D −; 18%) after differentiation (Fig. 3e). Thus, maintenance or acquisition of a promoter G4 appears to be related to the transition of bivalent hESC promoters towards an active H3K4me3 state in CNCC differentiation. Such bivalent promoters were also found to have on average higher H3K4me3 levels ( Fig. 3f and Supplementary Fig. 7d-f), chromatin accessibility ( Fig. 3g and Supplementary Fig. 7g, h) and gene expression (Fig. 3h, Supplementary Fig. 9 and Supplementary Discussion) compared to bivalent promoters without a G4. Furthermore, hESC bivalent promoters which maintain a G4 but lose repressive H3K27me3 in CNCC differentiation are strongly associated with developmental signalling pathways such as WNT signalling 29 (Supplementary Fig. 8b and Supplementary Data 1). In hESCs, the other main G4-positive promoter type is marked by H3K4me3. Significantly more (p < 1.4E-07) of these promoters retain their H3K4me3 status on CNCC differentiation when maintaining (G4 E + G4 D +; 98%) or gaining (G4 E − G4 D +; 99%) a G4, as compared to those that either lose (G4 E + G4 D −; 79%) or never had a G4 (G4 E − G4 D −; 84%) ( Fig. 3e). The majority (>75%) of H3K4me3 and bivalent G4-positive promoters in CNCCs also appear to have arisen from promoters in hESCs that also carried a G4 ( Supplementary Fig. 8c). Therefore, G4 maintenance during differentiation appears to be linked with propagation of the hESC histone promoter status to the daughter cells. The same G4-associated histone-mark transitions were also seen in the differentiation of hESCs into NSCs (Supplementary Fig. 8d-g), suggesting that the coupling of G4s with histone marks is a common feature of genes involved in transitioning to a more differentiated state.
Maintenance of promoter G4s is linked to transcriptional stabilisation between stem cell transitions. Many genes undergo transcriptional changes upon hESC differentiation into downstream cell lineages 6,8 . To understand how promoter G4 status relates to changes in gene expression during differentiation, the transcript levels in hESCs were compared to those in each daughter cell for the four combinations of promoter G4 transitions ( Supplementary Fig. 8a). We used a weighted linear regression model to fit expression level data and computed the residuals (i.e., distance from the regression line) to quantify transcriptional variability between the two cell states being considered (see Methods secion). G4 E + G4 D + promoters showed significantly lower (p < 2E-275) transcriptional variability between hESCs and daughter cells than G4 E − G4 D −, G4 E + G4 D − or G4 E − G4 D + promoter transitions (Fig. 4a-c and Supplementary Figs. 10a, b, 11, 12 and 13). This stabilisation is also observed when considering differential gene expression (FDR < 0.05, abs (Log2FC) >1) ( Supplementary Fig. 10c, d): the proportion of non-differentially expressed (DE) genes for G4 E + G4 D +promoters is~2-3-fold higher (p < 2E-312) than genes that are DE. In contrast, for G4 E − G4 D + and G4 E + G D − promoter transitions the proportion of DE genes is significantly Fig. 1 G4 abundance is linked to degree of stem cell plasticity. a G-quartet with 4 Hoogsteen-base-paired guanines, which stack to form a G-quadruplex (G4) structure stabilised by cations (e.g., K + ). b Overall study design and schematic showing differing lineage potential of stem cell types generated and analysed. Data generated in this study are indicated in blue text, with published datasets indicated in grey text. Schematic created with BioRender.com. c Immunofluorescent microscopy images for hESC (NANOG, OCT4), CNCC (p75NTR and TFAP2A) and NSC (PAX6 and NESTIN) markers. Inset: cell lineage marker merge with DAPI. Scale bar = 50 µm. d Number of confirmed G4-ChIP-seq sites (defined henceforth as "G4s") identified in each stem cell type. e Genome browser view of G4 signal for all three stem cell types across the promoters of the genes FZD2, MSX1, PTPRZ1 and CHD1. Yellow box highlights regions where G4s overlaps open chromatin (defined by ATAC-seq; ATAC) and genome sites that have the ability to fold into G4 structures in vitro (called OQs, observed quadruplex sequences 13 ). Genomic coordinates are indicated at the top. f Venn diagrams illustrating overlap of G4s between hESCs and CNCCs (top) or hESC and NSCs (middle) or CNCCs and NSCs (bottom).  Fold enrichment over random 10 12 14 16 18 20 h E S C C N C C N S C    . c-f Fold-enrichments over random (n = 1000 permutations) and proportion of G4s per stem cell type at c the binding sites of cell-specific transcription factors (TF); d super-enhancer elements as defined in Hnisz et al. 34 and Wilderman et al. 90 ; e promoter (bait) and promoter-interacting regions (PIRs) from promoter-capture HiC experiments (as defined in Freire-Pritchett et al. 40 . Centre and left panel: hESCs active promoters (H3K4me3 and/or H3K27ac), poised promoters (H3K4me3 and H3K27me3), repressed promoters (H3K27me3) and background (none). Right-most panel: NSC promoter-promoter PIRs and promoter-non-promoter PIR interactions of developmentally analogous H1-hESC-derived neural progenitor cells from Jung et al. 41 ; f binding sites of chromatin architecture proteins involved in promoter-enhancer looping. (p < 8E-14) higher than that of non-DE genes. This reveals that genes that lose, acquire, or never had a promoter G4 tend to have greater transcriptional changes during differentiation.
Promoter-related genetic and epigenetic features 48 , including H3K4me3 49 , are known to play a role in minimising transcriptional variability within a cellular population. This is thought to ensure the consistent expression of key genes when confronted with a fluctuating environment e.g during cellular differentiation. While we also observed transcriptional stabilisation for genes that maintain H3K4me3 49 (R 2 = 0.75 (CNCCs), 0.70 (NSCs)) and chromatin accessibility (R 2 = 0.75 (CNCCs), 0.74 (NSCs)) surprisingly the effect observed for G4s was markedly stronger (R 2 = 0.84 (CNCCs), 0.85 (NSCs); Supplementary Fig. 13c, d). Gene Ontology (GO) enrichment analysis for genes that are not differentially expressed and were G4 E +G4 D + upon differentiation, revealed essential cellular programmes as highest-ranking terms (FDR < 0.05) including splicing (e.g. SNRNP70, HNRNPA2B1, SF3B1), cell cycle (e.g. CUL3, PML), metabolism (e.g. VCP, PHF23), translation, ubiquitination (e.g. DERL3, CUL3) and chromatin maintenance (e.g. KANSL2, KDM1A; Fig. 4d, e). For selected genes, genome browser views and validation of G4 presence and gene expression by ChIP-PCR and RT-PCR respectively are presented in Supplementary Figs. 14 and 15). There was substantially lower enrichment in these pathways for genes with a G4 E −G4 D − promoter signature. KEGG, Reactome and Wikipathway functional analyses also confirmed these findings (Supplementary Data 2 and 3). Overall, these results show that G4 maintenance may be an important chromatin feature linked to the transcriptional stabilisation of genes essential for key cellular functions.
Promoter G4 landscape changes are linked to stem cell identity. Genes that acquire a promoter G4 (G4 E −G4 D +) during differentiation generally have increased expression in the daughter cell compared to hESCs ( Supplementary Fig. 10c, d). For CNCCs, the highest-ranking significant GO terms for this category were related to cell specification and include the key CNCC regulators TWIST1, TWIST2 and SNAI2 29 (Fig. 5a, b and Supplementary Figs. 15 and 16a-e). Likewise, for NSCs, many genes with increased expression that acquire a G4 were related to neurodevelopmental pathways such as Notch signalling e.g. DLL1 and HES1 essential for NSC maintenance 50 , SOX11 essential for NSC fate specification 51 and RFX4, which when disrupted leads to neurodevelopmental disease 52,53 (Fig. 5c, d and Supplementary Figs. 15 and 16a-e). Acquisition of G4s during differentiation may therefore play a role in lineage specification. Conversely, genes that lose a promoter G4 (G4 E +G4 D −) upon differentiation are generally down-regulated in the daughter cell ( Supplementary  Fig. 10c, d). For this latter category, top-ranking GO terms include genes associated with cellular tight junctions (e.g. hESCspecific CHD1 54 ) and alternative developmental lineages (e.g. GO: blood circulation and GO: circulatory system processes), and genes (e.g. PRDM14 55 and TERT 56 ) and signalling pathways (MAPK, RAP1, RAS and PI3K-AKT) important for hESC pluripotency and self-renewal 57 (Supplementary Figs. 15 and 16f-h). Thus, some G4s only exist in the hESC cell state and are specifically linked to the expression of key pluripotency genes whose expression is subsequently lost upon differentiation.
G4 stabilisation with small molecules delays hESC differentiation. For differentiation to proceed, the pluripotency transcriptional network that maintains self-renewal must be dismantled and lineage-specific transcriptional programmes activated 58 . G4s frequently occur in stem cell regulatory regions with many G4s being lost during differentiation, concomitant with the silencing of their associated genes. If G4s are coupled to gene expression, then introducing a factor that preserves G4 as folded structures in hESCs might present a barrier to differentiation. We tested this hypothesis by adding a small molecule that stabilises G4 structures to hESCs. OCT4-EGFP-expressing hESCs were differentiated into CNCCs 59 in the presence or absence of different concentrations of PhenDC3, a small molecule selective for G4s 60 (Fig. 6a), which does not result in detectable DNA damage or cell cycle check point arrest upon cellular treatment ( Supplementary Fig. 17a-e). Exit from pluripotency was monitored by measuring the loss of OCT4-EFGP by flow cytometry and IF, while CNCC differentiation was determined by IF microscopy for SOX10, a definitive marker of CNCCs 29 (Fig. 6b, c and Supplementary Fig. 17f, g). Loss of OCT4 (~50 and 90% at day 3 and 5 respectively) and gain of SOX10 (~5 and 40% at day 3 and 5 respectively) expression was first confirmed during differentiation in DMSO controls ( Fig. 6b-d). From day 3 of differentiation, there was a PhenDC3 dose-dependent significant (p < 0.01) increase in the proportion of live cells (6-37%) expressing OCT4-EGFP compared to DMSO-treated controls ( Supplementary Fig. 17f, g). This result was confirmed by IF analyses: PhenDC3 treatment resulted in 2-and 4-fold more OCT4 expressing cells on day 3 and 5, respectively (Fig. 6b, c), which was accompanied by up to a 4-fold decrease in the induction of SOX10-expressing cells (~10%) compared to DMSO controls (41%) (Fig. 6c, d). Global transcriptome analysis by RNA-seq verified that differentiating hESCs cultured in PhenDC3 had a transcriptional signature more similar to undifferentiated hESCs than to the differentiated controls ( Fig. 6e and Supplementary Fig. 18). Compared to DMSO and non-treated controls, treatment with PhenDC3 lead to fewer genes being differentially expressed during differentiation (37% less at day 3 and 43% less at day 5). Taken together these results suggest that artificial G4 stabilisation in hESCs poses a barrier to CNCC differentiation. Indeed delayed differentiation was also observed when using each of two structurally distinct G4-selective stabilising molecules: Nmethyl mesoporphyrin IX (NMM) 61,62 and 12459 63 (Supplementary Fig 19). This suggests that it is G4 stabilisation, rather than non-specific effects by a given molecule, that are the cause of delayed differentiation. High G4 abundance in hESCs appears to be associated with the pluripotent state, whereas dynamic changes in the G4 landscape are coupled to the transcriptional reprogramming that takes place during differentiation. The majority of daughter cell G4s are located at genomic loci that were also found in the embryonic state, suggestive of key common functions. During differentiation, we reveal that promoter G4 'maintenance' stabilises transcriptional levels of genes for essential cellular/homeostasis functions and is associated with important developmental changes in the histone modification   landscape. In addition to highlighting G4s as a feature associated with active promoters, our work has also revealed folded G4 structures are features of many bivalent promoters. To investigate a potential causative link between G4s and lineage commitment, we perturbed the endogenous G4 landscape in hESCs using G4 stabilising small-molecule ligands. This resulted in a differentiation delay due to failure of pluripotency exit and suggests that the high abundance of G4s in hESCs relative to CNCCs and NSCs acts to maintain the pluripotent state.
Our work provides insights into how the transcription of key lineage specification and essential cellular functions/homeostasis pathways are maintained during differentiation in the face of transcriptional noise 64 . Genes that maintain a G4 promoter upon differentiation fluctuate less in expression from the hESC state, as compared to genes that lose or gain a G4. Although a similar phenomenon is seen when promoters maintain their H3K4me3 and open chromatin status alone, this effect is much weaker than that of G4s. We propose, therefore, that promoter G4s that are 'maintained' from hESCs to daughter cells upon differentiation sustain expression of associated genes with less transcriptional variability. This effect is independent of the magnitude of gene expression; thus, a consistent explanation is that the G4 may have a structural role in helping keep the underlying chromatin state permissible for transcription. This is further supported by our recent findings in transformed cells showing that promoter G4 formation precedes transcription 65 . Indeed, we found that G4s preserved from hESCs to daughter cells travel with the H3K4me3 histone modification during differentiation whereas G4 loss is correlated with loss of chromatin accessibility. Additionally, we note that stem cell G4s form not only in gene promoters but also at enhancers, super enhancers and sites of chromatin looping interactions. Thus, G4s may be potentially important elements that modulate 3D chromatin organisation to promote transcriptional consistency during differentiation.
Histone modifications are dynamic features important in the control of gene expression and differentiation 66 and here we provide evidence that G4 structures may act as an additional layer within an epigenetic regulatory system. For example, we find that hESC bivalent promoters which keep their G4 status or gain a G4 upon differentiation are more likely to transition to an active H3K4me3 promoter state in the daughter cell. Compared to bivalent promoters that lack a G4, bivalent promoters that are marked with a G4 have higher levels of H3K4me3 and thus chromatin accessibility. Our findings may extend mathematical modelling that predicts chromatin bivalency at CpGI exists as a "bistable" system, frequently switching between active and silent chromatin states 67 . For instance, G4 folding/unfolding could theoretically provide a rapid and efficient mechanism to modulate between active and silent bivalent states. Indeed, G4s have been identified as binding sites for effector proteins with histone-modifying activities (e.g. ATRX and LSD1 9 ). Recent findings in zebrafish using antisense oligonucleotides directed against predicted promoter G4s lends further support for a developmental role of G4s as the resulting embryo phenotypes were similar to those of embryos deficient in the targeted genes 68 . Further work will be necessary to understand the detailed mechanistic interplay between the G4 and histone landscapes and subsequent transcriptomic changes.
In the future, it will be important to understand how G4 formation and loss is regulated during differentiation. SWItch/ Sucrose Non Fermentable (SWI/SNF) chromatin remodeler complexes, which deconstruct the pluripotency chromatin landscape during neurogenesis 69,70 may be directly involved. Members of this family (e.g. SMARCB1 and SMARCA2) were uncovered in genetics screens for G4-interactors 71 and recent work has demonstrated that these proteins preferentially recognise G4 structures in vitro 72 and localise to G4 structures in cancer chromatin 73 . It is interesting to note that perturbations in SWI/SNF complexes contribute to 20% of human cancers 74 and is immediately suggestive of a mechanism of how G4s may become reactivated in cancer.
In conclusion, our work has highlighted important and unanticipated roles for G4 DNA secondary structures in stem cell pluripotency and cell fate specification (Fig. 6f). As G4s presage the transitioning of epigenetic and transcriptional landscapes that occur during differentiation, G4s should now perhaps be considered epigenetic features in their own right.
RNA-Seq and library preparation. Total RNA was extracted from 1 × 10 6 cells using the RNeasy Plus Mini Kit (cat. #74136, Qiagen) as per manufacturer's instructions. RNA quality (RIN > 9) was checked via Bioanalyzer and RNA concentration determined using a Qubit fluorometer Libraries were generated from 500 ng RNA using the Truseq Stranded mRNA kit (cat. # 20020595, Illumina) as per manufacturer's instructions and sequenced on a HiSeq 4000 (Illumina) platform (50 bp single-end).
Sequencing data processing, G4 and gene-expression differential analysis. After quality controls checks with fastQC (version: 0.11.7) 78 , all fastq files have Fig. 5 Promoter G4 landscape changes are related to stem cell identity. a-d Top 20 significant (FDR < 0.05) biological processes obtained from Gene Ontology enrichment analysis (GO:BP) performed with g:Profiler 84 for genes showing increased expression in a CNCCs vs hESCs (FDR < 0.05, Log2FC > 1) that gain a promoter G4 (G4 E -G4 D +; n = 136 genes) and c in NSCs vs hESCs (FDR < 0.05, Log2FC > 1) that gain a promoter G4 (G4 D -G4 D +; n = 37 genes). Number of genes in intersection for each term shown in brackets. See Supplementary Data 2 and 3 for full list of GO and KEGG terms. Genome browser view for representative gene promoters demonstrating G4 E -G4 D + promoter status after hESC differentiation into b CNCCs or c NSCs. Yellow shading highlights location of cell-specific G4 across open chromatin sites (ATAC) and OQs 13 . Grey shading highlights G4s common to both CNCCs and NSCs. Heatmap showing median gene expression (Log10(TPM + 0.01)) is shown for the indicated genes. N = 5, 3 and 4 biologically independent samples for hESCs, CNCCs and NSCs, respectively. e, f Top 10 KEGG terms for genes with increased expression (FDR < 0.05, Log2FC > 1) in hESCs compared to CNCCs (n = 1,128 genes) or NSCs (n = 1175 genes) which lose a promoter G4 (G4 E + G4 D −) (e). Genes in bold represent those in common in NSC and CNCC differentiation. f Genome browser view for representative gene promoters demonstrating G4 E + G4 D − promoter status after hESC differentiation. G4-ChIP-seq and ATAC-seq reads were aligned to hg38 with bwa-mem (G4-ChIP-seq: bwa mem with default options, ATAC-seq: bwa mem -M -t 12; version: 0.7.17-r1188). Next, bam files with aligned reads were generated by using samtools (samtools view -Sb -F780 -q 10 -L; version 1.8) and they were sorted. Deduplication of the resulting mapped reads was performed with Picard MarkDuplicates (version: 2.20.3, http://broadinstitute.github.io/picard, REMOVE_DUPLICATES = true).
For ATAC-seq data, fragment size distribution was estimated using uniquely mapped reads bams with Picard CollectInsertSizeMetric. The amount of mitochondria contamination was checked by counting the number of unduplicated reads mapping to ChrM.
For both G4-ChIP-seq and ATAC-seq, regions with local enrichments (G4) or accessibility (ATAC) were identified using MACS2 (G4-ChIP-seq options: --broad, Cell G4 consensus regions were compared to the same cell ATAC consensus (ATAC) and to in-vitro observed G4 quadruplex (OQs 13 ) by evaluating the percentage of overlaps.
Differential G4-binding was done with edgeR (version 3.26.8). Library size and read coverage at a multi-cell G4 consensus (merging all three cells individual G4 consensus) regions were computed. Then, a generalised linear model (glmLRT) with default parameters (negative binomial log-linear distribution of read counts) was used to assess regions with the differential binding signals. Specifically, each cell was compared to one of the other two in turn and regions with differential signal were those with FDR ≤ 0.05.
RNA-seq trimmed reads were aligned to hg38 by using rsem (rsem-calculateexpression --phred33-quals -p 40 --bowtie2, version 1.3.1 77 ) and files containing genes and isoform level expression estimates were produced. Genes level expression estimates files (genes.results) were used for the subsequent analysis. When technical replicates were present, TPM (transcript per million) from the expected counts from the gene levels estimate files (genes.results) were averaged. As a quality control, the similarity across technical replicates was assessed by computing correlation between TPM and Pearson correlation was R > 0.95. Similarity across biological replicates was assessed by hierarchical clustering (experiments) paired to k-mean (promoters, with k = 2) by using Heatmap from the complexHeatmap R package (hierarchical clustering using Euclidean distance and average linkage, R version: 3.6.1 (2019-07-05)).
The differential gene expression analysis between each pair of cells (hESCs versus CNCCs, hESCs versus NSCs and NSCs versus CNCCs) was performed with EdgeR 81 using the expected_counts from the gene.results files generated during the alignment step. For each pairwise comparison, genes with differential expression (FDR < 0.05 and abs(log2FC)>1) were called using the function glmQLFit taking into account the presence of multiple biological replicates in the experimental design matrix (5 for hESCs, 3 for CNCCs and 4 for NSCs). For each cell type expression levels have been summarised as the median TPM observed across all the biological replicates. The differential gene expression analysis for samples in the PhenDC3 experiment was performed as above with differential expression cutoff FDR < 0.05 and abs(log2FC)>2.
External histone modifications ChIP-seq processing. Histone modification ChIP-seq maps were downloaded from the Sequence Read Archive database (SRA) using fastq-dump (version: 2.10). Adaptor trimming and alignment was performed similarly to the G4-ChIP-seq. The list of data used by cell type is the following. hESCs: • H3K4me3: GSM602296 Peak calling was performed with macs2 (default options for H3K4me3, --broad option for H3K27me3). When replicates were present, the consensus regions were defined as loci observed in at least two replicates. Genome-wide track files were generated by normalising read coverage by library size (RPM). Summarised RPM signal per cell type and mark was calculated as the median RPM G4 signal across replicates.
Other external data used for comparative analysis. Maps used for comparative analysis were taken from previous studies and public databases. When needed, genomic locations were converted into bed files and genomic coordinates from previous genome versions were lifted to hg38 by using USCS liftOver tool (command line at: https://genome-store.ucsc.edu).  Fig. 6 G4 structure stabilisation delays hESC differentiation. a Schematic of G4-ligand treatment experiment: OCT4-EGFP expressing hESCs were differentiated into CNCCs using a 2-day pulse using GSK3β (CHIR99021) and ROCK inhibition (Y-27632) 59 . Samples were taken at the indicated times to determine the percentage of cells positive for the pluripotency marker OCT4 (by flow cytometry and immunofluorescence microscopy (IF)) or SOX10 (by IF; lineage marker for CNCC) or RNA-seq. Insert chemical structure of the G4-specific ligand PhenDC3. b-d Proportion of b OCT4 and d SOX10 positive cells determined in IF studies. N = 52 fields of view per sample. One biological replicate (see Source Data for detailed cell counts). *: p < 0.05, one-sided Pearson's χ 2 test for proportions (see Source Data for exact p-values and number of cells analysed). See Supplementary Fig. 17 for additional biological replicates. c Representative confocal IF images of differentiating OCT4-EGFP hESCs treated with either DMSO or PhenDC3. SOX10 = red, OCT4 = green and DAPI (nuclear stain) = blue. Scale bar = 100 μm. e Heatmap showing gene expression (Log10 median gene expression (TPM + 0.1)) of the indicated pluripotency and cranial neural crest genes in hESCs and hESCs treated with cranial neural crest induction media alone (Control) or with DMSO or 2 μM PhenDC3 at 3 and 5 days of differentiation. f Schematic of the main findings of our study: G4 abundance correlates with stem cell plasticity (top left); G4 presence is associated with particular promoter histone landscapes in differentiation (top right); and maintenance of a promoter G4 from hESCs to differentiated daughter cells preserves the transcriptional output of a gene independent of transcription levels (bottom).  Characterisation of G4 fold-enrichments at sites of interests. G4 foldenrichments over random chance were evaluated at various sites of interest by using the Genomic Association Tester (GAT, https://gat.readthedocs.io/en/latest/ contents.html, 1000 randomisations) restricting the analysis the human whitelist.
Promoter annotations. Promoters were annotated based on various criteria: epigenetic data, G4 enrichment data in individual cell types and transitions of G4 enrichments from hESCs to one daughter cell (CNCCs or NSCs).
Promoter annotations based on epigenetic maps: • H3K4me3: promoter overlaps with at least one H3K4me3 region; • H3K27me3: promoter overlaps with at least one H3K27me3 region; • Bivalent: promoter overlaps with regions where both H3K4me3 and H3K27me3 colocalise; • Other: promoter overlaps with both H3K4me3 and H3K27me3 but the two marks do not colocalise; • Unmarked: promoter does not overlap with H3K4me3 or H3K27me3.
Promoter annotations based on G4 enrichments (in cell of interest): • G4 positive: promoter overlaps with at least one G4 cell-consensus peak.
• G4 negative: promoter does not overlap with any G4 cell-consensus peaks.
Promoters annotations describing how G4 enrichments transitioned from hESC (E) to each of the two daughter cells (D: daughter i.e., CNCC or NSC): • G4 E + G4 D −: G4 in hESC, no G4 region in daughter cell (G4 is lost in daughter cell).
• G4 E + G4 D + : G4 in both hESC and daughter cell (G4 is maintained from hESC to daughter cell).
Integration of expression levels, G4 signal levels, histone modification overlaps and G4 enrichment overlaps across all three stem cell types. Promoters annotations were combined into a single matrix together with: • median TPM across biological replicates for each individual cell type; • median RPM G4 signal at across replicates; • number of overlaps between promoter region and each of the three consensus G4 maps (hESCs, CNCCs, NSCs). All pie charts and alluvial plot are derived by querying this matrix with different filters. Difference in proportions was tested by using the proportion test based on the chi2 statistic (R function prop.test(), Pearson's χ 2 test for proportions).
Analysis of gene expression levels across promoter classes. Differences in expression levels across promoter classes (epigenetic and G4-based classification) were assessed. Genes were first defined by their "epigenetic"-classes (H3K4me3, H3K27me3, bivalent and unmarked) and subsequently stratified by the presence or absence of promoter G4 (G4+ and G4−) resulting in a total of distinct 8 groups. For each of the eight groups, we tested difference between the distribution of expression levels (median TPM) by using the Kolmogorov-Smirnov test (ks.test() function in R).
To assess whether gene expression levels directly recapitulate the G4 landscape, gene expression (median TPM) levels and G4 signal (median across replicates) at the promoters of the entire set of 58381 genes were calculated. Specifically, for each gene, a 6 element-long vector (one expression value for each of the 3 cells and one G4 promoter signal level for each of the 3 cells) was built. Values above 75th quantile (by column) were capped to control for outliers' effects and the final matrix was standardised by column (z-score). K-medoid clustering of the genes identified six different groups. To summarise global expression levels and G4 levels distributions in each cluster, 6 sets of boxplots were produced (3 boxes per cluster, where each box represents one stem cell type).
Analysis of transcriptional stabilisation. Transcriptional stabilisation and variability of individual genes between hESC (E) and differentiated cells (D, daughter) were explored in relation to the G4 promoter signatures with two orthogonal approaches.
Firstly, gene expression (TPM) levels were directly compared between hESCs and daughter cells. Only genes with TPM > 0 in at least one of the two cells under investigation were considered. For each promoter group (G4 E +G4 D −, G4 E +G4 D +, G4 E +G4 D +, G4 E −G4 D −), we fitted a weighted linear regression to model the relationship between the two sets of expression levels. The weights used in the fitting are expression levels of hESCs that represent conceptually the reference starting condition. Residuals of the data from the regressed model were computed and used to quantify the spread of the transcriptional variability. After each fitting step, the coefficient of "goodness of fit" R 2 was computed and the F-test was used to assess if there were significant differences in transcriptional stability between pairs of promoter groups. A similar analysis was performed using our ATAC-seq data and published H3K4me3 ChIP-seq data 6,83 . In each of the two cases, promoters were divided into four groups based on the presence/absence of the mark at promoters in hESCs/ daughter cells. The ranking of R 2 values was used to cross compare and determine which feature (G4, accessibility or H3K4me3) had a greater impact on stabilising expression when the feature is transmitted from hESCs to daughter cells.
Gene ontology and gene enrichment analysis. G:Profiler 84 web interface was used to determine enriched Gene Ontology (GO), Biological Processes (BP) terms, KEGG pathways, Reactome pathways and Wikipathways using the g:GOSt Functional Profiling analysis (using right-sided (enrichment) hyper-geometric test and Benjamini-Hochberg adjustment of the resulting p-value. All known genes and only experimental evidence codes (EXP, IDA, IPI, IMP, IGI, IEP) were used. Term size was limited to 20-450 genes.
G4 de novo motif discovery. De novo motif discovery was performed on the FASTA sequences (bedtools getfasta) extracted from the G4 consensus regions (pre-filtered 5-2000 bp) using MEME-ChIP 85 web-interface (options: Enrichment mode Classic; Set of known motifs: Eukaryote DNA, Human and Mouse (HOCOMOCO v11 FULL), order-1Background, MEME Site Distribution: 0 or 1 occurrence, MEME motif count: 3 and MEME Motif width: 6-30 wide (inclusive). Occurrences of selected motifs across G4 consensus regions were quantified by performing a dedicated search with FIMO, Find individual motif occurrences 86 . Density was calculated by dividing motif occurrence/residues. G4 immunofluorescence microscopy. Human stem cells were plated onto removable three-well chamber slides (cat. # 80381, Ibidi) pre-treated with the corresponding coating: Matrigel (hESCs), Geltrex (NSCs) or 7.5 μg/ml fibronectin (CNCCs). Cells were fixed in 4 % (w/v) paraformaldehyde pH 7.4 in PBS and permeabilised with PBS/0.1 % (v/v) Triton-X, with incubation for 10 min at RT for each step. Slides were blocked with 5% normal goat serum (cat. # ab7481, Abcam), 0.5% Tween20 in PBS for 1 h at 37°C and then directly incubated with 50 nM BG4 in blocking buffer for 1 h at 37°C. After 3 × 5 min washes with PBST (0.1 % (v/v) Tween20), slides were incubated with rabbit anti-FLAG (1:800, cat. # 2368, Cell Signaling Technology) for 1 h at 37°C. After washing, slides were incubated for 30 min at 37°C with goat anti-rabbit 647 secondary antibody (1:500, cat. # A32733, Invitrogen). Following three 5 min washes with PBST (DAPI counterstain (0.5 μg/ mL) was included in the second wash), chambers were removed and slides mounted to #1.5 coverslips using ProLong Diamond Antifade Mountant (cat. # P36961, ThermoFisher). 3 biological replicates per cell line were performed. Confocal z-stacks images (13 steps) were acquired using a Leica TCS SP8 microscope with a HC PL APO CS2 1.4 NA ×100 oil objective (Leica Microsystems) at a scan speed of 400 Hz and sampling rate of 0.11 μm × 0.11 μm × 0.30 μm. The DAPI channel was excited using 405 nm diode laser (at 405 nm) and the white-light pulsed laser (SuperK EXTRENE, NFT Photonics) was used to excite the secondary antibody fluorophore (at 647 nm). Fluorescence detection as performed sequentially with hybrid detectors (Leica HyD Photon Counter) at wavelength ranges of 657-733 nm and 415-525 nm for Alexa Fluor 647 and DAPI, respectively. The pinhole was set to one airy unit and laser power and gain settings were kept constant between samples and biological replicates. Images were deconvolved using Huygens Professional Software (Scientific Volume Imaging BV). Representative images were processed in the open source imaging software FIJI 87 (version 2.0.0-rc-69/1.52p) and ICY (version 1.0; http://icy.bioimageanalysis.org) 88 and assembled using Adobe Illustrator 2019. G4 signal density (sum G4 signal/ sum DAPI signal per nucleus), calculated on the sum projection of the Z-stack, was performed using FIJI followed by ICY. First in FIJI, nuclear regions of interest (ROIs) were generated from the DAPI channel using a Gaussian Blur (sigma = 3), Huang dark automatic thresholding and Watershed function to separate touching objects. Nuclei that did not segment properly were removed manually using the analyse particle function. ROIs were then used in the ICY protocol (ICY_microscope_i-mage_analysis_file found on lab github website) using all specified parameters to extract G4 and DAPI signals.
G4 selective ligands. The bisquinolinium phenanthroline derivative PhenDC3 was synthesised in-house according to the protocol outlined in De Cian et al. 60 . The triazine derivative 12459 was synthesised in-house according to the protocol outlined in Douarre et al. 63 and N-methyl mesoporphyrin IX (NMM) 61,62 was purchased from Merck cat. #258806. All G4 ligands were dissolved in DMSO to 10 mM stock solutions.