A CpG island promoter drives the CXXC5 gene expression

CXXC5 is a member of the zinc-finger CXXC family that binds to unmethylated CpG dinucleotides. CXXC5 modulates gene expressions resulting in diverse cellular events mediated by distinct signaling pathways. However, the mechanism responsible for CXXC5 expression remains largely unknown. We found here that of the 14 annotated CXXC5 transcripts with distinct 5′ untranslated regions encoding the same protein, transcript variant 2 with the highest expression level among variants represents the main transcript in cell models. The DNA segment in and at the immediate 5′-sequences of the first exon of variant 2 contains a core promoter within which multiple transcription start sites are present. Residing in a region with high G–C nucleotide content and CpG repeats, the core promoter is unmethylated, deficient in nucleosomes, and associated with active RNA polymerase-II. These findings suggest that a CpG island promoter drives CXXC5 expression. Promoter pull-down revealed the association of various transcription factors (TFs) and transcription co-regulatory proteins, as well as proteins involved in histone/chromatin, DNA, and RNA processing with the core promoter. Of the TFs, we verified that ELF1 and MAZ contribute to CXXC5 expression. Moreover, the first exon of variant 2 may contain a G-quadruplex forming region that could modulate CXXC5 expression.

www.nature.com/scientificreports/ with an electrophoretic migration of approximately 2500 nt in length similar to that of the annotated transcript variant 2, 2601 nt, while, as expected, a single GAPDH transcript of about 1500 nt was detected (Fig. 1f).
Because of these findings, we predicted that the promoter of CXXC5 resides in a transcript variant 2 region that encompasses a transcription start site (TSS). For the identification of TSS(s), we used the 5' Rapid Amplification of cDNA Ends (5'RACE) approach, designed for the amplification of nucleic acid sequences from a messenger RNA (mRNA) template between a defined internal site and unknown sequences at the 5′-end of the mRNA through the use of an adaptor RNA probe 49 . Although prone to biases introduced by various factors including RNA secondary structures, G-C nucleotide content, adaptor ligation efficiency 50 , 5′RACE has been successfully used for the identification of 5′-ends of numerous RNA transcripts 51 . We also used TFF1, a wellstudied estrogen-responsive gene 19,52 , as a control for 5′RACE studies. 3′RACE was also used for the identification of 3′-transcript sequences of both CXXC5 and TFF1 transcripts.
The 3'RACE approach readily identified the 3′ end of the CXXC5 or TFF1 transcript (Supplementary Information, Fig. S1). 5'RACE of CXXC5, in contrast to that of TFF1 which generates a transcript with a single TSS 53 ( Supplementary Information, Fig. S1), proved to be difficult likely due to the high GC content (> 70%) of the Exon 3 and surrounding sequences. Nevertheless, our results based on the sequencing of PCR amplicons generated from cDNA libraries of MCF7 cells indicated that several 5′-ends of transcript variant 2 can be detected, suggestive of multiple TSSs (Fig. 2a). These results imply the presence of a transcription start region for transcript variant 2 rather than a distinct TSS.
Collectively, our results indicate that transcript variant 2 of CXXC5 composed of Exons 3, 10, and 11 is the main transcript in MCF7 and HL60 cells.
The CXXC5 promoter is located in a DNA segment encompassing the beginning of Exon3. Based on the similar results obtained in MCF7 and HL60 cells, we assessed the promoter activity of the putative promoter region of CXXC5 by generating a PCR amplicon of 1975 bp in length from MCF7 genomic DNA that includes 5′ upstream regions of Exon3, the entire Exon3, and Exon4 (Fig. 2b). The PCR amplicon was inserted into a reporter vector, pGL3-Basic, which has no promoter but bears the Firefly Luciferase cDNA as the reporter enzyme. We also used a reporter vector bearing the estrogen-responsive TFF1 gene promoter 19 as control. We found in transiently transfected MCF7 cells that the reporter enzyme activity from the putative CXXC5 promoter region, as from the TFF1 promoter, was significantly higher compared to that observed with the reporter vector bearing no promoter, from which the enzyme activity is set to one (Fig. 2c). To decipher the core promoter elements of the putative CXXC5 promoter, we generated sequential truncations at the 5′-or The presence of CXXC5-transcript variants in MCF7 and HL60 cells was assessed with multiple rounds of PCRs with progressively nested primers specific to a variant together with a primer specific to Exon10, which is common to all CXXC5-transcript variants, followed by cloning and sequencing of the PCR amplicons. (d) Screenshots of the expression levels of CXXC5transcript variants in healthy breast tissue and blood annotated by GTEx Portal. (e) qPCR of cDNA libraries was used to assess the expression levels of transcript variants in MCF7 and HL60 cells using primer sets also utilized in evaluating the presence of transcript variants. Asterisk (*) indicates the highest expression among transcript variants. (f) Northern blot analyses of RNA samples from MCF7 or HL60 cells. A biotinylated probe complementary to Exon10/11 present in all transcript variants and a biotinylated probe targeting Exon3 were used for the detection of the CXXC5 transcripts. A GAPDH targeting probe was also used for the detection of the GAPDH transcripts as control. The molecular ladder (nt) is indicated. www.nature.com/scientificreports/ 3′-end of the region by PCR and inserted them into the reporter vector. Results from transiently transfected MCF7 cells indicated that DNA sequences of Exon3 produce the highest reporter activity (Fig. 2c). Further truncations and/or internal deletions as Segments (A-D) of Exon3 revealed that Segment A, corresponding to, and including the 5′ surrounding sequences of, Exon3 ( Supplementary Information, Fig. S2) retains the promoter activity (Fig. 2d,e). Interestingly, Segment C alone suppresses (Fig. 2e), and in the presence of other segments lessens (Fig. 2d) the activity of the reporter enzyme. This suggests that Segment C alone includes DNA elements adversely affecting transcription. In keeping with this prediction, the genetic fusion of Segment C, to the 3′-end sequences of the TFF1 promoter or of the strong human cytomegalovirus (CMV) promoter effectively repressed the Luciferase enzyme activity (Fig. 2f) in contrast to Segment D which has minimal effects on the reporter activity induced by the CMV promoter. These results suggest that the core promoter elements of CXXC5 reside in Segment A.
The methylation state of the putative CXXC5 promoter region. Based on the conclusion that transcript variant 2 is the main CXXC5 transcript in both MCF7 and HL60 cells, we initially carried out in silico analyses of a genomic region, about 1500 bp in length, of the CXXC5 locus, wherein Exon3 is situated, as the putative promoter region (Fig. 3a). The nucleotide sequence of the region revealed (1) a remarkably high (> 70%) G-C content (https:// www. biolo gicsc orp. com/ tools/ GCCon tent/), (2) a greatly enriched CpG dinucleotide repeats (https:// www. biolo gicsc orp. com/ tools/ GCCon tent/), (3) an asymmetric GC distribution, GC skew, which is used as a measure of DNA strand asymmetry in the GC nucleotide distribution (http:// gensk ew. csb. univie. ac. at/ GenSk ewSer vlet) as a property of CpG islands 54 3,4,56 . CGI promoters, often define promoters of housekeeping, developmental and tissue-specific genes, show a transcriptionally permissive state, within which transcription initiation can occur at several closely spaced locations 3,4,56 . To examine the methylation state of Exon3 and the surrounding region including the putative CXXC5 promoter segment, we explored a targeted methylation profile of the region as well as Exon10 as a control for the methylated gene body of the CXXC5 locus using bisulfite-sequencing. Genomic DNA of MCF7 cells was subjected to bisulfite reaction to convert unmethylated cytosine residues to uracil followed by bisulfite PCR. PCR amplicons generated with bisulfite primers were cloned and sequenced. Sequences were then aligned to the genomic sequence of the corresponding CXXC5 regions using QUMA 57 (http:// quma. cdb. riken. jp/). Results from MCF7 (Fig. 3b) as well as HL60 cells ( Supplementary Information Fig. S3) indicated that the 5′-upstream region of Exon3 shows a high degree of CpG methylation, which declines precipitously thereafter The methylation state of the CXXC5 promoter region (-930 to -489 and -179 through the Exon3; + 1 indicates the beginning of Exon3) together with the 3'-end of Intron9 and Exon10 (− 195 through + 330; + 1 marks the beginning of Exon10) as controls was examined with bisulfite sequencing. Isolated genomic DNA of MCF7 cells was subjected to bisulfite reaction for the conversion of unmethylated cytosine residues to uracil followed by bisulfite PCR. PCR amplicons produced with bisulfite primers were cloned and sequenced. Aligned sequences to the corresponding CXXC5 regions were depicted as a lollipop distribution. Filled circles indicate methylated and empty circles denote unmethylated CpG dinucleotides. (c) Nucleosome occupancy at the CXXC5 promoter elements was assessed with Micrococcal Nuclease (MNase) assay. MCF7 cells were fixed, permeabilized, and treated without (0) or with 500 or 1000 gel units (GU) of MNase for 15 or 20 min at 37 °C for chromatin digestion. Isolated DNA was analyzed with agarose gel electrophoresis. (d) Schematics of regions subjected to PCR. Isolated DNA fragments corresponding to (e) tri-nucleosomal (T1-5) and (f) mono-nucleosomal (M1-3) DNA were subjected to PCR using region-specific primer pairs. (g-i) ChIP analysis of Exon3. Following chromatin digestion of MCF7 cells by the use of MNase were subjected to ChIP using species-specific IgG, an antibody specific to total H3 (g) or H3K4me3 (h). Isolated DNA following precipitation with Protein A/G conjugated magnetic beads was subjected to qPCR using region-specific primer sets. Asterisk (*) denotes significant changes depicted as fold change compared to IgG. (i) MCF7 cells fixed, lysed, and sonicated were subjected to ChIP using IgG, PolII, or ser5-phosphorylated PolII antibody followed by precipitation with Protein A/G conjugated magnetic beads. DNA samples were then used for qPCR with primer sets specific to Segment A or the promoter of GAPDH as control. Shown are the mean ± SE of three independent experiments performed in triplicate. Significant differences depicted with an asterisk (*) are shown as fold change compared to IgG. www.nature.com/scientificreports/ and remains largely unmethylated throughout the region including Exon3 (Fig. 3b) and Exon4 (data not shown). This contrasts with Exon10 which is highly methylated (Fig. 3b). Common with all eukaryotic promoters, unmethylated CGI promoters also possess a nucleosome-free region surrounding TSSs 58 and contain dispersed nucleosomes decorated with H3K4me3, which marks active transcription [59][60][61] . To assess the nucleosome occupancy at the DNA region including the putative CXXC5 promoter elements, MCF7 cells were fixed, permeabilized, and subjected to Micrococcal Nuclease (MNase) for chromatin digestion (Fig. 3c). DNA was subsequently purified and analyzed for digestion patterns with agarose gel electrophoresis. DNA fragments corresponding to tri-nucleosomal and mono-nucleosomal DNA were excised from the gel and purified. The fragmented DNA, or the uncut genomic DNA of MCF7 cells as control, was used as the template for PCR to assess the presence of nucleosomes at Exon3. For initial analyses, five overlapping regions (depicted as T1-5, Fig. 3d) were subjected to PCR using the tri-nucleosomal DNA template with the region-specific primer pairs. For further verification, three sub-regions of Exon3 (depicted as M1-3, Fig. 3d) were also analyzed using the mono-nucleosomal DNA. The detection of a PCR amplicon from fragmented DNA compared to genomic DNA suggests the presence of nucleosomes. Results with tri- (Fig. 3e) or mononucleosomal (Fig. 3f) DNA template revealed that the 5′-surrounding sequences of Exon3 and Segment A are primarily nucleosome-deficient and the remaining segments of Exon3 contain nucleosomes. To verify this finding, we carried out ChIP of Exon3 (Fig. 3g,h). MCF7 cells processed for chromatin digestion by the use of MNase, as described for nucleosome occupancy, were subjected to ChIP using an antibody specific to H3 (Fig. 3g) or tri-methylated histone H3 lysine 4, H3K4me3, (Fig. 3h), a histone modification used as a marker for actively transcribed genes 60 . Purified DNA was then subjected to qPCR using primers specific to Segments of Exon3. We found that Segment A is indeed devoid of H3 but the remaining segments of Exon3 bear H3 decorated with K4me3 modification. We also observed the presence of an active PolII at Exon3, shown on Segment A, as on the promoter of the housekeeping GAPDH gene as control, using ChIP-qPCR with an antibody specific to PolII or Ser5 phosphorylated PolII (Fig. 3i).
Our results collectively indicate that Segment A of Exon3 constitutes the core promoter element of CXXC5 located in a CGI.
To assess the binding of TFs obtained with the promoter pull-down approach as the putative binders to sequences of Segment A, we initially carried out bioinformatics analyses using the Cistrome (http:// cistr ome. org/) database, a resource of human and mouse cis-regulatory information derived from ChIP-seq, DNase-seq, and ATAC-seq chromatin profiling assays to map the genome-wide locations of transcription factor binding sites 64 . Due to the availability of information on TFs in Cistrome, the possible association of 16 TFs (AFF1, ATF7, CREB1, ELF1, MAZ, MGA, MYNN, NFIA, NFIB, PRDM10, RB1, TFAP2C, TFAP4, ZBTB2, ZBTB7A, and ZBTB7B) with Exon3 and surrounding sequences was analyzed with datasets generated by the use of MCF7 cells and/or of other cell lines for which datasets were available. Results revealed that while ATF7, CREB1, MGA, MYNN, NFIA, NFIB, ZBTB2, or ZBTB7B does not appear to interact with the Exon3 region, the association of ELF1, TFAP4, or TFAP2C with the region in cells seems to be dependent on tissue-of-origin. On the other hand, AFF1, MAZ, PRDM10, RB1, or ZBTB7A could be involved in the regulation of CXXC5 expression in MCF7, and also in other, cells by interacting with the Exon3 region ( Supplementary Information, Fig. S9).

Interactions of TFs with Segment A in cellula.
MAZ binds to DNA sequences with high G nucleotide content 65,66 , which are abundantly present in the CXXC5 core promoter. ELF1, upon binding to DNA could regulate gene expressions through interaction with RB1 67 . RB1, which we identified here as one of the Segment A interacting proteins as well, indirectly associates with DNA through interactions with, for example, members of the E2F family proteins and hematopoietic transcription factors 68 . Based on these observations, we reasoned that ELF1 and MAZ could be involved in the regulation of the CXXC5 gene expression. We also carried out ChIP for RB1. To assess the possible presence of TFs on Segment A, we initially examined the efficiency of antibodies to precipitate the protein of interest with IB following ChIP (ChIP-IB) (Fig. 5a) and subsequently assessed the amount of isolated DNA with qPCR (ChIP-qPCR) (Fig. 5b) using primer sets specific for Segment A. We also used primers specific for the promoter of OAS1 (2′-5′-Oligoadenylate Synthetase 1) with which ELF1 is shown to www.nature.com/scientificreports/ interact 67 as control. Similarly, primer sets for the promoter of MYC (MYC Proto-Oncogene, BHLH Transcription Factor) were used to assess the interaction of, as shown previously, MAZ 69 or RB1 70 as control. We also used Exon2 of MB (Myoglobin) as control. In addition, ChIP using an antibody specific to CREB1 was conducted to ensure that CREB1 does not interact with the Exon3 region as the findings of the Cistrome database suggested. Results revealed that ELF1 or RB1, as Ser5 phosphorylated PolII, indeed associates with Segment A, as each interacts with the promoter elements of the respective control gene but not with MB (Fig. 5b). MAZ synthesized endogenously in MCF7 cells displays electrophoretic mobility of about 57 kDa (Fig. 6g) that co-migrates with the heavy chain of IgG in immunoprecipitates ( Supplementary Information, Fig. S10a). This renders the presence of MAZ in precipitates difficult to decipher. To ensure that the antibody, which recognizes sequences at the carboxyl-terminus of MAZ, precipitates the protein, we used an amino terminally truncated MAZ (MAZ ΔN ) with an estimated MM of 37 kDa ( Supplementary Information, Fig. S10b). MCF7 cells were transiently transfected with an expression vector bearing the HA-MAZ or HA-MAZ ΔN cDNA. Cells were then subjected to ChIP-IB using the MAZ antibody. The presence of HA-MAZ ΔN in the precipitates indicated that the antibody immunoprecipitates the MAZ protein ( Supplementary Information, Fig. S10b). Based on this finding, we carried out ChIP of MCF7 cells using the MAZ antibody. qPCR results revealed that MAZ interacts with Segment A and the MYC promoter but not with Exon2 of MB (Fig. 5b).
CREB1 did not show an association with Segment A or MB but it interacted with the promoter elements of CCNA2 ( Supplementary Information, Fig. S10c,d), as shown previously 71 . www.nature.com/scientificreports/

Sequence motif analyses for ELF1 and MAZ on segment A. To examine binding sites for ELF1
or MAZ on Segment A, we performed sequence motif analyses using our motif analysis tool 72 and the JAS-PAR (http:// jaspar. gener eg. net/) database 73 , which is a resource for curated, non-redundant TF-binding profiles stored as position frequency matrices (PFMs) for TFs. We identified potential binding sites for MAZ and ELF1 proteins in Segment A ( Supplementary Information, Fig. S11a,b). Moreover, one of the characteristics of CGI promoters is the lack of sequence motifs for TATA-box or downstream promoter element (DPE) positioned at distinct locations relative to TSS that define non-CpG promoters 3,4,74 . Consistent with this, we found no such elements throughout the CXXC5 locus including Segment A.
To corroborate the binding to the putative ELF1 or one of the MAZ motif of Segment A, we performed electrophoretic mobility shift (EMSA) assays, as we described previously 19 , using a 5′-end biotin-conjugated DNA substrate containing the ELF1 or MAZ binding motif present in Segment A and nuclear extracts of MCF7 cells (Fig. 6). When the DNA substrate for ELF1 (ELF1-RE, Fig. 6a) or MAZ (MAZ-RE, Fig. 6b) was incubated with nuclear extracts, a DNA-protein complex (asterisk) was observable on the gel. The electrophoretic migration of the protein-DNA complex with the inclusion of the ELF1 or MAZ antibody further retarded the migration. These results suggest that ELF1 or MAZ specifically interacts with the DNA substrate. The abrogation of the protein-DNA interaction with a DNA substrate bearing mutant sequences (Mut DNA) or with the inclusion of a 250-fold molar excess of the unbiotinylated (UnB DNA) ELF1-RE or MAZ-RE DNA further indicates that Segment A contains sequences for the binding of ELF1 or MAZ.
To assess the effects of ELF1 or MAZ on the expression of reporter enzyme driven by Segment A, we transiently transfected MCF7 cells with the expression vector bearing the HA-ELF1 or HA-MAZ cDNA for 24 h (Fig. 6c) Results indicated that ELF1 or MAZ enhances the enzyme activity compared to levels observed with the vector. Moreover, we observed reduced levels of reporter enzyme activity driven by Segment A with the deleted ELF1 (SegA ΔELF1-RE ) or MAZ (SegA ΔMAZ-RE ) motif compared to the native sequence in MCF7 cells whether or not cells transfected with the expression vector bearing the HA-ELF1 or HA-MAZ cDNA. These results indicate that Segment A contains sequences for the binding of ELF1 and MAZ critical for the promoter activity. Fragmented chromatin from MCF7 cells processed for ChIP was immunoprecipitated with a species-specific IgG or an antibody specific to Ser5P-PolII, RB1, or ELF1. Fragmented chromatin of MCF7 cells transiently transfected with MAZ ΔN was also processed for ChIP to ensure that the MAZ antibody directed to the carboxyl-terminus of MAZ to be used in ChIP is capable of precipitating the endogenous MAZ. ChIP precipitates were subjected to SDS-PAGE for Ser5P-PolII and RB1 or to SDS-PAGE for ELF1 and MAZ ΔN followed by IB using antibodies indicated for ChIP. Asterisk (*) indicates the protein of interest. Input, IgG together with heavy (HC) and light (LC) chains of IgG are indicated. Molecular masses (MM) in kDa are denoted. (b) ChIP-qPCR. While identical primer sets for each antibody were used in assessing the interaction of a transcription factor (TF) to Segment A as the core CXXC5 promoter elements (CXXC5) or to the Exon2 of Myoglobin (MB) as a control, distinct primer sets were used for the promoter of a target gene of each TF. The mean ± SE of three independent experiments performed in triplicate is shown. Asterisk (*) indicates significant differences depicted as fold change compared to IgG. www.nature.com/scientificreports/ In assessing the effects of ELF1 or MAZ on CXXC5 expression in a chromatin context, we transiently transfected MCF7 cells with the expression vector bearing none (EV), the HA-ELF1, or HA-MAZ cDNA for 48 h. HA-ELF1 (Fig. 6d) or HA-MAZ (Fig. 6e) augmented the expression of CXXC5 as well as the corresponding control OAS1 or MYC compared to the vector as assessed with RT-qPCR. Furthermore, the reduction of ELF1 protein levels (Fig. 6f) in transient transfections in MCF7 cells with a siRNA pool that targets ELF1 effectively attenuated the expression of CXXC5 or OAS1 (Fig. 6h). We also observed effective repression of the protein levels of MAZ by a MAZ-specific siRNA pool (Fig. 6g). Unexpectedly, however, the suppression of MAZ synthesis did not alter the CXXC5 or the MYC expression (Fig. 6i). This suggests that MAZ at steady-state conditions in contrast to ELF1 may not contribute to CXXC5 expression.
Thus, it appears that although ELF1 and MAZ participate in the expression of CXXC5, the contributory effect of these TFs on the CXXC5 expression could be mechanistically distinct and context-dependent. The transfection efficiency was monitored by the co-expression of pCMV-Renilla Luciferase. Cellular extracts were then subjected to dual-luciferase assays. Asterisks (*) indicate significant differences depicted as fold change compared to B which was set to 1; whereas the superscript "a" denotes significant differences compared to SegA. (d, e) MCF7 cells were transfected with an expression vector bearing none (EV), HA-tagged ELF1, or HA-tagged MAZ cDNA. Isolated RNA was converted into cDNA libraries and followed by qPCR using either CXXC5-specific or OAS1-specific for ELF1 control or MYC-specific for MAZ control primers. Results were normalized with the RPLP0 expression using the 2-ΔΔCT method. (f, g) Nuclear extracts of untransfected (UT) or transiently transfected MCF7 cells with control siRNA (CtS), a pool of siRNA specific (siR) to ELF1 (f) or MAZ (g), or with an expression vector bearing the HA-ELF1 or the HA-MAZ cDNA for 48 h were subjected to IB using ELF1 or MAZ antibody. Membranes were then re-probed with the HA antibody. Molecular masses (MM) in kDa are indicated. (h, i) Total RNA from MCF7 cells transiently transfected cells with control siRNA (CtS), siRNA specific (siR) to ELF1 (h), or MAZ (i) were processed for and subjected to qPCR using primers specific to CXXC5, the OAS1, or MYC. Asterisks (*) denote significant differences depicted as fold change compared to CtS of three independent experiments. www.nature.com/scientificreports/ Segment C may contain a G-quadruplex. Our reporter assays suggested that Segment C, which has a high G-C content, in the presence of other segments attenuates and alone represses the activity of the promoter driving the expression of the Luciferase cDNA as the reporter enzyme (Fig. 2). G-rich sequences can self-associate into stacks of G-quartets to form complex structural motifs known as G-quadruplexes (G4s) which arise from Hoogsteen hydrogen bonding of four guanines arranged within a planar quartet (G-quartet) linked by loop nucleotides 75,76 . Self-stacking of G4 structure is further stabilized by monovalent cations, including K +75,76 . G4s could play many essential functions including transcriptional events 75,76 . The consensus motif of G3 + N 1-7 G3 + N 1-7 G3 + N 1-7 G3 + N 1-7 (G = guanine and N 1-7 = 1-7 any nucleotide) is used to identify potential G-quadruplexes from the primary sequence 76 .
Analysis of the sequence of Segment C strands with G4 prediction tools including G4Hunter 77 (http:// bioin forma tics. ibp. cz/) and G4CatchAll 78 (http:// homes. ieu. edu. tr/ odolu ca/ G4Cat chall/) revealed the possible presence of a G4 on the positive strand (Fig. 7a). Based on these results, we initially wanted to explore the presence of a G4 structure in Segment C using Thioflavin T (ThT) which interacts selectively with G4s resulting in a significant fluorescent enhancement 79 . Incubation with ThT of the putative G4 sequence of Segment C (SegC-G4), 34 nt long, in the presence of 70 mM KCl led to a substantial fluorescence increase which was determined to be as F − F 0 = 720 ± 15 (Fig. 7b,c). This increase in the fluorescence was comparable to that observed with the Pu 22 sequence (F-F 0 = 495 ± 13) present in the promoter region of the VEGF (vascular epithelial growth factor) gene, which was previously characterized to form a G4 structure 80,81 . These together with low fluorescence intensities of ThT at 488 nm with various mutant sequences of SegC-G4 designed to disrupt the G4 formation (SegC-Mut1-3), SegC complementary (SegC-Comp), or dT 32 suggest the possible adaptation of a G4 structure by SegC-G4 sequence.
To verify that the SegC-G4 sequence indeed forms a G4 structure, we also used the Circular Dichroism (CD) approach, which is commonly utilized to determine the G4 topology of G-rich sequences 82,83 . The presence of a parallel G4 is characteristically associated with a positive band around 260 nm and a negative band around 240 nm. On the other hand, the formation of an antiparallel G4 reveals a positive band around 295 and 240 nm and a negative band near 260 nm. The hybrid type G4 structure is associated with a positive band at 290 nm together with a shoulder band at 260 nm and a negative band at around 240 nm [82][83][84] . In the CD spectrum of www.nature.com/scientificreports/ SegC-G4, a negative peak around 240 nm, a positive peak around 260 nm and another positive peak around 290 nm were observable (Fig. 7d). The presence of a negative peak around 240 nm and a positive peak around 260 nm is an indication of a parallel G4 structure. Besides, the presence of a positive peak around 290 nm suggests the existence of a second structure with a hybrid topology. The CD spectrum of SegC-Comp reveals an intense absorption maximum around 283 nm with a negative band around 254 nm, which might be correlated with the formation of an i-motif structure due to the C-rich content of the strand 85 . The mutant sequences (SegC-Mut1, SegC-Mut2, and SegC-Mut3) did not show the characteristic peak intensities of G4s. Compared to SegC-Mut1 and SegC-Mut2, the high intensity of the CD spectrum at 220 nm of SegC-Mut3 likely results from the A-rich content of this sequence 86 . Thermal denaturation, using spectroscopic methods, offers an approach for measuring the stability of nucleic acid structures 87 . CD thermal denaturation experiments were conducted to further examine the G-quadruplex structure of SegC-G4. CD spectra were recorded as a function of temperature (between 15 °C and 95 °C) (Supplementary Information Fig. S12a). Thermal denaturation profile obtained by monitoring ellipticity change at 262 nm revealed a melting temperature (Tm) of 65 °C ( Supplementary Information Fig. S12). Furthermore, thermal denaturation as a function of temperature recorded with changes in UV-Vis absorbance at 295 nm, a characteristic wavelength for G-quadruplexes, revealed also a characteristic thermal denaturation curve of a G4 structure 88 in SegC-G4 (Fig. 7e). Additionally, a decrease in Tm of the SegC-G4 sequence from 65 to 45 °C in the absence of 70 mM KCl as the source of K + for stabilization of G4 structures 75,76 further indicates that the SegC-G4 sequence adopts a G4 conformation. On the other hand, in agreement with our CD data, no thermal denaturation curve was obtained for SegC-Mut1, SegC-Mut2, or SegC-Mut3 sequence (Fig. 7e).
Since the fusion of Segment C to the 3′-end sequence of the CMV promoter effectively repressed the reporter enzyme activity, in contrast to Segment D which has minimal effects, induced by the promoter (Fig. 2f), we wanted to assess whether the removal of this G4 sequence in Segment C would restore the CMV-driven enzyme activity. In MCF7 cells transiently transfected with the reporter plasmid bearing CMV promoter that drives the Luciferase enzyme cDNA as the reporter, the repression of the enzyme levels by the presence of Segment C (CMV-Pr-C) but not Segment D (CMV-Pr-D) was indeed effectively alleviated with the deletion of the G4 sequence in Segment C (CΔ G4 ) (Fig. 7f).

Discussion
The majority of human gene promoters of housekeeping, developmental and tissue-specific genes are located within unmethylated CGIs that display a chromatin state permissive for transcription which is initiated at multiple closely spaced TSSs by 'broad or dispersed' promoters in contrast to 'focused or sharp' non-CpG promoters of cell-type-specific genes within which a single TSS initiates transcription 3,4,56,74 . While sequence motifs for TATAbox and downstream promoter element (DPE) positioned at distinct locations relative to TSS tend to characterize non-CpG promoters, CGI promoters generally lack these elements 3,4,74 . We identified here transcript variant 2 with the highest expression level among transcript variants in MCF7 and HL60 cells as the main transcript of CXXC5. We also defined a DNA segment within and at the 5' surrounding sequences of Exon3 as the core promoter region required for CXXC5 expression. Based on DNA sequence composition and motifs, chromatin configuration of, and the presence of multiple TSSs together with an active PolII at the CXXC5 promoter, we suggest that a CGI promoter drives the expression of CXXC5.
Transcription is the result of the integrated effects of multiple inputs mediated by TFs whose activities are dynamically modulated in response to internal and external signaling cascades. Studies indicate that due to a high CpG density 89 and inherently unstable nucleosome architecture 90 , the chromatin accessibility of CGIs is critical for the binding of various transcription factors including the ZF-CXXC family proteins and the subsequent recruitment of DNA/histone modifiers and RNA polymerase machinery for transcription 3,4,74 . Our studies coupled with bioinformatics analyses suggest that AFF1, ELF1, MAZ, PRDM10, TFAP2C, TFAP4, and ZBTB7A transcription factors may be involved in the regulation of the CXXC5 expression; of these, ELF1, MAZ, TFAP2C, TFAP4, and ZBTB7A appear to be capable of interacting with sequences enriched with C and/or G nucleotides within the nucleosome-free CXXC5 promoter. We verified here that ELF1 and MAZ are critical components of the CXXC5 expression by directly interacting with cognate sequence motifs present in the CXXC5 promoter.
The regulation of CXXC5 expression is likely multifactorial involving many transcription factors with activator or repressor functions responding to distinct signaling pathways. ELF1 (E74 Like ETS Transcription Factor 1), a ubiquitously expressed gene product, is a member of the ELF subfamily of the ETS transcription factor family which plays diverse roles in regulating many essential processes including embryonic development, cell cycle control, cell proliferation, apoptosis, cell migration, hematopoiesis, and angiogenesis 91,92 . ELF-1 interacts with a permutation of a consensus core sequence, AGGAA, (also, Supplementary Information, Fig. S11) 93 on DNA and acts as an activator or repressor of target gene expressions. ELF1 could regulate gene expressions through interaction with RB1 as well 67 . The interaction of ELF1 with the pocket region of the hypo-phosphorylated RB1 was shown to be critical for gene expressions involved in cell cycle progression during T cell activation 94 . It is well established that hypo-phosphorylated RB1 restricts the ability of cells to replicate DNA by preventing G1 progression to the S phase of the cell cycle through repressing genes involved in cell cycle progression regulated by the E2F family and its obligatory dimerization partners DP family proteins through direct binding to E2F responsive elements 68,95 . Hyper-phosphorylation of RB1 leads to the dissociation of RB1 from E2F-DP complexes and subsequent activation of target gene expressions 68,95 . We observed here that ELF1 and RB1 are co-present as observed with promoter pull-down and each is enriched at the CXXC5 promoter as assessed with ChIP. Moreover, our studies revealed that ELF1 interacts with an ELF1 sequence motif on Segment A and modulates the CXXC5 expression. These observations, therefore, raise the possibility that the interaction between ELF1 www.nature.com/scientificreports/ and RB1 drives the CXXC5 expression in a cell cycle-dependent manner. Indeed, our ongoing studies suggest that this might be the case. As ELF1, MAZ is expressed ubiquitously in human tissues at varying levels 96 . MAZ is a six Cys2-His2 zinc finger transcription factor and recognizes a permutation of a cognate sequence of GGG AGG G (also, Supplementary  Information, Fig. S11) primarily present on nucleosome-free regions in broad promoters in contrast to focused promoters 97 . MAZ is implicated in a wide range of transcriptional roles, including transcription initiation 69 , transcriptional pausing of PolII during transcription elongation 98 , alternative splicing 98,99 , and transcription termination leading to the activation of polyadenylation 69,100 . We observed here that MAZ is enriched at the CXXC5 promoter assessed by ChIP, binds to MAZ binding motif by EMSA, and modulates the expression of CXXC5 assessed by overexpression from the reporter promoter construct or the endogenous gene locus. These suggest that MAZ is a critical contributor for the expression of CXXC5. However, we also observed that the effective reduction of MAZ protein levels by a siRNA approach did not alter the CXXC5 expression nor the expression of MYC used as the control, in contrast to the reduction in ELF1 protein levels which led to a decrease in the CXXC5 expression. This suggests that MAZ may not be involved in the expression of CXXC5 under steady-state conditions but is involved in the CXXC5 expression in response to a signaling pathway. It is also likely that the decrease/absence of MAZ might be compensated with other transcription factors that bind similarly to DNA binding motifs of MAZ. Indeed, the sequences of the binding sites for MAZ and SP1 (Specificity factor 1), which are often found within the same gene, are very similar: GGG AGG G and GGG CGG , respectively 65,66 . Studies further showed that SP1 binds, and competes with MAZ for binding, to the same GC-rich DNA-binding sites 101 .
Although how ELF1 or MAZ modulates CXXC5 expression is unclear, alterations in histone modifications, upon binding to DNA, of target gene promoters appear to be critical for gene expressions 65,102 . MAZ, for example, represses transcription by recruiting histone deacetylases including HDAC1, HDAC2, and HDAC3 65 . Moreover, the interaction of FAC1 (Fetal Alzheimer's clone 1), a truncated isoform of the chromatin remodeler BPTF (bromodomain and PHD domain transcription factor), with MAZ is shown to alter the transcriptional activity of the protein 103 . It is therefore likely that upon association with DNA, MAZ and/or ELF1 directly or through co-regulatory proteins interact with histone modifiers, as we find here with the pull-down assay of the association of histone acetyltransferase KAT2A as well as histone demethylases JMJD1C, KDM1A, and KDM2A with Segment A, and establishes a chromatin stateand structure permissive/restrictive for the transcriptional regulation of CXXC5.
While Segment A of the first exon of transcript variant 2 constitutes the core promoter for the CXXC5 gene, the surrounding regions may contribute to gene expression as they contain potential binding motifs for various transcription factors (data not shown) as well as structural features, exemplified here with the presence of a G4 conformation in Segment C. As non-canonical nucleic acid secondary structures formed within G-rich sequences in both DNA and RNA, G4s are widely found in promoter regions, immunoglobulin class switch regions, ribosomal DNA, mitochondrial DNA, replication initiation regions as well as in the extended repeat sequences in various pathologies 104,105 . G4s play fundamental roles in transcription, replication, genome stability, and epigenetic regulation as well as post-transcriptional events including RNA transport, localization, and translation 104,105 . G4 structures could act as modifiers of various TFs, as exemplified with p53 106 , at target promoter sites in the regulation of gene expression. The abilities of various proteins including helicases, chromatin/histone modifiers, and transcription factors to interact with G4 may be critical for the dynamic regulation of gene expressions. For example, the binding of nucleolin to the nuclease hypersensitive element III 1 (NHE III 1 ) of MYC induces the formation of a G4 structure and reduces the MYC transcription 107 ; the binding of NME (non-metastatic cell 2; Nucleoside diphosphate kinase B), on the other hand, unfolds the G4 structure and promotes the transcription of MYC 108 . MAZ is also shown to bind to secondary DNA structures including G4s, which appears to be critical for transcriptional events of target genes [109][110][111] . In addition to Segment A, we also observe adjacent binding motifs for MAZ (GGG GAG GGG GAG GAGGG) in Segment C ( Fig. 7A; Supplementary Information, Fig. S11). This raises the possibility that the interaction of MAZ with Segment C could also modulate the transcription of CXXC5 by forming or resolving the G4 structure.
Although promoters constitute the key platform for the assembly of pre-initiation complexes to mediate the directionality and accuracy of transcription initiation, enhancers are DNA regulatory elements that determine spatio-temporal expression even over long distances regardless of its orientation to the core promoter 112,113 . It is well established that enhancers acting as binding targets for lineage-specific TFs are critical components of transcription by establishing proximity interactions with promoters in a cell-type-specific manner 112,113 . Given the fact that transcription requires dynamic protein-protein interactions and subsequent multistep ordered assembly of protein complexes within a temporally modulated chromatin architecture, a better understanding and delineation of mechanistic features of the CXXC5 expression in response to distinct signaling pathways including retinoic acid 11 , TGF-β 12 , BMP4 13,14 , Wnt3a 15-17 and estrogen 18-20 would be a valuable input for both physiology and pathophysiology.

Engineering of reporter vectors.
To assess the promoter activity of the genomic region of CXXC5, we generated Luciferase reporter vectors bearing a DNA fragment containing the putative promoter elements of the CXXC5 gene. We used the pGL3-Basic Luciferase Reporter vector that bears the Firefly Luciferase cDNA as the reporter enzyme (Promega Corp., Madison, WI, USA). For the engineering of the reporter vector bearing the putative CXXC5 promoter containing genomic region, a DNA fragment of 1975 bp of the CXXC5 locus (GRCh38.p12 Primary Assembly, chromosome 5: 139647220-139649173) generated by PCR using the genomic DNA of MCF7 cells as a template was inserted into the pGL3-Basic vector with appropriate restriction enzymes.
To increase the resolution of the putative CXXC5 promoter region, we carried out deletions from both the 5' and 3' ends of the region by PCR and inserted them into the pGL3-Basic vector with appropriate restriction enzymes. All constructs were sequenced for PCR fidelity. TFF1 (also known as the pS2 gene) is a well-studied estrogen-responsive gene 19,52 . The human TFF1 gene confers E2 responsiveness through the binding of ER to a non-consensus ERE [115][116][117] . The pGL3-TFF1 reporter construct is responsive to E2 in transiently transfected cells synthesizing estrogen receptor(s) exemplified with MCF7 cells 19,118,119 . We used the pGL3-TFF1 reporter vector bearing the estrogen-responsive TFF1 gene promoter as control under the steady-state cellular growth condition in that the growth medium contains unprocessed fetal bovine serum (FBS) as opposed to charcoaldextran treated FBS to remove endogenous steroid hormones, including estrogens. In transfections, transfection efficiency was monitored with a reporter vector bearing CMV promoter that drives the expression of the Renilla Luciferase cDNA (pCMV-RL, Promega, Corp., Madison, WI, USA). For luciferase studies, cells, 4 × 10 4 cells/ well, were seeded in 48-well plates for 48 h. Cells were then transiently transfected with a 125 ng reporter vector together with 0.5 ng pCMV-RL using Turbofect. Luciferase assays were performed with a Dual-Luciferase Assay Kit (Promega, Corp., Madison, WI, USA) as described previously 19 .
PCR and RT-qPCR. Isolated total RNA from MCF7 or HL60 cells was used for the cDNA synthesis (The RevertAid First Strand cDNA Synthesis Kit, Thermo-Fisher) and transcript variant identification was carried out by PCRs with transcript variant-specific primer sets (Supplementary Information, Table 1) followed by TAcloning into the pGEM-T vector (Promega, Corp., Madison, WI, USA) and sequencing (PRZ Biotechnology, Ankara, Turkey). Transcript variant quantification studies were carried out with RT-qPCR. The SsoAdvanced Universal SYBR Green Supermix (Bio-Rad Life Sciences Inc., Hercules, CA, USA), transcript variant-specific qPCR primers ( Supplementary Information, Table 1), and DMSO when it is necessary, were used. Expression levels of transcript variants were assessed with the efficiency corrected form of the 2 −ΔCT method 120 and normalized using the RPLP0 expression levels. Relative expression levels of CXXC5, and ELF1-or MAZ-regulated genes were assessed using 2 −ΔΔCT method 120 and normalized using the RPLP0 expression levels. Results were adjusted to the expression level of transcript variant 1, which was arbitrarily chosen, as one. In all RT-qPCR experiments, MIQE Guidelines were followed 121 .
5′ or 3′ rapid amplification of cDNA ends (5′RACE and 3′RACE). For 5′RACE or 3′RACE studies, we used the RiboMinus Human/Mouse Transcriptome Isolation Kit (#K155001, Thermo Scientific, USA) to enrich the mRNA concentration in the total RNA population from MCF7 cells. We performed rRNA removal according to the manufacturer's instructions. The method is based on the selective depletion of the rRNAs by the hybridization of rRNA to Locked Nucleic Acid (LNA) probes conjugated to magnetic beads. LNA referred to as inaccessible RNA is a modified RNA nucleotide that significantly increases the hybridization properties to DNA or RNA. After the hybridization, LNA probe bound rRNAs were captured with the help of a magnetic stand and the supernatant containing largely mRNAs depleted of rRNA was recovered. Phenol:chloroform:isoamyl alcohol and ethanol precipitation was then used for the purification of mRNAs. For the identification of 5′-and 3′-ends of CXXC5 transcripts as well as of TFF1 as control, we used the FirstChoice RLM-RACE Kit (AM1700, ThemoFisher Scientific, USA) as directed by the manufacturer. For 5'RACE, in brief, purified mRNA (500 ng) was subjected to calf intestinal alkaline phosphatase (CIP) at 37 °C for one hour. Following the termination of the CIP reaction, RNA was extracted with phenol:chloroform:isoamyl alcohol, and ethanol precipitated. Resuspended RNA in nuclease-free water was then treated with tobacco acid pyrophosphatase at 37 °C for one hour and subjected to the ligation using a 5'RACE adapter and T4 RNA ligase. Ligated RNA products were subsequently reverse transcribed with M-MLV reverse transcriptase (M-MLV-RT) using random decamers at 42 °C for one hour. An aliquot of reactions was then used for outer 5′ RLM-RACE PCR with a 5′RACE CXXC5-or TFF1-specific 5′RACE outer primer ( Supplementary Information, Table 1). Outer PCR reaction was followed by Inner 5′ RLM-RACE PCR using 5'RACE CXXC5-or TFF1-specific inner primer and a 5'RACE inner primer ( Supplementary Information, Table 1). For 3′RACE, purified mRNA was subjected to reverse transcription using M-MLVRT and 3′RACE Adapter provided by the kit at 42 °C for one hour. An aliquot of the reaction was used for PCR using 3′ RACE CXXC5-or TFF1-specific outer primer and 3′RACE RLM adapter outer primer ( Supplementary Information, Table 1 www.nature.com/scientificreports/ Northern blotting. Preparation of biotin-tagged probes. PCR amplicons for targeted identification of Exon3, Exon boundaries of Exon10 and Exon11 of CXXC5 as well as of the GAPDH cDNA fragment containing Exon5-8 were cloned into a vector. We then used biotin-conjugated vector-specific primers (Supplementary Information, Table 1) for the PCR amplification of double-stranded probe sequences. To examine the presence of CXXC5 transcripts in MCF7 and HL60, a northern blot assay was performed by the NorthernMax Kit (Thermo-Fisher, AM1940) according to the manufacturer's instructions. In brief, 10 µg RiboMinus-treated RNA samples as well as 3 µl of RNA ladder (RiboRuler High Range RNA Ladder, # SM1821, Thermo Scientific, USA) were mixed with 3 volumes of formaldehyde loading dye and incubated at 65 °C for 15 min. Samples were then loaded to a denaturing agarose-LE gel and electrophoresed for 2 h and were transferred onto a positively charged PVDF membrane (Sigma-Aldrich, Roche, #11209272001). Transferred mRNAs were then cross-linked to the membrane with a UV transilluminator (312 nm wavelength) for 10 min. Membranes were placed into 15 ml sterile falcon tubes and pre-hybridization was initiated using 6 ml of ULTRAhyb buffer pre-heated to 42 °C in a vertical rotator for 40 min. Biotin-tagged probes were diluted tenfold with 10 mM EDTA containing TE buffer to a final volume of 100 µl and denatured at 90 °C for 10 min. The mixture was immediately added onto the membranes in falcon tubes and placed in the oven for hybridization overnight. Membranes were then washed twice with a low stringency buffer at RT for 5 min, followed by washing twice with a high stringency buffer for 15  However, due to the very high number of CpG positions at the center of Exon3 until the end of Exon4, designing bisulfite primers were precluded. We instead designed methyl-specific primers, which were based on methylated and unmethylated DNA sequences generated after bisulfite conversion. Methyl-specific PCRs were carried out with primer sets containing three or more CpG sites and these regions were PCR amplified as 3 overlapping segments. PCR was carried out with 2.5 units LongAmp Taq Polymerase (NEB, M0323) in 50 µl total reaction containing 0.5 µM forward and reverse primers. Template DNA was added to the reaction at 90 °C to prevent nonspecific primer annealing and the first two cycles of the reaction were carried out only in the presence of reverse primer complementary to the sense strand to avoid the formation of primer dimers. The subsequent 10 cycles were performed 5 °C above the annealing temperature followed by 30 cycles of PCR 123 . In all PCRs with bisulfite-converted DNA templates, we also used a bisulfite-converted "Universal Methylated Human DNA Standard" (Zymo Research, D5011) as control. PCR amplicons with expected sizes were excised from agarose gels, purified, and cloned into the pGEM-T vector (Promega Corp., Madison, WI, USA) by TA-cloning and sequenced (PRZ Biotechnology, Ankara, Turkey). Sequences were analyzed using the QUMA 57 tool.
Assessing nucleosome occupancy with micrococcal nuclease assay. To assess nucleosome occupancy at the putative promoter region of the CXXC5 gene, MCF7 cells, grown in six-well plates for 48 h were fixed with 2% formaldehyde in 1xPBS for 15  To digest RNA, 0.2 mg/ml RNAse A (Thermo Scientific, EN0531) was added onto the lysate and incubated at 37 °C for 30 min. Lastly, to digest cellular proteins, 0.25 mg/ml Proteinase K (Thermo Scientific, AM2542) was added into the lysate and incubated overnight at 65 °C. DNA was subsequently purified by phenol:chloroform:isoamyl alcohol (VWR, #K-169) and ethanol precipitation. To analyze the digestion pattern, purified DNA was loaded on an agarose gel and bands corresponding to the tri-nucleosomal and mono-nucleosomal DNAs were excised and gel purified using Zymoclean Gel DNA Recovery Kit (D4001). 50 ng of purified DNA or uncut genomic DNA was used as a template in PCR reactions to assess nucleosome occupancy. For the initial scanning, five overlapping regions (depicted as T1-5, Fig. 3) were amplified from the tri-nucleosomal DNA template. For further verification, 3 sub-regions (depicted as M1-3, Fig. 3) were analyzed from the mono-nucleosomal DNA by PCR.
Samples were then subjected to immunoprecipitation with Protein G-coupled magnetic beads (New England BioLabs) for anti-mouse antibodies or Protein A/G coupled magnetic beads (New England BioLabs) for anti-rabbit antibodies. After washes, de-crosslinking, and protein digestion, DNA was recovered with phenol:chloroform: isoamyl alcohol followed by ethanol precipitation. Samples (1 µl of 60 µl elution) were subjected to qPCR using ChIP primers (Supplementary Information Table 1)  Promoter pull-down. Based on luciferase reporter results, we amplified a 220 bp in length DNA fragment by PCR from the genomic DNA of MCF7 cells that includes the CXXC5 promoter (Segment A; − 117 to + 103, + 1 being the first nucleotide of the annotated Exon3) and inserted it into a vector with appropriate restriction enzyme cut sites. Similarly, a 220 bp in length DNA fragment within Exon10 of the CXXC5 gene as control DNA was cloned into the vector. 5′ end-biotinylated forward and reverse primers specific to the vector were then used for the amplification of Segment A and the control Exon10 DNA sequences by PCR. Biotinylated double-stranded PCR amplicons were recovered from agarose gels with Zymoclean Gel DNA Recovery Kit (Zymo Research).
Streptavidin magnetic beads (SMB, NEB) were blocked using 2% BSA in PBS for 2 h at 4 °C followed by washes with 1xPBS twice. The blocked SMBs were resuspended in 200 µl PBS containing 0.5 mM Phenylmethylsulfonyl Floride (PMSF), 0.5 mM DTT, and 1xPI, which were then mixed with one ml of nuclear extracts for pre-clearing for 1 h at 4 °C in 300 µl 1xPBS. Subsequently, the pre-cleared nuclear extract was divided into three equal (about 400 µl) aliquots. One aliquot of the extract was then mixed with 10 µg biotinylated double-stranded Segment A, control DNA or beads alone in the presence of 10 µg of Poly[d(I-C)] to form the protein-DNA complexes overnight at 4 °C on a rotator. The SMB-DNA-protein mixtures were subsequently washed with 1xPBS three times for 5 min each and resuspended in 200 µl 1xPBS for Mass Spectrometry (MS) analyses.
Protein identification by mass spectrometry. MS analyses of two biological replicates were carried out at the Koç University Proteomic Facility (Istanbul, Turkey). The SMB-DNA-protein mixtures were washed with 50 mM NH 4 HCO 3 , followed by reduction with 100 mM DTT in 50 mM NH 4 HCO 3 at 56 °C for 45 min, and alkylation with 100 mM iodoacetamide at RT in the dark for 30 min. MS Grade Trypsin Protease (Pierce) was added onto the beads for overnight digestion at 37 °C (enzyme to protein ratio of 1:100). The resulting peptides were purified using C18 StageTips (ThermoFisher). Peptides were analyzed by online C18 nanoflow reversed-phase HPLC (2D nanoLC; Eksigent) linked to a Q-Exactive Orbitrap mass spectrometer (ThermoFisher). The data sets were searched against the human SWISS-PROT database version 2014_08. Proteome Discoverer (version 1.4; ThermoFisher) was used to identify proteins. The final protein lists were analyzed using the STRING v11 125 and DAVID 63,126 databases.
In silico analysis of TF motifs for the CXXC5 locus. Binding motifs for transcription factors. To find TF binding motifs, we developed a motif search tool 72 using all the available ChIP-Seq datasets at the Cistrome 64 database. This tool obtains: (1) a set of binding locations on a sample of Chip-Seq reads using MACS2 peak locations, (2) the reference sequence of the genomic locus to analyze, and (3) the binding motifs for a specific Transcription Factor from the JASPAR 73 database as inputs. The program conducts and approximates string search on binding locations of the reference sequence using the consensus binding motif as the query sequence. www.nature.com/scientificreports/ The program generates both the forward and reverse strand hits which are ranked to a logarithmic sequence similarity score on binding locations.
Electrophoretic mobility shift assay (EMSA). EMSA was conducted as described previously 19 . 5′ end biotinlabeled oligomers bearing ELF1 or MAZ binding motif sequence were purchased from Integrated DNA Technologies (IDT Europe; Belgium) and annealed. Double-stranded DNA fragments were incubated in the presence or absence of (45 μg) nuclear extracts for 15 min. Reactions were further incubated without or with the ELF1-or MAZ-specific antibody for another 15 min. Samples were subjected to electrophoresis on 5% non-denaturing polyacrylamide gel. Gel contents were subsequently electrophoretically transferred to a nylon membrane and processed for EMSA using the LightShift Chemiluminescent EMSA kit (Thermo-Fisher). In brief, the membrane was UV cross-linked and blocked for non-specific binding using a blocking buffer. The membrane was then probed with Streptavidin-Horseradish Peroxidase Conjugate in the blocking buffer for image development. Images were then captured using ChemiDoc Imaging System (Bio-Rad).
Immunoblotting (IB). IB was carried out as described previously 19,29 . Briefly, cells were grown in six-well tissue culture plates in medium supplemented with 10% FBS for 48 h and transfected with a siRNA pool targeting ELF1 or MAZ and an expression vector bearing the HA-tagged ELF1 or the MAZ cDNA for 48 h. Cells were collected and protein isolation was performed using the NE-PER protein extraction kit (Thermo-Fisher). Protein concentration was determined using Bradford Proteins Assay (Bio-Rad). Nuclear extracts were subjected to denaturing SDS-PAGE, transferred to a membrane and proteins were probed with an antibody specific to ELF1 or MAZ followed by a secondary antibody conjugated with the horseradish peroxidase (Advansta). The membranes were then re-probed with the HA antibody (Abcam, ab9110) and subsequently with an HDAC1 antibody (Abcam, ab19845) as a loading control. Images were developed using the ECL-Substrate (Advansta) and captured with ChemiDoc Imaging System (Bio-Rad). ThT stock solution was prepared in Millipore water, and the concentration of the stock solution was determined by UV-vis spectroscopy using the molar extinction coefficient value of 36,000 M −1 cm −1 at 412 nm 79 . Igor Pro Software (WaveMetrics, Inc. Portland, OR, USA) was used for data analysis.

Assessment of the presence of a G-quadruplex in
Circular Dichroism (CD) spectroscopy. For CD spectroscopy, a Circular Dichroism JASCO J-815 spectropolarimeter (JASCO Inc., Easton, MD, USA) equipped with a Peltier-type temperature control system was used. Spectra of all samples for comparison were recorded at 5 °C using 10 mm quartz cells (3.5 mL, 111-QS, Hellma). The CD thermal denaturation experiment for SegC-G4 was performed by varying the temperature from 15 to 95 °C (and reverse) with a 5 °C/min increment and a 1-min waiting period for each temperature point. The Tm value for SegC-G4 was determined by the differentiation of the Normalized Ellipticity (mdeg) at 262 nm vs Temperature (°C) curve.
UV-Vis absorption spectroscopy. Cary 8454 spectrophotometer (Agilent Technologies; Santa Clara, CA, USA) equipped with a Peltier-type temperature control system was used for the recording of the UV-Vis absorption spectrum of the samples. UV-Vis thermal denaturation experiments were performed by changing the temperature between 15 and 95 °C (and reverse) with 2 °C/min increments.
Fluorescence spectroscopy. Fluorescence spectroscopy was performed by Cary Eclipse Fluorescence Spectrometer (HORIBA Ltd., Kyoto, Japan). All oligonucleotide samples were prepared before experiments with the same annealing procedure described above. Parameters for the fluorescence experiments were: Emission spectra collected between 430 and 700 nm, an excitation wavelength of 412 nm, 5.0 nm excitation and emission slit widths, operation at 800 V and 600 nm/min scan rate. For the fluorescence experiments, 0.5 µM ThT and 2.0 µM nucleic acid concentrations were selected as the optimal amounts. The results were demonstrated by plotting a bar graph of F − F 0 , where F 0 is the fluorescence of ThT alone and F is the fluorescence of ThT after the addition of the oligonucleotides at 488 nm.

Statistical analysis.
Experiments were repeated at least two independent times. Results, where and when appropriate, were presented as the mean ± standard error (S.E.) of three biological replicates. Statistical analyses were performed using a two-tailed unpaired t-test with a confidence interval, minimum, of 95%. www.nature.com/scientificreports/