## Results

### Nodal-driven Smads bind to GC regions for Gsc activation

We first determined whether nodal-activated Smad2/3 and Smad4 in mouse ES cells interact with the GC-rich region of the Gsc promoter that binds to Smad4 in vitro20 (Fig. 1a, b). We analyzed nine human genome-wide ChIP-Seq data sets9, 21,22,23,24 that were available in the NCBI GEO database (Supplementary Fig. 1a, b). The proximal promoter (PP) in human Gsc contains one CAGAC motif, which is located between the FoxH1 site and the TSS, and is not conserved in mouse, whereas the PP in mouse Gsc contains three CAGA repeats, which lack a crucial bp of the CAGAC SBE motif (Supplementary Fig. 1c).

The analysis revealed binding of Smad2/3 and Smad4 to an area of the Gsc PP that encompasses the GC-rich region and the nearby FoxH1 binding site, but excludes the CAGA and CAGAC sites (Fig. 1a, b; Supplementary Fig. 1 a, b). Smad binding also occurred in putative distal enhancer (DE) elements that contain CAGAC sequences (Supplementary Fig. 1a, d). These results showed that in ES cells stimulated by TGF-β signals, Smad2/3, and Smad4 bind to a GC-rich region of the PP of Gsc.

To determine the functional relevance of the GC-rich region, we used CRIPSR/Cas9-mediated mutagenesis25, 26 to generate focal deletions in the Gsc PP in mouse ES cells. The homozygous deletion of a 155 bp segment including the FoxH1 binding motif and a large portion of the GC-rich region (Supplementary Fig. 1e) abolished the induction of Gsc by activin A (a ligand for nodal receptors) (Fig. 1c, d). Moreover, the homozygous deletion of a 10-bp segment within the GC-rich region (GCGCCGGGGC), which spared both the FoxH1 and CAGA sites, was sufficient to abolish the Gsc response (Fig. 1d). These deletions in the Gsc locus did not alter the response of a separate Smad target gene, the negative feedback regulator Smad7, in the mutant cell clones, indicating that the cells remain capable of activin signal transduction (Fig. 1d).

Additionally, we provide Smad4 ChIP-qPCR data in ES cells showing that the 10 bp deletion in the GC-rich region abolished Smad4 interaction with this region without affecting Smad4 interaction with the DE elements (Fig. 1e).

### Smad4 binding to GC sequences from the Gsc proximal promoter

In order to characterize the specific GC-rich motifs recognized by Smad proteins, we first designed two 30-bp dsDNA oligonucleotides (GC1 and GC2 segments, Fig. 2a), and measured the interactions by electrophoretic mobility shift assays (EMSA), using recombinant Smad MH1 domains and Cy5-labeled DNAs (Fig. 2b–d). To narrow down the motifs present in these regions, we applied a sliding window approach (6-bp windows sliding by 2-bp steps, within 20-bp duplexes, Supplementary Table 1) covering the GC1 and GC2 segments. The Smad4 MH1 domain-bound dsDNA oligonucleotides containing the GGCGC/G, GGCCGC/G, and GGCTG sequences (underlined in brown, Fig. 2a, b, and Supplementary Table 1) with an affinity in the nanomolar range, as determined by isothermal titration calorimetry (ITC) (Supplementary Fig. 2b). Smad4 bound the GGCGC oligonucleotide with a K d of 160.3±0.2 nM. In parallel experiments, Smad4 bound a CAGAC oligonucleotide with K d of 270.5±0.1 nM (Supplementary Fig. 2b).

The Smad3 MH1 domain bound the GC1 oligonucleotide with a K d of ~100 nM, as estimated from EMSA (Fig. 2b). Smad1 and Smad8 MH1 domains also bind to these motifs efficiently with K d values similar to those displayed for a GC-rich BRE and for the palindromic AGAC15 (Fig. 2e; Supplementary Fig. 2a). These values are comparable to dissociation constants of other transcription factors in and their cognate DNA motifs, which range from 10–100 nM27,28,29,30. Smad4 binding to these 5-bp GC-rich motifs is conserved through evolution, as determined with Smad4 MH1 domains from Placozoa and Porifera (Fig. 2f). Not all GC-rich motifs bound to Smad4 with high affinity. dsDNA oligonucleotide with sequences CCGCGC, CCTGGGC, and GGGGCC failed to bind the Smad4 MH1 domain in the nanomolar range (Fig. 2c), as did the FoxH1 site and a palindromic CAGA site sequence (Fig. 2d).

### Structural flexibility of the DNA-binding hairpin

To investigate how Smad4 distinguishes between target and non-target GC-rich sequences, we first analyzed the properties of the MH1 domain devoid of DNA, using nuclear magnetic resonance (NMR) spectroscopy in solution. Flexibility and oligomerization properties of the MH1 domain were also analyzed by acquiring 15N NMR backbone relaxation data at two different protein concentrations. The 15N{1H} steady-state nuclear Overhauser effect (hNOE)31 values are homogeneous all along the sequence except for the β2–β3 pair, which forms the DNA-binding hairpin3. The β2–β3 hairpin is more flexible than the rest of the protein. (Fig. 3a) The analysis of the T1 and T2 experiments allowed us to calculate a correlation time τ c of 9.5 ns, which is in agreement with a sample of 15 kDa tumbling as a monomer in solution.

These structural properties were confirmed by small-angle X-ray scattering (SAXS) profiles acquired at the European Synchrotron Radiation Facility (ESRF Grenoble, beamline BM29). The data suggest a maximum protein size of 56.5 ± 1.4 Å, (D max) consistent with a globular monomeric particle in solution with a radius of gyration (R g) of 15.9 ± 1.0 Å. (Fig. 3a) Remarkably, improving the fitting of the experimental data required refining of the X-ray model. The best results were obtained using the backbone chemical shifts converted to dihedral angles and allowing for some flexibility in the extended N-terminus region of the domain and in the β-hairpin involved in DNA binding. A set of the best 10 structures with the lowest χ 2 to the experimental SAXS profile32 superimposed to the X-ray structure (PDB: 3QSV) is shown in Fig. 3b.

Overall these results indicate that in the absence of DNA, the Smad4 MH1 domain is well structured and behaves as a monomer in solution. The fold is similar to that determined by X-ray crystallography of the Smad4 MH1 complex with DNA13, except that the DNA-binding hairpin presents conformational flexibility in the absence of DNA. We postulate that the flexibility of the hairpin facilitates access to DNA duplexes with slightly distinct topologies (as defined by different A/T or G/C contents), allowing the conserved hairpin to bind different DNA motifs. This hypothesis is supported by the NMR titration experiments performed with two different DNA motifs (Fig. 3c), which induce similar chemical shifts in the same residues located in and around the conserved β2–β3 hairpin (Fig. 3d). Notably, the sequence of this hairpin is highly conserved in all R-Smads and in Smad4 (Supplementary Fig. 2c)33, suggesting that the flexibility observed in Smad4 also occurs in R-Smads.

### High-affinity recognition of the GGCGC motif by Smad4

Given that similar residues in the Smad4 MH1 domain interact with different DNAs, we set to identify which nucleotides directly contacted the protein. We applied high-throughput sparse matrix crystallization approaches using a set of 20 different duplexes, with varying DNA lengths and sequences, and obtained X-ray diffraction data of several complexes of Smad4 MH1 domain and these DNAs. The best diffracting crystals were obtained with DNAs of 16 and 18 bp. A complex with the 18 bp dsDNA oligonucleotide containing the GGCGCG sequence from the GC1 Gsc segment (Fig. 4a) yielded data at 2.05 Å (refer to Methods section and Table 1). The ASU (space group C2221) contains only one Smad4 MH1 monomer and a single DNA strand, while the second monomer and the complementary DNA strand in the biological assembly belong to the neighboring ASU (Fig. 4b).

In the complex, the convex face of the DNA-binding hairpin dives into the concave major groove of the duplex DNA containing five base pairs (6-GGCGC-10/9′-GCGCC-13′). The main DNA-binding region of the protein comprises the loop following the β1 strand, and the β2–β3hairpin (residues 75–89 in Smad4). The Smad4 β2–β3 hairpin contains three residues Arg81, Gln83, and Lys88, which are strictly conserved in all R-Smad and Smad4 proteins and define specific interactions with three nucleotides in the canonical GTCT site previously investigated14, 34. In the complex with the 6-GGCGCG-11 site, these three residues also participate in a network of hydrogen bonds with the first four consecutive base pairs of the GGCGC motif.

The electron densities for these functionally important residues, the rest of the protein and DNA are well defined (Fig. 4c). Specifically, the guanidine group of Arg81 side chain coordinates tightly the G6 nucleotide. The orientation of Arg81 is further stabilized by the interaction of Asp79 side chain with the guanidine site of Arg81. In addition, Lys88 side chain is also well defined through interactions with both G7 and G11′, the latter being located in the complementary chain. The Gln83 side chain carbonyl interacts with C10′, while the amide part is in the proximity of the C9′ phosphate. The complex is further reinforced by a set of hydrogen bond interactions between the backbones of Leu78, Gln83, and Ala85, with G11, G9′, and C10′, covering the 6-GGCGCG-11 area (Fig. 4d) and by a network of well-ordered water molecules, which are tightly bound at the interface of the protein–DNA-binding site (Fig. 4k). Ten of the water molecules are closely coordinated by residues of the β2–β3 hairpin and six bases of the GGGCGCGC region. Additionally, five more water molecules strengthen the interactions of the N-terminal helix α2 with the phosphates of the DNA backbone.

These results indicate that the complex interface is highly complementary and that one MH1 protein covers a DNA-binding site of six base pairs. When compared to the structure of the Smad4–GTCT complex13, the proteins are nearly identical with minor differences detected in one of the loops and at the first helix (Cα RMSD of 0.39 Å for 113 aligned residues). Most of the differences concentrate at the DNA-binding interface.

We also solved the X-ray crystal structure of Smad4 MH1 in complex with the GGCCG sequence, which is present in both the GC1 and GC2 segments (Fig. 2e). The complex with GGCCG (data at 3.03 Å, Space group P43) is similar to that with GGCGC, (Supplementary Fig. 4). The DNA sequence contains the GGCCG motif twice, and each of two bound MH1 monomers interacts slightly differently with this motif, with both binding sites in the biological unit displaying specific hydrogen bonds to either 5- or to 4-nucleotides, respectively. Data collection and statistics are shown in Table 1. These additional structures showed that Smad4 can recognize slightly different motifs (GGCGC and GGCCG) in the GC-rich region of Gsc.

### Smad4 recognition of the GGCGC motif across metazoans

We examined the recognition of the GGCGC site by the Smad4 MH1 domain of the Placozoa Trichoplax, metazoans whose bodies are composed by only four basic cell types and with one of the most divergent among the known Smad4 sequences relative to human Smad4 (Supplementary Fig. 2c). The crystal structure of the Trichoplax Smad4 MH1 domain bound to the GGCGC oligonucleotide reveals that this Smad4–DNA interaction is strictly conserved (Fig. 4e, f) (diffraction data obtained at 2.43 Å, space group P212121, data collection, and statistics shown in Table 1). A superposition of the human Smad4 and Trichoplax Smad4 complexes shows that both complexes are very similar, and in particular the protein binding mode (Cα RMSD of 0.46 Å for 115 aligned residues). The few observed differences are concentrated in and around two loops that also contain most of the amino-acid sequence differences (Fig. 4g). Collectively, the results show that Smad4 proteins from highly divergent metazoan species interact with the GGCGC motif with similar binding affinity and similar structural contacts.

### Mode of Smad3 binding to the GGCGC motif

Next, we investigated the binding mode of the Smad3 MH1 domain with the GGCGC motif. Crystals of human Smad3 MH1 domain bound to a 16mer dsDNA oligonucleotide containing the GGCGCG motif yielded data at 2.05 Å resolution using synchrotron radiation (I41 symmetry ID29, ESRF, Grenoble), which together with the Smad4 bound to the same site, are the highest-resolution structures determined for a DNA-bound Smad MH1 domain to date. The asymmetric unit (ASU) of the complex contains two Smad3 monomers and one dsDNA (Fig. 4h). Phases were obtained by molecular replacement using the crystal structure of the human Smad3 in complex with the GTCT motif, (PDB accession code: 1OZJ34). The electron density map obtained after refinement showed residues 10–135 of Smad3, and the dsDNA 16mer (data collection and statistics are shown in Table 1), including the presence of a bound Zn2+, coordinated by His126, Cys64, Cys109, and Cys121.

As in all other MH1 complexes, the three residues strictly conserved (Arg74 and Gln76 located in β2 and Lys81 in β3) are well defined in the structure and participate in a network of hydrogen bonds with the 5-GGCGCG-10 motif. The guanidine group of Arg74 is well ordered due to its tight coordination by the G5 nucleotide (fitting of the side chains to the density are shown in Fig. 4i). The Lys81 side chain is also well defined, through hydrogen bond interactions from the ɛ-amino group to the pair of nucleotides G6 and G10′ (in the complementary chain) and with several well-ordered water molecules in the proximity of the ɛ-amino group. The Gln76 side chain carbonyl interacts with N3 of C9′, and also with C7, G8, and G8′ mediated by the presence of several tightly bound water molecules (Fig. 4j, l). Various interactions occurred between Leu71, Gln76, Ser78, His79, and His101 backbone and the DNA phosphates. In addition, interactions from Lys33, Lys41, Leu71, Asp72, Arg74, Gln76, Ser78, and Lys81 to the DNA via bound water molecules, further stabilize the complex. In the past, it has been suggested that binding of the Smad3 MH1 domain to GC-containing motifs might be unspecific through contacts made from lysine and arginine residues present in helix α2 with the DNA34. However, among the many hydrogen bonds detected from the protein to DNA, only two of them involved two residues located in this helix and were facilitated via water molecules. In fact, Smad3 and Smad4 GGCGC complexes are structurally very similar with a Cα RMSD of 0.44 Å for 105 aligned residues and both complexes interact directly with the DNA using the β2–β3 hairpin (Fig. 4m).

The GC-rich region of Gsc contains a conserved GGCTG sequence, which resembles the GTCT-AGAC SBE (Fig. 5a). This GGCTG motif bound Smad3 and Smad4 MH1 domains with high affinity in the EMSA experiments. We determined how the T to G change modified the recognition of the motif by Smad3 and Smad4 using a palindromic version of the motif similar to the canonical SBE. The X-ray crystal structure of the Smad4 MH1 domain in complex with a GGCT-AGCC oligonucleotide was determined at 2.98 Å resolution (space group P212121). The ASU contains two Smad4 MH1 monomers and one dsDNA molecule (Fig. 5b). Contacts were detected from Arg81 to G5 in one DNA strand and from Gln83 to A9′ in the complementary DNA strand. Lys88 is bound by G10′ in one of the monomers, as in the previously described complex with the canonical GTCT complex (PDB: 3QSV). Interestingly, in the second monomer, the density of Lys88 suggests that both G6 and G10′ nucleotides are contributing to the interaction with the GGCT motif, similar to what we observed in the GGCGC complex (both conformations are shown in Fig. 5c, d).

The Smad3 MH1 domain complex with the GGCT oligonucleotide was refined at 2.12 Å resolution (space group I41) (Fig. 5e). Most of the hydrogen bonds with the GGCT site are very similar to those observed with Smad4 (Fig. 5f, g). The main difference is that in the Smad3 complex no hydrogen bonds are observed from the ɛ-amino group of Lys81 to the G6 nucleotide. In this complex, we also detected several tightly bound water molecules at the protein–DNA interface that contribute to the stabilization of the interactions (Fig. 5g).

### Comparative analysis of Smad binding to different motifs

We analyzed the superposition of the Smad4 MH1 domain bound to different DNAs (Fig. 6a) with Cα root mean square deviations (RMSD) of 0.47 Å and of 0.45 Å for the GTCT and GGCT, respectively, for 114 aligned residues. Similar RMSD values were obtained when we compared the Smad3 complexes. Both GGCGC and GTCT structural comparisons reveal differences in the DNA-binding mode, including the number of direct hydrogen bonds to the bases, and in the topology of the DNA structures, (Fig. 6b), although the same protein residues are involved in the interactions (Fig. 6c). The complexes of Smad4 or Smad3 MH1 domain with the ds-GGCGC motif show a narrower but deeper major groove and a wider and less pronounced minor groove (average width of 10.0 and 6.5 Å, respectively for the Smad4 complex) compared to this domain bound to GTCT/AGAC (3QSV) (10.6 and 5.9 Å, respectively). The conformation of the Smad-bound GGCT motif is intermediate between that of GTCT and that of GGCGC. Similar values were measured for Smad3 complexes. The similarities and differences of both topologies were quantified using Curves+33 (Fig. 6d). Similar topologies of GC-rich DNAs have been previously reported in other cases including the binding of several consecutive Zn-finger domains to GC motifs35. These differences in the DNA shape and flexibility provide a good match between the convex face of the DNA-binding Smad hairpin and the concave and narrow major groove of the GGCGC site, allowing the side chains of Arg81 and Lys88 to establish a well-defined set of hydrogen bonds with G nucleotides.

### A consensus GC-rich Smad binding element

The six MH1–DNA complex structures reveal common features of how Smad3 and Smad4 recognize GC motifs in a specific manner. They also illustrate why not all GC motifs can be Smad binding targets. Efficient interactions with GC sites occur only if a G nucleotide is located deep in the major grove, and establishes hydrogen bonds with the guanidinium group of Arg81. This interaction facilitates a complementary surface contact between the Smad DNA-binding hairpin and the major groove of the DNA. The ε-amino group of Lys88 also displays hydrogen bonds with two nucleotides in positions+1 and +2 (also guanines) with respect to the first G. Due to the extended side chain of the lysine, the guanines can be located at either side of the DNA duplex. These conserved sets of contacts define the first three nucleotides in the preferred motifs. Gln83 is more versatile and can form hydrogen bonds using its bi-functional side chain. Indeed, Gln83 interacts with GC in GGCGC (or CG in GGCCG) and with an adenosine in GTCT and GGCT.

Additional contacts are observed with the DNA backbone to define an area of interactions with up to 6 bp. The pattern of contacts and the binding mode elucidated here for Smad3 and Smad4 with the GGCGC motif are slightly different to that recently described for Smad5 bound to the palindromic GGCGCC site15. In the Smad5 complex, perhaps favored by the short palindromic nature of the motif, the binding mode is defined by two MH1 domains bound to the six nucleotides DNA site, with each domain binding only three contiguous nucleotides15. However, in the Smad3 and Smad4 complexes, we find that five nucleotides are covered by a single MH1 domain through direct interactions, and up to six nucleotides when we include hydrogen bonds with the DNA backbone. The comparison of the different hydrogen bonds detected in the Smad3 and Smad4 GGCGC complexes to those described for Smad5 bound to the GGCGCC site is included as (Supplementary Fig. 5).

The observed pattern of interactions defines a consensus GC-rich binding site of five nucleotides, the GGC(GC)|(CG) motif, that binds with high affinity to MH1 domains of Smads 1, 2, 3, 4, and 8. The 5 bp GGC(GC)|(CG) motif covers five bases of the GGC palindrome and a GCCG motif, which were previously identified as BREs36. We propose the term 5GC SBE to designate this general SBE.

### 5GC Smad binding elements are clustered

As revealed by the structures with the 5GC motifs, the fact that specific contacts in the complexes involve nucleotides present in both DNA strands provides an additional plasticity to the interactions but also adds complexity for the identification of these motifs in ChIP-Seq data analysis genome wide, explaining why the GC motifs have remained elusive in these analyses.

## Discussion

The Smad MH1 domain is highly conserved in distantly related metazoans, with no amino-acid sequence differences in the DNA-binding β-hairpin (with the exception of Drosophila Smad2). The structures determined here reveal that the β-hairpin is flexible and adaptable to bind with high-affinity several variants of the consensus sequence GGC(GC)|(CG), which we refer to as 5GC SBE. The structures also define the contacts of the Smad3 and Smad4 MH1 domains with specific DNA bases and phosphate groups.

Given the sequence identify of the β-hairpin and the overall high conservation of the MH1 domain in all R-Smad proteins and Smad4, it is not surprising that the various 5GC SBE motifs similarly bind to Smads 2 and 3, which operate in the TGF-β/nodal signaling, Smads 1, 5, and 8, which operate in BMP signaling, and Smad4, which is shared in all of these pathways. Binding of Smads 3 and 4 to the 5GC motifs is of similar or higher affinity than the binding of these proteins to the CAGAC motif GTCG (and to its variant GGCT), which has long been considered as a TGF-β responsive SBE. Smads 3 and 4 also bind to the motif GGCGCG, which is similar to the palindromic GGCGCC sequence previously identified as a BRE for BMP-activated Smad116.

The present results challenge long-standing perceptions that a dichotomy exists between the intrinsic preference of different Smad proteins for different SBEs. The concept that Smad2/3 and Smad4 primarily bind CAGAC SBEs, whereas Smads 1, 5, and 8 prefer GC-rich motifs seem too restrictive in light of the evidences. We propose that all R-Smad and Smad4 proteins recognize G(T/G)CTG motifs of the classical CAGAC SBE as well as GGC(GC)|(CG) motifs of the 5GC SBE. Analysis of ChIP-Seq data sets corresponding to CREs indicates that Smad responsive regions frequently contain both 5GC and GACAG SBEs, with a higher abundance of 5GC SBEs, which frequently occur as clusters. Additional biochemical and functional studies would need to be done in the future to validate the genome-wide findings and to narrow down the specific regions recognized by Smad proteins case by case.

Smad proteins do not bind to just any 5-bp GC-rich motif, and our structures of Smad–DNA complexes illuminate the basis for the selectivity of these interactions. These insights also suggest how three MH1 domains in a trimeric Smad complex may bind to DNA at once. The clustering of multiple SBEs in gene promoter regions may facilitate the cooperative binding of three Smad molecules in the same complex to a target promoter. This is illustrated by a model of three Smad MH1 domains binding to 5GC SBEs in the Gsc PP (Fig. 7c).

## Methods

### Protein expression and purification

The plasmid (Addgene #14959) encoding the human Smad4 protein was used as the template for cloning different constructs of the MH1 domains to optimize the yield and homogeneity of the recombinant proteins. The best results were obtained with the Pro10-Gly140 fragment of Smad4, which was expressed as a N-terminal His-tagged fusion protein followed by a TEV or 3C cleavage sites. The MH1 domain is very stable in solution, showing a melting temperature of 66.2 ± 0.1 °C determined by differential scanning calorimetry and well folded according to the NMR data. The 1D and 2D NMR data of this recombinant domain show a well-dispersed pattern of chemical shifts indicative of a folded sample (1H, 15N-Heteronuclear 2D Single Quantum Coherence, and Transverse Relaxation-Optimized SpectroscopY, HSQC-TROSY).

Two divergent Smad4 MH1 domains, (Amphimedon queenslandica, (Porifera) NCBI Reference Sequence: XP_003388571, Thr52-Thr192; and T. adhaerens (Placozoa), GenBank: EDV21247, Met1-Ser151) as well as the Hu-Smad3 (NCBI Reference Sequence: NP_005893, isoform 1, Met-Pro141), the Smad8/9 domain (Uniprot: O15198-1, Thr14-Pro144), Smad1 (Q15797-1, Thr10-Arg142), Smad5 (Q99717-1, Thr11-Arg 143) were cloned using a synthesized DNA template with optimized codons for bacterial expression (Thermo Fisher Scientific). All clones were confirmed by DNA sequencing.

All protein constructs were expressed in Escherichia coli BL21 (DE3) and purified following standard procedures. Unlabeled and labeled samples were prepared using LB and minimal media (M9) cultures, respectively. D2O (99.95%, Silantes), 15NH4Cl, and/or d-[13C] glucose (Cambridge isotopes) were used as sole hydrogen, nitrogen, and carbon sources, respectively, to prepare the labeled samples37. Proteins were expressed fused to a N-terminal His-tag followed by a TEV protease cleavage site.

Cells were cultured at 37 °C to reach an OD600 of 0.6. After induction with IPTG (final concentration of 0.4 mM) and overnight expression at 20 °C, bacterial cultures were centrifuged and cells were lysed using an EmulsiFlex-C5 (Avestin) in the presence of lysozyme. The soluble supernatants were purified by nickel-affinity chromatography (HiTrap Chelating HP column, GE Healthcare Life Science) using a NGC Quest 10 Plus Chromatography System (BIO-RAD). Eluted proteins were digested with TEV or 3C proteases (at 4 °C or room temperature, respectively), and further purified by ion exchange chromatography using a HiTrap SP HP (GE Healthcare), and size-exclusion chromatography on a HiLoadTM Superdex 75 16/60 prepgrade columns (GE Healthcare) equilibrated in 20 mM sodium phosphate (pH 6.3), 150 mM NaCl, and 2 mM TCEP (Melford, UK). For crystallography (Smad4 MH1 domains), the last step of purification was performed using 20 mM Tris buffer, (pH 7.2), 80 mM NaCl, and 2 mM TCEP. All clones were confirmed by DNA sequencing and the purified proteins were verified by MALDI-TOF mass spectrometry.

### Duplex DNAs

Duplex DNAs were annealed using complementary single-strand HPLC purified DNAs. DNAs were mixed at equimolar concentrations (1 mM), heated at 90 °C for 3 min and allowed to cool down to room temperature during 2 h. DNAs were purchased (at Biomers and/or at Metabion, Germany).

### Electrophoretic mobility shifts assay

Binding reactions were carried out for 30 min at room temperature in 10 μL of binding buffer (20 mM Tris pH 7.2, 80 mM NaCl, 2 mM DTT). A fixed concentration of 5′-end Cy5-labeled duplex DNA (7.5 nM) was incubated with increasing amounts of Smad4 MH1 domain. Electrophoresis was performed in non-denaturing 12.0% (19:1) polyacrylamide gels. The gels run for 1 h in 1× TG buffer (25 mM Tris, pH 8.4, 192 mM Glycine) at 90 V at 20 °C. The gels were exposed to a Typhoon imager (GE Healthcare) using a wavelength of 678/694 nm (excitation/emission maximum) for the Cy5 fluorophore.

### NMR chemical shift assignment and perturbation experiments

NMR data were recorded at 298 K on a Bruker Avance III 600-MHz spectrometer equipped with a quadruple (1H, 13C, 15N, 31P) resonance cryogenic probe head and a z-pulse field gradient unit at 298 K. Backbone 1H, 13C, and 15N resonance assignments were obtained by analyzing the TROSY versions of 3D HNCACB and HN(CO)CACB experiment pair using (1H/2H, 13C, 15N samples)38. To optimize the quality of the triple resonance backbone spectra, all experiments were acquired as Band-Selective Excitation Short-Transient-type experiments (BEST) with TROSY and non-uniform sampling (NUS), using two different buffer conditions (pH 7.5 and 5.5) to minimize the number of overlapped amides38, 39. This strategy allowed us to unambiguously assign 116 of the 125 possible amides (131 residues, 6 of them prolines). The comparison of the Smad4 MH1 (Cα and Cβ) chemical shifts to reference values, as well as the 15N edited-NOESY data, corroborated the presence of bound Zn2+ and of four helices and six strands, characteristic of the MH1 fold. The strands are ordered as three anti-parallel pairs: β1β5, β2β3, and β4β6. The presence of many long-range interactions confirmed that, the structure of the MH1 domain is well defined in solution, in the absence of DNA. Chemical shifts have been deposited in the BMRB (entry 26945). For the titration experiments, NUS acquisition strategy was also used to reduce experimental time and increase resolution. Protein samples (500 μM for backbone and 15N-NOESY experiments and 100 μM for titrations) were equilibrated in a buffer containing 20 mM sodium phosphate and 100 mM NaCl. All samples were supplemented with 10% D2O and pH adjusted to 6.3. Spectra were acquired using 15N-labeled protein samples at the indicated concentrations with progressively increasing amounts of the unlabeled DNA. Chemical shift perturbation analyses were performed with a 0.2 weighting of 15N with respect to 1H. NMRPipe40 and MddNMR39 were used for spectra processing and spectra assignment and analysis was performed with CARA41.

### Genome-editing with CRISPR/Cas9

sgRNAs targeting genomic regions of interest were designed using CRISPR Design Tool (http://crispr.mit.edu/)42 and synthesized by IDT, Inc. Single cells were sorted onto irradiated MEF feeder for increased viability through FACS 72 h post transfection. Mutant clones were first screened through aberrant melting temperature of qPCR products, then verified by PCR, TA cloning, Sanger sequencing, and CRISPResso analysis individually. The mESCs used for this experiment are diploid.

### Chromatin immunoprecipitation

mESCs E14Tg2a.IV (RRID: MMRRC_015890-UCD) were maintained on gelatin- (0.1%, Millipore, ES-006-B) coated plates with LIF-supplemented medium at 37 °C with 5% CO2. Basic ES cell medium included 80% knockout DMEM (Life Technologies, 10829-018), About 15% fetal bovine serum (HyClone, SH30071), 50 U Penicillin and 50 μg/mL Streptomycin (Cellgro, 30-001-CI), 1% non-essential amino acids (Life Technologies, 11140-050), 1% l-glutamine (Life Technologies, 25030081), 100 μM β-mercaptoethanol (Sigma-Aldrich, M6250), 1000 U/mL mLIF (Gemini Bio-Products, 400–495). EB formation and differentiation were carried out as described by the supplier (ATCC).

Mouse EBs were treated with recombinant human activin A (R&D Systems, 338-AC 50 ng/mL), fixed and quenched, sonicated to average fragment size of 250 bp in 1% SDS lysis buffer, and incubated with 60 μl Dynabeads protein G conjugated with 3–5 ug of indicated antibodies (SC-7966X, Santa Cruz Biotechnology). About 2% pre-cleared chromatin prior to primary antibody addition was kept as input DNA. Magnetic beads were washed, chromatin was eluted, and reverse crosslinked ChIP DNA was dissolved in 10 mM Tris pH 8.0 buffer for further analysis.

ChIP-Seq DNA samples were quantified and quality assessed by Ribogreen and Agilent Bioanalyzer. DNA fragments range from 200 to 600 bp were selected constructed for ChIP-Seq library with TruSeq ChIP Sample Prep Kit (Illumina) according to manufacturer’s instructions. Multiplexed sequencing libraries were run on a Hiseq2500 platform. Sequencing reads in FASTQ format were quality assessed by FastQC v0.11.3 for sequencing integrity and completeness. FASTQ reads were mapped to mouse genome mm9 (NCBI build 37, Jul/2007) with Bowtie2 with default filtering criteria. Resulted SAM files were converted to BAM files though Samtools 0.1.19. BAM files were sorted and indexed with Samtools. BAM files were converted to TDF file by IGV Tools 2.3.32 using the command “igvtools count -z 5 -w 25 -e 250”, specifying the coverage window size to be 25 bp and average fragment size of 250 bp. Additional information is described in the GSM2746361 entry.

### NMR relaxation measurements

T1 and T2 relaxation measurements were acquired using standard pulse sequences43. Peak integration values were fitted to a two-parameter function (Eq. 1), where I 0 and I(t) are the peak intensities at times 0 and t, respectively. The optimum value of the I 0 and the T 1,2 parameters are determined using the Levenberg–Marquardt optimization algorithm:

$$I\left( t \right) = I_0e^{\left( {\frac{{ - t}}{{T_{1,2}}}} \right)}$$
(1)

The rotational correlation time of the domains was calculated with the Eq. 2, using the approximation of slow molecular motion τ c larger than 0.5 ns and assuming that only J(0) and J(ωN) spectral density terms contribute to the overall value. νN is the 15N resonance frequency:

$$\left( {60.08 \times 10^6{\mathrm{Hz}}} \right)\tau _{\rm{c}} \approx \frac{1}{{4\pi \nu _N}}\sqrt {\left( {6\frac{{T_1}}{{T_2}} - 7} \right)}$$
(2)

Heteronuclear NOE experiments were acquired in an interleaved manner (the reference and the presaturated HSQC spectra). Steady-state values of 1H-15N NOEs resulted from the ratios of the peak intensities measured in the reference (I 0) and the presaturated (I S) spectra during the relaxation delay, as described. Background noise levels σ S and σ 0 were measured and used to determine the NOE standard deviation:

$$\frac{{\sigma _{{\rm{NOE}}}}}{{{\rm{NOE}}}} = \left( {\left( {\frac{{\sigma _{I_{\rm{S}}}}}{{I_{\rm{S}}}}} \right)^2 + \left( {\frac{{\sigma _{I_0}}}{{I_0}}} \right)^2} \right)^{\frac{1}{2}}$$
(3)

### SAXS data collection, analysis, and modeling

SAXS data were collected on samples of human Smad4 MH1 at protein concentrations ranging from 0.5 to 4.2 mg mL−1, in 20 mM Tris buffer, pH 7.2, 150 mM NaCl. Data were acquired at Beamline 29 (BM29) at the European Synchrotron Radiation Facility (ESRF; Grenoble, France). Protein samples were centrifuged for 10 min at 10.000×g prior to data acquisition. Experiments on BM29 were collected at 12.5 keV and data were recorded on a Pilatus 1 M detector, at 10 °C. We collected 10 frames for each of the samples (1 s exposure).

Image conversion to the 1D profile, data reduction, scaling, and buffer subtraction were done by the software pipeline available at the BM29 beamline. Further processing was done by the ATSAS software suite44. Guinier plot calculation, for radius of gyration estimation, was performed by PRIMUS, included in the ATSAS suite, using low q regions (q max × R g < 1.3).

Modeling with SAXS and NMR data were done with XPLOR-NIH45, using the SAXS 1D profile46 and the backbone chemical shifts of Smad4 MH1. The chemical shifts were converted to backbone, phi, and psi, dihedral angles using TALOSN47. Starting from the crystal structure (PDB: 3QSV), DNA and water molecules were removed and secondary structure elements were restrained with an RMSD potential, implemented in XPLOR-NIH. A simulated annealing procedure was applied using the SAXS and dihedral angles as restraints. Comparison between experimental and fitted SAXS profiles was done using CRYSOL32. The 10 structures, with the lowest χ 2 to the experimental SAXS profile, were selected for further analysis.

Calculation of the χ 2 metric, for N data points, is given by:

$$\chi ^2 = \frac{1}{N}\mathop {\sum }\limits_{i = 1}^N \frac{{\left[ {I^{{\rm{calc}}}\left( {q_i} \right) - I^{{\rm{obs}}}\left( {q_i} \right)} \right]^2}}{{\sigma _i}}$$
(4)

### X-ray

High-throughput crystallization screening and optimization experiments were performed at the HTX facility of the EMBL Grenoble Outstation48. The human and Trichoplax Smad4 as well as the human Smad3 MH1 domains were concentrated to 5 mg mL−1 prior to the addition of the annealed DNAs (Metabion) dissolved in 20 mM Tris pH 7, 10 mM NaCl. The final protein DNA molar ratio was 1:1. Screenings and optimizations were prepared by mixing 100 nL of the complex solution and 100 nL reservoir solution in 96-well plates. Crystals of the complexes were grown by sitting-drop vapor diffusion at 4 °C. Crystals were obtained with several DNAs and conditions and were reproducible. Several data sets were acquired for the best diffracting crystals and analyzed. Final conditions for the three best diffracting complexes were optimized as follows:

Human_Smad4 GGCGC complex: the crystallization condition was 17% PEG6000, 0.2 M sodium chloride, and 0.1 M sodium acetate at pH 5.0 in the crystallization buffer. Crystals were cryoprotected in mother liquid supplemented with 20% PEG MME 500. The purification buffer was supplemented with 20 mM calcium chloride.

Human_Smad4 GGCCG complex: 16% PEG MME 2000 and 0.1 M sodium acetate pH 5.0. Crystals were cryoprotected in mother liquid supplemented with 20% glycerol.

Trichoplax_Smad4 GGCGC complex: 4.8% PEG4k, 0.1 M sodium acetate at pH 4.6. Crystals were cryoprotected in mother liquid supplemented with 20% PEG MME 550.

Human_Smad4 GGCT complex: 24% PEG3350 and 0.2 M sodium chloride. Crystals were cryoprotected in mother liquor supplemented with 18% PEG MME 550.

Human_Smad3 GGCGC complex: the crystallization condition was 0.2 M lithium acetate, 20% PEG3350. Crystals were cryoprotected in mother liquid supplemented with 20% 550 MME.

Human_Smad3 GGCT complex: 0.02 M sodium potassium phosphate, 0.1 M BisTris propane pH 6.5, 20% PEG3350 cryoprotected with 20% 550 MME.

Crystals were frozen in liquid nitrogen. Diffraction data were recorded at the ESRF on the beamlines ID23-2 (for Hu_Smad4-GGCGC and Hu_Smad4-GGCCG), ID23-1 (for T_Smad4-GGCGC), or ID30A-3 (for Hu_Smad4-GGCT) and ID29 (for Hu_Smad3-GGCGC and GGCT complexes). The data were processed with XDS49 and scaled and merged with XSCALE50. Initial phases were obtained by molecular replacement using PHASER51 from the CCP4 suite (search model PDB code: 3QSV). REFMAC52, PHENIX53, and BUSTER54 were employed for the refinement and COOT55 for the manual improvement of the models. Water molecules bound at the DNA–protein interface were selected when they participate in at least three hydrogen bonds (cutoff distance of 3.5 Å).

### Ramachandran statistics

The human Smad4–GGCGC complex has 99.2 and 0.8% of the residues in the preferred and allowed regions of the Ramachandran plot, 96.7 and 3.3% for the Trichoplax Smad4-GGCG, 96.7 and 3.3% for the GGCT complex, 98 and 2% for the human Smad3–GGCGC complex, 98.4 and 1.6% for the Smad3–GGCT complex and 96.7 and 3.3% for Smad4–GGCCG complex. No structure displays outlier residues. The data were validated using MolProbity56.

Figures describing the structures were generated with UCSF Chimera57. Structures have been deposited at the PDB: 5MEY, 5MEZ, 5MF0, 5NM9, 5OD6, and 5ODG.

### Isothermal titration calorimetry

ITC measurements were performed using a nano ITC calorimeter (TA Instruments) at 20 °C. DNA and protein samples were dissolved in the same buffer and degassed before the experiments. Concentrations were determined using a NanoDrop system and their predicted extinction coefficients. The NanoAnalyze software (TA Instruments) was used to analyze the binding isotherms. Baseline controls were acquired with buffer and pure DNA solutions. Regarding the stoichiometry of the interactions, palindromic DNAs contain two theoretically equivalent sites (detected in the EMSA assays). However, if the stoichiometry is considered as one of the parameters that can be adjusted together with the K D, the stoichiometry values obtained were slightly below the expected 0.5, even for the SBE used as control. We assume that these discrepancies are due to unavoidable errors in the determination of the active concentration of duplex DNA present during the titration, and also due to the equilibrium between duplex and single-stranded DNA (some duplexes have melting points close to 37 °C). Fittings were performed using the independent binding sites model. For the GTCT, we obtained ΔH of −60 KJ mol−1, ΔS of −60, and ΔG of −42. Similar values were obtained for the GGCT, with −62, −74, and −40, respectively. We obtained ΔH of −23, ΔS +70, and ΔG of −44 for the GGCG.

### Computational methods

We downloaded the mm9 and hg19 human reference genomes using the UCSC58 genome Distributed Annotation System server (DAS). From this, we used the EMBOSS59 fuzznuc tool to scan for the following DNA patterns: CAGAC, AGAC, GGCCG, GGCTG, GGCGG, and GGCGC. For each of the patterns and from the output of the search, we generated a bed track that were displayed with the IGV 2.3.66 tool60, for visual inspection of the results.

The following Gene Expression Ominbus (GEO)61 data sets were downloaded for the analysis: (GEO accession numbers ordered as shown in the figures: GSM727561, GSM727586, GSM727564, GSM72758921, GSM126681723, GSM727557, GSM72758521, GSM5395489, GSM76175722. With the exception of GSM1266817, all remaining data sets were originally aligned to reference genome hg18. We have referenced them to hg19 using the UCSC liftOver tool58.

Additional controls were selected to determine whether other TFs that bind GC-rich sites affect the interaction of Smad proteins with 5GC SBEs (Supplementary Fig. 8). Encyclopedia of DNA Elements (ENCODE) ENCFF484WVM and ENCFF175VXL HepG2 cell data sets (generated by Dr. R. Meyers lab U54HG004576) were downloaded for comparison of Smad4 binding regions in cell lines and for control experiments using the SP1and CTFT transcription factors data. For these controls, we used Smad4 ChIP-Seq peaks that were not overlapped by SP1 ChIP-Seq peaks or by CTCF ChIP-Seq peaks (two different experiments) in human HepG2 cells. Each of the filtered data sets showed similar clustering values than the unfiltered Smad4 HepG2 data set. In addition, to explore the 5GC cluster enrichment outside of promoter regions, we also performed these analyses using all Smad4 ChIP-Seq peaks genome wide. The results still show a notable enrichment of 5GC SBE clusters (18 and 40% of the Smad binding regions have three or more 5GC motifs vs. 6% in the baseline) in hESC and in HepG2 cells, respectively.

We used the GEO data sets GSM178291462 and GSM2746361 for mouse data analysis. The Integrative Genomics Viewer (IGV) was used for the visualization of the data and for generating the maps displayed in several figures60.

### Motif analysis

To determine each motif frequency and the corresponding average frequencies and distributions, we created 200 bp DNA regions centered in ChIP-Seq regions that were up to 1000 bp from the transcription starting sites obtaining 1755 distinct regions from Smad1 (GSM1505745), 49 from Smad3 (GSM727585), and 155 from Smad4 (GSM727586). Similar analyses were performed with the Smad4 HepG2 (ENCFF484WVM) cells data set, obtaining 16,463 peaks. We also generated a baseline data set with 500 regions of 200 bp length generated from random human genomic coordinates. We determined the number of CAGAC, AGAC, and GGCGC, GGCTG, GGCCG or GGCGG sites in both ChIP-Seq data and baseline regions.

As additional data sets, we also analyzed 200 bp DNA regions centered in ChIP-Seq peaks genome wide (not restricted to promoters) for SMAD4 data sets GSM727586 (4531 regions) and ENCFF484WVM (45,887 regions). As additional controls, we analyzed peaks in both promoters and genome wide for the SMAD4 data set ENCFF484WVM in which we discarded all peaks that had any overlap with the SP1 data set ENCFF175VXL, obtaining 6393 and 23,084 distinct regions, respectively. The same control was performed discarding overlapped peaks with the CTCF data set ENCFF237OKO, obtaining 9035 and 36,587 distinct regions, respectively.

### Data availability

Smad4 ChIP-Seq data were deposited in the Gene Expression Omnibus database under accession number GSM2746361. NMR assignments and chemical shifts have been deposited in the Biological Magnetic Resonance Data Bank, BMRB entry 26945. Densities and coordinates have been deposited in the Protein Data Bank PDB entries are 5MEY, 5MEZ, 5MF0, 5NM9, 5OD6, and 5ODG63.