Transcription factors recognize specific genomic sequences to regulate complex gene-expression programs. Although it is well-established that transcription factors bind to specific DNA sequences using a combination of base readout and shape recognition, some fundamental aspects of protein–DNA binding remain poorly understood1,2. Many DNA-binding proteins induce changes in the structure of the DNA outside the intrinsic B-DNA envelope. However, how the energetic cost that is associated with distorting the DNA contributes to recognition has proven difficult to study, because the distorted DNA exists in low abundance in the unbound ensemble3,4,5,6,7,8,9. Here we use a high-throughput assay that we term SaMBA (saturation mismatch-binding assay) to investigate the role of DNA conformational penalties in transcription factor–DNA recognition. In SaMBA, mismatched base pairs are introduced to pre-induce structural distortions in the DNA that are much larger than those induced by changes in the Watson–Crick sequence. Notably, approximately 10% of mismatches increased transcription factor binding, and for each of the 22 transcription factors that were examined, at least one mismatch was found that increased the binding affinity. Mismatches also converted non-specific sites into high-affinity sites, and high-affinity sites into ‘super sites’ that exhibit stronger affinity than any known canonical binding site. Determination of high-resolution X-ray structures, combined with nuclear magnetic resonance measurements and structural analyses, showed that many of the DNA mismatches that increase binding induce distortions that are similar to those induced by protein binding—thus prepaying some of the energetic cost incurred from deforming the DNA. Our work indicates that conformational penalties are a major determinant of protein–DNA recognition, and reveals mechanisms by which mismatches can recruit transcription factors and thus modulate replication and repair activities in the cell10,11.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The data that support the findings in this study are available as Supplementary Tables in Excel format. Coordinates and structure factor amplitudes for the TBP-AC, TBP-CC(1a), TBP-CC(1b) and TBP-CC(2) structures have been deposited in the PDB under the accession codes 6UEO, 6UEP, 6UER and 6UEQ, respectively. The raw SaMBA data have been deposited in the Gene Expression Omnibus (GEO) under accession number GSE156375. The PDB entries used in this study are available in Extended Data Figs. 1, 2, 5, 7 and Supplementary Tables 5–7, 9. High-resolution gel images for the EMSA data are available at https://figshare.com/projects/DNA_mismatches_reveal_conformational_penalties_in_protein-DNA_recognition/83663.
The code used for the structural analyses presented in this study is available in GitHub at https://github.com/alhashimilab/TF_MM.
Rohs, R. et al. Origins of specificity in protein–DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).
Siggers, T. & Gordân, R. Protein–DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42, 2099–2111 (2014).
Guéron, M., Kochoyan, M. & Leroy, J.-L. A single mode of DNA base-pair opening drives imino proton exchange. Nature 328, 89–92 (1987).
Nikolova, E. N. et al. Transient Hoogsteen base pairs in canonical duplex DNA. Nature 470, 498–502 (2011).
Fischer, M., Coleman, R. G., Fraser, J. S. & Shoichet, B. K. Incorporation of protein flexibility and conformational energy penalties in docking screens to improve ligand discovery. Nat. Chem. 6, 575–583 (2014).
Fraser, J. S. et al. Hidden alternative structures of proline isomerase essential for catalysis. Nature 462, 669–673 (2009).
Lorch, Y., Davis, B. & Kornberg, R. D. Chromatin remodeling by DNA bending, not twisting. Proc. Natl Acad. Sci. USA 102, 1329–1332 (2005).
Parvin, J. D., McCormick, R. J., Sharp, P. A. & Fisher, D. E. Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor. Nature 373, 724–727 (1995).
Denny, S. K. et al. High-throughput investigation of diverse junction elements in RNA tertiary folding. Cell 174, 377–390 (2018).
Reijns, M. A. M. et al. Lagging-strand replication shapes the mutational landscape of the genome. Nature 518, 502–506 (2015).
Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).
Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).
Zeiske, T. et al. Intrinsic DNA shape accounts for affinity differences between Hox-cofactor binding sites. Cell Rep. 24, 2221–2230 (2018).
Azad, R. N. et al. Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations. Nucleic Acids Res. 46, 2636–2647 (2018).
Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M. & Zhurkin, V. B. DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl Acad. Sci. USA 95, 11163–11168 (1998).
Battistini, F. et al. How B-DNA dynamics decipher sequence-selective protein recognition. J. Mol. Biol. 431, 3845–3859 (2019).
Kunkel, T. A. & Erie, D. A. Eukaryotic mismatch repair in relation to DNA replication. Annu. Rev. Genet. 49, 291–313 (2015).
Lindahl, T. Instability and decay of the primary structure of DNA. Nature 362, 709–715 (1993).
Pich, O. et al. Somatic and germline mutation periodicity follow the orientation of the DNA minor groove around nucleosomes. Cell 175, 1074–1087 (2018).
Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).
Shen, N. et al. Divergence in DNA specificity among paralogous transcription factors contributes to their differential in vivo binding. Cell Syst. 6, 470–483 (2018).
Veprintsev, D. B. & Fersht, A. R. Algorithm for prediction of tumour suppressor p53 affinity for binding sites in DNA. Nucleic Acids Res. 36, 1589–1598 (2008).
Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).
Warren, C. L. et al. Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl Acad. Sci. USA 103, 867–872 (2006).
Benos, P. V., Bulyk, M. L. & Stormo, G. D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002).
Chattopadhyay, A., Zandarashvili, L., Luu, R. H. & Iwahara, J. Thermodynamic additivity for impacts of base-pair substitutions on association of the Egr-1 zinc-finger protein with DNA. Biochemistry 55, 6467–6474 (2016).
Kitayner, M. et al. Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nat. Struct. Mol. Biol. 17, 423–429 (2010).
Golovenko, D. et al. New insights into the role of DNA shape on its recognition by p53 proteins. Structure 26, 1237–1250 (2018).
Alvey, H. S., Gottardo, F. L., Nikolova, E. N. & Al-Hashimi, H. M. Widespread transient Hoogsteen base pairs in canonical duplex DNA with variable energetics. Nat. Commun. 5, 4786 (2014).
Shi, H. et al. Atomic structures of excited state A-T Hoogsteen base pairs in duplex DNA by combining NMR relaxation dispersion, mutagenesis, and chemical shift calculations. J. Biomol. NMR 70, 229–244 (2018).
Kim, J. L., Nikolov, D. B. & Burley, S. K. Co-crystal structure of TBP recognizing the minor groove of a TATA element. Nature 365, 520–527 (1993).
Mondal, M., Mukherjee, S. & Bhattacharyya, D. Contribution of phenylalanine side chain intercalation to the TATA-box binding protein-DNA interaction: molecular dynamics and dispersion-corrected density functional theory studies. J. Mol. Model. 20, 2499 (2014).
Peyret, N., Seneviratne, P. A., Allawi, H. T. & SantaLucia, J., Jr. Nearest-neighbor thermodynamics and NMR of DNA sequences with internal A.A, C.C, G.G, and T.T mismatches. Biochemistry 38, 3468–3477 (1999).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Zhou, H. et al. New insights into Hoogsteen base pairs in DNA duplexes from a structure-based survey. Nucleic Acids Res. 43, 3420–3433 (2015).
Lu, X.-J., Bussemaker, H. J. & Olson, W. K. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 43, e142 (2015).
Sathyamoorthy, B. et al. Insights into Watson–Crick/Hoogsteen breathing dynamics and damage repair from the solution structure and dynamic ensemble of DNA duplexes containing m1A. Nucleic Acids Res. 45, 5586–5601 (2017).
El Hassan, M. A. & Calladine, C. R. Two distinct modes of protein-induced bending in DNA. J. Mol. Biol. 282, 331–343 (1998).
Bailor, M. H., Mustoe, A. M., Brooks, C. L., III & Al-Hashimi, H. M. 3D maps of RNA interhelical junctions. Nat. Protocols 6, 1536–1545 (2011).
Bailor, M. H., Sun, X. & Al-Hashimi, H. M. Topology links RNA secondary structure with global conformation, dynamics, and adaptation. Science 327, 202–206 (2010).
Le Novère, N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 1226–1227 (2001).
Cheatham, T. E. III, Cieplak, P. & Kollman, P. A. A modified version of the Cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn. 16, 845–862 (1999).
Pérez, A., Luque, F. J. & Orozco, M. Dynamics of B-DNA on the microsecond time scale. J. Am. Chem. Soc. 129, 14739–14745 (2007).
Maier, J. A. et al. ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015).
Salomon-Ferrer, R., Götz, A. W., Poole, D., Le Grand, S. & Walker, R. C. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. explicit solvent particle mesh Ewald. J. Chem. Theory Comput. 9, 3878–3888 (2013).
Rossetti, G. et al. The structural impact of DNA mismatches. Nucleic Acids Res. 43, 4309–4321 (2015).
Arnold, F. H., Wolk, S., Cruz, P. & Tinoco, I. Jr. Structure, dynamics, and thermodynamics of mismatched DNA oligonucleotide duplexes d(CCCAGGG)2 and d(CCCTGGG)2. Biochemistry 26, 4068–4075 (1987).
Kouchakdjian, M., Li, B. F., Swann, P. F. & Patel, D. J. Pyrimidine.pyrimidine base-pair mismatches in DNA. A nuclear magnetic resonance study of T.T pairing at neutral pH and C.C pairing at acidic pH in dodecanucleotide duplexes. J. Mol. Biol. 202, 139–155 (1988).
Boulard, Y. et al. The pH dependent configurations of the C.A mispair in DNA. Nucleic Acids Res. 20, 1933–1941 (1992).
Peng, Y. & Alexov, E. Computational investigation of proton transfer, pKa shifts and pH-optimum of protein-DNA and protein-RNA complexes. Proteins 85, 282–295 (2017).
Chen, W., Morrow, B. H., Shi, C. & Shen, J. K. Recent development and application of constant pH molecular dynamics. Mol. Simul. 40, 830–838 (2014).
Rangadurai, A. et al. Why are Hoogsteen base pairs energetically disfavored in A-RNA compared to B-DNA? Nucleic Acids Res. 46, 11099–11114 (2018).
Patel, D. J., Kozlowski, S. A., Ikuta, S. & Itakura, K. Deoxyguanosine-deoxyadenosine pairing in the d(C-G-A-G-A-A-T-T-C-G-C-G) duplex: conformation and dynamics at and adjacent to the dG X dA mismatch site. Biochemistry 23, 3207–3217 (1984).
Webster, G. D. et al. Crystal structure and sequence-dependent conformation of the A.G mispaired oligonucleotide d(CGCAAGCTGGCG). Proc. Natl Acad. Sci. USA 87, 6693–6697 (1990).
Allawi, H. T. & SantaLucia, J., Jr. NMR solution structure of a DNA dodecamer containing single G.T mismatches. Nucleic Acids Res. 26, 4925–4934 (1998).
Boulard, Y., Cognet, J. A. & Fazakerley, G. V. Solution structure as a function of pH of two central mismatches, C. T and C. C, in the 29 to 39 K-ras gene sequence, by nuclear magnetic resonance and molecular dynamics. J. Mol. Biol. 268, 331–347 (1997).
Gordân, R. et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 3, 1093–1104 (2013).
Frank, F., Okafor, C. D. & Ortlund, E. A. The first crystal structure of a DNA-free nuclear receptor DNA binding domain sheds light on DNA-driven allostery in the glucocorticoid receptor. Sci. Rep. 8, 13497 (2018).
Takayama, Y., Sahu, D. & Iwahara, J. NMR studies of translocation of the Zif268 protein between its target DNA Sites. Biochemistry 49, 7998–8005 (2010).
Belo, Y. et al. Unexpected implications of STAT3 acetylation revealed by genetic encoding of acetyl-lysine. Biochim. Biophys. Acta 1863, 1343–1350 (2019).
Stelling, A. L. et al. Infrared spectroscopic observation of a G-C+ Hoogsteen base pair in the DNA:TATA-box binding protein complex under solution conditions. Angew. Chem. Int. Edn Engl. 58, 12010–12013 (2019).
Stephens, D. C. & Poon, G. M. Differential sensitivity to methylated DNA by ETS-family transcription factors is intrinsically encoded in their DNA-binding domains. Nucleic Acids Res. 44, 8671–8681 (2016).
Zhang, L. et al. SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site. Genome Res. 28, 111–121 (2018).
Vyas, P. et al. Diverse p53/DNA binding modes expand the repertoire of p53 response elements. Proc. Natl Acad. Sci. USA 114, 10624–10629 (2017).
Weinberg, R. L., Veprintsev, D. B. & Fersht, A. R. Cooperative binding of tetrameric p53 to DNA. J. Mol. Biol. 341, 1145–1159 (2004).
Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).
Siggers, T. et al. Principles of dimer-specific gene regulation revealed by a comprehensive characterization of NF-κB family DNA binding. Nat. Immunol. 13, 95–102 (2012).
Luisi, B. F. et al. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature 352, 497–505 (1991).
Beno, I., Rosenthal, K., Levitine, M., Shaulov, L. & Haran, T. E. Sequence-dependent cooperative binding of p53 to DNA targets and its relationship to the structural properties of the DNA targets. Nucleic Acids Res. 39, 1919–1932 (2011).
Stephens, D. C. et al. Pharmacologic efficacy of PU.1 inhibition by heterocyclic dications: a mechanistic analysis. Nucleic Acids Res. 44, 4005–4013 (2016).
Siggers, T., Duyzend, M. H., Reddy, J., Khan, S. & Bulyk, M. L. Non-DNA-binding cofactors enhance DNA-binding specificity of a transcriptional regulatory complex. Mol. Syst. Biol. 7, 555 (2011).
Maerkl, S. J. & Quake, S. R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237 (2007).
Geertz, M., Shore, D. & Maerkl, S. J. Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proc. Natl Acad. Sci. USA 109, 16540–16545 (2012).
Drachkova, I. et al. Effect of TATA box polymorphisms in human β-globin gene promoter associated with β-thalassemia on interaction with TATA-binding protein. Russ. J. Genet. Appl. Res. 1, 183–188 (2011).
Drachkova, I. et al. The mechanism by which TATA-box polymorphisms associated with human hereditary diseases influence interactions with the TATA-binding protein. Hum. Mutat. 35, 601–608 (2014).
Leslie, A. G. The integration of macromolecular diffraction data. Acta Crystallogr. D 62, 48–57 (2006).
Potterton, E., Briggs, P., Turkenburg, M. & Dodson, E. A graphical user interface to the CCP4 program suite. Acta Crystallogr. D 59, 1131–1137 (2003).
Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D 66, 213–221 (2010).
Jones, T. A., Zou, J. Y., Cowan, S. W. & Kjeldgaard, M. Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr. A 47, 110–119 (1991).
Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D 66, 12–21 (2010).
Yang, S., Salmon, L. & Al-Hashimi, H. M. Measuring similarity between dynamic ensembles of biomolecules. Nat. Methods 11, 552–554 (2014).
Hombauer, H., Srivatsan, A., Putnam, C. D. & Kolodner, R. D. Mismatch repair, but not heteroduplex rejection, is temporally coupled to DNA replication. Science 334, 1713–1716 (2011).
Krokan, H. E., Drabløs, F. & Slupphaug, G. Uracil in DNA—occurrence, consequences and repair. Oncogene 21, 8935–8948 (2002).
Shen, J. C., Rideout, W. M., III & Jones, P. A. The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res. 22, 972–976 (1994).
We dedicate this paper to the memory of Dr Rosalind E. Franklin, on the occasion of her 100th birthday anniversary. Dr Franklin’s legacy, including her crucial contribution to the discovery of the molecular structure of DNA, continues to inspire generations of diverse scientists around the world. We thank S. Adar for discussions that initiated this project; D. Herschlag for discussions and comments; E. Arbely, D. Golovenko, J. Iwahara, E. Ortlund and R. Young for providing recombinant purified protein; and L. McIntosh for providing expression plasmids. This work was supported by the National Institutes of Health (NIH) grants R01-GM135658 and R01-GM117106 (to R.G.) and R01-GM089846 (to H.M.A.-H.); a Duke University GCB Pilot Grant (to R.G. and H.M.A.-H.); and an Integrated DNA Technologies postdoctoral fellowship award (to A.A.). R.S. and M.A.S. were supported by NIH grant R35-GM130290 (to M.A.S.); A.S. and T.E.H. were supported by the Israel Science Foundation grant 1517/14 (to T.E.H.); S.X. and G.M.K.P. were supported by a National Science Foundation (NSF) grant MCB-2028902 (to G.M.K.P.); and M.F. and M.A.P. were supported by a NSF CAREER award MCB-1552862 (to M.A.P.) High-performance computing was partially supported by the Duke Center for Genomic and Computational Biology. We acknowledge the Advanced Light Source (ALS) at the Lawrence Berkeley National Laboratory for X-ray diffraction data collection on beamlines 8.3.1 and 5.0.1. Beamline 8.3.1 at the ALS is operated by the University of California Office of the President, Multicampus Research Programs and Initiatives grant MR-15-328599, the NIH (R01GM124149 and P30GM124169), Plexxikin and the Integrated Diffraction Analysis Technologies program of the US Department of Energy Office of Biological and Environmental Research. The Pilatus detector on beamline 5.0.1 was funded under NIH grant S10OD021832. The ALS-ENABLE beamlines are supported in part by the NIH National Institute of General Medical Sciences grant P30 GM124169. The ALS is a national user facility operated by Lawrence Berkeley National Laboratory on behalf of the US Department of Energy under contract number DE-AC02-05CH11231, Office of Basic Energy Sciences. The Berkeley Center for Structural Biology is supported in part by the Howard Hughes Medical Institute.
The authors declare no competing interests.
Peer review information Nature thanks James Fraser, Remo Rohs and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, Distributions of base-pair parameters in free and TF-bound DNA, from PDB34 survey. Solid lines denote the median value of each parameter. Dashed lines denote the upper and lower bounds of the distribution for free (pink) and bound (green) DNA. 613 TF-bound structures and 409 free B-DNA structures, all with resolution < 3 Å, were used in the analysis (Methods). b, Percentage of structures with base pairs outside the B-DNA envelope. Among the 613 TF-bound structures, 41.1% (that is, 252) contain severe distortions of at least one base pair outside the free B-DNA envelope, with the envelope defined as at most 3 standard deviations above or below the mean. Only 16% (that is, 65) of the free B-DNA structures satisfy this criterion. (Using a less stringent definition of the B-DNA envelope, by considering two standard deviations above or below the mean, we found that 80.8% of the TF-bound structures contain at least one base pair outside the free B-DNA envelope, approximately twice the frequency observed in free DNA, which was 41.8%.) Considering the full range of base-pair parameter values as defining the free B-DNA envelope, we found that 11.3% (that is, 69) of the TF-bound structures contain at least one base pair with an extreme deformation that was never observed in any free DNA structure. c, Local deformations of base pairs observed in diverse TF-DNA complex structures. Left, 3D structures with the distorted base pairs highlighted in black boxes. Upper right, enlarged view of the base-pair structures with their base-pair parameters labelled. Lower right, schematic diagram of the corresponding base-pair parameters.
a, Base-pairing geometry of Watson–Crick base pairs and mismatches, obtained from a survey of crystal structures in the PDB34. Mismatches with modified bases and those that were metal-mediated were excluded from analysis (Methods). Predominant base-pairing geometries under neutral pH conditions are shown in black. Minor geometries are shown in grey. b, Melting energies for DNA mismatches relative to G-C and A-T Watson–Crick base pairs. See Methods for details. c, Distributions of structural parameters in Watson–Crick and mismatched DNA, from MD simulations. Solid lines denote the median value of each parameter. Observations from the MD simulation results: (1) G-T retains wobble geometry during the MD simulation, with sheared conformation (|shear| around 2 Å) accompanied by a slight stretch. (2) T-T shows wobble geometry with sheared conformation (|shear| around 2 Å). Different from G-T, the T-T mismatch shows rapid dynamic equilibrium of both wobble geometries with either one of the Ts shifted to the minor groove direction. Despite this rapid dynamic equilibrium, the T-T base pair is still constricted with C1′–C1′ distance 8–9.5 Å. (3) Similar to T-T, the C-T mismatch is also constricted with two hydrogen bonds stably formed for most of the time. However, C-T mismatch can transiently adopt a high-energy conformation with only one hydrogen bond and is not constricted anymore (C1′–C1′ distance around 10 Å), potentially owing to the close contact between T-O2 and C-O2. The entire C-T MD trajectory is comprised of approximately 5% of these high-energy species. (4) C-C is partially constricted with C1′–C1′ distance around 9.8 Å owing to unstable hydrogen bonding. (5) All pyrimidine-pyrimidine mismatches are stacked in the helix without swing out of the helix in the MD trajectories. (6) G-G does not experience anti-syn equilibrium during the simulation. The C1′–C1′ distance of G-G (G(syn)-G(anti) or G(anti)-G(syn)) is around 11.2–11.5 Å, which is larger than the canonical G-C base pair. (7) G(anti)-A(syn) is not constricted (C1′–C1′ distance around 11 Å) and G(anti)-A(anti) reveals large C1′–C1′ distance around 12.8 Å. Base-pair and base-step parameters of bases with syn conformation (marked with *) were not computed, and are thus greyed out, owing to an ill-defined coordinate frame (Methods). The C1′–C1′ distance is shown, as it is not affected by the change of coordinate frame. d, Mismatches can mimic distorted base-pair geometries observed in protein-bound DNA. Overlays of distorted (coloured) and idealized Watson–Crick (grey) base pairs from 3DNA (top); mismatches (coloured) and idealized Watson–Crick (grey) base pairs (middle); and mismatched and distorted Watson–Crick base pairs (right). The mismatched conformations are of free DNA and were obtained from MD simulations (Methods). The C-T mismatch can mimic an A-T Hoogsteen base pair by constricting the C1′–C1′ distance (taken from PDB 3KZ8). The G-T mismatch can mimic a sheared A-T base pair by shifting the T to the major groove direction (taken from PDB 4MZR).
a, Schematic representation of our experimental workflow to detect cross-hybridization. To check whether certain oligonucleotides hybridize with non-target complementary oligonucleotides, we designed an experiment in which only certain oligonucleotides (red) were labelled. If significant cross-hybridization occurred, we would have detected fluorescent signal on the chip even for sequences without fluorescent complements in the hybridization solution (that is, for the sequences shown in blue). b, No significant cross-hybridization was detected. Bottom, list of 12 sequences used in the hybridization solution of one SaMBA experiment (red: fluorescently labelled oligonucleotides; blue: unlabelled). Top, fluorescent signal from the hybridization of these 12 sequences on the chip. For the sequences on the chip for which their complement is not labelled, the fluorescent signal is practically undetectable (blue), and it is several orders of magnitude lower than the sequences with a labelled complementary strand (red). Box plots show median signals over replicate DNA spots, with the bottom and top edges of each box indicating the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers. c, The effect of mismatches on hybridization. To estimate the efficiency of our hybridization protocol, we measured the hybridization signal of one specific sequence (sequence #3 for library v.1; see Methods, Supplementary Table 10), to different sequences containing multiple mismatches (0 to around 40), and a completely different sequence (‘60*’). As expected, the hybridization was less efficient for sequences with large numbers of mismatches. However, for small numbers of mismatches the hybridization was highly efficient. Longer incubation time, higher oligonucleotide concentration, and normalization of the signal could enable the use of SaMBA for larger numbers of mismatches. Plot shows medians and standard deviations over all sequences containing the same number of mismatches, with 6 replicate spots per sequence. Mismatches were introduced randomly by generating N random base changes (N = 1–5, 10, 15, 25, 35, 45) to sequence #3, and repeating the procedure ten times for each N. This led to duplexes with 1 to 37 mismatches compared to the original sequence. d, Hybridization signal is highly reproducible. The correlation of hybridization signals between two replicate experiments was very high (R2 = 0.99). Plot shows median values, computed over six replicate spots, based on data shown in c. e, Validation of mismatch effects by orthogonal methods. For p53, ETS1, and GR proteins, the log-transformed SaMBA binding intensities correlate with independent affinity measurements performed on mismatched and non-mismatched DNA sites (Methods). Similarly to PBM experiments, median values over all replicates were used for SaMBA (n = 10 replicate spots); error bars show the median absolute deviation. Average values over replicates were used for the orthogonal methods (n = 6 independent measurements for p53, and n = 3 independent measurements for ETS1 and GR), with error bars showing the standard deviation. Red shaded region, 95% confidence interval for Pearson’s correlation. Binding free energy differences (ΔΔG) are shown between native Watson–Crick binding sites and the highest increase in binding due to a mismatch. Two SaMBA sites were tested for GR (see Methods). f, Correlation between binding data obtained by SaMBA versus independent methods. For SaMBA data the plots show the median values over replicate spots (n = 10 replicate spots), with error bars showing the median absolute deviation. For independent data (Methods) the plots show the binding affinities as reported in the respective papers. Red shaded region, 95% confidence interval for Pearson’s correlation. g, Standard equilibrium thermodynamics equations demonstrate that the logarithm of the Kd values of the TF–DNA complex is linearly proportional to the logarithm of the TF–DNA complex fluorescence signal, under certain conditions in which the TF concentration and the free DNA concentration are in excess compared to the concentration of the bound complex (and those remain constant during the reaction). h, Similar to g, for cases in which the DNA-bound species is a dimer.
a, The magnitude of the energetic effects of mutations (light colours) and mismatches (dark colours) is similar. The effects were computed for all 7 proteins with available calibration data in our study, and for a total of 12 DNA sites (Methods). The effects of mismatches were calculated relative to the two closest Watson–Crick sequences (for example, for a G-T mismatch the closest Watson–Crick base pairs are G-C and A-T; the mismatch plots include both ΔΔG(G-C > G-T) and ΔΔG(A-T > G-T)). b, Mismatches and their corresponding mutations have different, even opposite effects on TF binding. Each mutation is compared to the two closest mismatches (for example, G-C > A-T is compared to both G-C > A-C and G-C > G-T). Top left quadrant, mutations increase binding, mismatches decrease binding. Top right quadrant, both mutations and mismatches decrease binding. Bottom left quadrant, both mutations and mismatches increase binding. Bottom right quadrant, mutations decrease binding, mismatches increase binding. The x axis and y axis show calibrated binding measurements computed from the median SaMBA signal intensities (over n = 10 replicate spots). c, Comparing the effect of mutations versus the cumulative effects of the two closest mismatches. Points close to the diagonal correspond to cases in which the effect of the mutation is approximately equal (within experimental noise) to the sum of the effects of the two mismatches. Points above the diagonal correspond to cases in which Watson–Crick mutations have either a more beneficial or a less detrimental effect on TF binding compared to the cumulative effect of the two mismatches. Points below the diagonal correspond to cases in which Watson–Crick mutations have either a less beneficial or a more detrimental effect on TF binding compared to the cumulative effect of the two mismatches. The x axis and y axis show calibrated binding measurements computed from the median SaMBA signal intensities (over n = 10 replicate spots). Please see Supplementary Table 4 for the raw binding data used to compute the measurements shown in this figure.
a, SaMBA profile for an ETS1-binding site, highlighting the G-A mismatch at position 6, which shows the largest increase in binding affinity. b, Distortions. In the bound ETS1–DNA complex (PDB ID: 1K79), the positions at which the recognition helix is inserted into the DNA major groove are significantly distorted, with bending (βh = 23°) towards the major groove, local unwinding (ζh = 23°), and minor groove widening. Position 6, the middle position of the GGA core binding region, is highlighted to show the expanded C1′–C1′ distance. The G-A mismatch at this position mimics the C1′–C1′ distance of the bound DNA. Violin plots of the MD simulation data show that the G-A mismatch in anti-anti configuration also mimic the minor groove width of the bound G-C. c, Base readout. According to MD simulation results, G-A (anti/anti) and G-T mismatches increase the overall number of hydrogen bonds and the buried surface area at the ETS1-DNA interface, compared to the Watson–Crick G-C pair (Methods). d, ETS1–DNA interface in the GGAA core binding region. Contacting residues in the recognition helix are shown in magenta. Direct hydrogen bond contacts with the bases are highlighted; such contacts occur only at the GGA bases, on the ‘lower’ strand of the shown Watson–Crick DNA site. e, f, Representative snapshots of different hydrogen bond interactions between Arg391 and the base pair at position 6, from MD simulations. The G-T mismatch shows an additional hydrogen bond compared to G-C and G-A. g, In a non-specific site where G-A increases the affinity to reach the specific range, MD simulations show that the G-A mismatch forms hydrogen bonds similar to those formed in specific sites (shown in panel f). h, Non-native hydrogen bond at position 4, owing to the G-A mismatch at position 6 in the specific ETS1-binding site. i, j, Non-native hydrogen bond interactions created in a non-specific site (g) at positions neighbouring the positions of the mismatch, either with the base (i) or the backbone (j). k, SaMBA profiles for additional ETS1-binding sites. We measured the effect of mismatches in four ETS1-binding sites in addition to the one shown in a. Although the profiles for different sites are quantitatively different and dependent on the flanks, the trends for increased binding due to mismatches are similar. For all cases, the A-G mismatch at position 6 significantly increases ETS1 binding. l, Structural features at the mismatch position. Violin plots show the local twisting and kinking at position 6, and the minor and major groove width at position 5–6 of ETS1-bound DNA, as well as the naked DNA for different base pairs, according to MD.
a, Mismatch profile for p53 reveals that increased TF binding occurs only due to C-T and T-T mismatches (red rectangle) at the same positions at which the Hoogsteen conformation is observed in p53–DNA complexes (PDB 3KZ8). b, MD simulation-based violin plots of C1′–C1′ distance at position 2, as well as the minor grove width (at position 0–1), for p53-bound DNA and naked DNA (wild-type and mismatched) reveals that the minor groove for C-T and T-T mismatches is more similar to the bound form compared to the free A-T base pair. Plot also shows that the G-T mismatch, which reduces p53 binding, does not mimic these distortions seen in the bound DNA. Notably, a narrower minor grove at position 0–1 was previously suggested to be important for the interaction of the DNA with the Arg248 residue in p5327. c, d, NMR validation showing that T-T and C-T mimic the reduced C1′–C1′ distance observed in p53-bound DNA27,28. c, Chemical shift overlays of the 2D HSQC NMR spectra of the C1′–H1′, C4′–H4′ and C3′–H3′ regions for A6-DNA m1A in which the m1AT base pair is in the Hoogsteen conformation30 (left, green), A6-DNA TT (middle, blue) and A6-DNA CT (right, red) with unmodified A6-DNA (black) at pH 6.9, 25 °C. d, Bar plots of the individual chemical shift differences (relative to unmodified A6-DNA) of the C1′, C3′ and C4′ carbon atoms of A6-DNA m1A (top), A6-DNA TT (middle) and A6-DNA CT (bottom). Similarity between the Hoogsteen induced chemical shift differences and mismatch shifts (relative to the Watson–Crick wild-type) is observed for both T-T and C-T. e, Additional comparisons of global features (twisting angle, local kinking, and kinking direction at position 2 and major groove width at position 0–1) reveal additional mimicry between C-T mismatch and the Hoogsteen conformation local twisting angle. f, Pyrimidine–pyrimidine mismatches (C-T, T-C, T-T and C-C) in all four positions in which Hoogsteen conformation is observed (n = 16 mismatches total), increased p53 binding. However, all other mismatches at these positions (n = 32 mismatches total) decreased p53 binding, or had non-significant effects. ΔΔG represents the differences between the p53-DNA binding energy of each mismatch versus the wild-type sequence, and was estimated using the calibration with EMSA measurements (Methods). Box plots show median signals over all mismatches, with the bottom and top edges of each box indicating the 25th and 75th percentiles, respectively. The whiskers extend to the most-extreme data points that are not considered outliers. g, Number of p53-DNA hydrogen bonds and buried surface area at p53-DNA interface, obtained from MD simulations, failed to explain the observed increase in p53 binding, consistent with the prepaying mechanism being a key determinant for binding in this case. h, DNA hairpin with four mismatches (in the four positions for which the Hoogsteen conformation was previously observed), strongly binds p53: 3–6 kBT stronger (depending on the data used for validation, Supplementary Tables 3, 4) compared to the highest-affinity p53-binding sites previously reported22. Notably, we expect the difference in binding affinity to other genomic p53 sites (ΔΔG) to be even larger, as most p53-binding sites in the genome are of lower binding affinities22.
a, Mismatch profile for TBP. b, Correlations between TBP-binding levels and DNA duplex stability were computed over all 16 base-pair variants at positions 1 to 8 in the TBP site. Bar plots (left) represent the squared Pearson correlation coefficient (R2) at each position. For the only three positions with significant correlations (positions 2, 7, and 8) the scatter plot correlation is presented (right), with binding signals representing medians over 9 replicate spots. Blue shaded regions, 95% confidence interval for Pearson’s correlation. The sequences of the Watson–Crick and mismatched base pairs are shown in each scatter plot (for example, for position 8, GC stands for the wild-type G-C base-pair in bold in the TBP site TATAAAAG, CC stands for C-C at this position, and so on). These high correlations are observed only in the unstacked base step positions. c, Left, structural overlays between TBP–DNA complexes with DNA mismatches (TBP-AC, orange; TBP-CC(2), cyan; TBP-CC(1a), purple; TBP-CC(1b), pink) and their corresponding Watson–Crick counterparts with single base substitutions (1QNE, green; 6NJQ, yellow). The base steps at position 7–8 are magnified and highlighted in black boxes. The structural overlay of the mismatch and the Watson–Crick base pairs are shown below each box, with their DNA sequences. Right, overlays of protein-DNA interfaces of TBP-DNA complexes, comparing mismatched and Watson–Crick sites. Four phenylalanine residues, as well as other amino acids that are discussed in the Supplementary Discussion are highlighted with dashed circles. d, Comparisons of the effects of Watson–Crick mutations versus the cumulative effects of the two closest mismatches, shown for the mismatches with new crystal structures. In all three cases the mismatches have significantly larger effects than the Watson–Crick mutations (see also Methods and Supplementary Table 4). ΔΔG values for TBP_site_1 in Supplementary Table 4 were used in these comparisons. e, Example of a Watson–Crick mutation that has a similar effect (within experimental error, Supplementary Table 4) to the sum of the two closest mismatches. ΔΔG values for TBP_site_1 in Supplementary Table 4 were used in these comparisons.
a, TF–DNA complex formation involves creation of intermolecular interactions, as well as DNA conformational changes. Thermodynamically, these processes can be separated into two independent events, and thus an increase in binding affinity could stem from additional interactions (decrease of ΔGinteraction), and/or a reduction in the penalty to change the DNA conformation (decrease of ΔGpenalty). b, A reduction in the energetic penalty to distort the DNA (ΔGpenalty) could originate from DNA conformational changes owing to the mismatch, that is, before binding (for example, p53 and TBP, as described in the main text). c, A reduction in the energetic penalty for DNA distortion (ΔGpenalty) could also originate from changes in the bound DNA. For example, MD simulations of the DNA conformations in free form and in the MYC–DNA complex (for the wild-type A-T and the mismatch G-T) suggest that the reduced penalty in this case is primarily due to changes in the mismatched bound form. The extent of overlap of the kinking direction (γh) obtained from the MD simulations was: Ω = 0.34 (wild type) versus Ω = 0.15 (G-T mismatch), and was analysed using a revised Jensen–Shannon divergence score (Ω)81. Representative structures of the DNA sites are shown for wild-type free (pink), wild-type bound (orange), G-T free (green) and G-T bound (blue). The MYC–MAX heterodimer is shown as a grey surface. d, Mismatches could lead to the formation of non-native interactions such as hydrogen bonds (left), electrostatic potential and shape sensing (centre), and water-mediated interactions (right). Red empty arrows point to the locations of the change. These changes could occur directly at the position of the mismatched base (for example, the G-T mismatch for ETS1), as well as at the positions of other bases and/or the backbone, owing to non-native structures (for example, the G-A mismatch for ETS1). Notably, mismatches not only alter the potential interacting chemical groups of the replaced base, but can also alter the relative orientation of the interacting bases (as observed for the T in the wobble geometry on the left).
a, Mismatches can result from misincorporation of bases during DNA replication by DNA polymerases. The average rate at which replication errors are generated and escape proofreading is low in healthy cells (around 10−9), but high in certain cancers and cells with Pol-ε or Pol-δ mutations. Even in healthy cells, the rates of generation of individual mismatches vary by more than a million fold17 depending on the sequence context and the type of mismatch. b, Mismatches result from genetic recombination. A characteristic feature of homologous recombination is the exchange of DNA strands, which results in the formation of heteroduplex DNA. Mismatches can result from genetic recombination when the parental chromosomes contain non-identical sequences. In addition, mismatches can arise during DNA synthesis associated with recombination repair. The repair of these mismatches might be less efficient, as it was previously shown82 that there is a strong temporal coupling between DNA replication and mismatch repair but a lack of temporal coupling for heteroduplex rejection82. c, Spontaneous deamination is common and estimated to occur 100—500 times per cell per day in humans83. G-T mismatches generated by deamination of 5-methylcytosine (5-meC) are not repaired by the DNA mismatch repair pathway and have considerably lower repair efficiency83. The high rate of 5-meC deamination, combined with their relatively slow repair in mammalian cells, contribute to making 5-meC a preferential target for point mutations (about 40-fold) compared to other nucleotides in the genome84, and one of the major sources of the frequent C-to-T mutations observed in human cells18. d, Transcription factors bound to mismatched DNA could interfere with Pol-δ strand displacement activity. Left, DNA synthesized by non-proofreading mismatch-prone Pol-α is normally displaced by the proofreading non-error-prone Pol-δ. Right, it was previously shown10 that increased mutation signals arise from regions synthesized by Pol-α that contain TF-binding sites. This study suggested that mismatched DNA synthesized by non-proofreading Pol-α is rapidly bound by TFs that act as barriers to Pol-δ displacement of Pol-α-synthesized DNA, resulting in locally increased mutation rates in subsequent rounds of replication.
This file contains Supplementary Methods, a Supplementary Discussion and Supplementary References.
Original source images for all EMSA data reported in this study.
Supplementary Table 1 . This table contains the raw and processed SaMBA data for the 22 TFs.
Supplementary Table 2 . This table contains the raw and processed SaMBA data used to compare the Ets1 binding level at mismatched non-specific sites versus random DNA sites and specific sites from NMR and X-ray crystal structures of Ets1-DNA complexes.
Supplementary Table 3 . This table contains Kd data from EMSA, FA, MITOMI and SPR experiments, used to calibrate our high-throughput SaMBA data.
Supplementary Table 4 . This table contains the raw and processed binding data for all mismatches and mutations in 12 TF binding sites for the 7 TFs with calibration data in our study, as well as statistics of the comparisons between the effects of mismatches versus mutations.
Supplementary Table 5 . This table shows the deviations from the B-DNA envelope for DNA structural parameters at each base pair position in 12 TF-DNA complexes.
Supplementary Table 6 . (a) X-ray structures of protein-DNA complexes selected for structural analyses of mismatches that increase TF binding. (b) Distortions at DNA positions where mismatches increase TF binding affinity. (c) Distortions of DNA structural parameters of mismatches relative to Watson-Crick base pairs. (d) Mismatches that increase TF binding affinity and exhibit geometries similar to distorted base pairs in TF-bound DNA.
Analysis of TF-DNA hydrogen bonding and buried surface area in MD simulations of TF-DNA complexes with and without mismatches
Supplementary Table 7 . Results shown are from MD simulations. DNA sequences were derived from the sequences used in SaMBA.
Supplementary Table 8 . Summary statistics of base pair parameters (mean, maximum value, minimum value, and standard deviation) for base pairs in B-DNA (as well as TF-bound DNA), obtained from a comprehensive survey of structures deposited in PDB.
Supplementary Table 9 . All possible single mismatches (excluding modified bases) surrounded by at least two canonical Watson-Crick bps on both sides, from PDB structures. The data was used to survey the DNA mismatch structure and geometry.
Supplementary Table 10 . Fluorescent signal for DNA duplexes expected to contain labeled and unlabeled probes, from the hybridization of 12 sequences on a DNA chip (see also Figure S3). For the sequences with an unlabeled complementary strand (sequences 2, 4, 6, 8, 10, 12), the signal is several orders of magnitude lower than for the sequences with a labeled complementary strand (sequences 1, 3, 5, 7, 9, 11).
Supplementary Table 11 . This table includes the chemical shift differences in the sugar C1'/C3'/C4' carbons for T-T and C-T mismatches versus a locked Hoogsteen conformation (using N1-methyladenosine, or m1A), relative to the Watson-Crick base-paired duplex.
About this article
Cite this article
Afek, A., Shi, H., Rangadurai, A. et al. DNA mismatches reveal conformational penalties in protein–DNA recognition. Nature 587, 291–296 (2020). https://doi.org/10.1038/s41586-020-2843-2