DNA mismatches reveal conformational penalties in protein–DNA recognition


Transcription factors recognize specific genomic sequences to regulate complex gene-expression programs. Although it is well-established that transcription factors bind to specific DNA sequences using a combination of base readout and shape recognition, some fundamental aspects of protein–DNA binding remain poorly understood1,2. Many DNA-binding proteins induce changes in the structure of the DNA outside the intrinsic B-DNA envelope. However, how the energetic cost that is associated with distorting the DNA contributes to recognition has proven difficult to study, because the distorted DNA exists in low abundance in the unbound ensemble3,4,5,6,7,8,9. Here we use a high-throughput assay that we term SaMBA (saturation mismatch-binding assay) to investigate the role of DNA conformational penalties in transcription factor–DNA recognition. In SaMBA, mismatched base pairs are introduced to pre-induce structural distortions in the DNA that are much larger than those induced by changes in the Watson–Crick sequence. Notably, approximately 10% of mismatches increased transcription factor binding, and for each of the 22 transcription factors that were examined, at least one mismatch was found that increased the binding affinity. Mismatches also converted non-specific sites into high-affinity sites, and high-affinity sites into ‘super sites’ that exhibit stronger affinity than any known canonical binding site. Determination of high-resolution X-ray structures, combined with nuclear magnetic resonance measurements and structural analyses, showed that many of the DNA mismatches that increase binding induce distortions that are similar to those induced by protein binding—thus prepaying some of the energetic cost incurred from deforming the DNA. Our work indicates that conformational penalties are a major determinant of protein–DNA recognition, and reveals mechanisms by which mismatches can recruit transcription factors and thus modulate replication and repair activities in the cell10,11.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: SaMBA measures the effects of mismatches on protein–DNA binding in high throughput.
Fig. 2: The effects of DNA mismatches on TF binding.
Fig. 3: DNA mismatches that exhibit geometries similar to distorted base pairs in TF-bound DNA lead to increased binding affinity.

Data availability

The data that support the findings in this study are available as Supplementary Tables in Excel format. Coordinates and structure factor amplitudes for the TBP-AC, TBP-CC(1a), TBP-CC(1b) and TBP-CC(2) structures have been deposited in the PDB under the accession codes 6UEO, 6UEP, 6UER and 6UEQ, respectively. The raw SaMBA data have been deposited in the Gene Expression Omnibus (GEO) under accession number GSE156375. The PDB entries used in this study are available in Extended Data Figs. 1, 2, 5, 7 and Supplementary Tables 57, 9. High-resolution gel images for the EMSA data are available at https://figshare.com/projects/DNA_mismatches_reveal_conformational_penalties_in_protein-DNA_recognition/83663.

Code availability

The code used for the structural analyses presented in this study is available in GitHub at https://github.com/alhashimilab/TF_MM.


  1. 1.

    Rohs, R. et al. Origins of specificity in protein–DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Siggers, T. & Gordân, R. Protein–DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42, 2099–2111 (2014).

    CAS  PubMed  Google Scholar 

  3. 3.

    Guéron, M., Kochoyan, M. & Leroy, J.-L. A single mode of DNA base-pair opening drives imino proton exchange. Nature 328, 89–92 (1987).

    ADS  PubMed  Google Scholar 

  4. 4.

    Nikolova, E. N. et al. Transient Hoogsteen base pairs in canonical duplex DNA. Nature 470, 498–502 (2011).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Fischer, M., Coleman, R. G., Fraser, J. S. & Shoichet, B. K. Incorporation of protein flexibility and conformational energy penalties in docking screens to improve ligand discovery. Nat. Chem. 6, 575–583 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Fraser, J. S. et al. Hidden alternative structures of proline isomerase essential for catalysis. Nature 462, 669–673 (2009).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Lorch, Y., Davis, B. & Kornberg, R. D. Chromatin remodeling by DNA bending, not twisting. Proc. Natl Acad. Sci. USA 102, 1329–1332 (2005).

    ADS  CAS  PubMed  Google Scholar 

  8. 8.

    Parvin, J. D., McCormick, R. J., Sharp, P. A. & Fisher, D. E. Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor. Nature 373, 724–727 (1995).

    ADS  CAS  PubMed  Google Scholar 

  9. 9.

    Denny, S. K. et al. High-throughput investigation of diverse junction elements in RNA tertiary folding. Cell 174, 377–390 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Reijns, M. A. M. et al. Lagging-strand replication shapes the mutational landscape of the genome. Nature 518, 502–506 (2015).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).

    ADS  CAS  PubMed  Google Scholar 

  12. 12.

    Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Zeiske, T. et al. Intrinsic DNA shape accounts for affinity differences between Hox-cofactor binding sites. Cell Rep. 24, 2221–2230 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Azad, R. N. et al. Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations. Nucleic Acids Res. 46, 2636–2647 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M. & Zhurkin, V. B. DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl Acad. Sci. USA 95, 11163–11168 (1998).

    ADS  CAS  PubMed  Google Scholar 

  16. 16.

    Battistini, F. et al. How B-DNA dynamics decipher sequence-selective protein recognition. J. Mol. Biol. 431, 3845–3859 (2019).

    CAS  PubMed  Google Scholar 

  17. 17.

    Kunkel, T. A. & Erie, D. A. Eukaryotic mismatch repair in relation to DNA replication. Annu. Rev. Genet. 49, 291–313 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Lindahl, T. Instability and decay of the primary structure of DNA. Nature 362, 709–715 (1993).

    ADS  CAS  PubMed  Google Scholar 

  19. 19.

    Pich, O. et al. Somatic and germline mutation periodicity follow the orientation of the DNA minor groove around nucleosomes. Cell 175, 1074–1087 (2018).

    CAS  PubMed  Google Scholar 

  20. 20.

    Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Shen, N. et al. Divergence in DNA specificity among paralogous transcription factors contributes to their differential in vivo binding. Cell Syst. 6, 470–483 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Veprintsev, D. B. & Fersht, A. R. Algorithm for prediction of tumour suppressor p53 affinity for binding sites in DNA. Nucleic Acids Res. 36, 1589–1598 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Warren, C. L. et al. Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl Acad. Sci. USA 103, 867–872 (2006).

    ADS  CAS  PubMed  Google Scholar 

  25. 25.

    Benos, P. V., Bulyk, M. L. & Stormo, G. D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Chattopadhyay, A., Zandarashvili, L., Luu, R. H. & Iwahara, J. Thermodynamic additivity for impacts of base-pair substitutions on association of the Egr-1 zinc-finger protein with DNA. Biochemistry 55, 6467–6474 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Kitayner, M. et al. Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nat. Struct. Mol. Biol. 17, 423–429 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Golovenko, D. et al. New insights into the role of DNA shape on its recognition by p53 proteins. Structure 26, 1237–1250 (2018).

    CAS  PubMed  Google Scholar 

  29. 29.

    Alvey, H. S., Gottardo, F. L., Nikolova, E. N. & Al-Hashimi, H. M. Widespread transient Hoogsteen base pairs in canonical duplex DNA with variable energetics. Nat. Commun. 5, 4786 (2014).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Shi, H. et al. Atomic structures of excited state A-T Hoogsteen base pairs in duplex DNA by combining NMR relaxation dispersion, mutagenesis, and chemical shift calculations. J. Biomol. NMR 70, 229–244 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Kim, J. L., Nikolov, D. B. & Burley, S. K. Co-crystal structure of TBP recognizing the minor groove of a TATA element. Nature 365, 520–527 (1993).

    ADS  CAS  PubMed  Google Scholar 

  32. 32.

    Mondal, M., Mukherjee, S. & Bhattacharyya, D. Contribution of phenylalanine side chain intercalation to the TATA-box binding protein-DNA interaction: molecular dynamics and dispersion-corrected density functional theory studies. J. Mol. Model. 20, 2499 (2014).

    PubMed  Google Scholar 

  33. 33.

    Peyret, N., Seneviratne, P. A., Allawi, H. T. & SantaLucia, J., Jr. Nearest-neighbor thermodynamics and NMR of DNA sequences with internal A.A, C.C, G.G, and T.T mismatches. Biochemistry 38, 3468–3477 (1999).

    CAS  PubMed  Google Scholar 

  34. 34.

    Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Zhou, H. et al. New insights into Hoogsteen base pairs in DNA duplexes from a structure-based survey. Nucleic Acids Res. 43, 3420–3433 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Lu, X.-J., Bussemaker, H. J. & Olson, W. K. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 43, e142 (2015).

    PubMed  PubMed Central  Google Scholar 

  37. 37.

    Sathyamoorthy, B. et al. Insights into Watson–Crick/Hoogsteen breathing dynamics and damage repair from the solution structure and dynamic ensemble of DNA duplexes containing m1A. Nucleic Acids Res. 45, 5586–5601 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    El Hassan, M. A. & Calladine, C. R. Two distinct modes of protein-induced bending in DNA. J. Mol. Biol. 282, 331–343 (1998).

    CAS  PubMed  Google Scholar 

  39. 39.

    Bailor, M. H., Mustoe, A. M., Brooks, C. L., III & Al-Hashimi, H. M. 3D maps of RNA interhelical junctions. Nat. Protocols 6, 1536–1545 (2011).

    CAS  PubMed  Google Scholar 

  40. 40.

    Bailor, M. H., Sun, X. & Al-Hashimi, H. M. Topology links RNA secondary structure with global conformation, dynamics, and adaptation. Science 327, 202–206 (2010).

    ADS  CAS  PubMed  Google Scholar 

  41. 41.

    Le Novère, N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 1226–1227 (2001).

    PubMed  Google Scholar 

  42. 42.

    Cheatham, T. E. III, Cieplak, P. & Kollman, P. A. A modified version of the Cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn. 16, 845–862 (1999).

    CAS  PubMed  Google Scholar 

  43. 43.

    Pérez, A., Luque, F. J. & Orozco, M. Dynamics of B-DNA on the microsecond time scale. J. Am. Chem. Soc. 129, 14739–14745 (2007).

    PubMed  Google Scholar 

  44. 44.

    Maier, J. A. et al. ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Salomon-Ferrer, R., Götz, A. W., Poole, D., Le Grand, S. & Walker, R. C. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. explicit solvent particle mesh Ewald. J. Chem. Theory Comput. 9, 3878–3888 (2013).

    CAS  PubMed  Google Scholar 

  46. 46.

    Rossetti, G. et al. The structural impact of DNA mismatches. Nucleic Acids Res. 43, 4309–4321 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47.

    Arnold, F. H., Wolk, S., Cruz, P. & Tinoco, I. Jr. Structure, dynamics, and thermodynamics of mismatched DNA oligonucleotide duplexes d(CCCAGGG)2 and d(CCCTGGG)2. Biochemistry 26, 4068–4075 (1987).

    CAS  PubMed  Google Scholar 

  48. 48.

    Kouchakdjian, M., Li, B. F., Swann, P. F. & Patel, D. J. Pyrimidine.pyrimidine base-pair mismatches in DNA. A nuclear magnetic resonance study of T.T pairing at neutral pH and C.C pairing at acidic pH in dodecanucleotide duplexes. J. Mol. Biol. 202, 139–155 (1988).

    CAS  PubMed  Google Scholar 

  49. 49.

    Boulard, Y. et al. The pH dependent configurations of the C.A mispair in DNA. Nucleic Acids Res. 20, 1933–1941 (1992).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Peng, Y. & Alexov, E. Computational investigation of proton transfer, pKa shifts and pH-optimum of protein-DNA and protein-RNA complexes. Proteins 85, 282–295 (2017).

    CAS  PubMed  Google Scholar 

  51. 51.

    Chen, W., Morrow, B. H., Shi, C. & Shen, J. K. Recent development and application of constant pH molecular dynamics. Mol. Simul. 40, 830–838 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Rangadurai, A. et al. Why are Hoogsteen base pairs energetically disfavored in A-RNA compared to B-DNA? Nucleic Acids Res. 46, 11099–11114 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Patel, D. J., Kozlowski, S. A., Ikuta, S. & Itakura, K. Deoxyguanosine-deoxyadenosine pairing in the d(C-G-A-G-A-A-T-T-C-G-C-G) duplex: conformation and dynamics at and adjacent to the dG X dA mismatch site. Biochemistry 23, 3207–3217 (1984).

    CAS  PubMed  Google Scholar 

  54. 54.

    Webster, G. D. et al. Crystal structure and sequence-dependent conformation of the A.G mispaired oligonucleotide d(CGCAAGCTGGCG). Proc. Natl Acad. Sci. USA 87, 6693–6697 (1990).

    ADS  CAS  PubMed  Google Scholar 

  55. 55.

    Allawi, H. T. & SantaLucia, J., Jr. NMR solution structure of a DNA dodecamer containing single G.T mismatches. Nucleic Acids Res. 26, 4925–4934 (1998).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    Boulard, Y., Cognet, J. A. & Fazakerley, G. V. Solution structure as a function of pH of two central mismatches, C. T and C. C, in the 29 to 39 K-ras gene sequence, by nuclear magnetic resonance and molecular dynamics. J. Mol. Biol. 268, 331–347 (1997).

    CAS  PubMed  Google Scholar 

  57. 57.

    Gordân, R. et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 3, 1093–1104 (2013).

    PubMed  PubMed Central  Google Scholar 

  58. 58.

    Frank, F., Okafor, C. D. & Ortlund, E. A. The first crystal structure of a DNA-free nuclear receptor DNA binding domain sheds light on DNA-driven allostery in the glucocorticoid receptor. Sci. Rep. 8, 13497 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Takayama, Y., Sahu, D. & Iwahara, J. NMR studies of translocation of the Zif268 protein between its target DNA Sites. Biochemistry 49, 7998–8005 (2010).

    CAS  PubMed  Google Scholar 

  60. 60.

    Belo, Y. et al. Unexpected implications of STAT3 acetylation revealed by genetic encoding of acetyl-lysine. Biochim. Biophys. Acta 1863, 1343–1350 (2019).

    CAS  Google Scholar 

  61. 61.

    Stelling, A. L. et al. Infrared spectroscopic observation of a G-C+ Hoogsteen base pair in the DNA:TATA-box binding protein complex under solution conditions. Angew. Chem. Int. Edn Engl. 58, 12010–12013 (2019).

    CAS  Google Scholar 

  62. 62.

    Stephens, D. C. & Poon, G. M. Differential sensitivity to methylated DNA by ETS-family transcription factors is intrinsically encoded in their DNA-binding domains. Nucleic Acids Res. 44, 8671–8681 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. 63.

    Zhang, L. et al. SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site. Genome Res. 28, 111–121 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. 64.

    Vyas, P. et al. Diverse p53/DNA binding modes expand the repertoire of p53 response elements. Proc. Natl Acad. Sci. USA 114, 10624–10629 (2017).

    CAS  PubMed  Google Scholar 

  65. 65.

    Weinberg, R. L., Veprintsev, D. B. & Fersht, A. R. Cooperative binding of tetrameric p53 to DNA. J. Mol. Biol. 341, 1145–1159 (2004).

    CAS  PubMed  Google Scholar 

  66. 66.

    Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. 67.

    Siggers, T. et al. Principles of dimer-specific gene regulation revealed by a comprehensive characterization of NF-κB family DNA binding. Nat. Immunol. 13, 95–102 (2012).

    CAS  Google Scholar 

  68. 68.

    Luisi, B. F. et al. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature 352, 497–505 (1991).

    ADS  CAS  PubMed  Google Scholar 

  69. 69.

    Beno, I., Rosenthal, K., Levitine, M., Shaulov, L. & Haran, T. E. Sequence-dependent cooperative binding of p53 to DNA targets and its relationship to the structural properties of the DNA targets. Nucleic Acids Res. 39, 1919–1932 (2011).

    CAS  PubMed  Google Scholar 

  70. 70.

    Stephens, D. C. et al. Pharmacologic efficacy of PU.1 inhibition by heterocyclic dications: a mechanistic analysis. Nucleic Acids Res. 44, 4005–4013 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. 71.

    Siggers, T., Duyzend, M. H., Reddy, J., Khan, S. & Bulyk, M. L. Non-DNA-binding cofactors enhance DNA-binding specificity of a transcriptional regulatory complex. Mol. Syst. Biol. 7, 555 (2011).

    PubMed  PubMed Central  Google Scholar 

  72. 72.

    Maerkl, S. J. & Quake, S. R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237 (2007).

    ADS  CAS  PubMed  Google Scholar 

  73. 73.

    Geertz, M., Shore, D. & Maerkl, S. J. Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proc. Natl Acad. Sci. USA 109, 16540–16545 (2012).

    ADS  CAS  PubMed  Google Scholar 

  74. 74.

    Drachkova, I. et al. Effect of TATA box polymorphisms in human β-globin gene promoter associated with β-thalassemia on interaction with TATA-binding protein. Russ. J. Genet. Appl. Res. 1, 183–188 (2011).

    Google Scholar 

  75. 75.

    Drachkova, I. et al. The mechanism by which TATA-box polymorphisms associated with human hereditary diseases influence interactions with the TATA-binding protein. Hum. Mutat. 35, 601–608 (2014).

    CAS  PubMed  Google Scholar 

  76. 76.

    Leslie, A. G. The integration of macromolecular diffraction data. Acta Crystallogr. D 62, 48–57 (2006).

    PubMed  Google Scholar 

  77. 77.

    Potterton, E., Briggs, P., Turkenburg, M. & Dodson, E. A graphical user interface to the CCP4 program suite. Acta Crystallogr. D 59, 1131–1137 (2003).

    PubMed  Google Scholar 

  78. 78.

    Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D 66, 213–221 (2010).

    CAS  PubMed  Google Scholar 

  79. 79.

    Jones, T. A., Zou, J. Y., Cowan, S. W. & Kjeldgaard, M. Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr. A 47, 110–119 (1991).

    PubMed  Google Scholar 

  80. 80.

    Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D 66, 12–21 (2010).

    CAS  PubMed  Google Scholar 

  81. 81.

    Yang, S., Salmon, L. & Al-Hashimi, H. M. Measuring similarity between dynamic ensembles of biomolecules. Nat. Methods 11, 552–554 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  82. 82.

    Hombauer, H., Srivatsan, A., Putnam, C. D. & Kolodner, R. D. Mismatch repair, but not heteroduplex rejection, is temporally coupled to DNA replication. Science 334, 1713–1716 (2011).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  83. 83.

    Krokan, H. E., Drabløs, F. & Slupphaug, G. Uracil in DNA—occurrence, consequences and repair. Oncogene 21, 8935–8948 (2002).

    CAS  PubMed  Google Scholar 

  84. 84.

    Shen, J. C., Rideout, W. M., III & Jones, P. A. The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res. 22, 972–976 (1994).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references


We dedicate this paper to the memory of Dr Rosalind E. Franklin, on the occasion of her 100th birthday anniversary. Dr Franklin’s legacy, including her crucial contribution to the discovery of the molecular structure of DNA, continues to inspire generations of diverse scientists around the world. We thank S. Adar for discussions that initiated this project; D. Herschlag for discussions and comments; E. Arbely, D. Golovenko, J. Iwahara, E. Ortlund and R. Young for providing recombinant purified protein; and L. McIntosh for providing expression plasmids. This work was supported by the National Institutes of Health (NIH) grants R01-GM135658 and R01-GM117106 (to R.G.) and R01-GM089846 (to H.M.A.-H.); a Duke University GCB Pilot Grant (to R.G. and H.M.A.-H.); and an Integrated DNA Technologies postdoctoral fellowship award (to A.A.). R.S. and M.A.S. were supported by NIH grant R35-GM130290 (to M.A.S.); A.S. and T.E.H. were supported by the Israel Science Foundation grant 1517/14 (to T.E.H.); S.X. and G.M.K.P. were supported by a National Science Foundation (NSF) grant MCB-2028902 (to G.M.K.P.); and M.F. and M.A.P. were supported by a NSF CAREER award MCB-1552862 (to M.A.P.) High-performance computing was partially supported by the Duke Center for Genomic and Computational Biology. We acknowledge the Advanced Light Source (ALS) at the Lawrence Berkeley National Laboratory for X-ray diffraction data collection on beamlines 8.3.1 and 5.0.1. Beamline 8.3.1 at the ALS is operated by the University of California Office of the President, Multicampus Research Programs and Initiatives grant MR-15-328599, the NIH (R01GM124149 and P30GM124169), Plexxikin and the Integrated Diffraction Analysis Technologies program of the US Department of Energy Office of Biological and Environmental Research. The Pilatus detector on beamline 5.0.1 was funded under NIH grant S10OD021832. The ALS-ENABLE beamlines are supported in part by the NIH National Institute of General Medical Sciences grant P30 GM124169. The ALS is a national user facility operated by Lawrence Berkeley National Laboratory on behalf of the US Department of Energy under contract number DE-AC02-05CH11231, Office of Basic Energy Sciences. The Berkeley Center for Structural Biology is supported in part by the Howard Hughes Medical Institute.

Author information




A.A., H.M.A.-H. and R.G. designed and supervised the study. A.A. generated high-throughput protein–DNA binding data. A.A., H. Shi, A.R. and H. Sahay analysed the data. H. Shi and A.R. contributed NMR data. A.S., S.X., M.F., M.A.P., G.K.M.P. and T.E.H. contributed experimental data on protein–DNA binding affinities: p53 (A.S., T.E.H.), ETS1 (S.X., G.M.K.P.) and GR (M.F., M.A.P.). Z.M. contributed high-throughput protein–DNA binding data. R.S. and M.A.S. contributed X-ray crystallography data. A.A., H. Shi, A.R., H.M.A.-H. and R.G. wrote the manuscript, with input from all authors. All of the authors critically reviewed the manuscript and approved the final version.

Corresponding authors

Correspondence to Hashim M. Al-Hashimi or Raluca Gordân.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks James Fraser, Remo Rohs and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Structural deformations in TF-bound and unbound DNA.

a, Distributions of base-pair parameters in free and TF-bound DNA, from PDB34 survey. Solid lines denote the median value of each parameter. Dashed lines denote the upper and lower bounds of the distribution for free (pink) and bound (green) DNA. 613 TF-bound structures and 409 free B-DNA structures, all with resolution < 3 Å, were used in the analysis (Methods). b, Percentage of structures with base pairs outside the B-DNA envelope. Among the 613 TF-bound structures, 41.1% (that is, 252) contain severe distortions of at least one base pair outside the free B-DNA envelope, with the envelope defined as at most 3 standard deviations above or below the mean. Only 16% (that is, 65) of the free B-DNA structures satisfy this criterion. (Using a less stringent definition of the B-DNA envelope, by considering two standard deviations above or below the mean, we found that 80.8% of the TF-bound structures contain at least one base pair outside the free B-DNA envelope, approximately twice the frequency observed in free DNA, which was 41.8%.) Considering the full range of base-pair parameter values as defining the free B-DNA envelope, we found that 11.3% (that is, 69) of the TF-bound structures contain at least one base pair with an extreme deformation that was never observed in any free DNA structure. c, Local deformations of base pairs observed in diverse TF-DNA complex structures. Left, 3D structures with the distorted base pairs highlighted in black boxes. Upper right, enlarged view of the base-pair structures with their base-pair parameters labelled. Lower right, schematic diagram of the corresponding base-pair parameters.

Extended Data Fig. 2 Structural characteristics of DNA mismatches.

a, Base-pairing geometry of Watson–Crick base pairs and mismatches, obtained from a survey of crystal structures in the PDB34. Mismatches with modified bases and those that were metal-mediated were excluded from analysis (Methods). Predominant base-pairing geometries under neutral pH conditions are shown in black. Minor geometries are shown in grey. b, Melting energies for DNA mismatches relative to G-C and A-T Watson–Crick base pairs. See Methods for details. c, Distributions of structural parameters in Watson–Crick and mismatched DNA, from MD simulations. Solid lines denote the median value of each parameter. Observations from the MD simulation results: (1) G-T retains wobble geometry during the MD simulation, with sheared conformation (|shear| around 2 Å) accompanied by a slight stretch. (2) T-T shows wobble geometry with sheared conformation (|shear| around 2 Å). Different from G-T, the T-T mismatch shows rapid dynamic equilibrium of both wobble geometries with either one of the Ts shifted to the minor groove direction. Despite this rapid dynamic equilibrium, the T-T base pair is still constricted with C1′–C1′ distance 8–9.5 Å. (3) Similar to T-T, the C-T mismatch is also constricted with two hydrogen bonds stably formed for most of the time. However, C-T mismatch can transiently adopt a high-energy conformation with only one hydrogen bond and is not constricted anymore (C1′–C1′ distance around 10 Å), potentially owing to the close contact between T-O2 and C-O2. The entire C-T MD trajectory is comprised of approximately 5% of these high-energy species. (4) C-C is partially constricted with C1′–C1′ distance around 9.8 Å owing to unstable hydrogen bonding. (5) All pyrimidine-pyrimidine mismatches are stacked in the helix without swing out of the helix in the MD trajectories. (6) G-G does not experience anti-syn equilibrium during the simulation. The C1′–C1′ distance of G-G (G(syn)-G(anti) or G(anti)-G(syn)) is around 11.2–11.5 Å, which is larger than the canonical G-C base pair. (7) G(anti)-A(syn) is not constricted (C1′–C1′ distance around 11 Å) and G(anti)-A(anti) reveals large C1′–C1′ distance around 12.8 Å. Base-pair and base-step parameters of bases with syn conformation (marked with *) were not computed, and are thus greyed out, owing to an ill-defined coordinate frame (Methods). The C1′–C1′ distance is shown, as it is not affected by the change of coordinate frame. d, Mismatches can mimic distorted base-pair geometries observed in protein-bound DNA. Overlays of distorted (coloured) and idealized Watson–Crick (grey) base pairs from 3DNA (top); mismatches (coloured) and idealized Watson–Crick (grey) base pairs (middle); and mismatched and distorted Watson–Crick base pairs (right). The mismatched conformations are of free DNA and were obtained from MD simulations (Methods). The C-T mismatch can mimic an A-T Hoogsteen base pair by constricting the C1′–C1′ distance (taken from PDB 3KZ8). The G-T mismatch can mimic a sheared A-T base pair by shifting the T to the major groove direction (taken from PDB 4MZR).

Extended Data Fig. 3 Validation and calibration of SaMBA measurements.

a, Schematic representation of our experimental workflow to detect cross-hybridization. To check whether certain oligonucleotides hybridize with non-target complementary oligonucleotides, we designed an experiment in which only certain oligonucleotides (red) were labelled. If significant cross-hybridization occurred, we would have detected fluorescent signal on the chip even for sequences without fluorescent complements in the hybridization solution (that is, for the sequences shown in blue). b, No significant cross-hybridization was detected. Bottom, list of 12 sequences used in the hybridization solution of one SaMBA experiment (red: fluorescently labelled oligonucleotides; blue: unlabelled). Top, fluorescent signal from the hybridization of these 12 sequences on the chip. For the sequences on the chip for which their complement is not labelled, the fluorescent signal is practically undetectable (blue), and it is several orders of magnitude lower than the sequences with a labelled complementary strand (red). Box plots show median signals over replicate DNA spots, with the bottom and top edges of each box indicating the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers. c, The effect of mismatches on hybridization. To estimate the efficiency of our hybridization protocol, we measured the hybridization signal of one specific sequence (sequence #3 for library v.1; see Methods, Supplementary Table 10), to different sequences containing multiple mismatches (0 to around 40), and a completely different sequence (‘60*’). As expected, the hybridization was less efficient for sequences with large numbers of mismatches. However, for small numbers of mismatches the hybridization was highly efficient. Longer incubation time, higher oligonucleotide concentration, and normalization of the signal could enable the use of SaMBA for larger numbers of mismatches. Plot shows medians and standard deviations over all sequences containing the same number of mismatches, with 6 replicate spots per sequence. Mismatches were introduced randomly by generating N random base changes (N = 1–5, 10, 15, 25, 35, 45) to sequence #3, and repeating the procedure ten times for each N. This led to duplexes with 1 to 37 mismatches compared to the original sequence. d, Hybridization signal is highly reproducible. The correlation of hybridization signals between two replicate experiments was very high (R2 = 0.99). Plot shows median values, computed over six replicate spots, based on data shown in c. e, Validation of mismatch effects by orthogonal methods. For p53, ETS1, and GR proteins, the log-transformed SaMBA binding intensities correlate with independent affinity measurements performed on mismatched and non-mismatched DNA sites (Methods). Similarly to PBM experiments, median values over all replicates were used for SaMBA (n = 10 replicate spots); error bars show the median absolute deviation. Average values over replicates were used for the orthogonal methods (n = 6 independent measurements for p53, and n = 3 independent measurements for ETS1 and GR), with error bars showing the standard deviation. Red shaded region, 95% confidence interval for Pearson’s correlation. Binding free energy differences (ΔΔG) are shown between native Watson–Crick binding sites and the highest increase in binding due to a mismatch. Two SaMBA sites were tested for GR (see Methods). f, Correlation between binding data obtained by SaMBA versus independent methods. For SaMBA data the plots show the median values over replicate spots (n = 10 replicate spots), with error bars showing the median absolute deviation. For independent data (Methods) the plots show the binding affinities as reported in the respective papers. Red shaded region, 95% confidence interval for Pearson’s correlation. g, Standard equilibrium thermodynamics equations demonstrate that the logarithm of the Kd values of the TF–DNA complex is linearly proportional to the logarithm of the TF–DNA complex fluorescence signal, under certain conditions in which the TF concentration and the free DNA concentration are in excess compared to the concentration of the bound complex (and those remain constant during the reaction). h, Similar to g, for cases in which the DNA-bound species is a dimer.

Extended Data Fig. 4 Comparing the effects of mutations versus mismatches on TF binding.

a, The magnitude of the energetic effects of mutations (light colours) and mismatches (dark colours) is similar. The effects were computed for all 7 proteins with available calibration data in our study, and for a total of 12 DNA sites (Methods). The effects of mismatches were calculated relative to the two closest Watson–Crick sequences (for example, for a G-T mismatch the closest Watson–Crick base pairs are G-C and A-T; the mismatch plots include both ΔΔG(G-C > G-T) and ΔΔG(A-T > G-T)). b, Mismatches and their corresponding mutations have different, even opposite effects on TF binding. Each mutation is compared to the two closest mismatches (for example, G-C > A-T is compared to both G-C > A-C and G-C > G-T). Top left quadrant, mutations increase binding, mismatches decrease binding. Top right quadrant, both mutations and mismatches decrease binding. Bottom left quadrant, both mutations and mismatches increase binding. Bottom right quadrant, mutations decrease binding, mismatches increase binding. The x axis and y axis show calibrated binding measurements computed from the median SaMBA signal intensities (over n = 10 replicate spots). c, Comparing the effect of mutations versus the cumulative effects of the two closest mismatches. Points close to the diagonal correspond to cases in which the effect of the mutation is approximately equal (within experimental noise) to the sum of the effects of the two mismatches. Points above the diagonal correspond to cases in which Watson–Crick mutations have either a more beneficial or a less detrimental effect on TF binding compared to the cumulative effect of the two mismatches. Points below the diagonal correspond to cases in which Watson–Crick mutations have either a less beneficial or a more detrimental effect on TF binding compared to the cumulative effect of the two mismatches. The x axis and y axis show calibrated binding measurements computed from the median SaMBA signal intensities (over n = 10 replicate spots). Please see Supplementary Table 4 for the raw binding data used to compute the measurements shown in this figure.

Extended Data Fig. 5 The effects of mismatches on ETS1–DNA binding.

a, SaMBA profile for an ETS1-binding site, highlighting the G-A mismatch at position 6, which shows the largest increase in binding affinity. b, Distortions. In the bound ETS1–DNA complex (PDB ID: 1K79), the positions at which the recognition helix is inserted into the DNA major groove are significantly distorted, with bending (βh = 23°) towards the major groove, local unwinding (ζh = 23°), and minor groove widening. Position 6, the middle position of the GGA core binding region, is highlighted to show the expanded C1′–C1′ distance. The G-A mismatch at this position mimics the C1′–C1′ distance of the bound DNA. Violin plots of the MD simulation data show that the G-A mismatch in anti-anti configuration also mimic the minor groove width of the bound G-C. c, Base readout. According to MD simulation results, G-A (anti/anti) and G-T mismatches increase the overall number of hydrogen bonds and the buried surface area at the ETS1-DNA interface, compared to the Watson–Crick G-C pair (Methods). d, ETS1–DNA interface in the GGAA core binding region. Contacting residues in the recognition helix are shown in magenta. Direct hydrogen bond contacts with the bases are highlighted; such contacts occur only at the GGA bases, on the ‘lower’ strand of the shown Watson–Crick DNA site. e, f, Representative snapshots of different hydrogen bond interactions between Arg391 and the base pair at position 6, from MD simulations. The G-T mismatch shows an additional hydrogen bond compared to G-C and G-A. g, In a non-specific site where G-A increases the affinity to reach the specific range, MD simulations show that the G-A mismatch forms hydrogen bonds similar to those formed in specific sites (shown in panel f). h, Non-native hydrogen bond at position 4, owing to the G-A mismatch at position 6 in the specific ETS1-binding site. i, j, Non-native hydrogen bond interactions created in a non-specific site (g) at positions neighbouring the positions of the mismatch, either with the base (i) or the backbone (j). k, SaMBA profiles for additional ETS1-binding sites. We measured the effect of mismatches in four ETS1-binding sites in addition to the one shown in a. Although the profiles for different sites are quantitatively different and dependent on the flanks, the trends for increased binding due to mismatches are similar. For all cases, the A-G mismatch at position 6 significantly increases ETS1 binding. l, Structural features at the mismatch position. Violin plots show the local twisting and kinking at position 6, and the minor and major groove width at position 5–6 of ETS1-bound DNA, as well as the naked DNA for different base pairs, according to MD.

Extended Data Fig. 6 The effects of mismatches on p53–DNA binding.

a, Mismatch profile for p53 reveals that increased TF binding occurs only due to C-T and T-T mismatches (red rectangle) at the same positions at which the Hoogsteen conformation is observed in p53–DNA complexes (PDB 3KZ8). b, MD simulation-based violin plots of C1′–C1′ distance at position 2, as well as the minor grove width (at position 0–1), for p53-bound DNA and naked DNA (wild-type and mismatched) reveals that the minor groove for C-T and T-T mismatches is more similar to the bound form compared to the free A-T base pair. Plot also shows that the G-T mismatch, which reduces p53 binding, does not mimic these distortions seen in the bound DNA. Notably, a narrower minor grove at position 0–1 was previously suggested to be important for the interaction of the DNA with the Arg248 residue in p5327. c, d, NMR validation showing that T-T and C-T mimic the reduced C1′–C1′ distance observed in p53-bound DNA27,28. c, Chemical shift overlays of the 2D HSQC NMR spectra of the C1′–H1′, C4′–H4′ and C3′–H3′ regions for A6-DNA m1A in which the m1AT base pair is in the Hoogsteen conformation30 (left, green), A6-DNA TT (middle, blue) and A6-DNA CT (right, red) with unmodified A6-DNA (black) at pH 6.9, 25 °C. d, Bar plots of the individual chemical shift differences (relative to unmodified A6-DNA) of the C1′, C3′ and C4′ carbon atoms of A6-DNA m1A (top), A6-DNA TT (middle) and A6-DNA CT (bottom). Similarity between the Hoogsteen induced chemical shift differences and mismatch shifts (relative to the Watson–Crick wild-type) is observed for both T-T and C-T. e, Additional comparisons of global features (twisting angle, local kinking, and kinking direction at position 2 and major groove width at position 0–1) reveal additional mimicry between C-T mismatch and the Hoogsteen conformation local twisting angle. f, Pyrimidine–pyrimidine mismatches (C-T, T-C, T-T and C-C) in all four positions in which Hoogsteen conformation is observed (n = 16 mismatches total), increased p53 binding. However, all other mismatches at these positions (n = 32 mismatches total) decreased p53 binding, or had non-significant effects. ΔΔG represents the differences between the p53-DNA binding energy of each mismatch versus the wild-type sequence, and was estimated using the calibration with EMSA measurements (Methods). Box plots show median signals over all mismatches, with the bottom and top edges of each box indicating the 25th and 75th percentiles, respectively. The whiskers extend to the most-extreme data points that are not considered outliers. g, Number of p53-DNA hydrogen bonds and buried surface area at p53-DNA interface, obtained from MD simulations, failed to explain the observed increase in p53 binding, consistent with the prepaying mechanism being a key determinant for binding in this case. h, DNA hairpin with four mismatches (in the four positions for which the Hoogsteen conformation was previously observed), strongly binds p53: 3–6 kBT stronger (depending on the data used for validation, Supplementary Tables 3, 4) compared to the highest-affinity p53-binding sites previously reported22. Notably, we expect the difference in binding affinity to other genomic p53 sites (ΔΔG) to be even larger, as most p53-binding sites in the genome are of lower binding affinities22.

Extended Data Fig. 7 The effects of mismatches on TBP–DNA binding.

a, Mismatch profile for TBP. b, Correlations between TBP-binding levels and DNA duplex stability were computed over all 16 base-pair variants at positions 1 to 8 in the TBP site. Bar plots (left) represent the squared Pearson correlation coefficient (R2) at each position. For the only three positions with significant correlations (positions 2, 7, and 8) the scatter plot correlation is presented (right), with binding signals representing medians over 9 replicate spots. Blue shaded regions, 95% confidence interval for Pearson’s correlation. The sequences of the Watson–Crick and mismatched base pairs are shown in each scatter plot (for example, for position 8, GC stands for the wild-type G-C base-pair in bold in the TBP site TATAAAAG, CC stands for C-C at this position, and so on). These high correlations are observed only in the unstacked base step positions. c, Left, structural overlays between TBP–DNA complexes with DNA mismatches (TBP-AC, orange; TBP-CC(2), cyan; TBP-CC(1a), purple; TBP-CC(1b), pink) and their corresponding Watson–Crick counterparts with single base substitutions (1QNE, green; 6NJQ, yellow). The base steps at position 7–8 are magnified and highlighted in black boxes. The structural overlay of the mismatch and the Watson–Crick base pairs are shown below each box, with their DNA sequences. Right, overlays of protein-DNA interfaces of TBP-DNA complexes, comparing mismatched and Watson–Crick sites. Four phenylalanine residues, as well as other amino acids that are discussed in the Supplementary Discussion are highlighted with dashed circles. d, Comparisons of the effects of Watson–Crick mutations versus the cumulative effects of the two closest mismatches, shown for the mismatches with new crystal structures. In all three cases the mismatches have significantly larger effects than the Watson–Crick mutations (see also Methods and Supplementary Table 4). ΔΔG values for TBP_site_1 in Supplementary Table 4 were used in these comparisons. e, Example of a Watson–Crick mutation that has a similar effect (within experimental error, Supplementary Table 4) to the sum of the two closest mismatches. ΔΔG values for TBP_site_1 in Supplementary Table 4 were used in these comparisons.

Extended Data Fig. 8 Potential mechanisms for mismatch-enhanced TF binding.

a, TF–DNA complex formation involves creation of intermolecular interactions, as well as DNA conformational changes. Thermodynamically, these processes can be separated into two independent events, and thus an increase in binding affinity could stem from additional interactions (decrease of ΔGinteraction), and/or a reduction in the penalty to change the DNA conformation (decrease of ΔGpenalty). b, A reduction in the energetic penalty to distort the DNA (ΔGpenalty) could originate from DNA conformational changes owing to the mismatch, that is, before binding (for example, p53 and TBP, as described in the main text). c, A reduction in the energetic penalty for DNA distortion (ΔGpenalty) could also originate from changes in the bound DNA. For example, MD simulations of the DNA conformations in free form and in the MYC–DNA complex (for the wild-type A-T and the mismatch G-T) suggest that the reduced penalty in this case is primarily due to changes in the mismatched bound form. The extent of overlap of the kinking direction (γh) obtained from the MD simulations was: Ω = 0.34 (wild type) versus Ω = 0.15 (G-T mismatch), and was analysed using a revised Jensen–Shannon divergence score (Ω)81. Representative structures of the DNA sites are shown for wild-type free (pink), wild-type bound (orange), G-T free (green) and G-T bound (blue). The MYC–MAX heterodimer is shown as a grey surface. d, Mismatches could lead to the formation of non-native interactions such as hydrogen bonds (left), electrostatic potential and shape sensing (centre), and water-mediated interactions (right). Red empty arrows point to the locations of the change. These changes could occur directly at the position of the mismatched base (for example, the G-T mismatch for ETS1), as well as at the positions of other bases and/or the backbone, owing to non-native structures (for example, the G-A mismatch for ETS1). Notably, mismatches not only alter the potential interacting chemical groups of the replaced base, but can also alter the relative orientation of the interacting bases (as observed for the T in the wobble geometry on the left).

Extended Data Fig. 9 DNA mismatches in the cell.

a, Mismatches can result from misincorporation of bases during DNA replication by DNA polymerases. The average rate at which replication errors are generated and escape proofreading is low in healthy cells (around 10−9), but high in certain cancers and cells with Pol-ε or Pol-δ mutations. Even in healthy cells, the rates of generation of individual mismatches vary by more than a million fold17 depending on the sequence context and the type of mismatch. b, Mismatches result from genetic recombination. A characteristic feature of homologous recombination is the exchange of DNA strands, which results in the formation of heteroduplex DNA. Mismatches can result from genetic recombination when the parental chromosomes contain non-identical sequences. In addition, mismatches can arise during DNA synthesis associated with recombination repair. The repair of these mismatches might be less efficient, as it was previously shown82 that there is a strong temporal coupling between DNA replication and mismatch repair but a lack of temporal coupling for heteroduplex rejection82. c, Spontaneous deamination is common and estimated to occur 100—500 times per cell per day in humans83. G-T mismatches generated by deamination of 5-methylcytosine (5-meC) are not repaired by the DNA mismatch repair pathway and have considerably lower repair efficiency83. The high rate of 5-meC deamination, combined with their relatively slow repair in mammalian cells, contribute to making 5-meC a preferential target for point mutations (about 40-fold) compared to other nucleotides in the genome84, and one of the major sources of the frequent C-to-T mutations observed in human cells18. d, Transcription factors bound to mismatched DNA could interfere with Pol-δ strand displacement activity. Left, DNA synthesized by non-proofreading mismatch-prone Pol-α is normally displaced by the proofreading non-error-prone Pol-δ. Right, it was previously shown10 that increased mutation signals arise from regions synthesized by Pol-α that contain TF-binding sites. This study suggested that mismatched DNA synthesized by non-proofreading Pol-α is rapidly bound by TFs that act as barriers to Pol-δ displacement of Pol-α-synthesized DNA, resulting in locally increased mutation rates in subsequent rounds of replication.

Extended Data Table 1 Data collection and refinement statistics for TBP–DNA mismatch structures

Supplementary information

Supplementary Information

This file contains Supplementary Methods, a Supplementary Discussion and Supplementary References.

Reporting Summary

Supplementary Figure 1

Original source images for all EMSA data reported in this study.

SaMBA data

Supplementary Table 1 . This table contains the raw and processed SaMBA data for the 22 TFs.

Validation of the effect of mismatches in non-specific DNA

Supplementary Table 2 . This table contains the raw and processed SaMBA data used to compare the Ets1 binding level at mismatched non-specific sites versus random DNA sites and specific sites from NMR and X-ray crystal structures of Ets1-DNA complexes.

Calibration of SaMBA data

Supplementary Table 3 . This table contains Kd data from EMSA, FA, MITOMI and SPR experiments, used to calibrate our high-throughput SaMBA data.

Calibrated SaMBA data used to compare the effects of mismatches versus mutations on TF binding

Supplementary Table 4 . This table contains the raw and processed binding data for all mismatches and mutations in 12 TF binding sites for the 7 TFs with calibration data in our study, as well as statistics of the comparisons between the effects of mismatches versus mutations.

Structural distortions in TF-bound DNA

Supplementary Table 5 . This table shows the deviations from the B-DNA envelope for DNA structural parameters at each base pair position in 12 TF-DNA complexes.

Results of structural mimicry analysis

Supplementary Table 6 . (a) X-ray structures of protein-DNA complexes selected for structural analyses of mismatches that increase TF binding. (b) Distortions at DNA positions where mismatches increase TF binding affinity. (c) Distortions of DNA structural parameters of mismatches relative to Watson-Crick base pairs. (d) Mismatches that increase TF binding affinity and exhibit geometries similar to distorted base pairs in TF-bound DNA.

Analysis of TF-DNA hydrogen bonding and buried surface area in MD simulations of TF-DNA complexes with and without mismatches

Supplementary Table 7 . Results shown are from MD simulations. DNA sequences were derived from the sequences used in SaMBA.

Defining the B-DNA envelope

Supplementary Table 8 . Summary statistics of base pair parameters (mean, maximum value, minimum value, and standard deviation) for base pairs in B-DNA (as well as TF-bound DNA), obtained from a comprehensive survey of structures deposited in PDB.

Structural survey of DNA mismatches

Supplementary Table 9 . All possible single mismatches (excluding modified bases) surrounded by at least two canonical Watson-Crick bps on both sides, from PDB structures. The data was used to survey the DNA mismatch structure and geometry.

SaMBA hybridization signal

Supplementary Table 10 .  Fluorescent signal for DNA duplexes expected to contain labeled and unlabeled probes, from the hybridization of 12 sequences on a DNA chip (see also Figure S3). For the sequences with an unlabeled complementary strand (sequences 2, 4, 6, 8, 10, 12), the signal is several orders of magnitude lower than for the sequences with a labeled complementary strand (sequences 1, 3, 5, 7, 9, 11).

NMR results confirming that T-T and C-T mismatches mimic Hoogsteen A-T geometry

Supplementary Table 11 . This table includes the chemical shift differences in the sugar C1'/C3'/C4' carbons for T-T and C-T mismatches versus a locked Hoogsteen conformation (using N1-methyladenosine, or m1A), relative to the Watson-Crick base-paired duplex.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Afek, A., Shi, H., Rangadurai, A. et al. DNA mismatches reveal conformational penalties in protein–DNA recognition. Nature 587, 291–296 (2020). https://doi.org/10.1038/s41586-020-2843-2

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing