DNA mismatches reveal conformational penalties in protein–DNA recognition

Afek, Ariel; Shi, Honglue; Rangadurai, Atul; Sahay, Harshit; Senitzki, Alon; Xhani, Suela; Fang, Mimi; Salinas, Raul; Mielko, Zachery; Pufall, Miles A.; Poon, Gregory M. K.; Haran, Tali E.; Schumacher, Maria A.; Al-Hashimi, Hashim M.; Gordân, Raluca

doi:10.1038/s41586-020-2843-2

Article
Published: 21 October 2020

DNA mismatches reveal conformational penalties in protein–DNA recognition

Nature volume 587, pages 291–296 (2020)Cite this article

22k Accesses
59 Citations
149 Altmetric
Metrics details

Subjects

Abstract

Transcription factors recognize specific genomic sequences to regulate complex gene-expression programs. Although it is well-established that transcription factors bind to specific DNA sequences using a combination of base readout and shape recognition, some fundamental aspects of protein–DNA binding remain poorly understood^1,2. Many DNA-binding proteins induce changes in the structure of the DNA outside the intrinsic B-DNA envelope. However, how the energetic cost that is associated with distorting the DNA contributes to recognition has proven difficult to study, because the distorted DNA exists in low abundance in the unbound ensemble^{3,4,5,6,7,8,9}. Here we use a high-throughput assay that we term SaMBA (saturation mismatch-binding assay) to investigate the role of DNA conformational penalties in transcription factor–DNA recognition. In SaMBA, mismatched base pairs are introduced to pre-induce structural distortions in the DNA that are much larger than those induced by changes in the Watson–Crick sequence. Notably, approximately 10% of mismatches increased transcription factor binding, and for each of the 22 transcription factors that were examined, at least one mismatch was found that increased the binding affinity. Mismatches also converted non-specific sites into high-affinity sites, and high-affinity sites into ‘super sites’ that exhibit stronger affinity than any known canonical binding site. Determination of high-resolution X-ray structures, combined with nuclear magnetic resonance measurements and structural analyses, showed that many of the DNA mismatches that increase binding induce distortions that are similar to those induced by protein binding—thus prepaying some of the energetic cost incurred from deforming the DNA. Our work indicates that conformational penalties are a major determinant of protein–DNA recognition, and reveals mechanisms by which mismatches can recruit transcription factors and thus modulate replication and repair activities in the cell^10,11.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: SaMBA measures the effects of mismatches on protein–DNA binding in high throughput.**

**Fig. 2: The effects of DNA mismatches on TF binding.**

**Fig. 3: DNA mismatches that exhibit geometries similar to distorted base pairs in TF-bound DNA lead to increased binding affinity.**

Improving prime editing with an endogenous small RNA-binding protein

Article Open access 03 April 2024

Jun Yan, Paul Oyler-Castrillo, … Britt Adamson

DNA double-strand break–capturing nuclear envelope tubules drive DNA repair

Article 17 April 2024

Mitra Shokrollahi, Mia Stanic, … Karim Mekhail

Transcription-coupled repair of DNA–protein cross-links depends on CSA and CSB

Article Open access 10 April 2024

Christopher J. Carnie, Aleida C. Acampora, … Julian Stingele

Data availability

The data that support the findings in this study are available as Supplementary Tables in Excel format. Coordinates and structure factor amplitudes for the TBP-AC, TBP-CC(1a), TBP-CC(1b) and TBP-CC(2) structures have been deposited in the PDB under the accession codes 6UEO, 6UEP, 6UER and 6UEQ, respectively. The raw SaMBA data have been deposited in the Gene Expression Omnibus (GEO) under accession number GSE156375. The PDB entries used in this study are available in Extended Data Figs. 1, 2, 5, 7 and Supplementary Tables 5–7, 9. High-resolution gel images for the EMSA data are available at https://figshare.com/projects/DNA_mismatches_reveal_conformational_penalties_in_protein-DNA_recognition/83663.

Code availability

The code used for the structural analyses presented in this study is available in GitHub at https://github.com/alhashimilab/TF_MM.

References

Rohs, R. et al. Origins of specificity in protein–DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).
CAS PubMed PubMed Central Google Scholar
Siggers, T. & Gordân, R. Protein–DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42, 2099–2111 (2014).
CAS PubMed Google Scholar
Guéron, M., Kochoyan, M. & Leroy, J.-L. A single mode of DNA base-pair opening drives imino proton exchange. Nature 328, 89–92 (1987).
PubMed ADS Google Scholar
Nikolova, E. N. et al. Transient Hoogsteen base pairs in canonical duplex DNA. Nature 470, 498–502 (2011).
CAS PubMed PubMed Central ADS Google Scholar
Fischer, M., Coleman, R. G., Fraser, J. S. & Shoichet, B. K. Incorporation of protein flexibility and conformational energy penalties in docking screens to improve ligand discovery. Nat. Chem. 6, 575–583 (2014).
CAS PubMed PubMed Central Google Scholar
Fraser, J. S. et al. Hidden alternative structures of proline isomerase essential for catalysis. Nature 462, 669–673 (2009).
CAS PubMed PubMed Central ADS Google Scholar
Lorch, Y., Davis, B. & Kornberg, R. D. Chromatin remodeling by DNA bending, not twisting. Proc. Natl Acad. Sci. USA 102, 1329–1332 (2005).
CAS PubMed ADS Google Scholar
Parvin, J. D., McCormick, R. J., Sharp, P. A. & Fisher, D. E. Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor. Nature 373, 724–727 (1995).
CAS PubMed ADS Google Scholar
Denny, S. K. et al. High-throughput investigation of diverse junction elements in RNA tertiary folding. Cell 174, 377–390 (2018).
CAS PubMed PubMed Central Google Scholar
Reijns, M. A. M. et al. Lagging-strand replication shapes the mutational landscape of the genome. Nature 518, 502–506 (2015).
CAS PubMed PubMed Central ADS Google Scholar
Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).
CAS PubMed ADS Google Scholar
Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).
CAS PubMed PubMed Central ADS Google Scholar
Zeiske, T. et al. Intrinsic DNA shape accounts for affinity differences between Hox-cofactor binding sites. Cell Rep. 24, 2221–2230 (2018).
CAS PubMed PubMed Central Google Scholar
Azad, R. N. et al. Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations. Nucleic Acids Res. 46, 2636–2647 (2018).
CAS PubMed PubMed Central Google Scholar
Olson, W. K., Gorin, A. A., Lu, X.-J., Hock, L. M. & Zhurkin, V. B. DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc. Natl Acad. Sci. USA 95, 11163–11168 (1998).
CAS PubMed ADS Google Scholar
Battistini, F. et al. How B-DNA dynamics decipher sequence-selective protein recognition. J. Mol. Biol. 431, 3845–3859 (2019).
CAS PubMed Google Scholar
Kunkel, T. A. & Erie, D. A. Eukaryotic mismatch repair in relation to DNA replication. Annu. Rev. Genet. 49, 291–313 (2015).
CAS PubMed PubMed Central Google Scholar
Lindahl, T. Instability and decay of the primary structure of DNA. Nature 362, 709–715 (1993).
CAS PubMed ADS Google Scholar
Pich, O. et al. Somatic and germline mutation periodicity follow the orientation of the DNA minor groove around nucleosomes. Cell 175, 1074–1087 (2018).
CAS PubMed Google Scholar
Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).
CAS PubMed PubMed Central Google Scholar
Shen, N. et al. Divergence in DNA specificity among paralogous transcription factors contributes to their differential in vivo binding. Cell Syst. 6, 470–483 (2018).
CAS PubMed PubMed Central Google Scholar
Veprintsev, D. B. & Fersht, A. R. Algorithm for prediction of tumour suppressor p53 affinity for binding sites in DNA. Nucleic Acids Res. 36, 1589–1598 (2008).
CAS PubMed PubMed Central Google Scholar
Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).
CAS PubMed PubMed Central Google Scholar
Warren, C. L. et al. Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl Acad. Sci. USA 103, 867–872 (2006).
CAS PubMed ADS Google Scholar
Benos, P. V., Bulyk, M. L. & Stormo, G. D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002).
CAS PubMed PubMed Central Google Scholar
Chattopadhyay, A., Zandarashvili, L., Luu, R. H. & Iwahara, J. Thermodynamic additivity for impacts of base-pair substitutions on association of the Egr-1 zinc-finger protein with DNA. Biochemistry 55, 6467–6474 (2016).
CAS PubMed PubMed Central Google Scholar
Kitayner, M. et al. Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nat. Struct. Mol. Biol. 17, 423–429 (2010).
CAS PubMed PubMed Central Google Scholar
Golovenko, D. et al. New insights into the role of DNA shape on its recognition by p53 proteins. Structure 26, 1237–1250 (2018).
CAS PubMed Google Scholar
Alvey, H. S., Gottardo, F. L., Nikolova, E. N. & Al-Hashimi, H. M. Widespread transient Hoogsteen base pairs in canonical duplex DNA with variable energetics. Nat. Commun. 5, 4786 (2014).
CAS PubMed PubMed Central ADS Google Scholar
Shi, H. et al. Atomic structures of excited state A-T Hoogsteen base pairs in duplex DNA by combining NMR relaxation dispersion, mutagenesis, and chemical shift calculations. J. Biomol. NMR 70, 229–244 (2018).
CAS PubMed PubMed Central Google Scholar
Kim, J. L., Nikolov, D. B. & Burley, S. K. Co-crystal structure of TBP recognizing the minor groove of a TATA element. Nature 365, 520–527 (1993).
CAS PubMed ADS Google Scholar
Mondal, M., Mukherjee, S. & Bhattacharyya, D. Contribution of phenylalanine side chain intercalation to the TATA-box binding protein-DNA interaction: molecular dynamics and dispersion-corrected density functional theory studies. J. Mol. Model. 20, 2499 (2014).
PubMed Google Scholar
Peyret, N., Seneviratne, P. A., Allawi, H. T. & SantaLucia, J., Jr. Nearest-neighbor thermodynamics and NMR of DNA sequences with internal A.A, C.C, G.G, and T.T mismatches. Biochemistry 38, 3468–3477 (1999).
CAS PubMed Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
CAS PubMed PubMed Central ADS Google Scholar
Zhou, H. et al. New insights into Hoogsteen base pairs in DNA duplexes from a structure-based survey. Nucleic Acids Res. 43, 3420–3433 (2015).
CAS PubMed PubMed Central Google Scholar
Lu, X.-J., Bussemaker, H. J. & Olson, W. K. DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 43, e142 (2015).
PubMed PubMed Central Google Scholar
Sathyamoorthy, B. et al. Insights into Watson–Crick/Hoogsteen breathing dynamics and damage repair from the solution structure and dynamic ensemble of DNA duplexes containing m¹A. Nucleic Acids Res. 45, 5586–5601 (2017).
CAS PubMed PubMed Central Google Scholar
El Hassan, M. A. & Calladine, C. R. Two distinct modes of protein-induced bending in DNA. J. Mol. Biol. 282, 331–343 (1998).
CAS PubMed Google Scholar
Bailor, M. H., Mustoe, A. M., Brooks, C. L., III & Al-Hashimi, H. M. 3D maps of RNA interhelical junctions. Nat. Protocols 6, 1536–1545 (2011).
CAS PubMed Google Scholar
Bailor, M. H., Sun, X. & Al-Hashimi, H. M. Topology links RNA secondary structure with global conformation, dynamics, and adaptation. Science 327, 202–206 (2010).
CAS PubMed ADS Google Scholar
Le Novère, N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics 17, 1226–1227 (2001).
PubMed Google Scholar
Cheatham, T. E. III, Cieplak, P. & Kollman, P. A. A modified version of the Cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn. 16, 845–862 (1999).
CAS PubMed Google Scholar
Pérez, A., Luque, F. J. & Orozco, M. Dynamics of B-DNA on the microsecond time scale. J. Am. Chem. Soc. 129, 14739–14745 (2007).
PubMed Google Scholar
Maier, J. A. et al. ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015).
CAS PubMed PubMed Central Google Scholar
Salomon-Ferrer, R., Götz, A. W., Poole, D., Le Grand, S. & Walker, R. C. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. explicit solvent particle mesh Ewald. J. Chem. Theory Comput. 9, 3878–3888 (2013).
CAS PubMed Google Scholar
Rossetti, G. et al. The structural impact of DNA mismatches. Nucleic Acids Res. 43, 4309–4321 (2015).
CAS PubMed PubMed Central Google Scholar
Arnold, F. H., Wolk, S., Cruz, P. & Tinoco, I. Jr. Structure, dynamics, and thermodynamics of mismatched DNA oligonucleotide duplexes d(CCCAGGG)2 and d(CCCTGGG)2. Biochemistry 26, 4068–4075 (1987).
CAS PubMed Google Scholar
Kouchakdjian, M., Li, B. F., Swann, P. F. & Patel, D. J. Pyrimidine.pyrimidine base-pair mismatches in DNA. A nuclear magnetic resonance study of T.T pairing at neutral pH and C.C pairing at acidic pH in dodecanucleotide duplexes. J. Mol. Biol. 202, 139–155 (1988).
CAS PubMed Google Scholar
Boulard, Y. et al. The pH dependent configurations of the C.A mispair in DNA. Nucleic Acids Res. 20, 1933–1941 (1992).
CAS PubMed PubMed Central Google Scholar
Peng, Y. & Alexov, E. Computational investigation of proton transfer, pKa shifts and pH-optimum of protein-DNA and protein-RNA complexes. Proteins 85, 282–295 (2017).
CAS PubMed Google Scholar
Chen, W., Morrow, B. H., Shi, C. & Shen, J. K. Recent development and application of constant pH molecular dynamics. Mol. Simul. 40, 830–838 (2014).
CAS PubMed PubMed Central Google Scholar
Rangadurai, A. et al. Why are Hoogsteen base pairs energetically disfavored in A-RNA compared to B-DNA? Nucleic Acids Res. 46, 11099–11114 (2018).
CAS PubMed PubMed Central Google Scholar
Patel, D. J., Kozlowski, S. A., Ikuta, S. & Itakura, K. Deoxyguanosine-deoxyadenosine pairing in the d(C-G-A-G-A-A-T-T-C-G-C-G) duplex: conformation and dynamics at and adjacent to the dG X dA mismatch site. Biochemistry 23, 3207–3217 (1984).
CAS PubMed Google Scholar
Webster, G. D. et al. Crystal structure and sequence-dependent conformation of the A.G mispaired oligonucleotide d(CGCAAGCTGGCG). Proc. Natl Acad. Sci. USA 87, 6693–6697 (1990).
CAS PubMed ADS Google Scholar
Allawi, H. T. & SantaLucia, J., Jr. NMR solution structure of a DNA dodecamer containing single G.T mismatches. Nucleic Acids Res. 26, 4925–4934 (1998).
CAS PubMed PubMed Central Google Scholar
Boulard, Y., Cognet, J. A. & Fazakerley, G. V. Solution structure as a function of pH of two central mismatches, C. T and C. C, in the 29 to 39 K-ras gene sequence, by nuclear magnetic resonance and molecular dynamics. J. Mol. Biol. 268, 331–347 (1997).
CAS PubMed Google Scholar
Gordân, R. et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 3, 1093–1104 (2013).
PubMed PubMed Central Google Scholar
Frank, F., Okafor, C. D. & Ortlund, E. A. The first crystal structure of a DNA-free nuclear receptor DNA binding domain sheds light on DNA-driven allostery in the glucocorticoid receptor. Sci. Rep. 8, 13497 (2018).
PubMed PubMed Central ADS Google Scholar
Takayama, Y., Sahu, D. & Iwahara, J. NMR studies of translocation of the Zif268 protein between its target DNA Sites. Biochemistry 49, 7998–8005 (2010).
CAS PubMed Google Scholar
Belo, Y. et al. Unexpected implications of STAT3 acetylation revealed by genetic encoding of acetyl-lysine. Biochim. Biophys. Acta 1863, 1343–1350 (2019).
CAS Google Scholar
Stelling, A. L. et al. Infrared spectroscopic observation of a G-C⁺ Hoogsteen base pair in the DNA:TATA-box binding protein complex under solution conditions. Angew. Chem. Int. Edn Engl. 58, 12010–12013 (2019).
CAS Google Scholar
Stephens, D. C. & Poon, G. M. Differential sensitivity to methylated DNA by ETS-family transcription factors is intrinsically encoded in their DNA-binding domains. Nucleic Acids Res. 44, 8671–8681 (2016).
CAS PubMed PubMed Central Google Scholar
Zhang, L. et al. SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site. Genome Res. 28, 111–121 (2018).
CAS PubMed PubMed Central Google Scholar
Vyas, P. et al. Diverse p53/DNA binding modes expand the repertoire of p53 response elements. Proc. Natl Acad. Sci. USA 114, 10624–10629 (2017).
CAS PubMed Google Scholar
Weinberg, R. L., Veprintsev, D. B. & Fersht, A. R. Cooperative binding of tetrameric p53 to DNA. J. Mol. Biol. 341, 1145–1159 (2004).
CAS PubMed Google Scholar
Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).
CAS PubMed PubMed Central Google Scholar
Siggers, T. et al. Principles of dimer-specific gene regulation revealed by a comprehensive characterization of NF-κB family DNA binding. Nat. Immunol. 13, 95–102 (2012).
CAS Google Scholar
Luisi, B. F. et al. Crystallographic analysis of the interaction of the glucocorticoid receptor with DNA. Nature 352, 497–505 (1991).
CAS PubMed ADS Google Scholar
Beno, I., Rosenthal, K., Levitine, M., Shaulov, L. & Haran, T. E. Sequence-dependent cooperative binding of p53 to DNA targets and its relationship to the structural properties of the DNA targets. Nucleic Acids Res. 39, 1919–1932 (2011).
CAS PubMed Google Scholar
Stephens, D. C. et al. Pharmacologic efficacy of PU.1 inhibition by heterocyclic dications: a mechanistic analysis. Nucleic Acids Res. 44, 4005–4013 (2016).
CAS PubMed PubMed Central Google Scholar
Siggers, T., Duyzend, M. H., Reddy, J., Khan, S. & Bulyk, M. L. Non-DNA-binding cofactors enhance DNA-binding specificity of a transcriptional regulatory complex. Mol. Syst. Biol. 7, 555 (2011).
PubMed PubMed Central Google Scholar
Maerkl, S. J. & Quake, S. R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237 (2007).
CAS PubMed ADS Google Scholar
Geertz, M., Shore, D. & Maerkl, S. J. Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proc. Natl Acad. Sci. USA 109, 16540–16545 (2012).
CAS PubMed ADS Google Scholar
Drachkova, I. et al. Effect of TATA box polymorphisms in human β-globin gene promoter associated with β-thalassemia on interaction with TATA-binding protein. Russ. J. Genet. Appl. Res. 1, 183–188 (2011).
Google Scholar
Drachkova, I. et al. The mechanism by which TATA-box polymorphisms associated with human hereditary diseases influence interactions with the TATA-binding protein. Hum. Mutat. 35, 601–608 (2014).
CAS PubMed Google Scholar
Leslie, A. G. The integration of macromolecular diffraction data. Acta Crystallogr. D 62, 48–57 (2006).
PubMed Google Scholar
Potterton, E., Briggs, P., Turkenburg, M. & Dodson, E. A graphical user interface to the CCP4 program suite. Acta Crystallogr. D 59, 1131–1137 (2003).
PubMed Google Scholar
Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D 66, 213–221 (2010).
CAS PubMed Google Scholar
Jones, T. A., Zou, J. Y., Cowan, S. W. & Kjeldgaard, M. Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr. A 47, 110–119 (1991).
PubMed Google Scholar
Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D 66, 12–21 (2010).
CAS Google Scholar
Yang, S., Salmon, L. & Al-Hashimi, H. M. Measuring similarity between dynamic ensembles of biomolecules. Nat. Methods 11, 552–554 (2014).
CAS PubMed PubMed Central Google Scholar
Hombauer, H., Srivatsan, A., Putnam, C. D. & Kolodner, R. D. Mismatch repair, but not heteroduplex rejection, is temporally coupled to DNA replication. Science 334, 1713–1716 (2011).
CAS PubMed PubMed Central ADS Google Scholar
Krokan, H. E., Drabløs, F. & Slupphaug, G. Uracil in DNA—occurrence, consequences and repair. Oncogene 21, 8935–8948 (2002).
CAS PubMed Google Scholar
Shen, J. C., Rideout, W. M., III & Jones, P. A. The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res. 22, 972–976 (1994).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We dedicate this paper to the memory of Dr Rosalind E. Franklin, on the occasion of her 100th birthday anniversary. Dr Franklin’s legacy, including her crucial contribution to the discovery of the molecular structure of DNA, continues to inspire generations of diverse scientists around the world. We thank S. Adar for discussions that initiated this project; D. Herschlag for discussions and comments; E. Arbely, D. Golovenko, J. Iwahara, E. Ortlund and R. Young for providing recombinant purified protein; and L. McIntosh for providing expression plasmids. This work was supported by the National Institutes of Health (NIH) grants R01-GM135658 and R01-GM117106 (to R.G.) and R01-GM089846 (to H.M.A.-H.); a Duke University GCB Pilot Grant (to R.G. and H.M.A.-H.); and an Integrated DNA Technologies postdoctoral fellowship award (to A.A.). R.S. and M.A.S. were supported by NIH grant R35-GM130290 (to M.A.S.); A.S. and T.E.H. were supported by the Israel Science Foundation grant 1517/14 (to T.E.H.); S.X. and G.M.K.P. were supported by a National Science Foundation (NSF) grant MCB-2028902 (to G.M.K.P.); and M.F. and M.A.P. were supported by a NSF CAREER award MCB-1552862 (to M.A.P.) High-performance computing was partially supported by the Duke Center for Genomic and Computational Biology. We acknowledge the Advanced Light Source (ALS) at the Lawrence Berkeley National Laboratory for X-ray diffraction data collection on beamlines 8.3.1 and 5.0.1. Beamline 8.3.1 at the ALS is operated by the University of California Office of the President, Multicampus Research Programs and Initiatives grant MR-15-328599, the NIH (R01GM124149 and P30GM124169), Plexxikin and the Integrated Diffraction Analysis Technologies program of the US Department of Energy Office of Biological and Environmental Research. The Pilatus detector on beamline 5.0.1 was funded under NIH grant S10OD021832. The ALS-ENABLE beamlines are supported in part by the NIH National Institute of General Medical Sciences grant P30 GM124169. The ALS is a national user facility operated by Lawrence Berkeley National Laboratory on behalf of the US Department of Energy under contract number DE-AC02-05CH11231, Office of Basic Energy Sciences. The Berkeley Center for Structural Biology is supported in part by the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Center for Genomic and Computational Biology, Duke University School of Medicine, Durham, NC, USA
Ariel Afek, Harshit Sahay, Zachery Mielko & Raluca Gordân
Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
Ariel Afek & Raluca Gordân
Department of Chemistry, Duke University, Durham, NC, USA
Honglue Shi & Hashim M. Al-Hashimi
Department of Biochemistry, Duke University School of Medicine, Durham, NC, USA
Atul Rangadurai, Raul Salinas, Maria A. Schumacher & Hashim M. Al-Hashimi
Program in Computational Biology and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
Harshit Sahay
Department of Biology, Technion–Israel Institute of Technology, Haifa, Israel
Alon Senitzki & Tali E. Haran
Department of Chemistry, Georgia State University, Atlanta, GA, USA
Suela Xhani & Gregory M. K. Poon
Department of Biochemistry, Carver College of Medicine, University of Iowa, Iowa City, IA, USA
Mimi Fang & Miles A. Pufall
Holden Comprehensive Cancer Center, University of Iowa, Iowa City, IA, USA
Mimi Fang & Miles A. Pufall
Program in Genetics and Genomics, Duke University School of Medicine, Durham, NC, USA
Zachery Mielko
Center for Diagnostics and Therapeutics, Georgia State University, Atlanta, GA, USA
Gregory M. K. Poon
Department of Computer Science, Duke University, Durham, NC, USA
Raluca Gordân
Department of Molecular Genetics and Microbiology, Duke University School of Medicine, Durham, NC, USA
Raluca Gordân

Authors

Ariel Afek
View author publications
You can also search for this author in PubMed Google Scholar
Honglue Shi
View author publications
You can also search for this author in PubMed Google Scholar
Atul Rangadurai
View author publications
You can also search for this author in PubMed Google Scholar
Harshit Sahay
View author publications
You can also search for this author in PubMed Google Scholar
Alon Senitzki
View author publications
You can also search for this author in PubMed Google Scholar
Suela Xhani
View author publications
You can also search for this author in PubMed Google Scholar
Mimi Fang
View author publications
You can also search for this author in PubMed Google Scholar
Raul Salinas
View author publications
You can also search for this author in PubMed Google Scholar
Zachery Mielko
View author publications
You can also search for this author in PubMed Google Scholar
Miles A. Pufall
View author publications
You can also search for this author in PubMed Google Scholar
Gregory M. K. Poon
View author publications
You can also search for this author in PubMed Google Scholar
Tali E. Haran
View author publications
You can also search for this author in PubMed Google Scholar
Maria A. Schumacher
View author publications
You can also search for this author in PubMed Google Scholar
Hashim M. Al-Hashimi
View author publications
You can also search for this author in PubMed Google Scholar
Raluca Gordân
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.A., H.M.A.-H. and R.G. designed and supervised the study. A.A. generated high-throughput protein–DNA binding data. A.A., H. Shi, A.R. and H. Sahay analysed the data. H. Shi and A.R. contributed NMR data. A.S., S.X., M.F., M.A.P., G.K.M.P. and T.E.H. contributed experimental data on protein–DNA binding affinities: p53 (A.S., T.E.H.), ETS1 (S.X., G.M.K.P.) and GR (M.F., M.A.P.). Z.M. contributed high-throughput protein–DNA binding data. R.S. and M.A.S. contributed X-ray crystallography data. A.A., H. Shi, A.R., H.M.A.-H. and R.G. wrote the manuscript, with input from all authors. All of the authors critically reviewed the manuscript and approved the final version.

Corresponding authors

Correspondence to Hashim M. Al-Hashimi or Raluca Gordân.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks James Fraser, Remo Rohs and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Structural deformations in TF-bound and unbound DNA.

a, Distributions of base-pair parameters in free and TF-bound DNA, from PDB³⁴ survey. Solid lines denote the median value of each parameter. Dashed lines denote the upper and lower bounds of the distribution for free (pink) and bound (green) DNA. 613 TF-bound structures and 409 free B-DNA structures, all with resolution < 3 Å, were used in the analysis (Methods). b, Percentage of structures with base pairs outside the B-DNA envelope. Among the 613 TF-bound structures, 41.1% (that is, 252) contain severe distortions of at least one base pair outside the free B-DNA envelope, with the envelope defined as at most 3 standard deviations above or below the mean. Only 16% (that is, 65) of the free B-DNA structures satisfy this criterion. (Using a less stringent definition of the B-DNA envelope, by considering two standard deviations above or below the mean, we found that 80.8% of the TF-bound structures contain at least one base pair outside the free B-DNA envelope, approximately twice the frequency observed in free DNA, which was 41.8%.) Considering the full range of base-pair parameter values as defining the free B-DNA envelope, we found that 11.3% (that is, 69) of the TF-bound structures contain at least one base pair with an extreme deformation that was never observed in any free DNA structure. c, Local deformations of base pairs observed in diverse TF-DNA complex structures. Left, 3D structures with the distorted base pairs highlighted in black boxes. Upper right, enlarged view of the base-pair structures with their base-pair parameters labelled. Lower right, schematic diagram of the corresponding base-pair parameters.

Extended Data Fig. 2 Structural characteristics of DNA mismatches.

a, Base-pairing geometry of Watson–Crick base pairs and mismatches, obtained from a survey of crystal structures in the PDB³⁴. Mismatches with modified bases and those that were metal-mediated were excluded from analysis (Methods). Predominant base-pairing geometries under neutral pH conditions are shown in black. Minor geometries are shown in grey. b, Melting energies for DNA mismatches relative to G-C and A-T Watson–Crick base pairs. See Methods for details. c, Distributions of structural parameters in Watson–Crick and mismatched DNA, from MD simulations. Solid lines denote the median value of each parameter. Observations from the MD simulation results: (1) G-T retains wobble geometry during the MD simulation, with sheared conformation (|shear| around 2 Å) accompanied by a slight stretch. (2) T-T shows wobble geometry with sheared conformation (|shear| around 2 Å). Different from G-T, the T-T mismatch shows rapid dynamic equilibrium of both wobble geometries with either one of the Ts shifted to the minor groove direction. Despite this rapid dynamic equilibrium, the T-T base pair is still constricted with C1′–C1′ distance 8–9.5 Å. (3) Similar to T-T, the C-T mismatch is also constricted with two hydrogen bonds stably formed for most of the time. However, C-T mismatch can transiently adopt a high-energy conformation with only one hydrogen bond and is not constricted anymore (C1′–C1′ distance around 10 Å), potentially owing to the close contact between T-O2 and C-O2. The entire C-T MD trajectory is comprised of approximately 5% of these high-energy species. (4) C-C is partially constricted with C1′–C1′ distance around 9.8 Å owing to unstable hydrogen bonding. (5) All pyrimidine-pyrimidine mismatches are stacked in the helix without swing out of the helix in the MD trajectories. (6) G-G does not experience anti-syn equilibrium during the simulation. The C1′–C1′ distance of G-G (G(syn)-G(anti) or G(anti)-G(syn)) is around 11.2–11.5 Å, which is larger than the canonical G-C base pair. (7) G(anti)-A(syn) is not constricted (C1′–C1′ distance around 11 Å) and G(anti)-A(anti) reveals large C1′–C1′ distance around 12.8 Å. Base-pair and base-step parameters of bases with syn conformation (marked with *) were not computed, and are thus greyed out, owing to an ill-defined coordinate frame (Methods). The C1′–C1′ distance is shown, as it is not affected by the change of coordinate frame. d, Mismatches can mimic distorted base-pair geometries observed in protein-bound DNA. Overlays of distorted (coloured) and idealized Watson–Crick (grey) base pairs from 3DNA (top); mismatches (coloured) and idealized Watson–Crick (grey) base pairs (middle); and mismatched and distorted Watson–Crick base pairs (right). The mismatched conformations are of free DNA and were obtained from MD simulations (Methods). The C-T mismatch can mimic an A-T Hoogsteen base pair by constricting the C1′–C1′ distance (taken from PDB 3KZ8). The G-T mismatch can mimic a sheared A-T base pair by shifting the T to the major groove direction (taken from PDB 4MZR).

Extended Data Fig. 3 Validation and calibration of SaMBA measurements.

a, Schematic representation of our experimental workflow to detect cross-hybridization. To check whether certain oligonucleotides hybridize with non-target complementary oligonucleotides, we designed an experiment in which only certain oligonucleotides (red) were labelled. If significant cross-hybridization occurred, we would have detected fluorescent signal on the chip even for sequences without fluorescent complements in the hybridization solution (that is, for the sequences shown in blue). b, No significant cross-hybridization was detected. Bottom, list of 12 sequences used in the hybridization solution of one SaMBA experiment (red: fluorescently labelled oligonucleotides; blue: unlabelled). Top, fluorescent signal from the hybridization of these 12 sequences on the chip. For the sequences on the chip for which their complement is not labelled, the fluorescent signal is practically undetectable (blue), and it is several orders of magnitude lower than the sequences with a labelled complementary strand (red). Box plots show median signals over replicate DNA spots, with the bottom and top edges of each box indicating the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers. c, The effect of mismatches on hybridization. To estimate the efficiency of our hybridization protocol, we measured the hybridization signal of one specific sequence (sequence #3 for library v.1; see Methods, Supplementary Table 10), to different sequences containing multiple mismatches (0 to around 40), and a completely different sequence (‘60*’). As expected, the hybridization was less efficient for sequences with large numbers of mismatches. However, for small numbers of mismatches the hybridization was highly efficient. Longer incubation time, higher oligonucleotide concentration, and normalization of the signal could enable the use of SaMBA for larger numbers of mismatches. Plot shows medians and standard deviations over all sequences containing the same number of mismatches, with 6 replicate spots per sequence. Mismatches were introduced randomly by generating N random base changes (N = 1–5, 10, 15, 25, 35, 45) to sequence #3, and repeating the procedure ten times for each N. This led to duplexes with 1 to 37 mismatches compared to the original sequence. d, Hybridization signal is highly reproducible. The correlation of hybridization signals between two replicate experiments was very high (R² = 0.99). Plot shows median values, computed over six replicate spots, based on data shown in c. e, Validation of mismatch effects by orthogonal methods. For p53, ETS1, and GR proteins, the log-transformed SaMBA binding intensities correlate with independent affinity measurements performed on mismatched and non-mismatched DNA sites (Methods). Similarly to PBM experiments, median values over all replicates were used for SaMBA (n = 10 replicate spots); error bars show the median absolute deviation. Average values over replicates were used for the orthogonal methods (n = 6 independent measurements for p53, and n = 3 independent measurements for ETS1 and GR), with error bars showing the standard deviation. Red shaded region, 95% confidence interval for Pearson’s correlation. Binding free energy differences (ΔΔG) are shown between native Watson–Crick binding sites and the highest increase in binding due to a mismatch. Two SaMBA sites were tested for GR (see Methods). f, Correlation between binding data obtained by SaMBA versus independent methods. For SaMBA data the plots show the median values over replicate spots (n = 10 replicate spots), with error bars showing the median absolute deviation. For independent data (Methods) the plots show the binding affinities as reported in the respective papers. Red shaded region, 95% confidence interval for Pearson’s correlation. g, Standard equilibrium thermodynamics equations demonstrate that the logarithm of the K_d values of the TF–DNA complex is linearly proportional to the logarithm of the TF–DNA complex fluorescence signal, under certain conditions in which the TF concentration and the free DNA concentration are in excess compared to the concentration of the bound complex (and those remain constant during the reaction). h, Similar to g, for cases in which the DNA-bound species is a dimer.

Extended Data Fig. 4 Comparing the effects of mutations versus mismatches on TF binding.

a, The magnitude of the energetic effects of mutations (light colours) and mismatches (dark colours) is similar. The effects were computed for all 7 proteins with available calibration data in our study, and for a total of 12 DNA sites (Methods). The effects of mismatches were calculated relative to the two closest Watson–Crick sequences (for example, for a G-T mismatch the closest Watson–Crick base pairs are G-C and A-T; the mismatch plots include both ΔΔG(G-C > G-T) and ΔΔG(A-T > G-T)). b, Mismatches and their corresponding mutations have different, even opposite effects on TF binding. Each mutation is compared to the two closest mismatches (for example, G-C > A-T is compared to both G-C > A-C and G-C > G-T). Top left quadrant, mutations increase binding, mismatches decrease binding. Top right quadrant, both mutations and mismatches decrease binding. Bottom left quadrant, both mutations and mismatches increase binding. Bottom right quadrant, mutations decrease binding, mismatches increase binding. The x axis and y axis show calibrated binding measurements computed from the median SaMBA signal intensities (over n = 10 replicate spots). c, Comparing the effect of mutations versus the cumulative effects of the two closest mismatches. Points close to the diagonal correspond to cases in which the effect of the mutation is approximately equal (within experimental noise) to the sum of the effects of the two mismatches. Points above the diagonal correspond to cases in which Watson–Crick mutations have either a more beneficial or a less detrimental effect on TF binding compared to the cumulative effect of the two mismatches. Points below the diagonal correspond to cases in which Watson–Crick mutations have either a less beneficial or a more detrimental effect on TF binding compared to the cumulative effect of the two mismatches. The x axis and y axis show calibrated binding measurements computed from the median SaMBA signal intensities (over n = 10 replicate spots). Please see Supplementary Table 4 for the raw binding data used to compute the measurements shown in this figure.

Extended Data Fig. 5 The effects of mismatches on ETS1–DNA binding.

a, SaMBA profile for an ETS1-binding site, highlighting the G-A mismatch at position 6, which shows the largest increase in binding affinity. b, Distortions. In the bound ETS1–DNA complex (PDB ID: 1K79), the positions at which the recognition helix is inserted into the DNA major groove are significantly distorted, with bending (β_h = 23°) towards the major groove, local unwinding (ζ_h = 23°), and minor groove widening. Position 6, the middle position of the GGA core binding region, is highlighted to show the expanded C1′–C1′ distance. The G-A mismatch at this position mimics the C1′–C1′ distance of the bound DNA. Violin plots of the MD simulation data show that the G-A mismatch in anti-anti configuration also mimic the minor groove width of the bound G-C. c, Base readout. According to MD simulation results, G-A (anti/anti) and G-T mismatches increase the overall number of hydrogen bonds and the buried surface area at the ETS1-DNA interface, compared to the Watson–Crick G-C pair (Methods). d, ETS1–DNA interface in the GGAA core binding region. Contacting residues in the recognition helix are shown in magenta. Direct hydrogen bond contacts with the bases are highlighted; such contacts occur only at the GGA bases, on the ‘lower’ strand of the shown Watson–Crick DNA site. e, f, Representative snapshots of different hydrogen bond interactions between Arg391 and the base pair at position 6, from MD simulations. The G-T mismatch shows an additional hydrogen bond compared to G-C and G-A. g, In a non-specific site where G-A increases the affinity to reach the specific range, MD simulations show that the G-A mismatch forms hydrogen bonds similar to those formed in specific sites (shown in panel f). h, Non-native hydrogen bond at position 4, owing to the G-A mismatch at position 6 in the specific ETS1-binding site. i, j, Non-native hydrogen bond interactions created in a non-specific site (g) at positions neighbouring the positions of the mismatch, either with the base (i) or the backbone (j). k, SaMBA profiles for additional ETS1-binding sites. We measured the effect of mismatches in four ETS1-binding sites in addition to the one shown in a. Although the profiles for different sites are quantitatively different and dependent on the flanks, the trends for increased binding due to mismatches are similar. For all cases, the A-G mismatch at position 6 significantly increases ETS1 binding. l, Structural features at the mismatch position. Violin plots show the local twisting and kinking at position 6, and the minor and major groove width at position 5–6 of ETS1-bound DNA, as well as the naked DNA for different base pairs, according to MD.

Extended Data Fig. 6 The effects of mismatches on p53–DNA binding.

a, Mismatch profile for p53 reveals that increased TF binding occurs only due to C-T and T-T mismatches (red rectangle) at the same positions at which the Hoogsteen conformation is observed in p53–DNA complexes (PDB 3KZ8). b, MD simulation-based violin plots of C1′–C1′ distance at position 2, as well as the minor grove width (at position 0–1), for p53-bound DNA and naked DNA (wild-type and mismatched) reveals that the minor groove for C-T and T-T mismatches is more similar to the bound form compared to the free A-T base pair. Plot also shows that the G-T mismatch, which reduces p53 binding, does not mimic these distortions seen in the bound DNA. Notably, a narrower minor grove at position 0–1 was previously suggested to be important for the interaction of the DNA with the Arg248 residue in p53²⁷. c, d, NMR validation showing that T-T and C-T mimic the reduced C1′–C1′ distance observed in p53-bound DNA^27,28. c, Chemical shift overlays of the 2D HSQC NMR spectra of the C1′–H1′, C4′–H4′ and C3′–H3′ regions for A6-DNA m¹A in which the m¹AT base pair is in the Hoogsteen conformation³⁰ (left, green), A6-DNA TT (middle, blue) and A6-DNA CT (right, red) with unmodified A6-DNA (black) at pH 6.9, 25 °C. d, Bar plots of the individual chemical shift differences (relative to unmodified A6-DNA) of the C1′, C3′ and C4′ carbon atoms of A6-DNA m¹A (top), A6-DNA TT (middle) and A6-DNA CT (bottom). Similarity between the Hoogsteen induced chemical shift differences and mismatch shifts (relative to the Watson–Crick wild-type) is observed for both T-T and C-T. e, Additional comparisons of global features (twisting angle, local kinking, and kinking direction at position 2 and major groove width at position 0–1) reveal additional mimicry between C-T mismatch and the Hoogsteen conformation local twisting angle. f, Pyrimidine–pyrimidine mismatches (C-T, T-C, T-T and C-C) in all four positions in which Hoogsteen conformation is observed (n = 16 mismatches total), increased p53 binding. However, all other mismatches at these positions (n = 32 mismatches total) decreased p53 binding, or had non-significant effects. ΔΔG represents the differences between the p53-DNA binding energy of each mismatch versus the wild-type sequence, and was estimated using the calibration with EMSA measurements (Methods). Box plots show median signals over all mismatches, with the bottom and top edges of each box indicating the 25th and 75th percentiles, respectively. The whiskers extend to the most-extreme data points that are not considered outliers. g, Number of p53-DNA hydrogen bonds and buried surface area at p53-DNA interface, obtained from MD simulations, failed to explain the observed increase in p53 binding, consistent with the prepaying mechanism being a key determinant for binding in this case. h, DNA hairpin with four mismatches (in the four positions for which the Hoogsteen conformation was previously observed), strongly binds p53: 3–6 k_BT stronger (depending on the data used for validation, Supplementary Tables 3, 4) compared to the highest-affinity p53-binding sites previously reported²². Notably, we expect the difference in binding affinity to other genomic p53 sites (ΔΔG) to be even larger, as most p53-binding sites in the genome are of lower binding affinities²².

Extended Data Fig. 7 The effects of mismatches on TBP–DNA binding.

a, Mismatch profile for TBP. b, Correlations between TBP-binding levels and DNA duplex stability were computed over all 16 base-pair variants at positions 1 to 8 in the TBP site. Bar plots (left) represent the squared Pearson correlation coefficient (R²) at each position. For the only three positions with significant correlations (positions 2, 7, and 8) the scatter plot correlation is presented (right), with binding signals representing medians over 9 replicate spots. Blue shaded regions, 95% confidence interval for Pearson’s correlation. The sequences of the Watson–Crick and mismatched base pairs are shown in each scatter plot (for example, for position 8, GC stands for the wild-type G-C base-pair in bold in the TBP site TATAAAAG, CC stands for C-C at this position, and so on). These high correlations are observed only in the unstacked base step positions. c, Left, structural overlays between TBP–DNA complexes with DNA mismatches (TBP-AC, orange; TBP-CC(2), cyan; TBP-CC(1a), purple; TBP-CC(1b), pink) and their corresponding Watson–Crick counterparts with single base substitutions (1QNE, green; 6NJQ, yellow). The base steps at position 7–8 are magnified and highlighted in black boxes. The structural overlay of the mismatch and the Watson–Crick base pairs are shown below each box, with their DNA sequences. Right, overlays of protein-DNA interfaces of TBP-DNA complexes, comparing mismatched and Watson–Crick sites. Four phenylalanine residues, as well as other amino acids that are discussed in the Supplementary Discussion are highlighted with dashed circles. d, Comparisons of the effects of Watson–Crick mutations versus the cumulative effects of the two closest mismatches, shown for the mismatches with new crystal structures. In all three cases the mismatches have significantly larger effects than the Watson–Crick mutations (see also Methods and Supplementary Table 4). ΔΔG values for TBP_site_1 in Supplementary Table 4 were used in these comparisons. e, Example of a Watson–Crick mutation that has a similar effect (within experimental error, Supplementary Table 4) to the sum of the two closest mismatches. ΔΔG values for TBP_site_1 in Supplementary Table 4 were used in these comparisons.

Extended Data Fig. 8 Potential mechanisms for mismatch-enhanced TF binding.

a, TF–DNA complex formation involves creation of intermolecular interactions, as well as DNA conformational changes. Thermodynamically, these processes can be separated into two independent events, and thus an increase in binding affinity could stem from additional interactions (decrease of ΔG_interaction), and/or a reduction in the penalty to change the DNA conformation (decrease of ΔG_penalty). b, A reduction in the energetic penalty to distort the DNA (ΔG_penalty) could originate from DNA conformational changes owing to the mismatch, that is, before binding (for example, p53 and TBP, as described in the main text). c, A reduction in the energetic penalty for DNA distortion (ΔG_penalty) could also originate from changes in the bound DNA. For example, MD simulations of the DNA conformations in free form and in the MYC–DNA complex (for the wild-type A-T and the mismatch G-T) suggest that the reduced penalty in this case is primarily due to changes in the mismatched bound form. The extent of overlap of the kinking direction (γ_h) obtained from the MD simulations was: Ω = 0.34 (wild type) versus Ω = 0.15 (G-T mismatch), and was analysed using a revised Jensen–Shannon divergence score (Ω)⁸¹. Representative structures of the DNA sites are shown for wild-type free (pink), wild-type bound (orange), G-T free (green) and G-T bound (blue). The MYC–MAX heterodimer is shown as a grey surface. d, Mismatches could lead to the formation of non-native interactions such as hydrogen bonds (left), electrostatic potential and shape sensing (centre), and water-mediated interactions (right). Red empty arrows point to the locations of the change. These changes could occur directly at the position of the mismatched base (for example, the G-T mismatch for ETS1), as well as at the positions of other bases and/or the backbone, owing to non-native structures (for example, the G-A mismatch for ETS1). Notably, mismatches not only alter the potential interacting chemical groups of the replaced base, but can also alter the relative orientation of the interacting bases (as observed for the T in the wobble geometry on the left).

Extended Data Fig. 9 DNA mismatches in the cell.

a, Mismatches can result from misincorporation of bases during DNA replication by DNA polymerases. The average rate at which replication errors are generated and escape proofreading is low in healthy cells (around 10⁻⁹), but high in certain cancers and cells with Pol-ε or Pol-δ mutations. Even in healthy cells, the rates of generation of individual mismatches vary by more than a million fold¹⁷ depending on the sequence context and the type of mismatch. b, Mismatches result from genetic recombination. A characteristic feature of homologous recombination is the exchange of DNA strands, which results in the formation of heteroduplex DNA. Mismatches can result from genetic recombination when the parental chromosomes contain non-identical sequences. In addition, mismatches can arise during DNA synthesis associated with recombination repair. The repair of these mismatches might be less efficient, as it was previously shown⁸² that there is a strong temporal coupling between DNA replication and mismatch repair but a lack of temporal coupling for heteroduplex rejection⁸². c, Spontaneous deamination is common and estimated to occur 100—500 times per cell per day in humans⁸³. G-T mismatches generated by deamination of 5-methylcytosine (5-meC) are not repaired by the DNA mismatch repair pathway and have considerably lower repair efficiency⁸³. The high rate of 5-meC deamination, combined with their relatively slow repair in mammalian cells, contribute to making 5-meC a preferential target for point mutations (about 40-fold) compared to other nucleotides in the genome⁸⁴, and one of the major sources of the frequent C-to-T mutations observed in human cells¹⁸. d, Transcription factors bound to mismatched DNA could interfere with Pol-δ strand displacement activity. Left, DNA synthesized by non-proofreading mismatch-prone Pol-α is normally displaced by the proofreading non-error-prone Pol-δ. Right, it was previously shown¹⁰ that increased mutation signals arise from regions synthesized by Pol-α that contain TF-binding sites. This study suggested that mismatched DNA synthesized by non-proofreading Pol-α is rapidly bound by TFs that act as barriers to Pol-δ displacement of Pol-α-synthesized DNA, resulting in locally increased mutation rates in subsequent rounds of replication.

Extended Data Table 1 Data collection and refinement statistics for TBP–DNA mismatch structures

Full size table

Supplementary information

Supplementary Information

This file contains Supplementary Methods, a Supplementary Discussion and Supplementary References.

Reporting Summary

Supplementary Figure 1

Original source images for all EMSA data reported in this study.

Supplementary Table 1 SaMBA data. This table contains the raw and processed SaMBA data for the 22 TFs.

41586_2020_2843_MOESM5_ESM.xlsx

Supplementary Table 2 Validation of the effect of mismatches in non-specific DNA. This table contains the raw and processed SaMBA data used to compare the Ets1 binding level at mismatched non-specific sites versus random DNA sites and specific sites from NMR and X-ray crystal structures of Ets1-DNA complexes.

41586_2020_2843_MOESM6_ESM.xlsx

Supplementary Table 3 Calibration of SaMBA data. This table contains Kd data from EMSA, FA, MITOMI and SPR experiments, used to calibrate our high-throughput SaMBA data.

41586_2020_2843_MOESM7_ESM.xlsx

Supplementary Table 4 Calibrated SaMBA data used to compare the effects of mismatches versus mutations on TF binding. This table contains the raw and processed binding data for all mismatches and mutations in 12 TF binding sites for the 7 TFs with calibration data in our study, as well as statistics of the comparisons between the effects of mismatches versus mutations.

41586_2020_2843_MOESM8_ESM.xlsx

Supplementary Table 5 Structural distortions in TF-bound DNA. This table shows the deviations from the B-DNA envelope for DNA structural parameters at each base pair position in 12 TF-DNA complexes.

41586_2020_2843_MOESM9_ESM.xlsx

Supplementary Table 6 Results of structural mimicry analysis. (a) X-ray structures of protein-DNA complexes selected for structural analyses of mismatches that increase TF binding. (b) Distortions at DNA positions where mismatches increase TF binding affinity. (c) Distortions of DNA structural parameters of mismatches relative to Watson-Crick base pairs. (d) Mismatches that increase TF binding affinity and exhibit geometries similar to distorted base pairs in TF-bound DNA.

41586_2020_2843_MOESM10_ESM.xlsx

Supplementary Table 7 Analysis of TF-DNA hydrogen bonding and buried surface area in MD simulations of TF-DNA complexes with and without mismatches. Results shown are from MD simulations. DNA sequences were derived from the sequences used in SaMBA.

41586_2020_2843_MOESM11_ESM.xlsx

Supplementary Table 8 Defining the B-DNA envelope. Summary statistics of base pair parameters (mean, maximum value, minimum value, and standard deviation) for base pairs in B-DNA (as well as TF-bound DNA), obtained from a comprehensive survey of structures deposited in PDB.

41586_2020_2843_MOESM12_ESM.xlsx

Supplementary Table 9 Structural survey of DNA mismatches. All possible single mismatches (excluding modified bases) surrounded by at least two canonical Watson-Crick bps on both sides, from PDB structures. The data was used to survey the DNA mismatch structure and geometry.

41586_2020_2843_MOESM13_ESM.xlsx

Supplementary Table 10 SaMBA hybridization signal. Fluorescent signal for DNA duplexes expected to contain labeled and unlabeled probes, from the hybridization of 12 sequences on a DNA chip (see also Figure S3). For the sequences with an unlabeled complementary strand (sequences 2, 4, 6, 8, 10, 12), the signal is several orders of magnitude lower than for the sequences with a labeled complementary strand (sequences 1, 3, 5, 7, 9, 11).

41586_2020_2843_MOESM14_ESM.xlsx

Supplementary Table 11 NMR results confirming that T-T and C-T mismatches mimic Hoogsteen A-T geometry. This table includes the chemical shift differences in the sugar C1'/C3'/C4' carbons for T-T and C-T mismatches versus a locked Hoogsteen conformation (using N1-methyladenosine, or m1A), relative to the Watson-Crick base-paired duplex.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Afek, A., Shi, H., Rangadurai, A. et al. DNA mismatches reveal conformational penalties in protein–DNA recognition. Nature 587, 291–296 (2020). https://doi.org/10.1038/s41586-020-2843-2

Download citation

Received: 05 January 2019
Accepted: 17 September 2020
Published: 21 October 2020
Issue Date: 12 November 2020
DOI: https://doi.org/10.1038/s41586-020-2843-2

This article is cited by

Predicting DNA structure using a deep learning method
- Jinsen Li
- Tsu-Pei Chiu
- Remo Rohs
Nature Communications (2024)
A quantum physics layer of epigenetics: a hypothesis deduced from charge transfer and chirality-induced spin selectivity of DNA
- Reiner Siebert
- Ole Ammerpohl
- Joachim Ankerhold
Clinical Epigenetics (2023)
RNA conformational propensities determine cellular activity
- Megan L. Ken
- Rohit Roy
- Hashim M. Al-Hashimi
Nature (2023)
In praise of research in fundamental biology

Nature (2022)
Simple synthesis of massively parallel RNA microarrays via enzymatic conversion from DNA microarrays
- Erika Schaudy
- Kathrin Hölz
- Mark M. Somoza
Nature Communications (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.