Abstract
Non-B DNA structures formed by repetitive sequence motifs are known instigators of mutagenesis in experimental systems. Analyzing this phenomenon computationally in the human genome requires careful disentangling of intrinsic confounding factors, including overlapping and interrupted motifs and recurrent sequencing errors. Here, we show that accounting for these factors eliminates all signals of repeat-induced mutagenesis that extend beyond the motif boundary, and eliminates or dramatically shrinks the magnitude of mutagenesis within some motifs, contradicting previous reports. Mutagenesis not attributable to artifacts revealed several biological mechanisms. Polymerase slippage generates frequent indels within every variety of short tandem repeat motif, implicating slipped-strand structures. Interruption-correcting single nucleotide variants within short tandem repeats may originate from error-prone polymerases. Secondary-structure formation promotes single nucleotide variants within palindromic repeats and duplications within direct repeats. G-quadruplex motifs cause recurrent sequencing errors, whereas mutagenesis at Z-DNAs is conspicuously absent.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
The datasets analyzed during this study are freely available from the gnomAD Consortium (https://gnomad.broadinstitute.org/downloads), the UCSC Genome Browser (https://genome.ucsc.edu), the non-B Database (https://nonb-abcc.ncifcrf.gov/apps/nBMST/default/) and other studies as cited. Instructions for accessing specific datasets are further detailed in code repository (Code availability).
Code availability
The code to perform the analysis in this study is available in a Github repository (https://github.com/ryanmcggg/nonb_motifs). For software/packages (with version numbers), please visit the Github repository.
References
Khristich, A. N. & Mirkin, S. M. On the wrong DNA track: molecular mechanisms of repeat-mediated genome instability. J. Biol. Chem. 295, 4134–4170 (2020).
Du, X. et al. Potential non-B DNA regions in the human genome are associated with higher rates of nucleotide mutation and expression variation. Nucleic Acids Res. 42, 12367–12379 (2014).
Zou, X. et al. Short inverted repeats contribute to localized mutability in human somatic cells. Nucleic Acids Res. 45, 11213–11221 (2017).
Georgakopoulos-Soares, I. et al. Noncanonical secondary structures arising from non-B DNA motifs are determinants of mutagenesis. Genome Res. 28, 1264–1271 (2018).
Guiblet, W. M. et al. Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome. Nucleic Acids Res. 49, 1497–1516 (2021).
Murat, P., Guilbaud, G. & Sale, J. E. DNA polymerase stalling at structured DNA constrains the expansion of short tandem repeats. Genome Biol. 21, 209 (2020).
Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).
Tiao G. and Goodrich J. gnomAD v3.1 New content, methods, annotations, and data availability. GnomAD browser https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/ (2020).
Chambers, V. S. et al. High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat. Biotechnol. 33, 877–881 (2015).
Guiblet, W. M. et al. Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate. Genome Res. 28, 1767–1778 (2018).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Muyas, F. et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum. Mutat. 40, 115–126 (2019).
Gadgil, R. Y. et al. Replication stress at microsatellites causes DNA double-strand breaks and break-induced replication. J. Biol. Chem. 295, 15378–15397 (2020).
Baptiste, B. A. et al. Mature microsatellites: mechanisms underlying dinucleotide microsatellite mutational biases in human cells. G3 (Bethesda). 3, 451–463 (2013).
Wang, Q. et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat. Commun. 11, 2539 (2020).
Kockler, Z. W., Osia, B., Lee, R., Musmaker, K. & Malkova, A. Repair of DNA breaks by break-induced replication. Annu. Rev. Biochem. 90, 165–191 (2021).
Seplyarskiy, V. B. et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science 373, 1030–1035 (2021).
Wang, G. & Vasquez, K. M. Z-DNA, an active element in the genome. Front. Biosci. 12, 4424–4438 (2007).
Brázda, V. et al. Cruciform structures are a common DNA feature important for regulating biological processes. BMC Mol. Biol. 12, 33 (2011).
Quilez, J. et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 44, 3750–3762 (2016).
Guiblet, W. M. et al. Selection and thermostability suggest G-quadruplexes are novel functional elements of the human genome. Genome Res. 31, 1136–1149 (2021).
Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 51, 1652–1659 (2019).
Jakubosky, D. et al. Properties of structural variants and short tandem repeats associated with gene expression and complex traits. Nat. Commun. 11, 2927 (2020).
Kim, J. C. & Mirkin, S. M. The balancing act of DNA repeat expansions. Curr. Opin. Genet Dev. 23, 280–288 (2013).
Ananda, G. et al. Microsatellite interruptions stabilize primate genomes and exist as population-specific single nucleotide polymorphisms within individual human genomes. PLoS Genet. 10, e1004498 (2014).
Bacolla, A. et al. Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes. Nucleic Acids Res. 43, 5065–5080 (2015).
Pfeiffer, F. et al. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci. Rep. 8, 10950 (2018).
Mukherjee, P., Lahiri, I. & Pata, J. D. Human polymerase kappa uses a template-slippage deletion mechanism, but can realign the slipped strands to favour base substitution mutations over deletions. Nucleic Acids Res. 41, 5024–5035 (2013).
McCulloch, S. D. & Kunkel, T. A. The fidelity of DNA synthesis by eukaryotic replicative and translesion synthesis polymerases. Cell Res. 18, 148–161 (2008).
Lovett, S. T. Encoded errors: mutations and rearrangements mediated by misalignment at repetitive DNA sequences. Mol. Microbiol. 52, 1243–1253 (2004).
Tirman, S. et al. Temporally distinct post-replicative repair mechanisms fill PRIMPOL-dependent ssDNA gaps in human cells. Mol. Cell. 81, 4026–4040.e8 (2021).
Stone, J. E., Lujan, S. A. & Kunkel, T. A. DNA polymerase zeta generates clustered mutations during bypass of endogenous DNA lesions in Saccharomyces cerevisiae. Environ. Mol. Mutagen. 53, 777–786 (2012).
Seplyarskiy, V. B., Bazykin, G. A. & Soldatov, R. A. Polymerase ζ activity is linked to replication timing in humans: evidence from mutational signatures. Mol. Biol. Evol. 32, 3158–3172 (2015).
Lovett, S. T. Template-switching during replication fork repair in bacteria. DNA Repair (Amst.). 56, 118–128 (2017).
Löytynoja, A. & Goldman, N. Short template switch events explain mutation clusters in the human genome. Genome Res. 27, 1039–1049 (2017).
Walker, C. R., Scally, A., De Maio, N. & Goldman, N. Short-range template switching in great ape genomes explored using pair hidden Markov models. PLoS Genet. 17, e1009221 (2021).
Bacolla, A., Tainer, J. A., Vasquez, K. M. & Cooper, D. N. Translocation and deletion breakpoints in cancer genomes are associated with potential non-B DNA-forming sequences. Nucleic Acids Res. 44, 5673–5688 (2016).
McKinney, J. A. et al. Distinct DNA repair pathways cause genomic instability at alternative DNA structures. Nat. Commun. 11, 236 (2020).
Meng, Y. et al. Z-DNA is remodelled by ZBTB43 in prospermatogonia to safeguard the germline genome and epigenome. Nat. Cell Biol. 24, 1141–1153 (2022).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Goldmann, J. M. et al. Parent-of-origin-specific signatures of de novo mutations. Nat. Genet. 48, 935–939 (2016).
Yuen, R. K. et al. Genome-wide characteristics of de novo mutations in autism. NPJ Genom. Med. 1, 160271–1602710 (2016).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).
An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).
Sasani, T. A. et al. Large, three-generation human families reveal post-zygotic mosaicism and variability in germline mutation accumulation. eLife 8, e46922 (2019).
Jonsson, H. et al. Differences between germline genomes of monozygotic twins. Nat. Genet. 53, 27–34 (2021).
Goes, F. S. et al. De novo variation in bipolar disorder. Mol. Psychiatry 26, 4127–4136 (2021).
Cer, R. Z. et al. Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools. Nucleic Acids Res. 41, D94–D100 (2013).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Marsico, G. et al. Whole genome experimental maps of DNA G-quadruplexes in multiple species. Nucleic Acids Res. 47, 3862–3874 (2019).
Sung, W. et al. Evolution of the insertion-deletion mutation rate across the tree of life. G3 (Bethesda). 6, 2583–2591 (2016).
Acknowledgements
We thank V. Seplyarskiy and E. Koch for their important contributions. The project has been funded by National Institutes of Health grants R35-GM127131, 67 R01-MH101244, U01-HG012009 and R01-HG010372 (awarded to S.S.).
Author information
Authors and Affiliations
Contributions
R.M. conceived the study, performed the analysis and wrote the manuscript. S.S. supervised the project, discussed results and consulted on the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Structural & Molecular Biology thanks Martin Taylor and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Carolina Perdigoto and Dimitris Typas were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Caveats of the ‘Non-B Database’.
a, Overlapping motifs in Non-B DB. X-axis indicates motif category. Y-axis indicates the portion of motifs that do not overlap other categories (‘unique’) and those that overlap additional categories (indicated by color and abbreviation). b, Overlapping flanking regions in Non-B DB. For each motif in Non-B DB, the distance to the nearest repeat (including transposable elements from Repeatmasker where indicated) on either side was measured, and the smaller of the upstream or downstream values was taken. Overlapping motifs have a distance of 0. X-axis represents the percentile for the range of distances, and the Y-axis indicates the distance in nucleotides.
Extended Data Fig. 2 Mutagenesis flanking non-B motifs.
X-axes are coordinates relative to central motif (0 position encompasses entire repeat). Y-axes are relative mutation frequency (compared to the gnomAD average, normalized by trinucleotide mutation type). Data presented as mean values, with 95% binomial confidence intervals indicated in transparency. Confidence values derived from n = 76,156 individuals and the dynamic loci count. Blue lines: no sequencing quality filters, green and yellow lines: increasingly stringent filters (‘pass’ indicates gnomAD’s passing quality filter based on a VQSLOD score of −2.774). Left: STR motifs (combined with their respective reverse complements). Right: Other non-B motifs. STRs and Symmetrical motifs exclude the shortest 80% by motif length. G4 motifs (strand-specific) detected under K + conditions. Z-DNA motifs detected using standard definition. See Supplementary Fig. 2a–c and Methods for more details.
Extended Data Fig. 3 Absolute mutation frequency contributing to STR interruption reversions.
Mutation frequencies within STRs, represented as absolute frequencies (mutations per base). Frequency of interruption-perfecting SNVs (left), insertions within imperfect motifs (middle), and deletion of imperfections (right). X-axes are motif lengths. Y-axes are relative mutation frequency. Data presented as mean values, with error bars indicating 95% binomial confidence intervals. Confidence values derived from n = 76,156 individuals and the dynamic loci count. Blue: no sequencing quality filters, green and yellow: increasingly stringent quality filters. Motif sequence (combined with its reverse-complement) indicated at left of each row.
Extended Data Fig. 4 Duplications within direct repeats.
a, Examples of duplications within direct repeats. Reference sequence shown. Blue text indicates location of repeat motifs. Blue highlight shows location of motif mismatches. Orange highlight indicates region duplicated in gnomAD. Orange text with blue highlight indicates that the duplication includes the interruption. b, Duplication position bias. Duplication start/end positions (blue) or positions flanking duplications (orange), categorized by their location (ie. start of left motif, mismatch position in left motif, end of spacer, etc.). Frequency of the duplication at each position (observed) is divided by the portion of the motif for which the position accounts (expected). Represents n = 3170 DR loci with spacer length < 10nt and containing a duplication >5 nt. c) Gap repair model explaining direct repeat duplications. The initial A-B-A pattern forms a slipped-strand structure. Post-replicative gap-filling (Polζ) fills in single-stranded loop regions, resulting in 4-way pseudo-Holliday junctions. Cleavage of the top strand allows the filled-in sequence to be ligated into the top strand. Either this can be repeated for the other loop, or replication of the top strand will produce a daughter cell with the A-B-A-B-A pattern.
Extended Data Fig. 5 G4 motifs are prone to recurrent sequencing errors.
a, Insertions and b, deletions within G4 motifs. X-axis indicates positions within G-runs, or spacers between G-runs. Spacers of 1 nt in length are categorized separately. Y-axis is relative mutation frequency. Data presented as mean values, with error bars indicating 95% binomial confidence intervals. Confidence values derived from n = 76,156 individuals and the dynamic loci count. Blue: no sequencing quality filters, green and yellow: increasingly stringent filters. Arrows indicate magnitude and direction of change for individual trinucleotide mutation frequencies before and after sequencing quality filters. Mutations with large magnitude changes are highlighted in text.
Supplementary information
Supplementary Information
Supplementary Figs. 1–7.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
McGinty, R.J., Sunyaev, S.R. Revisiting mutagenesis at non-B DNA motifs in the human genome. Nat Struct Mol Biol 30, 417–424 (2023). https://doi.org/10.1038/s41594-023-00936-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41594-023-00936-6