Autism spectrum disorder (ASD) is an early-onset developmental disorder characterized by deficits in communication and social interaction and restrictive or repetitive behaviours1,2. Family studies demonstrate that ASD has a substantial genetic basis with contributions both from inherited and de novo variants3,4. It has been estimated that de novo mutations may contribute to 30% of all simplex cases, in which only a single child is affected per family5. Tandem repeats (TRs), defined here as sequences of 1 to 20 base pairs in size repeated consecutively, comprise one of the major sources of de novo mutations in humans6. TR expansions are implicated in dozens of neurological and psychiatric disorders7. Yet, de novo TR mutations have not been characterized on a genome-wide scale, and their contribution to ASD remains unexplored. Here we develop new bioinformatics methods for identifying and prioritizing de novo TR mutations from sequencing data and perform a genome-wide characterization of de novo TR mutations in ASD-affected probands and unaffected siblings. We infer specific mutation events and their precise changes in repeat number, and primarily focus on more prevalent stepwise copy number changes rather than large expansions. Our results demonstrate a significant genome-wide excess of TR mutations in ASD probands. Mutations in probands tend to be larger, enriched in fetal brain regulatory regions, and are predicted to be more evolutionarily deleterious. Overall, our results highlight the importance of considering repeat variants in future studies of de novo mutations.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All TR genotypes and mutation calls are available through SFARI base accession code: SFARI_SSC_WGS_2b. Per-locus selection scores computed by SISTR are provided in Supplementary Data 1. The BrainSpan dataset is available at https://www.brainspan.org/static/download.html. The NHGRI GWAS catalogue is available at https://www.ebi.ac.uk/gwas/.
The (1) MonSTR software for identifying TR mutations (https://github.com/gymreklab/STRDenovoTools (https://zenodo.org/record/4279668)) and (2) SISTR software for prioritizing TR mutations (https://github.com/BonnieCSE/SISTR (https://zenodo.org/record/4279700)) are open source and available on Github. The code used to generate figures and results for this study is available at https://github.com/gymreklab/ssc-denovos-paper (https://zenodo.org/record/4279671).
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders 5th edn (2013).
Rosti, R. O., Sadek, A. A., Vaux, K. K. & Gleeson, J. G. The genetic landscape of autism spectrum disorders. Dev. Med. Child Neurol. 56, 12–18 (2014).
Gaugler, T. et al. Most genetic risk for autism resides with common variation. Nat. Genet. 46, 881–885 (2014).
Iakoucheva, L. M., Muotri, A. R. & Sebat, J. Getting to the cores of autism. Cell 178, 1287–1298 (2019).
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Willems, T., Gymrek, M., Poznik, G. D., Tyler-Smith, C. & Erlich, Y. Population-scale sequencing data enable precise estimates of Y-STR mutation rates. Am. J. Hum. Genet. 98, 919–933 (2016).
Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).
Fischbach, G. D. & Lord, C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).
Turner, T. N. et al. Genomic patterns of de novo mutation in simplex autism. Cell 171, 710–722 (2017).
Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).
An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Gymrek, M., Willems, T., Reich, D. & Erlich, Y. Interpreting short tandem repeat variations in humans using mutational constraint. Nat. Genet. 49, 1495–1501 (2017).
Payseur, B. A., Jing, P. & Haasl, R. J. A genomic portrait of human microsatellite variation. Mol. Biol. Evol. 28, 303–312 (2011).
Sun, J. X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).
Michaelson, J. J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
O’Roak, B. J. et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250 (2012).
Rahbari, R. et al. Timing, rates and spectra of human germline mutation. Nat. Genet. 48, 126–133 (2016).
Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24, 400–402 (2000).
Huang, Q. Y. et al. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70, 625–634 (2002).
Weber, J. L. & Wong, C. Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123–1128 (1993).
Amos, W., Kosanović, D. & Eriksson, A. Inter-allelic interactions play a major role in microsatellite evolution. Proc. R. Soc. Lond. B 282, 20152125 (2015).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLOS Comput. Biol. 6, e1001025 (2010).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47 (D1), D886–D894 (2019).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).
Trost, B. et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature 586, 80–86 (2020).
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).
Grünewald, T. G. et al. Chimeric EWSR1-FLI1 regulates the Ewing sarcoma susceptibility gene EGR2 via a GGAA microsatellite. Nat. Genet. 47, 1073–1078 (2015).
Breuss, M. W. et al. Autism risk in offspring can be assessed through quantification of male sperm mosaicism. Nat. Med. 26, 143–150 (2020).
Mousavi, N. et al. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics btaa736 (2020).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Quinlan, A. R. BEDTools: the Swiss-army tool for genome feature analysis. Bioinformatics 47, 11–34 (2014).
Schuelke, M. An economic method for the fluorescent labeling of PCR fragments. Nat. Biotechnol. 18, 233–234 (2000).
Krebs, M. O. et al. Absence of association between a polymorphic GGC repeat in the 5′ untranslated region of the reelin gene and autism. Mol. Psychiatry 7, 801–804 (2002).
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).
Buniello, A. et al. The NHGRI–EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 (D1), D1005–D1012 (2019).
Miller, J. A. et al. Transcriptional landscape of the prenatal human brain. Nature 508, 199–206 (2014).
Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 51, 1652–1659 (2019).
Fu, Y. X. & Chakraborty, R. Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics 150, 487–497 (1998).
Haasl, R. J. & Payseur, B. A. Microsatellites as targets of natural selection. Mol. Biol. Evol. 30, 285–298 (2013).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
This study was supported by the Simons Foundation Autism Research Initiative (SFARI Grant no. 630705). I.M. was additionally supported by a predoctoral fellowship from the Autism Science Foundation. M.G. was additionally supported in part by the Office of The Director, National Institutes of Health under Award Number DP5OD024577 and NIH/NHGRI grants R01HG010149 and R21HG010070. K.E.L. was supported by the National Institutes of Health grant R35GM119856. We thank J. Gleeson, J. Sebat, A. Palmer and A. Goren for helpful comments on this study.
The authors declare no competing interests.
Peer review information Nature thanks Anders Børglum, Thomas Bourgeron, Anthony Hannan and Ryan Layer for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, Evaluation of a naive TR mutation-calling method. WGS was simulated for probands with mutations and controls with no mutation under three different scenarios for a range of mean sequencing coverages (Methods). Top plots show the sensitivity (blue line). Bottom plots show the false positive rate (FPR). Shaded bars show the percent of transmissions called as mutation (blue), no mutation (dark grey), or no call (light ray). b, Evaluation of MonSTR’s default model-based method. Plots are the same as in a. but based on MonSTR’s default model (Supplementary Methods). Note FPR lines are not visible because all are at 0%. c, Evaluation of TR mutation calling using default model-based MonSTR settings as a function of mutation size. The top plot is the same as in a, b, and shows the sensitivity to detect mutations as a function of their size. The bottom plot compares the estimated called mutation size (y-axis) compared to the true simulated mutation size (x-axis). Bubble sizes show the number of mutation calls represented at each point. d, Evaluation of TR mutation calling as a function of mutation size after quality filtering. Plots are same as in c, but using the stringent quality filters in MonSTR applied to analyse the SSC cohort. Compared to default settings, sensitivity is decreased especially for larger expansions but inferred mutation sizes are unbiased. All plots are based on simulation of 100 randomly chosen TR loci (Methods). c, d, show results for scenario #1.
a, Distribution of average TR mutation rates by period. For each repeat unit length (x-axis), bars give the genome-wide estimated TR mutation rate (y-axis, log10 scale). Average mutation rates were computed as the total number of mutations divided by the total number of children analysed. The numbers of TRs considered (rounded to the nearest 1,000) in each category are annotated. b, TR mutation rate vs. length. The x-axis shows the TR reference length (hg38) and the y-axis shows the log10 mutation rate estimated across all TRs with each reference length. Colours denote different repeat unit lengths. c, Number of TR mutations observed for CODIS markers. Red dots show observed mutation counts. Black dots show expected mutation counts and lines give 95% confidence intervals based on mutation rates reported by NIST (Methods). Each x-axis category denotes a separate CODIS marker. The total number of children analysed is annotated above each marker d, Observed TR mutation counts concordant with MUTEA. Boxes show the distribution of log10 mutation rates estimated by MUTEA12 (y-axis) at each TR with a given number of mutations observed in SSC children (x-axis). Black middle lines give medians and boxes span from the 25th percentile (Q1) to the 75th percentile (Q3). Whiskers extend to Q1-1.5*IQR (minima) and Q3+1.5*IQR (maxima), where IQR gives the interquartile range (Q3-Q1). Data are shown for n = 548,724 TRs for which MUTEA estimates were available. e, Determinants of TR mutation rates. The Poisson regression coefficient is shown for each feature in models trained separately for each repeat unit length (Methods). Features marked with an asterisk denote significant effects (two-sided P < 0.01 after Bonferroni correction for the number of features tested across all models). Nominal P-values are annotated above each plot. Error bars give 95% confidence intervals.
a, Mutation size distributions by repeat unit length. Histograms show the distribution (y-axis, fraction of total) of de novo TR mutation sizes for each repeat unit length (x-axis, number of repeat units). Mutations <0 denote contractions and >0 denote expansions. Colours denote different repeat unit lengths (grey = homopolymers; red = dinucleotides; gold = trinucleotides; blue = tetranucleotides; green = pentanucleotides; purple = hexanucleotides). b, c, Mutation size distributions by parental origin. Histograms show the distribution of de novo TR mutation sizes for mutations arising in the paternal (b) and maternal (c) germlines (homopolymers excluded). d, e, Mutation directionality bias in homozygous vs. heterozygous parents. In each plot, the x-axis gives the size of the parent allele relative to the reference genome (hg38). The y-axis gives the mean mutation size in terms of number of repeat units across all mutations with a given parent allele length. A separate coloured line is shown for each repeat unit length (red = dinucleotides; gold = trinucleotides; blue = tetranucleotides; green = pentanucleotides). Plots are restricted to mutations that were successfully phased to either the mother or the father for which the parent of origin was homozygous (b) or heterozygous (c). To restrict to highest confidence mutations, these plots are based only on mutations with step size of ±1 and for which the child had more than 10 enclosing reads supporting the de novo allele.
a, Number of recurrent mutations required to reach genome-wide significance. We performed a Fisher’s exact test to test for an excess of mutations in probands (n = 1,593) vs. non-ASD siblings (n = 1,593), for a different number of hypothetical mutation counts in probands (x-axis) and assuming 0 mutations observed in non-ASD siblings. The black line shows the two-sided P-value (log10 scale) obtained for each test. The grey dashed line denotes the P-value required to meet a genome-wide significance of P < 0.05 with Bonferroni multiple testing correction. b, Sample sizes required to identify genome-wide significant TRs. The x-axis shows sample size (log10 scale) in terms of the number of quad families analysed. Each line represents a different rate of mutation at a particular TR in probands, assuming 0 mutations at that TR in siblings (blue = 0.001%; orange = 0.01%; green = 0.05%; red = 0.1%; purple = 0.3%). The y-axis shows the power to detect a specific TR at genome-wide significance for each rate. c, Quantile–quantile plots for per-locus TR mutation burden testing. For each TR we performed a Fisher’s exact test to test for an excess of mutations in probands vs. siblings. The x-axis gives expected −log10 P-values under a null (uniform) distribution. The y-axis gives observed −log10 P-values from burden tests. Each dot represents a single TR. Black = all TRs. Gray = homopolymers excluded.
a, b, Bars show mean TR mutation counts in probands (red) vs. non-ASD siblings (blue) for TRs within 50kb of published GWAS associated SNPs (ASD = autism spectrum disorder; SCZ = schizophrenia; EA = educational attainment) considering (a) all TR mutations (ASD n = 4,213; SCZ n = 22,811; SCZ n = 25,668 TR mutations) or (b) TR mutations for which the mutant allele frequency is >5% in controls (SSC parents) (ASD n = 2,774; SCZ n = 14,661; SCZ n = 16,364 TR mutations). Error bars give 95% confidence intervals around the mean. Single asterisks denote nominally significant increases (Mann–Whitney one-sided P < 0.05). Double asterisks denote trends that are significant after Bonferroni correction for the six categories tested. Circles and squares show counts for females and males, respectively.
a, Ratio of median expression in proband-only genes to control-only genes across time points. The heatmap shows the ratio of the median expression of genes with only proband mutations (n = 268 genes) to that of genes with only mutations in non-ASD siblings (n = 242 genes). Each row shows a different brain structure from the BrainSpan dataset. Each column shows a different developmental time point. The black vertical line separates pre-natal from post-natal time points. Gray boxes indicate no data was available for that time point. Brain structure acronyms are defined in the legend of Fig. 3c. b, Proband TR mutations enriched for brain expression STRs. The quantile–quantile plot shows the distribution of expression STR (eSTR) unadjusted P-values based on associating TR length with gene expression in Brain-Caudate samples in the GTEx cohort46. eSTR association P-values are two-sided and are based on t-statistics computed using linear regression analyses performed previously. Each point represents a TR by gene association test using a linear regression model42. The x-axis gives expected −log10 P-values and the y-axis gives observed −log10 P-values. Red points show TRs with at least one de novo mutation in probands and 0 in controls. Blue points show TRs with at least one de novo mutation in controls and 0 in probands. We found no significant difference in either Brain-Cerebellum or the other 15 non-brain tissues analysed in that study, which we expected should not be relevant to ASD (not shown).
a, Mutations in probands at coding or 5′ UTR TRs to unobserved alleles. Each panel shows a de novo TR mutation observed in ASD probands to an allele (x-axis, repeat copy number) not observed in SSC parents. Black histograms give the allele counts in parents. Red arrows denote the allele resulting from each specified de novo TR mutation. Pedigrees show genotypes of parents and the child with the mutation (probands = black diamonds; non-ASD siblings = white diamonds). The text below pedigrees gives the gene and region in which the mutation occurred. b, Mutations in non-ASD siblings at coding or 5′ UTR TRs to unobserved alleles. Plots are the same as in a. except show mutations in non-ASD siblings.
a, Mutation burden by gene annotation. b, Mutation burden by frequency of the allele arising by de novo mutation. The x-axis stratifies mutations based on non-overlapping bins of the frequency of the de novo allele in healthy controls (SSC parents). “All” includes all mutations. For other allele frequency bins, only TRs for which precise copy numbers could be inferred in at least 80% of SSC parents are included (Methods). AF = allele frequency. In both plots, the y-axis gives RR in probands vs. non-ASD siblings. Dots show estimated relative risk and lines give 95% confidence intervals. Gray = all samples; green = males only; purple = females only. Both plots show only TRs with repeat unit length >1bp.
a, STR mutation model. Mutation is modelled by a stochastic mutation matrix with length-dependent mutation rates and mutation sizes following a geometric distribution with a directional bias towards the central allele. Unless otherwise indicated, alleles are specified in terms of the number of repeat units away from the central, or modal, allele at each STR. b, STR selection model. Negative selection is modelled by a diploid selection surface constructed as a function of the fitness of the individual alleles. The fitness of each allele is calculated as a function of a selection coefficient s, where the central allele has optimal fitness (w = 1), and the fitness of other alleles is a function of the number of repeat units away from the optimal allele. c, Example output of forward simulations of allele frequencies. The simulation starts with one ancestral (“optimal”) allele. As s increases, variability in the resulting allele frequency distributions decreases as the less fit alleles are removed by natural selection. d, Overview of per-STR selection inference using Approximate Bayesian Computation. For each STR, the method takes a prior on s, mutation model, and demographic parameters, and the observed allele frequency distribution as input. It outputs a posterior distribution of s and a P-value from a likelihood ratio test of whether a model with selection fits better than a model without selection (s = 0).
a, Comparison of true vs. inferred per-locus selection coefficients. The x-axis shows the true simulated value of s, and the y-axis shows the mean s value inferred by SISTR across 200 simulation replicates. b, Power to detect negative selection as a function of s. The x-axis shows the true simulated value of s, and the y-axis gives the power to reject the null hypothesis that s = 0. Left, middle, and right panels show results using models for dinucleotide, trinucleotide, and tetranucleotide TRs, respectively. Colours denotes different optimal allele lengths. c, Inferred genome-wide distribution of s is robust to prior choice and demographic models. We applied SISTR genome-wide using 2 different demographic models (Supplementary Methods) and 3 different prior distributions (left panels) on s. Right panels show the inferred genome-wide distribution of s using different combinations of priors and demographic models. Only loci inferred to be under selection (adjusted SISTR P < 1%) are included in the histograms. Red, yellow, and blue denote dinucleotides (n = 29,874), trinucleotides (n = 39,250), and tetranucleotides (n = 13,099), respectively. d, Genes containing coding STRs under strong selection are more missense-constrained. The x-axis gives the missense constraint Z-score reported by Gnomad47. The y-axis gives the frequency of genes with each missense Z-score. e, Genes containing coding STRs under strong selection are more loss-of-function intolerant. The x-axis gives the pLI score measuring loss of function intolerance of each gene reported by Gnomad. For d, e, black bars show the distribution for all genes containing an STR not inferred to be under selection (n = 177; adjusted SISTR P ≥ 1%) and red bars show the distribution for all genes containing an STR inferred to be under selection (n = 21; adjusted SISTR P < 1%). Vertical lines show medians of each distribution. For c–e, SISTR P-values are one-sided and based on the likelihood ratio test described in the Supplementary Methods.
Supplementary Methods; Supplementary Note; Supplementary Discussion. The supplementary information provides detailed descriptions of the MonSTR and SISTR methods, analysis of the contribution of genomic sequence features to TR mutation rates, further analyses comparing de novo TR mutations to published GWAS loci, and additional discussion points.
Supplementary Dataset 1: Per-locus selection coefficients inferred using SISTR.
This file contains Supplementary Tables 1-8.
About this article
Cite this article
Mitra, I., Huang, B., Mousavi, N. et al. Patterns of de novo tandem repeat mutations and their role in autism. Nature 589, 246–250 (2021). https://doi.org/10.1038/s41586-020-03078-7