Patterns of de novo tandem repeat mutations and their role in autism

Mitra, Ileena; Huang, Bonnie; Mousavi, Nima; Ma, Nichole; Lamkin, Michael; Yanicky, Richard; Shleizer-Burko, Sharona; Lohmueller, Kirk E.; Gymrek, Melissa

doi:10.1038/s41586-020-03078-7

Article
Published: 13 January 2021

Patterns of de novo tandem repeat mutations and their role in autism

Nature volume 589, pages 246–250 (2021)Cite this article

16k Accesses
80 Citations
160 Altmetric
Metrics details

Subjects

Abstract

Autism spectrum disorder (ASD) is an early-onset developmental disorder characterized by deficits in communication and social interaction and restrictive or repetitive behaviours^1,2. Family studies demonstrate that ASD has a substantial genetic basis with contributions both from inherited and de novo variants^3,4. It has been estimated that de novo mutations may contribute to 30% of all simplex cases, in which only a single child is affected per family⁵. Tandem repeats (TRs), defined here as sequences of 1 to 20 base pairs in size repeated consecutively, comprise one of the major sources of de novo mutations in humans⁶. TR expansions are implicated in dozens of neurological and psychiatric disorders⁷. Yet, de novo TR mutations have not been characterized on a genome-wide scale, and their contribution to ASD remains unexplored. Here we develop new bioinformatics methods for identifying and prioritizing de novo TR mutations from sequencing data and perform a genome-wide characterization of de novo TR mutations in ASD-affected probands and unaffected siblings. We infer specific mutation events and their precise changes in repeat number, and primarily focus on more prevalent stepwise copy number changes rather than large expansions. Our results demonstrate a significant genome-wide excess of TR mutations in ASD probands. Mutations in probands tend to be larger, enriched in fetal brain regulatory regions, and are predicted to be more evolutionarily deleterious. Overall, our results highlight the importance of considering repeat variants in future studies of de novo mutations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Identifying de novo TR mutations in the SSC cohort.**

**Fig. 2: Patterns of autosomal TR mutations.**

**Fig. 4: Prioritizing TR mutations by fitness effects.**

Chromatin accessibility during human first-trimester neurodevelopment

Article Open access 01 May 2024

Genome-wide association studies

Article 26 August 2021

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Article Open access 30 April 2024

Data availability

All TR genotypes and mutation calls are available through SFARI base accession code: SFARI_SSC_WGS_2b. Per-locus selection scores computed by SISTR are provided in Supplementary Data 1. The BrainSpan dataset is available at https://www.brainspan.org/static/download.html. The NHGRI GWAS catalogue is available at https://www.ebi.ac.uk/gwas/.

Code availability

The (1) MonSTR software for identifying TR mutations (https://github.com/gymreklab/STRDenovoTools (https://zenodo.org/record/4279668)) and (2) SISTR software for prioritizing TR mutations (https://github.com/BonnieCSE/SISTR (https://zenodo.org/record/4279700)) are open source and available on Github. The code used to generate figures and results for this study is available at https://github.com/gymreklab/ssc-denovos-paper (https://zenodo.org/record/4279671).

References

American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders 5th edn (2013).
Rosti, R. O., Sadek, A. A., Vaux, K. K. & Gleeson, J. G. The genetic landscape of autism spectrum disorders. Dev. Med. Child Neurol. 56, 12–18 (2014).
Article PubMed Google Scholar
Gaugler, T. et al. Most genetic risk for autism resides with common variation. Nat. Genet. 46, 881–885 (2014).
Article CAS PubMed PubMed Central Google Scholar
Iakoucheva, L. M., Muotri, A. R. & Sebat, J. Getting to the cores of autism. Cell 178, 1287–1298 (2019).
Article CAS PubMed PubMed Central Google Scholar
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Willems, T., Gymrek, M., Poznik, G. D., Tyler-Smith, C. & Erlich, Y. Population-scale sequencing data enable precise estimates of Y-STR mutation rates. Am. J. Hum. Genet. 98, 919–933 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).
Article CAS PubMed Google Scholar
Fischbach, G. D. & Lord, C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).
Article CAS PubMed Google Scholar
Turner, T. N. et al. Genomic patterns of de novo mutation in simplex autism. Cell 171, 710–722 (2017).
Article CAS PubMed PubMed Central Google Scholar
Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).
Article CAS PubMed PubMed Central Google Scholar
An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Article PubMed PubMed Central ADS CAS Google Scholar
Gymrek, M., Willems, T., Reich, D. & Erlich, Y. Interpreting short tandem repeat variations in humans using mutational constraint. Nat. Genet. 49, 1495–1501 (2017).
Article CAS PubMed PubMed Central Google Scholar
Payseur, B. A., Jing, P. & Haasl, R. J. A genomic portrait of human microsatellite variation. Mol. Biol. Evol. 28, 303–312 (2011).
Article CAS PubMed Google Scholar
Sun, J. X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).
Article CAS PubMed PubMed Central Google Scholar
Michaelson, J. J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Article CAS PubMed PubMed Central Google Scholar
O’Roak, B. J. et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250 (2012).
Article PubMed PubMed Central ADS CAS Google Scholar
Rahbari, R. et al. Timing, rates and spectra of human germline mutation. Nat. Genet. 48, 126–133 (2016).
Article CAS PubMed Google Scholar
Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24, 400–402 (2000).
Article CAS PubMed Google Scholar
Huang, Q. Y. et al. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70, 625–634 (2002).
Article CAS PubMed PubMed Central Google Scholar
Weber, J. L. & Wong, C. Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123–1128 (1993).
Article CAS PubMed Google Scholar
Amos, W., Kosanović, D. & Eriksson, A. Inter-allelic interactions play a major role in microsatellite evolution. Proc. R. Soc. Lond. B 282, 20152125 (2015).
Google Scholar
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLOS Comput. Biol. 6, e1001025 (2010).
Article PubMed PubMed Central CAS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47 (D1), D886–D894 (2019).
Article CAS PubMed Google Scholar
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Article CAS PubMed PubMed Central Google Scholar
Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).
Article CAS PubMed PubMed Central Google Scholar
Trost, B. et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature 586, 80–86 (2020).
Article CAS PubMed ADS Google Scholar
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019).
Article CAS PubMed PubMed Central Google Scholar
Grünewald, T. G. et al. Chimeric EWSR1-FLI1 regulates the Ewing sarcoma susceptibility gene EGR2 via a GGAA microsatellite. Nat. Genet. 47, 1073–1078 (2015).
Article PubMed PubMed Central CAS Google Scholar
Breuss, M. W. et al. Autism risk in offspring can be assessed through quantification of male sperm mosaicism. Nat. Med. 26, 143–150 (2020).
Article CAS PubMed Google Scholar
Mousavi, N. et al. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics btaa736 (2020).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS PubMed PubMed Central Google Scholar
Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).
Article CAS PubMed PubMed Central Google Scholar
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Article PubMed CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Quinlan, A. R. BEDTools: the Swiss-army tool for genome feature analysis. Bioinformatics 47, 11–34 (2014).
Google Scholar
Schuelke, M. An economic method for the fluorescent labeling of PCR fragments. Nat. Biotechnol. 18, 233–234 (2000).
Article CAS PubMed Google Scholar
Krebs, M. O. et al. Absence of association between a polymorphic GGC repeat in the 5′ untranslated region of the reelin gene and autism. Mol. Psychiatry 7, 801–804 (2002).
Article CAS PubMed PubMed Central Google Scholar
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012).
CAS PubMed PubMed Central Google Scholar
Buniello, A. et al. The NHGRI–EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 (D1), D1005–D1012 (2019).
Article CAS PubMed Google Scholar
Miller, J. A. et al. Transcriptional landscape of the prenatal human brain. Nature 508, 199–206 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 51, 1652–1659 (2019).
Article CAS PubMed PubMed Central Google Scholar
Fu, Y. X. & Chakraborty, R. Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics 150, 487–497 (1998).
Article CAS PubMed PubMed Central Google Scholar
Haasl, R. J. & Payseur, B. A. Microsatellites as targets of natural selection. Mol. Biol. Evol. 30, 285–298 (2013).
Article CAS PubMed Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
MathSciNet MATH Google Scholar
Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article PubMed ADS Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
CAS PubMed PubMed Central ADS Google Scholar

Download references

Acknowledgements

This study was supported by the Simons Foundation Autism Research Initiative (SFARI Grant no. 630705). I.M. was additionally supported by a predoctoral fellowship from the Autism Science Foundation. M.G. was additionally supported in part by the Office of The Director, National Institutes of Health under Award Number DP5OD024577 and NIH/NHGRI grants R01HG010149 and R21HG010070. K.E.L. was supported by the National Institutes of Health grant R35GM119856. We thank J. Gleeson, J. Sebat, A. Palmer and A. Goren for helpful comments on this study.

Author information

Authors and Affiliations

Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
Ileena Mitra
Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
Bonnie Huang & Michael Lamkin
Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
Nima Mousavi
Department of Medicine, University of California San Diego, La Jolla, CA, USA
Nichole Ma, Richard Yanicky, Sharona Shleizer-Burko & Melissa Gymrek
Department of Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, CA, USA
Kirk E. Lohmueller
Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
Kirk E. Lohmueller
Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
Melissa Gymrek

Authors

Ileena Mitra
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nima Mousavi
View author publications
You can also search for this author in PubMed Google Scholar
Nichole Ma
View author publications
You can also search for this author in PubMed Google Scholar
Michael Lamkin
View author publications
You can also search for this author in PubMed Google Scholar
Richard Yanicky
View author publications
You can also search for this author in PubMed Google Scholar
Sharona Shleizer-Burko
View author publications
You can also search for this author in PubMed Google Scholar
Kirk E. Lohmueller
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Gymrek
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

I.M. performed TR genotyping, identification of de novo mutations and downstream analyses, and helped to write the manuscript. B.H. developed SISTR and performed analysis of TR selection scores in the SSC cohort. N. Mousavi helped to design GangSTR analysis and filtering settings, and performed analyses to evaluate MonSTR. N. Ma performed capillary electrophoresis validation experiments. M.L. performed TR annotation for identification of determinants of TR mutation rates. R.Y. designed AWS cloud analysis pipelines. S.S.-B. helped to design and set up validation experiments. K.E.L. conceived the SISTR method, supervised analysis of TR selection scores and drafted the manuscript. M.G. conceived the study, designed and performed analyses, and drafted the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Kirk E. Lohmueller or Melissa Gymrek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Anders Børglum, Thomas Bourgeron, Anthony Hannan and Ryan Layer for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Evaluation of MonSTR using simulated data.

a, Evaluation of a naive TR mutation-calling method. WGS was simulated for probands with mutations and controls with no mutation under three different scenarios for a range of mean sequencing coverages (Methods). Top plots show the sensitivity (blue line). Bottom plots show the false positive rate (FPR). Shaded bars show the percent of transmissions called as mutation (blue), no mutation (dark grey), or no call (light ray). b, Evaluation of MonSTR’s default model-based method. Plots are the same as in a. but based on MonSTR’s default model (Supplementary Methods). Note FPR lines are not visible because all are at 0%. c, Evaluation of TR mutation calling using default model-based MonSTR settings as a function of mutation size. The top plot is the same as in a, b, and shows the sensitivity to detect mutations as a function of their size. The bottom plot compares the estimated called mutation size (y-axis) compared to the true simulated mutation size (x-axis). Bubble sizes show the number of mutation calls represented at each point. d, Evaluation of TR mutation calling as a function of mutation size after quality filtering. Plots are same as in c, but using the stringent quality filters in MonSTR applied to analyse the SSC cohort. Compared to default settings, sensitivity is decreased especially for larger expansions but inferred mutation sizes are unbiased. All plots are based on simulation of 100 randomly chosen TR loci (Methods). c, d, show results for scenario #1.

Extended Data Fig. 2 Genome-wide de novo TR mutation rate patterns.

a, Distribution of average TR mutation rates by period. For each repeat unit length (x-axis), bars give the genome-wide estimated TR mutation rate (y-axis, log₁₀ scale). Average mutation rates were computed as the total number of mutations divided by the total number of children analysed. The numbers of TRs considered (rounded to the nearest 1,000) in each category are annotated. b, TR mutation rate vs. length. The x-axis shows the TR reference length (hg38) and the y-axis shows the log₁₀ mutation rate estimated across all TRs with each reference length. Colours denote different repeat unit lengths. c, Number of TR mutations observed for CODIS markers. Red dots show observed mutation counts. Black dots show expected mutation counts and lines give 95% confidence intervals based on mutation rates reported by NIST (Methods). Each x-axis category denotes a separate CODIS marker. The total number of children analysed is annotated above each marker d, Observed TR mutation counts concordant with MUTEA. Boxes show the distribution of log₁₀ mutation rates estimated by MUTEA¹² (y-axis) at each TR with a given number of mutations observed in SSC children (x-axis). Black middle lines give medians and boxes span from the 25th percentile (Q1) to the 75th percentile (Q3). Whiskers extend to Q1-1.5*IQR (minima) and Q3+1.5*IQR (maxima), where IQR gives the interquartile range (Q3-Q1). Data are shown for n = 548,724 TRs for which MUTEA estimates were available. e, Determinants of TR mutation rates. The Poisson regression coefficient is shown for each feature in models trained separately for each repeat unit length (Methods). Features marked with an asterisk denote significant effects (two-sided P < 0.01 after Bonferroni correction for the number of features tested across all models). Nominal P-values are annotated above each plot. Error bars give 95% confidence intervals.

Extended Data Fig. 3 Biases in TR mutation sizes.

a, Mutation size distributions by repeat unit length. Histograms show the distribution (y-axis, fraction of total) of de novo TR mutation sizes for each repeat unit length (x-axis, number of repeat units). Mutations <0 denote contractions and >0 denote expansions. Colours denote different repeat unit lengths (grey = homopolymers; red = dinucleotides; gold = trinucleotides; blue = tetranucleotides; green = pentanucleotides; purple = hexanucleotides). b, c, Mutation size distributions by parental origin. Histograms show the distribution of de novo TR mutation sizes for mutations arising in the paternal (b) and maternal (c) germlines (homopolymers excluded). d, e, Mutation directionality bias in homozygous vs. heterozygous parents. In each plot, the x-axis gives the size of the parent allele relative to the reference genome (hg38). The y-axis gives the mean mutation size in terms of number of repeat units across all mutations with a given parent allele length. A separate coloured line is shown for each repeat unit length (red = dinucleotides; gold = trinucleotides; blue = tetranucleotides; green = pentanucleotides). Plots are restricted to mutations that were successfully phased to either the mother or the father for which the parent of origin was homozygous (b) or heterozygous (c). To restrict to highest confidence mutations, these plots are based only on mutations with step size of ±1 and for which the child had more than 10 enclosing reads supporting the de novo allele.

Extended Data Fig. 4 Power to detect per-locus TR mutation enrichments.

a, Number of recurrent mutations required to reach genome-wide significance. We performed a Fisher’s exact test to test for an excess of mutations in probands (n = 1,593) vs. non-ASD siblings (n = 1,593), for a different number of hypothetical mutation counts in probands (x-axis) and assuming 0 mutations observed in non-ASD siblings. The black line shows the two-sided P-value (log₁₀ scale) obtained for each test. The grey dashed line denotes the P-value required to meet a genome-wide significance of P < 0.05 with Bonferroni multiple testing correction. b, Sample sizes required to identify genome-wide significant TRs. The x-axis shows sample size (log₁₀ scale) in terms of the number of quad families analysed. Each line represents a different rate of mutation at a particular TR in probands, assuming 0 mutations at that TR in siblings (blue = 0.001%; orange = 0.01%; green = 0.05%; red = 0.1%; purple = 0.3%). The y-axis shows the power to detect a specific TR at genome-wide significance for each rate. c, Quantile–quantile plots for per-locus TR mutation burden testing. For each TR we performed a Fisher’s exact test to test for an excess of mutations in probands vs. siblings. The x-axis gives expected −log₁₀ P-values under a null (uniform) distribution. The y-axis gives observed −log₁₀ P-values from burden tests. Each dot represents a single TR. Black = all TRs. Gray = homopolymers excluded.

Extended Data Fig. 5 TR mutation burden near SNPs associated with ASD and related traits.

a, b, Bars show mean TR mutation counts in probands (red) vs. non-ASD siblings (blue) for TRs within 50kb of published GWAS associated SNPs (ASD = autism spectrum disorder; SCZ = schizophrenia; EA = educational attainment) considering (a) all TR mutations (ASD n = 4,213; SCZ n = 22,811; SCZ n = 25,668 TR mutations) or (b) TR mutations for which the mutant allele frequency is >5% in controls (SSC parents) (ASD n = 2,774; SCZ n = 14,661; SCZ n = 16,364 TR mutations). Error bars give 95% confidence intervals around the mean. Single asterisks denote nominally significant increases (Mann–Whitney one-sided P < 0.05). Double asterisks denote trends that are significant after Bonferroni correction for the six categories tested. Circles and squares show counts for females and males, respectively.

Extended Data Fig. 6 Proband de novo TR mutations enriched in brain-expressed genes.

a, Ratio of median expression in proband-only genes to control-only genes across time points. The heatmap shows the ratio of the median expression of genes with only proband mutations (n = 268 genes) to that of genes with only mutations in non-ASD siblings (n = 242 genes). Each row shows a different brain structure from the BrainSpan dataset. Each column shows a different developmental time point. The black vertical line separates pre-natal from post-natal time points. Gray boxes indicate no data was available for that time point. Brain structure acronyms are defined in the legend of Fig. 3c. b, Proband TR mutations enriched for brain expression STRs. The quantile–quantile plot shows the distribution of expression STR (eSTR) unadjusted P-values based on associating TR length with gene expression in Brain-Caudate samples in the GTEx cohort⁴⁶. eSTR association P-values are two-sided and are based on t-statistics computed using linear regression analyses performed previously. Each point represents a TR by gene association test using a linear regression model⁴². The x-axis gives expected −log₁₀ P-values and the y-axis gives observed −log₁₀ P-values. Red points show TRs with at least one de novo mutation in probands and 0 in controls. Blue points show TRs with at least one de novo mutation in controls and 0 in probands. We found no significant difference in either Brain-Cerebellum or the other 15 non-brain tissues analysed in that study, which we expected should not be relevant to ASD (not shown).

Extended Data Fig. 7 All coding and 5′ UTR mutations to novel alleles.

a, Mutations in probands at coding or 5′ UTR TRs to unobserved alleles. Each panel shows a de novo TR mutation observed in ASD probands to an allele (x-axis, repeat copy number) not observed in SSC parents. Black histograms give the allele counts in parents. Red arrows denote the allele resulting from each specified de novo TR mutation. Pedigrees show genotypes of parents and the child with the mutation (probands = black diamonds; non-ASD siblings = white diamonds). The text below pedigrees gives the gene and region in which the mutation occurred. b, Mutations in non-ASD siblings at coding or 5′ UTR TRs to unobserved alleles. Plots are the same as in a. except show mutations in non-ASD siblings.

Extended Data Fig. 8 TR mutation burden in ASD excluding homopolymers.

a, Mutation burden by gene annotation. b, Mutation burden by frequency of the allele arising by de novo mutation. The x-axis stratifies mutations based on non-overlapping bins of the frequency of the de novo allele in healthy controls (SSC parents). “All” includes all mutations. For other allele frequency bins, only TRs for which precise copy numbers could be inferred in at least 80% of SSC parents are included (Methods). AF = allele frequency. In both plots, the y-axis gives RR in probands vs. non-ASD siblings. Dots show estimated relative risk and lines give 95% confidence intervals. Gray = all samples; green = males only; purple = females only. Both plots show only TRs with repeat unit length >1bp.

Extended Data Fig. 9 A method to estimate selection coefficients for short TRs (STRs).

a, STR mutation model. Mutation is modelled by a stochastic mutation matrix with length-dependent mutation rates and mutation sizes following a geometric distribution with a directional bias towards the central allele. Unless otherwise indicated, alleles are specified in terms of the number of repeat units away from the central, or modal, allele at each STR. b, STR selection model. Negative selection is modelled by a diploid selection surface constructed as a function of the fitness of the individual alleles. The fitness of each allele is calculated as a function of a selection coefficient s, where the central allele has optimal fitness (w = 1), and the fitness of other alleles is a function of the number of repeat units away from the optimal allele. c, Example output of forward simulations of allele frequencies. The simulation starts with one ancestral (“optimal”) allele. As s increases, variability in the resulting allele frequency distributions decreases as the less fit alleles are removed by natural selection. d, Overview of per-STR selection inference using Approximate Bayesian Computation. For each STR, the method takes a prior on s, mutation model, and demographic parameters, and the observed allele frequency distribution as input. It outputs a posterior distribution of s and a P-value from a likelihood ratio test of whether a model with selection fits better than a model without selection (s = 0).

Extended Data Fig. 10 Evaluation of SISTR.

a, Comparison of true vs. inferred per-locus selection coefficients. The x-axis shows the true simulated value of s, and the y-axis shows the mean s value inferred by SISTR across 200 simulation replicates. b, Power to detect negative selection as a function of s. The x-axis shows the true simulated value of s, and the y-axis gives the power to reject the null hypothesis that s = 0. Left, middle, and right panels show results using models for dinucleotide, trinucleotide, and tetranucleotide TRs, respectively. Colours denotes different optimal allele lengths. c, Inferred genome-wide distribution of s is robust to prior choice and demographic models. We applied SISTR genome-wide using 2 different demographic models (Supplementary Methods) and 3 different prior distributions (left panels) on s. Right panels show the inferred genome-wide distribution of s using different combinations of priors and demographic models. Only loci inferred to be under selection (adjusted SISTR P < 1%) are included in the histograms. Red, yellow, and blue denote dinucleotides (n = 29,874), trinucleotides (n = 39,250), and tetranucleotides (n = 13,099), respectively. d, Genes containing coding STRs under strong selection are more missense-constrained. The x-axis gives the missense constraint Z-score reported by Gnomad⁴⁷. The y-axis gives the frequency of genes with each missense Z-score. e, Genes containing coding STRs under strong selection are more loss-of-function intolerant. The x-axis gives the pLI score measuring loss of function intolerance of each gene reported by Gnomad. For d, e, black bars show the distribution for all genes containing an STR not inferred to be under selection (n = 177; adjusted SISTR P ≥ 1%) and red bars show the distribution for all genes containing an STR inferred to be under selection (n = 21; adjusted SISTR P < 1%). Vertical lines show medians of each distribution. For c–e, SISTR P-values are one-sided and based on the likelihood ratio test described in the Supplementary Methods.

Supplementary information

Supplementary Information

Supplementary Methods; Supplementary Note; Supplementary Discussion. The supplementary information provides detailed descriptions of the MonSTR and SISTR methods, analysis of the contribution of genomic sequence features to TR mutation rates, further analyses comparing de novo TR mutations to published GWAS loci, and additional discussion points.

Reporting Summary

Supplementary Data

Supplementary Dataset 1: Per-locus selection coefficients inferred using SISTR.

Supplementary Tables

This file contains Supplementary Tables 1-8.

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mitra, I., Huang, B., Mousavi, N. et al. Patterns of de novo tandem repeat mutations and their role in autism. Nature 589, 246–250 (2021). https://doi.org/10.1038/s41586-020-03078-7

Download citation

Received: 14 February 2020
Accepted: 23 November 2020
Published: 13 January 2021
Issue Date: 14 January 2021
DOI: https://doi.org/10.1038/s41586-020-03078-7

This article is cited by

RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci
- Sarah Fazal
- Matt C. Danzi
- Vanessa Aguiar-Pulido
Genome Biology (2024)
Sequence composition changes in short tandem repeats: heterogeneity, detection, mechanisms and clinical implications
- Indhu-Shree Rajan-Babu
- Egor Dolzhenko
- Jan M. Friedman
Nature Reviews Genetics (2024)
Short tandem repeat mutations regulate gene expression in colorectal cancer
- Max A. Verbiest
- Oxana Lundström
- Maria Anisimova
Scientific Reports (2024)
Sequencing and characterizing short tandem repeats in the human genome
- Hope A. Tanudisastro
- Ira W. Deveson
- Daniel G. MacArthur
Nature Reviews Genetics (2024)
Pangenomics of the cichlid species (Oreochromis niloticus) reveals genetic admixture ancestry with potential for aquaculture improvement in Kenya
- John G. Mwaura
- Clabe Wekesa
- Patrick Okoth
The Journal of Basic and Applied Zoology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.