Identifying regions of the genome that are depleted of mutations can distinguish potentially deleterious variants. Short tandem repeats (STRs), also known as microsatellites, are among the largest contributors of de novo mutations in humans. However, per-locus studies of STR mutations have been limited to highly ascertained panels of several dozen loci. Here we harnessed bioinformatics tools and a novel analytical framework to estimate mutation parameters for each STR in the human genome by correlating STR genotypes with local sequence heterozygosity. We applied our method to obtain robust estimates of the impact of local sequence features on mutation parameters and used these estimates to create a framework for measuring constraint at STRs by comparing observed versus expected mutation rates. Constraint scores identified known pathogenic variants with early-onset effects. Our metric will provide a valuable tool for prioritizing pathogenic STRs in medical genetics studies.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Communications Biology Open Access 19 September 2023
The sequence of the repetitive motif influences the frequency of multistep mutations in Short Tandem Repeats
Scientific Reports Open Access 24 June 2023
Genome Biology Open Access 12 December 2022
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Samocha, K.E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
di Iulio, J. et al. The human functional genome defined by genetic diversity. Preprint at. bioRxiv http://dx.doi.org/10.1101/082362 (2016).
Willems, T., Gymrek, M., Highnam, G., Mittelman, D. & Erlich, Y. The landscape of human STR variation. Genome Res. 24, 1894–1904 (2014).
Mirkin, S.M. Expandable DNA repeats and human disease. Nature 447, 932–940 (2007).
Houge, G., Bruland, O., Bjørnevoll, I., Hayden, M.R. & Semaka, A. De novo Huntington disease caused by 26–44 CAG repeat expansion on a low-risk haplotype. Neurology 81, 1099–1100 (2013).
Amiel, J., Trochet, D., Clément-Ziza, M., Munnich, A. & Lyonnet, S. Polyalanine expansions in human. Hum. Mol. Genet. 13, R235–R243 (2004).
Press, M.O., Carlson, K.D. & Queitsch, C. The overdue promise of short tandem repeat variation for heritability. Trends Genet. 30, 504–512 (2014).
Gymrek, M. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 48, 22–29 (2016).
Quilez, J. et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 44, 3750–3762 (2016).
Hause, R.J., Pritchard, C.C., Shendure, J. & Salipante, S.J. Classification and characterization of microsatellite instability across 18 cancer types. Nat. Med. 22, 1342–1350 (2016).
Ballantyne, K.N. et al. Mutability of Y-chromosomal microsatellites: rates, characteristics, molecular bases, and forensic implications. Am. J. Hum. Genet. 87, 341–353 (2010).
Burgarella, C. & Navascués, M. Mutation rate estimates for 110 Y-chromosome STRs combining population and father–son pair data. Eur. J. Hum. Genet. 19, 70–75 (2011).
Sun, J.X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).
Weber, J.L. & Wong, C. Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123–1128 (1993).
Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24, 400–402 (2000).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Willems, T., Gymrek, M., Poznik, G.D., Tyler-Smith, C. & Erlich, Y. Population-scale sequencing data enable precise etimates of Y-STR mutation rates. Am. J. Hum. Genet. 98, 919–933 (2016).
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).
Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Mastushita, M. et al. A glutamine repeat variant of the RUNX2 gene causes cleidocranial dysplasia. Mol. Syndromol. 6, 50–53 (2015).
Shibata, A. et al. Characterisation of novel RUNX2 mutation with alanine tract expansion from Japanese cleidocranial dysplasia patient. Mutagenesis 31, 61–67 (2016).
Goodman, F.R. et al. Synpolydactyly phenotypes correlate with size of expansions in HOXD13 polyalanine tract. Proc. Natl. Acad. Sci. USA 94, 7458–7463 (1997).
La Spada, A.R. & Taylor, J.P. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat. Rev. Genet. 11, 247–258 (2010).
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA 113, 11901–11906 (2016).
Huang, Q.Y. et al. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70, 625–634 (2002).
Haasl, R.J. & Payseur, B.A. Microsatellites as targets of natural selection. Mol. Biol. Evol. 30, 285–298 (2013).
Ballantyne, K.N. et al. Toward male individualization with rapidly mutating Y-chromosomal short tandem repeats. Hum. Mutat. 35, 1021–1032 (2014).
Amos, W., Kosanović, D. & Eriksson, A. Inter-allelic interactions play a major role in microsatellite evolution. Proc. Biol. Sci. 282, 20152125 (2015).
Garza, J.C., Slatkin, M. & Freimer, N.B. Microsatellite allele frequencies in humans and chimpanzees, with implications for constraints on allele size. Mol. Biol. Evol. 12, 594–603 (1995).
Excoffier, L. & Foll, M. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 27, 1332–1334 (2011).
Helgason, A. et al. The Y-chromosome point mutation rate in humans. Nat. Genet. 47, 453–457 (2015).
Poznik, G.D. et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat. Genet. 48, 593–599 (2016).
We thank N. Patterson, M. Daly, Y. Wan, and A. Goren for helpful discussions. D.R. was supported by NIH grants GM100233 and HG006399 and is a Howard Hughes Medical Institute investigator. M.G. was supported by NIH/NIMH grant 1U01MH105669-01. Y.E. holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported in part by National Institute of Justice grant 2014-DN-BX-K089 (Y.E., T.W., M.G.) and by a generous gift from Paul and Andria Heafy (Y.E.).
Y.E. is the Chief Science Officer of MyHeritage.com and consults for companies that operate in the DNA forensics domain.
Integrated supplementary information
(a) A mean-centered random walk imposes a length constraint on allele size. The leftmost diagram represents two copies of an AC repeat descended from a common ancestor. The upper right plot shows the number of repeats at each leaf node versus the number of generations passed for 100 simulations of a stepwise model (gray) and two example haplotypes (black; denoted as m and n). The lower right plot shows the same simulations using a mean-centered model. (b) Allelic variance saturates at large TMRCAs. The solid and dashed lines represent the relationship between mean squared allele difference and TMRCA under the stepwise model and mean-centered model, respectively. The mean-centered scenario recapitulates the saturation in the STR molecular clock observed in population data. (c) Example step-size distributions for a mean-centered model. Histograms give probability distributions for step sizes assuming current alleles of –2 (left), 0 (center), or +2 (right) repeats from the optimal allele.
(a,b) Step-size distributions for dinucleotides (a) and tetranucleotides (b). Lines give the geometric distribution with parameter p, where p is the probability of a step of a single unit obtained from each study (Supplementary Table 1).
(a) Multiplier versus truth coverage on simulated data. For each mutation rate estimate, the standard error was multiplied by a constant (x axis). Truth coverage is calculated as the percentage of loci for which the true mutation rate falls within the maximum-likelihood estimate ± 1.96 × standard error × multiplier. Colors represent a range of simulated mutation rates. (b) Scaled multiplier (γ) versus truth coverage. The x axis represents the multiplier from a scaled by the absolute value of the log of the maximum-likelihood mutation rate estimate. The inset shows data aggregated across mutation rates for simulations with (red) and without (black) stutter noise. (c) Calibrating γ against MUTEA Y-STR estimates. (d) Calibrating γ against Ballantyne Y-STR estimates. (e) Calibrating γ for autosomal STRs.
(a–d) Plots are shown for the log10 (mutation rate) (a), step-size parameter σ2 = (2 – p)/p2 (b), length constraint (c), and effective length constraint, defined as β/σ2.
(a) Adjusting genotypes for stutter errors reduces bias. Solid black lines show simulated values of mutation rate and effective length constraint. Red dots give values estimated from genotypes with simulated stutter errors. Black dots give estimates after inferring stutter parameters and adjusting genotype likelihoods. Dashed gray lines give boundaries enforced during numerical likelihood maximization. (b) Stutter parameters are accurately recovered from simulated data. Black points represent estimated stutter parameters for reads with no simulated stutter errors (d = 0, υ = 0). Red points represent estimated stutter parameters for reads simulated at 5× coverage with stutter parameters υ = 0.1, d = 0.05, and p = 0.9.
(a,b) Mutation parameter estimates for mutation rate (a) and effective length constraint (b) are highly concordant across data sets (n = 48). (c) Y-STR mutation rate estimates are concordant with de novo studies. Each point represents a single Y-STR. Gray dashed lines denote the diagonal (n = 41). (d) Pairwise Pearson correlation between Y-STR estimates from each study.
Each dot represents an individual CODIS locus. The gray dashed line denotes the diagonal (n = 11).
(a,b) Shown are estimated length constraints (a) and step-size parameters (b) for 1,634 STRs also analyzed by Sun et al. Dashed lines give the median estimate across loci. Solid lines give the empirical mutation rate from trio data analyzed by Sun et al. A comparison of mutation rates is shown in Figure 2d.
(a) Probability of stutter deleting repeat units. (b) Probability of stutter inserting repeat units. (c) Parameter describing the geometric distribution of step sizes. Each plot shows the cumulative distribution across all autosomal loci.
Supplementary Figure 10 Comparison of per-locus mutation rate estimates versus rates predicted by MUTEA.
Each point represents a locus. The red line represents the diagonal (n = 480,623).
(a–c) Plots give the cumulative distributions of the per-locus estimates of mutation rate (a), length constraint (b), and step-size parameter (c). Dashed lines indicate the 50th percentile.
Each dot represents the mean mutation rate for each category. The dashed line gives the mean mutation rate across all loci for each motif length.
Distribution of constraint scores for loci with mutation rates not at the lower optimization boundary (gray) and for loci with high or undefined standard errors with mutation rates likely below our mutation rate detection threshold (cyan). The black line denotes a standard normal distribution.
Red denotes brain tissues. The gray line gives P = 0.05. Constraint score distributions were compared in the top 20% versus the bottom 80% of expressed genes in each tissue. The x axis gives unadjusted P values.
About this article
Cite this article
Gymrek, M., Willems, T., Reich, D. et al. Interpreting short tandem repeat variations in humans using mutational constraint. Nat Genet 49, 1495–1501 (2017). https://doi.org/10.1038/ng.3952
This article is cited by
The sequence of the repetitive motif influences the frequency of multistep mutations in Short Tandem Repeats
Scientific Reports (2023)
Communications Biology (2023)
Human Genetics (2023)
Genome Biology (2022)
Systematic identification and characterization of repeat sequences in African swine fever virus genomes
Veterinary Research (2022)