Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Interpreting short tandem repeat variations in humans using mutational constraint

Abstract

Identifying regions of the genome that are depleted of mutations can distinguish potentially deleterious variants. Short tandem repeats (STRs), also known as microsatellites, are among the largest contributors of de novo mutations in humans. However, per-locus studies of STR mutations have been limited to highly ascertained panels of several dozen loci. Here we harnessed bioinformatics tools and a novel analytical framework to estimate mutation parameters for each STR in the human genome by correlating STR genotypes with local sequence heterozygosity. We applied our method to obtain robust estimates of the impact of local sequence features on mutation parameters and used these estimates to create a framework for measuring constraint at STRs by comparing observed versus expected mutation rates. Constraint scores identified known pathogenic variants with early-onset effects. Our metric will provide a valuable tool for prioritizing pathogenic STRs in medical genetics studies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Estimating STR mutation parameters from diploid data.
Figure 2: Accurate estimation of STR mutation parameters from simulated data.
Figure 3: A framework for measuring STR constraint.
Figure 4: Constraint scores can be used for STR prioritization.

Similar content being viewed by others

References

  1. Samocha, K.E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Petrovski, S., Wang, Q., Heinzen, E.L., Allen, A.S. & Goldstein, D.B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Gulko, B., Hubisz, M.J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. di Iulio, J. et al. The human functional genome defined by genetic diversity. Preprint at. bioRxiv http://dx.doi.org/10.1101/082362 (2016).

  5. Willems, T., Gymrek, M., Highnam, G., Mittelman, D. & Erlich, Y. The landscape of human STR variation. Genome Res. 24, 1894–1904 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Mirkin, S.M. Expandable DNA repeats and human disease. Nature 447, 932–940 (2007).

    Article  CAS  PubMed  Google Scholar 

  7. Houge, G., Bruland, O., Bjørnevoll, I., Hayden, M.R. & Semaka, A. De novo Huntington disease caused by 26–44 CAG repeat expansion on a low-risk haplotype. Neurology 81, 1099–1100 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Amiel, J., Trochet, D., Clément-Ziza, M., Munnich, A. & Lyonnet, S. Polyalanine expansions in human. Hum. Mol. Genet. 13, R235–R243 (2004).

    Article  CAS  PubMed  Google Scholar 

  9. Press, M.O., Carlson, K.D. & Queitsch, C. The overdue promise of short tandem repeat variation for heritability. Trends Genet. 30, 504–512 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Gymrek, M. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 48, 22–29 (2016).

    Article  CAS  PubMed  Google Scholar 

  11. Quilez, J. et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 44, 3750–3762 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hause, R.J., Pritchard, C.C., Shendure, J. & Salipante, S.J. Classification and characterization of microsatellite instability across 18 cancer types. Nat. Med. 22, 1342–1350 (2016).

    Article  CAS  PubMed  Google Scholar 

  13. Ballantyne, K.N. et al. Mutability of Y-chromosomal microsatellites: rates, characteristics, molecular bases, and forensic implications. Am. J. Hum. Genet. 87, 341–353 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Burgarella, C. & Navascués, M. Mutation rate estimates for 110 Y-chromosome STRs combining population and father–son pair data. Eur. J. Hum. Genet. 19, 70–75 (2011).

    Article  PubMed  Google Scholar 

  15. Sun, J.X. et al. A direct characterization of human mutation based on microsatellites. Nat. Genet. 44, 1161–1165 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Weber, J.L. & Wong, C. Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123–1128 (1993).

    Article  CAS  PubMed  Google Scholar 

  17. Ellegren, H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24, 400–402 (2000).

    Article  CAS  PubMed  Google Scholar 

  18. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Willems, T., Gymrek, M., Poznik, G.D., Tyler-Smith, C. & Erlich, Y. Population-scale sequencing data enable precise etimates of Y-STR mutation rates. Am. J. Hum. Genet. 98, 919–933 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  23. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  24. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Mastushita, M. et al. A glutamine repeat variant of the RUNX2 gene causes cleidocranial dysplasia. Mol. Syndromol. 6, 50–53 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Shibata, A. et al. Characterisation of novel RUNX2 mutation with alanine tract expansion from Japanese cleidocranial dysplasia patient. Mutagenesis 31, 61–67 (2016).

    CAS  PubMed  Google Scholar 

  28. Goodman, F.R. et al. Synpolydactyly phenotypes correlate with size of expansions in HOXD13 polyalanine tract. Proc. Natl. Acad. Sci. USA 94, 7458–7463 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. La Spada, A.R. & Taylor, J.P. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat. Rev. Genet. 11, 247–258 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Telenti, A. et al. Deep sequencing of 10,000 human genomes. Proc. Natl. Acad. Sci. USA 113, 11901–11906 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Huang, Q.Y. et al. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70, 625–634 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Haasl, R.J. & Payseur, B.A. Microsatellites as targets of natural selection. Mol. Biol. Evol. 30, 285–298 (2013).

    Article  CAS  PubMed  Google Scholar 

  35. Ballantyne, K.N. et al. Toward male individualization with rapidly mutating Y-chromosomal short tandem repeats. Hum. Mutat. 35, 1021–1032 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Amos, W., Kosanović, D. & Eriksson, A. Inter-allelic interactions play a major role in microsatellite evolution. Proc. Biol. Sci. 282, 20152125 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Garza, J.C., Slatkin, M. & Freimer, N.B. Microsatellite allele frequencies in humans and chimpanzees, with implications for constraints on allele size. Mol. Biol. Evol. 12, 594–603 (1995).

    CAS  PubMed  Google Scholar 

  38. Excoffier, L. & Foll, M. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 27, 1332–1334 (2011).

    Article  CAS  PubMed  Google Scholar 

  39. Helgason, A. et al. The Y-chromosome point mutation rate in humans. Nat. Genet. 47, 453–457 (2015).

    Article  CAS  PubMed  Google Scholar 

  40. Poznik, G.D. et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat. Genet. 48, 593–599 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank N. Patterson, M. Daly, Y. Wan, and A. Goren for helpful discussions. D.R. was supported by NIH grants GM100233 and HG006399 and is a Howard Hughes Medical Institute investigator. M.G. was supported by NIH/NIMH grant 1U01MH105669-01. Y.E. holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported in part by National Institute of Justice grant 2014-DN-BX-K089 (Y.E., T.W., M.G.) and by a generous gift from Paul and Andria Heafy (Y.E.).

Author information

Authors and Affiliations

Authors

Contributions

M.G., D.R., and Y.E. conceived the study. M.G. prepared the initial manuscript and performed analyses. T.W. developed the likelihood-maximization procedure and helped design analyses. All authors contributed to the development of the mutation model and mutation rate estimation technique.

Corresponding author

Correspondence to Melissa Gymrek.

Ethics declarations

Competing interests

Y.E. is the Chief Science Officer of MyHeritage.com and consults for companies that operate in the DNA forensics domain.

Integrated supplementary information

Supplementary Figure 1 Modeling the STR mutation process.

(a) A mean-centered random walk imposes a length constraint on allele size. The leftmost diagram represents two copies of an AC repeat descended from a common ancestor. The upper right plot shows the number of repeats at each leaf node versus the number of generations passed for 100 simulations of a stepwise model (gray) and two example haplotypes (black; denoted as m and n). The lower right plot shows the same simulations using a mean-centered model. (b) Allelic variance saturates at large TMRCAs. The solid and dashed lines represent the relationship between mean squared allele difference and TMRCA under the stepwise model and mean-centered model, respectively. The mean-centered scenario recapitulates the saturation in the STR molecular clock observed in population data. (c) Example step-size distributions for a mean-centered model. Histograms give probability distributions for step sizes assuming current alleles of –2 (left), 0 (center), or +2 (right) repeats from the optimal allele.

Supplementary Figure 2 Previously reported step-size distributions.

(a,b) Step-size distributions for dinucleotides (a) and tetranucleotides (b). Lines give the geometric distribution with parameter p, where p is the probability of a step of a single unit obtained from each study (Supplementary Table 1).

Supplementary Figure 3 Calibrating standard errors.

(a) Multiplier versus truth coverage on simulated data. For each mutation rate estimate, the standard error was multiplied by a constant (x axis). Truth coverage is calculated as the percentage of loci for which the true mutation rate falls within the maximum-likelihood estimate ± 1.96 × standard error × multiplier. Colors represent a range of simulated mutation rates. (b) Scaled multiplier (γ) versus truth coverage. The x axis represents the multiplier from a scaled by the absolute value of the log of the maximum-likelihood mutation rate estimate. The inset shows data aggregated across mutation rates for simulations with (red) and without (black) stutter noise. (c) Calibrating γ against MUTEA Y-STR estimates. (d) Calibrating γ against Ballantyne Y-STR estimates. (e) Calibrating γ for autosomal STRs.

Supplementary Figure 4 Per-locus simulation results.

(ad) Plots are shown for the log10 (mutation rate) (a), step-size parameter σ2 = (2 – p)/p2 (b), length constraint (c), and effective length constraint, defined as β/σ2.

Supplementary Figure 5 Modeling genotyping errors.

(a) Adjusting genotypes for stutter errors reduces bias. Solid black lines show simulated values of mutation rate and effective length constraint. Red dots give values estimated from genotypes with simulated stutter errors. Black dots give estimates after inferring stutter parameters and adjusting genotype likelihoods. Dashed gray lines give boundaries enforced during numerical likelihood maximization. (b) Stutter parameters are accurately recovered from simulated data. Black points represent estimated stutter parameters for reads with no simulated stutter errors (d = 0, υ = 0). Red points represent estimated stutter parameters for reads simulated at 5× coverage with stutter parameters υ = 0.1, d = 0.05, and p = 0.9.

Supplementary Figure 6 Validating mutation parameters at Y-STRs.

(a,b) Mutation parameter estimates for mutation rate (a) and effective length constraint (b) are highly concordant across data sets (n = 48). (c) Y-STR mutation rate estimates are concordant with de novo studies. Each point represents a single Y-STR. Gray dashed lines denote the diagonal (n = 41). (d) Pairwise Pearson correlation between Y-STR estimates from each study.

Supplementary Figure 7 CODIS mutation rate estimates are concordant with de novo studies.

Each dot represents an individual CODIS locus. The gray dashed line denotes the diagonal (n = 11).

Supplementary Figure 8 Comparison of autosomal mutation parameters with de novo studies.

(a,b) Shown are estimated length constraints (a) and step-size parameters (b) for 1,634 STRs also analyzed by Sun et al. Dashed lines give the median estimate across loci. Solid lines give the empirical mutation rate from trio data analyzed by Sun et al. A comparison of mutation rates is shown in Figure 2d.

Supplementary Figure 9 Per-locus stutter parameter estimates by repeat motif length.

(a) Probability of stutter deleting repeat units. (b) Probability of stutter inserting repeat units. (c) Parameter describing the geometric distribution of step sizes. Each plot shows the cumulative distribution across all autosomal loci.

Supplementary Figure 10 Comparison of per-locus mutation rate estimates versus rates predicted by MUTEA.

Each point represents a locus. The red line represents the diagonal (n = 480,623).

Supplementary Figure 11 Per-locus estimates of mutation parameters.

(ac) Plots give the cumulative distributions of the per-locus estimates of mutation rate (a), length constraint (b), and step-size parameter (c). Dashed lines indicate the 50th percentile.

Supplementary Figure 12 Relationship between mutation rate and local sequence features.

Each dot represents the mean mutation rate for each category. The dashed line gives the mean mutation rate across all loci for each motif length.

Supplementary Figure 13 Constraint score distribution.

Distribution of constraint scores for loci with mutation rates not at the lower optimization boundary (gray) and for loci with high or undefined standard errors with mutation rates likely below our mutation rate detection threshold (cyan). The black line denotes a standard normal distribution.

Supplementary Figure 14 Enrichment of constrained STRs in highly expressed genes.

Red denotes brain tissues. The gray line gives P = 0.05. Constraint score distributions were compared in the top 20% versus the bottom 80% of expressed genes in each tissue. The x axis gives unadjusted P values.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14, Supplementary Tables 1–5 and Supplementary Note (PDF 3106 kb)

Life Sciences Reporting Summary (PDF 129 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gymrek, M., Willems, T., Reich, D. et al. Interpreting short tandem repeat variations in humans using mutational constraint. Nat Genet 49, 1495–1501 (2017). https://doi.org/10.1038/ng.3952

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.3952

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research