Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Characterization and visualization of tandem repeats at genome scale

Abstract

Tandem repeat (TR) variation is associated with gene expression changes and numerous rare monogenic diseases. Although long-read sequencing provides accurate full-length sequences and methylation of TRs, there is still a need for computational methods to profile TRs across the genome. Here we introduce the Tandem Repeat Genotyping Tool (TRGT) and an accompanying TR database. TRGT determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It also reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool. Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. In six samples with known repeat expansions, TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution than existing methods. Additionally, we released a database with allele sequences and methylation levels for 937,122 TRs across 100 genomes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: An overview of TRGT and TRVZ.
Fig. 2: TRGT benchmarks.
Fig. 3: Genetic and epigenetic variation of n = 937,122 TR regions across 100 HPRC samples.
Fig. 4: Genetic variation of RFC1 repeat alleles.
Fig. 5: Genetic and epigenetic variation of FMR1 repeat.

Similar content being viewed by others

Data availability

PacBio Revio sequencing of HG002, HG003 and HG004 samples has been deposited to the Sequence Read Archive (SRA)82. Version 0.7 of the HG002 assembly from the Telomere-to-Telomere Consortium was downloaded from GitHub40,83. The data created as part of Genomic Answers for Kids are available through NIH/NCBI dbGAP, accession number phs002206 (ref. 84). Human Pangenome Reference Consortium data are available at the SRA under BioProject ID PRJNA850430 (ref. 85) and the AWS Registry of Open Data86. The short-read data for HG002, HG003 and HG004 are available from the 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7 within the AWS Registry of Open Data87. TRGT repeat catalogs and TRGTdb for 100 HPRC samples have been deposited into a dedicated Zenodo repository88.

Code availability

The source code of TRGT, TRVZ and TRGTDB is available on GitHub64.

References

  1. English, A. et al. Benchmarking of small and large variants across tandem repeats. Preprint at bioRxiv https://doi.org/10.1101/2023.10.29.564632 (2023).

  2. Caron, N. S., Wright, G. E. B. & Hayden, M. R. Huntington disease. In GeneReviews® (eds. Adam, M. P. et al.) (Univ. Washington, 1998).

  3. Siddique, N. & Siddique, T. Amyotrophic lateral sclerosis overview. In GeneReviews® (eds. Adam, M. P. et al.) (Univ. Washington, 2001).

  4. Hunter, J. E., Berry-Kravis, E., Hipp, H. & Todd, P. K. FMR1 disorders. In GeneReviews® (eds. Adam, M. P. et al.) (Univ. Washington, 1998).

  5. Gymrek, M. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 48, 22–29 (2016).

    CAS  PubMed  Google Scholar 

  6. Erwin, G. S. et al. Recurrent repeat expansions in human cancer genomes. Nature 613, 96–102 (2023).

    CAS  PubMed  Google Scholar 

  7. Li, K., Luo, H., Huang, L., Luo, H. & Zhu, X. Microsatellite instability: a review of what the oncologist should know. Cancer Cell Int. 20, 16 (2020).

    PubMed  PubMed Central  Google Scholar 

  8. Trost, B. et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature 586, 80–86 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Mojarad, B. A. et al. Genome-wide tandem repeat expansions contribute to schizophrenia risk. Mol. Psychiatry 27, 3692–3698 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Morales, F. et al. Somatic instability of the expanded CTG triplet repeat in myotonic dystrophy type 1 is a heritable quantitative trait and modifier of disease severity. Hum. Mol. Genet. 21, 3558–3567 (2012).

    CAS  PubMed  Google Scholar 

  11. Morales, F. et al. Longitudinal increases in somatic mosaicism of the expanded CTG repeat in myotonic dystrophy type 1 are associated with variation in age-at-onset. Hum. Mol. Genet. 29, 2496–2507 (2020).

    CAS  PubMed  Google Scholar 

  12. Overend, G. et al. Allele length of the DMPK CTG repeat is a predictor of progressive myotonic dystrophy type 1 phenotypes. Hum. Mol. Genet. 28, 2245–2254 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Press, M. O., Carlson, K. D. & Queitsch, C. The overdue promise of short tandem repeat variation for heritability. Trends Genet. 30, 504–512 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Payseur, B. A., Place, M. & Weber, J. L. Linkage disequilibrium between STRPs and SNPs across the human genome. Am. J. Hum. Genet. 82, 1039–1050 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Zhou, Y. et al. Robust fragile X (CGG)n genotype classification using a methylation specific triple PCR assay. J. Med. Genet. 41, e45 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Tarleton, J. Detection of FMR1 trinucleotide repeat expansion mutations using Southern blot and PCR methodologies. In Neurogenics: Methods and Protocols (ed. Potter, N. T.) 29–39 (Springer, 2003).

  17. Rajan-Babu, I. S., Law, H. Y., Yoon, C. S., Lee, C. G. & Chong, S. S. Simplified strategy for rapid first-line screening of fragile X syndrome: closed-tube triplet-primed PCR and amplicon melt peak analysis. Expert Rev. Mol. Med. 17, e7 (2015).

    PubMed  PubMed Central  Google Scholar 

  18. Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 54–62 (2012).

    Google Scholar 

  19. Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Dolzhenko, E. et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 27, 1895–1903 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Dashnow, H. et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 19, 121 (2018).

    PubMed  PubMed Central  Google Scholar 

  22. Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Dolzhenko, E. et al. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35, 4754–4756 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. Dolzhenko, E. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol. 21, 102 (2020).

    PubMed  PubMed Central  Google Scholar 

  25. Dashnow, H. et al. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. Genome Biol. 23, 257 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).

    CAS  PubMed  Google Scholar 

  27. Ibañez, K. et al. Whole genome sequencing for the diagnosis of neurological repeat expansion disorders in the UK: a retrospective diagnostic accuracy and prospective clinical validation study. Lancet Neurol. 21, 234–245 (2022).

    PubMed  PubMed Central  Google Scholar 

  28. Giesselmann, P. et al. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat. Biotechnol. 37, 1478–1481 (2019).

    CAS  PubMed  Google Scholar 

  29. Mitsuhashi, S. et al. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol. 20, 58 (2019).

    PubMed  PubMed Central  Google Scholar 

  30. Chiu, R., Rajan-Babu, I. S., Friedman, J. M. & Birol, I. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol. 22, 224 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Coster, W. D., De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing [Internet]. Nat. Rev. Genet. 22, 572–587 (2021).

    PubMed  PubMed Central  Google Scholar 

  33. Oostra, B. A. & Willemsen, R. FMR1: a gene with three faces. Biochim. Biophys. Acta 1790, 467–477 (2009).

  34. Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2018).

    CAS  PubMed  Google Scholar 

  35. Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. & Bafna, V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 28, 1709–1719 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics. 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. English, A. Project Adotto Tandem-Repeat Regions and Annotations. Zenodo https://doi.org/10.5281/zenodo.7013709 (2022).

  38. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Tsai, Y. C. et al. Amplification-free, CRISPR–Cas9 targeted enrichment and SMRT sequencing of repeat-expansion disease causative genomic regions. Preprint at bioRxiv https://doi.org/10.1101/203919 (2017).

  42. Grosso, V. et al. Characterization of FMR1 repeat expansion and intragenic variants by indirect sequence capture. Front. Genet. 12, 743230 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Mousavi, N. et al. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics 37, 731–733 (2020).

    PubMed Central  Google Scholar 

  44. Ziaei Jam, H. et al. A deep population reference panel of tandem repeat variation. Nat. Commun. 14, 6711 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Dreos, R., Ambrosini, G., Cavin Périer, R. & Bucher, P. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. 41, D157–D164 (2013).

    CAS  PubMed  Google Scholar 

  46. Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Vavouri, T. & Lehner, B. Human genes with CpG island promoters have a distinct transcription-associated chromatin organization. Genome Biol. 13, R110 (2012).

    PubMed  PubMed Central  Google Scholar 

  48. Takai, D. & Jones, P. A. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl Acad. Sci. USA 99, 3740–3745 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Rafehi, H. et al. Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am. J. Hum. Genet. 105, 151–165 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Cortese, A. et al. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat. Genet. 51, 649–658 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Akçimen, F. et al. Investigation of the RFC1 repeat expansion in a Canadian and a Brazilian ataxia cohort: identification of novel conformations. Front. Genet. 10, 1219 (2019).

    PubMed  PubMed Central  Google Scholar 

  52. Fan, Y. et al. No biallelic intronic AAGGG repeat expansion in RFC1 was found in patients with late-onset ataxia and MSA. Parkinsonism Relat. Disord. 73, 1–2 (2020).

    PubMed  Google Scholar 

  53. Hagerman, R. J. et al. Fragile X syndrome. Nat. Rev. Dis. Primers 3, 17065 (2017).

    PubMed  Google Scholar 

  54. Yrigollen, C. M. et al. AGG interruptions and maternal age affect FMR1 CGG repeat allele stability during transmission. J. Neurodev. Disord. 6, 24 (2014).

    PubMed  PubMed Central  Google Scholar 

  55. Huang, W. et al. Distribution of fragile X mental retardation 1 CGG repeat and flanking haplotypes in a large Chinese population. Mol. Genet. Genomic Med. 3, 172–181 (2015).

    CAS  PubMed  Google Scholar 

  56. Depienne, C. & Mandel, J. L. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am. J. Hum. Genet. 108, 764–785 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–572 (2016).

    CAS  PubMed  Google Scholar 

  58. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  60. Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).

  61. TRGTdb tutorial. https://github.com/ACEnglish/trgt/blob/main/tdb_tutorial.md

  62. Stovner, E. B. & Sætrom, P. PyRanges: efficient comparison of genomic intervals in Python. Bioinformatics 36, 918–919 (2020).

    CAS  PubMed  Google Scholar 

  63. ACEnglish/trgt. https://github.com/ACEnglish/trgt/tree/main/notebooks

  64. Dolzhenko, E. et al. TRGT: tandem repeat genotyper. Github https://github.com/PacificBiosciences/trgt/ (2023).

  65. Index of /ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/LowComplexity. https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/LowComplexity/

  66. Table Browser. https://genome.ucsc.edu/cgi-bin/hgTables

  67. Repeats. http://useast.ensembl.org/info/genome/genebuild/assembly_repeats.html

  68. Bakhtiari, M., Park, J., Javadzadeh, S., Homer, N. & De Coster, W. A tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data. Github https://github.com/mehrdadbakhtiari/adVNTR (2023).

  69. Qiu, Y. J., Deshpande, V., Avdeyev, P., Dolzhenko, E. & Eberle, M. A. Illumina/RepeatCatalogs. Github https://github.com/Illumina/RepeatCatalogs (2023).

  70. Lucas, J., Li, H. & Jeltje human-pangenomics/HPP_Year1_Assemblies. Assemblies from HPP Year 1 production. Github https://github.com/human-pangenomics/HPP_Year1_Assemblies (2023).

  71. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

    CAS  PubMed  Google Scholar 

  73. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. Cohen, A. S. A. et al. Genomic answers for children: dynamic analyses of >1000 pediatric rare disease genomes. Genet. Med. 24, 1336–1348 (2022).

    CAS  PubMed  Google Scholar 

  75. Cheung, W. A. et al. Direct haplotype-resolved 5-base HiFi sequencing for genome-wide profiling of hypermethylation outliers in a rare disease cohort. Nat. Commun. 14, 3090 (2023).

    CAS  PubMed  PubMed Central  Google Scholar 

  76. Pedersen, B. S. et al. Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches. Genome Med. 12, 62 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Töpfer, A. et al. PacificBiosciences/pbmm2. A minimap2 frontend for PacBio native data formats. Github https://github.com/PacificBiosciences/pbmm2 (2023).

  79. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Google Scholar 

  80. Granger, B. E. & Perez, F. Jupyter: thinking and storytelling with code and data. Comput. Sci. Eng. 23, 7–14 (2021).

    Google Scholar 

  81. pandas-dev/pandas: Pandas. Zenodo https://doi.org/10.5281/zenodo.10045529 (2023).

  82. Homo sapiens (human): WGS of GIAB HG002-4 trio with PacBio HiFi. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1028149 (2023).

  83. Hansen, N. F., Phillippy, A., Koren, S. & Walenz, B. Telomere-to-telomere consortium HG002 ‘Q100’ project. Github https://github.com/marbl/hg002 (2023).

  84. Genomic Answers for Kids (GA4K). dbGaP. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002206.v4.p1

  85. Homo sapiens: Human Pangenome Reference Consortium (HPRC). https://www.ncbi.nlm.nih.gov/bioproject/730823 (2021).

  86. Human PanGenomics Project. https://registry.opendata.aws/hpgp-data/

  87. 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7. https://registry.opendata.aws/ilmn-dragen-1kgp/

  88. Dolzhenko, E. & English, A. Repeat catalogs for TRGT. Zenodo https://doi.org/10.5281/zenodo.8329210 (2023).

Download references

Acknowledgements

We would like to thank M. Gymrek, I. Deveson and the anonymous reviewers for helping us to substantially improve the manuscript and TRGT. We are grateful to the Telomere-to-Telomere Consortium, the Human Pangenome Reference Consortium and the Genome in a Bottle Consortium for releasing datasets essential for this study. We would also like to acknowledge many TRGT users who provided valuable feedback that helped us to substantially improve the tool. We thank generous donors to the Genomic Answers for Kids program at Children’s Mercy Kansas City. A.E. was supported by grant HHSN268201800002I. H.D. was supported by grants K99HG012796 and 5T32HG008962-07. P.J. was supported by grants NS111602, HD104458 and HD104463. D.L.N. was supported by grants HD104463, NS051630 and HD103555. S.Z. was supported by grant 2R01NS072248. T.P. was supported by grant UL1TR002366. A.R.Q. was supported by grant R01HG010757. F.J.S. was supported by grants 1U01HG011758-01, 3OT2OD002751 and 1UG3NS132105-01.

Author information

Authors and Affiliations

Authors

Contributions

E.D. and M.A.E. devised and implemented the initial versions of TRGT and TRVZ. A.E. and F.J.S. implemented TRGTdb. H.D. performed analysis of samples with known expansions, in collaboration with W.A.C., C.B., E.F. and T.P. H.D., W.J.R., Z.K. and A.W. guided the development of TRGT. G.D.S.B., E.D., H.D. and M.C.D. performed benchmarking analyses. T.M. and G.D.S.B. contributed major improvements to the TRGT source code. E.D., H.D., A.E., G.D.S.B. and T.M. performed TR analyses in the HPRC samples. V.M.-C., T.D.B., P.J. and D.L.N. generated sequencing from prefrontal cortex samples of individuals with FMR1 expansions. M.A.E., F.J.S., A.R.Q., T.P. and S.Z. provided guidance and supervision. E.D., A.E., H.D., F.J.S. and M.A.E. wrote the manuscript, with assistance from C.K., K.P.C., W.J.R., Z.K., A.W. and A.R.Q. All authors read and approved the manuscript.

Corresponding author

Correspondence to Michael A. Eberle.

Ethics declarations

Competing interests

E.D., G.D.S.B., T.M., W.J.R., C.K., Z.K., K.P.C., A.W. and M.A.E. are employees and shareholders of Pacific Biosciences. F.J.S. received research support from Illumina, Pacific Biosciences, Nanopore and Genentech. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Ira Deveson, Melissa Gymrek and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Supplementary Tables 1 and 2 and Supplementary Note

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dolzhenko, E., English, A., Dashnow, H. et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-023-02057-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-023-02057-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing