Short tandem repeats (STRs) are repetitive DNA sequences composed of units typically 2–6 base pairs. These sequences exhibit hyper-mutability and high polymorphism, making them potential contributors to diverse phenotypes and disorders [1]. To date, approximately 50 STR disorders have been identified, predominantly in neuromuscular and neuropsychiatric disorders [2,3,4]. Although long-read sequencing technologies offer advantages for STR investigations, genomic data generation has primarily relied on short-read sequencing due to its cost-effectiveness in clinical settings. Fortunately, the development of various computational tools, such as Expansion Hunter [5], has facilitated the reliable detection of repeat expansions in short-read datasets. Recent studies have demonstrated the feasibilities of STR analysis in large-scale short-read genomes or exomes [6,7,8,9,10,11]. Therefore, we aimed to explore the diagnostic utilities of STR analysis for identification of pathogenic repeat expansions using exome sequencing.

Materials and methods

Study cohorts and sequencing

The study cohorts comprised 6,099 exomes, derived from 2,510 Korean families with rare diseases, who underwent exome sequencing as part of further diagnostic work-ups (Supplementary Table 1).

Short tandem repeat analysis

Based on the previous reports [7,8,9], we utilized ExpansionHunter (v5.0) [5] to detect repeat expansions within the target STRs. We selected 21 loci within 20 genes that were sufficiently covered by exomes (Supplementary Fig. 2) [4], and visually inspected candidates exceeding the pathogenic threshold using the Repeat Expansion Viewer (REViewer v0.2.7; Supplementary Fig. 3) [12].

Please refer to the Supplementary information for more detailed materials and methods used.


Identified repeat expansions

In our study, we found that the majority (94.0%) had pediatric-onset diseases, and neurodevelopmental disorders constituted the most prevalent primary disease category at 65.6%, with trio sequencing utilized in 67.9% of the cases. Using ExpansionHunter, we targeted 20 genes with adequate locus coverage to detect pathogenic repeat expansions (Supplementary Fig. 2). Our initial analysis yielded 116 potential repeat expansions above recognized pathogenic thresholds (Supplementary Table 2). These candidates were further examined using REViwer, and genotype calls from regions with low coverage, suboptimal mapping quality, or alignment bias towards specific haplotypes were excluded to eliminate false-positives (Supplementary Fig. 3). Consequently, 35 visually suspected repeat expansions were identified, and subsequent validation confirmed 13 repeat expansions. Through genotype-phenotype correlation, these confirmed expansions led to diagnose 13 individuals (7 probands and 6 parents) within 7 families (Table 1, Supplementary Fig. 5): dentatorubral-pallidoluysian atrophy (DRPLA; n = 3), spinocerebellar ataxia type 7 (SCA7; n = 2), and myotonic dystrophy type 1 (DM1; n = 2).

Table 1 Clinical findings and detected repeat expansions in families undergoing exome sequencing.

In the case of DRPLA (families 1–3), the three probands were initially referred to the clinic for developmental delays, and their parents were asymptomatic at the time of initial enrollment. The expanded alleles were found to be transmitted from their fathers, who later developed adult-onset DRPLA symptoms in their 40 s. Brain MRI scans of the probands from families 1 and 2 revealed cerebellar atrophy, and cascade screening within these two families uncovered additional patients with DRPLA who presented with cerebellar ataxia (Fig. 1a).

Fig. 1: Pedigree and brain imaging findings in representative cases identified by short tandem repeat (STR) analysis.
figure 1

a Dentatorubral-pallidoluysian atrophy (DRPLA) was incidentally identified by STR analysis in families 1 and 2. Trio exome sequencing was conducted due to the initial observation of developmental delay in the probands. Initially, both fathers were asymptomatic; however, relevant clinical findings emerged during the follow-up periods. The pedigrees were reconstructed after genotype-phenotype correlation, confirmation of DRPLA, and cascade screening for other affected members. b Spinocerebellar ataxia type 7 (SCA7) identified in families 4 and 5. The proband and her mother from family 4 underwent exome sequencing after receiving negative results from SCA panel tests at another institution. However, ExpansionHunter suggested repeat expansions in SCA7. Subsequent fragment analysis and Cas9-mediated Nanopore long-read sequencing cross-validated that the previous results was false negative. Cascade screening of family 4 further confirmed ATXN7 repeat expansions in other family members. In family 5, ATXN7 repeat counts estimated by ExpansionHunter were 10/36 in the proband and 10/42 in his father, respectively. Subsequent fragment analysis confirmed the repeat expansions (92 and 42 repeats, respectively). c Myotonic dystrophy type 1 (DM1) identified in family 6. Despite ExpansionHunter identifying 64 and 62 repeats in the proband and his mother, respectively, Southern blotting confirmed the presence of extremely long repeats in the proband (1171 repeats) and his mother (617 repeats). During the reverse phenotyping of the mother, myotonia of the tongue and grip were noticed. Brain magnetic resonance imaging revealed cerebellar atrophy in the probands, indicated by yellow arrows. Samples that underwent exome sequencing within the families are highlighted in pink. Black arrows denote the probands within the families.

Repeat expansions in ATXN7 were detected in two families (families 4 and 5), one with adult-onset symptoms and the other with childhood-onset symptoms in the probands (Fig. 1b). In family 4, the affected members commonly showed signs of cerebellar ataxia and foveal atrophy. These repeat expansions were confirmed using two orthogonal methods (fragment analysis and Nanopore long-read sequencing) following a previously negative result on SCA panels from another institution, which were later determined to be false negatives. After the diagnosis, additional family members were also found to have the repeat expansions. In family 5, the proband and his father had repeat counts estimated at 10/36 and 10/42 by ExpansionHunter, respectively. Fragment analysis later confirmed these expansions to be 92 repeats for the proband and 42 for the father. Although the proband initially exhibited only developmental delays, a regression and cerebellar ataxia were noted at 6 years of age. Particularly, anticipation was evident in this family; the proband was diagnosed with childhood-onset SCA7 before the father with the pathogenic repeat expansion became symptomatic [13].

For DM1, the repeat counts were validated using either fragment analysis or Southern blotting, depending on the length of the repeats. In family 6, ExpansionHunter estimated the repeat count to be 64 repeats in the proband, while Southern blotting revealed an exceptionally long CTG repeat expansion of 1171 repeats, categorized as the congenital type of DM1. This allele was inherited from his mother (617 repeats), who exhibited tongue and grip myotonia during reverse phenotyping (Fig. 1c). In family 7, the proband had a repeat count of 57 CTG repeats, indicative of the mild type of DM1. He exhibited mild muscle weakness with skeletal anomalies, including foot deformities and neck webbing. Recent electromyography revealed myopathic findings. The expanded allele was inherited from his father, who had 44 CTG repeats, falling within the premutable range (35–49 repeats).

Expanded ATXN1 alleles with CAT interruptions

Among the 35 repeat expansions initially suspected through visual inspection, 22 were not subjected to further confirmatory methods, as subsequent evaluation deemed them likely non-pathogenic. Within the AR gene, we identified ten heterozygous repeat expansions (≥38 repeats) in females (5 alleles with 38 repeats, 2 alleles with 39 repeats, 2 alleles with 40 repeats, and 1 allele with 41 repeats). The female carrier frequency of the expanded allele was 0.52% (9 unrelated alleles in 1,746 mothers within trio- or quartet-sequenced samples). Additionally, we identified twelve expanded ATXN1 alleles (≥39 repeats) with interruptions, none of which were associated with clinical features of SCA1 at the time of evaluation. We observed that the presence of different thymidine (T) nucleotides within the interruptions allowed for accurate phasing and alignment (Supplementary Fig. 6). We found six different patterns of interruptions, where each expanded allele had either two or four CAT interruptions, leading to amino acid changes from glutamate (Q) to histidine (H) residues. Notably, the (Q)26-31(H)(Q)(H)(Q)10 motifs were the predominant pattern observed in 9 individuals, as previously reported [14]. After excluding four related alleles, we identified eight expanded/interrupted alleles among 8512 alleles originating from 4256 unrelated individuals. This suggests that expanded/interrupted ATXN1 alleles may be present in approximately 0.19% of the Korean population (Supplementary Table 3).


Our approach utilized ExpansionHunter and REViewer for screening repeat expansions and visually inspecting aligned reads, respectively, and we validated them using orthogonal methods. After these processes and genotype-phenotype correlations, we identified thirteen individuals from seven previously undiagnosed families across three distinct disorders. Our cohort primarily consisted of pediatric patients with neurodevelopmental or neuromuscular disorders, with repeat expansions confirmed in the ATN1, ATXN7, and DMPK genes. The overall diagnostic gain (0.28%, 7/2,510) was comparable to a previous study of a movement disorder cohort (0.24%, 7/2,867) [8]. This study involved six genes (ATXN1, ATXN3, ATXN7, HTT, NOP56, and PPP2R2B), while a higher detection rate has been reported in a spinocerebellar ataxia cohort (4.4%, 22/498) [10], which included five genes (ATXN2, ATXN3, NOP56, AR and HTT). Also, we incidentally found different patterns of expanded ATXN1 alleles with interruptions in twelve individuals who did not report SCA1-related phenotypes. These findings highlight the applications of STR analysis, which is often overlooked in exome analysis.

The detection capacity for repeat expansions using exomes strongly relies on read length and locus coverage [6]. The discrepancies observed between ExpansionHunter estimates and results from orthogonal methods emphasize the challenges in accurately estimating repeat counts with short-reads (Table 1), which can be significantly influenced by the number of reads anchored into the targeted regions. Particularly, we could not assess the FMR1 region located on the X chromosome due to insufficient coverage (Supplementary Fig. 2b), despite it being one of the most common causes of repeat expansion diseases in the pediatric population. Moreover, our study may underrepresent the actual frequency of repeat expansions, as false-negative results are possible in outliers with low coverage (Supplementary Fig. 4) [9]. Consequently, repeat counts estimated by ExpansionHunter require cautious interpretation and should be confirmed with orthogonal methods for accurate repeat count assessment.

Internal sequence interruptions have been implicated in disease phenotypes, penetrance, and age of onset of various STR disorders [4]. We found interruptions within expanded ATXN1 alleles and observed intriguing patterns. These interruptions in the polyQ tract are understood to mitigate aggregate formation and increase the stability of repeat transmission to offspring, which may contribute to the absence of symptoms or a delayed onset age seen in SCA1 [15]. A previous study reported expanded/uninterrupted alleles in 1.40% of Korean patients with cerebellar ataxia, and their onset age ranged from 44 to 59 years [16]. Therefore, it remains uncertain whether the carriers in our study might develop SCA1-related symptoms later in life. However, the proportions of expanded/interrupted alleles appeared to be much lower (0.19%) than the previous study, with the (Q)26-31(H)(Q)(H)(Q)10 motifs being revealed as the most common patterns in the Korean population.

In conclusion, this study, which encompassed a substantial number of pediatric patients and samples sequenced as trios or quartets within the East Asian population, serves to broaden the molecular spectrum and enhance the applicability of exome sequencing for STR assessments. The integration of STR analysis into the exome sequencing pipeline holds the potential to provide additional diagnostic opportunities.