Introduction

Microsatellites, or short tandem repeats (STRs), consist of tandemly arrayed 1–6 base pairs (bp) motifs. These are among the most useful and commonly employed genetic markers in population, forensic, or conservation genetics1, due to their variability and ubiquity. Their instability has relevant medical implications, being linked to cancer2 and to many other diseases. Namely, there are over 40 neurological, neurodegenerative, and neuromuscular disorders determined by repeat expansions of STRs at coding and non-coding regions3.

STRs undergo rapid length changes due to the insertion or deletion of one or multiple repeat units1,3. The primary mutational mechanism thought to lead to changes in STR length is polymerase template slippage during DNA replication4,5. A distinct pathway is associated with unequal crossing over, which may happen due to strand mispairing during recombination6.

The stepwise mutation model (SMM) was introduced by Ohta and Kimura7 and Wehrhahn8, suggesting mutational dynamics of STRs where parental alleles gain or lose a single repeat when transmitted to the offspring. The possibility of multistep changes was also considered, although at a much lower rate. Indeed, some works showed that the proportion of multistep mutations represents 1% of the detected mutations for tri- and tetranucleotide STRs, increasing this figure to 30% for dinucleotide STRs9,10. The SMM has been used to model STR mutation and evolution and has been applied in diverse areas such as population genetics11, epidemiology12, or phylogeography13. The traditional approach for quantifying kinship likelihood ratios relies on establishing a value corresponding to the decreased probability for each additional repeat difference between parental and filial alleles. This so-called “mutation range” parameter is considered in diverse software14,15. Despite the lack of statistical support, 0.1 is sometimes suggested as an overall value for the mutation range, meaning that a two-step mutation is 10 times rarer than a single-step one, and a three-step mutation is 10 times rarer than a two-step one, and so on.

To investigate the impact of the composition of the STR’s repetitive motif in the mutational dynamics, we have compiled data available for STRs located in the non-recombining region of the Y chromosome (Y-STRs). This region of the Y chromosome possesses no homologous region on the X chromosome and, as such, they do not undergo recombination during meiosis. Hence, in simple sequence markers, any change detected between father and son must be due to a mutation event. It is also noteworthy that the data obtained for this study were generated through genotyping platforms that do not discriminate variation in sequence, but just differences in alleles’ length (automated fragment size determination after capillary electrophoresis).

Indeed, the Y chromosome is an invaluable tool for the study of germinal mutations and their biological mechanisms since it is exclusively transmitted through the paternal lineage in a haploid fashion. The NRY contains many STRs. When typing platforms discriminate solely the length of the allele, the Y chromosome, due to its specific mode of transmission, is the only component of the nuclear genome that allows the exact knowledge of which parental allele resulted in which filial one, allowing the unambiguous identification of any length mutation16.

In both autosomal and heterosomal modes of transmission, when no Mendelian incompatibilities are detected in parent(s)-child duos or trios, it is assumed that no mutation occurred. This unavoidably leads to an underestimation of the mutation rates, since ‘hidden’ or ‘covert’ mutations may be present17,18,19.

A most parsimonious approach is used when classifying the mutation as either single- or multistep, i.e., the mutation that requires the minimum number of steps to conciliate the observations with Mendelian transmission is assumed. This leads to an overestimation of the single-step mutation rates and a corresponding underestimation of those involving multiple steps. It is however noteworthy that this is more severe for autosomal than for X-chromosomal markers, since in father-daughter and mother-son transmissions the parental and filial alleles, respectively, are known20.

Here we intend to contribute to the improvement of the estimates and the mutational model design, by correlating the Y chromosome-specific STRs (Y-STRs) repetitive motif sequence, rather than just its length, with the mutational dynamics.

We have found that the frequency of multistep mutations varies widely across repeat motif compositions and length, reaching differences by a factor of nearly an order of magnitude. The implications of these findings in the fields of population genetics, epidemiology, or phylogeography, and in general evolutionary studies where STR mutation models are discussed.

Material and methods

Data from 44 published reports21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64 were gathered, comprising a total of 2,444 observed mutations in 476,306 allele transfers between father and son pairs, regarding 64 Y-STRs (see Tables S1 and S2). These data were obtained in genotyping platforms (automated fragment size determination after capillary electrophoresis) that do not discriminate variation in sequence, but just size differences. As previously referred, a change between a pair of paternal and filial Y-STR alleles implies that a mutation occurred. However, a correspondence between the paternal and filial alleles only indicates the absence of mutation in simple structure STRs (harboring a single repetitive motif). For STRs with a complex structure (having two or more adjacent repetitive motifs) two mutations may occur in opposite directions, maintaining the final size of the PCR amplicon. Hence, only using STRs with simple structure is possible to determine the number of repeats involved in the allelic transmission. Thus, after compiling data from all studies including father-child duos, DYS389II, DYS390, DYS435, DYS446, DYS447, DYS520, DYS547, DYS552, and DXY156 were excluded from the analyses because they harbor complex structures. Markers containing several loci (multi-copy), such as DYS385a/b, DYS459a/b, DYS464a/b/c/d, DYS526a/b, DYS527a/b, DYF387S1, DYF399S1, DYF404S1, DYF403S1a/b, were also not considered since they do not allow the unambiguous assignment of mutation to each locus. Structure groups with fewer than 10 reported mutations were also removed from the analyses: DYS413, YCAII, DYS531 and DYS587, DYS443, DYS505 due to a lack of statistical power. Finally, DYS622, DYS630 and DYS640 were not considered since no sequence information was found.

A final subset of 35 Y-STRs, 323,818 allele transfers and 1297 mutations, was then considered for further analyses (see Table S1).

STRs were grouped according to the sequence and length of the repetitive motif present in the leading strand (retrieved from GRCh38.p1465), resulting in 8 groups, as shown in Table 1.

Table 1 Grouping of the STRs analyzed according to the repetitive motif present in the leading strand.

In forensic genetics, STRs nomenclature recommendations state that, although most times it is possible to define different repetitive motifs within a 5’ to 3’ strand, the repeat sequence motif must be defined so that the first 5’-nucleotides that can represent a repetitive motif are used66. However, when a mutation occurs, it is impossible to discern if the length change resulted in the addition or deletion of the designated repetitive motif or any other. For example, if the repetitive motif of an STR is defined as TCTA, when a length mutation occurs, that repetitive motif might have been the one involved in the mutation, but so could the motifs CTAT, TATC, and ATCT (see Table 1 for the group information). It is impossible to discern which motif was involved in the mutation through capillary electrophoresis or sequencing. As such, in this work, STRs were grouped according to their structure and not their official nomenclature.

As TCTA and GATA, and GAAA and CTTT are complementary sequences, to determine if they could be grouped, Fisher exact tests were performed to ascertain the statistical significance of the differences in the number of single- and multistep mutations between the two pairs (α = 0.05). No significant differences were detected in the comparison of GAAA with CTTT markers (p = 0.8415) nor in the comparison of TCTA with GATA markers (p = 0.0846). Hence, GAAA were grouped with CTTT markers, and GATA were grouped with TCTA markers.

The ratio between single- and multistep mutations was calculated for each of the above-defined groups of markers. Fisher’s exact tests were also used to measure the significance (α = 0.05) of the single/multistep proportions between groups of markers.

The number of repeats involved in allele transitions where mutations were observed was also analyzed for the complete set of 35 single-copy Y-STRs with simple structure.

In markers DYS19, DYS389I, and DYS635, allele calling includes the total number of repeats in polymorphic and contiguous non-polymorphic tracts. Proper adjustments were made for these markers to obtain the number of repeats of the polymorphic tract.

Some of the published reports51,56,58,61,63 do not indicate the alleles observed in the mutation, providing only information on the type of mutation observed (single- or multistep, gain or loss of repeats). These works were thus not included in the analyses involving the number of repeats.

Results and discussion

Although many studies report single-step mutations as much more frequent than multistep mutations, these results are usually presented as an overall value, and not analyzed per marker—see for example23,24,38. Our results regarding markers with simple structure show that, indeed, single-step mutations are more frequent than multistep ones (except for marker DYS438, see Table S1). However, the ratio between single- and multistep mutations varies widely between markers and groups of markers defined by their repetitive motif structure (see Table 2).

Table 2 Number of multistep (a) and total (b) mutations observed, multistep mutation frequency, and corresponding 95% confidence intervals, per group of markers.

The CTTTT group showed the highest frequency of multistep mutations (25% of the mutations observed), more than twice the corresponding frequency of the ATT and CTT groups, with the second-highest frequency (~ 12%). The lowest frequency of multistep mutations was observed for the group TCTA/GATA (~ 0.93%).

Comparing the two tetrameric groups, the GAAA/CTTT group showed 5.7 times higher multistep mutation frequency than the group GATA/TCTA, the corresponding confidence intervals not intersecting each other.

Ballantyne et al.67 concluded that motifs with strong purine:pyrimidine asymmetries have the highest diversity and variance. Our results indicate that this could also be a factor affecting the type of mutation, with a consequent impact on the variance in the number of repeats. For STRs with tetrameric motifs, the GAAA group, with a 4:0 ratio of purine:pyrimidine, presents a greater frequency of multistep mutations than the GATA group, with a 3:1 ratio (p < 0.0001, see Table 3). The same trend is observed for trimeric repeats, with the CTT having a higher ratio of multistep mutations than the ATT motifs. The frequency of multistep mutations is even higher regarding the pentameric motif CTTTT, with 0:5 ratio of purine:pyrimidine. However, in this case, we cannot discern if this difference is influenced by the higher asymmetry or the larger number of nucleotides in the motif. Significant differences were also found between both ATT and CTT groups and the TCTA/GATA (p = 0.0168 and p < 0.0001, respectively), and between both TCTA/GATA and GAAA/CTTT groups and the CTTTT (p < 0.0001, and p = 0.0102, respectively) – see Table 3.

Table 3 P-values resulting from a pairwise Fisher test of the number of single- and multistep mutations between the STR groups defined by the repetitive motif (α = 0.05).

The correlation between the length of repetitive motif and the mutation rate have been shown in different studies (e.g.67,68,69,70). Most of these studies also acknowledge the presence of mutations that escape SMM, however, without relating their frequency with the repetitive structure of the locus. Beyond these analyses, our work shows how frequently some STRs can escape the SMM. Most mutations obey the SMM, but some escape this model, for some markers and/or groups of markers more than for others. So, despite being the most used model, and suitable for most STRs, the SMM should be used with caution for others.

Martins et al.71 found that wild-type Machado-Joseph Disease alleles do not follow the single-step mutation model. Their results show that the frequency distribution of CAG alleles has been shaped by a multistep mutation mechanism. Indeed, this seems to be the case for some of the groups in this work, that show multistep mutation proportions up to 25%.

Most works show a considerable disproportion between single- and multistep mutations, which might be due to the high number of GATA markers analyzed in the most used multiplexes. In the last years, more GAAA markers have been added to the commercially available typing kits and the ratio between single- and multistep mutations will likely tend to be less disproportionate. Penta and hexameric motifs are much less represented in the generally used commercial kits and so their effect on these overall rates has little impact.

The number of single- and multistep mutations considering the number of repeats involved in the allele transmissions were analyzed for the complete set of markers and structure groups—see Table 4.

Table 4 Number of single- and multistep mutations considering the number of repeats of the parental alleles for all the structure groups.

The high number of categories considered through this approach implies a low number of observations in each of them. This implies that differences may not be detected even if they exist. Nevertheless, for a set of 22 numbers of repeats existing in at least 2 structure groups, 2 showed statistically significant differences (and 2 nearly significant). This supports that, at least in some cases, the structure of the repeat motif does influence the proportion of single- and multistep mutation, beyond the length of the polymorphic tract.

Conclusions

So far, diverse studies have shown the influence of several factors on STRs mutation rates, such as the allele length, repeat motif size and sequence, parental sex, and age. Others have studied the correlation between the mutation rate and the nucleotide composition of the repetitive motif with the same number of base pairs (see, for example,67). However, the influence of nucleotide composition of the repetitive motif on the type of mutation (single- or multistep), was not systematically investigated. In this study, we took advantage of the mode of transmission of the non-recombining region of the Y chromosome, which enables the direct analysis of length mutations in markers with simple structure.

Despite the inescapable problem regarding the low number of observations when modeling rare events, this work supports that, just like mutation rates, the type of mutation (single- or multistep) is heterogeneous across STRs. This includes markers with the same length of the repetitive motif, as well as alleles with the same number of repeats, although from different markers. Comparing repetitive motifs with different sizes prevent us to discern the reason leading to the observed unbalance between single- and multistep mutations. In any case, our work supports that the best fitting mutation model varies between markers.

The monomeric tract in motifs ATT, CTT, GAAA and CTTTT might be influencing slippage, or another mutation model might be operating since in these motifs the multistep mutation frequency is higher.

Most noteworthy is the case of one of the pentameric markers analyzed, DYS438, which does not fit the single-step mutation model, as half of the observed mutations involved several steps.

It is clear that, at least for some STR motif structures, the single stepwise mutation model represents, at best, a crude and biased oversimplification. The implications are manifold and affect many areas of study, such as human population and evolutionary history, genealogical studies, or forensics. Concerning forensic applications, the “mutation range” parameter of 0.1 frequently used in kinship computations seems to be too high for all tetrameric STRs analyzed and too low for pentameric ones. Based on the available data, the mutation range parameter estimates are 0.1333 for ATT markers, 0.1316 for CTT, 0.0554 for GAAA/CTTT, 0.0093 for motif TCTA/GATA, 0.3333 for CTTTT, and 0.0556 for AGAGAT (although in these two last cases more data are needed for a sound estimate).

The development of new models of STR evolution including all major factors known to influence mutation is challenging, but their development is crucial. Large datasets are needed to test mutation models and to estimate rates more accurately. One major setback is that some markers have extremely low mutation rates and gathering enough data is challenging, in such cases targeted analyses are needed. Moreover, guidelines concerning mutation reporting should be established, a need particularly felt when dealing with STRs outside NRY, as previously mentioned in72. These data should include parental age, and genotypic information, as the absolute frequencies of the observed alleles in one-generation profiles (separately for duos and trios in the case of either autosomal or X chromosomal markers, comprising all the cases, with or without mutation, and for the full set of analyzed markers). Such enriched and organized datasets would improve mutation modeling, enabling allele-specific mutation rates estimates, and allowing the discernment and quantification of the effects of the various factors influencing the fidelity of the genetic transmission.