Introduction

Genetic information can be altered by base substitutions or by addition or deletion of nucleotides. These changes can be either beneficial, neutral or detrimental to the organism. In order to understand the mechanisms generating mutations, it is essential to investigate the nature of the DNA sequence alterations. Extensive studies in bacterial genetic systems have demonstrated the existence of deletion and insertion ‘hotspots’, involving repetitive sequences [1, 2]. In vitro frameshift fidelity assays using eukaryotic DNA polymerases have suggested that template-primer misalignment during DNA synthesis or recombination is probably the mechanism generating short deletion and insertion mutations [3], According to this proposal, misaligned intermediates are formed as a result of slippage of DNA strands, in regions containing repeated nucleotides. Therefore, the mechanism was termed slipped-strand mispairing (SSM) [4].

The eukaryotic genome contains many regions in which small motifs consisting of a single base or small number of bases are repeated in tandem multiple times (often over 20 repeated units in a run). These relatively high-copy-number simple tandem repeats comprise the highly polymorphic microsatellite sequences found in many noncoding regions of mammalian genomes. These polymorphisms derive from differences in the copy number of the tandemly repeated simple motifs, and are usually stably inherited. Therefore, these polymorphic sites are useful for genomic mapping and for fingerprinting studies. Analysis of several restriction fragment length polymorphism (RFLP) sites flanking polymorphic microsatellite sites has suggested that SSM, rather than unequal exchange between homologous chromosomes, is most probably the mechanism involved in generating copy number variation [5,6].

Studies in Saccharomyces cerevisiae have shown that (GT)n tracts are highly unstable, with length alterations at a minimal rate of 10−4 events per division [7]. Most of the changes involve additions or deletions of one or two repeated units. In addition, Strand et al. [8] have shown that the instability of poly (GT) sequences in yeast can be greatly increased by mutations in DNA mismatch repair genes. These results support the assumption that tract instability is associated with DNA polymerases slipping during replication. Another study has recently shown that DNA plasticity in mitochondrial DNA is a result of the instability of a 10-bp tandemly repeated sequence [9].

Several human diseases (fragile X, myotonic dystrophy, Huntington’s chorea, spinal and bulbar muscular atrophy and spinocerebellar ataxia type 1) have been found to be caused by somatic expansion of highly repeated tandem repetitive sequences [1015]. In these diseases, a dramatic increase in the length of the repeated run occurs.

Recent progress in human molecular genetics has led to the identification of disease-causing mutations in several human genes. Analysis of the sequences involved in these mutations has provided an opportunity to investigate the mechanisms involved. In a study of 80 short deletion and insertion mutations, direct repeats of between 2 and 8 bp were found in the immediate vicinity of the majority of the analyzed mutations [16, 17]. These direct repeats either included or partially overlapped the deleted or inserted bases. A modified SSM model was proposed to explain these results. The existence of direct repeats in the vicinity of mutation sites is not surprising, since an extensive computer research of DNA sequences has shown significantly high levels of nontandem direct repeats (cryptic simplicity) in many coding or non-coding DNA sequences [18].

In this study, a more conservative analysis was performed in order to estimate the net influence of SSM in short (1–4 bp) insertion and deletion mutations. This was obstained by classifying a mutation as SSM only in those cases where the mutation could be simply explained by the SSM model as detailed in the Material and Methods section. The results of this analysis are expected to enhance our understanding of the contribution of SSM to the mutability of human genes. We have studied three genes in which a large number of disease-causing mutations have been identified: (1) the cystic fibrosis (CF) transmembrane conductance regulator (CFTR) gene, in which 400 mutations have been identified [CF Genetic Analysis Consortium, pers. commun.]; (2) the β globin gene, in which over 100 mutations cause β thalassemia [19], and (3) the factor IX gene, in which over 300 mutations cause hemophilia B [20].

Materials and Methods

Sequences flanking mutations in the coding region of the CFTR, β-globin and factor IX genes were analyzed. The information was obtained from previously published reports [19, 20; CF Genetic Analysis Consortium, pers. commun.]. The analyzed mutations were located within published normal coding sequences as reported in GenBank.

Classification of Mutations

Mutations that were neither insertions nor deletions, and mutations in which the insertion or deletion was larger than 4 bp were classified as ‘other’ mutations. Classification as an SSM mutation was restricted to mutations of up to 4 bp since most length variation was found in repeat units of 1–4 bp [21]. Insertion or deletion mutations of 4 bp or less were classified as deriving from SSM if they met the following conditions: (1) deletion of n (1–4) bp was considered SSM, if and only if the adjacent n bases, from either side, were identical to the deleted bases; (2) insertion of n (1–4) bp was considered SSM if and only if the insertion was adjacent to a tandem repeat of at least 2 repeats with the same sequence as the inserted bases, and (3) when the deletion or insertion was itself a repetitive unit (e.g. TT, ACAC), only the repeat unit was compared with the adjacent bases to classify the nature of the mutation according to (1) and (2) (e.g. deletion of AA in a run of AAA would be considered SSM).

All other mutations were classified as non-SSM. At each SSM mutation, the length of the repetitive run in the normal sequence was counted. For example: (1) deletion of the underlined nucleotides in the normal sequence ACACACAC was counted as a 2-bp deletion SSM mutation in a run of four repeats and (2) insertion of the double underlined nucleotide in the sequence AAA was counted as a 1-bp insertion SSM mutation in a run of two repeats.

It is important to note that deletion and insertion mutations might be classified as SSM if they happened to occur adjacent to a repetitive sequence, even if SSM is not actually the mechanism responsible for the mutation. Under the assumption that SSM is not involved in generating deletion and insertion mutations, the proportion of deletion and insertion mutations expected to be classified as SSM by chance was calculated as follows: (1) Deletion mutations. The proportion of 1-bp deletions was calculated by counting all the base pairs that if deleted would have been considered to result from SSM, divided by the total number of base pairs in the coding sequence analyzed. The proportions of all possible 2-, 3- and 4-bp deletions that would be classified as SSM were similarly calculated. (2) Insertion mutations. The proportion of 1-bp insertions was calculated by counting all the locations in which a 1-bp insertion would have been considered to result from SSM, and dividing by the total number of possible locations for insertions (the total number of base pairs + 1). The inserted nucleotides were taken to be A, C, G or T, each with equal probability. The proportion of 2-, 3- and 4-bp insertions was calculated in a similar manner by considering all possible 2, 3 and 4 bp at each point along the sequence. The net proportion of mutations that can be attributed to the mechanism of SSM was calculated as described in the Appendix.

Results

Several examples of the classification of mutations are presented in table 1. Table 2 presents the total number of mutations analyzed for each gene, the total number of insertion and deletion mutations and the proportion of the insertion and deletion mutation that can be classified as SSM mutations. The proportion of the total number of mutations that were insertions or deletions varied from 0.13 for factor IX to 0.45 for β globin. The proportion of total insertion and deletion mutations classified as SSM was 0.47.

Table 1 Examples of deletion (underlined) and insertion (double underlined) mutations of 4 bp or less found in the CFTR, β-globin and factor IX genes
Table 2 Insertion and deletion mutations in the CFTR, factor IX and β-globin genes

Table 3 presents results of the analysis of insertion and deletion mutations according to the number of deleted or inserted base pairs. Deletion and insertion mutations were analyzed separately. Results are presented for the three genes together, since similar proportions were found independently in each gene (data not shown). The number of deletion mutations (100) was 3-fold greater than the number of insertion mutations (34). In cases in which SSM mutations were present, the proportions of SSM mutations were tested for statistical difference from the proportions expected by random chance using an exact binomial test. With one exception, there were significantly more SSM mutations than would be expected by chance. The exception was the 1-bp deletions, for which the proportion was not significantly greater than that expected by chance (0.57 versus 0.48). The net proportion of mutations attributed to SSM was greater in insertions (0.36) than in deletions (0.24), with an overall proportion of 0.27.

Table 3 Total number of insertion and deletion mutations, number and proportion attributed to SSM according to type of mutation (insertion or deletion) and number of base pairs inserted or deleted

Table 4 presents the distribution of SSM insertion and deletion mutations, according to the number of repeats in the run. Mutations of 1 bp and mutations of > 1 bp were analyzed separately. It can be seen that the most frequent deletion mutation event is a 1-bp deletion in a run of two units. These mutations, however, are attributed mostly to a mechanism other than SSM, and classified as SSM only by chance. Insertion mutations are more common in longer runs.

Table 4 SSM mutations according to the type of mutation (insertion or deletion), the number of base pairs inserted or deleted (1 or > 1) and the number of repeats in the run at the site of the mutation (2, 3 or > 3) in the normal sequence

Discussion

The study analyzed 625 different disease-causing mutations in the coding regions of three human genes: CFTR, β globin and factor IX. A strict analysis was performed in order to estimate the net influence of SSM in short (1–4 bp) insertion and deletion mutations. This was obtained by classifying a mutation as SSM only in those cases in which the mutation could be simply explained by the SSM model. It is important to note that the repertoire of mutations found in patients showing disease symptoms does not reflect the entire spectrum of mutational events that might have occurred in these coding sequences in the course of evolution. Mutations without clinical effect or mutations that have not ‘survived’ through evolution are missing. The search uncovered 134 deletion or insertion mutations of 4 bp or less (table 2). These were 21% of all mutations. Our analysis revealed that a net proportion of 27% of all deletion or insertion mutations of 4 bp or less can be attributed to SSM events. In these cases, it is reasonable to assume that SSM, rather than unequal sister chromatid exchange, is the mechanism involved, since the analyzed mutations are within very short runs (2–7 repeats) which probably would not facilitate unequal pairing of DNA strands.

As seen in table 3, the proportion of deletion and insertion mutations that can be explained by SSM is significantly higher than expected by chance, except for the 1-bp deletions. These results indicate that SSM is a significant mechanism causing deletion and insertion mutations in human genes. The appearance of a simple run of 2 bp (AA, CC, GG and TT) is the most common repeat in any sequence. Therefore, it is not surprising that, as seen in table 4, the majority of the 1-bp deletion mutations are in runs of two repeats. This supports the suggestion that some of these mutations are caused by other mechanisms and are classified as SSM by chance only.

It is well recognized that the highly polymorphic microsatellite regions comprised of long runs (>20) of short tandem repeats are unstable, leading to changes in the number of the repeated unit. In addition, several human diseases have been found to be caused by the instability of long runs of 3 bp tandemly repeated. Our results suggest that very short runs (<8) of tandem repeats can also lead to misalignment and SSM. Since the frequency of short tandem repeats is higher than the frequency of long tandem repeats, the results presented here indicate that the instability of these short repeats contributes to DNA mutability in human genes.

Nontandem repeats, interrupted by a few base pairs, may also promote deletion and insertion mutations by SSM. This has been convincingly demonstrated in Escherichia coli [25, 26]. In our study, such mutations were not classified as SSM mutations. Nevertheless, several examples have been observed, such as a 4-bp deletion (double underlined) in the CF gene: CTACCAAGTCAACCAAACCATACAA-3667del4 [Estivill, pers. commun.]. This sequence comprises three repetitive units (underlined) of 4 bp, interrupted by a few base pairs. Thus, the proportion of deletion and insertion mutations in which short repeats are involved may well be even higher than found in this study. Furthermore, it has been suggested that the SSM mechanism is also involved in generating substitution mutations [27]. Thus, the SSM mechanism is probably a major mechanism for the generation of mutations in general.