Introduction

Usher syndrome type II (USH2) belongs to a genetically and phenotypically heterogeneous group of recessively inherited disorders that combine hearing loss and retinitis pigmentosa (RP). More specifically, USH2 displays moderate-to-severe hearing loss, postpubertal onset of RP and normal vestibular reflexes. Although three genes are responsible for USH2, USH2A accounts for more than 75% of USH2 cases.1, 2 Usher syndrome type IIA (USH2A; MIM 276901) represents the most common form of inherited deaf–blindness and is estimated to affect 1 in 17 000 individuals.3

The underlying USH2A gene was isolated by positional cloning.4 It was initially described as including 21 exons, with the first exon being entirely non-coding, spanning a region of 250 kb and it was predicted to encode a 1546 amino-acid protein of 171 kDa. Today, this protein is recognised as the short isoform of usherin and is predicted to be a secreted extracellular protein.4, 5

Because mutation detection rates obtained in mutation screening studies were lower than those expected, the existence of additional uncharacterised exons of USH2A was postulated. van Wijk et al6 identified 51 novel exons at the 3′end of the gene, increasing its size to 800 kb. These authors also provided some indications for alternative splicing. The predicted protein encoded by the longest open reading frame (5202 residues) is a member of the protein network known as the Usher interactome. This interactome has an essential role in the development of the stereocilia of the hair cells in the organ of Corti. In photoreceptors, the Usher interactome localises in the periciliary region and could be involved in the cargo transport between the inner and outer segment.7, 8, 9

Since the identification of the USH2A long isoform, a small number of mutation screenings have been reported, which indicates that the study of all 72 exons is mandatory for efficient molecular diagnosis.1, 2, 10, 11, 12, 13 As a result of these studies, together with the mutations of the short isoform reported before the year 2004, more than 210 mutations have been described. A great majority of these mutations are private or present in a few families.14 However, a prevalent mutation located in exon 13, designated as c.2299delG, is frequently found in the European and US patients, and also in isolated cases from South America, South Africa and Asia. The allele frequency distribution of c.2299delG varies geographically in Europe. This mutation accounts for 47.5% of USH2A alleles in Denmark and for 36% in Scandinavia,2 whereas an allelic frequency of 31% was found in the Netherlands,15 16–36 % in the United Kingdom,16, 17 15% in Spain10 and 10% in France (unpublished results). A common ancestral origin has been hypothesised for the c.2299delG mutation on the basis that alleles bearing the c.2299delG mutation share the same core haplotype, restricted to the first 21 exons of the USH2A gene.18 In this study, we carry out an exhaustive analysis of the 51 additional exons of the long isoform that reveals high variability in numerous associated intragenic single nucleotide polymorphisms (SNPs) giving rise to at least 10 different c.2299delG haplotypes, but preserving the previously described core haplotype. All these data confirm the common origin of this ancestral mutation.

Materials and methods

Patients

A total of 27 patients were included in this study. Of whom 17 were of Spanish origin and were recruited from the Federación de Asociaciones de Afectados de Retinosis Pigmentaria del Estado Español (FAARPEE) and from the ophthalmology and ENT Services of several Spanish hospitals. A total of 10 patients were French and were recruited from the medical genetic and ophthalmology clinics distributed all over France. The patients were classified as Usher type II on the basis of ophthalmological studies, including visual acuity, visual field and fundus ophthalmoscopy, electroretinography, pure-tone and speech audiometry and vestibular evaluation. For each patient, samples from parents were considered, as well as those from siblings, when possible. This study was approved by both the Hospital La Fe and CHU Montpellier Ethical Committees and consent to genetic testing was obtained from adult probands or parents in the cases of minors.

Controls

In total, 97 control chromosomes were used to establish the distribution of USH2A normal alleles. They were generated from 50 trios (subject and both parents). A total of 25 of each were of French and Spanish origins and randomly chosen as the healthy control group. These trios did not refer symptoms or a history of the Usher syndrome or related disorders.

DNA analysis of USH2A gene

Patient and control genomic DNA was extracted from peripheral blood samples using standard protocols. The 14 SNPs used to construct USH2A haplotypes of control and c.2299delG alleles were PCR amplified using the primers and PCR conditions previously described.5, 6 PCR products were directly sequenced on an ABI PRISM 3130xl (Applied Biosystems, CA, USA). The polymorphism IVS17-8T>G was not considered in this study. Because this variant was included in the core haplotypes defined by Dreyer et al,18 we indicate it in brackets to avoid any confusion when referring to Dreyer's data.

Construction of haplotypes

Parents and available siblings of the c.2299delG patients were used to infer the haplotypes linked to the c.2299delG mutation (M haplotypes). Similarly, control trios were used to establish normal USH2A haplotypes in a healthy population (C haplotypes). In all cases, haplotypes were manually generated by inheritance. In some cases, the data were not informative enough to establish the phase of the SNPs and some ambiguities remained. When possible, ambiguous haplotypes were ascribed to an already existing haplotype.

Construction of phylogenetic trees

Relationships between haplotypes were inferred using two approaches with three different data sets: the complete set of SNPs, the first 5 SNPs included in the first 21 exons of the gene and the last 9 SNPs located in the 3′ end of the gene (see Table 1). In the first approach, we constructed phylogenetic trees using a variety of methods and evolutionary models. However, the high levels of homoplasy present in this data set prevented the derivation neither of a single most reliable phylogenetic tree, neither with the whole set nor with any other subsets of SNPs. Among the different trees obtained, we present the results obtained with the neighbour-joining method19 using the uncorrected number of differences between pairs of SNPs as a measure of their genetic divergence. Bootstrap support values were obtained using version 4.1 of the MEGA software (available from http://www.megasoftware.net).

Table 1 Location and repartition of the 14 SNPs used to establish the USH2A haplotypes

In addition, a median-joining network was obtained with the programme Network 4.5.10 (Fluxus Technology, http://www.fluxus-technology.com). A network represents all the alternative possibilities linking every haplotype considered through a minimum number of mutation steps and is not restricted to represent relationships as a single pathway. This is a more appropriate methodology than dichotomous phylogenetic trees for establishing relationships among closely related allele variants.20

Dating the USH2A c.2299delG mutation

To estimate the original date of the c.2299delG mutation in the USH2A gene, three mathematical approaches were applied, namely, a Monte Carlo likelihood method implemented in the programme BDMC21 v2.121 (http://sites.google.com/site/rannalaorg/software), a Markov chain method by means of the DMLE+ v2.2 software22, 23, 24 (http://www.dmle.org) and a moment method described by Bengtsson and Thomson.25

The programme BDMC21 v2.1 relies on the assumption that genetic variation among a group of highly linked polymorphic markers, defining a haplotype in which a novel non-recurrent mutation arose, is a function of the mutation frequencies of those linked markers and the time since the first occurrence of this unique mutation. To achieve this approach, we considered information from the three variable SNPs closest to the c.2299delG mutation, namely, c.4714C>T, c.6506T>C and c.6875G>A. Confidence interval was estimated following the standard theory of maximum likelihood estimation.26 The second analysis performed was using the DMLE+ programme version 2.2, which takes into account the marker information from the entire haplotype on the basis of:

5′_c.373G>A_c.504A>G_c.1419C>T_IVS15+35G>A_c.4457G>A_c.4714C>T_c.6506T>C_c.6875G>A_c.10232A>C_c.11602A>G_c.11677C>A_c.12612A>G_c.12666A>G_c.13191G>A_3′.

This programme allows Bayesian inference of the mutation age based on the observed linkage disequilibrium (LD) at multiple genetic markers. For both approaches, we used a carrier frequency of USH2 of 1/106, a proportion of mutation-bearing chromosomes in our sample f=1.7 × 10−5 and a population growth parameter d=0.05. Moreover, because the estimate of mutation age based on the DMLE+ v2.2 software seems to be sensitive to demographic parameters (growth rate, mutation frequency and population size),24 we analysed haplotype data considering a range of plausible growth rates (d=0.03–0.11) and proportion of chromosomes (f=1 × 10−6–6 × 10−5). After this, in order to verify the estimated allele age, we decided to used a method described by Bengtsson and Thomson25 based on the algorithm g=logδ/log(1−θ), which depends on the LD (δ) and on the recombination frequency (θ), and therefore insensitive to demographic parameters. For this analysis, we considered information from the three SNPs showing significant LD index values (δ), namely, c.4714C>T, c.6506T>C and c.10232A>C. SNPs c.373G>A, c.504A>G, c.1419C>T, IVS15+35G>A and c.4457G>A were not informative for this analysis because all disease chromosomes carried the same allele. SNPs c.6875G>A, c.11677C>A and c.12666A>G could not be used in this method because the proportion of disease chromosomes carrying the major allele (Pd) was lower than the proportion of normal chromosomes carrying that same allele (Pn). Finally, to set the genetic clock, we applied the Luria–Delbrück correction, that is, gc=g+g027 to avoid a possible underestimation.28, 29, 30

Results

c. 2299delG haplotypes

The c.2299delG haplotypes were built for the 27 USH2A patients using the 14 SNPs represented in Table 1. Seven of the patients were c.2299delG homozygotes (six were Spanish and one was French). A total of 10 different haplotypes were identified (M1–M10, see Table 2). The haplotypes were identical from exon 2 to 21, but the SNPs located along the 51 additional exons of the USH2A long isoform were variable (Table 2). The variability rate of the SNPs was uneven. For five of the SNPs (c.4714C>T, c.6875G>A, c.11602A>G, c.11677C>A and c.13191G>A) the same allele was present on at least eight haplotypes.

Table 2 Representation of the 10 different c.2299delG-linked haplotypes

Haplotype M1 was the most frequent in the Spanish population (8/23; frequency 0.35) followed by haplotype M2 (6/23; frequency of 0.26). Haplotype M1 was also the most prevalent in France, together with haplotype M8 (3/11; frequency 0.27). Haplotypes M4–M9 were restricted to either the Spanish or French populations. Haplotype M1 was the most common with a frequency of 0.32 (11/34) when both populations were pooled.

Control USH2A haplotypes

A total of 54 different haplotypes could be defined from the 97 control chromosomes (Supplementary Table 1). Variation was found along the entire gene; however, this variation is significantly higher in the region encompassing from exon 22 to exon 72. Two SNPs remained invariable: c.4714C>T and c.11677C>A. In addition, the c.6875G>A SNP had the same G allele in 53 of 54 haplotypes. Interestingly, this variant corresponds to the only CpG dinucleotide identified among the 14 SNPs (Table 1). Haplotype C1 was the most prevalent among the Spanish control population with a frequency of 0.1 (5/51) and haplotype C6 was the most frequent among the French controls with a frequency of 0.09 (4/46). Combining the data from both populations, haplotype C1 was the most prevalent with a frequency of 0.07 (7/97).

Relationship of haplotypes

The neighbor-joining tree for all the entire haplotypes was rooted with the corresponding Pan troglodytes haplotype. It did not present a well-defined structure, as none of the nodes were supported by bootstrap analysis. Nevertheless, a small cluster encompassing six haplotypes related to the disease (M1, M2, M5–M8) was observed. The remaining disease-associated haplotypes did not group with this clade, but were not too distant from it (Supplementary Figure 1A). This pattern was very different from that obtained when only five SNPs from the first 21 exons of the gene were analysed. The common haplotype, including disease-related alleles as well as many others from control chromosomes, occupies an intermediate position between the oldest haplotypes, as inferred from their close relationship to the out group and to the most recently derived, that is, the group including C40-C46. Again, none of the nodes in this tree were supported by bootstrap analysis (Supplementary Figure 1B). This topology is markedly different from the one inferred from the remaining SNPs, those located at the 3′-end of the gene. Here, there was no longer a clear association between disease-related haplotypes, except for a small group including alleles M5–M8. Most of the other disease-related haplotypes were more closely related to control alleles than to any other disease allele but, again, these associations were not supported by bootstrap analysis (Supplementary Figure 1C).

The apparent lack of congruence between the phylogenetic histories of these alleles when considering SNPs from the 5′- and 3′-ends may be due to frequent recombination events. This was further checked by reconstruction of median-joining networks for the same three data sets described above. The three networks, especially those derived from the complete and the 3′-end sets of SNPs, present a high level of connectedness with many alternative routes connecting every possible pair of haplotypes (see Supplementary Figures 2A–C). There are also many haplotypes connected to several others with a minimum number of intermediate steps, and only a few haplotypes are connected to the rest through a single intermediate. This pattern is still present, although at a much reduced level, in the network derived from SNPs in the 5′-end of the gene (Supplementary Figure 2B) partly owing to the reduced number of different haplotypes in this part of the gene. The ancestral (C19, which includes Pan troglodytes) and the most abundant haplotypes are connected through an intermediate haplotype (either C28 or C17) and two point changes in SNPs c.3157+15G>A and c.4457G>A. These observations easily explain the difficulties encountered in reconstructing a phylogenetic tree with well-supported relationships as previously commented. Although it is certainly possible to invoke homoplasic point mutations to explain these patterns, they are more likely due to high level of recombination, with an apparently higher rate in the second part of the gene.

Dating the c.2299delG mutation

We estimated the allele age of the USH2A c.2299delG mutation using three mathematical approaches. Haplotype data were analysed for the Spanish and the French populations separately and also together in the pooled populations using both the BDMC21 v2.1 programme and the DMLE+ v2.2 software. Results were quite similar for both mathematical methods (Table 3). Taken into account the whole studied population, the estimated age of the c.2299delG mutation resulted to be 245.4 generations (95% CI 245.2–245.6) and 231.3 generations (95% CI 204.8–245.6) for the BDMC21 v2.1 programme and the DMLE+ v2.2 software, respectively. Assuming a generation time of 28 years,31 these results indicate that the c.2299delG arose between 6476 and 6871 years ago.

Table 3 Summarized results of c.2299delG dating using BMC21 and DMLE+ programs

For the DMLE+ v2.2 approach, we found a high variability as a result of the oscillation of the growth rate values between d=0.03 (g=374.96 (95% CI 321.84–464.08)) and d=0.11 (g=111.84 (95% CI 97.39–138.38)) that led to an estimated allelic age of 10 500 years for the former and 3100 for the latter. The analysed range of f gave an oscillation of g=288.4 (95% CI 256.4–346.8) for f=1 × 10−6 and g=204.3 (95% CI 176.2–260.9) for f=6 × 10−5, thus ranging from 8000 to 5700 years, respectively (Figure 1a and b).

Figure 1
figure 1

Estimation of c.2299delG allelic age using DMLE+. (a) Calculated using a variable proportion of mutated chromosomes (f). (b) Calculated using a variable population growth rate (d).

Finally, we analysed the LD data using the algorithm g=logδ/log(1–θ) and the Luria–Delbrück correction. These results showed that the c.2299delG mutation arose 95–206 generations ago and increased to 163–264 generations ago when applying the Luria–Delbrück correction. Assuming a generation time of 28 years,31 this would indicate that the USH2A c.2299delG mutation could have arisen 2700–5800 years ago or 4600–7400 years ago with the correction (Table 4).

Table 4 LD analysis and corrected estimated age (gc) of the c.2299delG (USH2A) mutation in South European patients

Discussion

The data obtained from the entire haplotypes of the control population reveal a highly variable genetic background, as 54 haplotypes could be identified in the Spanish and French populations with no evidence of a prevalent common haplotype (Supplementary Table 1). In the year 2001, 12 core haplotypes were identified by Dreyer et al18 in a Scandinavian control population. These were based on partial information as only part of the USH2A gene was then recognised. These authors identified a major haplotype ‘A-G-C-A-(T)-A’ with a frequency of 0.60. This core haplotype is also the most frequent one in our control group (C1 to C16), but overall represents less than 50%. The same core haplotype is found in all c.2299delG alleles within the first 21 exons, confirming the existence of high LD in this 250 kb region.

The C>T distribution of the c.4714 SNP is quite striking. The C allele is present in all control haplotypes, but it is carried by only two disease-associated haplotypes, M9 and M10, that represent less than 15% of the c.2299delG alleles. LD between the c.4714T allele and the c.2299delG mutation had already been noted in a French study.1 Dreyer et al2 identified this SNP in both the c.2299delG and control Scandinavian alleles. However, we do not know whether the majority of the c.2299delG patients in Northern Europe also carry the T allele at this position. Extending the studies to Northern Europe and other populations should help to clarify this point.

The variability observed in the additional portion of the gene covering from exon 22 to 72 (that is, about 500 kb) is quite puzzling, suggesting a high recombinational activity at the 3′ end of the gene and a conservation of the 5′end. We analysed the mutability rate of USH2A SNPs by looking at CpG dinucleotides (Table 1). Only one CpG was found in exon 36 at position 6875 and, therefore, cannot explain the variability observed in the 3′ region. The median-joining networks reconstructed in order to find the relationship between haplotypes showed a high level of connectedness, especially for the second part of the gene. These networks are more easily explained by the existence of high recombination rates than by point mutations. These analyses using formal methods confirm that recombination events represent the predominant source of variability in this gene. Subsequently, we looked for a common sequence motif, CCNCCNTNNCCNC, associated with recombination hotspots in humans32 along the entire USH2A DNA sequence. A total of 20 motif locations were found, 4 within the first 20 introns and 16 between introns 21 and 71. Therefore, twice the amount of recombination hotspots are located in the most variable region.

Three different mathematical approaches led to a wide range of estimated allelic age for c.2299delG mutation depending on the methods. When we applied BDMC21 and DMLE+, the estimated allelic age ranged from about 5500 to 7000 years. When we used variable f and d to correct the sensitivity of these methods to demographic parameters, the allelic age ranged from 3100 to 10 500 years. Finally, using a method based on genetic parameters, we obtained an estimate of 2700–5800 years. Labuda et al29 suggested that the genetic clock lead to an underestimation when it is applied to growing populations and used a correction (based on d and f) to avoid this underestimation. When we applied this correction, we obtained a range of ages from 4500 to 7500 years.

The programmes BDMC21 and DMLE+ are highly dependent on demographic parameters. In fact, when one uses a range of d and f, the estimated allelic age varies considerably, reflecting that these programmes hardly consider genetic data. Results are strongly biased due to the fact that c.2299delG allelic frequency estimation was only based on clinical data and current prevalence of the disease. Moreover, the overall demographic growth parameter for Europe could not be equivalent to the local growth rate for Spanish and French populations. Thus, further studies are still to be carried out, considering the rest of the European populations, to estimate a more realistic figure for the original date of c.2299delG.

There are no data concerning the c.2299delG frequency among North-Africans. However, c.2299delG is not a prevalent mutation in the non-Ashkenazi Jewish populations from the South and Near East regions.11, 33, 34 This supports the hypothesis of the more recent migration fluxes across the Mediterranean Sea as a cause of the reduced frequency of the c.2299delG mutation within Northern Mediterranean populations. Another interesting point is the presence of c.2299delG in Asian patients. This mutation was found in isolated patients of Chinese origin.15 The recent studies carried out by Dai et al12 in China and Nakanishi et al13 in Japan indicate that c.2299delG is not common among Asian USH2 patients, although the authors only screened 6 and 10 patients, respectively. Further studies are needed to investigate the frequency of c.2299delG in this and other non-European populations.

In relation to those territories with a history of European colonisation, such as America and South Africa, it has already been pointed by Dreyer et al18 that the recent waves of European migration to the New World and other countries would definitely explain the presence of c.2299delG in these populations.

The exhaustive study of the 3′ region of the USH2A gene in our cohort of patients has revealed that haplotypes linked to the c.2299delG mutation show high variability, but preserve the previously described core haplotype ‘A-G-C-A-(T)-A’. This common haplotype is restricted to 250 kb in the 5′ region of this gene, which corresponds to the USH2A protein short isoform. By extending this study to the control population, we have evidenced the existence of LD restricted to this 250 kb region. The analysis of the relationship between USH2A haplotypes suggests that the major source of variability in this gene is recombination. The higher variability observed in the 3′ region could be explained by the accumulation of recombination hotspots observed in specific intronic sequences of this portion of the gene. It is difficult to ascertain whether the structural and dynamic differences observed between the 5′ and 3′ region of the gene could have a functional significance. Our data do not allow us to estimate a realistic allelic age for c.2299delG. Nevertheless, this mutation seems to have a European ancestral origin.