Introduction

Spinal muscular atrophy (SMA; type I, OMIM# 253300; type II, OMIM# 253550; type III, OMIM# 253400), an autosomal recessive disorder characterized by loss of motor neurons in the anterior horn of the spinal cord, affects 1 in 10 000 live births.1 All three types of SMA are caused by mutations in SMN1 (OMIM# 600354) on chromosome 5q13.2 The coding regions of SMN1 and its homologue, SMN2 (OMIM# 601627) differ in only one base.3 Although translationally silent, this transition 840C>T in SMN2 exon 7 alters the splicing pattern of SMN2 transcripts, resulting in a lower level of full-length SMN mRNA from SMN2 than from SMN1.4, 5 Recent evidence suggests that 840C>T in SMN2 activates an exonic splicing silencer, which functions as a binding site for the known repressor protein hnRNP A1.6 Approximately 94% of clinically typical SMA patients lack both copies of SMN1 by either gene deletion or SMN1-to-SMN2 gene conversion (ie, 840C>T).7 Approximately 30 small intragenic mutations have also been identified.8

By SMN gene dosage analyses to identify SMA carriers with only one copy of SMN1, three copies of SMN1 have been identified in normal individuals, implying the presence of two copies of SMN1 on a single chromosome 5.9, 10 The presence of this ‘2-copy-SMN1 allele’ is associated with a decreased SMN2 copy number.11 Importantly, SMA carriers may have two copies of SMN1 on one chromosome 5 and a deletion/conversion mutation of SMN1 on the other chromosome 5 (the ‘2+0’ SMN1 genotype).10 By dosage analysis alone, such carriers, as well as carriers of small, intragenic mutations, are indistinguishable from normal individuals with one copy of SMN1 on each chromosome 5, unless a monosomal analysis technique12 is utilized. Adding to the complexity of SMA genetics is a high de novo SMN1 mutation rate.13 The pericentromeric location14 and presence of the paralogous segmental duplication15 likely predispose the region to recombination events, leading to the high observed de novo deletion mutation rate.

Because of the genetic complexity of SMA and the high carrier frequency, genetic counseling and risk assessment are essential components of genetic testing for SMA patients and their families.16 However, previous risk-assessment studies were based on data derived from a relatively limited number of studies and subjects.1, 16, 17 For the most accurate possible risk calculations, we meta-analyzed all available and reliable data including our own new data to determine the SMN1 allele frequencies and de novo SMN1 mutation rates. To better understand the molecular evolution of the SMN region, we also estimated the various SMN1-SMN2 haplotype frequencies by haplotype reconstruction and meta-analysis.

Symbols used

Throughout the manuscript, a plus (+) symbol separates the SMN1 copy number on each chromosome; a colon (:) separates the total SMN1 copy number and the total SMN2 copy number; and a hyphen (-) separates the SMN1 copy number and the SMN2 copy number on the same chromosome (SMN1-SMN2 haplotype). An SMN1 genotype is indicated by ‘(SMN1 allele on one chromosome 5)+(SMN1 allele on the other chromosome 5).’ For example, a ‘2+0’ genotype denotes two copies of SMN1 on one chromosome 5 (‘2-copy allele’), and zero SMN1 copies on the other chromosome 5 (‘0-copy allele’).

Methods

Subject samples and SMN gene dosage analysis

SMN gene dosage analysis to determine the copy numbers of SMN1 and SMN2 was performed on genomic DNA samples from 176 unaffected individuals without a family history of SMA in the Molecular Pathology Laboratory, Hospital of the University of Pennsylvania, on a clinical basis. These were new cases in addition to those analyzed for our previous studies.11, 18 Results were anonymized and used for this study. There was no evidence for the presence of any ethnic group with skewed SMN1 or SMN2 copy numbers. Genomic DNA was extracted from peripheral blood leukocytes using Puregene reagents (Gentra Systems, Minneapolis, MN, USA). SMN gene dosage analysis was developed and validated as described previously.10, 19, 20 SMN1 and SMN2 copy numbers were determined by quantification of the PCR products after DraI digestion to differentiate SMN1 from SMN2, and normalizations to genomic standards, internal standards, and two-SMN1-copy controls. All samples were assayed in duplicate.

Literature used for data collection

We previously described methods of meta-analysis to determine SMN1 allele frequencies.1 We improved our methods by using SMN1 copy number data in the general population for the calculations of disease allele frequencies. At the time of our previous study, copy number data were insufficient. For the calculation of SMN1 disease allele frequencies, SMA population incidence data, which we used previously, may overestimate the disease allele frequency due to confounders such as consanguinity. Population data might also lead to underestimation of the disease allele frequency, because SMA patients with very severe disease might not survive to birth. In addition, SMA diagnoses in population studies are, in general, based on variable clinical criteria, rather than genetic testing results. To calculate SMN1 allele frequencies as accurately as possible, and in the absence of a single large study, all reliable data available in the literature should be used. We included the following data: the frequencies of individuals with one, two, three, and four copies of SMN1 in the general population9, 18, 21, 22, 23, 24 (and this study) (Table 1); the fraction of individuals for each SMA type who lack both copies of SMN1 exon 7 among those with identifiable mutations in both SMN1 alleles;25 and the frequency of patients completely lacking SMN1 who received a de novo SMN1 deletion/conversion mutation.1, 13, 18, 23, 25 We excluded data from other studies for reasons described previously.1 The data of Cusin et al23 include those of Gerard et al26 (V Cusin, personal communication). To determine the various SMN1-SMN2 haplotype frequencies, data were collected from studies of the distribution of SMN2 copy numbers in SMA patients who lack both copies of SMN111, 21, 26, 27, 28 (Table 2), and the distribution of the SMN1 and SMN2 copy numbers in the general population9, 11, 23, 24 (with V Cusin, personal communication; D Anhuf, personal communication; and this study) (Table 3).

Table 1 SMN1 copy number distributions among individuals without a family history of SMA in the general population
Table 2 SMN2 copy number distributions in SMA patients lacking SMN1
Table 3 Combined SMN1 and SMN2 copy number distributions in the general population

Haplotype reconstruction and statistical analysis

Data contained in Tables 2 and 3 were inspected and systematically rewritten in terms of the haplotypes illustrated in Figure 1. Disease haplotypes in Table 2 were reconstructed with the simplest explanatory haplotypes among disease alleles. Normal haplotypes in Table 3, were reconstructed from among the simplest explanatory normal alleles. Where haplotype reconstructions were equivocal, both haplotypes were tabulated as equal probabilities.

Figure 1
figure 1

SMN1-SMN2 haplotypes and frequencies.

Reconstructed haplotypes were formatted for presentation using Tools for Population Genetic Analysis TFPGA [Mark P Miller, 1997. Tools for population genetic analysis (TFPGA) 1.3: A windows program for the analysis of allozyme and molecular population genetic data. Computer software distributed by the author.] TFPGA provided both χ2 and Fisher's exact statistics. Some frequencies were recalculated using SPSS for Windows, version 11.0.1. The reconstructed haplotypes were also compiled through submission to PHASE, version 2.0.2, written by Stephens et al.29

Results

SMN1 allele frequencies

SMN1 alleles were designated as follows: most SMN1 disease alleles are ‘0-copy alleles’ (SMN1 gene deletions or SMN1-to-SMN2 gene conversions). A normal chromosome 5 usually has one copy of SMN1 (‘1-copy allele’). However, two copies of SMN1 are occasionally present on a single normal chromosome 5 (‘2-copy allele’). SMN1 disease alleles with a small, intragenic mutation are referred to as ‘1D alleles’ (standing for ‘1-copy-Disease’); 1D alleles are indistinguishable from normal 1-copy alleles by dosage analysis. The allele frequencies are designated as follows:

  • a=0-copy (disease) allele frequency

  • b=1-copy (normal) allele frequency

  • c=2-copy (normal) allele frequency

  • d=1D (disease) allele frequency

Asymptomatic individuals who lack both copies of SMN1, all of whom were relatives of affected individuals, have been described.9, 30, 31, 32, 33, 34, 35 However, all 0-copy alleles are regarded as disease alleles for reasons described in detail elsewhere.1

Two asymptomatic individuals without a family history who have a polymorphism (9T) at the polythymidine tract (8T) in SMN1 intron 6 (IVS6-24dupT) have been reported (one18 and the other in this study). Thus, an observed frequency of the 9T allele among all normal SMN1 alleles is 2/2813=7.1 × 10−4. Since there is no evidence indicating that 9T alleles are anything other than polymorphisms without functional consequence, they are regarded as normal alleles.

We previously calculated SMA incidence as 1.0 × 10−4 using several large population studies.1 However, as described in Methods, using SMA population incidence data to calculate the SMN1 disease allele frequency is problematic, because of variable diagnostic criteria, consanguinity, and potential embryonic lethality of some severely affected fetuses. Instead, we calculated the frequencies of the various SMN1 alleles (except for the 1D alleles), using data on the frequencies of individuals with one, two, three, and four copies of SMN1 in the general population9, 18, 21, 22, 23, 24 (and this study) (Table 1). Assuming Hardy-Weinberg equilibrium,

From (3) and (6), b=0.947. After reiterative calculations, a=0.0132, b=0.947, c=0.0377.

Among patients with SMA of all three types who had identifiable mutations in both alleles of SMN1, Wirth et al25 found that 18 of 501 individuals (with a total of 1002 alleles) had one 0-copy allele and one 1D allele. These data imply that the ratio of a to d is approximately 984/18. Thus, d=0.0132 × 18/984=0.000241. If a, b, c, and d are proportionally adjusted so that their sum equals 1, then, a (the 0-copy disease allele frequency)=0.0132, b (the 1-copy normal allele frequency)=0.949, c (the 2-copy normal allele frequency)=0.0378, and d (the 1D disease allele frequency)=0.000241 (Table 4). Studies to determine SMN1 copy numbers among carriers showed that a total of 19 of 494 carriers with a deletion/conversion mutation of SMN1 on one chromosome have two copies of SMN1,18, 21, 23, 36 implying that 19 (3.8%) of 494 normal alleles are 2-copy alleles, which exactly matches the above SMN1 allele frequencies {c/(b+c)=3.8%}. The disease frequency derived from our data {(a+d)2=1.8 × 10−4} is considerably higher than the disease frequency (1.0 × 10−4) derived from large population studies1 (also, see On-line Supplemental Material). This may be due in part to variable diagnostic criteria for inclusion of SMA cases and embryonic lethality in some severely affected fetuses.

Table 4 SMN1 allele and genotype frequencies

When an asymptomatic family member of individuals affected with clinically typical SMA of known type has two or three copies of SMN1, but no genetic testing has been performed on the affected individual, the type of SMA in the patient has an impact on risk assessment.1 This is because the fraction of 1D alleles among all disease alleles changes according to the type of SMA.25 Thus, using the data of Wirth et al,25 we calculated relative 1D allele and 0-copy allele frequencies for each SMA type (Table 4), but only for the purpose of risk assessment, and not for the calculations of the true SMN1 allele frequencies in the general population.

The observed frequency of 4-copy individuals in the general population, 7/1203=5.81 × 10−3, is significantly higher than the expected frequency of c2=1.43 × 10−3 (P<0.005 by χ2-test), suggesting that ‘3-copy alleles’ (ie, one chromosome with 3 copies of SMN1) might exist. Assuming c′=the ‘3-copy allele’ frequency, c2+2bc′=5.81 × 10−3. Thus, c′=2.30 × 10−3, which is approximately one order of magnitude lower than c. Because the existence of the ‘3-copy allele’ remains unproven, and because, even if it exists, its frequency would be very small relative to other normal alleles, the SMN1 allele frequencies in Table 4 were determined on the assumption that ‘3-copy alleles’ do not exist.

SMN1 de novo mutation rates

A total of 12 independent SMN1 de novo mutations (among 494 patients lacking SMN1) have been described. Wirth et al13 reported six paternal de novo deletions and one maternal de novo conversion25 among 340 patients lacking SMN1. Ogino et al1 reported two paternal de novo deletions among 53 patients. Cusin et al23 reported one paternal de novo deletion, one paternal de novo conversion, and one maternal de novo deletion among 101 patients. Using the methods of Wirth et al,13 we estimated paternal and maternal de novo mutation rates to be ν=2.11 × 10−4 and μ=4.15 × 10−5, respectively. Estimated rates of paternal and maternal de novo conversion mutations (based on one event for each) are the same: 2.07 × 10−5. These figures are based on small numbers of patients with de novo mutations; further studies are necessary to calculate de novo mutation rates more accurately.

SMN1-SMN2 haplotype frequencies

SMN1-SMN2 haplotypes are designated as ‘(SMN1 copy number)-(SMN2 copy number)’ (Figure 1). We did not take into account SMN1-SMN2 haplotypes with small intragenic SMN1 mutations, which are rare, due to insufficient published data. To determine SMN1-SMN2 haplotype frequencies, we used published data from several studies, and our own data, on SMN1 and SMN2 copy number distributions in SMA patients and individuals in the general population (Tables 2 and 3). Typically, haplotypes are determined by linkage studies. However, it is often difficult to determine the distribution of SMN1 and SMN2 copies between the two chromosome 5 s in a given individual, even after linkage analysis. Instead, to estimate the SMN1-SMN2 haplotype frequencies, we assumed Hardy–Weinberg equilibrium and constructed mathematical formulae using the observed SMN1 and SMN2 copy number distributions. The haplotype frequencies are designated as follows (a=e+f+g, b=h+i+j, and c=k+l):

  • e=0-0 (disease) haplotype frequency

  • f=0-1 (disease) haplotype frequency

  • g=0-2 (disease) haplotype frequency

  • h=1-0 (normal) haplotype frequency

  • i=1-1 (normal) haplotype frequency

  • j=1-2 (normal) haplotype frequency

  • k=2-0 (normal) haplotype frequency

  • l=2-1 (normal) haplotype frequency

To decrease errors associated with small numbers of examples, we calculated the disease haplotype frequencies (e, f, and g) using the SMN1:SMN2 copy number distribution data in patients (Table 2), rather than the SMN1:SMN2 copy number distribution data in normal populations (Table 3). Using the data in the On-line Supplemental Material, it is estimated that type I, type II, and type III comprise 57.5, 29.2, and 13.2%, respectively, of all typical SMA cases. The data in Table 2 do not take into account the incidence of each SMA type. Hence, to calculate e, f, and g in the general population, the data in Table 2 need to be adjusted according to the incidence of type II and type III SMA relative to that of type I. For type II SMA, the total number of patients should be adjusted to 271 × (0.292/0.575)=137.6, with 13.9 two-SMN2-copy individuals, 114.1 three-SMN2-copy individuals, and 9.6 four-SMN2-copy individuals. For type III, the total number of patients should be adjusted to 271 × (0.132/0.575)=62.2, with 1.04 two-copy individuals, 39.6 three-copy individuals, and 21.5 four-copy individuals. Then, the adjusted number (fraction) of SMA patients (total N=470.8) with one, two, three, and four copies of SMN2 are 23 (0.0488), 216.9 (0.461), 199.8 (0.424), and 31.2 (0.0662), respectively. The incidence per live birth of all types of SMA is defined as m (=1.80 × 10−4). Assuming Hardy–Weinberg equilibrium [a genotype is designated as ‘(total SMN1 copy number): (total SMN2 copy number)’],

From (10), g=3.45 × 10−3. Then, from (9), f=1.11 × 10−2, and from (7), e=3.96 × 10−4. Because this value for g is based on a relatively small number of individuals, the above value for f also has a substantial error. Therefore, the formula (8) should be used to calculate f; then, from (8), f=8.96 × 10−3. Then, from (9), g=4.26 × 10−3. Then, from (7), e=4.90 × 10−4. After iterative calculations; f=8.87 × 10−3, g=4.30 × 10−3, and e=4.95 × 10−4. e, f, and g are proportionally adjusted so that their sum equals a; f (the 0-1 haplotype frequency)=8.6 × 10−3, g (the 0-2 haplotype frequency)=4.2 × 10−3, and e (the 0-0 haplotype frequency)=4.8 × 10−4. These results indicate that the 0-1 haplotype, the 0-2 haplotype, and the 0-0 haplotype comprise 65, 31, and 3.6%, respectively, of disease haplotypes with 0 copies of SMN1.

Because a knockout of the murine SMN gene is associated with embryonic lethality,37 and because humans with the 0:0 genotype have never been observed, the expression of some full-length SMN protein seems to be essential for survival. Hence, by starting with data on individuals affected with SMA for our calculations, we are potentially introducing an ascertainment bias against the 0-0 haplotype, since we would never expect to see the 0:0 genotype among living individuals. However, the 0-0 haplotype is sufficiently rare that the existence of the 0:0 genotype can be ignored, and the far more common 0:1 genotype can be used, to estimate e. Using the data in Table 3, and assuming Hardy-Weinberg equilibrium,

To decrease errors associated with small numbers of subjects, we used equations containing larger numerators (ie, formulae (16) through (19) and (21) through (23)). Likewise, to calculate e, f, and g, we used population data based on large numbers of SMA patients (ie, formulae (7) through (10)), rather than small numbers of carrier individuals (ie, formulae (11) through (15)). The potential existence of a haplotype with three copies of SMN1 might compromise the accuracy of estimates based primarily on formulae (25) through (27), and, to lesser degrees, on formulae (21) through (24). Thus, we did not primarily use formulae (25) through (27).

With the above values for j, k and l, the values for h, i, j, k and l can be refined further.

Because b=h+i+j, we adjusted the values for h, i and j proportionately: h (the 1-0 haplotype frequency)=0.27, i (the 1-1 haplotype frequency)=0.66, and j (the 1-2 haplotype frequency)=0.015. Likewise, because c=k+l, we adjusted the values for k and l proportionately: k (the 2-0 haplotype frequency)=0.027, and l (the 2-1 haplotype frequency)=0.012. Using the above haplotype frequency estimates, and assuming Hardy–Weinberg equilibrium, the expected number of individuals with each genotype among the 841 individuals is in agreement with the raw population data (Table 3). The slightly higher-than-expected frequency of individuals with four copies of SMN1 may be due to the presence of ‘3-copy-SMN1 alleles,’ which we disregarded in our calculations of SMN1-allele and SMN-haplotype frequencies.

To verify our allele-frequency and haplotype calculations, we composed a simulation that reconstructed the haplotypes in the affected and normal population using the data presented in Tables 2 and 3, as described in the Methods section. We used both a χ2 and Fisher's exact tests to determine compliance of the simulation data set with Hardy–Weinberg equilibrium. This simulation provided the insight that three populations9, 24 (and this study) of the six normal populations examined for SMN haplotype analysis in Table 3 were consistent with a Hardy-Weinberg distribution. The remainder of the six normal populations showed increased numbers of the 1-1 haplotype, causing a distortion from Hardy–Weinberg equilibrium. Analysis of the haplotype reconstructions and simulations from the disease allele data set provided evidence for a relative two-fold increase in the frequency of the 0-2 haplotype among affected individuals as compared with calculations assuming Hardy–Weinberg equlibrium carried out above.

Discussion

To test our assumption of Hardy–Weinberg equilibrium in calculating allele and haplotype frequencies, we carried out a simulation as described in Methods and Results. Overall, the simulation demonstrated some evidence for distortion from Hardy–Weinberg equilibrium among the normal alleles. The distortions from Hardy–Weinberg equilibrium could derive from the heterogeneity of populations examined or because of selective advantage for some alleles. We believe that our simulations support the additional evidence delineated below for selection at this complex locus.

Although functional evidence for selection of the 1-1 haplotype is lacking, it is the most common haplotype in Homo sapiens. Because the paralogous duplication giving rise to SMN2 occurred after human–primate divergence,41 it will be difficult to provide phylogenetic proof of selection. However, the young age of this haplotype structure, accompanied by elevated haplotype frequency, suggests selection. Our simulation data reinforces overrepresentation of the 1-1 haplotype in the human gene pool. The availability of a HapMap should enable coalescent modeling studies using polymorphisms in the region and testing numerous human populations for elevated haplotype frequencies.38 This type of analysis should provide definitive evidence for or against selection of the 1-1 haplotype in human populations.

It is likely that all of the 0-0 haplotypes, and most, if not all, of the 0-1 haplotypes arose by de novo SMN1 deletion mutations. On the other hand, it is likely that most of the 0-2 haplotypes, and perhaps some of the 0-1 haplotypes, arose by de novo SMN1-to-SMN2 gene conversion mutations. Although some SMA type III patients reproduce, most of the SMA disease haplotypes, including the 0-2 haplotype, are genetically lethal when homozygous, or when heterozygous with another disease haplotype. Thus, disease haplotypes must be replenished constantly in the general population by de novo mutations. The available data on SMA patients who received de novo mutations (among all SMA patients) are limited, comprising only 12 such patients. However, two of these patients received de novo SMN1-to-SMN2 conversion mutations.23, 25 Other small intragenic SMN1 mutations (present in 1.8% of all disease alleles) are relatively rare. Hence, the 31% frequency of the 0-2 haplotype among all disease haplotypes implies that SMN1-to-SMN2 gene conversion (i.e., a single nucleotide substitution at position 840) occurs at a much higher frequency (approximately 10-fold higher) than all other small intragenic mutations combined. Thus, position 840 might be a mutation hotspot. A mutation rate of 1.2 × 10−9 mutations/site per year has been suggested for Homo sapiens.39, 40 This number agrees well with a more-recent direct estimate of the frequency of all kinds of small mutations (1.8 × 10−8/nucleotide per generation42) as well as an indirect, evolutionary estimate (2.5 × 10−8/nucleotide per generation43). Single nucleotide substitutions are 25 times more common than all other small mutations.41 The observed frequency of SMN1-to-SMN2 gene conversion is 2.1 × 10−5 per generation for both men and women, which is approximately three orders of magnitude higher than the average mutation frequency estimates above. Rochette et al41 suggested that gene conversion between SMN1 and SMN2 is facilitated by the 99.8% sequence identity between the two genes, and the fact that they are confined to a region of about 1 Mb. Alternative hypotheses for the relatively high frequency of the 0-2 haplotype include a selective advantage for the 0-2 haplotype over other disease haplotypes, that is, some mildly affected (eg, so-called type IV) individuals homozygous for the 0-2 haplotype may reproduce. However, the vast majority of SMA patients with four copies of SMN2 are affected by type III SMA, which is generally pediatric in onset and genetically lethal. Type IV SMA is too rare to account fully for the high frequency of the 0-2 haplotype.

The SMN1 disease allele frequency estimated from compiled large population incidence data (in On-line Supplemental Material) is 0.010, implying that the SMA carrier frequency is 1/50 under Hardy–Weinberg equilibrium. However, the carrier frequency estimated from our allele frequency data is 1/38 {=2 (a+d) (b+c)}. This discrepancy may be attributed to embryonic lethality of some severely affected fetuses, exclusion of some SMA cases with atypical features, and statistical variation.

We recommend that genetic counsellors use our allele and haplotype frequency estimates for risk calculations, as our estimates are the best available at this time. The take home message of this article is that the SMN region in 5q is unstable and in active flux (based on the high de novo mutation rate and the recent evolutionary basis for the duplication), and that geneticists and genetic counsellors must keep informed of changes in genomic sequence information (because the assembly is still inaccurate), de novo mutation rates (due to the small numbers of cases on which our estimates depend), and allele and haplotype frequencies (because the numbers of cases on which our estimates depend will increase over time, thereby improving accuracy).

In conclusion, the SMN1 allele frequencies (Table 4) and de novo mutation rates are based on all available and reliable data, and can be used for genetic counseling and accurate risk assessments for SMA patients and their families. An unusually high frequency of de novo deletion/conversion mutations occur in the SMN1 gene, a substantial fraction of which seem to be de novo gene conversions from SMN1 to SMN2 (ie, de novo 840C>T mutation), suggesting that the nucleotide position 840 might be a mutation hotspot. Additional studies will further shed light on the molecular evolution of the SMN1 and SMN2 genes.