Introduction

Cystic fibrosis (CF) is an autosomal recessive disease due to a deleterious mutation in a chloride channel gene (CFTR = CF transmembrane conductance regulator), located on chromosome 7. From a biological point of view, molecular cloning of the gene [1], purification of the protein, and subsequent analyses will increase understanding of the molecular and cellular physiopathology of CF (chronic obstructive lung disease and pancreatic enzyme insufficiency).

From a medical point of view, identification of a large number of deleterious mutations and of microsatellite sequences within the CFTR gene provides a highly effective means of prenatal diagnosis or even, in some specific populations, of carrier screening [25]. CF is a notorious conundrum in population genetics. Why is this disease so frequent among Caucasians but unknown in other populations? In Caucasians, the mean prevalence at birth is 1 in 2,500. According to the Hardy-Weinberg law, this means that the frequency of the mutation, or more exactly, the frequency of the cluster of deleterious mutations of the CFTR gene, is equal to 2%, and unaffected carriers are numerous − 4%, or 1 in 25. How could a lethal mutation have reached such a frequency?

Since the early sixties, population geneticists have developed various hypotheses and models. A balance between negative selection and recurrent mutations was ruled out long before molecular data provided definitive evidence, and genetic drift [68] may be questionable. What frequency of the ΔF508 mutation would genetic drift have produced in remote eras (neolithic or paleolithic) before natural selection lowered it to the present 1.5%? So the only surviving model postulates selective factors favoring CF mutations through heterozygote advantage and/or meiotic drive [813]. But this balanced polymorphism hypothesis cannot be easily tested, either statistically, because the required sample size would be too large [7], or physiologically, because the CFTR protein is still under study.

Between October 1989 and December 1992, 250 mutations were characterized on the CF gene in a worldwide survey conducted by the Cystic Fibrosis Genetic Analysis Consortium (see Appendix). In this study, the cartographic location of the observed CF mutations was analyzed depending on their respective nature (nonsense, splicing, frameshift or missense), using a null hypothesis of random mutation at each potential site. Molecular data were examined to see if they pointed to the possible existence of selective factors favoring CF mutations. The set of CF mutations was thus classified on the basis of geographic distribution. Each class of mutation was then analyzed according to the distribution of mutations within the gene or peptide chain, and according to the associated RFLP markers in linkage disequilibria.

Material and Methods

Since identification of the predominant ΔF508 mutation of the CFTR in 1989 [14] and subsequent study in all populations [1517], the CF gene has been extensively screened. Our analysis refers to the 250 different mutations (50 nonsense, 33 splicing, 60 frameshift and 107 missense mutations or amino acid deletions) listed by the Cystic Fibrosis Genetic Analysis Consortium in December 1992 (partly confidential data). CF mutations were characterized within samples, ranging from 29,567 CF chromosomes for the predominant ΔF508 mutation to a few hundred CF chromosomes for private mutations. For most of the mutations, a few thousand CF chromosomes were studied.

For each kind of mutation, the observed distribution of mutations between the exons of the CF gene was compared to the expected distribution using a null hypothesis of random mutation at potential sites, depending on the respective natures of the studied mutations. Since the CF gene sequence is known [18], the expected random distribution of mutations was calculated using either the respective and variable exon length for frameshift or missense mutations, or the potential sites within the DNA sequence for splicing or nonsense mutations. Exons 6a and 6b, 14a and 14b, and 17a and 17b were not distinguished. The gene is therefore partitioned in 24 exons and 23 introns. The cartographic location of the missense mutations between the domains of the protein has been studied by grouping the corresponding exons.

Only 155 CF mutations out of the total of 250 could be divided into the following four classes on the basis of geographic distribution because the more recently characterized mutations cannot yet be classified in this way: private mutations observed only once in the worldwide survey of the Cystic Fibrosis Consortium; demic mutations observed twice or more, but within the same population; local mutations observed in two or three closed populations or countries, and general mutations observed everywhere, or in most countries. Such classification is provisional because some private mutations may have been misclassified since most of the laboratories did not test all the identified mutations within their patients’ DNA. Some private mutations may therefore actually be demic or even local.

A great number of molecular markers has been detected near the CFTR locus, especially RFLPs like XV2C/Taq1 and KM19/Pst1 [19]. Depending on the presence or absence of the respective endonuclease sites, there are four kinds of haplotypic or chromosomal combinations: A = (−,−); B = (−,+) C = (+,−), or D = (+,+). Since 1986, molecular analyses of RFLP haplotypes within affected and control individuals have provided evidence for a close association between the B haplotype and the disease. This association was probably due to a high disequilibrium between this marker and one predominant or several deleterious alleles. This hypothesis proved correct after identification of ΔF508 by Kerem et al. [14]. The expected random associations between mutations and each kind of RFLP were calculated according to the mean European frequency of these RFLP haplotypes on normal chromosomes [16].

The significance level (p value) of χ2 tests was calculated using the tabulated values, except for statistical tests for which sample size were too small. In these cases, the exact p values were computed, using a turbo-Pascal program [20] which generates the exact probability distribution of χ2.

Results and Discussion

Cartographic Distribution of the 250 CF Mutations

The cartographic distribution is shown in table 1. As there is no disparity between observed and expected distributions, splicing and frameshift mutations may be considered to be randomly distributed. This conclusion is still valid when exons are grouped in order to obtain expected numbers higher than 5. Four years ago molecular geneticists started their hunt for CF mutations other than ΔF508, nonrandomly with regard to the domains or exons. The random distribution of splicing or frameshift mutations suggests that the whole gene has now been screened so there is no more census bias in the cartographic distribution of some kinds of mutations. The hypothesis of the existence of a mutation hot spot [17, 21] must therefore be questioned, at least for this kind of mutation (splicing, frame-shift).

Table 1 Disparity between observed and expected distributions of CF mutations depending on their nature

The test value for the distribution of nonsense mutations is borderline, even when grouping exons. If significant, the low number of nonsense mutations in the protein C-terminal would be in agreement with the observation that deletions of this domain may not affect protein activity [22].

There is a highly significant disparity between the observed and expected distribution of missense mutations or amino acid deletions (table 1, last column). As previously noted [21], there is an excess of mutations in NBF1 as well as in MS1. An alternative explanation of the existence of a mutation hot spot in this part of the gene is that amino acid substitutions in MS1 or in NBF1 are far more critical for protein folding than missense mutations affecting MS2 or NBF2.

Sixty-four missense mutations out a total of 107 could be classified according to their geographic dispersion. Private and demic mutations are randomly distributed within the CF gene, whereas local, and especially general, mutations are not (table 2). Both of these classes are almost always mutations within MS1 and NBF1 domains. The fact that private and demic mutations are randomly distributed while local and general mutations, the so-called successful mutations, are mostly confined to specific locations in the peptide chain, is in agreement with the hypothesis of selective factors favoring the expansion of these mutations. It is hardly likely that migration, founder effect or genetic drift would have only favored the spread of 19 MS1 or NFB1 missense mutations out a total of 23.

Table 2 Disparity between the observed and random expected distributions between the domains of the CFTR protein for missense mutations and amino acid deletions

Geographic Dispersion

The classification of the 155 mutations of the cluster is reported in table 3. Private mutations, though numerous, only account for 0.25% of CF chromosomes, due, of course, to their very low relative frequency. General mutations account for 11% of the cluster but for nearly 85% of all CF chromosomes (the prevalent ΔF508 mutation accounts for 67% of them).

Table 3 Classification of the set of CF mutations according to their geographic pattern

Without any general mutations, CF would be a very rare disease (one affected newborn in more than 100,000), which is probably the case in non-Caucasian populations. Even without ΔF508, CF would be a common recessive disease (one in 23,000). The fact that CF is so frequent among Caucasians is only due to general mutations, especially ΔF508, which from a population genetics standpoint are ‘successful mutations’ because they have diffused in most populations. Each of these mutations has reached a frequency which is not in agreement with mutation-selection balance, or even with genetic drift for ΔF508.

The geographic pattern of local mutations (fig. 1) reflects the common origin and history of population migrations, for instance between Germany, Bohemia and Slovakia, between Germany and France, France and England, France and Canada, and particularly between Europe as a whole and North America.

Fig. 1
figure 1

Geographic distribution of a few local cystic fibrosis mutations (points refer to countries where a mutation has been screened, not to the location of the mutation within each country).

Association of CF Mutations with RFLP Markers

To date, 47 mutations (15 private, 9 demic, 6 local and 17 general) have been reported together with their associated RFLP haplotypes. Table 4 shows the observed numbers of mutations for each class of mutation and each kind of associated RFLP haplotype. Two general mutations (S549N and R553X) were associated with two different haplotypes and were entered as two halves for each haplotype.

Table 4 Numbers of observed mutations associated with each kind of RFLP haplotype, for each geographic class of CF mutation, and expected numbers of mutations assuming that RFLP haplotypes and mutations are randomly associated

There is clearly no disparity between the random expected and the observed distributions of associated RFLPs within private, demic and local mutations. In contrast, there is a large excess of B-associated haplotypes within the general mutations. Not only is ΔF508 largely associated with the B haplotype, but so too are some of the most frequent secondary mutations, namely 621 + 1G → T, A455E, 1717-1 G→T, G542X, S549N, G551D, W1282X, and N1303K.

Overspread or so-called successful mutations seem to be more often associated with a B haplotype, although this marker is the least frequent among normal chromosomes. This fact may be consistent with the existence of selective factors postulated by advocates of meiotic drive or heterozygote advantage in order to explain the high disease frequency. Such selective factors, if they exist, could have been connected with a specific kind of mutation, thus leading to their geographic spread. Two kinds of selective factors may exist: those acting according to whether or not a mutation affects the MS1 or NBF1 domain, and those acting according to whether or not a mutation occurs on a B chromosome.

The B sequence is probably not responsible for the selection, but could be a marker in linkage disequilibrium both with the CFTR locus and another gene or DNA sequence responsible for selective effects (or meiotic drive). In this case, CF mutations could have been driven by hitchhiking, as previously suggested [23].

CF is a very peculiar disease in terms of population genetics analysis. The severity of the disease, and its frequency, have resulted in the rapid accumulation of much data, since well over one hundred laboratories perform RFLP analyses in prenatal diagnosis and identify mutations for biological and medical purposes. Within less than 5 years, since polymorphism inside and around the gene has been better elucidated than in most diseases. It is therefore to be hoped that analysis of the cellular biology and physiology of the CFTR protein, as well as the search for other genes acting on the variable expressivity of the disease, will provide answers to the questions that have long puzzled population geneticists.