Introduction

Cystic fibrosis (CF) is the most frequent severe recessive disorder in European populations. The mutation spectrum of this disease, that is, the pool of alleles that cause CF in homozygosis or in heterozygosis with another mutation, is made up of more than 1000 different mutations;1 of those, a three-basepair deletion, F508del, accounts for roughly two-thirds of cases, and only four others are found at average frequencies >1%.

The geographic distribution of mutation frequencies is heterogeneous among European populations, showing a high geographic variation; F508del is ubiquitous, although its frequency ranges from less than 50% to almost 100%; other mutations reach significant frequencies only in part of the European continent, and many are rare and population specific. Most of the knowledge about frequency distributions has been obtained on a mutation-by-mutation basis and has consisted of mapping mutation frequencies,2 and the consequent association of their geographic distribution with hypothetic demographic expansions of historical populations such as the Celts or the Phoenicians. However, this single-mutation approach fails to capture the complexity of both the mutation spectrum and the population history. Gene flow among populations tends to disperse a number of mutations simultaneously and not just one. Moreover, too often these ad hoc explanations are put forward without any quantitative considerations about the extent of gene flow required or for the basic population genetics involved.

A different, sounder approach is possible: considering the whole mutation spectrum at once and, beyond the mere description and subsequent story telling, applying to it the basic toolkit of population genetics, which can lead to more accurate interpretations of the history of the populations in relation to the natural history of genetic diseases. One basic requirement for that approach is the reasonable assumption that all CF mutations are selectively equivalent. That is, they cause the same loss of fitness in homozygosity and they give the same advantage to heterozygotes.3,4,5,6 Under this assumption, Reich and Lander7 modelled mutation diversity as a function of population history and of the overall frequency of disease-causing alleles. However, they considered one average spectrum per disease, and, as we show below, allele diversity can be very different among populations for the same disease. A basic property of a mutation set is its frequency spectrum, that is, how many mutations are found and at frequency in a particular population. It can be understood acumulatively, that is, how many mutations are required to explain a given relative frequency (say 90%) of CF chromosomes. This has an obvious interest in clinical testing, but it is also crucial in population genetics, since a frequency spectrum may be the result of population history and of other evolutionary factors.

We have compiled CFTR mutation frequencies for a wide set of populations from Europe, North Africa and SW Asia from the literature and have described their spatial patterns, both of a few single mutations and of mutation diversity. We present an interpretation of our findings grounded on the population genetics theory that relates CF mutation frequency spectra to the patterns of effective population size across European populations and to selection.

Materials and methods

Databases

Mutation frequency spectra for CF were compiled from the bibliography for populations (countries or main regions within countries) from Europe, North Africa, and SW Asia, that is, areas with a relatively high prevalence of CF and where mutation spectra can be reported for a reasonable number of patients. Since we are interested in the analysis of geographical patterns in CF mutation spectra in a historical frame, data for Americans and Australians of European descent were not taken into account. These populations are clearly related to northern Europeans but are located at a large distance, which would distort the spatial analyses. For each population, the frequency of each allele associated to the disease, the geographical location and the sample size were considered. In each population, a fraction of CF chromosomes remained with an unrecognized mutation; this frequency of unknown mutations depends on technical limitations that may vary across studies. Those populations with an indetermined location, or with a sample size of less than 20 chromosomes, or with a frequency of unknown mutations over 60%, were excluded from further analysis, which gave a total in the database of 94 populations with 32 mutations found in more than one population and 109 private (ie population specific) mutations. CF incidences were taken from,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 where incidences were estimated either from neonatal screening programs or from general population screens; incidence values were available for 16 populations, 13 of which referred to whole countries. In those cases, data such as mutation frequencies that were compared to incidences were first pooled within the country.

Genetic diversity estimates

Two different genetic diversity estimates were obtained for each population using the Arlequin package:24 the expected heterozygosity and an estimate of θ based on expected heterozygosity.25,26 The latter is an estimate of 4Neμ(1−f0), where Ne is the effective population size, μ is the mutation rate and f0 is the frequency of the overall disease allele class.7 Computation of both the expected heterozygosity and θ requires the complete specification of the frequencies of all alleles. However, an average 24.8% (ranging from 0 to 51.3%) of the CF chromosomes carried unknown CFTR mutations. Estimates of allele diversity are bracketed between two extremes: they would be minimal if all unknown mutations in a population were, in fact, the same allele, and they would be maximal if all unknown mutations were different from each other. Intensive mutation-detection efforts by means of denaturing high-performance liquid chromatography27 have shown that chromosomes bearing previously unknown mutations carried each a new, different, unique mutation. Thus, it is likely that the ‘unknown’ portion of the mutation spectrum contains a wide diversity of alleles. Under that assumption, and assuming as well the absence of phenocopies (non-CF patients diagnosed as having CF) and genocopies (CF caused by mutations in genes other than CFTR), mutation diversity has been estimated by assuming that all chromosomes in the ‘unknown mutation’ category carry each a different mutation. This implies that, given the frequency of the ‘unknown mutation’ category, we have considered the maximum allele diversity estimate.

Geographical patterns

A geographical description of allele frequency patterns was obtained by drawing maps of gene frequencies for the most common CF mutations (namely F508del, G452X, G551D, N1303 K, and W1285X, which are the ones found at average frequencies >1%), as well as for genetic diversity estimates, by using Surfer 7.0 (http://www.goldensoftware.com) with the inverse-squared distance method. A regular grid covering Europe, North Africa, and the Middle East and limited between 30°N and 64°N and between 10°W and 42°E was used. Interpolation points were spaced 0.1°. For each interpolation point, only data points within the same landmass (island or continent) were considered. It should be noted that interpolation was used only to map allele frequencies and diversities, and that interpolated values were not used in any other analysis.

The frequency of the most common mutations, maximum genetic diversity estimates and countrywide CF incidences were subjected to spatial autocorrelation analysis28 by means of the SAAP program (http://www.exetersoftware.com/). Autocorrelation analysis, which consists in plotting a measure of correlation among pairs of populations classified according to the geographical distance between them, allows to characterize geographical patterns such as clines (gradients), depressions (clines irradiating from the center of the area considered), and isolation by distance, since each of these patterns leads to autocorrelation plots that are statistically significantly different from each other (see examples in Barbujani29). Thus, spatial autocorrelation analysis allows one to describe objectively the spatial patterns and to compare them with the expectations derived from demographic hypotheses, such as growth and migration.

Genetic distances

Reynolds' genetic distances30 based on CF mutation relative frequencies were computed. The same measure of genetic distance was used by Cavalli-Sforza et al31 to estimate genetic distances among European populations based on classical genetic polymorphisms (ie blood groups, protein polymorphisms, and HLA). Both distance matrices were compared by means of a Mantel test, which calculates a nonparametric index of matrix correlation;32 correction by a geographic distance matrix was performed in order to make partial Mantel tests controlled by this variable.33 Reynolds' distances were also computed and compared to CF mutation distances for Y-chromosome haplogroup frequencies.34 Finally, CF mutation distances were compared to corrected pairwise distances for hypervariable region I mitochondrial DNA sequences;35,36,37 since some slightly negative values were obtained, a small positive quantity was added to all distance values so that all would be positive. Genetic distance calculations and Mantel tests were performed with Arlequin 2.000.24

Results

CF mutation frequency spectra were gathered for 94 populations from Europe, SW Asia, and North Africa. Information about populations, sample sizes, main mutation frequencies, and mutation diversities can be found in Table 1. In all, 32 different mutations were considered; complete mutation frequencies and other information can be found at http://www.upf.es/cexs/recerca/bioevo/index.htm. As seen in Figure 1,Figure 2,Figure 3,Figure 4 and Figure 5, the most common CF mutations display clinal or largely clinal patterns as determined by their spatial correlograms. As for the direction of the clines, F508del peaks in NW Europe and declines towards SE Europe, G542X declines from SW to NE Europe, G551D is almost restricted to NW Europe, and N1303 K and W1282X show gradients from SW Asia and SE Europe towards NW Europe.

Table 1 94 middle Eastern, North African and European populations used in the analysis
Figure 1
figure 1

Geographical distribution (a) and spatial autocorrelogram (b) of F508del mutation in 94 middle Eastern, North African, and European populations. The X-axis represents geographic distance between samples; the Y-axis represents Moran's index; a single asterisk (*) denotes P<0.05; double asterisks (**) denote P<0.01.

Figure 2
figure 2

Geographical distribution (a) and spatial autocorrelogram (b) of the G542X mutation in 94 middle Eastern, North African, and European populations. The X-axis represents geographic distance between samples; the Y-axis represents Moran's index; a single asterisk (*) denotes P<0.05; double asterisks (**) denote P<0.01.

Figure 3
figure 3

Geographical distribution (a) and spatial autocorrelogram (b) of the G551D mutation in 94 middle Eastern, North African, and European populations. The X-axis represents geographic distance between samples; the Y-axis represents Moran's index; a single asterisk (*) denotes P<0.05; double asterisks (**) denote P<0.01.

Figure 4
figure 4

Geographical distribution (a) and spatial autocorrelograms (b) of the N1303 K mutation in 94 middle Eastern, North African, and European populations. The X-axis represents geographic distance between samples; the Y-axis represents Moran's index; a single asterisk (*) denotes P<0.05; double asterisks (**) denote P<0.01.

Figure 5
figure 5

Geographical distribution (a) and spatial autocorrelogram (b) of the W1282X mutation in 94 middle Eastern, North African, and European populations. The X-axis represents geographic distance between samples; the Y-axis represents Moran's index; a single asterisk (*) denotes P<0.05; double asterisks (**) denote P<0.01.

The richness of a mutation frequency spectrum can be summarized by several genetic parameters, such as allele diversity and θ. Maximum allele diversity (see Material and methods) ranges from less than 0.4 to over 0.9, and shows a significant correlogram with a negative peak around 3000 km (P<0.0001, Figure 6(a)), with maxima in southern Europe (Turkey, Italy, Spain) and minima in northern and NW Europe (Denmark, Britain). Given the predominant S–N orientation of the cline, the maximum CF mutation diversity is strongly and negatively correlated with latitude (r=−0.587, P<0.001), although it is also slightly correlated with longitude (r=0.220, P=0.033). θ shows a similar spatial pattern, with a three-fold decrease from southern to northern/NW Europe, a significant correlogram (P<0.001, Figure 6(b)), and a high correlation with latitude (r=−0.394, P<0.001).

Figure 6
figure 6

Spatial autocorrelogram of genetic diversity calculated as expected heterozygosity of CF mutations (a) and of genetic diversity calculated as Θ of CF mutations (b). The X-axis represents geographic distance between samples; the Y-axis represents Moran's index; a single asterisk (*) denotes P<0.05; double asterisks (**) denote P<0.01.

The previous analyses were performed on the mutation frequencies relative to the total CF chromosomes. For instance, on an average 70% of CF chromosomes carry F508del, but, given the average CF incidence (1/2500 newborns), which implies that on an average 2% of chromosomes carry a CF mutation, 1.4% of all chromosomes carry F508del. It should be noted that the prevalence of CF in Europe is irregular and slightly correlated with longitude (r=−0.533, P=0.028, growing from east to west), and although the overall correlogram is significant (P=0.001), the pattern it shows is not significant at high geographical distance classes, and therefore can be interpreted as reflecting only isolation by distance.38 Furthermore, we did not find any correlation between CF incidence and F508del mutation frequency (r=−0.002; P=0.994).

The relation among different CF mutation spectra across European populations can be investigated using standard population genetic methods, in order to relate it to the population history of the continent as reflected in other genome regions.

Genetic distances among CF mutation pools (ie a measure of the difference in relative CF mutation frequencies among pairs of populations) in different European populations were computed. By a Mantel test, we found a significant correlation between distances based on CF mutations and geographic distances (r=0.329, P<0.0005). For a subset of 23 populations, genetic distances based on classical polymorphisms31 were available. This was also the case for 27 populations and mtDNA control region sequences35,36,37 and 28 populations and Y-chromosome haplogroups.34 Classic and CF distances were not correlated (r=−0.039, P=0.48), even when controlling for geographic distance (r=−0.044, P=0.52). When known outliers such as Sardinians and Basques were removed from the analysis, the correlations increased, although without reaching statistical significance (r=0.083 and after controlling for geographical distance, r=0.090). The correlations with Y-chromosome-based distances were similar (r=0.147, P=0.116 and after controlling for geographical distance r=0.054, P=0.296). A higher and significant correlation was obtained for mtDNA control region sequences (r=0.423, P=0.004, and after controlling by geographic distance r=0.425, P=0.004).

We also analyzed the correlation between CF diversity and Y-chromosome and mtDNA control region diversity. In both cases, correlation was positive, although not statistically significant (r=0.308, P=0.111 for CF-Y chromosome and r=0.319, P=0.105 for CF-mtDNA diversity).

Discussion

We have described in detail the geographical variation pattern of the main CF mutations in Europe and in the immediately adjacent territories of Asia and Africa, and we have also characterized the diversity of mutation frequency spectra and their spatial pattern in this geographical frame. We have confirmed that the main CF mutations show frequency clines in Europe, and that this is also the case for mutation diversity. Mutation spectra are richer in southern than in northern Europe, but the pattern is not simply related to latitude. This has practical implications, since, on average, more mutations need to be assayed before finding the one(s) responsible for a particular case in southern than in northern Europe.

Measuring mutation diversity as θ opens the possibility of explaining this pattern. In a population at equilibrium, this parameter is an estimator of 4Neμ(1−f0), where Ne is the effective population size, μ is the mutation rate, and f0 is the prevalence of the overall disease allele class.7 Mutation rate is highly unlikely to be higher in southern than in northern Europe; therefore, a higher θ may be caused by a higher Ne and/or lower f0 in southern Europe. That is, we need to explore whether spatial patterns in incidence or effective population size may be related to CF mutation diversity. Estimates of CF incidence would indicate that the CF allele class is 3.8 times more frequent in Ireland than in Russia, the two extremes of the incidence of CF in Europe (see Table 1). This may pose a problem and may seem to have contributed to the observed spatial patterns in CF mutation diversity. However, the effects of the two parameters are not equivalent: for any Ne, doubling it would double θ; for a typical CF allele frequency f0=0.02, doubling it would mean just a 2% reduction in θ. The nonmutated CF chromosomes are the repository from which new mutant alleles arise and contribute to mutant diversity (CF chromosomes already carrying F508del would normally go undetected since mutation testing would stop after finding F508del; although, for counterexamples, see Savov et al39). Since even a large increase in the incidence has a small impact in the pool of normal chromosomes, variation in incidence is unlikely to have a large effect on mutation diversity. And, as predicted, the correlation between θ and the frequency of alleles carrying CF mutations is small and nonsignificant (r=0.061, P=0.821). Then, the variation in CF incidence in Europe does not seem to contribute significantly to CF mutation diversity.

An additional factor may prevent incidence from modulating CF mutation diversity. The frequency of the CF allele class is likely to be in balancing selection equilibrium.3,4,5,6 It has been suggested that CF hetero-zygotes could have a selective advantage against cholera and other diarrhoal diseases, even if Bertranpetit and Calafell3 showed that cholera by itself is not enough to explain selection on a CF background; that advantage would be given by any mutation that disrupts CFTR function; that is, by any mutation that causes CF. The selective advantage determines the overall frequency of the mutant allele class: for a selective advantge s, the equilibrium frequency of the mutant allele class is s/(1+s),40 although, for small values of s such as those expected for CF, the mutant frequency can take hundreds of generations to reach the equilibrium value. In this process, selection will pull up the frequency of any mutant allele as long as it confers a selective advantage to the heterozygote, that is, as discussed above, any CF causing mutation. Thus, the initial spectrum of CF causing mutations is that found when the selection process started; selection will increase the frequency of those mutations already present in the different populations, and, in the end, the total frequency of the mutant class may be similar across populations (as long as the selection pressure is similar), but the actual mutations that fill up this frequency determined by selection may be different across populations. Balancing selection will tend to increase the whole CF allele class frequency rather than reshape the spectrum itself, and, thus, it is not expected to have contributed to the pattern of CF mutation diversity in Europe. However, as discussed below, population history, through the action of gene flow and random drift, will then reshape the mutation spectra of the populations.

We turn now to the factor that may have determined to the largest extent of CF mutation diversity: effective population size. Taken at face value, a three-fold increase in θ from Britain (average θ=0.63) to Italy (average θ=2.20) should be the result of a historical effective population size three times larger in Italy than in Britain.41 Besides the milder, more favorable living conditions, several (pre) historic processes may account for a higher Ne in southern Europe. In the harsher phase of the last glaciation (the so-called Last Glacial Maximum, 18 000 years ago), human populations may have retreated to three glacial refugia in S Europe: Iberia, Italy, and the Balkans, from where they re-expanded when the ice shield melted. It has been suggested that mtDNA42 (although see the ensuing debate in Simoni et al37 and Torroni et al43) and Y chromosome44 diversity patterns bear the traces of the postglacial re-expansions. Later on, the Neolithic expansion that carried the new farming lifestyle to Europe and a 10-fold population increase45 seems to have expanded faster along the Mediterranean shores,37 where the population expansion may have started one to two millennia earlier than in northern Europe. All these factors may explain a higher long-term effective population size in the southeast.

In neutral genome regions, genetic distances can capture the shared history patterns among a set of populations. Thus, a correlation between distances based on CF mutation frequencies and distances based on other polymorphisms, once the shared effects of geographic distance are discounted, implies that the same population history has shaped variation at both loci. We have observed such a significant correlation with mtDNA, although that was not the case with the Y chromosome. Females seem to be more mobile than males,46,47 up to the point that mtDNA variation is much more homogeneous among European populations than Y-chromosome haplogroup frequencies.34,37 Although they show clear spatial patterns, the frequency of the main CF mutations is relatively homogeneous across European populations; an analysis of the molecular variance48 shows that differences among populations explain 3.34% of the variance in CF mutation frequencies; that fraction is 1.13% for mtDNA and 17.07% for Y-chromosome haplogroups in the panel of European populations used for comparison. Thus, the CF mutation landscape seems to correlate better with a homogeneous pattern such as that described by mtDNA than with a more structured pattern such as that found for the Y chromosome. The ubiquity of F508del and the wide areas of distribution of the other major mutations may be the result both of their antiquity and of the aid to dispersion contributed by heterozygote advantage.

The consideration of the whole mutation spectrum of CF has allowed us to interpret the natural history of this disease and its relation to demographic history in a much more comprehensive way than can be obtained with a single-mutation approach. Rather than a specific, ad hoc story per mutation, we can provide an overall frame to underestand CF. We have shown how the interplay of population history and selection molded the mutation spectrum of CF; it can be expected that the same process has acted on other disease-related genes.