Introduction

Classical phenylketonuria (PKU) is a heterogeneous disorder caused by inborn errors in amino acid metabolism which determine irreversible mental retardation in children if left untreated [1]. Penrose [2] observed that it is a genetic disorder transmitted as an autosomal recessive trait. Neonatal screening of PKU, which has been systematically performed for several years in developed countries, led to a well-documented geographical distribution of the frequency of PKU. Its prevalence ranges from 4 × 10−4 in Turkey [3] to 8 × 10−6 in Japan [4]. The frequency of PKU among Caucasians is approximately 1 × 10−4 [5], so that the frequency of the autosomal gene is about 1%, according to the Hardy-Weinberg equilibrium.

PKU is primarily caused by a deficiency of the hepatic enzyme phenylalanine hydroxylase (PAH). The recent isolation of PAH cDNA [6] led to the identification of the gene structure [7] and its location in the 12q22q24.1 region [8]. Restriction enzymes reveal the presence of eight RFLPs associated with the PAH gene [9]. These can be used in prenatal diagnosis and genetic counseling in PKU families [911]. Geographical distribution of the different mutant haplotypes defined by these RFLP at the PAH locus have already been described in Europe, Asia and in Black Americans from Africa [see review in ref. 12]. Some of the mutant haplotypes are reported to decrease in frequency from east to west and from north to south (haplotype 2), whereas others seem present mostly in specific areas (e.g. haplotype 18 in China, haplotype 38 in France). The founder effect, migration and selective pressure are generally invoked to explain patterns in the geographical distribution of alleles.

However, these results are generally not statistically tested using a multivariate method, raising the question of the informativeness of the various PAH polymorphisms in localizing the geographical origins of mutations and describing the possible pattern of diffusion or selection.

The purpose of this article is to compare haplotype frequencies in different clusters of populations and, when possible, the distribution of the PKU and normal haplotypes within the same population, by performing multivariate analyses of the genetic diversity at the PAH locus.

Materials and Methods

The eight restriction enzymes [BglII, EcoRI, EcoRV, HindIII, MspI, PvuII (two separate polymorphisms) and XmnI] have led to the identification of 60 different haplotypes. They are labeled 1–52 (table 1) according to the classification of Eisensmith and Woo [13]. Other haplotypes are labeled differently [according to ref. 1416].

Table 1 RFLP haplotypes at the human PAH locus

We used published data for haplotype frequencies of PAH alleles in normal individuals and in PKU patients of various populations (table 2): Norway [17], Denmark [18], Sweden [14], Scotland [19], Poland [15, 20], Hungary [11], Czechoslovakia [11], Switzerland [19], Germany [21, 22], France [23], Italy [24], China [25], Japan [25], US Black [16], Turkey [26], Bulgaria [27] and Polynesia [28].

Table 2 Geographical distribution of 60 haplotypes at the human PAH locus

The data were first analyzed by a classical principal component analysis (PCA) of contingency tables [29].

Then statistical analyses [AMOVA; ref. 30] were performed on the basis of a linear model defining different effects on individual haplotypes:

$${p_{{\rm{jig}}}} = p + {a_{\rm{g}}} + {b_{{\rm{ig}}}} + {c_{{\rm{jig}}}},$$

where pjig is the s-dimenional vector in the jth haplotype in the ith population in the gth group of populations. This s-dimensional Boolean vector is of the form {r1, r2,… rs} where rs = 1 if the haplotype is cut at the restriction site s, and 0 otherwise. The dimension is s = 8 RFLP sites (see table 1). For example, the Boolean vector corresponding to haplotype 1 is the following string of 0 and 1: {01001000}. p is the unknown expectation of pjig averaged over the whole study, a the group effect, b the population effect within groups and c the individual effect within populations. Effects are assumed to be uncorrected, randomly distributed with variances S2a, S2b, S2c for effects a, b, and c respectively, with S2t = S2a + S2b + S2c.

To estimate these variances, the distance δ 2jk between two haplotypes j and k must first be defined [30].

In this study, we assume that RFLP sites are independent and equally informative. We also assume that the number of mutational steps separating two haplotypes is equal to the number of observed RFLP differences between them (for example haplotype 1 {01001000} and haplotype 3 {01010100} are separated by 3 steps; see table 1). This simplification, which is not theoretically required by Excoffier’s model, is essential since we cannot apply any evident probabilistic model to crossing-over events compared to mutation events. In other words, we assume that crossing-over producing a new haplotype from two common or rare haplotypes is an event which is less probable than the event producing the minimum number of steps required to obtain this new haplotype from another haplotype. For example, we assume that the distance between haplotypes 3 and 8 is more likely the result of three mutations than a single crossing-over of haplotypes 3 and 5 occurring between the restriction sites c and e (see table 1). This is a simplified model since at present we do not have accurate information on mutation rates characteristic of each restriction site or on the probability of each possible crossing-over between these restriction sites.

Under these conditions, the evolutionary distance δ2jk between haplotype j and haplotype k is given by the Euclidean distance between the vectors pjig and pkig:

$${\delta^2}_{{\rm{jk}}} = \Sigma {({p_{{\rm{jig}}}} - {p_{{\rm{kig}}}})^2}$$

where the sum is over all s restriction sites.

According to Li [31] and Excoffier et al. [30], the sum of squared differences δ2jk between all pairs of N haplotypes can be broken down into: (a) the sum of squared differences within a population; (b) the sum of squared differences between populations within the same population group, and (c) the sum of squared differences between population groups. These different sums of squares lead to the estimation of s2a, s2b and s2c. The significance of the different components is tested using a permutational approach, which is appropriate when the normality assumption is unwarranted, as is the case with molecular data [30].

As no information on the individual level of variability (within-individual haplotypic variability) is available, we do not take into consideration this level of variability. Consequently, instead of analyzing individuals, we applied Excoffier’s method to the N haplotypes observed in various populations.

Results

Table 2 shows the number of various normal and PKU haplotypes at the PAH locus observed in several populations.

The results given by PCA are plotted on figures 1 and 2. When PCA includes all populations, the frequencies of normal and PKU alleles of the US Black population account for the largest part (58.5%) of the variance on the first three axes. Clearly, this population is the most distant from the others, essentially because of the high frequencies of haplotypes 15, 35, 36 and BA (table 2, fig. 1), although normal and PKU frequencies of haplotypes, estimated on a very small sample, are very different within the US Black population. Another group of populations including China, Japan and Polynesia accounts for 43% of the variance on the second axis and 17% on the first three axes.

Fig. 1
figure 1

PCA of the PAH haplotype frequencies on the whole samples. Percentages are the proportions of variance explained by the two first axes. USB and usb are PKU and normal haplotypes, respectively, of the US Black population. Both Asiatic (including the Polynesians) and European populations are clustered together.

Fig. 2
figure 2

PCA of the PAH haplotype frequencies in European populations. Upper cases: PKU haplotypes; lower case: normal haplotypes. H1, H2, H5 and H6 are haplotypes 2, 5 and 6 (table 1). Percentages are the proportions of variance explained by the three first axes (38.8% of the whole variance). The dashed line separates PKU and normal haplotypes.

In an attempt to locate more accurately the various European populations in the space of the allele frequencies, a PCA was then performed excluding US Black, Asiatic and Polynesian populations. Since Bulgarian data are not available for PKU and normal haplotypes, this population was also removed from the analysis. In this case, each population includes both a normal and a PKU sample. Along the first three axes, accounting for 38.8% of the whole variance, PKU and normal populations are plotted in distinct regions of the figure. When all these European populations are considered together, samples of PKU haplotypes show a higher frequency of haplotype 2 (highly correlated to the first axis, r = 0.96) and a lower frequency of haplotype 6 (correlated to the second axis, r = 0.91). It should be noted that only PKU populations of Poland, Czechoslovakia and Hungary are clustered together on the first three axes, not the normal populations which are more widely scattered. Italian and Turkish PKU populations are also far from their normal population, essentially owing to the high frequency of PAH haplotype 6 in PKU populations. However, it should be noted that the sample sizes are small and that haplotypes are not always typed for these two populations, so that their location on the diagram is still uncertain.

AMOVA was applied in different ways. First, seven population groups were formed, one including, as clearly suggested by the PCA (fig. 1), all Asian and Polynesian populations (China, Japan, Polynesia) and the other six each composed of two or three European populations. The US Black population was removed from the analysis. Analyses were separately performed for normal and PKU haplotypes.

Several distributions of European populations into the six groups were tested. All of them led to roughly the same result when Asiatic and Polynesian populations composed the seventh group (table 3): the variance among groups, which represents about 10% of the total variance, is significantly different from 0 at p < 0.01. This result holds for both normal and PKU alleles. Results of the following distribution are given in table 3: Norway and Sweden, Scotland and France, Denmark and Germany, Poland and Czechoslovakia, Turkey and Hungary, Italy and Switzerland, Polynesian, China and Japan.

Table 3 Hierarchical analysis of molecular variance at the PAH locus

When the Asiatic and Polynesian group is removed, the variance among the six European groups is no longer significantly different from 0 with normal alleles (table 3), however the European groups are distributed (two populations per group). The overall variance is almost totally explained by the within-population variance. Twenty different distributions of the twelve European populations into six groups were tested, all of them leading to the same results. These partitions were built either at random or on the basis of geographical proximity (or following PCA results). This last method would be expected to give the best results according to the hypothesis of the gradient of haplotype frequencies in Europe. However, the only significant results (p < 0.01) are observed with PKU alleles and when Czechoslovakia and Hungary are assigned to the same group (as given in table 3). In this case, both the variance among groups and the variance between populations within a group are significantly different from 0 even though explaining small parts of the whole variance (6.0 and 3.4%, respectively). When these two populations are not assigned to the same group, the variance among groups is no longer significantly different from 0.

AMOVA was also applied to different geographical clusters of populations, each of them including two different samples, normal and PKU haplotypes, of the same population. The variance between populations (table 3) is significantly different from 0 only when the Asiatic group (China, Japan and Polynesia) is considered, and not when only European populations are taken into account. Moreover, the variance between normal and PKU samples within populations is highly significant (p < 0.01) although this variance represents less than 5% of the total variance.

Discussion

Population differences for haplotype frequencies at the PAH locus are now well documented [12, 32, 33]. However, as far as we know, no inclusive statistical analysis of all available haplotypes has been performed. The PAH locus is highly polymorphic, since at least 71 different haplotypes have been listed recently [13]. A few haplotypes are found in all populations (haplotypes 1–4), whereas the majority are found in few, sometimes distantly related populations (table 2). Therefore, a specific statistical analysis is required to validate inferences as to the relationships between populations or a possible migratory flux between them.

Both PCA (giving heuristic descriptions of the data) and hierarchical analysis of variance (which tests different clustering hypotheses) lead to the same conclusions.

PKU and normal haplotypes clearly show a substantial divergence among Asian populations (including Polynesians), the US Black population and European populations. However this intercontinental divergence explains no more than 6–7% of the molecular variance at this locus, with the within-population variance accounting for about 90% of the total variance.

Looking at the European populations, hierarchically structured groups of populations cannot be established significantly either for normal or PKU haplotypes. The only significant results indicate that Czechoslovakia and Hungary (and Poland) share the same profile of PAH haplotypes in only the PKU population. One can also note that the European differences between the distributions of normal haplotypes and PKU haplotypes within the same population are very low (less than 5% of the whole variance) albeit significant (p < 0.01). In other words, PKU haplotype frequencies and normal haplotype frequencies at the PAH locus diverge slightly systematically for all populations, as shown in figure 2. This can be explained by a higher frequency of haplotypes 2, 3 and 6, and lower frequencies of haplotype 1 and 5 in PKU than in normal samples, whatever the geographical origin of the population.

However, attempts to interpret, at least in part, the multifactorial analysis plotted on figure 2 in terms of geography failed. PKU populations of Denmark, France and Scotland are close together on the first two axes. Poland, Hungary and Czechoslovakia are clearly plotted into the same cluster only for PKU populations. This result is statistically demonstrated by the variance analysis. No trend from east to west or from south to north, as suggested by several authors for some haplotypes [34], seems to be convincingly inferred when taking into account all haplotypes.

One can seriously argue that the sample size for each population is far from large enough to allow clear statistical conclusions, due to lack of power. If the haplotype frequencies are too inaccurate to use a methodology built on estimation of frequencies, another approach may be possible by only noting, for each population, the presence or absence of each haplotype whatever its frequency and then looking for a parsimonious tree of populations, following a method suggested by Mickevitch and Mitter [35]. Unfortunately, this method leads to totally unresolved trees, since more than 100 equally parsimonious trees are obtained with the same data.

Moreover, the model used to evaluate distances between haplotypes before performing variance analyses is probably too simple and needs to be improved by integrating information on mutation rates and crossing-over probabilities as soon as they become available, and by following, for instance, the suggestions of Templeton et al. [36].

As far as European populations are concerned, one can also argue that we did not try all possible distributions of twelve populations into various groups of various sizes, in order to obtain at least one with significant results. However, as explained, obvious and logical distributions based on geographical criteria or on PCA did not lead to significant results. In European populations, the proportion of variance within populations is clearly higher than both variance among groups and variance among populations within a group. This result holds even when one population group is very distant from the others (analysis with Japan, China and Polynesian as a supplementary group; table 3). Thus, there is no chance of inversing this trend by chance (i.e. by just randomly distributing populations). Moreover, the advantage of CPA analysis is that it gives a clear distribution of populations in the space of haplotype frequencies. As can be seen, the structure shown by the CPA is weak so there is no chance that a better distribution can be found. Thus, if we obtain one distribution giving significant results by chance (about 50 among 1,000 random distributions for a probability level of 5% with no structured data) it will be explained by other factors than known geographical location or known historical events.

Obviously, the RFLP polymorphism at the PAH locus does not suggest any significant heterogeneity among European populations (except haplotype 2) and, consequently, cannot provide unequivocal inferences on the past evolution of European populations. To answer this question, we need more information about the mutations at the PAH gene. Moreover, different mechanisms can explain the lack of geographical distribution of haplotype frequencies, such as the founder effect, historical variations in the effective population size, sampling bias, selective pressure or migration flux occurring differently both for different haplotypes and in various spatial directions. As a result, one can only point out that PKU and normal haplotypes show significantly different frequencies. On the other hand, this locus will provide useful information on the divergence between Asiatic, European and African populations as soon as the present lack of extensive data on Africa and Asia is acquired. For inferring the evolution of human populations using polymorphism at the PAH locus, future work will no doubt be increasingly fruitful.