Introduction

The quantitative study of the DNA random diversity can only be approached through a suitable ‘window’ (nbp × ng) consisting of nbp base pairs, studied on a number ng of genomes. The whole spectrum of possible ‘windows’ ranges between two opposite and complementary ‘extreme windows’: those where many genes (that is many nbp) are studied on few ng (‘1st type loophole windows’) and those where only a single gene is studied on a very large ng sample (‘2nd type loophole windows’).

The type of molecular variations most suitable for these studies are the single nucleotide substitutions in a coding sequence (cSNSs); (hereafter, when no ambiguity is possible, we shall also refer to them simply as substitutions) because the number and the consequences of cSNSs that may occur in a sequence of known length are known, not to mention that they include a perfectly defined ‘control class’ (the synonymous cSNSs) mainly consisting of neutral mutations (although it is known that some of them affect the splicing: for the possible molecular mechanisms see, eg, Pagani et al1,2).

The majority of the previous investigations3,4,5,6,7,8,9 adopted the ‘1st-type loophole’ approach (ng≈100); thus, in spite of their extensiveness (nbp≈100 Kb), which produced exhaustive information about the common variation, they could not, in principle, provide direct information about the range of rare variability. More recently, few studies10,11,12 have adopted a less extreme approach by studying some genes on hundreds of random individuals from the major human groups (Africans, Europeans, Asians).

The present investigation adopted the ‘2nd-type loophole approach’. An average sample of 1550 European genomes was studied by examining the pattern of common (=polymorphic; q>0.005) as well as rare (=subpolymorphic) variability, of a single gene (the CFTR gene, Cystic Fibrosis Transmembrane conductance Regulator, consisting of 4.5 coding kb) which belongs to the class of autosomal nonduplicated genes performing an essential function, whose deleterious alleles are mainly recessive.

The low limit of the range of variation, which one can consider as reliably explored with a given ng is that where the variants with the lowest q are expected to be found at least ca. 5 times (q × ng≈5): with ng≈1550, as in the present investigation, the lowest end of the range corresponds to q≈0.003, instead of 0.05 (corresponding to ng=100).

The large body of data generated by this investigation allowed us to study directly, for the first time, an issue of general relevance in molecular evolution, namely the pattern of nonsynonymous (NS) and synonymous (S) substitutions in the range of rare variability; it has also allowed us to evaluate the rate of misclassification (rare or common) of variants as a function of sample size.

Materials and methods

In this study, we systematically explored all the 27 exons of the CFTR gene, using denaturing gradient gel electrophoresis (DGGE) or denaturing high-performance liquid chromatography (DHPLC), genotyping methods with a very high efficiency (about 98%, see Bombieri et al13) in detecting molecular variation. Preliminary results on a subset of 400 individuals, and details on the method, are in Bombieri et al.13

The sample

All the individuals studied (present investigation and previous data13) come from six geographical areas, namely, Northern Italy (Verona), Central Italy (Rome), Southern France (Montpellier), Northern France (Brest), the Czech Republic (Prague) and Spain (Barcelona). All individuals gave their informed consent.

Mutation analysis

Genomic DNA was extracted from blood samples, amplified in vitro by PCR and analyzed by DGGE as previously reported13 or DHPLC. Every mutant discovered by these methods was sequenced with the ABI PRISM 377 or 310 Sequence Analyser. The following cSNSs, numbered as given in Table 1, were studied by RFLP analysis: no. 1, see Ghanem et al;14 nos 20 and 59 see Fanen et al;15 nos 37, see Chillon et al16 and nos. 12, 24–26, 28, 29, 45, 48, 56, 60, methods available on request (cristina.bombieri@medgen.univr.it).

Table 1 List of the 61 cSNSsaencountered in the present survey

Subdivision of the total number of the theoretically possible cSNSs into NS and S cSNSs

The CFTR coding sequence consists of 4443 bp (1480 sense codons+a stop codon); thus the total number of possible cSNSs is (4443 × 3)=13 329. To compute exactly how many of these 13 329 cSNSs would be NS and how many would be S, we considered the CFTR codon usage, rather than simply its amino-acid composition, as it had been done commonly. The total number of possible NS cSNSs, NNS, is 10 408 (including the eight stop codon → aa codon, and the 650 aa codon → stop codon) and that of the synonymous cSNSs, NS, is 2921 (including the TGA stop codon → TAA stop codon). If the mutation rate μ were the same for NS and S substitutions and, once the mutations occurred, both types of substitutions have the same probability of being sampled, the expected ratio NNS/NS would be 3.56 (=10 408/2921). This ratio should be valid for both subpolymorphic and polymorphic cSNSs.

These figures can also be used to estimate the probability of being polymorphic or subpolymorphic for the NS (PPolyNS and PSubPolyNS) and for the S (PPolyS and PSubPolyS) substitutions. For example, the estimate of PPolyS is the ratio between the number of S substitutions that have been shown to be polymorphic and 2921. It may be worth to point out that the herein adopted definition of polymorphic for an allele is only based on its frequency, disregarding any of its possible phenotypic effects.

Estimate of θ and π and of the distribution of variant frequencies

Under neutrality, both π (mean heterozygosity per site) and θ (number of segregating sites) are expected to be independent from the sample size and equal to 4Neμ.17,18,19 These parameters can be estimated as follows:

where Hi is the 2pq observed for each of the n cSNS detected (see, the last two columns of Table 1) and nbp is the length of sequence studied;

where is a factor that should counterbalance the expected increase of ncSNS associated with the increase of ng.

In the Tajima's test,19 the null hypothesis of neutrality is rejected if a statistically significant difference between π and θ is observed, because different selective regimes differently affect these two quantities.20

The distribution of gene frequencies expected by the neutral mutation model was calculated according to Glatt et al.10

Results

In the present investigation, a gene of ca. 4.5 coding kb was explored on a mean random sample of 1550 European (Italian, French, Spaniard, Czech) genomes (total window (ng × nbp)≈7 Mb). The results are presented in Table 1 and Figure 1. Table 1 reports the detailed results derived from each subsample and Figure 1 shows the length of the analyzed DNA sequences encompassing the variable nucleotide sites and the cSNSs found along with estimates of their frequencies and their position in the gene. A total of 61 cSNSs was found, 45 were NS and 16 S. Only 12 (six NS and six S) showed a polymorphic frequency (q>0.005), corresponding to a mean density of 2.71 SNP per kb (being the total length of the analyzed sequence 4.4 kb), a figure that compares well with the mean density (1.77 per kb) published in the dbSNP website (http://www.ncbi.nlm.nih.gov/SNP).

Figure 1
figure 1

The windows (ng × nbp) utilized to study the 27 CFTR exons and the position of the 61 discovered cSNSs with the frequency of the minor allele. Upper graph: The figure within each box (▪ and ♦) is the reference number of the cSNS as in Table 1. The q confidence limits are equal to 2.5 se. Lower graph: Each exon (or PCR-amplified segment (the exon 13 has been amplified into two segments)) is represented as a window with the horizontal and the vertical side indicating, respectively, the length of the fragment studied (nbp) and the sample size (ng) (some cSNSs have been examined on an additional sample (not shown in the graph) by a specific method, see Materials and methods). The two figures inside each window indicate the exon number and its distance, in kb, from the cap site (GDB accession nos AC000061-AC000111).

Two parameters describing the nucleotide diversity (θ, proportion of segregating sites; and π, mean heterozygosity) and the Tajima's D were calculated for the whole sample (1550 genomes) and for two subsamples of 100 and 400 genomes, respectively [(Table 2, section (a)]. Furthermore θ, π and the Tajima's D were calculated separately for the NS and the S cSNSs in the whole sample (ng=1550) [Table 2, section (b)].

Table 2 Dependence of the pattern of the CFTR cSNS variability on the size, ng, of the random sample and on the type (NS or S) of cSNS

The data concerning the four subsamples of 100 genes are comparable with each other and with those of the literature obtained on samples of similar size and relating to very numerous genes: present θ and π are similar to those of the other authors4,5 both for their values and for being in agreement with the Tajima's model (Tajima's D not significantly different from zero). This is the first investigation in which a gene was studied on a very large ethnically homogenous sample so that a comparison of the present parameters θ and π, obtained on the whole sample (1550 genomes), is possible only with those of the few studies in which a large, even though ethnically heterogenous, sample has been analysed.10,11,12 Both in the present and in the previous studies the values of π obtained on large samples are not significantly different from those derived from small samples; on the contrary, θ turned out to be higher both than π and than the θ observed with small samples.

Discussion

Effect of sample size on the apparent level of polymorphism

To evaluate the rate of misclassification of the variants detected as singletons with ng=100 (estimated q=0.01), we examined 400 random genomes (200 individuals) subdivided into four subsamples of 100 genes each (marked by 1st in Table 1) and compared the data obtained with these four subsamples with those observed in the whole study. In particular, we recorded the number of what we refer to as ‘false negatives’ (cSNSs that failed to be detected in a subsample ng=100, even though their ‘true’ frequency (estimated in the whole sample of 1550 genomes, on the average) was certainly higher than 0.01) and ‘false positives’ (cSNSs that exhibited, in one or more subsamples of 100 genes, a q=0.01 although their ‘true’ frequency was less than 0.005). There were six false negatives (0, 2, 1 and 3 in the four subsamples) and 14 false positives (7, 2, 3 and 2).

These data show that the number of cSNSs found with ng=100 is very likely to be an overestimate of the true number of cSNPs, since the false positives were more than the false negatives. Furthermore, the final q's of the 34 singletons were largely dispersed (from 0.0004 to 0.0286, Figure 2) and many were lower than 0.01 with a final mean value of only 0.0052 instead of 0.01. In addition, the singletons found in each of the four subsamples were almost never the same although they were all derived from the same population (Europeans).

Figure 2
figure 2

The distribution of the final q's (those observed on the sample of 1550 genomes) of the 34 ‘singletons’ found on the four subsamples of 100 genomes (see text).

Evidences suggesting a role of selection

The value of θ increased with the increase of sample size (6.3 × 10−4 with ng=100; 10.6 × 10−4 with ng=400 and 17.3 × 10−4 with ng=1550; see Table 2. section (a)), leading to a negative D value statistically significant only with large ng. Negative Tajima's D are consistent with population expansion and/or negative selection. The observation that the present increase of θ is almost exclusively due to the NS cSNSs (see Table 2. section (b)) is against a pure neutrality model: demographic events, in fact, would have affected to the same extent both types of substitutions (NS and S). A selection process is also suggested by the pattern of NS and S substitutions (see later) and by the distribution of the variant frequencies: if one compares the observed distribution with that predicted by the neutral mutation model, there appears to occur a striking excess of rare variants, particularly of the NS: in the class of variants with q<0.005, the expected number among the 45 NS variants was 14.8, while the observed was 39 (P≈0); for the 16 synonymous variants the number expected was 5.2 and the observed was 10 (P≈0.04). The slightly significant P for the synonymous substitutions may be due to recent population expansion.

High values of θ and significantly negative Tajima's D have been found also in the few investigations in which much larger samples than usual were studied.10,11,12 In particular, Glatt et al.10 compared the π, θ and D obtained on a subsample of 180 genomes with those obtained on a sample of 900 genomes and found a similar increase of θ almost exclusively due to the NS substitutions and a significantly negative D value only on the larger sample.

Patterns of NS and S substitutions

A reasonably reliable knowledge of the frequency of an adequate number of rare alleles enabled us to compare for the first time the NS with the S pattern of variation also in the subpolymorphic frequency range (Figure 3).

Figure 3
figure 3

The q distribution of the 45 NS and 16 S cSNSs.

The table inserted in Figure 3 shows that the observed relative fractions of the NS and S cSNSs depend dramatically on the range of variation: it turned out to be 6:6 among the 12 polymorphic substitutions (q0.005) and 39:10 among the 49 subpolymorphic (q < 0.005) substitutions (P<0.005).

In other words, the large yield of cSNSs detected by expanding the sample size mainly consists of NS cSNSs. If one takes into account the number of NS and that of S cSNSs that may occur in the CFTR gene (10 408 and 2921, respectively (see: Materials and methods)) it turns out that

  1. 1)

    The probability of the NS substitutions of being polymorphic, PpolyNS, is much lower than that of the S substitutions (6/10 408=5.8 × 10−4 vs 6/2921=20.5 × 10−4) (P<0.03), whereas

  2. 2)

    The estimates of the probabilities of being subpolymorphic, PSubPoly, are equal (PSubPolyNS=39/10 408=37.5 × 10−4; and PSubPolyS=10/2921=34.2 × 10−4; P>0.75); this implies that

  3. 3)

    For a NS substitution the probability of being subpolymorphic is much higher than its probability of being polymorphic (39/10 408 vs 6/10 408), while this is not the case for the synonymous cSNSs (6/2921 vs 10/2921).

Obviously, the mean heterozygosity of the NS substitutions, HmeanNS, turned out to be much lower than that of the S substitutions, HmeanS (HmeanNS=0.65/45=0.014; HmeanS=0.96/16=0.060) (last two columns of Table 1).

Similar findings that PPolyNSPPolyS were already reported for very many other loci4,5 and the commonly accepted explanation for this is that the majority of the NS cSNSs are deleterious whereas the majority of the S cSNSs are neutral. This explanation is also supported by previous evidence that the relative shortage of NS cSNPs concerns mainly the drastic aa substitutions21 and by the data in Figure 4, showing that the proportion of cSNPs found among the cSNSs detected in the present survey were 6/16, 5/30 and 0/14 in the classes of cSNSs with a ‘score of drasticity’ dr=0 (the SS cSNSs) or with 0<dr<8 or with 8<dr<24, respectively.

Figure 4
figure 4

Negative correlation between the ‘score of drasticity’, dr, of the cSNSs and the frequency q of their minor allele. The herein adopted term ‘score of drasticity’, dr, assigned to the cSNSs is a figure equal to the ‘score’ shown in the Matrix: (pam 10 (point accepted mutation)22) except for the synonymous cSNSs to which, as a first approximation, we assigned a dr=0. The cSNSs 26 and 29 appear to be always associated.

However, another result of this study, which concerns the low range of variation, seems to deserve serious consideration. Although the majority of the NS substitutions are deleterious, their probability of being subpolymorphic turned out to be approximately equal to that of the S substitutions. This finding directly implies that the majority of the rare NS substitutions behave as if they were neutral despite being intrinsically deleterious. Besides having obvious relevance for Medical Genetics, this observation is evolutionarily important since it makes less vague and arbitrary the distinction between ‘common’ (ie polymorphic) and ‘rare’ (ie subpolymorphic) range of variation. In the common variation range, evolution proceeds deterministically, namely, selection overwhelms (or, more likely, overwhelmed) genetic drift; thus all the polymorphic alleles are either neutral or advantageous, at least in the heterozygous state. On the contrary, in the low range of variation evolution proceeds stochastically, namely, selection ceases (or ceased) to be effective so that rare alleles behave(d) as neutral, irrespective of being intrinsically neutral or deleterious. Further data are necessary to clarify if the threshold (more realistically a threshold range of q values) varies (for the position and/or steepness), as it would seem likely, between different genes and populations.

The combined action of recessivity and the ensuing genetic drift appears to be the only reasonable explanation for the ineffectiveness of selection against the NS cSNSs in the subpolymorphic range of variation (PSubPolyNSPSubPolyS).

Recessivity is implied by the strong frequency-dependence of the ratio nNS/nS (see Table in Figure 3), which virtually excludes an appreciable share of dominant-negative alleles (provided any such alleles actually exist for the CFTR gene) among the deleterious CFTR alleles. This notion is also supported by the well-known formal genetics of CF and by the monomeric structure of the CFTR protein. In other words, for this autosomal gene, a malfunction of a variant allele from a parent can have appreciable adverse phenotypic effects only in the few individuals where also the expression of the gene derived from the other parent is severely impaired.

The role of genetic drift in shaping the pattern of variation of the NS with respect to that of the S cSNSs is a direct consequence of the recessivity of the disadvantage associated with the deleterious alleles. As a matter of fact, for each deleterious allele, the effectiveness of the genetic drift in reducing the consequences of its selective disadvantage should be proportional, at least as a first approximation, to the ratio between Seqrel(=seq/q, the relative extent of the random fluctuations of its frequency q) and Δqrelsel(=Δq/q, the expected relative decrease of its frequency per generation caused by selection), and recessivity reduces drastically Δqrelsel. In fact, for a deleterious dominant allele Δqrelsel depends only on its severity (for a lethal dominant allele is 1 by definition, regardless of its frequency and that of the other deleterious alleles); on the contrary, for a deleterious recessive allele it also depends on the cumulative frequency of the other deleterious alleles (lethal or sublethal). If the ratio seqselqrelsel is very high (because its numerator is high and/or because its denominator is very low) selection is quite ineffective. The numerator is a function of the demographic history of the population; the denominator is proportional to the cumulative frequency of the deleterious alleles (usually in the order of 0.01 or even less); thus its value is very close to 0. The finding that recessive deleterious or even lethal alleles are very numerous (owing to their behavior as neutral alleles, at least in the not extremely large populations) supports the old and commonly accepted idea of the existence, for many genes, of a reservoir of potentially useful rare variability.