Introduction

Cystic fibrosis (CF; OMIM 219700), the most common life-threatening autosomal recessive disorder among Caucasians, is most frequently associated with the first variant discovered in the cystic fibrosis transmembrane conductance regulator gene (CFTR, OMIM 602421; reference sequence accession number NM_000492.3), the well characterized c.1521_1523delCTT disease causing variant, known commonly as F508del or p.Phe508del [1]. It accounts for about 70% of CF alleles in Europeans and Europe-derived populations such as Euro-Americans and explains the relatively high birth incidence and prevalence of CF among rare (Mendelian) genetic diseases affecting Caucasians [2]. CF undoubtedly originated in Europe, and there is a decreasing proportion of CF patients with p.(Phe508del) from northwestern to southeastern Europe [3, 4]. Although this European geographic gradient is well established [2, 4], the age of the c.1521_1523delCTT variant has remained uncertain and somewhat controversial. Our direct studies of ancient DNA (aDNA) from Iron Age archeological specimens have shown that the c.1521_1523delCTT variant was definitely present at least 2300 years ago; more specifically, this variant was discovered in 3 of 32 individuals buried around 350 Before the Common Era (BCE) near the Danube River in the vicinity of present day Vienna, Austria [5]. Use of indirect strategies for the estimation of the age of this variant, however, have led to published age estimates that range from greater than 2100 generations or about 50,000 years to 3000 years ago [6, 7].

As preconception-, prenatal-, and neonatal screening for CF have proliferated during the past two decades [8, 9], the many thousands of individuals discovered to be heterozygous for the c.1521_1523delCTT allele have often raised questions about the origin and significance of carrying this mutation themselves or in their children identified through DNA-based neonatal screening tests [10]. It has not been possible to address their questions and concerns in this regard. Although a heterozygote selective advantage has been suspected [11, 12], and seems likely [13], efforts to identify it have been unsuccessful, despite many hypotheses such as protection from cholera [14] which was later refuted [15]. A major challenge in such research is the limited historical information that can be connected geographically as has been done convincingly for hemoglobin S carriers, i.e., those with “sickle cell trait” where an evidentiary strategy [16, 17] has convincingly confirmed the “malaria hypothesis” arising from the visionary 1949 report of Pauling et al. [18]. Similarly, a better understanding of when and where the c.1521_1523delCTT variant arose could aid in understanding why it became so frequent. Thus, the initial goal of our project is to gain more knowledge about the age of the c.1521_1523delCTT variant throughout Europe and thus potential insights about its dissemination to provide clues as to a probable heterozygote selective advantage. Consequently, we organized a study of patients and families with this CF-causing allele drawn from eight regions across Europe comprising representative CF populations and also a Wisconsin, USA cohort with predominantly German ancestry. Recognizing that the principal CF-causing variant may have emerged in eastern Europe at a different time in history than in its western regions, we tested the hypothesis that the age of c.1521_1523delCTT varies among different European populations that are geographically dispersed.

Materials and methods

Specimens

To accomplish an age estimation of the principal CF-causing variant, we obtained blood specimens after informed consent from CF patients with the c.1521_1523delCTT variant and their parents. The blood was anticoagulated with EDTA and placed in plastic tubes labeled only with the family number and member (i.e., patient, mother, or father) but with no personal identifiers. In most of the countries, the DNA was extracted promptly in genetics laboratories and stored prior to analysis. In one (Austria), the blood was processed to prepare leukocyte-enriched samples in 2 ml microtubes after erythrocyte lysis and multiple washings. After being frozen, batched leukocyte specimens were shipped overnight on dry ice from Vienna to the Laboratoire de Génétique, Génomique fonctionnelle et Biotechnologies in Brest where DNA extractions were performed. Approvals from ethics committees were obtained at each institution, although some, such as the University of Wisconsin-Madison, designated the project as research exempt based on the investigators using deidentified samples and not returning results for any clinical practice purpose.

Selection of regional populations

We considered several factors in choosing the regions of Europe to investigate. Taking into account the well-established p.(Phe508del) variant gradient [3, 4] described above, our priority was to select countries that have a high proportion of patients with this allele [2] and a European CF center with an interested, cooperative director or geneticist, as well as knowledge of family ancestry and a willingness to provide specimens from the native population. The request to our collaborators was to select families who knew that their ancestors were native inhabitants in the region. However, it was not possible to carry out detailed genealogical studies on each family. Thus, we relied on family ancestry self-identification and the information drawn from collaborating CF centers. In addition, we selected one population of CF patients and their parents in the USA, or more specifically living in eastern Wisconsin, in which the ancestry was predominantly German. Massive immigration from Germany to Wisconsin occurred during the second half of the 19th century for economic reasons and by 1890 led to 626,394 German-Americans residing in Milwaukee and the eastern region of the State to account for the majority of the population there.

Population meeting criteria

A total of 190 CF patients and their parents were included in this study. Of this group, 185 were trios with DNA from both parents, while five were pairs with DNA available from only one parent. All the patients were shown to have the CFTR c.1521_1523delCTT variant; 166 were homozygous and 24 were compound heterozygotes with one c.1521_1523delCTT and one other CF-causing variant. These individuals were sampled from Albanian, Austrian, Czech, Danish, French, Greek, Irish, and Ukrainian populations as listed in Table 1.

Table 1 Origins and characteristics of the different patient populations

Estimation of tMRCA

All individuals were assessed for the same 10 microsatellite markers that were selected by Fichou et al. [7]. around the CFTR gene (Table 2). Evaluation of the 10 informative microsatellite DNA sequences amplified by multiplex PCR was performed using Universal Florescent Labeling. Microsatellites were selected by software freely available (zeon.well.ox.ac.uk/git-bin/microsatellite.cgi). Contiguous sequences around the CFTR gene (localized at chromosome 7q31) were successively screened for micro-satellite regions. Haplotypes were reconstructed using version 3.3.2 of the Beagle program [19] on both the trios and the pairs with 100 reconstructions. Only the unambiguous haplotypes that were the same over the 100 reconstructions were kept for the analysis. This stringent requirement was established to avoid the inclusion of uncertain haplotypes and that could have biased our estimates. There were 272 such non-ambiguous haplotypes coming from 148 independent trios or pairs; 24 haplotypes were from heterozygous p.Phe508del carriers and 248 from homozygote CF patients and their parents.

Table 2 The genetic markers in the region surrounding the CFTR gene used in this study

The age of the most recent common ancestor (tMRCA) of the c.1521_1523delCTT carriers was estimated in each population from the length of the haplotypes shared by the carriers using the Estiage program [20] under a stepwise model at the different markers (assuming a variant rate of 10−3 per meiosis). This program provides maximum likelihood estimates of the number of generations since the most recent common ancestor using the multilocus marker data information on patients. It assumes that all affected patients in the sample descended from a common ancestor who introduced c.1521_1523delCTT n generations ago. An estimate of n is obtained from the size of the haplotype shared by individuals on both sides of the disease locus by finding the most likely positions of recombinations on the ancestral haplotype in the different patients; then, the value of n is converted to age in years by multiplying by the assumed 25 years per generation. Chronologic dates were determined using 2017 for the computations. The archeological periods for Europe were assigned by traditional criteria in consultation with archeologists aware of the European regions we selected as sources of DNA [21, 22].

To avoid a possible underestimation of the age associated with ancestral consanguinity, only one of the two haplotypes were considered in homozygous c.1521_1523delCTT carriers. Allele frequencies at the different markers were obtained by considering the parental haplotypes that were not transmitted to the patients in each population separately. Different hypotheses regarding the ancestral haplotype that carries the c.1521_1523delCTT allele were considered. First, it was reconstructed independently in each population by considering the most frequent allele at each position. Second, it was constrained to be the same in every population, considering the most frequent allele in the entire sample.

Results

Haplotypes studies

A total of 148 haplotypes carrying c.1521_1523delCTT were considered for age estimation. Only one haplotype was kept per family. Among these 148 haplotypes, 137 were different, but most (95.9%) of them share a common allele (allele 253) at marker M09 located in intron 1 of the CFTR gene (Supplementary Table 1). This allele is the same as the arbitrarily termed “allele 256” in the previous study reported by Fichou et al. [7]. for the Breton population. In each population, a different ancestral haplotype was imputed based on the data (Table 3), except in Greece and Albania where the same ancestral haplotype is found.

Table 3 Age estimates of the CFTR p.Phe508del in the different populationsa

Time to the most recent ancestor

Table 3 shows tMCRA values estimated with calculation of the 95% confidence intervals for each population group. It was found that the age estimates were quite different between western and eastern populations. In the former, the mean age estimates of the most recent common ancestors vary between 4725 (Ireland) and 4600 years ago (both France and Denmark). It was of special interest to find that the predominantly ancestral German population sampled in the USA was close to the northwestern European populations at 4625 years ago. In the two populations from the southeast of Europe (Greece and Albania), p.Phe508del must have been introduced much more recently based on our data revealing mean tMRCA values of 1175–1300 years ago with 95% CI ranges that do not overlap with the results obtained in western European populations. Compared with these two groups, the tMRCA values obtained in families residing in the two central European countries, the Czech Republic and Austria, were intermediate at 3200 and 3575 years ago, respectively, but the 95% CIs overlapped. In the Ukrainian group, our estimate is closest to the southeastern countries at 2150 years ago, but the sample size is comparatively small because only five trios met our criteria for inclusion, i.e., were kept for the Estiage analysis.

The c.1521_1523delCTT variants found in the various populations investigated were most often carried on the same haplotype, namely microsatellite haplotype 206-114-199-276-300-253- p.Phe508del -227-305-287-240. Assuming there is common ancestral haplotype, as other data also suggest [23], and constraining the analysis to this same haplotype in every population, we repeated the Estiage analysis with the most frequent allele. Supplementary Table 2 lists the estimates for each population and reveals the same trends described above but with wider 95% confidence limits.

Conversion of tMRCA values to chronologic archeological periods

Table 4 converts the tMRCA values of Table 3 to calendar years on an absolute chronological scale and also shows the archeological period. Three categories were evident. (1) Ireland = 2708 BCE, derivative predominantly German population in the USA (Wisconsin) = 2608 BCE, France = 2583 BCE, Denmark = 2583 BCE [obviously, none of these are statistically different]; (2) Austria = 1558 BCE and the Czech Republic = 1183 BCE [intermediate]; and (3) Albania = 717 AD and Greece = 842 AD [significantly different from the western populations with no overlap of the 95% CI values]. The mean values for the first category represents the early Bronze Age, while the central European populations dated to the Middle and Late Bronze Ages. Lastly, the p.(Phe508del) variant was likely introduced into the southeastern population during and/or after the Roman Imperial Era of 31 BCE to 476 CE.

Table 4 Year of the most recent common ancestor with 95% CI values and probable archeological era of distributiona

Data repository

These results are available in a database available to the public that can be accessed through the European Nucleotide Archive (https://www.ebi.ac.uk/ena) using study accession number: PRJEB27683 and study unique name: ena-STUDY-INSERM UMR1078-09-07-2018-20:39:26:525-2156.

Discussion

As part of a larger investigation entitled “The Ancient Origin of Cystic Fibrosis,” we designed this project to gain a better understanding of when the p.(Phe508del) variant may have first arisen in Europe, i.e., an estimate of its age, by examining more geographically distributed and distinct trio populations than in previous studies [6, 7] focused on this issue. Although the populations sampled may not be truly pan-European, they represent regions separated by ~2000 miles, i.e., from Ireland to Athens, Greece. In addition, we selected European CF patients and their parents from regions in which geographic and thus genetic distances were previously documented by others [24, 25] using genome wide high density microarray data. In doing so, we hoped to shed light on where the p.(Phe508del) variant may have arisen and its pattern of dissemination. Addressing these issues, we reasoned, might in turn provide insights about the presumed p.(Phe508del) heterozygote selective advantage [13] or at least help guide future studies as occurred with geographical and historical/temporal evidence [16, 17] confirming the “malaria hypothesis” regarding hemoglobin S heterozygosity. To elucidate the origin and explain the frequency of the p.(Phe508del) CFTR variant, we believe that it would be necessary to answer three “W questions,” namely when, where and then why did p.(Phe508del) allele become so prevalent in northern European populations and their descendants.

Our results revealed tMRCA average values ranging from 4725 to 1175 years ago and support the estimates of Serre et al. (3000–6000 years ago) [11], rather than Morral et al. (52,000 years ago) [6], but the latter figure was challenged by Kaplan et al. [26] because of disagreement with assumptions used in their calculations. In addition, the tMRCA values from western European regions reported herein refine the results of Fichou et al. [7] from a study of Breton CF patients in which the Estiage analysis suggested that the most common recent ancestor lived 115 generations ago. That tMRCA value, however, may have underestimated the age of p.(Phe508del) in Brittany due to consideration of all the haplotypes, even those that were reconstructed with ambiguities, as well as a potential bias associated with consanguinity due to including both haplotypes in homozygous families. In the more stringent Estiage analyses reported herein, those potential biases were avoided for all populations, leading to estimates of the oldest tMCRA values corresponding to the Early Bronze Age in western Europe, which is generally agreed to begin around 3000 BCE. This finding extends our results from a direct investigation of aDNA in teeth from Iron Age burials near Vienna around 350 BCE and allow us to conclude that p.(Phe508del) was present in that region long before then. More specifically, in the Austrian families studied, the Estiage data revealed a mean tMCRA value of 3575 years ago, which converts to 1558 BCE (Middle Bronze Age) [22].

Perhaps most remarkably, the estimated ages of p.(Phe508del) in the three western European regions (France, Ireland, and Denmark) were similar with closely overlapping 95% CI values. This observation is also in line with previously documented spatial autocorrelograms expressing genetic and geographical distance for these populations [24]. Such data provide more insight about the ancient origin of CF in our judgment—both when and where—and lead us to propose that CFTR p.(Phe508del) is derived from ancestors who lived in western Europe during the Bronze Age, as early as 2700 BCE, and that its relatively rapid dissemination occurred because of human migrations around the northwestern Atlantic trading routes [21] and then towards central and eastern Europe [22]. Diffusion from northwestern to central Europe in approximately 1000 years is consistent with the prominent Bronze Age migrations evident in the archeological record [21, 22] and from genomic studies of aDNA [27]. On the other hand, we are assuming a discrete origin of the principal CF-causing variant, but it is possible that p.(Phe508del) arose more than once or earlier, and then reached western Europe subsequently through Neolithic migrations.

Considering potential explanations from archeological evidence regarding prehistoric settlements and migrations, and based on opinions from consulted European archeologists, we believe that the most likely phenomenon of Bronze Age human activities that could account for our CFTR p.(Phe508del) tMRCA observations is the Bell Beaker culture [22, 28,29,30]. Prehistorians have concluded that Bell Beaker folk appeared at the transition from the Late Neolithic period to the Early Bronze Age during the third millennium BCE somewhere in the western Europe [22], although the exact region is uncertain [29]. They were distinguished by their ceramic beakers, pioneering metallurgy north of the Alps, and great mobility [30, 31]. Over ~1000 years, a network of small families and/or elite tribes spread their culture from west to east throughout western Europe and into regions that correspond closely to the present-day European Union, where the highest incidence of CF is found [32]. More specifically, their distinctive Bell Beaker pottery appeared and spread across western and central Europe beginning around 3000–2750 BCE and then disappeared between 2200 and 1800 BCE [22, 29]. Their migrations are linked to the advent of western and central European metallurgy, as they manufactured and traded metal goods, especially weapons, while traveling over long distances [30]. Most relevant to our study is the evidence that they migrated in a direction and over a time period that fits well with the pattern of tMRCA data we found for the p.(Phe508del) variant. Olalde et al. [29] have shown that both migration and cultural transmission played a major role in diffusion of the “Beaker Complex” and led to a “profound demographic transformation” of Britain after 2400 BCE. Moreover, the cultural elements that unite the widely distributed Beaker folk are so obvious that some have considered them a distinct ethnicity of Bronze Age people [33].

From our results, we propose the novel concept that large scale, long term west-to-east migrations of the Bell Beaker Europeans [22, 28,29,30] during the Bronze Age, could explain the dissemination of p.(Phe508del) in Europe and its documented northwest-to-southeast gradient [4]. In fact, our tMRCA data show a temporal gradient also. Determining when the p.(Phe508del) variant was first introduced in Europe and discovering where it arose should provide new insights about the high prevalence of p.(Phe508del) heterozygotes. For instance, Bronze Age Europeans migrated extensively and apparently were not exposed to endemic infectious diseases or epidemics; thus, microbial-related selection as in sickle hemoglobin seems unlikely [34]. As more information on Bronze Age people and their practices during migrations [21, 22] become available through archeological and aDNA genomics research [29], more clues about selection factors should emerge.