Introduction

An ad hoc mining of the historical record can lead to a spurious association of any finding in human population genetics with any historical episode that could potentially explain it.1 In this context, the hypothesis of Basques representing an ancestral Paleolithic European population, already put forward on linguistic grounds in the 19th century, has been used as a recurrent explanation in a number of early population-genetic studies.2, 3, 4 Currently, most linguists agree that the Basque language (Euskera) should be considered pre-Roman and pre-Indo-European, with no robust phylogenetic relationships. Consequently, its origins are lower-bounded in the second millennium BC. Linguistically, any other hypothesis positing a more ancient origin for Basque cannot be proven with the currently available scientific methodology.5 However, the idea of a Basque genetic pool that shows little influence from both the Neolithic and later population flows, has spread through the literature as a circular argument that has led to use the Basque population as the representative gene pool of the first modern human settlers of Europe.6, 7, 8 Thus, with the general aim to investigate the evolutionary history of the Basques we studied in this work a large Basque sample by means of high-resolution compound Y-chromosome haplotypes and analyzed them in the context of the available European, North African and Near East Y-chromosome data from the literature. The Y chromosome offers a series of advantages for these purposes. In particular, the nonrecombining nature of the Y chromosome facilitates the inference of compound haplotypes made up of slowly evolving single-nucleotide polymorphisms (SNPs) and more quickly evolving short tandem repeat (STR) loci, which offers the possibility to study the Y phylogeny at different resolutions and thus, at different time scales.

Materials and methods

Population samples

Y chromosomes from 168 unrelated Basque donors were used for the study: 72 from the province of Biscay, 74 from Gipuzkoa and 22 (Other Basques) from the Alava and Navarre provinces. Donors had at least four generations of ancestry in the Basque Country (recorded by Basque surnames), and within each of our three Basque samples (Biscay, Gipuzkoa and Others) all the grandparents of the donors were born in the same province. We also included in this study 459 non-Basque Iberians from diverse localities that had been partially genotyped previously9 and which have been further genotyped in this work along with an additional set of 233 non-Basque Iberians, and 75 North-African Berbers. In addition, data from 39 European and Near-Eastern populations were compiled from the literature for comparative purposes (Supplementary Information 1). Among these, two other Basque samples were included: on the one hand, a Basque sample representing a general sample from Gipuzkoa10, 11 (E Bosch, pers. comm.), and on the other hand, the Basque sample of Brión et al,12 whose precise geographical origin is unknown. Finally, some data (33 French from Normandy and 53 Georgians) correspond to unpublished work (JM Larruga, pers. comm.) and have been kindly submitted to us for comparative purposes.

Biallelic markers

A total of 45 binary markers have been analyzed. First, we genotyped nine genealogically basal markers in all individuals (SRY10831.1, YAP, M89, P2, M9, M201, M170, 12f2 and 92R7) and the remaining markers (M2, M12, M13, M20, M26, M34, M52, M65, M67, M70, M78, M81, M92, M107, M122, M123, M124, M148, M153, M163, M165, M166, M172, M173, M175, M178, M207, M224, M269, M342, M377, P15, P16, SRY10831.2, SRY2627 and Tat) were genotyped following the hierarchy of the genealogy.13 Markers were genotyped as in previous works9, 14 or by PCR-RFLP methods developed in this work. The M12, M20, M65, M92, M107, M122, M148, M163, M165, M207, M269 and M377 (PA Underhill, pers. comm.) markers were amplified using previously published primers10, 15, 16 and their allelic states diagnosed by means of the restriction enzymes NdeII, SspI, HinfI, HpyCH4IV, NlaIII, MaeIII, MnlI, RsaI, DraI, MvaI and PstI, respectively. Mismatched primers to create RFLPs were designed for M166 (Reverse 5′-CAGCGAATTAGATTTTCTTG-3′, digested with BsrDI), M224 (Reverse 5′-TGAAATATTTGGAAGGGCTGAA-3′, digested with AcuI), M342 (Forward 5′-GTTAAATTATGACTTACGGGCA-3′, digested with Bsp1286I), P15 (Forward 5′-TGCTTGAGGTTCTGAATCATA-3′, digested with NdeI) and P16 (Reverse 5′-CCTGTCAATATTCCTGTTAAT-3′, digested with Tsp509I). Markers M124 and M175 were genotyped with published primers10 by sequencing both strands (BigDye Terminator kit v.3.1) using an ABI Prism 310 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). Haplogroups were identified by following standardized nomenclature guidelines13, 17 (Figure 1). In order to increase the size of some population samples, in those populations for which more than one reference was found in the literature but which had been analyzed at different levels of genealogical resolution (SNP coverage), the frequency data described for a marker in the sample with less genealogical resolution, were subdivided according to the subgroup proportions observed in the more genealogically resolved sample of the same population. Both samples were then grouped together and hereafter considered as a single sample. Before applying this approach to a set of samples of the same population, congruency tests were carried out at the lowest common resolution level in order to check that the samples were not different at this common level. Only those samples that did not differ were subdivided at the next level of resolution determined by the more resolved sample.

Figure 1
figure 1

Genealogical relationships, nomenclature and frequencies of the Y chromosome haplogroups. Only informative markers are included. The status of the underlined marker was inferred. Markers M13, M20, M52, M65, M107, M122, M124, M148, M163, M165, M175, M224, M342 and M377 were also genotyped but not observed. BER: Berbers; BIS: Biscay Basques; GIP: Gipuzkoa Basques; OTH: Other Basques; IBE: non-Basque Iberians. Hg: Haplogroup.

STR markers

Our Basque sample had been previously typed for 11 Y-chromosome STRs18 so that it has been possible to compare the STR variability within haplogroups both among our samples and among other North African, Iberian and European populations. In order to include the largest number of populations in the analyses only the following five STRs have been included in the present analysis: DYS19, DYS390, DYS391, DYS392 and DYS393 (Supplementary Information 2). We used this particular order for the STR-haplotype nomenclature. In addition, a subset of Iberian and Berber samples belonging to the relatively well represented European haplogroups R1b3*-M269, R1b3d-M153 and R1b3f-SRY2627 were genotyped for the above set of five STRs. Five-STR haplotypes from the same haplogroups, characterized in samples of diverse European and Near Eastern origin, have also been collected for comparative purposes. For R1b3d-M153, six chromosomes were included in the analysis, five of which (Iberians) were genotyped in this work. For R1*(xR1a,R1b3f)-M173 a total of 3087 chromosomes were considered. In addition to the data from the literature, 27 R1*(xR1a,R1b3f)-M173 non-Basque Iberians and five R1*(xR1a,R1b3f)-M173 North Africans were genotyped in this work. Finally, a total of 57 chromosomes were analyzed for R1b3f-SRY2627. In addition to the data from the literature, one European (French) R1b3f-SRY-2627 sample was genotyped in this work.

Statistical analysis

Genetic diversities (Nei's heterozygosity, h, and the mean number of pairwise differences between five STR-loci haplotypes) were computed with ARLEQUIN 2.000.19 Tests for significant pairwise differences in h were assessed by a Bayesian approach by means of TEST_h_DIFF (http://www.ucl.ac.uk/tcga/software/index.html) program under the MATLAB v.6.5 environment (The MathWorks Inc.). Reynolds FST genetic distances20 between populations were also computed based on haplogroup or haplotype frequencies by means of ARLEQUIN 2.000. Multidimensional scaling (MDS) was used to represent genetic distances in two-dimensional space using SPSS ver. 11.5.1 (SPSS Inc.). To estimate the time to most recent common ancestor (TMRCA) of some subtrees we used Batwing21 assuming constancy in population size or exponential growth (α=0.005/generation) and gamma priors for θ and ω. The mutation rates used were those described22 for the specific loci used herein, except for those two loci with mean mutation rates of zero where, in order to be conservative, we decided to use the (higher) average rate described for Y-chromosome STR loci.23 The locus-specific mean mutation rates22 are higher (and therefore will provide younger age estimates) than the 95% upper limit of the generic rate23 (mean+2SD: 1.8 × 10−3). Generation time was assumed to be 25 years.

Results

The Basque Y-chromosome gene pool

The majority (about 86%) of the Basque Y chromosomes belong to haplogroup R1*(xR1a,R1b3f)-M173, of which R1b3*-M269 accounts for 88% (Figure 1). As this haplogroup is also the most abundant type in all Western Europe16 it places the Basque Y chromosomes within the European landscape. The above data are reflected in the low diversity values for the Basque populations (Table 1). Within R1b3*-M269, Basques also show a reduced STR diversity (Table 1). Thus, compared for instance to the non-Basque Iberians, the average number of mutations in our sampled Basques is significantly lower (Mann–Whitney U-test, P=0.009).

Table 1 Diversity values for the populations considered

The presence of R1b3b-M65, a possibly autochthonous Basque branch of R, was not confirmed in our Basque samples. Instead, we did detect in our Basque sample the putative Iberian markers R1b3d-M153 and R1b3f-SRY2627 although in lower frequencies than in earlier analyses on Basques11, 24 (Figure 1). Thus, R1b3d-M153 shows a frequency in our total Basque sample of 7.1%, a figure higher than the corresponding frequency in non-Basque Iberians (0.9%). R1b3f-SRY2627 shows, in our Basque sample, a frequency of 2.4% (5.2% in Iberians). This haplogroup is considered to be of Iberian origin as the highest frequencies and diversities for R1b3f-SRY2627 have been described in the Mediterranean area of the Iberian Peninsula.24, 25

The existence of a minor NW African male component (1.2%) in Basques is confirmed by the presence of the African lineages E3*-P2 and E3b2-M81. The latter is the most widespread haplogroup of the E cluster in Iberia, reaching its highest frequencies both in southern and northern parts of the Iberian Peninsula.9, 26

Relationships among the Basque samples

To explore the hypothesis that the Basques might not constitute a homogeneous population,27, 28 we tested the pairwise FST values among the individual Basque samples. For the binary markers, after Bonferroni correction (α=0.008), the hypothesis of a single Basque population is rejected (Supplementary Information 3). Sequential testing rendered the most inclusive three-sample group formed by Biscay, Gipuzkoa-2 and Other Basques (hereafter referred to as ‘pooled Basques’). After Bonferroni correction of the pairwise comparisons between Gipuzkoa-1 with the samples forming this group (α=0.0167), Gipuzkoa-1 remained significantly different. However, the possibility of structure among Basques is minimized after comparing all the Basque samples for their five-STR haplotype composition within the main haplogroup R1*(xR1a,R1b3f)-M173 (Supplementary Information 3). The modal haplotype is the same in the five samples (Biscay, Gipuzkoa-1, Gipuzkoa-2, Other Basques and the Basque sample from Brión et al12), being the number of different haplotypes highest in Biscay (18/61). After Bonferroni correction (α=0.005) (Supplementary Information 3), all Basque samples showed nonsignificant differences for the five-STR haplotypes within the haplogroup. Therefore, the above differences between the samples from Gipuzkoa do not appear to be the result of different internal composition within R1*(xR1a,R1b3f)-M173, but to the higher proportion of this lineage within our Gipuzkoa-1 sample. Thus, while these data cannot be strictly taken as a proof of genetic structure within Basques,29 it may however warn against taking any Basque sample as representative of the Basques.

Relationships between the Basque and other samples

It has been suggested that the British Celtic populations and the Basques are derived from common paternal ancestors and that genetic drift in these populations has not been sufficiently great to differentiate them.6 In this regard, for haplogroups, the pooled Basques are more diverse than the samples from Ireland (P<0.0001), Wales (P<0.0001) and Scotland (P=0.04), while the Gipuzkoa-1 sample does not show significant differences with these populations (P-values 0.88, 0.94, 0.054, respectively). In this context, pairwise comparisons (FST values) of the Basque samples with other European populations based on haplogroup frequencies show that Gipuzkoa-1 has its closest affinities with the Irish and Welsh (Supplementary Information 3). These similarities can be explained in terms of the R1*(xR1a,R1b3f)-M173 frequencies, which are highest in Gipuzkoa-1 (0.84) and Ireland (0.83). The pooled Basques (excluding Gipuzkoa-1) showed significant FST values with all the populations.

Within Western Europe, the low Basque haplogroup diversity stands out when compared to their geographical neighbors. Thus, for haplogroups, both the pooled Basque sample and the Gipuzkoa-1 sample are less diverse than the non-Basque Iberians (P=0.0004 and <0.0001, respectively). The overall landscape of haplotypic diversity within R1*(xR1a,R1b3f)-M173 (Figure 2) confirms that Basques are the least diverse of all populations. Basques, who share with Iberians and Italians the same Atlantic modal haplotype (14,24,11,13,13),6 show an outlier position (Figure 3), in agreement with their low diversity values.

Figure 2
figure 2

Map of the frequency distribution of the R1*(xR1a,R1b3f)-M173 STR loci haplotypes in Europeans, Near East and North Africa. For clarity, only those haplotypes with a frequency above 2% are indicated. Populations: 1: Armenians (n=238); 2: Turkish (n=90); 3: Italians (n=20); 4: Berbers (n=23); 5: Non-Basque Iberians (n=437); 6: All Basques (n=209); 7: Croatians (n=34); 8: Austrians (n=42); 9: Germans (n=37); 10: Belgians (n=31); 11: Friesians (n=52); 12: Danes (n=77); 13: Norwegians (n=113); 14: Welsh (n=244); 15: Irish (n=285); 16: English (n=799); 17: Scottish (n=370); 18: Icelanders (n=75). Haplotype nomenclature refers to alleles of loci DYS19, DYS390, DYS391, DYS392 and DYS393 in that order.

Figure 3
figure 3

Multidimensional Scaling Analysis of the R1*(xR1a,R1b3f)-M173 STR loci haplotypes in Europeans, Near East and North Africa. ARM: Armenians; AUS: Austrians; BAS: All Basques grouped; BEL: Belgians; BER: Berbers; DAN: Danes; ENG: English; FRI: Friesians; GER: Germans; IBE: Iberians; IRE: Irish; ITA: Italians; NWG: Norwegians; SCO: Scottish; TUR: Turkish; WAL: Welsh; CRO: Croatians; ICE: Icelanders. Samples typed in this work are indicated as follows: black circles: Basques; gray circles: Berbers; black triangle: Iberians.

Among the Basques, the Gipuzkoa-1 sample is the least diverse (Table 1). However, in this case, FST comparisons between the global Basque population and the rest of the populations (using the five STR-loci variability within R1*(xR1a,R1b3f)-M173) (Supplementary Information 3) do not show special affinities between the Basques and the Irish or Welsh. Similarly, the diversity of Basques for the R1*(xR1a,R1b3f)-M173 associated STR haplotypes is significantly different from that of Iberians, Irish, Welsh and Scottish (all P<0.001), whose diversity values are among the highest.

This contrasting pattern between haplogroup diversity and the R1*(xR1a,R1b3f)-M173 associated STR diversity between Basques, on the one hand and Irish and Welsh on the other, can be graphically observed in Figure 4. Thus, while Basques show a proportional reduction in STR diversity for their low haplogroup diversity values, the British population in general and Wales and Irish in particular, show, even for their low binary diversity, a ‘saturation’ (steady-state) level of STR haplotype diversity.

Figure 4
figure 4

Haplogroup h (x-axis) vs R1*(xR1a,R1b3f)-M173 STR loci haplotype diversity (y-axis). We choose R1*(xR1a,R1b3f)-M173 because it is the major haplogroup in Basques and the British populations. For clarity, not all population names are indicated. Black circles: British populations; gray circles: Basque populations; empty circles, rest of the populations. Graph includes a linear regression line.

The age of the Basques

An approximation to infer the age of a specific population is based on the estimating the age of haplogroups that originated within that geographical area. As we have not detected R1b3b-M65 in Basques, the best candidates left are R1b3d-M153 and R1b3f-SRY2627. Given that R1b3f-SRY2627 has higher STR haplotype h in Iberians (0.83±0.06) than in Basques (0.73±0.08), although this difference is not significant (P=0.11), and the average number of mutations is also higher in Iberians (1.7) than in Basques (1.3), the most plausible hypothesis is that this haplogroup originated in Iberians (Figure 5). R1b3d-M153 STR-haplotype diversity is not significantly different between Basques (h=0.66±0.13) and non-Basque Iberians (h=0.60±0.23) (P=0.6), and also, the number of different lineages in M153 Basques (seven out of 17 total) is similar to that in Iberians (three out of six total). However, the average pairwise mutational difference is two-fold in Basques (1.4 vs 0.7 in Iberians) (Figure 5). This could be indicative of a Basque origin for this haplogroup, particularly given that the sample size of Iberians is four-times larger (Figure 1) and more geographically widespread. The fact that this haplogroup is absent in the sample ‘Other Basques’ does not contradict this point, as from the binomial distribution, even if the real frequency of this haplogroup in that population was as high as 12% we could still score zero observations of this haplogroup in a sample of 22 individuals with P=0.05. On the other hand, even in the less likely scenario that the mutation originated somewhere else and was introduced into the Basque Country by migration, from the statistical point of view the introduction of an allele by migration and the introduction of an allele by mutation are equivalent concepts. However, given the small frequency of this haplogroup outside the Basque Country we favor a Basque origin for this haplogroup. Alternatively, R1b3d-M153 may be present in a common ancestral population, arising to relatively higher frequency in Basques through genetic drift. However, this scenario would have led also to a reduction in STR diversity of R1b3d-M153 in Basques, which is not supported by our data.

Figure 5
figure 5

Networks representing the diversity patterns for the Basque and other populations' STR haplotypes within defined haplogroups. (a) STR haplotypes within R1b3f-SRY2627, (b) within R1b3d-M153 and (c) within R1b3*-M269. Black circles represent Basque haplotypes and white circles, non-Basque Iberian haplotypes. In (a) and (b), gray circles represent European haplotypes but in (c) they represent Berber haplotypes. Circles areas are proportional to frequency. Haplotype relationships were obtained applying sequentially reduced median and median joining methods implemented in Network 4.0.36

Thus, under this assumption, the TMRCA of R1b3d-M153 could be taken as a lower bound for the age of the Basque population. Batwing simulations, using parameters obtained as described in Figure 6, indicate that the ages range between 17 900 (10 700–26 500) years, assuming exponential growth, and 21 300 (8500–51 000) years, under constant population size. Therefore, these ages indicate that this population, or at least some of its Y chromosome lineages, dates back to pre-Neolithic times. This estimate is supported by inferences made using highly variable autosomal minisatellite loci.30

Figure 6
figure 6

Expected diversity distributions using coalescent simulations. For the determination of proper estimates for the effective populations size and population growth coefficient for input in Batwing, coalescent simulations were run by means of Simcoal2. The evolution of five-STR haplotypes was simulated using the mutation rates described22 (see main text). For those loci with an average mutation rate of zero the average Y-chromosome STR rate of 0.0007 per generation was used.23 Simulations assumed a stepwise mutation model with geometric parameter of 0.5 and range constraint of 25. Simulations indicate that a constant effective population size of 1000 or a population of effective size of 5000 growing with an exponential rate of 0.005 per generation can satisfactorily explain the diversity levels observed in the Basque samples These parameter estimates were later used to estimate the age of the R1b3d-M153 lineage with Batwing. In this case, the effective population size was reduced proportionally to the frequency of this haplogroup (approx. 1/10). Continuous lines represent simulations with constant population size. Dashed lines, simulations with an exponential growth rate of 0.005/generation. Black lines Ne=5000; gray lines Ne=1000. White circles on the X-axis: Basque samples (from left to right: Gipuzkoa-1, Other Basques, Biscay, Gipuzkoa-2, Brión et al12 Basques); gray circles, British sample (the circle on the left represents any of the English, Scottish or Welsh samples; the circle on the right represents the Irish sample); black circle, Iberian sample.

Discussion

Combined haplotype information of slow (SNPs) and more quickly (STRs) evolving markers can be used to reveal a greater detail about the demographic and evolutionary processes that have played a role in the history of human populations. However, a note of caution must be added to reflect the fact that, as it happens with mtDNA, herein we are focusing on just a single locus, which may not have been immune to the effects of selection. The analysis at the haplogroup level in the set of (mainly) European populations shows a marked drop in diversity in the Basque populations. This happens despite Basques having been represented in the Y-chromosome SNP-discovery panel of samples, a fact that through ascertainment bias may produce a higher diversity than real for the populations in the discovery panel. The low observed diversity causes a certain affinity between the Basques and the populations of the British Isles, particularly Irish and Welsh. However, this may be simply the effect of convergent drift, as when we consider the STR haplotypes within the major haplogroup (R1b) the latter populations are no more closely connected to the Basques than other European populations. Particularly Irish and Welsh show much higher diversities within R1b than Basques (and Gipuzkoa-1 in particular). The high associated STR diversity points to a prehistoric founder effect for the Welsh and the Irish, long enough ago that the more quickly evolving STR diversity has regenerated. Our own simulations with Simcoal2 demonstrate that a splitting founding population of Ne=50 growing at an exponential rate of α=0.005 can regenerate its five-STR haplotype diversity in 400 generations (assuming a source population with Ne of 5000) (not shown).

Basques share with the rest of Europeans both the most common haplogroup (R1*(xR1a,R1b3f)-M173) and the modal STR haplotypes within this haplogroup.15 The low STR diversity in Basques seems to be the result of a lower effective population size maintained through generations, which is particularly marked in Gipuzkoa-1. This low effective size may have allowed drift to drive some haplogroups to such high frequencies. It can also be argued that at least part of this conspicuously low diversity present in Basques can be attributed to a sampling bias. One of the criteria for donors to be included in the sample is to have at least four generations of Basque ancestry (recorded by Basque surnames) plus, in many cases, a localized ancestry of their grandfathers within the district being sampled. Stringent criteria can lead to reduced diversity, as was the case with the Gaelic surnames.31 As this stringent criterion is not normally demanded in other sampling schemes, we can be introducing a reduction in the effective size of the population being sampled. First, we are discarding any contribution by external gene flow that may have take place during approximately the last 100 years (a restriction that is not normally imposed on other samples) and second, we are removing any internal gene flow (among Basque districts) that may have taken place during that time. In fact, other Basque samples10, 11, 12 do not show such an important drop in diversity as seen in our Basque samples, although still show slightly lower values both in haplogroups and within R1*(xR1a,R1b3f)-M173 than, for instance, Iberians (Figure 5).

In any case, this low effective size is not the result of a recent founder effect, as our data support the hypothesis that at least some lineages of Y chromosome in modern Basques originated and have been evolving since pre-Neolithic times. We cannot gauge up to which point the origin and evolution of these lineages has been geographically local, but this possibility should be unsurprising given that there is evidence supporting human presence in the Basque Country since the Lower Paleolithic, about 150 000 years ago, although the oldest skeletal remains found correspond to the Neanderthals (in the Middle Paleolithic). As regards archaeological sites in the Basque Country,32 the Upper Paleolithic is one of the richest periods with some of the sites showing continuity in habitation up to, at least, the Bronze Age (about 2000 BC). However, it can be argued that Archaeology can seldom differentiate between the cultural and/or biological evolution of a single group and the possible replacement by new groups of incomers. Ancient DNA analysis focused on the Y chromosome could yield the proof needed to conclude a local evolution of Basques.

There is some evidence of a short-range outward flow of Basque Y chromosomes, as the presence of R1b3d-M153 chromosomes in Iberia suggests. This finding is in agreement with previous data,24 which provided more evidence for such gene flow between Basques and surrounding populations on the basis of haplogroup R1b3f-SRY2627. However, in agreement with additional data,33 our data do not show any signs of long range diffusion of Basque Y-chromosome haplogroups into North Europe associated to the retreat of the last Glacial Maximum, as has been suggested for mitochondrial DNA.34, 35 Finally, while a pre-Neolithic settlement for the Basques can be posited, the strong genetic drift experienced by the Basques does not allow to consider Basques either the only or the best representatives of the ancestral European gene pool. Similarly, genetic drift will make determination of their population affinities difficult.