Introduction

The Caucasus region provides a great opportunity to investigate the influence of geography and language on the genetic structure of human populations. The geomorphological and linguistic landscapes are extremely complex and heterogeneous with several ethnic groups speaking Caucasian, Indo-European or Altaic languages isolated over long periods in the region. There is no traditional pan-Caucasian lingua franca, and bilingualism or multilingualism are the main means of inter-ethnic communication.1

Twenty-six out of the fifty autochthonous ethnic groups of the Caucasus live in Dagestan,2 the north-eastern Caucasian Republic within the Russian Federation. Archeological data indicate that the earliest settlers in the Dagestan highland region are at least 10 000 years old.3 These ethnic groups can be considered as primary isolates,4 because they have been isolated and endogamous for centuries. Throughout the history of the region there has been restricted gene flow between neighboring populations, a high degree of inbreeding (reinforced by Muslim tradition of marriages within kindred), and relatively stationary population sizes for many generations.5 Prolonged marriage isolation under strict patrilocality favored the creation of genetically homogeneous gene pools within these highland populations, as well as a considerable genetic differentiation between them.6 Many groups show high prevalence of certain complex diseases such as cardiovascular illnesses, cancer, schizophrenia, mental retardation and progressive muscular dystrophy.7, 8

Previous genetic studies have revealed that the Caucasian populations do not share the genetic variation believed by some researchers to be a signature of the Neolithic expansion through Europe, which has led some to infer that these populations are direct descendants of a more ancient Eurasian population.9, 10, 11 Others suggest that the genetic legacy of the Caucasus region is mainly because of the people who crossed or invaded this territory in historic times.12

A large amount of genetic data on the Caucasian populations, based on classical markers,9, 13, 14, 15 Alu insertion polymorphisms,10 mitochondrial DNA10, 16 and the non-recombining portion of the Y chromosome (NRY)17 have shown that neither geography nor linguistics have had a strong influence on the genetic structure of these ethnic groups and that genetic drift appears to be the main factor acting in the region. Furthermore, Y chromosomes of highland Dagestan ethnic groups appear to have undergone genetic drift independently of the population history of mitochondrial and autosomal loci, suggesting that social rules, and not simply geographic isolation, are the causal agents for the high NRY diversity observed among highlanders. However, these analyses are often based either on an insufficient number of ethnic groups or they lack sampling accuracy, because they do not consider only mono-ethnic settlements, nor do they use an adequate number of molecular markers to convey a more accurate profile of the variation.

In this study, we analyzed both stable and fast-mutating polymorphisms on the male-specific region of the Y chromosome that gave a substantial increment in terms of evolutionary resolution with respect to previous analyses12, 17, 18 and provided a finer framework upon which evolutionary hypotheses were tested. Our main goal was to analyze the Y-chromosomal variation in these ethnic groups. In this way, it could be possible to better define the genetic structure of Dagestan ethnic groups, verifying whether the extreme genetic diversity of these populations is mainly due to stochastic factors (especially drift) or because of cultural and geographic barriers.

Materials and methods

Subjects and populations

Blood samples were collected in the summer of 2004 from 105 healthy, unrelated males living in the Dagestan auls. The sample set analyzed included five different ethnic groups: Tabasarans (N=30), Kubachians (N=14), Laks (N=21), Avars (N=20) and Chechens-Akkins (N=20) (Figure 1).

Figure 1
figure 1

Geographic location of the Dagestan ethnic groups sampled.

Appropriate informed consent was obtained from all individuals participating in the study. The project and subjects consent forms were approved by the Dagestan IRB at the Institute of History, Archaeology and Ethnology, Dagestan Center of the Russian Academy of Sciences. Sampling accuracy was ensured by choosing only mono-ethnic settlements, preferring isolated highland communities with social rigid structures influencing marriage behaviors and selecting males who were unrelated for at least three generations. The small sample size reflects the difficulty in obtaining an elevated number of unrelated samples in strictly endogamous, small-sized, isolated communities.

All sampled individuals were Muslims residing in remote villages located between altitudes of approximately 550 m (Chechens) and 2 118 m (Laks and Avars) above sea level. These ethnic groups are thought to be indigenous to the region but, like other native peoples of the Caucasus region, their exact origins are unclear. The Laks, Chechens, Avars and Tabasarans are farmers, transhumant shepherds and livestock breeders, whereas Kubachians are well-known goldsmiths and silversmiths. Social and family relations are regulated by Adat (traditional Dagestan customary law that pre-dates Islam) and Shariah (Islamic law). Except for the Chechens, who generally marry outside the clan, most marriages are between first and second cousins and take place within the tukhum, extended family units, or clans, composed of several families descending from a common male ancestor. This marriage practice controls the inheritance of property and restricts male gene flow.

Y-chromosome genotyping

Genomic DNA was extracted from whole blood using QIAamp DNA Blood Mini kit (Qiagen, Hilden, Germany), and genotyped for 20 Y-STR loci (DYS19, DYS388, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385 a/b, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, GATAC4, GATAH4 and YCAII a/b) and 30 Y-biallelic markers (Y Alu polymorphism, M96, M35, M78, M89, M201, M285, M287, P15, M170, M253, P37.2, M223, M267, M62, M365, M367, M368, M369, M390, M172, M9, M22, M45, M17, P25, M124, M2, M90, MEH2) that were previously reported to be polymorphic in Europe and the Near East.19, 20

Seventeen Y-STRs were typed using the AmpℓFSTR Y-filer kit (Applied Biosystems, Foster City, CA, USA). The length of PCR products was estimated on the ABI-PRISM 310 Genetic Analyzer using the Genotyper version 3.7 analysis software (Applied Biosystems) and the Y-filer reference ladders. The internal size standard used was GeneScan 500 LIZ (Applied Biosystems). DYS388 and YCAII a/b loci were amplified separately. Primers labeled with different fluorescent dye labels, FAM (blue) and JOE (green), allowed the detection of these loci within the same allelic size range (between 125 and 143 bp for DYS388 and 135–163 bp for YCAII). Fragments were separated on the ABI 310 Genetic Analyzer using GeneScan 500 ROX size standard (Applied Biosystems).

Biallelic Y-markers were typed following a phylogenetic hierarchical approach to define evolutionary Y lineages, or haplogroups (Hg). The Y-Hg nomenclature used here is in accordance with the recommendations of the Y Chromosome Consortium.21 M223-I2b1 and M17-R1a1a subclades designation was updated to the Y-haplogroup classification of International Society of Genetic Genealogy (ISOGG 2009, version 4.13).

Genomic DNA was amplified for Y-SNPs using the ABI-PRISM SNaPshot Multiplex Kit. Primers were drawn from the literature and modified to construct multiplexes with the software Autodimer version 1. Each primer was checked for possible extendable hairpin structures and dimer formation between primers. PCR products were purified using Shrimp Alkaline Phosphatase and Exonuclease I to remove unincorporated dNTPs and primers. The purified fragments were separated and detected on ABI 310 Genetic Analyzer's capillary electrophoresis platform using the GeneScan 120 LIZ internal size standard.

Initially the Y Alu polymorphism was typed in all individuals and Y-SNPs M96, M35 and M78 were only typed in individuals with the insertion. Markers M170, M172, M267, M201 were typed only in individuals with the derived M89 allele and the ancestral state at M9 and M45; M22 was typed only in individuals with the mutant state at M89 and M9 and the ancestral state at M45, whereas M17, P25, M124 were typed in individuals with the derived allele at M89, M9 and M45. M62, M365, M367, M368, M369, M390 were typed only in individuals with the derived allele at M267; M285, M287 and P15 were typed only in individuals with the derived state at M201 and M253; P37.2, M223 were only typed in individuals with the mutant state at M170. M2, M90, MEH2, M253, P37.2, M62, M365, M369, M390, M285, M287 and P15 were genotyped, but not observed in any of the analyzed samples.

Statistical analysis

Analysis of molecular variance (AMOVA), genetic distances and diversity indices were calculated using Arlequin 3.1122 (http://cmpg.unibe.ch/software/arlequin3). The statistical significance of Fst values was estimated by permutation analysis, using 10 000 permutations. Fst values were calculated based on the number of pairwise differences between haplotypes/haplogroups. The statistical significance of the correlation between geographic distance, language and genetic distance matrices, was evaluated by AMOVA with 10 000 permutations. The relationship between populations was also assessed through a multidimensional scaling (MDS) analysis, based on haplogroup frequencies and performed with the STATISTICA version 6.0 (StatSoft, Inc., Tulsa, OK, USA).

A median-joining network23 linking 20-locus haplotypes was constructed using the program Network 4.5.1.0 (http://www.fluxus-engineering.com). When constructing networks, the maximum value of 99 was assigned to each SNP whereas each STR locus was given a specific weight according to its variance; the weight of the ith STR was calculated as 10 Vm/Vi, where Vm is the mean variance of all STRs and Vi is the variance of the ith STR.24 DYS385 and YCAII were included in the networks as two separate loci.24

Bayesian coalescent analysis was performed using the program BATWING, which uses a Markov-chain Monte Carlo method25 to derive posterior distributions for a complete set of parameters that describe the relevant underlying model. The age of the most recent common ancestor of the lineage J1-M267 was calculated using this method as described earlier.11

We searched for rare or private alleles to demonstrate whether an ancient genetic structure is still preserved for the ethnic groups of Dagestan, comparing alleles experimentally found in the sample with those obtained from three databases:

  • YHRD database (release 27, built on 13 February 2009, consisting of 72 082 haplotypes in 538 populations), (www.yhrd.org).

  • Y-Chromosome database of Sorenson Molecular Genealogy Foundation (more than 31 000 haplotypes; last update at 26 February 2009), (www.smgf.org).

  • Internal database made up by published and unpublished data (51 149 individuals, 31 453 of which are Europeans).

Results

The main haplogroups detected in the five ethnic groups analyzed are shown in Figure 2, and allele frequencies are summarized in Supplementary Table 1s. Ten paternal lineages were observed: E1b1b1-M35, E1b1b1a-M78, G-M201, I-M170, J1-M267, J1e1-M367/M368, J2-M172, L-M22, R1a1a-M17, R1b1-P25. The most widespread haplogroup in the sample is haplogroup J1 (68.6%), in particular, J1x(J1a, J1b, J1c, J1e1, J1e2) and J1e1 (65 and 4%, respectively) followed by J2 (9.5%), R1b1 (5%), G, I and R1a1a (4%). No downstream mutations were observed in the majority of J1-M267 Y chromosomes and in the G-M201, I-M170 and E1b1b1-M35 lineages in these populations (Figure 3). This, and the presence of several rare or private alleles (Supplementary Table 2s) indicate the great antiquity of Y-chromosomal lineages in Dagestan ethnic groups and the long-term isolation of auls.

Figure 2
figure 2

Median-joining network constructed using 20 locus STR haplotypes. Circles represent haplotypes, with areas proportional to the number of individuals they contain. Pink: Kubachians; red: Avars; yellow: Chechens; green: Tabasarans; light blue: Laks. A full color version of this figure is available at the Journal of Human Genetics journal online.

Figure 3
figure 3

Y-chromosomal haplogroup frequencies in Dagestan ethnic groups, considering only the Y-Hg J, G, I and their subclades.

Nei's diversity estimates based on haplogroup frequencies, reveal low haplogroup diversity, with only Laks and Avars presenting a slightly higher diversity values (Supplementary Table 3s). In contrast, all Dagestan populations exhibited high haplotype diversity (Supplementary Table 3s), as expected in populations that have evolved under constant-size over several hundred generations.24 In the Kubachians, which are one of the most strictly endogamous ethnic groups, a significant loss of paternal lineages was observed (haplogroup diversity=0.264±0.136) and only two haplogroups, J1 and R1a1a, were found. However, they had the highest level of weighted mean intralineage mean pairwise difference (4.935) and haplotype diversity (HD=0.989±0.031), with no haplotype sharing across 20 Y-STR loci.

Considering the DYS388 and YCAII loci in the majority of individuals belonging to haplogroup J1, we did not detect the high repeat numbers (15) for DYS388 as reported in the Arab and Jewish populations from Near East nor did we find the single-banded motif YCAIIa22-YCAIIb22, associated with the diffusion of Arab people in Near East and North Africa.26 The Cohen, Galilee and I&P Arab modal haplotypes,27 which were reported in the Turkic populations from Anatolia19 were not found in the Dagestan samples.

Genetic variability of the ethnic groups analyzed was further evaluated in the wider context of the Caucasus region. Populations were compared on the basis of haplogroup frequencies (Table 1). The most frequent and widespread haplogroup in north Caucasian populations was J1 (Figure 4). Ethnic groups from Dagestan have the highest frequencies ever reported for this haplogroup (mean value 69%) that range from 86% in Kubachians to 43% in Laks. J1 frequency in the non-Dagestanian ethnic groups are significantly lower (mean value 20%), ranging from 32% (non-Dagestanian Chechens) to 3% (Ossetians). This haplogroup is scarcely found in the South Caucasus (11%) where haplogroup J2 is more common (25%). In the samples analyzed here the J2 lineage was present only in Chechens (25%), Laks (14.3%) and Avars (10%), with a mean value not exceeding 10%. In non-Dagestanian populations from North Caucasus this haplogroup is slightly more frequent (13%).

Table 1 Reference populations along with their geographic locations, the number of individuals analyzed (N), linguistic affiliations and data sources
Figure 4
figure 4

Y-Hg distribution in the samples analyzed. Abbreviations: Ab, Abazinians; Abk, Abkhazians; Ar, Armenians; Az, Azerbaijanians; Ch, non-Dagestanian Chechens; Da, Darginians; Ge, Georgians; In, Ingushians; Ka, Kabardinians; Le, Lezginians; Oss_Ardon, Ossetians from Ardon; Oss_Digora, Ossetians from Digora; Ru, Rutulians; Tur, Turks; Iran_Isf, Iranians from Isfahan; Iran_T, Iranians from Tehran. *North Caucasus; **South Caucasus; ***West Asia.

To visualize the relationships among Caucasians and West-Asian populations, we performed MDS analyses based on haplogroup frequencies (Figure 5). The Dagestan ethnic groups analyzed here are well separated from the remaining Caucasian and West-Asian populations in the first two dimensions. The North Caucasian ethnic groups analyzed in this study (Kubachians, Tabasarans and Avars) form a cluster that is distinct from the other northern (Kabardinians, Abazinians, Ingushians and non-Dagestanian Chechens) and southern (Georgians, Azerbaijanians, Armenians) Caucasian populations whereas the Chechens and Laks plot separately. Except for the outlier populations of Darginians and Ossetians (North Caucasus), all the other ethnic groups cluster together with populations of the same geographical area. Overall, there is a well-defined clustering by geography but no clear-cut population groupings by linguistic affiliation.

Figure 5
figure 5

MDS plot based on the matrix of haplogroup frequencies, showing genetic relationships among Caucasus and West-Asian populations. Populations analyzed in this study are indicated by white circles and population designations are as described in Figure 4. Symbols used to designate geographic clusters: ▪ South Caucasus; • North Caucasus West Asia. The stress value for the MDS plot is 0.0796.

AMOVA was used to evaluate the level of genetic differentiation between populations (Supplementary Table 4s). Although the majority of haplotype variation was accounted for by within population differences (86.88%), the variance among populations component (13.12%) was highly significant (P=0.00000±0.00000). To examine the genetic structure in Dagestan and surrounding populations, AMOVA was also performed on the paternal lineage frequencies using different grouping criteria (Supplementary Table 5s). Even if the more relevant percentage of variation was within populations (83.07 and 77.68%, respectively; P=0.0000), the variance between groups component was high and significant (9.75%; 14.94%; P=0.0000), underlining the peculiar characteristics of Dagestan ethnic groups.

The concordance between geography, language and genetic structure was further investigated by grouping populations according to geographic and linguistic criteria. The results (Table 2) show that a geographical rather than linguistic classification of Caucasian populations best accounts for the extant Y-chromosomal diversity. The within-groups component of the genetic variance was high and statistically significant: for geographically defined groups (RG: North Caucasus, South Caucasus, West Asia), FSC was 0.10570 (P=0.00000±0.00000); for linguistically defined groups (L: Caucasian, Indo-European, Turkic), FSC value was 0.12422 (P=0.00000±0.00000). In contrast, the among-groups component (FCT value) was high and significantly different from zero only for geographical classification (FCT=0.04105, P=0.02673±0.00153) and not for the linguistically defined group (FCT=0.01226; P=0.16109±0.00379).

Table 2 Analysis of molecular variance (AMOVA) according to geographic (RG) and linguistic (L) classification

Discussion

The Republic of Dagestan (Northern Caucasus, Russia) has a mosaic of ethnic groups with claims to long residence in the area.28 The linguistic and ethnic Caucasian diversity is particularly apparent in the small villages, or auls, along the often inaccessible highland region of the Great Caucasus Range.

This study presents the most comprehensive analyses of NRY markers in the Dagestan highland ethnic groups. The sampling strategy of this study was designed to obtain a realistic portrayal of the genetic structure and variability of all the ethnic groups analyzed, even if the sample size was low. We analyzed both stable and fast-mutating polymorphisms, which respond differently to demographic events and time scales and allowed a finer resolution of Y variation in the region. The results show that the patrilineal descent of the Dagestan groups represents a cul-de-sac in which NRY lineages (haplogroups/haplotypes) that originated during the Last Glacial Maximum (LGM) are still preserved. These auls may have functioned as refuges off the expansion routes of human populations since the early Neolithic age.

Overall, the Dagestan groups analyzed here represent outliers in the Caucasus Y genetic landscape because of the presence of rare alleles and prevalence of M267-derived J1 Y chromosomes. The absence of downstream mutations in clades J1-M267 (with the only exception of the J1e1 subclade), G-M201, I-M170 and E1b1b1-M35 chromosomes (Figure 3), may represent both the great antiquity of Y-paternal lineages of these ethnic groups and genetic drift in an isolated population. Haplogroup frequencies reported here are very different from those of previous studies12 (Figure 4). North, South Caucasus and West-Asian populations analyzed showed high frequencies for haplogroups J2-M172, G-M201 and I-M170, whereas in Dagestan these haplogroups are found in <10% of the population.

The genetic heterogeneity previously found in Caucasian populations did not correlate with either linguistic diversity or geographic barriers in the Caucasus.12 Based upon mitochondrial DNA and Y-markers the Indo-European speaking Armenians and Turkic-speaking Azerbaijanians were genetically more closely related to other Caucasus groups and not to other Indo-European or Turkic-speaking groups.12, 29 Although geographic classifications do provide a better fit for the genetic variability, correlations between genetic and geographic distances between groups were not statistically significant.12 Our results depict a significantly different scenario with the Dagestan ethnic groups showing a unique high genetic diversity derived from an ancient founder pool of haplotypes. Linguistic barriers do not appear to be the main determinants of the high between-population Y-chromosome variability observed in the Caucasus (Table 2, Figure 5). Instead, geographic factors appear to have shaped the high genetic diversity observed in this region. Ethnic groups from Northern Caucasus are often small-sized, isolated populations residing in highland regions and, hence, they are likely to have undergone long-term genetic drift. In the populations from Southern Caucasus gene flow, facilitated by the absence of strong geographic barriers, might have reduced the genetic distance between groups.

Time estimates were obtained 11 with BATWING, considering a constant-size population growth model. The time to the most recent common ancestor of a set of closely related Dagestan Y chromosomes belonging to haplogroup J1 was estimated to be 11 711 years (95% CI: 15 460–8930, using as priors a uniform distribution between germline and evolutionary estimated mutation rates and 25 years as inter-generation time). This is consistent with the events of spatial re-expansion that took place after the last glacial maximum in the northern hemisphere, the latest being caused by the end of the Younger Dryas cooling (12 900–11 600 ybp30). Moreover, this estimate is in agreement with the archeological record which indicates that some of these ethnic groups have lived in the Dagestan mountains for over 10 000 years.2, 31 These results suggest that colonization of Dagestan probably took place during an early phase of the peopling of Eurasia.

With respect to Semino's and Nebel's hypotheses26, 27 it is likely that the Islamic diffusion in historical time involved the Dagestan population but only from a cultural point of view. Numerous Near Eastern groups have invaded the Caucasus several times during the last two millennia, including the Arab caliph's occupation of Georgia in 654 AD, the Seljuk Turks invasion of the South Caucasus in the 11th century and repeated invasions by Turks and Persians.20, 32 Nevertheless, our results show that these did not result in any genetic imprint probably because the Great Caucasus Range might have represented an important genetic barrier to recent migrations. Geographic and cultural (the rigid structuring of group of patrilineal descent, the high rate of endogamy and inbreeding, the small population size of some ethnic units) factors, could not prevent the diffusion of Islam in 12th–14th AD, but may have represented a genetic watershed in historical times.

In conclusion, Dagestan highland villagers represent a unique example of biodiversity resulting from over four hundred generations of independent evolution. Presumably, all share a common Y genetic background dating back to an early phase of the post-glacial recolonization of Europe. Subsequent geographic isolation33 and endogamy of highland populations resulted in their high genetic differentiation. These factors have had a key role in shaping the present gene pools in Dagestan by facilitating the emergence of a high genetic divergence among ethnic groups and the maintenance of both rare allelic variants and ancestral Y-chromosomal lineages.