Introduction

Relatively recent archaeological evidence indicates that northeastern Europe was initially occupied by modern humans during the transition from the Middle to Upper Paleolithic periods (approximately 35–45 000 YBP).1 However, the last glacial maximum (LGM) forced the contraction of the entire European populace to a number of refugia in the Iberian Peninsula, present day Ukraine and the northern Balkans.2 The region was impacted again 12 200–13 000 years ago, by an expansion from southwestern Europe during the final stage of the LGM, an event still imprinted in the mtDNA landscape of the area.3 The next group of migrants to arrive in the locality is theorized to have been the Comb Ware people (predecessors of Finno-Ugric-speaking tribes, a branch of the Uralic language family) from the Uralic mountains about 6900 YBP.4

Populations within the Urals are characterized by high levels of genetic heterogeneity and various degrees of admixture between Europeans and Asians.5 It has been reported that these groups possess some Asian maternal DNA components.6, 7 Additional investigations utilizing the autosomal VNTR markers, D1S80 and 3′ApoB,8, 9, 10 TP53 single-nucleotide polymorphism (SNP) haplotypes11 along with Y-chromosomal analyses12 signal both Asian and European genetic constituents. For example, Y-chromosomal haplogroup N (specifically sub-haplogroups N1c and N1b), believed to be of Asian ancestry,13, 14, 15, 16 is found at high frequencies within the Urals; and its pronounced presence in the Baltic countries (Lithuania, Latvia, and Estonia), as well as in the Nordic Peninsula (Finland) and in the Saami of Sweden, argue for an Uralic genetic signature throughout northeastern Europe.17

Despite the marked genetic similarities between Finno-Ugric speakers (Finns, Estonians, the Saami, and groups found in the slopes of the Urals) and Latvians and Lithuanians, peoples from the latter two Baltic countries speak languages belonging to the Balto-Slavic branch of the Indo-European language family. The Indo-European languages are believed to have been initially spread by the Kurgan horse culture about 10 000 YBP.12, 13, 18 In spite of this, a lack of consensus on the roots of this civilization is reflected in the existence of varying theories claiming the Ukraine,13 the Central Asian steppes,12 and northern India18 all as plausible cradles for Proto-Indo-Europeans. Proto-Baltic ancestors, in turn, are speculated to have arrived from Central and southeastern Europe 5000–4000 YBP,19 triggering the contraction of the already present Finno-Ugric tribes to the north. Early genetic analyses based on blood groups and serum protein marker distributions indicate that the contemporary Balts constitute a composite of the Finno-Ugrians and Slavic groups.20 More recent work, utilizing Y-chromosomal short tandem repeats (STRs), suggests that the Baltic populations of Latvia and Lithuania are phylogenetically closer to each other than either is to their Finno-Ugric Estonian neighbors.21

The eastern Slavic populations (the present-day Russians, Byelorussians, and Ukrainians) are speculated to have descended from Proto-Slavic-speaking groups that extended into northeastern Europe from Central Europe during the early middle ages,22 yet the origins of these migrant tribes is widely debated.23 Two theories have been proposed on the origins of eastern Slavs: the hybridization and transformation hypotheses. According to the former, these groups arose as a result of fusion between the invading Slavic tribes and populations inhabiting Eastern Europe. Alternatively, the transformation model proposes that eastern Slavic groups gradually evolved in situ from ancient groups autochthonous to the area. Mitochondrial DNA,24, 25 Y-STR haplotypes,26, 27 and autosomal STR diversity distributions8, 28 endorse the hybridization theory supporting the Central European Slavic infusion into tribes previously residing in Eastern Europe.

A recent Y-chromosomal study addressing the intra-ethnic variation in Russian populations revealed that central and southern Slavic Russian groups cluster closely together, whereas northern groups exhibit genetic and phylogenetic affinities to Finno-Ugric peoples, suggesting an assimilation of the Uralic substrata throughout the area,23 a phenomenon previously observed using other marker systems, such as mtDNA,24, 25 Y-STR haplotypes,26, 27 and autosomal STR loci.8, 28 These and other publications14, 29 also claim that geographic partitioning rather than ethnolinguistic boundaries constitutes the main genetic barriers throughout Europe. Nevertheless, the complexity of the region (especially of northeastern groups) and the fusion of a plethora of people make the scenarios portrayed by this claim simplistic in nature.

To date, several studies have been performed to genetically characterize populations both within northeastern Europe and northwestern Asia; yet, the data are fragmentary and uneven in geographic scope, heterogeneous in the marker systems used, and at times contradictory. In addition, limited work has been conducted to integrate the existing information comprehensively in order to delineate migratory patterns and phylogenetic relationships. In the current study, high-resolution Y-chromosome binary markers were used to shed light onto the paternal genetic histories of populations from the aforementioned regions and their relationships to previously published collections. Furthermore, 15 Y-STR loci were assayed for individuals from the SNP backgrounds, R1a1, N1c1, and N1b, to ascertain population expansion times and elucidate possible migratory scenarios.

Materials and methods

Sample collection and DNA isolation

Blood samples were collected in Vacutainer tubes from a total of 236 unrelated male individuals residing in the East European region of Russia (Arkhangelski (n=28), Kursk (n=40), Tver (n=38), Izhemski Komi (n=54), and Priluzski Komi (n=49)) and Siberia (Khanty (n=27)). Genealogical information was recorded for at least three generations to establish regional ancestry. Table 1 lists the sampling sizes, geographic locations, linguistic affiliations, and references of the previously published, geographically targeted populations under study.

Table 1 Populations examined in Y-SNP analyses

Total nucleic acid was isolated by standard phenol–chloroform extraction, as described by Antunez-de-Mayolo and collaborators.37 DNA was ethanol-precipitated and stored in 0.010 M Tris-EDTA (pH 8.0) at −80°C as stock solutions. The samples were procured with informed consent following all ethical guidelines as stipulated by all research institutions involved in the project.

Y-chromosome haplotyping

A total of 105 binary markers were hierarchically genotyped by PCR-RFLP, allele-specific PCR,38, 39 and amplicon size detection of the YAP polymorphic Alu insertion.40 Detailed information on the locations, allelic states, primer sequences, and references for each marker can be found at the Y-chromosome consortium web page (http://ycc.biosci.arizona.edu/nomenclature_system/index.html) and in subsequent publications.18, 41, 33

Y-STR genotyping

A total of 17 Y-STR markers (DYS19, DYS385 a/b, DYS389 I/II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635, and Y-GATA H4) were PCR-amplified using the AmpF/STR Y Filer Kit (Applied Biosystems) according to the manufacturer's specifications for samples under the SNP backgrounds R1a1 (M198), N1c1 (M178) and N1b (P43). Fragment separation was conducted with an ABI Prism Genetic Analyzer (Applied Biosystems). The electropherogram profiles were then analyzed using the Genescan 3.7 and Genotype 3.7 NT softwares.

Time estimations

DYS385a and DYS385b were not included in time estimation or variance calculations, given their duplicative nature. Variance estimations were ascertained using the Vp function as shown by Kayser et al42 and were based on seven Y-STR loci (DYS19, DYS389 I, DYS389 II, DYS390, DYS391, DYS392, and DYS393) for R1a1 and six loci (DYS389 I, DYS389 II, DYS390, DYS391, DYS392, and DYS393) for N1c1- and N1b-derived samples given the limited number of loci reported for the previously published reference populations (Supplementary Table 1). Haplotype expansion times were defined using the programs NETWORK 4.2.00 and BATWING, assuming an average Y-STR mutation rate of 6.9 × 10−4,43 an intergeneration time of 25 and 32 years,44 and exponential population growth from a constant size ancestral population.45 Assumptions for the BATWING analysis were followed, as previously described by Cinnioğlu et al,33 with the exemption of the population growth rate (α) using γ (1.01,1) instead.46 Median joining networks were constructed, also excluding DYS385 a/b, with the aid of the NETWORK 4.2.0045 software package (SNP-STR references and number of individuals are provided in Supplementary Table 1).

Unfortunately, BATWING did not generate credible 95% CIs (confidence intervals) for most comparisons, and as such most of the values generated grossly disagree with coalescence time estimates performed by other authors.15, 18 As such, unless otherwise stated, age estimates used throughout the Results and Discussion sections will be the NETWORK estimations using a 25-year intergeneration time for comparison purposes, as most reports base their calculations on this time frame. BATWING estimates will only be referred to when credible 95% CIs are attained. Nevertheless, given that BATWING estimates did not generate credible 95% CIs in most instances, NETWORK calculations should be taken with caution, as there are likely to be violations of the dating method's assumptions.

Phylogenetic and statistical analyses

A correspondence analysis (CA) based on the frequencies of the binary markers defining major haplogroups (A–R) was generated to gauge genetic similarities among the populations using the NTSYSpc 2.02i software.47 CAs based on Y-STR haplotype frequencies were also conducted. Analyses of Molecular Variance (AMOVAs) and Fst distances were calculated using the Arlequin software package (version 3.11).48, 49 Significance was ascertained at α=0.05.

Results

Haplogroup phylogeography

Of 105 binary markers typed, 48 were found to be polymorphic (Arkhangelski (18), Khanty (13), Izhemski Komi (13), Priluzski Komi (16), Kursk (29), and Tver (26)) in the 236 individuals who were examined (Figure 1). Sub-haplogroup N1c1 (M178) is shared across all the European and Uralic populations at varying frequencies, with the highest level detected in the Izhemski Komi collection (52%) and the lowest in the Siberian Khanty (4%), which exhibits a considerable proportion of haplogroup N1b (78%) (Figure 1). These findings parallel the result from other northeastern European populations (eg, Finland, Estonia, Lithuania and Latvia), which contain comparable frequencies of N1c; however, M178 was not typed in a previously published report.39 Haplogroup N1b is also found at appreciable quantities in the Izhemski and Priluzski Komi groups (17% and 14%, respectively).

Figure 1
figure 1

Haplogroup phylogeography throughout Northwestern Russia. Y-Chromosome markers typed hierarchically along with current haplogroup denominations and numbers of individuals belonging to each haplogroup.

R1a1 (defined by mutation M198) is shared across all the populations genotyped in this study, with frequencies ranging from 0.15 in the Khanty collection to 53 and 58% in Kursk and Tver, respectively (Figure 1). Haplogroup I derivatives, specifically I1 and I2a (defined by M253 and P37, respectively), are found at substantial proportions in the Slavic populations of Kursk and Tver (Figure 1), adding up to 13 and 18% of each population's paternal gene pool, respectively. The Arkhangelski group displays similar levels of I1 (14%) and is completely lacking I2a, exhibiting high frequencies of I2* (absent in both Tver and Kursk). The haplogroup distribution within Central Eurasia based on the six genotyped populations in this study and the reference collections are illustrated in Figure 2.

Figure 2
figure 2

Haplogroup distributions throughout Central Eurasia. Population names and abbreviations: KAB (Kabardinians), LEZ (Lezgi), OSA (Ossetians Ardon), OSD (Ossetians Digora), ARM (Armenia), AZE (Azerbaijan), GEO (Georgia), IRQ (Iraq), LEB (Lebanon), SYR (Syria), TUR (Turkey), NIR (North Iran), SIR (South Iran), NPA (North Pakistan), SPA (South Pakistan), KAZ (Kazakhstan), TUK (Turkmenistan), UZB (Uzbekistan), BEL (Belarus), EST (Estonia), FIN (Finland), LAT (Latvia), LIT (Lithuania), POL (Poland), ROM (Romania), SLO (Slovakia), UKR (Ukraine), ARK (Arkhangelski), BEG (Belgorod), KRA (Krasnoborsk), KUR (Kursk), LIV (Livni), MEZ (Mezen), OST (Ostrov), PIN (Pinega), ROS (Roslavl), TVE (Tver), UNZ (Unzha), VOL (Vologda), KOI (Komi Izhemski), KOP (Komi Priluzski), MAR (Mari), EVE (Evenks), KAK (Khakassians), KHA (Khanty), and TUV (Tuvinians).

Population relationships

Genetic similarities between Finno-Ugric and Balto-Slavic populations are illustrated in Figure 3. The Slavic populations cluster tightly in the upper-left quadrant, with the Finno-Ugric Estonians and Finnish partitioning loosely to the left of the aforementioned grouping along with their geographical neighbors Lithuania and Latvia. The Uralic populations segregate to the bottom-left quadrant midway between the Slavic cluster and a poorly defined Siberian grouping. The Khanty collection strays away from any pairings and lays to the extreme lower corner of the same portion of the graph. To the right half of the projection, Caucasian, Middle Eastern, and Central Asian populations follow an almost geographical cline from the North Caucasus toward the Middle East and then into Central Asia from the extreme right midway between the upper and lower quadrants to the center of the lower-right portion of the graph.

Figure 3
figure 3

Correspondence Analysis (CA) based on major Y-chromosome haplogroup bifurcations (A–R).

When a continentally based AMOVA was conducted, variance components suggest a greater affinity for geographical influences rather than for linguistic ties (Table 2), supporting earlier findings.15, 29, 50 However, when only Balto-Slavic and Uralic groups are evaluated, both linguistic and geographic components yield similar variance component percentages, making it difficult to ascertain whether linguistics or geographical connections influence genetic relationships (Table 2).

Table 2 Analysis of molecular variance

Pairwise Fst distances are presented in Supplementary Table 2. Values of pairwise comparisons reported in red represent statistically nonsignificant distances at α=0.05, whereas estimates displayed in blue correspond to distances found nonsignificant after applying the Bonferroni correction for Type 1 errors at α=0.05/1050=0.000048. The northeastern European populations of Latvia and Lithuania were found to be more similar genetically to the Finno-Ugric populations than to their Slavic neighbors. Populations from Central Asia and Siberia (excluding Khanty) exhibit comparable average distance values (0.09540 and 0.090254, respectively) when compared among themselves (all generating significant Fst values), whereas populations from the Caucasus, which are found in much closer geographical proximity to each other than the aforementioned groups, display an average value of 0.09759, including several significant pairwise distance values (Supplementary Table 2).

Y-STR variance, age estimates, and network projections

Distribution and Age Estimates of Haplogroup R1a1

The oldest age estimates dating back to Mesolithic times (approximately 18 000 YBP) for Haplogroup R1a1 have been detected in West India (16.7±4.3), South India (18.2±5.5), South Pakistan (18.7±4.7), and Serbia (17.3±5.9) (Table 3). However, STR variance is highest in southern India with a value of 0.505 (Table 3).

Table 3 Haplotype variance and age estimations for haplogroup R1a1 (M198)

A NETWORK projection based on the Y-STR profiles of all R1a1 individuals is presented in Figure 4a. It is readily observed that the diversity of Asian haplotypes is far greater than that found in European populations. There are several specific clades exclusive to Asian groups; however, the same is not true for Europeans. The microsatellite distributions are especially interesting in Turkey (the only Anatolian group included), given the plethora of haplotypes present in the population. Supplementary Figure 1 displays the genetic relationships among R1a1 individuals of Russian Slavic descent and Uralic groups across 15 Y-STR loci in a Network projection. The distribution does not reflect population-specific partitioning or ancestral–descendant relationships, but rather all the collections appear to contain a widespread distribution of haplotypes suggesting multiple founders. A CA plot based on the Y-STR profiles of individuals belonging to haplogroup R1a1 is presented in Supplementary Figure 1a.

Figure 4
figure 4

NETWORK Projections for all populations analyzed. (a) R1a1 using 7 Y-STR loci; (b) N1c using 6 Y-STR loci; and (c) N1b using 6 Y-STR loci.

Haplogroups N1c (Tat) and N1c1 (M178)

Y-STR variance estimates for N1c1 reach levels as low as 0.079 in the Izhemski Komi group and as high as 0.226 in the Arkhangelski group (Table 4). Age estimates for haplogroup N1c1 (based on six STR loci) range from 7.2±3.4 in Tver to 9.7±5.8 in the Priluzski Komi population. Similar age estimates for other northeastern European populations were attained (Table 4); however, not all the reference populations were typed for M178 (samples typed for M178 are designated as N1c1-derived individuals in Table 4). However when using 15 STR loci, these values range from as low as 8.2±2.5 kya in the Arkhangelski collection to as high as 13.0±4.2 kya in the Komi from Priluzski. Yet, both Komi populations (Izhemski and Priluzski) exhibit N1c1 Network topologies consisting of two subclusters (Supplementary Figures 2b and c), each subcluster generating considerably lower ages; the values are 5.6±2.0 and 2.4±1.7 kya for the Izhemski Komi, and 5.5±2.0 and 2.1±0.8 kya for the Priluzski Komi (Table 4). Two distinct independent clusters are also observed when the two Komi populations are pooled together for age determinations (Supplementary Figure 2d).

Table 4 Haplotype variance and age estimations for haplogroups N1c (Tat), N1c1 (M178) and N1b (P43)

A Network projection based on N1c haplotype distributions exhibits a segregation between Asian and European groups despite some haplotypic sharing between these two (Figure 4b). Close relationships are observed among the Slavic and Uralic Russians, as most haplotypes present in one subset of populations are also present in the other. They both share clusters with European groups as well. Supplementary Figure 2e displays a Network Analysis of N1c1-derived individuals based on the 15 Y-STR loci. The Slavic populations (Arkhangelski, Kursk, and Tver) are found to the right of the projection along with some Komi haplotypes (shown in red and green); however, a split to the left of the NETWORK, shared only by the Priluzski and Izhemski Komi populations, suggests a different source for N1c1 (M178) in these Uralic groups or, alternatively, the expansion into the Komi territory of Slavic individuals. Supplementary Figure 2a illustrates a CA graph based on the Y-STR profile (six loci) of individuals possessing the N1c1 haplogroup.

Haplogroup N1b (P43)

N1b is the predominant haplogroup in the Khanty population; however, Y-STR variance values are much higher for both the Izhemski and Priluzski Komi groups (0.098 versus 0.181 and 0.611, respectively) (Table 4). Similarly, time estimates for the Khants reveal a rather recent entrance of the haplogroup into the population (4.0±2.6), whereas much later dates are obtained for the Izhemski (6.7±4.2) and Priluzski (12.9±4.1) Komi populations (Table 4). Variance calculations for the Pinega and Mezen populations yield Vp values of 0.163 and 0.083, respectively. Conversely, the other Slavs group attains a variance value of 0.653 and age estimate of 18.1±6.4; however, a bipartite structure is observed with separate clusters that attain ages of 6.0 ± 3.7 and 6.0±3.2. A Network Analysis, including Khanty, and the Uralic and Slavic Russian groups at a resolution of 15 Y-STR loci, displays a clear partition between the Slavic groups and the Khanty collection (Supplementary Figure 3b). Interestingly, the Izhemski Komi partitions to the portion of the projection encompassing the Slavic groups while the Komi from Priluzski shares haplotypes with both clusters.

A Network projection based on Y-STR distributions of haplogroup N1b is presented in Figure 4c. Haplotype distributions in Uralic groups are widespread throughout the projection sharing clusters with both Asian and European Slavic populations. The Siberian Khanty collection segregates into one portion of the graph composed of Asian haplotypes, but shows some affinities to Uralic groups as well. A CA based on the Y-STR haplotype frequencies of these populations is presented in Supplementary Figure 3a.

Haplogroup N

Grouped age estimates based on the major bifurcations of haplogroup N were performed to achieve a consensus on the antiquity of each of its sub-haplogroups on a regional basis (Europe and Asia) and in specific ethnic groups (Russian Slavic and Russian Uralic) (Table 5). Estimates for M231 (N*) are highest among Mongolian/Siberian groups reaching 23.7±5.4 kya compared with an overall value for all populations of 19.1±4.2 kya. BATWING expansion times for the same comparison yield an overall age of 3.8 kya for M231. The age for haplogroup N1a (M128) is more recent (6.9±3.9 kya) than that of its sister clades N1b (15.8±5.4 kya), which exhibits a bipartite substructure leading to separate age estimates of 7.8±3.7 and 5.0±2.2 kya (BATWING yields an estimate of 1.1 kya for this haplogroup), and N1c (10.7±3.4 kya). BATWING estimates for N1c achieve an age of 3.0 kya. Calculations using all clades within the haplogroup (N*, N1a, N1b, and N1c) yield an average age of 13.4±4.0 kya, meanwhile BATWING calculations provide a value of 3.4 kya. It should be noted that these BATWING age estimates do exhibit a credible 95% CI; however, they dispute previous findings by Rootsi et al15 and provide recent haplogroup ages. NETWORK projections, on the other hand, provide estimates that go hand in hand with previous findings.15

Table 5 Grouped age estimations

Discussion

Population relationships

The CA (Figure 3) based on major Y-SNP haplogroups reveals several distinct groupings reflecting both geographic and linguistic affiliation. With the exception of the Khanty, a clear cluster is formed among the Uralic-speaking populations where Finland segregates at a distance from the rest. This partitioning may be related to Finland's low-effective population size for long periods of time and local isolation of small groups, possibly causing major bottlenecks, which are significantly limiting the current diversity of the population, allowing for genetic drift.4

Lithuania and Latvia, both Indo-European-speaking groups, are also found within the Uralic assemblage. This phylogenetic connection between the Baltic- and Uralic-speaking collections is also reflected in Fst distances (Supplementary Table 2), where both Lithuania and Latvia exhibit nonsignificant genetic distances with all Uralic speakers, excluding Khanty (Supplementary Table 2). When distances within the group are averaged, Fst values are lower when Lithuania and Latvia are included in the calculations (Fstavg=0.03622) than when they are removed (Fstavg=0.05322). On the other hand, analyzing these two populations with Slavic-speaking groups leads to an increase from 0.09078 to 0.09222 in Fst distances. These data lend support to previous findings by Kasperaviciute et al,51 who propose a close relationship between Lithuanians and Latvians with their Finno-Ugric-speaking neighbors (Estonia and Finland). The Y-haplogroup distributions of Latvia and Lithuania also exhibit greater affinity with those of Uralic populations than with the other Indo-European-speaking groups. For example, both Baltic populations display considerable frequencies of haplogroup N1c (33 and 47% in Latvia and Lithuania, respectively), whereas in other geographically proximal Indo-European-speaking groups (ie, Belarus, Slovakia, and Poland), this frequency is only 2–5% (Figure 2). These data corroborate results by Laitinen et al,52 indicating that males from these Baltic and Uralic populations exhibit common genetic patrimonies and suggests that the Uralic dominion encompassed a greater area than has been previously reported.

It has been reported that haplogroup distributions from western (Poles, Slovakians, Czechs, and Lusatians), southern (Slovenes, Croats, Bosnians, Montenegrins, Serbs, Macedonians, and Bulgarians), and eastern Slavs (Belarusians and Ukrainians) differ considerably from those of Russian Slavs, specifically northwestern Russians (also considered part of the eastern Slavs). For example, Slovakians, Ukrainians, Poles and Belarusians exhibit very low frequencies of N1c, whereas the haplogroup attains levels of 13, 13, and 29% in the Russian Slavic groups of Kursk, Tver, and Arkhangelski, respectively, despite the close geographical proximity of these groups. These differences are also observed between Russian groups, with southeastern Russians exhibiting frequencies of N1c as low as 5% in a collection from the Livni province and northeastern Russians possessing levels as high as 46% in the Mezen locality.23 N1c is particularly high in populations of Uralic descent and may signal genetic input from the autochthonous (former) groups of northeastern Europe. The Slavic Russian populations (Kursk, Tver, and Arkhangelski) also possess frequencies of haplogroup I of 15, 18, and 50%, respectively, found at 18% in Ukraine, where it may have arisen during the LGM;13, 53 similar frequency distributions of haplogroup I have been reported for other Russian groups.23 The distributions and clinal frequency gradients of N1c support the hybridization hypothesis for Slavic Russians and argue for considerably more genetic signals from Uralic tribes in northwestern Russian groups than in the rest of the eastern Slavic domain.

It should be noted that although statistically significant correlations are observed between linguistics as well as geography and genetics in the AMOVA, a closer relationship between geography and genetics (8.81% in the Among Groups comparison versus 7.87% in the Among Populations Within-Groups comparison) than between linguistics and genetics (6.51% variance attributable to the Among Groups comparison versus 10.10% to the Among Populations Within-Groups estimate) as has been stated previously,15, 28, 29, 50 is seen when populations throughout Eurasia are compared at the transcontinental level. When only members of the Balto-Slavic linguistic branch of the Indo-European language family and Uralic groups are compared, neither linguistic nor geographic ties appear to define the genetic structure of the populations in question, suggesting that other factors besides geographical proximity and linguistic affiliations have been involved in shaping the current genetic and phylogenetic relationships of members of these two linguistic families (Table 2).

A discontinuity is apparent between populations from North Caucasia and Baltic/Slavic/Uralic groups to the north in the distributions of haplogroups G and N (Figure 2). Haplogroup G is confined to the Caucasus and the Middle East and not detected in the northern groups (Slavic and other Eastern European populations) despite the lack of major geographical barrier between the northern Caucasus and the aforementioned areas. Conversely, haplogroup N is not observed within the Caucasus despite its high frequencies and widespread distribution throughout northeastern Europe, Siberia, and Central Asia (these apparent disconnections have also been reported by Fechner et al50). Phylogenetic relationships also illustrate a disconnection between northeastern European populations, which despite their proximal geographical locations map at opposite ends of the plot (Figure 3), suggesting linguistic, and/or ethnic obstacles to gene flow. Cultural barriers to genetic exchange have been previously observed in the Kalmyks, a group that after relocating to the area near the Caucasus from Mongolia has not received genetic inputs from North Caucasian groups.54 Populations from Caucasia, in turn, are described as traditional genetic isolates that have remained separate and independent from other groups for thousands of years.55

Haplogroup R1a1 is represented by complex diversity patterns

Haplogroup R1a1 (delineated by mutation M198) is believed to have originated in present day Ukraine13 following the LGM, and is thought to mark the expansion of the Kurgan horse culture.12 Kurgan migrations are believed to have occurred both into Europe and to the east, resulting in the dissemination of the Indo-European languages.56 Alternatively, Sengupta et al18 and Wells et al12 have proposed that the haplogroup originated in Northwestern India and in the Central Asian steppes, respectively, given the wide variety of R1a1 Y-STR haplotypes throughout these areas. Network age estimations from this study suggest that two separate groups exist within R1a1 with similar ages for populations found at the western (Serbia 17.3±5.4) and eastern (South Pakistan 18.7±4.7) poles of the expansion. These results along with time estimates for several other populations across Europe and Asia support the findings by Sengupta et al18 regarding the central Asian origins of the mutation. NETWORK projections also support an Asian origin to this haplogroup, given the plethora of STR haplotypes present in these groups versus those found in European populations (Figure 4a).

The R1a1 network projection in Supplementary Figure 1b based on 15 Y-STR loci lacks substructure along population lines. A central core of individuals and star-like topology is indicative of similar haplotypes from a common source for both the Slavic and Uralic Russians genotyped in this report. These results corroborate the comparable expansion time estimates based on 7 and 15 STR loci (Table 3).

Microevolutionary processes

The separation between the geographically proximal collections of the Priluzski Komi and Izhemski Komi in Supplementary Figure 2a is noteworthy. Similarly, North and South Pakistan partition distantly in the plot. In the case of South and North Pakistan, one possible explanation is the distinctive involvement of South Pakistan as a migratory corridor between the Middle East and Asia in the original migration of modern humans out of Africa followed by bidirectional dispersals.38 North Pakistan, on the other hand, located at the southwestern end of the Himalayan range, a known genetic as well as topo-geographical barrier,39 has more likely experienced limited dispersals allowing for the observed patterns.

The differences between the two Komi groups may reflect events regarding people with a common origin being differentially influenced genetically by unrelated migrations and/or genetically distinct populations adopting similar cultures and languages. It is possible that the observed genetic differences may reflect cultural and socioeconomical separations between the two groups, who despite inhabiting a close geographical area exhibit differing subsistence styles (the Komi from Priluzski are cattle breeders and farmers, whereas the Komi from Izhemski have adapted reindeer herding from neighboring Nenets).10 In support of this scenario, it is known that the Priluzski Komi belong to a group of populations that appear to have arisen much earlier historically than the Izhemski Komi, which, in turn, exhibit some peculiar linguistic traits not observed in other Komi populations.57 Yet, the profound differences in the Y-STR profiles and the separation from each other in the Network Analysis argue for populations with unique genetic backgrounds.

Possible origin and migration patterns of haplogroups N1c1 (M178) and N1b (P43)

Haplogroup N is found throughout North-Central Eurasia at varying frequencies with sub-haplogroup N1c being the most widespread.15 Proposed migratory routes based on Y-STR variance estimations have suggested that N1c carriers spread from northern China through Siberia to northeastern Europe.15 Sub-haplogroup N1c1 (defined by mutation M178), long believed to be restricted to Europe and to mark a recent Uralic migration into northern Europe,14 is now known to be widespread in northern China and Mongolia.16 However, Y-STR variance values from this study do not support a migratory route from the Urals to the northeastern Slavic domain, as Russian Slavic populations exhibit higher Y-STR diversity (as high as 0.226 in the Arkhangelski population) than those found in the Uralic groups (0.079 in the Izhemski Komi and 0.121 in the Priluzski Komi) (Table 4).

When Network projections are constructed for N1c1 using 15 Y-STR loci, topologies composed of two clusters observed for both Komi populations (Supplementary Figures 2a and b), leading to separate time estimates at the individual cluster level of 5.6±2.0 and 2.4±1.7 kya for the Komi from Izhemski, and 5.5±2.0 and 2.1±0.8 kya for the Komi from Priluzski. When the two populations are grouped (Supplementary Figure 3d), similar age estimates are attained for each subcluster (Table 5). On the other hand, the Network projections for the Russian Slavic populations do not show dual clustering, and their ages range from 8.2±2.5 kya in Arkhangelski to 9.7±2.6 kya in Kursk (Table 4), and 8.3±1.6 kya when the three Slavic populations are grouped (Table 6). The presence of dual clusters in these Komi groups may explain the high age estimates previously observed for this region, leading to the suggestion that an east to west dispersal of N1c1 was the most likely migratory route taken by the haplogroup's carriers.15 It is possible that the age values previously reported15 may be the result of subpopulation structure (known to lead to erroneously inflated accumulated ages) within the Uralic populations analyzed, probably resulting from the input from different source populations (eg, of Asian and European descent).

Similarly, haplotype variance calculations based on haplogroup N1c do not support an east to west dispersal, given that northeastern European populations, such as Finland (0.223), Estonia (0.206), Tver (0.183), Arkhangelski (0.226), and Kursk (0.167), possess higher variance levels than the Komi Izhemski and Priluzski collections (0.079 and 0.121, respectively). As such, these results suggest that, instead of the previously reported migratory scenario from the Urals to the west,14, 15 the flow of N1c may have occurred in the opposite direction. As older ages are observed when grouping All Asians versus All Europeans (Table 5) for N1c, the available data suggest that the mutation may have originated in northern China as previously reported,14, 15 but may have traversed through a different migratory route than has been postulated elsewhere,15 reaching northeastern European populations before the Urals. The presence of haplogroup I (of European descent) in both Komi populations (specifically I-M253), in turn, suggests that European groups have contributed to these populations’ gene pools. The absence of I-M253 in the Khanty of West Siberia completes the demic decrease of this haplogroup (Europe–Urals–West Siberia), supporting the stipulated west to east Y-driven migration.

Haplogroup N1b has been reported to have separated into two clades of similar ages about 6.2 and 6.8 kya for Asia and Europe, respectively.15 Yet, time and variance estimations in this study indicate a much older origin for the haplogroup (12.9±4.1 kya) in the Priluzski Komi collection (Table 4). Comparable age estimates were obtained using two sets of 6 and 15 Y-STR markers; the battery of 6 loci is included in the group of 15. In the Network Analysis, the Khanty collection segregates into one portion of the bi-cluster topology observed (Supplementary Figure 3b) along with the Slavic populations identified as carrying Asian haplotypes,23 meanwhile the Komi from Izhemski and the Slavic Russian populations partition toward the other extreme of the projection. However, the Komi from Priluzski exhibit a bipartite distribution throughout the two star-like sub-clusters, providing an explanation for the population's high variance and old age estimates. These results make it possible to contemplate a scenario where the Komi from Priluzski have contributed differentially to populations within Asia and the Slavic domain. These findings should be further explored by examining other Uralic populations to elucidate whether the mutation did originate among the Priluzski Komi or whether other people within the region exhibit older age estimates and higher accumulated STR variance. Nevertheless, with the Khanty population, located eastward in northwest Siberia, exhibiting the most recent age and variance estimations, the data implicate migrations from the Urals into Siberia and Asia rather than the converse.