The global variation of the largely non-recombining male-specific Y chromosome has become one of the major sources in reconstructing ancient human migrations. The resolution of the Y chromosome phylogeny has increasingly been improved by the discovery of new binary polymorphisms, mostly single nucleotide polymorphisms (SNPs), that may, through their distinct geographic patterns, bear evidence of historic relationships between living populations.1, 2, 3, 4, 5

One of the most widespread and frequent branches of the Y phylogeny in Eurasia is haplogroup (hg) NO defined by SNP-marker M2145 (corrected phylogeny in Cinnioglu et al6) (see Figure 1). It entails a low number of NO* lineages that lack distinguishing derived SNP markers (Figure 2a) and two frequent sister clades, N and O, defined by markers M231 and M175, respectively (Figure 2b,c). Although the phylogeography of clade O has drawn considerable scrutiny,7, 8, 9, 10 knowledge about hg N is relatively impoverished with regards to its origin, phylogeographic patterning and demographic significance.

Figure 1
figure 1

Phylogeny of NO clade. Phylogenetic relationships of the NO clade and its subclades together with the defining SNP markers. Mutation labeling follows the YCC nomenclature.1, 4 Left of the phylogeny, the ages in 1000 years (ky) of the splits between subclades are shown.

Figure 2
figure 2

Geographical distribution of NO clade. (ag) Spatial frequency distributions of the NO clade: NO*, N (overall distribution of hg N), O (overall distribution of hg O), N*, N1, N2, N3. Maps are based on data from Supplementary Table 1. We label various panels following the YCC ‘by mutation’ format by adding the relevant mutation suffix.

Materials and methods

Samples and DNA typing

A total of 5389 samples from 58 populations in different geographical regions were genotyped or updated to present phylogenetic resolution (M9-derived samples with ancestral allele of 92R7 marker were typed for M214, M231, M128, P43, Tat, M175) in this study and analyzed together with data about 8019 individuals from 90 populations from the literature (data presented in Supplementary Table 1). DNA samples were obtained from unrelated male volunteers after getting the informed consent from ethical committees of institutions involved.

Mutation labeling follows the YCC nomenclature.1, 4 Phylogenetic relations of markers M128, P43 and TAT, characterizing three subclades (N1–N3 respectively) were known earlier,2, 11, 12 but only recently marker M2316 (characterizing the whole N clade) was introduced in the tree of Y chromosome diversification (see Figure 1). Marker M231 is phylogenetically equivalent to the more cumbersome LLY22g polymorphism,13 initially used to define haplogroup N.

Markers M128 and M214 2, 5 were assayed by sequencing the polymorphic sites, and markers M175,2 P4311 and Tat12 were assayed using restriction-fragment length polymorphism (RFLP) method using restriction enzymes MboII, NlaIII and TaiI, respectively. The allelic state of the M231, SNP first described in Cinnioglu et al6 and originally assayed by denaturing high performance liquid chromatography method, can be readily assayed by RFLP analysis (TaqI enzyme cuts ancestral allele G, producing 223 and 108 bp products and does not cut derived allele A with lenght of 331 bp).

STRs were studied using Y-filer Kit (Applied Biosystems, Foster City, CA, USA). PCR products were analyzed on ABI 3100Avant genetic analyzer (Applied Biosystems) in the mode of standard fragment analysis protocol. GeneScan 500LIZ size standard (Applied Biosystems) was added to each sample for size scaling, and GeneMapper 3.5 (Applied Biosystems) was employed for allele scoring. Alleles were designated by repeat numbers.

Data analysis

Using the program Network, median joining network was constructed of hg N–O haplotypes from data on 17 STRs (DYS19, DYS385a,b, DYS389I,II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYs635, Y GATA) and bi-allelic markers (M231, M128, P43, TAT) in 58 individuals (STR data presented in Supplementary Table 2). Phylogenetic relationships between the haplotypes were determined by the median joining method after having processed the data with the reduced median method described in Bandelt et al14 using program Network,

The age calculations are based on STR variation and calculated according to published methods.15, 16

Spatial frequency maps of NO clade and subclades were obtained applying the frequencies from Supplementary Table 1 (dots indicate the populations) in Surfer software (version 7, Golden Software, Inc.).

Results and discussion

Different distribution-pattern of N2 and N3 versus NO*, O and N*

Haplogroup N has both a unique and widespread distribution spanning northern Eurasia, from the Far East to Eastern Europe, showing higher frequencies at high latitudes.11, 12, 17 Here, we assess the history of this haplogroup via a detailed phylogeographic approach using samples from different regions of Europe, East/Southeast Asia and Oceania, ascertaining SNP markers defining haplogroup N, its subclades and sister-clade O. The analysis reveal that despite its ancient split from hg O, hg N subclades display more recent demographic temporality and a net counter-clockwise migratory trajectory distinctive from its hg O counterparts.

Although having variable frequency scales, the spatial distributions for ancestral paragroup NO-M214*, paragroup N-M231* and the prevalent hg O-M175 (Figure 2a, c, d) are generally congruent and highlight Southeast Asia as the most parsimonious source region of these clades. The spread pattern of paragroup NO* approximates the same regions of Southeast Asia as paragroup N*, although being present at an even lower frequency compared with N*18, 19 (data from Kayser et al19 updated in present study). More notable, however, is the fact that the spatial dynamics of the whole N and O haplogroups greatly differ from each other. The split between N* and O is dated to 34.6±4.7 thousand years (ky). The age of STR variation of hg O in Southeast Asia probably exceeds 26 ky,10 and its numerous subclades currently predominate in southern and southeastern Asia extending into northern China, Manchuria and some Siberian populations,7, 9, 11, 20, 21 as well as westward to the eastern sector of the Indian subcontinent10 and eastward to Oceania.18, 19

Distribution and spread of haplogroup N subclades

The N-haplogroups reflect a more recent demographic history. Ancestral paragroup N* is widely distributed, although with low frequencies, from Fiji, Borneo, Cambodia, southern China and Japan up to southern Siberia (Supplementary Table 1), while apparently absent in the Indian Peninsula.10 Its age of accumulated STR variation estimated using the method from Zhivotovsky et al15 points to late Pleistocene–early Holocene (11.9±2.5 to 12.6±3.1 ky), depending on the number of Y chromosomes and STR loci included in analysis (see Table 1). However, it should be noticed that the frequency of N* is extremely low.

Table 1 Coalescent times of haplogroups

In this regard, the age of accumulated STR variation in hg N, estimated on all combined data from N1, N2 and N3 at 15 loci (Supplementary Table 2), yields an estimate of 19.4±4.8 ky. However, as will be argued below, the European subcluster of N2 and the Yakutian N3 might have descended from single founders with multiple jumps at several loci, thus causing a possible shift in statistical estimates that assume a step-wise mutation model. When those chromosomes are excluded, the age of hg N STR diversity is somewhat younger, 14.2±4.0 ky.

Time calculations based on evolutionary- and pedigree-based methods give significantly different date estimates (Table 1). Both estimates are included because a consensus has not yet been reached among all the geneticists. Recent simulations demonstrate that pedigree rates do not consider the evolutionary consequences of population dynamics, such as the rapid extinction of newly arisen microsatellite alleles (Zhivotovsky, Underhill and Feldman).23 Thus, time estimates based on pedigree studies are younger and inconsistent with the archaeological record. Additional factors relevant to the issue include (i) ascertainment bias (studies reporting no mutations in a pedigree are less likely to be published); (ii) rate variation between loci whereby pedigree rate yields the average rate of the fastest evolving loci; (iii) saturation (the evolutionary rate calibration misses back-and-forth mutations).

The median joining network (Figure 3), based on 17 STR loci (Supplementary Table 2), and SNPs shows the extent of variation within hg N subhaplogroups. Despite the current presence of N3 and N2 in various Siberian populations, including Chukchi and Yupik from Chukotka Peninsula in Beringia, these haplogroups are absent among Native Americans.24, 25, 26, 27, 28 This finding suggests that hg N chromosomes were likely not among the dominant and omnipresent types in Palaeolithic Siberians at the time of their likely colonization of the Americas some 12–17 ky,29 although the possibility that the N lineages became extinct during the colonization due to founder effect or drift cannot be excluded.

Figure 3
figure 3

Median network of N–O haplogroups. Median joining network of hg N–O haplotypes was constructed based on data of 17 STRs and bi-allelic markers in 58 individuals by using the program Network Each circle represents a haplotype, defined by a combination of STR markers. Circle size is shown proportional to haplotype frequency, according to data presented in Supplementary Table 2. Haplotypes are labeled as follows: Al–Altaian, Ba–Bashkir, Ch–Chinese, Ci–Chukci, Cu–Chuvash, Eo–Eskimo, Ee–Estonian, Ev–Evenk, Fj–Fiji, Ka–Karelian, Kh–Khakash, Ko–Komi, Ma–Mari, Ru–Russian, Sl–Slovak, Ta–Tatar, Tu–Tuva, Ud–Udmurt, Uk–Ukrainian, Vp–Vepsa, Ya–Yakut. Colors indicate the subdivisions inside haplogroups: light green for European N3-haplotypes, dark green for Asian N3-haplotypes, light blue for European N2-haplotypes, blue for Asian N2-haplotypes, brown for N1-haplotypes, pink for N* haplotypes, purple for O-haplotype.

Haplogroup N3 is the most common subclade of hg N (Figure 2g, Supplementary Table 1), being almost universally the most frequent Y chromosome type among populations inhabiting north Eurasia,11, 12, 17, 26, 30, 31, 32, 33, 34, 35 while occurring at only marginal frequencies in China, Korea, Borneo and Japan.18, 19 Being prevalent throughout northern Asia, its distribution in Europe is restricted to the northern and eastern populations, showing sharp east–west decline across Scandinavia and between Lithuania and Poland (Supplementary Table 1 and data in17). The phylogeography of the NO* and N* lineages (Figure 2a, d) and the presence of N* chromosomes in southern East Asia (South China and Cambodia, see Supplementary Table 1) suggests that this region could be the source of the initial spread of hg N. In this scenario, the Altay/Sayan/southern Siberia region might have been a place of transition of hg N westward as all major subclades of hg N are still to be found there.

Although the frequency of hg N3 is low in northern China and restricted to a few small populations, its STR variance is higher (0.26, averaged across eight loci: DYS19; DYS389Iⅈ DYS391; DYS392; DYS393 and DYS439, data from Sengupta et al10) than in Altai and in Volga-Ural region (0.16 and 0.17, respectively), thus again pointing to northern China rather than southern Siberia as a possible place of expansion of hg N3. The age of accumulated N3-STR variation in North China is 11.8±6.8 ky, falling, thus, at the boundary of Pleistocene and Holocene, although it should be treated with caution because of a very large standard error, caused by limited sample size of the N3 chromosomes.

According to our scenario, on the way through Siberia to eastern Europe, the N3-carriers might have been subjected to founder effects or strong genetic bottlenecks. Northeastern Europe can be considered as a place of secondary expansion of N3. Indeed, hg N3 occurs at high frequencies in the Volga-Ural Ugric groups and related Finns, Saami and Estonians. One may notice that while STR variation is relatively low in the Volga-Ural group, some north-European populations have high STR variance (eg, 0.32 in Finns: data from,36 without DYS385ab). The high STR variation among the latter, however, might not be a result of a long-term in situ differentiation of the founder lineage, but, rather a consequence of an admixture of separate N3 founder types.

Populations of eastern Europe on the most distant western border of N3 spread area that have considerable frequencies of hg N3 from single sources are expected to have lower STR variation. As an example, STR variance in Baltic-Lithuanians and Latvians is 0.12 and 0.09, respectively (data on five loci).37 Some European populations have low frequency of hg N3 combined with high STR variation as in non-Saami Norwegians (0.27: data without DYS439) that may indicate recent gene flow from the neighboring Finno-Ugric populations.36, 38 Similar situation for Swedes was described recently in study about Swedish Y-chromosomal pool.39

Phylogenetic analysis of STR variation (Figure 3) shows two overlapping subclusters of N3, one of them encompassing predominantly Volga-Ural region, Finnic- as well as Turkic-speaking populations together with Altaian, and the other one both Baltic-Finnic (Estonians, Karelians and Vepsa) and east Slavs (Russians, Ukrainians), as well as West-Slavonic Slovak N3 chromosomes. The Yakut Y chromosomes form their specific branch; they are almost identical to each other, consistent with earlier studies.12, 40

The haplogroup N2 distribution (Figure 2e) exhibits an irregular frequency pattern in Siberian populations, extending in the western direction to eastern Europe as far as Vepsas and Karelians at the Baltic Sea. The highest frequencies of N2 are observed among north-west Siberian populations: 92% in the Nganassan, 78% in the Enets and 74% in the Tundra Nenets.11 In Europe, the N2 types have their highest frequency of 20% among Volga-Uralic populations.17 The extreme western border of the spread of N2 is Finland, where this haplogroup occurs only at marginal frequency – 0.4%.36 Yet interestingly, N2 is quite frequent among Vepsas (17.9%), a small Finnic population living in immediate proximity to Finns, Karelians and Estonians.

The network of N2 haplotypes shows a well-resolved bipartite STR distribution with separate European and Siberian subclusters, denoted here by N2-A and N2-E (Figure 3). It can be speculated that the nearest Asian putative root subcluster, N2-A, originated first, later giving rise to the derived European subcluster, N2-E. Although the N2-A has median repeat scores more similar to those for N3, the European subcluster N2-E differs sharply from N2-A in its STR composition at several loci (Table 2), thus suggesting that the European N2-chromosomes descended from a single founding haplotype. One can even speculate on the probable existence of binary polymorphisms yet to be discovered which would be unique to the N2-E cluster.

Table 2 Median repeat scores at the most informative STR markers for different N-haplogroups

N2-A and N2-E clusters are relatively young – the ages of accumulated STR variation in N2-A and N2-E are 6.2±2.0 and 6.8±2.9 ky, respectively; the lower value for presumably older hg N2-A can be explained by stronger bottlenecks in Siberian populations and by small sample sizes. Indeed, the indigenous Siberian populations are very small in size compared with most of east European populations; even the most numerous of the former, Yakuts and Buryats, reach only a few hundreds of thousands – compared with many millions of east Europeans.41, 42

Among our samples, N2-E is mainly restricted to the Volga-Ural region, which might be a possible source region for the northward and eastward (Khants and Mansis; data from Stepanov et al43) gene flow of N2. In contrast to NW-Siberian N2-A STR profile, the more western lineages are of the N2-E type. Interestingly, 14 N2-individuals from Turkey, data from Cinnioglu et al6 (updated in this study), belong to the Asian subcluster N2-A, suggesting that the clade N2 might have geographically expanded from Siberia westward by at least two different flows: one northwest through the Volga-Ural region, giving rise to N2-E, probably mainly via the Finno-Ugric group, and the other, N2-A, southwest together with Turkic languages. Therefore, the distinctive difference of N2-E from N2-A in their STR composition, as well as data on similarity of STR profiles at N3 in the Yakut, indicates the consequence of multiple postglacial founder events, especially in the re-peopling of sparsely inhabited territories, consistent with the view on Central Asia as the ‘land of bottlenecks’.44

The least frequent N subclade is N1 (Figure 2f), distributed with low frequencies in some Central Asian populations, Koreans, Northern Hans and Manchurian Evenks. Further large-scale studies on present phylogenetic resolution level (earlier literature often do not provide necessary data) are needed to say more of the spread and distribution pattern of this clade.

In summary, Y chromosome haplogroup N presents a case of gene flow to eastern Europe that has its likely ultimate source in east Asia. There are no equal mtDNA counterparts for the NRY hg N narrative – the mtDNA haplogroups characteristic to southeast Asian populations occur in east Baltics with their total frequency of less than 1%.17, 45, 46 Only some minor twigs of the Asian mtDNA tree, like Z1 and D5, having high diversity in Altai/Central Asia, occur at above 1% in some Nordic populations like Saami and Finns.17, 45, 47 However, numerous mtDNA haplogroups, such as B, C, D, F and G, do span from South China to Siberia and Central Asia, up to the Ural Mountains and, at already lower frequencies, to the Turkic and Finno-Ugric populations of the Volga basin.48, 49, 50, 51, 52, 53

Although the frequency scales of these haplogroups are significantly different across different loci, this independent evidence provided by maternal ancestry supports significant pre-historic migration of humans from southeast Asia, back to the West via the counter-clockwise northern route.