Introduction

Understanding how the ancestors of modern humans had successfully settled and adapted to areas with extreme conditions, for example, highlands, continues to be a hot issue of wide interest. For instance, a recent study on mitochondrial genome information revealed that the modern humans had successfully colonized the Tibetan Plateau since the Late Pleistocene.1 As a neighbor of Tibet, Nepal is located at the southern piedmont of the Himalayas and is bordered by India and China. Similar to the Tibetans, the Nepalese are also famous for their high-altitude living conditions. In consistent with the language affiliation of the Nepalese populations (viz. Indo-European and Tibeto-Burman),2, 3 recent studies on the Nepalese have detected substantial genetic contribution from South Asians, East and West Eurasians,4, 5 it becomes evident that the genetic landscape of the Nepalese has been largely shaped by later immigrants from the neighboring regions.4, 5, 6 For instance, South Asian genetic components (for example, Y-chromosome haplogroups H-M52*, H-M69*, H1-M82*, H1-M370*, R1a1-M198 and R2-M124;4, 6 mitochondrial DNA (mtDNA) haplogroups M31, M33, M35, M38, R6 and R305, 7) are prevalent in the Nepalese, reflecting extensive connections between Nepal and India; whereas the East Eurasian influence on the Nepalese is surprisingly substantial, as manifested by the presence of Y-chromosome haplogroup O3-M1174, 5 and mtDNA haplogroups B5, D4 and G.5, 7 As the Himalayas acts as a formidably geographical barrier between Nepal and East Asia (Tibet), how these East Eurasian components dispersed to Nepal becomes an issue of hot debate.

Recently, by comparing the Y-chromosome lineages between the Nepalese and the Tibetan populations, Gayden et al. have proposed that the East Eurasian genetic components had been introduced into Nepal from Tibet directly,4, 6 a scenario seemingly in agreement with the hypothesized retreat of Baric speakers.8 However, considering the close vicinity between Nepal and northeast India in geography, a possibility that these genetic components of the East Eurasian ancestry might trace their origins to northeast India could not be ruled out completely. Indeed, abundant East Eurasian mtDNA lineages have already been detected in the northeast Indian populations,9, 10, 11, 12 raising the possibility that the East Eurasian influence could trace its origin back to the dispersal of Tibeto-Burman people who arrived at Nepal via northeast India.5 Therefore, it is plausible that the bearer of the East Eurasian genetic components might have arrived at Nepal either from Tibet directly (across the Himalayas) or from northeast India instead. To test both scenarios and get deeper insights into the origin of the Nepalese, a simple way is to compare the phylogenetic affinity of the East Eurasian lineages, respectively, observed in Nepal, northern India (including northeast, north and northwest India) and Tibet at higher molecular resolution, and which becomes feasible with the recently released Tibetan mtDNA data.1, 13

To achieve this objective, mtDNA variation within a number of 246 Nepalese individuals from Kathmandu and eastern Nepal were collected and studied in this study (Supplementary Table S1, Supplementary Material online), and the recently released mtDNA data from Nepal5 and the neighboring regions, especially those from Tibet,1, 13 northeast and northwest India9, 10, 11, 12, 14, 15, 16, 17 were also considered. To further understand the Nepalese mtDNA landscape, a total of 21 representative individuals were selected for completely mtDNA genome sequencing in order to unambiguously determine their phylogenetic status (Figure 2).

Materials and methods

Sample collection

Blood samples of 246 unrelated individuals were collected from Kathmandu and eastern Nepal with informed consent (Figure 1 and Table 1), and this project was approved by the Ethics Committee at Kunming Institute of Zoology, Chinese Academy of Sciences. The geographical locations of the populations are displayed in Figure 1.

Figure 1
figure 1

Sampling locations of the populations analyzed in this study. The Nepalese populations collected in this study are highlighted with solid pentacles, and the black dots represent the populations retrieved from the literature.

Table 1 Information of the 43 populations analyzed in this study

DNA amplification, sequencing and quality control

Total DNA was isolated with the standard phenol/chloroform method and stored at −80 °C. An mtDNA segment (spanning nucleotide position 16024–16569/1–576 and covering the whole-mtDNA control region) was amplified and sequenced as fully described in our previous works.1, 18 Mutations are recorded by comparing with the revised Cambridge reference sequence (rCRS).19 All the individuals were allocated into specific haplogroup based on their control-region information; the assignments were further confirmed by typing additional diagnostic coding-region mutations according to the reconstructed phylogenetic trees of East Asian,1, 20, 21, 22, 23, 24, 25 South Asian5, 7, 12, 26, 27, 28, 29, 30, 31, 32, 33 and Southeast Asian34, 35, 36, 37, 38 (Supplementary Table S1, Supplementary Material online). For the mtDNA sample of interest, entire genome was amplified and sequenced as described elsewhere.1, 18 To avoid any potential problems in mtDNA data quality, necessary quality control measures21 and some caveats39 were followed as described previously. The new mtDNA sequences reported in this study have been deposited in GenBank under accession numbers JF742217–JF742461 (for control-region sequences) and JF742196–JF742216 (for whole-mitochondrial genomes).

Data analysis

The principle component analysis (PCA) was conducted based on the East Eurasian haplogroup frequencies as described previously40 (Supplementary Table S2, Supplementary Material online). Genetic distances (Table 2) were estimated by using the package Arlequin 3.11.41 Admixture estimation was performed by the Weighted Least Squares (WLS) Method using the Statistical Package for the Social Sciences (SPSS) 14.0 software (Table 3).42 The reduced median networks of haplogroups of interest were constructed by using the network 4.510 program (http://www.fluxus-engineering.com/sharenet.htm) and adjusted manually (Figures 4 and 5).43 The ages of the lineages G2a2 (further defined by mutation 16193 and named G2a2 tentatively) and M9a1a2a (characterized by mutations 16145, 16316 and a back mutation at site 16362, and designated as M9a1a2a) (Table 4) were estimated by using the ρ statistic44, 45 with the suggested calibration rates.44, 46

Table 2 The genetic distance between populations from Nepal, China (Tibet), northern India based on the East Eurasian matrilineal components
Table 3 Admixture analysis of the Nepalese by comparing with its potential parental populations
Table 4 Estimated ages of haplogroups G2a2 and M9a1a2a based on the calibration rates proposed in Forster et al.44 and Soares et al.46

Results and discussion

On the basis of the combined information from control-region and partial coding-region segments, the majority (96.34%; 237/246) of the Nepalese mtDNAs could unambiguously be allocated into the defined haplogroups of East Eurasian (36.59%; 90/246),1, 20, 21, 22, 23, 24, 25 South Asian (51.63%; 127/246)5, 7, 11, 12, 26, 27, 28, 29, 30, 31 and West Eurasian ancestries (8.13%; 20/246)26, 31, 47, 48 (Supplementary Table S2, Supplementary Material online). It is apparent that the genetic components of East Eurasian (36.59%) and South Asian (51.63%) ancestry have comprised the vast majority of the Nepalese gene pool (Supplementary Table S1, Supplementary Material online),5 and this pattern remains almost stable for both the East Eurasian (45.11%; 189/419) and South Asian (47.49%; 199/419) components after taking into account the recently reported Nepalese mtDNA data.5 As for the 21 samples with ambiguously phylogenetic status, completely sequencing their mtDNA genomes revealed that virtually all of these samples in fact belong to the already defined haplogroups, such as M3, M5, M18, M30, M35, M43, D4, R8 and M60. Of note is that, beside two singular branches identified in this study, we also defined a novel haplogroup characterized by variations 9266 and 11827, which was named M81 here (Figure 2).

Figure 2
figure 2

Reconstructed mtDNA tree of the completely sequenced representatives of the major Nepalese mtDNA lineages. The phylogenetic tree was reconstructed on the basis of 58 mitochondrial genomes, among which 21 mtDNAs were sampled from Katmandu, Nepal and generated in this study, whereas the rest 37 related mtDNAs were collected from the literature.5, 26, 27, 31, 32, 33 Mutations are recorded according to the rCRS.19 Suffixes A, C, G, and T indicate transversions, ‘‘d’’ signifies a deletion and a plus (+) signs an insertion; recurrent mutations are underlined. The prefix ‘‘h’’ indicates heteroplasmy and ‘‘@’’ highlights back mutation. The length polymorphisms (for example, 309+C, 309+2C, 315+C and 315+2C) are ignored during the tree reconstruction. The reconstruction of highly recurrent mutations (for example, 16519, and the insertion/deletion of ‘‘CA’’ repeats in region 514–523) is tentative at best.

To get more insights into the origin of the East Eurasian maternal components observed in the Nepalese and therefore test the two competing scenarios about how these components had been introduced into Nepal,4, 5, 8 we focused on the phylogenetic affinity between the East Eurasian haplogroups identified in the Nepalese and those from the Tibetan, northeast and northwest Indian populations. Figure 3 illustrates the principle component analysis plot of the 43 populations under study, which was constructed based merely on the East Eurasian lineages. Among the five Nepalese populations under study, three clustered with the Tibetans (Figure 3). After we considered all the Nepalese regional populations as a whole and calculated its Fst value with the populations from its neighboring regions, the smallest genetic distance was observed between the Nepalese and the Tibetans (Table 2). By taking the Tibetans and northern Indians as the parental populations, the results of the admixture estimation analysis revealed that the Tibetans made major contribution to virtually all Nepalese populations (except for the eastern Tharu population; Table 3). Afterwards, we further compared the phylogenetic affinity of the East Eurasian lineages observed in Nepalese (including haplogroups A11, C, G2a, M9a, F1c and Z; Figures 4 and 5) with those from the neighboring regions, for example, Tibet, northeast and northwest India, by means of median networks.43 On the basis of the constructed networks (Figures 4 and 5), several features could be observed: (1) the Nepalese share some basal or internal haplotypes with the Tibetans; (2) the Nepalese harbor a number of unique haplotypes at the terminal level, most of which branched off directly from the nodes occupied almost exclusively by the Tibetan lineages and (3) only a few haplotypes are shared sporadically between the Nepalese and the northern Indians. Taken together, the Nepalese lineages of East Eurasian ancestry generally show much closer affinity with the ones from Tibet, albeit a few mtDNA haplotypes, likely resulted from recent gene flow, were shared between the Nepalese and northern (including northeast) Indians (Figures 4 and 5).

Figure 3
figure 3

Principle component analysis (PCA) of the populations under study.

Figure 4
figure 4

Median networks of haplogroups A11, C, F1c and Z. The reduced median networks of haplogroups of interest were constructed by using the network 4.510 program (http://www.fluxus-engineering.com/sharenet.htm) and adjusted manually according to Bandelt et al.43 The data used here were collected from this study (Supplementary Table S1) and the literature (Table 1). The sequence variation used for network construction was confined to segment 16047–16497. Suffixes T, C and G refer to transversions; recurrent mutations are underlined and ‘‘@’’ denotes a reverse mutation, 16193+C was omitted in the median network. Code in the circles refers to the population abbreviation as displayed in Supplementary Table S2.

Figure 5
figure 5

Median networks of haplogroups G2a2 and M9a1a2a. For more information, see Figure 4.

Even though we focused on the East Eurasian lineages identified in the Nepalese populations, we did observe a number of the Nepalese-specific haplotypes, strongly suggesting their rather ancient origin and most plausibly de novo differentiation in Nepal. To get some hint at the arrival time of the lineages, we have focused on two clades from haplogroups G2a and M9a1a2 simply because both clades contain the Nepalese haplotypes at their terminal branch or basal node and likely have differentiated in Nepal; estimating their ages would then help to date the arrival time of the migration from Tibet. In fact, time estimation results revealed that haplogroups G2a2 and M9a1a2a have very similar ages of 5.7 kya, and this age becomes a little older (6 kya) when calibration rate proposed by Forster et al.44 was used. To this end, the very similar ages of both haplogroups, which likely had in situ differentiated in Nepal, strongly suggest that the bearers of these East Eurasian maternal components would have arrived at Nepal no later than 5.7 kya (Table 4). In retrospect, previous work has suggested that the maternal genetic components from the northern East Eurasian was introduced into Tibet around 8.2 kya,1 and our time estimation results fit this dating frame very well. It is then conceivable that the settlement of Nepal by the bearer of the East Eurasian genetic components occurred likely before 5.7 kya, a result in good agreement with the archeological findings reporting shared the Neolithic features between Nepal and Tibet (references therein).49

Previous studies have observed substantial East Eurasian genetic components in the Nepalese populations;4, 5 however, it remains controversial whether the East Eurasian lineages have been introduced into Nepal from Tibet directly (across the Himalayas)4, 6 or via northeast India.5, 8, 50 By extensively analyzing the mtDNA variation in Nepal, Tibet, northern India populations, our observations, based on the principle component analysis, Fst and admixture estimation, revealed the closer genetic affinity between the Nepalese and the Tibetans, and this result was further substantiated by the median networks, (Figures 4 and 5) in which most of the Nepalese mtDNAs prevalent among northern Asian populations shared the haplotypes with the Tibetans at root level or branched off directly from the nodes consisting almost exclusively of the Tibetan lineages. Our results strongly suggest that most of the East Eurasian maternal components identified in the Nepalese were introduced directly from Tibet,4, 6 and the time estimation results further date that this peopling scenario plausibly occurred about 6 kya. Indeed, this inference seems to be in striking accordance with the historically recorded passes (such as the Kodari and Rasuwa Passes), which bridged the Nepalese and the Tibetans since the ancient time.3 However, the observed gene flow from northeast India suggests genetic contribution, albeit limited, from this region, a scenario echoing the proposed inland dispersal route.50 In this spirit, our findings complete the understanding of the origin of the Nepalese and the way how the East Eurasian genetic components had been introduced into Nepal. Taking into account the previous observation on Y chromosome,4 now it is convincing that the East Eurasian had entered Nepal across the Himalayas around 6 kya, a scenario in good agreement with the previous findings from linguistics and archeology.