Ancient Human Migrations to and through Jammu Kashmir- India were not of Males Exclusively

Jammu and Kashmir (J&K), the Northern most State of India, has been under-represented or altogether absent in most of the phylogenetic studies carried out in literature, despite its strategic location in the Himalayan region. Nonetheless, this region may have acted as a corridor to various migrations to and from mainland India, Eurasia or northeast Asia. The belief goes that most of the migrations post-late-Pleistocene were mainly male dominated, primarily associated with population invasions, where female migration may thus have been limited. To evaluate female-centered migration patterns in the region, we sequenced 83 complete mitochondrial genomes of unrelated individuals belonging to different ethnic groups from the state. We observed a high diversity in the studied maternal lineages, identifying 19 new maternal sub-haplogroups (HGs). High maternal diversity and our phylogenetic analyses suggest that the migrations post-Pleistocene were not strictly paternal, as described in the literature. These preliminary observations highlight the need to carry out an extensive study of the endogamous populations of the region to unravel many facts and find links in the peopling of India.


Result and Discussion
In this study, in order to address the compelling question as to whether, the maternal gene pool of J&K was conserved across millennia, as expected from other phylogenetic studies of Indian populations, we targeted the mitogenomes, the strongest genetic tool for tracing direct maternal inheritance 19 , of 83 individuals belonging to different ethnic groups across Jammu and Kashmir.
From the data set, we identified a high diversity of maternal HGs in the region as the 83 mitogenomes included types from haplogroup (HG) to M, U, H, W, R, K, F, D, T, A, C and I. These major HG, included the lineages from as far as both Europe and Central Asia. Further, the phylogenetic diversity documented here has not been observed before in Indian populations. From the 83 mitogenomes, we identified 19 novel lineages, belonging to the following groups: M3a1c, M3a1d, M3a2b, M3a2b1, M65a3, M65a3a, M5a1b1, M5a1b1a, M5a6, M39b1a, W6e, F1c1a2a, U2a1a1, U2c2, U2c2a, U2c2a1, K2a5c, A21a and R0a2o. HGs were assigned to each individual according to the nomenclature of PhyloTree.org 20 (mtDNA tree Build 17). These novel lineages were only designated when at least 2 individuals could be found belonging to that lineage (Table 1). Interestingly, some of the lineages called for a rearrangement of the existing phylogenetic tree (mtDNA tree build 17) to accommodate the previously unknown diversity. To avoid confusion, we named the lineages to follow the existing nomenclature of phylotree (build 17) but highlighted the branches that need rearrangement in the manuscript. In addition, a few individuals were observed with distinct haplotypes than those in existing phylotree (mtDNA tree build 17) and in literature. However, as these set of variants were observed only once, and thus were considered as possible private polymorphisms, and were added as new branches in the existing maternal phylogenetic tree (supplementary Data 1 and 2). It is possible that these branches are potentially new haplogroups with deep (Pleistocene) time differentiation (Fig. 1), yet we lacked replicate samples and further evidence to be characterized as such, due to a smaller sample size in the present preliminary study from the region. Also seen in previous work from India, we found HG U to be predominant and the second most frequent and highly diverse group in the present study. HG U is known to be distributed geographically from Europe and North Africa through India and Central Asia 21,27,28 . It is the second most frequent maternal lineage in India as it is also in Europe 29 . The autochthonous HGs like U2a, U2b and U2c were observed in Indian populations 28,30,31 . To these HGs, newer sub HGs are added in the present study. U2a, reported to be present in Pakistan 32 , showed a new sub-haplogroup, designated as U2a1a and defined by variant T15629C. Two sub-branches were added to the HG U2b2 and one sub-branch to U2c1. Basal HGs U2c2, U2c2a, U2c2a1 were also observed and defined by variants: C13934T, and T4772C, C16188T, A16207G, and T63C, C64T, G9554T, A15954G, G16213A, respectively. A sub-branch was added to U2e1′2′3 and U7a3. Further, two sub-branches were added to U7a, U7a3b; and one sub-branch each to U7a3a, U7b, U5a1i1 and U1a1c1d (Supplementary Data 2). HG U7 has been found in various parts of India, Pakistan and Iran 10 but is also widely distributed across Europe, near East and South Asia 33 . Additionally, one sub-branch was added to each HG R5a2 and R30b1; both R5 and R30 are common Indian HGs 31 .

Presence of non-Indian haplogroups in J&K.
We identified various other HGs (sub-branches and some new HGs) of R, W, H, I, T, K, C, A, F and D but with relatively less frequency. These HGs appeared in low frequency individually, which in the present study may be due to a small sample size. However, all these groups taken together represented 29/83 studied samples (34.9%), thus more than a third of lineages were contributed from outside of India. Some of these HGs were mainly the West Eurasian and North-East Asian HGs, rarely have been found in India. Moreover, given the extent of diversity, these are some of the most interesting observations made in the present study. Various phylogenetic analyses (phylogenetic tree and Bayesian analysis), indicate an introduction of this genetic component during post Last Glacial Maximum (<14000 YBP) to early Holocene period (>8000 YBP) (Fig. 1), which may coincide with dispersal of language and adoption of agriculture 34 . Thus, the diversity observed could have arisen as a result of large scale female migration to the region in post-settlement times, but could also be differentiation of many of the lineages in the region, in-situ, since Pleistocene settlement. In order to make this differentiation, we need further and more extensive study of the region, including possible ancient DNA contributions.
In the present study, various new HGs belonging to macro haplogroup N were observed. A new sub-haplogroup was added to R0a2: the R0a2o, defined by variant T11152C and two sub-branches (Supplementary Data 2). R0 is a West Eurasian haplogroup 35 , and has also been reported in population of Iran 36 . Further, one sub-branch each was added to HGs, W4 and W6, and two sub-braches to W3. Additionally, a new sub-haplogroup, not reported earlier, was defined as W6e with basal variant G143A and further differentiation in the region with variants G7521A, T16189C, C16355T and T16362C (Supplementary data 2). The W haplogroup has been reported to be the one of the dominant groups in Iran, Pakistan 10 ; and is also widely distributed with low frequencies over European continent, the Near East and West Asia 37 . Similarly haplogroup H, commonly present in Caucasus and widely spread within Europe with a rare presence in India 38 , was observed in the present study. Further, one sub-branch was added to each HG H2a and H6a1b: and two sub-branches to each sub-haplogropu H2b and H41a. Also, one sub-branch was added to each, I4b and T2g1 (I4b and T2 are present in Europe, the Near East and the Caucasia 39,40 ). Another novel sub-haplogroup of K2a5: K2a5c, defined by A15799G at basal position with further differentiation possessing variants C9356T and A3397G C7241T was reported for the first time in India. HG K2a5 has predominantly been documented in West Eurasians 41 . European gene pool has been affected by major gene flow from steppe region of North of the Caucasus and has resulted in a shift in the genetic profile of Europeans during Bronze Age 42,43 .
It has also been suggested that groups from the steppe regions of Eurasia were the source population of Indo-European languages to Europe 42 and also likely to South Asia 44 . This linguistic spread could explain the presence of these European lineage in South Asia. The other plausible migration of maternal European gene pool to this region could have been from the Southern coast of Iran, Afghanistan; and present Pakistan through Baluchistan, acting as conduits for human dispersals 36,45 . These geographically adjacent regions were connected with the State of Jammu and Kashmir and were also regions that saw similar historical gene flow from Europe as India did. therefore, the extent of diversity, we propose, may either indicate differentiation of various population groups in the neighboring region 36 or be the remnants of natural diversity associated with various migrations of high magnitude through various invaders, like Alexander the Great, the Persian Empire, Arabs, Turks, Syrians and Afghans. Thus, the historical migrants too could have introduced the genetic component of West and Central Asia to this region 12 which interestingly was not exclusively that of males.
In addition, many Eastern Asian haplogroups were also observed. Ancient branches of HG C(C4a1) present in Eastern Asia, may have been introduced to J&K with movement south from central and Eastern Asia with the concurrent expansions in Northern Asia 46 . HG A one of the major haplogroup in Tibetans 47 and a common haplogroup in Northern and Eastern Asia 48 , was found in the form of one new sub-haplogroup, A21a, defined by C12603T and T16092C variants. The two HGs A and C have also been reported in Tibet with a high frequency among Sherpa population 47 . Likewise, HG F1c, which is predominantly present in the Tibetans 49 and Chinese populations 50 was observed in J&K in sub-branches as F1c1a2a, defined by variants A234G and G1927A. Within haplogroup D4 seen in Han Chinese populations 51 , we added sub-branches to HG D4g2a1c and D4q. The presence of these lineages representing North East Asian gene pool in J&K may have been contributed through the Silk Route when people migrated in and out of the region.

Affinity with the Global Populations.
In order to visualize the maternal closeness of J&K population with various population of the world, genetic distances amongst different populations were estimated as Fst values 52 . In addition to J&K samples from present study samples from population sets, Africa(AFR) 53 , Caucasian(CAU) 54 , China(CH) 50 , Europe(EUR) 53 , India(IN1) 31 , India(IN2) 55 , Japan(JAP) 56 , assorted Jew(JEW) 57 , Onge(ONG) 58 , Pakistan(PAK) 32 , Papua New Guinea(PNG) 59 and Native Americans(AME) 53 were included for the analysis. Fst values were further used to generate Multi Dimensional Scaling (MDS) plot ( Supplementary Fig. 1). First plot was generated by taking population of JK as whole. It was observed that JK population set clusters in between Asian and European population sets ( Supplementary Fig. 1a). To have a better understanding of genetic affinity of samples bearing known Indian and non-Indian haplogroups found in J&K population, samples were divided into two sets JK1 and JK2. JK1 set constituted the HGs known in literature majorly as Indian population specific haplogroups (M, U2 & U7). JK2 set was comprised of all other HGs (A, C, D, F, H, I, K, R, T, U1, U5, W), found in the study. Interestingly, JK2 clustered with European and Caucasian population set ( Supplementary Fig. 1b).

Maternal population expansion in the region. The initial settlement of humans in South Asia is
reported to have occurred between 40,000-70,000 years before present (YBP) 60 , whereas the initial maternal colonization in Indian subcontinent is reported to have occurred around 40,000-45,000 YBP 61 . We found various lineages of macrohaplogroup M and N, clearly indicating high diversity and maternal gene flow to and from the region as discussed and also indicated by large negative values of Fu's Fs statistics and the highly significant values of Tajima's D (Supplementary Table 1). This substantiated the pattern of a maternal population expansion in Jammu and Kashmir population. Bayesian Skyline Analysis (BSA) and phylogenetic tree branching patterns indicated a gradual population growth over 35,000 YBP but an expansion episode can be detected around 15,000-8000 YBP, which could be attributed to an expansion after the Last Glacial Maximum (Fig. 1) that could be insitu differentiation or associated with agriculture and language dispersal also indicated by some of the archeological evidences from the region.
Coalescence Age estimates of major Haplogroups. We tried to estimate the coalescent ages (expressed as years before present, YBP) of major haplogroups found in our sample set using BEAST software 62 . The most frequent haplogroups of macrohaplogroup M in the present study were M3a, M5, M30, and M65; whereas, of macrohaplogroup N were U, U2, U7, W and H (Supplementary Table 2). Most of these coalescence time periods in the present study were in agreement with the estimates reported in literature. The coalescence time period of M3a has been reported as 16400 YBP 22 , and observed as 16838.1 YBP with 95% HPD of (12654. 3-22671.5). Similarly, M5a has been estimated with an age of 23100 YBP 22 , whereas we observed 15453.4 YBP with 95% HPD of (11878.9-20377.2). In case of M30, reported to have originated 15400 YBP 22 , showed in our estimates its age as 13663.9 YBP with 95% HPD of (10237.1-18628.2), in the region. The coalescence age of M65 which has been reported as 20600 YBP 34 , in our study has been observed as 18115.9 YBP with 95% HPD of (13786.7-24394.0).
One of the major haplogroups, haplogroup U and some of its lineages which have been reported to be differentiated in the Indian subcontinent 21,63,64 , showed its coalescence age of 33568.2 YBP with 95% HPD of (29494.3-39721.8) in J&K. The mean age of HG U has been reported as 46531.1 41 and the coalescence age of U2 has been reported as 42805.7 41 YBP. The age for the latter when calculated in the region was 28503.0 YBP with 95% HPD of (24666.1-33724.1). Further, HG U7 which has been known to be differentiated in south Asia with the age of 15600 YBP 33 , showed the coalescence age estimate of 13060.5 with 95% HPD of (10014.2-17556.7), in the present study. In contrast, the absolute coalescence age estimates of European haplogroups in J&K were not consistent with those in Europe, though 95% HPD values overlapped; West Eurasian haplogroup W has been reported in Europe with the age of 18400 YBP 37 , we observed an age of 15033.1 YBP with 95%HPD of (11539. 9-20022.5). While as, Haplogroup H which has been estimated to originate around (12846.0) YBP 41 , but our estimates show it around 15747.3 YBP in J&K with 95% HPD of (11917.5-21518.1). Thus, the diversity and deep in time coalescence ages of known European HGs in the region highlight the importance of the region for exploring its plausibility as differentiation ground of many maternal lineages that might have migrated to Europe from the region post glacial maximum. Overall, this age estimation has provided an overview of the expansion of mtDNA haplogroups in the region, indicating population expansion in the J&K region at different time intervals, most likely with post-settlement gene flow east from western Eurasia to India.

Conclusion
To conclude, the extent of presence of variants defining novel HGs or personal variants indicate high diversity in maternal genetic component of the population of J&K. Statistical analyses indicate that maternal population in J&K have undergone expansion, along with other regions of Indian sub-continent 9 . However, signatures of maternal gene pool expansion in the region past LGM and early Holocene era are also seen, and this is a unique observation for the present study. These distinct signatures and maternal lineages, never reported before in India, apparently suggest that this region might have served as a corridor, yet also as a reservoir for many unreported lineages.
The overall diversity seen in the maternal gene pool of J&K suggests that the migrations to and through this region were not exclusively of males. This data has refined the existing phylogenetic tree and added to the information further diversity of mtDNA in Indian populations. Further, this preliminary study highlights the importance of the region and emphasizes that the populations of this region should be studied extensively to understand the gene pool of Indian populations. Along with the Y chromosomal and mtDNA markers, a study of autosomal markers is also warranted in these population groups. It is anticipated to help in finding some of the missing links in the evolution of modern humans and their migratory history to and from the mainland India and the Indian subcontinent, a future perspective of our study. Further, we would like to emphasize that the endogamous populations should be studied with respect to their individual evolutionary and migration histories, rather than pooling these together as one group, an underlying drawback that has plagued many of the Indian population based studies in the past, diluting individual signatures and masking stories their DNA has to tell.

Materials and Methods
Sample collection. Samples were collected with informed consent from all participants. The study was approved by Institutional Ethical Review Board (IERB) of Shri Mata Vaishno Devi University. All experimental protocols were conducted according to the guidelines and regulations set by the IERB. The samples represent cumulatively the ethnic population of J&K, but belong to various endogamous groups (or subgroups), such as, Brukpa, Bakkarwal, Brahmin, Gujjar, Kashmiri Pandit, Kashmiri Muslim, Khatri, Lohar, Rajput, Sikh and SC (subgroups-Bhagat, Balmiki, Charmark, Mahasha, Screra, Tradiye) from three provinces of J&K i.e. Jammu, Kashmir and Ladakh.
Complete mtDNA sequencing. Complete mitochondrial sequencing was done by targeted amplification of the mitochondrial genome by long-range PCR, from genomic DNA. The mitochondrial genome was amplified as two long overlapping fragments of 9Kb each. Equimolar concentrations of the two fragments were pooled and taken for library preparation using Illumina TruSeq DNA sample preparation kits. The LR PCR products were fragmented to 300-400 bp size in Covaris M220 followed by end repair, adenylation and Illumina adaptor ligation. The adaptor ligated libraries were further amplified and size selected using Ampure XP beads. The libraries were then sequenced to more than 1000 × coverage on Illumina sequencing platform (HiSeq. 2500). The sequences obtained were aligned to revised Cambridge mitochondrial reference genome (RCRS) using BWA program 65,66 and analyzed using Picard and GATK-Lite toolkit 67 Phylogenetic analysis of Mitogenomes. Bayesian Skyline Analysis (BSA) was also done, to calculate effective population size with time using BEAST v1.8.2 suite 62 (Bayesian Evolutionary Analysis Sampling Trees) software. We performed several analyses on the partitioned mtDNA (partition of mtDNA was done using a custom python script into control region, tRNA plus rRNA regions, first, second, and third positions of codons in the protein coding regions) 33 , with a strict molecular clock and uncorrelated lognormal relaxed clock with HKY (Hasegawa Kishino and Yano) model 72 of nucleotide substitutions. The reasonable ESS (Estimated Sample Size) values i.e (>200) were obtained with strict clock. Bayesian Skyline algorithm was selected in BEAST to generate Bayesian Skyline Plot. Each MCMC (Markov Chain Monte Carlo) sample was based on a run of 30000000 generations sampled every 3000 steps, with the first 3000000 generations regarded as burn-in. Runs were made with a mutation rate (2.67 × 10 −8 ) 73 and (2.74 × 10 −8 ) 74 , however results presented here are based on (2.67 × 10 −8 ) 73 . We visualized BSP with Tracer software v1.6 ( Supplementary Fig. 2). Phylogenetic analysis of J&K Mitogenomes was also done using Densitree 75 . The age of most recent common ancestor (TMRCA) and 95% highest posterior density intervals of major haplogroups were calculated using human mitochondrial evolutionary rates 2.67 × 10 −8 with European mitogenomes 73 (Supplementary Fig. 3). Fst values were used to generate Multi Dimensional Scaling (MDS plot) using SPSS statistics software v.20 76 . Complete mtDNA sequences for various population groups were downloaded from (http://www.mtdb.igp.uu.se/) 77 and aligned using DNasp v.5 71 . The Fst values and statistical significance was estimated by permutation analysis, using 10,000 permutations by Arlequin software v.3.5 78 .