Genetic Diversity in Gorkhas: an Autosomal STR Study

Genotyping of highly polymorphic autosomal short tandem repeat (STR) markers is a potent tool for elucidating genetic diversity. In the present study, fifteen autosomal STR markers were analyzed in unrelated healthy male Gorkha individuals (n = 98) serving in the Indian Army by using AmpFlSTR Identifiler Plus PCR Amplification Kit. In total, 138 alleles were observed with corresponding allele frequencies ranging from 0.005 to 0.469. The studied loci were in Hardy-Weinberg Equilibrium (HWE). Heterozygosity ranged from 0.602 to 0.867. The most polymorphic locus was Fibrinogen Alpha (FGA) chain which was also the most discriminating locus as expected. Neighbor Joining (NJ) tree and principal component analysis (PCA) plot clustered the Gorkhas with those of Nepal and other Tibeto-Burman population while lowlander Indian population formed separate cluster substantiating the closeness of the Gorkhas with the Tibeto-Burman linguistic phyla. Furthermore, the dataset of STR markers obtained in the study presents a valuable information source of STR DNA profiles from personnel for usage in disaster victim identification in military exigencies and adds to the Indian database of military soldiers and military hospital repository.

The Gorkha (also spelt as Gurkha) soldiers are a dominant force in the Indian Army who have completed 200 years in the Armed Forces since their integration on April 24, 1815. They are courageous in battle and have won many gallantry awards and military honors. Gorkha was the usual designation of the reigning dynasty of Nepaul (Nepal) 1 and the designation had no ethnic connotation 2 . Historically, the terms 'Gurkha' and 'Gurkhali' are synonymous with 'Nepali' and were derived from the name of old kingdom of Gorkha (Gurkha), a hill town and present day district of Gorkha (~fifty kilometers west of Kathmandu) from which the kingdom of Nepal expanded 3 . As the kingdom spread across the Himalayas from Tibet to Sikkim, the king's warriors, taken from all groups in the area, came to be known as Gorkha soldiers 2 . During the Gurkha War (1814-1816) between the Gorkha Kingdom and the East India Company, the British were impressed by the Gorkhali soldiers whom they called Gurkhas 4 and they became a part of the British Indian Army. After India's independence in 1947, as part of the Tripartite Agreement, the Gurkha Regiment was split between the British and the Indian Army. There are two types of Gorkhas in the Indian Army: the original inhabitants of Nepal and the ones who are domiciled in India (who migrated from Nepal long ago and settled in the hilly region of Northern and North-East India).
Only a handful of published genetic studies on Gorkhas/Nepalese are available [5][6][7][8][9][10][11][12] . Nepal is situated just to the south of the Himalayan mountain peaks. Himalaya, stretching from Pakistan to Myanmar, forms the highest land boundary on our planet and is one of the linguistically most complex regions. The Great Himalayan region is of particular interest for studying human population prehistory. The geographical area comprising the present day Nepal provided corridors for human migration in ancient times, resulting in relatively early inhabitation of the area. Nepal is a multiethnic, multilingual and multicultural country made up of more than 125 ethnic/caste groups and more than 123 languages 13 which are derived mostly from three major language groups: Indo-Aryan, Tibeto-Burman and various indigenous language isolates. The two major groups in the Nepali society are Tibeto-Burman or Mongoloids from the north and the Indo-Aryans from the south. The larger groups can be divided on the basis of geographical locations by altitude: alpine based cultural groups (Sherpas, Dolpas, Larkes and Siars, Manang bas, Lo pas, Olangchung), temperate zone based cultural groups (Brahmins and Chhetris, Kiratis, Tamangs, Magars, Gurungs, Thakalis) and subtropical based cultural groups (Brahmans and Rajputs, Tharus, Rajbansis, Satars, Musalmans) 14 . It is generally believed that ancestors of ethnic groups inhabiting the northern Nepal, the middle hills and the north eastern region (Gurungs, Magars, Limbus, Rais, Sherpas, Tamangs and Thakalis etc.) came from Tibet, China, Myanmar, Mongolia and the Far East. Majority of Gurungs along with Magars and their Khasa counterparts formed the bulk of the famous Gorkha regiment of British and Indian Army.
Nepalese population displays strong genetic differentiation despite sharing close geographical proximity underscoring necessity for studies of Nepalese ethnic groups. The understanding of past genetic history, genetic affinity as well as population diversity has been vastly improved by the molecular genetic markers like microsatellite, mitochondrial DNA and Y chromosome markers. Microsatellite markers, particularly the autosomal short tandem repeats (STR), which have high polymorphic information content (PIC) and high power of discrimination 15 , have become the markers of choice for genetic analysis and understanding of genetic relatedness of the population. The present study investigates i) the genetic status of the Gorkhas of Tibeto-Burman linguistic phyla serving the Indian Army based on a set of 15 autosomal microsatellite (STR) markers, ii) the extent of affiliation with ethno-linguistically close population of Nepal, Tibet and other global Asian populations belonging to the Tibeto-Burman linguistic family and iii) construction of Indian soldier DNA dataset for human identification for military purposes.

Results and Discussion
In total, 138 alleles were observed in the studied population with allele frequencies ranging from 0.005 to 0.469. TPOX locus had the maximum allele frequency with allele 8 (0.469) being the most frequent allele in the population. Observed heterozygosity of TPOX, TH01 and D5S818 was found to be low being 0.602, 0.663 and 0.663 respectively. Remaining STR loci were highly polymorphic with observed heterozygosity values ranging from 0.724 for CSF1PO to 0.867 for D18S51. All loci met HWE expectations. Allele frequencies and summary statistics of the 15 autosomal STRs in the studied Gorkhas are presented in Table 1. Polymorphism Information Content (PIC) in the population ranged from 0.61 (TPOX) to 0.87 (FGA) and power of discrimination (PD) ranged from a minimum of 0.843 for TPOX to a maximum of 0.963 for FGA. The most polymorphic locus of FGA was also, as was expected, the most discriminating in the population. The power of exclusion (PE) ranged from a minimum of 0.293 for TPOX to a maximum of 0.770 for D21S11 and D19S433. The combined power of discrimination (CPD) and combined power of exclusion (CPE) for all 15 STR loci was 0.9999999986 and 0.9999993441 respectively. The average probability of matching value was found to be 0.082 and it was expressed as 1 in 14.7.
Phylogenetic analysis based on allelic frequencies was performed to investigate genetic relationship of the studied population with the neighboring and Indian lowlander population 8 Table 1). Allele frequencies at 11 out of the 15 loci in the Gorkhas were statistically similar to the Tibetan general population 10 . Interestingly, Sherpas, a Tibetan population from Namche Bazaar, Nepal 9 , situated at an elevation of 3440 m, showed similarity with the Gorkhas at only 9 out of 15 loci in the present study (Supplementary Table 1). This observation is not unexpected as in an earlier study, allelic and genotypic distributions between Sherpas of Namche Bazaar and non-Sherpas from Kathmandu valley were shown to differ significantly at 14 STR loci 9 . The Nei's D A distance matrix is presented in Table 3. Geographically close Gorkha population (present study) and those from Nepal 8,19 showed the minimum distance value reflecting closeness. Phylogenetic trees established with the genetic distance matrix constructed with allele frequency information of 15 STR markers showed clustering of the Gorkhas with the population from Nepal 19 , Kathmandu 10 and other Tibeto-Burman linguistic phyla as one group while the lowlander Indian population formed a separate cluster (Fig. 1). Figure 2 shows the PCA plot based on Component I and Component II scores. In general, the scattering pattern obtained in PCA ( Fig. 2) showed similarity with clustering pattern of NJ tree (Fig. 1). Four separate aggregates were evident in PCA plot: Indian lowlanders comprising Bhils and Tamils were placed in the upper right quadrant, Indian lowlanders comprising Brahmins, Komatis and Rajus were placed in the lower right quadrant, Gorkhas of present study were placed in the upper left quadrant along with Nepalese 8 , Kathmandu 10 , Nepal 19 as well as with Korean 18 and Chinese 20 population. It is worthwhile to mention here that low bootstrap value between Gorkhas and Nepalese was observed in the phylogenetic tree ( Fig. 1) created by software POPTREE2 23 . Nevertheless, further estimation of relatedness between the two populations by pair wise Fst values computed by Arlequin version 3.5 software 24 (Table 2), Nei's DA distance matrix obtained by POPTREE2 23 (Table 3) as well as placement of populations in PCA plot created by Past software package 25 (Fig. 2) substantiated the relatedness of the studied populations. Neighbor Joining dendogram constructed from Nepali speaking community from Sikkim along with the tribal population of Lepchas and Bhutias 26 and high altitude native population from Ladakh, particularly the Buddhists 27 (who also belong to Tibeto-Burman linguistic phyla) based on allele frequency information from 9 STR markers, did not cluster the Gorkhas with high altitude native (Buddhist) population from Ladakh or with Lepchas and Bhutias or Nepali speaking community of Sikkim (Fig. 3), which in all probability may reflect separate migration history.
The present study reveals phylogenetic relationship of Gorkhas of Tibeto-Mongol origin with neighboring Himalayan population. Our findings are in concordance with the previous studies on Indian Gorkhas which suggested genetic closeness of Gorkha population with that of Mongoloid origin based on HLA-A and B antigens 7 . A characteristic HLA A33-HLA B44 haplotype of Korean and Japanese population was shown to occur with significant positive association in Gorkhas 7 . An earlier study by Wang and colleagues 11 revealed that vast majority of Nepali gene pool comprised genetic component of East Eurasian (36.56%) and South Asian (51.63%) ancestry and 3 out of 5 Nepalese populations were clustered with the Tibetan population. The Himalayan region of the Indian subcontinent contains a complex linguistic pattern indicative of the region being an ancient source of genetically differentiated population and language 28  Indian subcontinent forming a linguistic boundary between the Tibeto-Burman and Indo-European language families 29 ; language and geography might have played equally important roles in defining the genetic composition of present day Himalayan population 28 . Geographical continuity is a major influencing factor of genetic affinity among diverse populations which are distributed over a wide geographical area after migration. Geographically adjoining populations from a common stock cluster together due to genetic affinity and the findings of the present study with autosomal STR markers substantiate the genetic affinity of the Gorkhas with the Tibeto-Burman linguistic phyla reflecting recent past genetic history and possible migration from Tibet as well as probable origin of the Gorkhas from Mongolian and/or Tibetan stocks. Studies with Y-chromosomal diversity revealed high frequency of East Asian specific haplogroup O3a5-M1324 in the Himalayan population of Nepal and Tibet suggesting a common ancestry for these linguistic subfamilies 30 . Higher prevalence of South Asian-derived Y-haplogroup R1a1-M198 was also reported in the Nepalese population of Newar and Kathmandu indicating significant genetic influence from the Indian subcontinent 30 . Previous studies had strongly suggested that most of the East Eurasian maternal components identified in Nepalese were directly introduced from Tibet 8, 28 . A maternal footprint of gene flow was reported between Nepal and India 31 .    It was observed that our study population comprised both Tamangic (Gurung, Tamang) and Magaric (Magar) groups of Tibeto-Burman language family, based on assessment of ethnicity from ethno-linguistic questionnaire ( Table 4). Further investigation of phylogenetic relationship between the Gurung, Tamang and Magar groups showed clustering of the Tamangs with Tibetans 10 , Tamangs 10 and Sherpas 9 while Gurungs and Magars showed genetic relatedness with those from Kathmandu, Nepal 9,10 (Fig. 4). Gurungs and Magars were also closely clustered suggesting common origin of these two ethnic groups (Fig. 4). This interesting observation, however, is required to be substantiated by increasing the markers and the ethnic groups. Although little is known about Tamang history, it is believed that they came from Tibet possibly around 3000 years ago. The Magar people (genetically and physically Mongoloid/East Asian) are believed to have migrated from Tibet via Sikkim although their origin is shrouded in mystery. Origin of Gurungs is also uncertain though linguistic evidence suggests that their ancestors may have migrated from Tibet about 2000 year ago. They are predominantly of Mongoloid racial stock and speak a language which largely belongs to the Tibeto-Burman language family 32 . Time estimation results indicate that people from Tibet began to migrate to Nepal around 6000 years ago 11 which is also in agreement with the archeological findings of reported sharing of Neolithic features between Nepal and Tibet 33 and historically recorded passes (Kodari and Rasuwa) which had connected the Nepalese and the Tibetans since the ancient times 34 . A recent study has revealed presence of Denisovan haplotype in the Himalayan population 35 .   India is home to various social groups with diverse ethnic and linguistic origins representing several racial stocks and social statuses that found place for themselves at various points of time and adapted to several ecological niches offered by the physiographic and climatic setting of the area. The waves of migration drew the ancestors of the majority of the present population of the area from the surrounding territories and across the Himalayas 36 . India, with diverse human population, provides a unique opportunity for population genetics explorations. Tibeto-Burman speaking population, which is one of the four major linguistic groups in India, differs from other linguistic families of India and contributes to the component of diversity to peopling of India. Very few studies are available addressing the extent of genetic diversity and genetic affinity in the Tibeto-Burman population [37][38][39][40][41] . The present study with autosomal STR markers is the first study conducted on Tibeto-Burman speaking Gorkhas from the Indian Armed Forces and substantiates the genetic affinity of the Gorkhas with the Tibeto-Burman linguistic phyla. The results demonstrate that geographic isolation has not played significant role in differentiation of genetic constitution of the Gorkhas whether they come from Nepal or are domiciled in India.
One of the other objectives of our study was to create a microsatellite dataset of Indian soldiers for human identification for military purposes. Military exigencies and major disasters such as wars, airplane accidents and maritime transport disasters leave military personnel highly vulnerable. Usage of weapons of mass destruction and terrorist action also expose the military to disaster and fatality. Identification of remains brings sense of comfort and closure to family and friends. The STR markers are important tools for human identification 18,42 . Various countries have national forensic DNA databases: the Combined DNA Index System (CODIS) in USA, the National DNA database (NDNAD) in UK and Fichier National Automatisé des Emprintes Génétiques (FNAEG) in France which is used by the national police force as well as the military police. Indian Armed Forces has initiated a project of DNA profiling of military personnel which is expected to be completed by 2020. The dataset of STR markers obtained in this study presents a valuable information source of STR DNA profiles from military personnel for usage in disaster victim identification in military exigencies. There is also an urgent need to formulate the DNA profiling laws in India.

Conclusion
The present study reveals phylogenetic relationship of Gorkhas of Tibeto-Mongol origin with other neighboring Tibeto-Mongol Himalayan population. The study substantiates genetic affinity of the Gorkhas with the Tibeto-Burman linguistic phyla reflecting recent past genetic history and origin of the Gorkhas from Mongolian and/or Tibetan stocks. Furthermore, the dataset of STR markers obtained in this study presents a valuable information source of STR DNA profiles from military personnel for usage in disaster victim identification in military exigencies and adds to the Indian database of military soldiers and military hospital repository.

Methods
The study and the experimental protocols were approved by the Ethics Committee of the Defence Institute of Physiology and Allied Sciences, Delhi. The participants (n = 100) were selected from the Gorkha regiment of the Indian Army who were part of an ongoing study of the Institute. All participants gave written informed consent before enrolment in the study. The experiments were conducted in accordance with quality control measures at Participants and sample collection. Of the participants who reported, 59 individuals were from Nepal and 41 from India. During the study period, they were stationed at sea level. The clan ties were determined based on self-reporting and ethnic backgrounds were ascertained through ethno-linguistic questionnaire. To ensure that the individuals were ethnically unmixed, both parents had to belong to the same group. 3-4 ml of peripheral blood was collected through venepuncture from the consenting individuals in K 2 EDTA vacutainers (BD, CA, USA) and stored at − 20 °C till further processing. DNA extraction and microsatellite typing. High molecular genomic DNA was extracted as per published procedure 43 . Quantity assessment was performed on Nanodrop 2000C (Thermo Fisher, USA) and quality checked by Agarose gel electrophoresis. DNA samples had a A 260 /A 280 ratio of 1.8-1.9. Further quantification of DNA was performed by the Quantifiler Duo Human DNA Quantification kit (Applied Biosystems, Foster City, CA) using 7500 Real Time PCR system (Applied Biosystems, Foster City, CA) with v1.1 software as per manufacturer's protocol.
Polymerase chain reaction (PCR) amplification was carried out for each sample using 15 autosomal STR loci markers (D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S433, vWA, TPOX, D18S51, D5S818 and FGA) along with the gender determination marker Amelogenin with AmpFlSTR ® Identifiler ® Plus kit (Applied Biosystems, Foster City, CA) on GeneAmp PCR system 9700 Thermal Cycler following manufacturer's recommended protocol. 2 samples failed in amplification. Positive and negative amplification controls were used as per kit guidelines. The amplified products were run on 3500xL Genetic Analyzer (Applied Biosystems, Foster City, CA) using 36 cm capillary array and Dye set G5. Allelic ladder sample provided in the kit was included in each run. Data generated using capillary electrophoresis was analyzed using Gene Mapper ID-X version 1.4 software (Applied Biosystems, Foster City, CA) as per manufacturer's instructions. Allele calls were generated for all samples and exported in Excel format. Plot views were generated in PDF format.
Analytical methods. Allelic frequencies for the 15 STR loci and matching probability (Pm), power of discrimination (PD), power of exclusion (PE) and polymorphic information content (PIC) were computed using the PowerStats version 1.2 spreadsheet program 44 . Arlequin 24 version 3.5 was used to calculate observed (H obs ) and expected heterozygosity (H exp ) and Hardy-Weinberg Equilibrium (HWE). HWE based on the exact test was confirmed for all the studied 15 loci at a significance level of p > 0.003 after Bonferroni correction 45 (α = 0.05/15 = 0.003). In the absence of raw genotypic scores from other populations, published allele frequency datasets of STR loci from neighboring populations were used for population differentiation by Arlequin using Fst pair wise distance. Phylogenetic analysis based on allele frequencies were performed to investigate the genetic relationship between the Gorkha population, other neighboring population and Indian lowlanders using the set of 15 STR loci and 9 STR loci from different datasets. POPTREE2 software 23 was used for generating Neighbor Joining (NJ) dendograms as well as to derive Nei's genetic distances 46 . Robustness of the phylograms established by NJ tree was estimated by bootstrapping 1000 replicates over loci. Principal Component Analysis (PCA) plot was generated with Past software package 25 version 3.02 and used for graphical representation of the genetic distances (Dst) of the Gorkha population with other global/Indian lowlander populations.