X-chromosomal STR based genetic polymorphisms and demographic history of Sri Lankan ethnicities and their relationship with global populations

A new 16 X-short tandem repeat (STR) multiplex PCR system has recently been developed for Sr Lankans, though its applicability in evolutionary genetics and forensic investigations has not been thoroughly assessed. In this study, 838 unrelated individuals covering all four major ethnic groups (Sinhalese, Sri Lankan Tamils, Indian Tamils and Moors) in Sri Lanka were successfully genotyped using this new multiplex system. The results indicated a high forensic efficiency for the tested loci in all four ethnicities confirming its suitability for forensic applications of Sri Lankans. Allele frequency distribution of Indian Tamils showed subtle but statistically significant differences from those of Sinhalese and Moors, in contrast to frequency distributions previously reported for autosomal STR alleles. This suggest a sex biased demographic history among Sri Lankans requiring a separate X-STR allele frequency database for Indian Tamils. Substantial differences observed in the patterns of LD among the four groups demand the use of a separate haplotype frequency databases for each individual ethnicity. When analysed together with other 14 world populations, all Sri Lankan ethnicities except Indian Tamils clustered closely with populations from Indian Bhil tribe, Bangladesh and Europe reflecting their shared Indo-Aryan ancestry.

Sri Lanka is an island country in South Asia, located in the Indian Ocean, close to India. Due to its strategic position at middle of the maritime silk route from China to Europe, it was well known to the outside world from ancient times as a trading hub. The diverse ethnicities that compose the 20 million population inhabiting the island as per the last population census 1 have descended mainly from numerous groups of migrants who came to the island at various historical time periods. Their overpowering impact have confined the original inhabitants of the country to a few dry zone areas, forming a tribal group known as the Veddahs (aboriginals) 2 , represented by about 10,000 individuals 3 .
Today, the Sinhalese make the largest ethno-cultural group in Sri Lanka, having a population of 15.17 million (74.9% of the total population) 1 . The Sinhalese make a unique population in the world as the only ethnic group that speaks Sinhala, a branch of the Indo-European (Indo-Aryan) language family 4 . According to historical chronicles, the Bengali prince Vijaya and his seven hundred followers, who are descendants from the Indo-Aryans natives of the northern Indian subcontinent laid the foundation to the Sinhalese in 543 BCE 4 . They vanquished the aboriginal Veddas and converted Sri Lanka to a Sinhalese territory until the Dravidian rulers from South India invaded the northern part of the island in the fifth century AD 4 . Since then, there had been frequent migrations by South Indians into the country, which gave rise to the second largest ethnic group in Sri Lanka known as the "Ceylon Tamil" (2.27 million people representing 11.2%) 1 .
A third ethnic group was established in the country when Arab traders visiting the country for commercial purposes settled in Sri Lanka in 1000 AD, leading to intermarriages with the Sinhalese and the Sri Lankan Tamils 4 . The group now known as Moors comprise 9.2% 1 (1.86 million) of the Sri Lankan population and maintain unique sociocultural features which are based largely on the Islamic faith. They speak a Dravidian language that contains large number of Arabic words that is generally referred to as '' Arabic Tamil'' 5  www.nature.com/scientificreports/ DXS10135 showed the highest value for all forensic parameters, suggesting it to be the most informative marker for Sri Lankans, while DXS7423 showed the lowest and the least informative. These results indicate that the 16 X-STR markers are polymorphic enough for both the forensic and kinship analysis applications.

Population differentiation among Sri Lankan ethnicities.
A locus-by-locus analysis of molecular variance (AMOVA) was carried out including all 16 loci, grouping the four ethnicities based on their linguistic origins (Table 1) to understand the extent of genetic differentiation among them. Out of the four populations, Sinhalese are known to have an Indo-Aryan origin, which is different from the Dravidian linguistic origin of the other three ethnicities. However, a significant variation was not detected among the two linguistic groups (Fct = − 0.00059; P > 0.05) though a subtle, but statistically significant variation was detected among populations within groups (Fsc = 0.0018; P = 0.0108). The global AMOVA results as a weighted average over loci showed that most of the variance in the samples is attributable to within-individual variation (97.97%) and between ethnic group variation is around 0.18%. To better understand this observed population structure, pairwise comparisons (pairwise Fst analysis) were carried out among all four ethnicities (Table 2). According to the Fst values obtained, Indian Tamils were shown to have a subtle but statistically significant genetic subdivision from Sin-  www.nature.com/scientificreports/ halese (Fst = 0.0029; P = 0.0000) and Moors (Fst = 0.0038; P = 0.0000) while Sinhalese, Sri Lankan Tamils and Moors are highly panmictic (P > 0.05). Further, the two Tamil ethnicities were shown to share a common genetic background (P > 0.05).
Since phylogenetic trees constructed from genetic distances can easily deduce the evolutionary relationships and origins of different populations 16,17 , UPGMA method was applied for the four ethnicities to further evaluate their genetic affinities. As shown in the phylogram (Fig. 2), Sinhalese and Moors are genetically closely associated with each other and also with Sri Lankan Tamils to a lesser extent. Although Indian Tamil group is placed at a distant position from Moors and Sinhalese, the close genetic affinity between the two Tamil groups are apparent in the phylogram. In addition, the phylogram also supports an Indian origin for the Sri Lankan Moors as suggested by some historians 5 . Nei genetic distances for the six pairwise ethnic groups are listed in the Table 3.
These findings agree with the historical data on early settlement of the four ethnic groups in Sri Lanka. According to anthropological and archaeological evidence, Sri Lankan Tamils have a very long history in Sri Lanka and have lived in the island since at least around the second century BCE. They have arrived in Sri Lanka   www.nature.com/scientificreports/ from various parts of the Indian subcontinent, either with their original families or alone, and subsequently uniting with the Sinhalese through matrimonial bonds. Indian Tamils on the other hand were brought to Sri Lanka to work in estates during the British colonization and had minimum admixture with the native Sinhalese or with Sri Lankan Tamils, who were more economically independent at the time and had better social status. Even at present, majority of Indian Tamils are congregated around the plantation estates of central hill area of the country forming a separate community. In contrast, Sri Lankan Moors have descended exclusively from Muslim male merchants of either Arabic or of Indian origin 5 , who came to Sri Lanka for trading. During the fourteenth century, they started to settle in coastal areas in Sri Lanka and espoused local women, who were either Sinhalese or Sri Lankan Tamil. Thus, it is not surprising to see this local female ancestry reflected among Moors via our X-STR analysis, despite their Arabic origin, in the light that the X chromosome spends two third of its lifetime within females. Alternatively, the genetic similarity observed between Sri Lankan Tamils and Moors may be reflecting the Indian origin of Moors as some scholars claims it to be 5 . Likewise, the genetic similarity observed between the two Tamil populations might also lie in their common Indian origin.
Since the X chromosome reflect more of the maternal blood line, results reported in mitochondrial (mt) DNA studies are also of much relevance to the present context. Despite that the number of mt DNA studies conducted on Sri Lankan ethnicities are very limited, the available reports closely agree with our observations and indicated a very fine genetic structure with a more closer genetic relationship among Sinhalese and Sri Lankan Tamils, in comparison to that among Sinhalese and Indian Tamils 8 . In addition, haplotype sharing between Sinhalese and Moors has also been demonstrated 9 . When taken together with the fact that autosomal STR analysis have failed in detecting a genetic structure among the four ethnic groups 6 , our results suggest a sex-biased demographic history for Sri Lankan ethnicities.  Table S14).

Genetic distance between Sri
Among the three South Asian populations, Bhil tribe and Brahmin caste populations of India are the most geographically proximal populations to Sri Lankan ethnics groups. Both these Indian populations did not show a significant genetic differentiation with the Sri Lankan ethnicities' with respect to any of the tested loci. Similarly, Bangladesh population also did not exhibit any population subdivision with Sri Lankan ethnicities. However, Pakistan population demonstrated significant differentiation from one or more ethnic groups of Sri Lanka at two loci (DXS6789, DXS7424) out of the nine common loci compared.
These results suggest that allele distribution of the four ethnic groups of Sri Lanka is very similar to the two tested Indian populations and the Bangladesh population. Pakistani population also exhibit substantial similarity to Sri Lankan ethnic groups, though not to the same extent of Indian and Bangladesh populations. On the contrary, allelic distribution of many X-STR loci in Sri Lankan ethnic groups differ from Southeast Asian, East Asian, European and African populations. Among them, East Asian and African populations are the most genetically distant populations to Sri Lankans.
To further clarify the relationship between Sri Lankans and the world populations, pairwise Fst values were averaged over eight of the 16 studied X-STR loci (DXS10148, DXS10135, DXS8378, DXS7132, DXS10079, DXS10074, HPRTB and DXS7423) and were represented in a multidimensional scaling (MDS) plot (Fig. 3 44 ). According to the results observed, Sri Lankans were clustered together not only with Indians and Bangladeshi, but also with Europeans. Indian Tamils were placed towards the periphery of this main cluster, while Southeast Asians, East Asians and Africans were placed at a distant, outside the main cluster.
This presentation of MDS plot aligns well with the historical claims of population movements in Eurasia. Sinhalese are believed to be descended from Indo-Aryans, who set forth from boarders of Caspian and Black sea towards Europe and South Asia, early in the third millennium BC. Accordingly, many scholars hold the view that along with the Indo-Aryan language family, many Europeans and South Asian civilization of today share common genetic background reflecting their Bronze age common ancestors. Tamils on the other hand are believed to have descended from the indigenous people of Indian subcontinent. However, Sri Lankan Tamils have admixed with Sinhalese nearly over two millennia, unlike the Indian Tamils, which might explain their relative positions in the MDS plot.
LD and haplotype analysis. Population studies of various countries have illustrated that the LD is population specific 42,46,47 and does not necessarily present between markers with close physical proximity 48,49 . In the present study, Sri Lankan Tamils did not exhibit LD within any of the four clusters, while Sinhalese displayed LD within three of the four studied clusters after adjusting for multiple comparisons (Table 4). Further, in cluster I and IV, LD was detected only among Sinhalese population, and only between a single pair of loci (DXS10135 and DXS8378 in cluster I and DXS7424 and DXS101 in cluster II). The cluster II, which had the highest presentation of LD among the four clusters, displayed a highly significant (P = 0.0000) LD between the two marker pairs, DXS10074-XS10075 (in Sinhalese, Indian Tamils and Moors) and DXS7132-DXS10075 (among the Sinhalese). www.nature.com/scientificreports/ However, there was no LD detected between DXS10079 and DXS10074 among Sinhalese in contrast to the results obtained for our initial study, which was conducted with 120 Sinhalese males 15 . This disparity of observations might have caused by the limitations in the initial study with respect to the sample size requirements specified for LD analysis 50 . In cluster III, none of the markers showed a significant LD (corrected P = 0.0167), although there was a marginal LD (P = 0.0174) detected between DXS6801 and DXS6809 for Sinhalese population. Further, in addition to the pairwise LD analysis within separate clusters, pairwise LD was also analyzed between markers belonging to cluster III and IV, due to their relative close physical proximity (6.78 cM) compared to the other clusters. However, LD was not detected between any of the marker pairs (P > 0.05 for eight out of nine marker pairs) as shown in Table 4. Since the haplotypes defined by those loci which are found to be in LD tend to behave as alleles, forensic efficiency parameters were calculated based on their observed haplotype frequencies and are given in Table 5. As shown PIC, PD m and PD f values for all these haplotypes were greater than 0.9 while MEC values ranged above 0.8.
LD is generally expected to be high in populations, which are either small, reproductively isolated or having low population growth 48 . Indian Tamils who are tea estate workers, restricted mostly to central part of the country fits well into this description. In Sri Lanka, there are about 850,000 Indian Tamils, which had very limited genetic mixing with other populations due to social and cultural reasons. Nevertheless, only one loci pair among Indian Tamils showed LD according to our results. As pointed out by Kling et al. 50 , to acquire adequate power to detect LD in X-STR analysis, sample sizes over 200 are often needed. When considering the limited number of Indian Tamil samples used in the present study in calculating LD (63 males), it is possible that the low sample number www.nature.com/scientificreports/ to have masked the actual existence in LD among Indian Tamils. The same limitation is also valid for Sri Lankan Tamil and Moor ethnicities in which similarly low level of LD were detected. On the other hand, analysing a small sample from a large population can create LD, even among loci which are in linkage equilibrium, due to the under-representation of haplotypes 51 . In this light, it is advisable to increase the number of Sinhalese samples used for LD calculations (258 males in the present study), considering that there are about 15 million Sinhalese in the country. On the other hand, the high level of LD observed within Sinhalese might have resulted from its largely admixed nature. All these facts highlight the requirement of a more exhaustive data set with increased power to detect LD, before concluding on the true pattern of LD among Sri Lankan ethnicities. In general, the LD reported earlier for South Asian and South East Asian populations like Bangladeshi 20 , Indian Bhil tribe 18 and Malaysians 23 was quite low; i.e. no LD was reported among cluster I (DXS10148, DXS1035 and DXS8378) or in cluster II markers (DXS7132, DXS10079 and DXS10074). The marker, DXS10075, in which LD was detected with DXS7132 for Sinhalese was not investigated in any of these populations to make  www.nature.com/scientificreports/ a comparison. However, it is noteworthy that all these studies were conducted with male samples between 100 and160, which might have posed a limitation in detecting true LD in these populations. On the other hand, LD was not observed for cluster III or cluster IV in a study that investigated 302 Pakistan males 21 , the only Asian population for which LD data are available for these clusters. These data are suggestive of the existence of a complex pattern of LD among Asians. In contrast, a high level of LD was reported within these four clusters for Europeans like Swedish 52 and German [53][54][55] populations, in studies conducted with 450-800 male individuals. The genetic stability of the linked clusters and the degree of dependence between them can have a major influence on most forensic and kinship applications. The classical approach of studying such linkage between selected markers is via analysis of three generation pedigrees using LOD scores 56 . However, a linkage analysis was not conducted for the 16 X-STR markers investigated in the current study. Nevertheless, a previous research conducted for three generation families of Chinese origin has reported a significant linkage with maximum LOD scores > 2.0 for all pairwise markers in the cluster II, III and IV indicating their tightly linked nature 57 .
The haplotype frequencies obtained for the four ethnic groups for all four clusters are listed in Supplementary Tables S15-S18. The forensic efficiency parameters of these haplotypes are also provided in Table S19. In most of the previously published population studies, cluster 1 V comprises only DXS7424-DXS101 without DXS7133 55,58,59 . Likewise, the four markers included in the cluster II had been analyzed in two different combinations; i.e. DXS7132-DXS10079-DXS10074 (Argus X-12 kit) and DXS10079-DXS10074-DXS10075 54 . Therefore, to allow comparisons with these previous studies, the haplotype frequencies of above combinations are also listed in Supplementary Tables S20-S23. As shown in Supplementary Table S24, typing of male subsamples from the four ethnic groups yielded a total of 604, 246, 208 and 223 different haplotypes for Sinhalese, Sri Lankan Tamils, Indian Tamils and Moors, respectively. Cluster I produced the highest number of haplotypes for Sinhalese and Sri Lankan Tamils, while both cluster I and II produced the highest number of haplotypes for Indian Tamils and Moors. Among all the observed haplotypes, 96.85% of Sinhalese, 82.11% of Sri Lankan Tamil, 83.17% of Indian Tamil and 85.65% of Moor haplotypes showed frequencies < 0.020. Moreover, the most common haplotype was observed at a frequency ≤ 0.065 in all the four ethnicities, indicating the suitability of the selected X-STR clusters for haplotype based kinship analysis.
Among the four clusters, both clusters I and II proved relatively more informative for all four ethnicities as reflected by the haplotype diversity (HD = 0.9964-0.9987 and 0.9935-0.9973 respectively). In cluster I, this observation may have caused by the two highly polymorphic markers, DXS10135 and DXS10148 as described above and is consistent with other previously published population data 18,20,25 . Cluster II, on the other hand carries four markers compared to the other clusters, which consists of three markers each. This increased number of markers may have generated higher haplotype diversity with respect to cluster II among the studied populations. On the contrary, cluster III and cluster IV are equally informative (HD = 0.9873-0.9908 and 0.9882-0.9923 respectively), though not to the same extent as the clusters I and II. Nevertheless, in general, all four clusters showed a high haplotype diversity for all four ethnicities.
Combined forensic efficiency parameter data. In general, for those loci in LD, the combined forensic parameters are calculated based on the haplotypes defined by LD. However, a more conservative approach would be better suited in the current scenario, considering the limitations posed by the number of samples in calculating LD. Accordingly, the combined forensic efficiency parameters for the 16 X-STR were calculated based on the efficiency of haplotypes defined by the four clusters. In addition, the efficiency of the three individual markers were taken separately. The combined power of discrimination for both males (CPD m > 0.999999991334440) and females (CPD f > 0.999999999999996) as well as combined MEC indices calculated for deficiency (CMEC Kru > 0.999999242880176), normal trio (CMEC Des-trio > 0.999999984443189) and duo cases (CMEC Des-duo > 0.999999435330137) were equally high for all four ethnicities (Table 6). These results suggest that the 16 tested X-STR loci are appropriate candidates for kinship and forensic analysis among the four Sri Lankan ethnicities, especially with the cases involving female offspring. Table 6. Combined forensic efficiency parameters calculated for the 16 X-STR loci in the four ethnic groups based on the haplotypes frequencies of the four clusters and the allele frequencies of the three individual markers. CPD f combined power of discrimination in females, CPD m combined power of discrimination in males, CMECKru combined mean exclusion chance Kruger, CMEC Des-trio combined mean exclusion chance Desmaris trio, CMEC Des-duo combined mean exclusion chance Desmaris duo, SLT Sri Lankan Tamil, INT Indian Tamil. www.nature.com/scientificreports/

Conclusion
In this work, we report for the first time, X chromosome based population genetic data for all four major ethnic groups in Sri Lanka which covers 99.5% of the total population. According to our results, the 16 X-STR assay system used in the current study is highly polymorphic and exhibited high forensic efficiency for all four ethnicities tested indicating its suitability to be used in both evolutionary genetic analysis and forensic applications of Sri Lankans. The present study has also revealed subtle but statistically significant differences in X-STR based allele frequency distribution of Indian Tamils with Sinhalese and Moors, contrary to the highly homogeneous genetic outlook portrayed by autosomal STR analysis. While suggesting a sex biased demographic history for Sri Lankan ethnicities, this observation recommends the use of a separate X-STR allele frequency database for Indian Tamils for forensic and kinship application purposes. In contrast, the observed genetic admixture present within Sinhalese, Sri Lankan Tamil and Moor ethnicities suggest the possible use of a common allele database for the purpose. Further, the genetic distances observed among the Sri Lankans and other nationalities in the world visualized in the MDS plot render evidence to the ancient linguistic origin of Sri Lankan ethnicities-Indo-Aryan origin of Sinhalese and Dravidian origin of Tamil populations-which had been later affected to various degrees through genetic admixing between them. LD was detected along the X chromosome in all ethnic groups except Sri Lankan Tamils, which need to be considered during the likelihood calculations of kinship resolution and person identification. Further, the patterns of LD observed, which differ substantially among the four ethnicities request the use of different haplotype frequency databases for the four ethnicities for forensic purposes. Although the results of haplotype analysis suggest that the four studied X-STR clusters can provide a powerful tool for kinship testing and relationship identification of Sri Lankan ethnicities, a more exhaustive sampling of the two Tamil groups and Moors is recommended to confirm the LD and haplotype based observations.

Methods
Sample preparation and DNA extraction. The 60 and subjected to PCR amplification using the single tube 16 X-STR multiplex system described in Perera et al. 15 . Amplified products were resolved with capillary gel electrophoresis using ABI 3500 Genetic Analyzer (Thermo Fisher Scientific, USA) and data analysis allele designation was performed using GeneMapper IDX software (Thermo Fisher Scientific, USA).

Statistical analysis.
All population genetic parameters were calculated using ARLEQUIN 3.5. Allele and haplotype frequencies, exact test of differentiation for male and female allele frequencies, conformity of female subsamples to Hardy-Weinberg equilibrium and pairwise test of LD between pairs of markers within clusters of linked loci in the male sub samples were analysed separately for all four ethnic groups. LD between the markers of cluster III and IV was also analysed, considering the close physical proximity of the two clusters (6.78 cM). Forensic parameters, i.e. polymorphism information content (PIC), expected heterozygosity (He), mean exclusion chance in deficiency (MEC Kruger ), mean exclusion chance in Duos (MEC Desmaris duo ), mean exclusion chance in trios (MEC Desmaris trio ), power of discrimination for females (PD f ), power of discrimination for males (PD m ) were calculated using chromosome X web (http:// www. chrx-str. org) for all X-STR loci and haplotypes defined by the four clusters using MATLAB software (version R2017a) for all four ethnic groups. For those loci under LD, the parameters were also evaluated based on frequencies of haplotypes defined by them using the latter software.
Locus-by-locus analysis of molecular variance (AMOVA) was conducted by grouping the four ethnicities based on their linguistic origin; i.e. Indo-Aryan origin (Sinhalese) versus Dravidian origin (Tamils and Moors) to create a hierarchical structure. Nei's average number of pairwise differences within and between populations was used to detect pairwise differences among the four ethnic groups. The null distribution of pairwise Fst values was obtained by permuting haplotypes between populations for 1000 times. The significance level was kept at 0.05 and all analyses were conducted using Arlequin software v.3.5.1.2 61 . Further, for easier visualization of the observed genetic distances, a phylogenetic tree was also constructed with Molecular Evolutionary Genetics Analysis Version 6.0 (MEGA 6.0) 62 software using the Unweighted Pair Group Method with Arithmetic mean (UPGMA) method. The reliability of phylograms was estimated by bootstrapping 2000 replicates over loci and the extended majority rule consensus trees were inferred. To compare the Sri Lankan ethnicities with populations from other countries in the world, locus by locus pairwise genetic distances (Fst) were generated based on data available in the literature. Resulting Fst values were averaged over loci and represented in a multidimensional scaling (MDS) plot using IBM SPSS Statistics Version 21 statistical package 63 .

Data availability
The datasets generated during the current study are available from the corresponding author on reasonable request.