Genetic structure and polymorphisms of Gelao ethnicity residing in southwest china revealed by X-chromosomal genetic markers

X-chromosome short tandem repeat markers (X-STRs), due to their special inheritance models, physical location on a single chromosome and the absence of recombination in male meiosis, play an important role in forensic and population genetics. While a series of genetic analyses focusing on the genetic diversity and forensic characteristics of X-STRs are well studied for ethnically/linguistically diverse and demographically large Chinese populations, genetic evidence from Gelao ethnicity is still sparse. Here, we genotyped the first batch of 19 X-STRs in 513 Chinese Gelao individuals (265 females and 248 males), and reported genetic polymorphisms, forensic characteristics based on the single locus and seven linkage groups. DXS10135 with the highest PIC (0.9106) and LG1 (DXS10148-DXS10135-DXS8378) with the largest HD (0.9970) are polymorphic and informative. The CPDs in Gelao males and females are respectively larger than 0.999999999997095 and 0.99999999999999999999918, and the combined MECs are larger than 0.999999975715109. Subsequently, we investigated the population relationships among 14 Chinese populations based on 19 X-STRs and among 23 populations based on 11 overlapped X-STRs. Our results revealed genetic differentiations among Tibeto-Burman, Altaic and other Chinese homogenous populations, and demonstrated that Guizhou Gelao has the genetically closer relationships with Han Chinese and geographically close Guizhou Miao.

Short tandem repeat (STR), one kind of mutation-prone genetic marker and also often referred to as microsatellite and simple sequence repeat (SSR), is widely distributed in the human genome (approximately 1.6 million and spanning nearly 1% of the human genome) [1][2][3] . STR is the repetitive nucleotide sequence, which comprises a repeating motif of 2-6 base pairs 2 . Previous studies have suggested that slippage events during the DNA replication make the contribution to higher mutation rate of averagely 10 −3 to 10 −4 mutations per generation than other types of genetic markers, such as binary markers of single nucleotide polymorphisms and insertion/deletions 4,5 . A large-scale surveys focused on lager number of autosomal STR variations have been performed and demonstrated that STRs are associated with regulating gene expression and complex molecular phenotype traits, as well as prevalence and susceptibility of Mendelian diseases and cancers [4][5][6][7] . Y-chromosomal STRs with the features of high mutation and male especial inheritance play an important role in the population genetics, genealogy researches, evolutionary and forensic studies 8,9 . In forensic science, more attentions have been paid to widely in the rates and patterns of de novo STR mutations, genetic polymorphisms and forensic characteristics of the
We next performed the exact test using the Markov Chain with the forecasted chain length of 1,000,000 and dememorization steps of 100,000 to examine the Hardy-Weinberg equilibrium (HWE) of 19 X-STRs in the 265 female individuals on the basis of the distributions of the observed heterozygosity (Ho) and expected heterozygosity (He) 51 . As shown in Table 1, the values of Ho and He span the ranges between 0.5019 (DXS7423) and 0.9019 (DXS10135), and 0.5433 (DXS7423) and 0.9158 (DXS10135), respectively. No deviations from the HWE are observed with the exception of DXS10134 (p = 0.0360). After applying the Bonferroni correction (p = 0.0026), all tested X-STRs are in conformity with the HWE. The allele frequencies of Gelao females and males are presented in the Supplementary Tables S3, S4 and Figs S1, S2. A total of 229 alleles with corresponding allelic frequencies ranging from 0.0019 to 0.5736 in females, and 201 alleles with corresponding allelic frequencies spanning from 0.0040 to 0.6169 in males are observed. The Fst and corresponding p values were calculated to explore the gender differentiations among female and male samples using the exact test in the locus-by-locus comparison and presented in Supplementary Table S5. Considering that no significant statistical differences between males and females are observed, we pooled the male and female samples to recalculate the allele frequency distributions and forensic statistical parameters. As shown in Supplementary Table S6 and Fig. S3, a total of 242 alleles are identified with corresponding frequencies ranging from 0.013 to 0.5874.

Forensic Parameters
LG1 LG2 LG3 LG4 LG5 LG6 LG7  Table 2. Forensic parameters of seven linkage groups on the basis of the haplotype frequencies in Guizhou Gelao population. HD, haplotype diversity; MP, march probability; PIC, polymorphism information content; PD f , power of discrimination in females; PD m , power of discrimination in males; MEC Krüger, mean paternity exclusion chance for autosomal STR markers in trios and complex kinship cases; MEC Kishida, mean paternity exclusion chance for X-chromosomal markers in trios involving daughters; MEC Desmarais, mean paternity exclusion chance for X-chromosomal markers in trios involving daughters (Desmarais version); MEC Desmarais Duo, Mean paternity exclusion chance for X-chromosomal markers in father/daughter duos.
LG1 The reference populations comprised Southern Han (n = 308) 30 , Tibet Tibetan2 (n = 213) 30 , Xinjiang Uyghur2 (n = 211) 30 , Ningxia Hui (n = 200) 30 , Tibet Tibetan1 (n = 270) 26 , Xinjiang Uygur1 (n = 220) 26 , Guanzhong Han (n = 474) 31 , Xinjiang Kazakh (n = 300) 39 , Xinjiang Xibe (n = 179) 40 , Liangshan Yi (n = 331) 27 , Sichuan Han (n = 201) 28 , Sichuan Tibetan (n = 235) 29 , and Guizhou Miao (n = 268) 60 . The first three principal components extracted 58.687% of total genetic variations (PC1: 29.081%, PC2: 19.604% and PC3: 10.003%). As showed in Fig. 2, PC1 can separate two Xinjiang Uyghur populations and one Kazakh population from others, and PC2 can differentiate three Tibetan populations from others. The third PC shows a separation of Ningxia Hui with other tested populations. PCA results on the basis of allele frequency distributions revealed that Guizhou Gelao is more closely related to Han Chinese populations, Miao and Xibe than to others. Pairwise comparisons between the studied Gelao and aforementioned 13 populations were subsequently estimated using the Nei's genetic distances (Supplementary Table S10 and Fig. S6). A middle genetic heterogeneity (mean ± SD: 0.0262 ± 0.0110) among Chinese populations with the genetic distances spanning from 0.0070 (between Guanzhong Han and Guizhou Gelao) to 0.0519 (between Xinjiang Uyghur2 and Sichuan Tibetan) is observed. Guizhou Gelao is similarly related to Guanzhong Han (0.0070) and has a distant genetic relationship with Xinjiang Uyghur2 (0.0394), which is consistent with the population origin. Subsequently, we conducted the MDS based on the genetic distance matrix to further explore the genetic relationship and language affinity. As shown in Fig. 3, three Altaic-speaking populations are located in the second and third quadrants with the exception of Xinjiang Xibe located in the fourth quadrant. Four Tibeto-Burman-speaking populations are located in the first quadrant. However, Gelao, as one Tai-Kadai-speaking population, is located in the fourth quadrant and has high genetic affinity with Sinitic-speaking populations. One Hmong-Mien-speaking population of Guizhou Miao is positioned between Southern Han and Xinjiang Xibe. An N-J tree was constructed among these 14 populations belonged to four language families. We identified three main clusters: Altaic-speaking cluster, Tibeto-Burman-speaking cluster, and Sinitic-speaking cluster. Guizhou Gelao and Guizhou Miao form one branch and then grouped with Sinitic-speaking populations in the same cluster.
Genetic relationships and population structures revealed by 11-overlapped STRs among 23 nationwide populations. To glean further details of Chinese genetic structure, we combined our investigated genetic variations of Gelao with more previously published populations, including 22 reference groups 26-31,39-44,46-49 from 12 diverse ethnicities and six language families, on the basis of 11 overlapped X-chromosomal genetic markers between the Investigator Argus X-12 QS Kit and AGCU X-19 amplification system. We first explored the genetic homogeneity and heterogeneity using PCA based on the allelic frequency distributions. We found that a total of 53.534% genetic variation was extracted from the total variance based on the first three components. As shown in Fig

Discussions
Linkage and linkage disequilibrium. Forensic genetic workers are needed to illuminate the potential power (genetic polymorphisms and forensic parameters) in forensic application, in order to provide population-specific reference data for establishing a comprehensive database for a new PCR amplification system before its extensive use in forensic casework. Herein, Genotype data of 19 X-STRs included in AGCU X19 kit in 513 unrelated Chinese Gelao individuals is obtained. Before analyzing the forensic population frequency data, we evaluate the linkage disequilibrium. Linkage is the phenomenon that genetic markers are close together on a chromosome and can inherit as a unit during the meiosis phase of sexual reproduction. Linkage disequilibrium, also referred as allelic association, is non-random association of different alleles, which can be caused by linkage and specific population history, like population substructure, migration, non-random mating and genetic drift. In this study, linkage disequilibrium analyses were performed in both male and females. Most marker pairs which are of disequilibrium were observed within the linkage groups. Previous studies 49,55-59 based on large scale pedigree and population genetic analyses revealed that the 19 X-STRs can be grouped into seven linkage groups (LG): LG1 comprises three loci located on X-chromosomal short arm 55,56 , LG2 is consisted of three genetic markers located on the centromere with low recombination rate 57 , LG3 49 , LG4 58 , LG5 59 , LG6 56 , LG7 49 are located on the long arm. DNA Commission of the International Society for Forensic Genetics (ISFG) recently recommended that haplotype frequency should be considered to calculate the likelihood when linkage inheritance exists in the included forensic X-STRs 50 . Thus, statistical parameters of forensic interest based on both single locus and linkage groups are analyzed.
Forensic efficiency. AGCU X-19 STR amplification system, co-amplification and fluorescent detection of the 19 X-STRs, was developed specifically to facilitate Chinese X-STR reference database establishment. To explore the power of this panel in forensic complex paternity testing and individual identification, we next comprehensively evaluate the forensic efficiency indexes and the genetic polymorphisms. A set of forensic parameters has been devised 51-54,61 , including GD, PIC, PE, PD f , PD m , and four mean paternity exclusion change indexes  . PIC and GD are serviceable in both autosomal and X-chromosomal markers, and GD is also appropriate for Y-chromosomal markers 16,61 . MEC Krüger is conceived for addressing the deficiency cases without the alleged father which replaced by the paternal grandmother using X-chromosomal markers and normal trios using autosomal markers 52 . MEC Kishida and MEC Desmarais are specially designed and suitable for trios with a daughter 53,54 , and MEC Desmarais Duo is valid for cases of father/daughter duos or mother/son duos on the basis of X-chromosomal markers 54 . In this study, the combined powers of the aforesaid six parameters in Chinese Gelao pooled population on the basis of single locus allele frequencies are respectively 0.99999999999985, 0.99999999999999999999973, 0.999999975715109, 0.999999999998337, 0.999999999998324, and 0.999999995577508. For haplotype analyses, the combined powers of discrimination and mean paternity exclusion chances are also estimated. The combined PD m and PD f are 0.999999999997095 and 0.99999999999999999999918, respectively, which are slightly smaller than efficiency calculated by allele frequency distributions. The cumulative mean paternity exclusion chances in trios are 0.999999999394923 (Krüger), 0.999999999991709 (Kishida), 0.999999999996492 (Desmarais), and which in duos is 0.999999999682643 (Desmarais). The combined MEC Kishida, and MEC Desmarais based on genetic polymorphisms in the single locus are larger than that on the basis of genetic variation of haplotype distributions of seven linkage groups. However, the higher combined indexes of MEC Krüger and MEC Desmarais Duo are observed according to the genetic polymorphisms of haplotype. Our findings combined with our previous investigations [27][28][29] indicate that the 19 X-STRs are informative and polymorphic in Chinese Gelao population and this amplification system can efficiently complement the analyses of autosomal 13 , mitochondrial and Y-chromosomal STRs 16 , single nucleotide polymorphisms (SNPs) 62 , insertion/deletions (InDels) 63 in the forensic applications, especially in some special and complicated kinship cases (deficiency kinships cases of paternal grandmother/granddaughter duos, mother-son duos, and full or half-sibling duos involving two females, as well as some specific incest cases).
Population genetic relationship. China, located on the East Asia and comprising 56 ethnically/linguistically diverse ethnicities officially recognized by the People's Republic of China and several unrecognized populations (such as Mosuo, Miyao), has been the genetic subject in the molecular anthropology, archaeology, population genetics and forensic genetics to shed light on the genetic diversity, origin, divergence, evolution, population migration and admixture of the eastern anatomically modern humans after migrating out of Africa around fifty millennium BC [32][33][34][35] . The detailed genetic structures of Chinese minority ethnicities with the exception of Uyghur and Tibetan 32,64 , particularly the Chinese Gelao, remains unresolved. We used two different datasets to investigate Chinese population structure. Significant genetic differences were identified between Turkic-speaking, Tibeto-Burman-speaking and other Chinese populations. Which are consisted with previous genetic studies 32 Except for Turkic and Tibeto-Burman populations, other Chinese populations are homogenous groups as revealed in this study. Our comprehensive population genetic comparisons demonstrated that Gelao keeps the genetic affinity with this homogenous group, especially for Han Chinese and Guizhou Miao (geographically-neighboring population). Since the remarkable cluster structure was displayed by different methods between Gelao and these diverse ethnic groups from different linguistic family, including mainly Sinitic-speaking (Han, Hui), Hmong-Mien-speaking (Miao, Yao, She), and Tai-Kadai-speaking (Zhuang). Meanwhile, the closer genetic relationships between Gelao and others based on different methods and datasets are somewhat different: PCA revealed the Gelao shows close relationships mainly with Han, Miao and Xibe in Fig. 2, whereas with Han, Miao, Zhuang and Hui in Fig. 4; MDS revealed the closer genetic affinity between Gelao and Sinitic-speaking populations, Miao, and Xibe in Fig. 3, whereas and Sinitic-speaking populations, Zhuang, followed by Miao, Yao, Xibe and She in Fig. 5; N-J tree revealed Gelao grouped with Guizhou Miao fisrt in one branch and then clustered together with Sinitic-speaking populations in Fig. 3, whereas Gelao first clustered with Fujian She, and then clustered together with Shanghai Han, Guizhou Miao and two other Han populations in Fig. 6. Three software programs (PCA, MDS and phylogenetic tree) are the most well-known and widely used methods for examining the general patterns of population genetic relationships. Although, overall consensus was showed among the Gelao and other homogeneous populations, the completely same results about the closer genetic relationships between the Gelao and others cannot be obtained by using distinct descriptive methods, like the conclusions revealed by formal tests of Admixturetools 65 or TreeMix 66 . Which is also consisted with previous studies based on the Y-chromosomal, autosomal genetic markers 13,14,67,68 . Totally, our results based on the X-chromosomal markers demonstrated genetic differentiations among Turkic, Tibeto-Burman and other admixture groups (homogeneous populations, including Gelao). These patterns of genetic variation and structure are caused by the migration 34,35 , nature selection 64 , admixture 32 and religious and cultural diffusion 13,17,34 .
As a typical example of the apparent genetic affinity between the Gelao and all compared Han populations derived from distinct administrative regions as shown by all three phylogenetic methods, it can also be explained as a mixed cluster pattern: an obvious ethnical cluster of different Han populations coupled with a probable geographical cluster of the Gelao ethnicity and local Han majority, since they have a long history of living and intermarriage with each other in the same northern part of Guizhou Province 36,37,69,70 . Additionally, Guizhou Miao is another minority group in Guizhou Province and geographically close to Guizhou Gelao 60 . The close genetic relationships of Guizhou Miao and Guizhou Gelao are displayed more explicitly and steadily than others (except for Han Chinese) based on all the tested methods, To better understand the origin and migration of Gelao ethnicity and dissect the fine-scale genetic structures and relationships with complex surrounding or related populations, additional genome and population analyses based on higher resolution genetic marker sets, such as high-density SNP chip and whole-genome sequencing data, are needed.

Conclusions
Tightly linked X-STR markers play an important role in forensic complex kinship cases or deficiency case identifications. In this study, we genotyped 19 X-STRs in 513 unrelated Chinese Gelao individuals to investigate the forensic characteristics, and combined with 13 previously studied nationwide populations based on the genetic variations of 19 X-STRs as well as 22 reference populations on the basis of 11 overlapping X-STRs to explore the Chinese population genetic relationships along ethnic, geographical and linguistic divisions. All 19 X-STRs are in accordance with the HWE. Forensic parameters are estimated according to both allele and haplotype frequency distributions. Locus of DXS10135 and linkage group of DXS10148-DXS10135-DXS8378 are the most informative and polymorphic genetic markers in Chinese Gelao population. The high combined power discrimination and mean paternity exclusion chance are achieved based on genetic variations of both 19 X-STRs and 7 linkage groups with minor differences, indicating that this panel could complement the applications of autosomal, Y-chromosomal and mitochondrial markers in forensic deficiency cases. This study also provides haplotype database for likelihood estimation of kinship identification in Guizhou Gelao. Additionally, our PCA, MDS and phylogenetic relationship reconstruction, which are based on two sets of genetic markers from a large of Chinese populations, are concordant in revealing the genetic distinctions among Tibeto-Burman-speaking populations, Altaic-speaking populations and other Chinese language family populations. Besides, Guizhou Gelao as a Tai-Kadai-speaking population, has the closer genetic relationship with Han Chinese and geographically close Guizhou Miao. Further genetic studies based on the whole-genome studies of modern or archaic samples in East Asia are needed due to the existing uncertainty of genetic relationships among Chinese populations.

Methods and Materials
Compliance with ethical standards and sample collections. This study was performed with the approval of the Ethics Committee of the Zunyi Medical University and followed the guidelines published by Center of Forensic Expertise, Affiliated Hospital of Zunyi Medical University. Each voluntary participant has signed the written informed consent after being informed of the aim of the study. A total of 513 human blood samples (265 females and 248 males) were collected from unrelated healthy Gelao individuals residing in the Zunyi City in Guizhou Province, southwest China. Samples from individuals whose parents and paternal grandparents belonged to the Gelao ethnolinguistic group and had non-consanguineous marriages within three generations.
DNA extraction and quantification. Genomic DNA was extracted and isolated using the salting-out method. Quantification analysis of DNA template was carried out using the Quantifiler Human DNA Quantification Kit (Thermo Fisher Scientific) on the basis of manufacturer's instruction on the 7500 Real-Time PCR System (Thermo Fisher Scientific). All DNA sample was diluted to 1 ng/μl and preserved in the −20 °C until the following amplification.
Analytical method. We calculated the allele frequencies of 19 X-STRs in the Gelao males, females and pooled population using the modified PowerStat V1.2 spreadsheet (Promega, Madison WI, USA). Haplotype distributions and corresponding haplotype frequencies of seven linkage groups were estimated by the direct counting. Forensic statistical parameters polymorphism information content (PIC), power of exclusion (PE), paternity index (PI), power of discrimination in female (PD f ) and male (PD m ) and mean paternity exclusion chance (MEC) for trios cases introduced respectively by Krüger et al. 52 54 (MEC Desmarais Duo) were evaluated using the online tool provided by the ChrX-STR.org 2.0 database (http://www.chrx-str.org/). Gene diversity (GD) and haplotype diversity (HD) were estimated using Nei's formula 51 : where N and P i respectively denote the population size and ith allele frequency or haplotype frequency. The gender differentiation (Fst and corresponding p values), Hardy-Weinberg equilibrium (HWE) in females, Linkage disequilibrium (LD) in males and females were calculated using the Arlequin software (version 3.5.2) 71 . Finally, we used the newly developed software StatsX (Statistics for X-STR) v2.0 72 to examine and validate our analysis results.
To dissect the genetic heterogeneity and homogeneity between the studied Gelao population and other nationwide reference populations along ethnic, linguistic and administrative divisions, we first integrated our data with 13 previously investigated populations genotyped by 19 X-STRs and then combined our data with 22 reference populations on the basis of the overlapped 11 X-STRs (DXS7132, DXS10079, DXS10074, DXS10103, HPRTB, DXS10101, DXS10134, DXS10148, DXS10135, DXS8378, and DXS7423). The first set of reference groups: Southern Han 30