Introduction

Southern Thailand lies on the Malay Peninsula, bordering the Gulf of Thailand to the East, the Andaman Sea to the West, and Malaysia to the South. A census size of ~ 9.16 million in southern Thailand is about 13.35% of the total census size of the country (68.61 million in 2020)1. Most people are southern Thai Buddhists (66%) and southern Thai Muslims (33%), while the minorities, e.g. sea nomad and Maniq groups account for about 0.33%2. The populations of the three groups of sea nomads are 4000, 2000 and 3000 for Moklen, Moken and Urak Lawoi’, respectively, while only 250 were recorded for the Maniq1,2. The languages spoken in southern Thailand belonging to three linguistic families: Tai-Kadai (TK), Austroasiatic (AA) and Austronesian (AN). The AA-speaking Maniq who are scattered through the jungle are regarded as the indigenous people of Southeast Asia or often referred to “negritos” because of their phenotypic difference and traditional mode of subsistence practice that is hunter-gatherers3. The AN-speaking sea nomads were used to subsist through maritime foraging in most of the year, although nowadays they prefer to settle in the coastal area of Thailand and Myanmar4. Both Maniq and sea nomads were minority groups and thought to be native in southern Thailand since prehistorical time, together with the other groups, e.g. AA-speaking Mon and Khmer before the occupation of the AN-speaking Malays and TK-speaking Thais, thought the Mon and Khmer people are nowadays disappear in southern Thailand5.

The autosomal short tandem repeats (STRs) show a number of advantages for both population genetic and forensic studies, i.e. distribution across the human genome which led to escape from natural selection, highly polymorphism and informativeness to distinguish recently diverged populations6,7. In Thailand, studies on forensic microsatellites and also other markers have focused on northern, northeastern and central Thailand leaving the southern region understudied8,9,10,11,12. The only one study on autosomal STRs in southern Thailand indicated that the Thai-Malay Muslim and Thai Buddhist who lived in the five deep Southern Thai provinces had non-significant genetic difference13.

In addition, there were some other genetic studies on southern Thai populations using uni-parentally inherited markers3,4. The mitochondrial (mt) DNA investigation of the Moken showed ancient basal mtDNA haplogroup M21d and M46 with very low genetic diversity4. The basal mtDNA haplogroup M21a, R21 and M17a and Y chromosomal haplogroup K were also observed in the Maniq as well as close genetic affinity between the Maniq and other indigenous people of Southeast Asia in Malaysia, reflecting an ancient ancestry of the Maniq and common genetic ancestry of indigenous people of Southeast Asia in the Malay Peninsula3.

To expand the genetic studies in southern Thailand, we reported genotypes of 15 autosomal STRs of seven southern Thai populations: one AA-speaking Maniq, four AN-speaking Moklen, Moken, Urak Lawoi’ and southern Thai Muslim and two TK-speaking southern Thai Buddhist and southern Thai Takbai. We explored genetic structure and relationships of southern Thai populations with other Thai and Malaysian populations8,10,11,12,13,14. In addition, because the forensic database combined diverse southern Thai populations has not yet been established, we created a regional DNA database of 15 autosomal STRs of southern Thailand.

Results and discussions

Genetic diversities and forensic parameters

Raw genotypic data of 15 STRs of 334 southern Thai samples are provided in Table S1. Total genetic diversity of all southern Thai samples was 0.7871 ± 0.3945, whereas that in individual populations ranged from 0.6742 ± 0.3526 in the Maniq to 0.7943 ± 0.4012 in southern Thai Buddhist (Table 1). The reduced genetic diversity of the Maniq is possibly driven by genetic drift associated with geographic isolation and very small population sizes, as reported previously3. When the genetic diversity calculated form the same marker set was compared between two hunter-gatherer groups in Thailand, the Maniq from the South had diversity value greater than the Mlabri from the North (0.547 ± 0.288)15 although the sample size of Maniq (n = 15) is lower than the Mlabri (n = 19). Also the genetic diversity results of these 15 STRs of ~ 70 Thai populations8,9,10,11,12,13,15 revealed that the Mlabri had the lowest genetic diversity, indicating a strong genetic drift of the Mlabri. Regarding the sea nomads and excluding the Moken due to their small sample size, the Moklen and Urak Lawoi’ showed lower genetic diversity than other Thai and Malaysian populations (Table 1), reflecting certain degree of genetic drift.

Table 1 General information and results on genetic diversities of the studied and compared populations.

When genotype data of total 334 southern Thai samples were combined and calculated the allelic frequency for the 15 STR loci (Table 2), there are two loci (D19S433 and D18S51) that depart from the Hardy–Weinberg equilibrium (HWE) even after applying Bonferoni adjustment (p < 0.0033). Although the forensic parameters show that both loci are highly discriminating (power of discrimination (PD) = 0.9246 for D19S433 and 0.9513 for D18S51) and power of exclusion (PE) = 0.5757 or D19S433 and 0.6873 for D18S51)), the lack of HWE must be taken into account in forensic investigation. A total of 157 alleles were detected, ranging from 6 alleles at TPOX to 21 alleles at FGA. The maximum allele frequencies is observed in TPOX (0.5472). The lowest expected heterozygosity (HE) was observed in the TPOX (0.6201), while the highest HE was in the FGA (0.8690) (Table 2). The polymorphic information content (PIC) ranged from 0.5672 (TPOX) to 0.8529 (D2S1338) and matching probability (MP) values are from 0.0374 (FGA) to 0.2037 (TPOX) (Table 2). The power of discrimination (PD) ranged from 0.7963 (TPOX) to 0.9673 (D2S1338) (Table 2), with a value of 0.9999999999999999 for the combined PD. The power of exclusion (PE) ranged from 0.3121 (D3S1358) to 0.7588 (FGA) (Table 2), with a combined PE value of 0.99999622.

Table 2 Allele frequencies of total southern Thais based on the 15 autosomal STR loci (n = 334).

Genetic relatedness and genetic structure of southern Thai populations

One measure of genetic relationship among populations was a genetic distance value. The result of genetic distance (Rst) among 17 Thai and Malaysian populations showed that the Maniq (MN) and Urak Lawoi’ (UL) were genetically different from each other and from other populations (Fig. 1) whereas the Moklen (MLK) showed significantly difference from almost all comparisons (p > 0.05), except with the pairs of newly generated southern Thai Muslim (MST) and Moken. However, due to the effect of very small sample size, the Moken did not differ from almost populations. In general, the Maniq and sea nomads from southern Thailand exhibited genetic differentiation from the other groups. Then, the matrix of Rst were constructed to multi-dimensional scaling (MDS) plots. The three-dimensional MDS result based on dimension 1 and 2 showed genetic distinction of Maniq (MN) and three sea nomads, i.e. Moklen (MLK), Moken (MOK) and Urak Lawoi’ (UL) from the other groups from Thailand and Malaysia. The MDS analysis based on dimensions 3 showed genetic differences of Urak Lawoi’ from other populations (Fig. 2A–C). The heat plot of the MDS indicated genetic distinction of Moklen and Maniq in dimension 1 and 2, respectively and genetic difference of Urak Lawoi’ from other sea nomads in dimension 3 (Fig. 2D).

Figure 1
figure 1

Heat plot of Rst values between total 17 populations. The “ = ” symbol indicates non-significance of Rst values (p > 0.05).

Figure 2
figure 2

The three-dimensional MDS plots for 17 populations (AC) (stress = 0.0030) and the heat plot of standardized values of MDS with five dimensions (D). See population abbreviation in Table 1. Red, purple, green, blue and black indicate populations from southern Thailand, northern Thailand, northeastern Thailand, central Thailand and Malaysia, respectively. Circle, square and triangle indicate Austronesian, Tai-Kadai, Austroasiatic families, respectively.

To further explore cryptic population structure and genetic relationship among 16 populations without the Moken by STRUCTURE, we present the result of K from 2 to 8 (Fig. 3A) and at K = 5 which is the suitable cluster (Fig. 3B)16. The first cluster was in the Maniq (MN), as represented by orange, while the second clusters (purple) stood out in the sea nomads: Moklen (MLK) and Urak Lawoi’ (UL), supporting their genetic uniqueness (Fig. 3A). The other three clusters (dark blue, light blue and green) were distributed in all populations at different proportions: (1) the dark blue component greatly emerged in southern Thais (MST, MUS, BST and BUD), Malays (ML1 and ML2), populations from central Thailand (MO and CT), (2) light blue strongly emerged in the other Thais from northern (YO and YU) and northeastern regions (IS and KH) and the green component was roughly distributed in all populations, except for a reduction in the Maniq and Urak Lawoi’. Interestingly, although the Moklen and Urak Lawoi’ occupy their own cluster (purple), the Moklen exhibited mixed ancestries compared to the Urak Lawoi’ (Fig. 3A), indicating stronger interactions between Moklen and the other populations.

Figure 3
figure 3

STRUCTURE result at K = 2–8 (A). Each individual is represented by a single column that is divided into segments whose size and color correspond to the relative proportion of a particular cluster. Populations are separated by black lines and population codes are listed in Table 1. Number of populations with the highest posterior probability expressed as the Delta K (B).

Overall, there were three main observations according to genetic relationship results. First, the Maniq and sea nomads exhibited extremely genetic differences from other Thai and Malaysian populations. The distinct genetic structure coupled with low genetic diversity (Table 1) is probably driven by genetic drift and/or inbreeding due to their geographical isolation and small census size. Reduced genetic diversity of the Maniq was also observed in previous study of mtDNA and Y chromosomal variations3. Second, among the sea nomad groups (excluding the Moken), the Urak Lawoi’ and Moklen showed genetic dissimilarity with the latter displaying genetic admixture with other populations. According to ethnolinguistic background, the Moklen are more closely related to the Moken and both of them are more distant from Urak Lawoi’4. Although languages of sea nomads were grouped within Austronesian family and Malayo-Polynesian sub-family, different in dialects were spoken; the Urak Lawoi’ or Orang Laut speak Malayic that distantly related to both Moken and Moklen who shared many cultural connections. In addition, the Urak Lawoi’ was culturally isolated but the Moklen had frequently interacted with and influenced by other southern Thais1,17,18. Therefore, the unique genetic signature of Urak Lawoi’ and mixed ancestries of Moklen could be described by ethnolinguistic and cultural evidence. Third, we found more genetic similarity between major southern Thais and populations from central Thailand than other regions. The present result was in agreement with a recent genome-wide study19 that could be explained by historical evidence; there were movements from the central region to the south during the Ayutthaya Period (during 1350–1767 A.D.)20 and genetic admixture between the southern Thai and Malays after the settlement period might be possible13.

Genetic relationships between southern Thai populations and other Asian populations

A neighbor-joining (NJ) tree based on allele frequencies of 15 STR loci among 29 Asian populations reveals four clusters of populations. Cluster 1 consists of populations from Island Southeast Asia and Malaysia while the South Asian populations occupy cluster 2. Cluster 3 comprises of Mainland Southeast Asian populations and cluster 4 belongs to the Thai sea nomads, Maniq from Thailand and Indonesians from Bali, with the extreme divergence of Maniq (Fig. 4). Interestingly, both southern Thai Muslim populations (MST and MUD) and southern Thai Takbai are positioned close to cluster 2 of South Asian. One southern Thai Buddhist population (BUD) is grouped with other Mainland Southeast Asian populations of cluster 3, while another southern Thai Buddhist population (BST) is clustered with southern Thai sea nomads in cluster 4 (Fig. 4). Several archaeological evidence indicated prehistorical contacts between India and present-day Thailand (and Cambodia) during the Iron Age that brought exotic goods and Buddhist and Hindu religions; early states in this area, e.g. Dvaravati in central Thailand and Langkasuka in Malay Peninsular were influenced by Indian cultures during initial establishment5. South Asian connections of southern Thai populations could be possibly driven by previous admixture, in agreement with previous study on genome-wide data19.

Figure 4
figure 4

Neighbor-joining (NJ) tree. The NJ tree based on Fst computation from allele frequency of 15 STR loci from 29 populations, including southern Thai populations (indicated by dots) and other comparative Thai and Asian populations.

Conclusion

We generated and analysed forensic STR loci in diverse ethnolinguistic groups from southern Thailand. In general, the Maniq and sea nomads are highly diverged from the other Thai groups, while the southern Thai populations are closer to the Malays and populations from central Thailand, reflecting different genetic structures of major Thais in each region that emphasize the importance of generating a database of allelic frequencies in southern regions of Thailand. Therefore, the allelic frequency generated here from combined STRs data from several populations is useful for further forensic investigation in the region. In anthropological genetic perspective, although the resolution of STRs to elucidate population history is lower than those of genome-wide data, several results here are concordant to previous genome-wide data, e.g. close relationship between southern and central Thais, reflecting certain usefulness of this set of markers. In addition, the Moklen and Urak Lawoi’ sea nomads have not been genetically investigated yet; this study initially provides basic genetic background of these enigmatic groups from southern Thailand. We found genetic distinction among Urak Lawoi’ and Moklen; the former had unique genetic perspective while the latter exhibited mixed ancestries, reflecting more population interaction with other populations. The limitations in this study is the limited sample size of the Moken which cannot be able to compare the results with other populations. Additional studies of sea nomads from other locations of southern Thailand coupled with further details from other genetic markers will be provided more insights into the genetic ancestry of AN speaking people in the Malay Peninsula.

Materials and methods

Sample

We newly collected 184 samples belonging to seven populations: AA-speaking Maniq, AN-speaking Moklen, Moken, Urak Lawoi’ and southern Thai Muslim and TK-speaking southern Thai Buddhist and southern Thai Takbai, using buccal swabs with written informed consent. Prior to the collection of samples, all volunteers were interviewed to screen for subjects unrelated for at least two generations. The rights of participants and their identity have been protected during the whole process of this research. All experiments were performed in accordance with relevant guidelines and regulations based on the experimental protocol on human subjects which was approved by the Khon Kaen University Ethic Committee (Protocol No. HE622223) and Naresuan University Institution Review Board (COA No. 0464/2017). When combined with previously published southern Thai Buddhist and southern Thai Muslim data13, this provides a total raw genotype data of 334 southern Thai samples (Table S1).

Data collection

Genomic DNA was extracted from buccal swabs using the Gentra Puregene Buccal Cell Kit (Qiagen, Hilden, Germany) according to the manufacturer’s instructions. Each DNA sample was amplified for 15 STR loci in a multiplex PCR using a commercial AmpFlSTR Identifiler kit (Applied Biosystem, Foster City, CA, USA) according to the manufacturer’s protocols. The amplicons were genotyped by multi-capillary electrophoresis on an ABI 3130 DNA sequencer (Applied Biosystem), and allele calling was performed by the software GeneMapper v.3.2.1 (Applied Biosystem).

Statistical analysis

Arlequin v.3.5.2.221 was used to calculate allele frequency, Hardy–Weinberg equilibrium (HWE) P values, observed heterozygosity (HO), expected heterozygosity (HE), total alleles, and gene diversity (GD). Significant levels for the HWE were adjusted according to the sequential Bonferroni correction (α = 0.05/15)22. We used the Excel PowerStats spreadsheet23 to compute several forensic parameters, including power of discrimination (PD), matching probability (MP), polymorphic information content (PIC), power of exclusion (PE), and typical paternity index (TPI) as well as the combined PD (CPD), combined MP (CMP), and combined PE (CPE). To reveal population relationships and population structures, we also combined genotyping data of additional eight populations from northern Thailand (Yuan and Yong), northeastern Thailand (Khmer and Lao Isan) and central Thailand (Mon and central Thai)8,10,11,12,37, and Malaysia (two Malay populations)14 (Table 1; Fig. 5). A genetic distance matrix based on sum of square difference (Rst) was generated by Arlequin, and the matrix was then plotted in two dimensions by means of multidimensional scaling (MDS) using Statistica v.10 demo (StatSoft, Inc., USA). The heatmap visualization of Rst and MDS values were obtained using R package (R Development Core Team).

Figure 5
figure 5

Map of the sampling locations of the 17 populations in analyses of genetic diversity and genetic structure, color-coded according to geographic region/country: red, purple, blue, green, and black indicating populations from southern Thailand, northern Thailand, northeastern Thailand, central Thailand and Malaysia, respectively while symbol-coded according to language family: Circle, square and triangle representing Austronesian, Tai-Kadai, Austroasiatic families, respectively. (Adob e Illustrat or CS4 14.0.0. http://www.adobe.com/sea/).

To delineate cryptic population structure using the Bayesian clustering method, we performed STRUCTURE version 2.3.4 under the following prior parameters: admixture, correlated allele frequencies, and assistance of sampling locations (LOCPRIOR model)24,25,26. We ran ten replications for each number of clusters (K) from 1 to 11 and used a burn-in length of 100,000 iterations, followed by 200,000 iteration running length. We used STRUCTURE Harvester27 to compute a second-order rate of change logarithmic probability between subsequent K values (K) in order to identify the optimal K value in the data16. We used CLUMPAK28 and DISTRUCT29 to generate the final results of STRUCTURE. To evaluate genetic relatedness with other Asian populations, we used POPTREE v.230 to generate a neighbor-joining (NJ) tree based on Fst computation by allele frequency of 15 STR loci of 29 populations from South and Southeast Asia8,9,11,12,14,31,32,33,34,35,36,37,38,39.

Ethics statement

The rights of participants and their identity have been protected during the whole process of this research. All experiments were performed in accordance with relevant guidelines and regulations based on the experimental protocol on human subjects which was approved by the Khon Kaen University Ethic Committee (Protocol No. HE622223) and Naresuan University Institution Review Board (COA No. 0464/2017).