Cultural variation impacts paternal and maternal genetic lineages of the Hmong-Mien and Sino-Tibetan groups from Thailand

The Hmong-Mien (HM) and Sino-Tibetan (ST) speaking groups are known as hill tribes in Thailand; they were the subject of the first studies to show an impact of patrilocality vs. matrilocality on patterns of mitochondrial (mt) DNA vs. male-specific portion of the Y chromosome (MSY) variation. However, HM and ST groups have not been studied in as much detail as other Thai groups; here we report and analyze 234 partial MSY sequences (∼2.3 mB) and 416 complete mtDNA sequences from 14 populations that, when combined with our previous published data, provides the largest dataset yet for the hill tribes. We find a striking difference between Hmong and IuMien (Mien-speaking) groups: the Hmong are genetically different from both the IuMien and all other Thai groups, whereas the IuMien are genetically more similar to other linguistic groups than to the Hmong. In general, we find less of an impact of patrilocality vs. matrilocality on patterns of mtDNA vs. MSY variation than previous studies. However, there is a dramatic difference in the frequency of MSY and mtDNA lineages of Northeast Asian (NEA) origin vs. Southeast Asian (SEA) origin in HM vs. ST groups: HM groups have high frequencies of NEA MSY lineages but lower frequencies of NEA mtDNA lineages, while ST groups show the opposite. A potential explanation is that the ancestors of Thai HM groups were patrilocal, while the ancestors of Thai ST groups were matrilocal. Overall, these results attest to the impact of cultural practices on patterns of mtDNA vs. MSY variation.


Introduction
Thailand occupies the center of Mainland Southeast Asia (MSEA) and shares borders with several countries: Laos and Myanmar in the North and West, Laos in the Northeast, Cambodia in the East, and Malaysia in the South (Fig. 1). With a census size of~68.61 million in 2018, there are 73 different recognized languages belonging to five different linguistic families: Tai-Kadai (TK, 89.4%), Austroasiatic (AA, 4.0%), Sino-Tibetan (ST, 3.2%), Austronesian (AN, 2.8%), and Hmong-Mien (HM, 0.2%) [1]. Archaeological evidence indicates the presence of modern humans in the area of Thailand since the late Pleistocene [2]. More recently, archaeogenetics studies indicate that modern AAspeaking groups in Southeast Asia (SEA) are descended from a dispersal of Neolithic farmers from southern China that occurred~4000 years ago (kya) [3,4]. Archaeological and linguistic evidence supports the presence of AA people by at least 2.5 kya, while a later expansion of the TKspeaking groups from southern China reached present-day Thailand~2 kya [2,5]. And, historical evidence indicates that the homeland of many ST and HM speaking hill tribe groups (e.g., Akha, Lisu, Lahu, Karen, Hmong, and IuMien) is in the area further north of Thailand, i.e., northern Myanmar, northern Laos, and southern China, and most of these groups migrated to present-day Thailand~200 ya [6,7]. Thus, the present-day cultural and linguistic variation in Thailand has multiple sources, but the HM and ST groups have not been studied in as much detail and the impact of this variation on genetic variation is still poorly understood.
The HM language family is one of the major language families in MSEA, comprising some 39 languages (35 Hmongic and 4 Mienic) distributed across China, northern Vietnam, northern Laos, and northern Thailand [8]. Although the Hmongic and Mienic languages are relatively similar to one another, there are differences due to divergent developments in the phonology [8]. The heartland of the Hmong people is considered to be in the present-day southern Chinese province of Kweichow, where they were established at least 2 kya, probably arriving from the east [6]. Migrations into Thailand through Laos are documented since the second half of the 19th century A.D. The IuMien are, like the Hmong, thought to have an origin in southeastern China, from which they started to migrate southwards to Vietnam in the 13th century A.D., entering Thailand about 200 ya [6,9]. With two main ST subfamilies, Chinese and Tibeto-Burman, the ST family is both large (~460 languages spoken by over a billion people) and spread across many countries in South, East and SEA, including China, Nepal, Bhutan, northeastern India, Pakistan, Myanmar, Bangladesh, Thailand, Vietnam, and Laos [10].
The HM and ST groups in Thailand are regarded as the hill tribes who inhabit the high mountainous northern and western region of Thailand. Consisting of~700,000 people, there are nine officially recognized hill tribes: the AAspeaking Lawa, Htin and Khmu; the HM-speaking Hmong and IuMien; and the ST-speaking Karen, Lahu (or Mussur), Akha, and Lisu [6,7]. In addition to living in a remote and isolated region of Thailand, the hill tribes are of interest for their cultural variation. In particular, postmarital residence pattern varies among the hill tribes, with some practicing patrilocality (i.e., following marriage, the woman moves to the residence of the man) while others are matrilocal (i.e., the man moves to the residence of the woman). The first study to document an effect of patrilocality vs. matrilocality on patterns of human mitochondrial (mt) DNA vs. malespecific portion of the Y chromosome (MSY) variation was carried out on the hill tribes [11], and has been further investigated in subsequent studies [9,12].
Our previous studies on the paternal and maternal genetic lineages and structure of many TK and AA groups indicated  [12][13][14][15]. Red stars, green triangles, black circles, and blue squares represent Hmong-Mien, Sino-Tibetan, Austroasiatic, and Tai-Kadai speaking populations, respectively. The barplots on the left and right sides of the map depict the proportion of MSY and mtDNA haplogroups specific to Northeast Asia (NEA), Southeast Asia (SEA), or of unknown/other origin. different and complex demographic histories in populations from Thailand and Laos [12][13][14][15]. However, the ST and HM speaking groups have not been studied in as much detail.
Here we generated and analyzed 416 complete mtDNA genome sequences and 234 partial sequences of the MSY from 14 populations belonging to 11 HM and ST speaking hill tribe populations, and from three non-hill tribe TK populations: the Shan, who migrated recently from Myanmar and live in the mountainous area of northern Thailand; and the Phutai and Lao Isan from the Northeast of Thailand (Fig. 1). This is the first detailed genetic study of Thai HM speaking groups and we also revisit the impact of patrilocality vs. matrilocality on patterns of mtDNA and MSY variation with higher-resolution methods and more populations than studied previously.  Table S1; Fig. 1). Genomic DNA samples of HM1-HM4 and Y1 and Y2 were from a previous study [16], while the remaining groups were newly-collected buccal samples obtained with written informed consent and with ethical approval from Khon Kaen University and the Ethics Commission of the University of Leipzig Medical Faculty. We extracted DNA using the Gentra Puregene Buccal Cell Kit (Qiagen, Germany) according to the manufacturer's directions.

Sequencing
Genomic libraries with double indices were prepared and enriched for mtDNA as described previously [17,18]. The libraries were sequenced on an Illumina Hiseq 2500 to obtain mtDNA consensus sequences as described in our previous studies [14,15]. We used Bustard for Illumina standard base calling and the read length was 76 bp. We also enriched for ∼2.3 mB of the MSY from the same genomic libraries for male samples via in-solution hybridization-capture using a previously designed probe set [12,14] and the Agilent Sure Select system (Agilent, CA). The target MSY regions are positioned between positions 9927192 and 13199427 on the human reference genome hg19 [12,14]. Sequencing was carried out on the Illumina HiSeq 2500 platform with paired-end reads of 125 bp length and we used Bustard for Illumina standard base calling. The process for manipulating raw sequencing data, alignment and post-processing pipeline of the sequencing data for both mtDNA and MSY were carried out as previously described [12][13][14][15]. We then manually checked and manipulated sequences with Bioedit (www.mbio.ncsu.edu/BioEdit/bioedit. html). The complete mtDNA sequence data set can be found at GenBank (accession number MT418943-MT419358). All reads that aligned to the region of the MSY that was targeted by the capture-enrichment array were deposited in the European Nucleotide Archive (study ID PRJEB36639). Final SNP genotypes and their chromosomal positions on hg19 are provided in Supplementary Table S2.

Statistical analysis
The newly-generated 234 MSY sequences from 14 populations were combined with 928 sequences from 59 populations from our previous studies [12,14] for a total of 1162 sequences belonging to 73 populations. For mtDNA, combining the 416 new sequences from this study with 1434 sequences from our previous studies [13][14][15] brings the total to 1850 sequences from 73 populations. Summary statistics of the genetic diversity within populations, the matrix of pairwise genetic distances (Φ st ), and analyses of molecular variance (AMOVA) were obtained with Arlequin 3.5.1.3 [19]. To visualize population relatedness, the R package [20] was used to carry out the nonmetric MDS analysis (based on the Φ st distance matrices for the MSY and mtDNA) (R function: isoMDS package: MASS) and to construct heat plots of the Φ st distance matrix and the matrix of shared haplotypes (R function: ape, pegas, adegenet and ggplot2 packages). STATISTICA 13.0 (StatSoft, Inc., USA) was used to carry out a correspondence analysis (CA) based on MSY and mtDNA haplogroup frequencies. For the MSY, haplogroup assignment was performed by yHaplo [21]. Haplogroups were assigned to the maximum depth possible given the phylogeny of ISOGG Y-DNA Haplogroup Tree 2015 (http://www.isogg.org/) and the available genetic markers in our target region. The derived SNPs for each sample are provided in Supplementary Table S2. The mtDNA haplogroups were assigned by Haplogrep2 [22] with PhyloTree mtDNA tree Build 17 (http://www. phylotree.org) [23] and the polymorphisms for each sample are provided in Supplementary Table S3. To obtain a broader picture of population relationships within SEA, we included publicly-available sequences from relevant populations for the MSY and mtDNA (Supplementary Table S4). To compare Northeast Asia (NEA) and SEA prevalent haplogroups with our studied populations, we calculated the frequency of MSY/mtDNA NEA and SEA prevalent haplogroups in the HM and ST speaking populations from China and Vietnam (Supplementary Table S5). To construct Bayesian skyline plots (BSP) per population and maximum clade credibility (MCC) trees per haplogroup, based on Bayesian Markov Chain Monte Carlo (MCMC) analyses, we used BEAST 1.8.4 [24]. For BSP plots by population, we conducted analyses both pooling all populations within the same ethnicity (e.g., pooling HM1-HM5), and for the individual populations. The Bayesian MCMC estimates (BE) and 95% highest posterior density (HPD) intervals of haplogroup coalescent times were calculated using the CongPy6 sequence (haplogroup A1b1-M14) for rooting the tree for MSY haplogroup C [25] and the Mbuti-3 sequence (haplogroup B-M182) for rooting the tree for haplogroups F and O [26]. BEAST input files were created with BEAUTi v1.8.2 after first running jModel test 2.1.7 in order to choose the most suitable model of sequence evolution [27]. The best substitution models are shown in Supplementary Table S6. We used an MSY mutation rate of 8.71 × 10 −10 substitutions/bp/year [28], and the BEAST input files were modified by an in-house script to add in the invariant sites found in our data set. Both strict and log normal relaxed clock models were run, with marginal likelihood estimation [29,30]. After each BEAST run, the Bayes factor was computed from the log marginal likelihood of both models to choose the best-fitting BSP/MCC tree (Supplementary Table S6). For mtDNA, we executed BSP analyses per population and the BEAST runs by haplogroup, with mutation rates of 1.708 × 10 −8 and 9.883 × 10 −8 for data partitioned between the coding and noncoding regions, respectively [31]. The RSRS was used for rooting the tree for mtDNA [32]. The strict clock model was used, and the best substitution models are shown in Supplementary Table S6. The Bayesian skyline piecewise linear tree prior for the dating and Bayesian skyline generation were applied, so as to allow for population size changes over time.
When we focus on the MSY lineages that are prevalent in NEA, i.e., C2e*, D-M174 and N* in our samples, the frequency of NEA lineages is >30% among the Hmong (HM2-HM4), Lawa (LW2), and Karen (KPA) from northern Thailand, and the Nyaw (NY) from northeastern Thailand. In contrast O1b*, which is the predominant lineage in SEA and at high frequency in AA speaking people [12], is the major lineage in the Thai/Lao AA speaking group (except for some Mon populations who show evidence of admixture with Central Thai populations). Interestingly, there is heterogeneity in the frequency of NEA/SEA lineages in the Hmong and Lawa groups: among the five Hmong populations, HM5 completely lacks NEA lineages, while within the Lawa groups, LW2 has 56% NEA lineages while LW3 has exclusively SEA lineages ( Fig. 1).

mtDNA
We generated 416 complete mtDNA sequences with mean coverage ranging from 35× to 7752× (overall average coverage 1934×) (Supplementary Table S3). When combined with 1434 sequences from our previous studies [13][14][15] there are in total 1850 sequences belonging to 73 populations, with 1125 haplotypes. When combined with our previous data there are a total of 285 haplogroups (Supplementary  Table S8); several were not reported previously from Thai/ Lao populations and these are specific to some populations, e.g., B4a5 (specific to the Hmong), and B5a1c1a, B5a1c1a1, D4e1a3, and F1g1 (specific to HM groups).

MSY
Overall, the HM, AA and ST groups tend to have more heterogeneous and lower genetic diversity values than the TK groups ( Fig. 2; Supplementary Table S9). By contrast, genetic diversities of the HM groups are not statistically different from the AA and ST groups, nor do the ST and AA groups differ significantly in genetic diversity values (Supplementary Table S9). At the individual population level, out of 63 populations, the Hmong (HM1) shows lower haplotype diversity than all other groups except for two hunter-gatherer groups (Mlabri (MA) and Maniq (MN)) and the Htin Mal (TN1). HM1 and HM5 have lower genetic diversity than the other HM populations, although HM2 shows the highest MPD values. Generally, the Hmong (HM1-HM5) groups show lower haplotype and haplogroup diversity than the In Mien (Y1-Y2) (Fig. 2a,  b). Of the eight ST speaking populations, the newly-studied Karen group (KSK3) exhibits lower genetic diversities while the Lisu (LS) and KSK1 have higher genetic diversities than the other ST speaking populations ( Fig. 2a-d).
Although low genetic diversities are observed in HM1, a significantly low Tajima's D value (Fig. 2d) suggests recent paternal expansion in this group. Significant negative Tajima's D values were observed more frequently in the TK than in the AA and HM groups (P < 0.05: 11/34 for TK, 6/24 for AA, and 2/7 for HM) but no significant Tajima's D values were observed in any of the ST-speaking groups (Fig. 2d).
mtDNA Along with the Mlabri (MA), Htin (TN1 and TN2), and Seak (SK), the ST speaking Lahu or Mussur (MR) shows low mtDNA haplotype and haplogroup diversities whereas the Lisu (LS) shows higher genetic diversities than the other ST populations (Fig. 2a-c). In contrast to the MSY, the Hmong groups exhibited generally higher genetic diversities than the ST and AA speaking groups (Supplementary Table S9). Both the ST and AA groups have lower genetic diversity values than the TK groups (Supplementary Table S9). As also seen in the MSY results, both IuMien populations (Y1 and Y2) show higher genetic diversities than the other Hmong and ST speaking groups (Fig. 2a-c). In agreement with the MSY results, a significantly negative Tajima's D value was observed more frequently for the TK than for the AA and HM groups (P < 0.05: 21/34 for TK, 5/24 for AA, and 2/7 for HM). Interestingly, the ST-speaking groups show no significant Tajima's D values while the two IuMien groups both show significant negative Tajima's D values (Fig. 2d).

MSY
The AMOVA results indicate that the variation among populations accounts for 13.72% of the total MSY genetic variance ( Table 1). The HM group shows the greatest genetic heterogeneity among populations, followed by the AA and ST groups; the TK group shows the lowest amongpopulation variation. The Thai Hmong, with five populations sampled, shows higher variation among the populations than do the other hill tribe groups. When HM and ST populations from Vietnam [33] were included in the analysis, genetic variation among populations of the ST and HM groups increased substantially, suggesting some differentiation between Vietnamese and Thai populations belonging to these two groups. However, a direct comparison of Thai ST vs. Vietnamese ST, and Thai HM vs. Vietnamese HM groups, showed no significant differences between groups. The variation among populations within groups of the IuMien and Lahu were much lower than for the Hmong, indicating genetic heterogeneity of the Hmong and ST populations and more homogeneity for the IuMien and Lahu. The MSY genetic variation showed significant differences among the four language families (HM, ST, AA, and TK), but the variation among groups was lower than the variation among populations within each group, indicating that language families do not correspond to genetic structure. All pairwise comparisons of the four language families showed significant differences among groups, but these were on the same order as the differences among populations within the same group. However, when the HM were separated into Hmong and IuMien, the pairwise comparisons of Hmong with other language families remained significant, while for the IuMien there were no significant differences with other language families, suggesting some differences between Hmong and IuMien groups. The lowest variation was between the TK and AA groups, indicating a relatively close genetic relationship between these two.

mtDNA
The total mtDNA variation among populations of 8.46% was lower than for the MSY ( Table 1). The mtDNA variation for the AA, ST, and TK groups was about the same as for the MSY, but was substantially less for the HM. The Htin have by far the largest variation among populations, while the Hmong show nonsignificant mtDNA variation among populations (0.53%), reflecting genetic homogeneity in their maternal side. The mtDNA genetic variation among the four language families (HM, ST, AA, and TK) was much smaller (1.09%) than the variation among populations assigned to each group (7.7%), indicating that as with the MSY, language families do not correspond to the genetic structure of these populations. The variation between pairs of linguistic groups shows in all comparisons that the variation between groups is lower than the variation among populations within groups. As with the MSY, the lowest variation (which is not significantly different from zero) is between the TK and AA groups, further supporting a close relationship between these two groups in Thailand.
The pooled mtDNA data of Lahu from Vietnam and Thailand revealed much larger variation for mtDNA (13.47%) than for the MSY (4.88%), in contrast to the larger MSY than mtDNA variation observed when pooling data from other groups from Thailand and Vietnam. In particular, the mtDNA variation among Hmong groups from Thailand and Vietnam was only 1.08%, which is not significantly different from zero. When the Hmong and IuMien were separately compared with other linguistic groups, significant differences were observed for the Hmong but not for IuMien, similar to the MSY results and further supporting the difference between Hmong and Mien groups.

Population affinity
MSY Shared haplotypes within populations are an indication of smaller population size and increase relatedness among individuals, while shared haplotypes between populations are an indication of recent shared ancestry or contact. There were shared MSY haplotypes within the HM groups, and some sharing between them and a few TK-speaking groups, except for HM5 and Y1, who did not share any haplotypes with any other populations (Fig. 3a). The Lisu shared haplotypes with both Lahu (MB and MR) populations, while both Lahu populations shared haplotypes among themselves and also with one group of central Thai (CT5). The newly studied Karen (KSK3) shared haplotypes with the other Karen populations (KSK1, KSK2, and KPW), and also with their neighbors, i.e., Shan (SH2) and Lawa (LW1) (Fig. 3a).
Genetic distance values are a further indication of genetic relationships among populations; the genetic distances (Φ st values) indicate, in general, genetic heterogeneity among AA populations and homogeneity among TK populations, as well as genetic differences between the AA (except the Mon) and TK populations. For the newlystudied HM groups, significant genetic differences between the Hmong and almost all other populations were observed, whereas the Φ st values for comparisons of the IuMien (Y1 and Y2) and Lisu with many populations were not significant (Fig. 3b). The new Karen group is significantly different from almost all populations except SH2 and KSK1. The two Lahu populations are genetically distinct from all other populations (except each other).
To further visualize the relationships based on the Φ st distance matrix, we carried out MDS analysis. The MDS plot for three dimensions indicates genetic distinction of the Maniq (MN), the hunter-gatherer group from southern Thailand, the Hmong groups (HM1-HM5) and the Karen (KSK3) (Supplementary Fig. S4), as further indicated in the MDS heat plot (Supplementary Fig. S5). Based on the MDS results for both the MSY and mtDNA, we removed five highly-diverged populations (MA, MN, TN1, TN2, and SK); a three-dimension MDS for the remaining 68 populations has an acceptable stress value (Fig. 4a-c). There was overall some clustering of populations according to  Fig. 4a-c) shows clustering of the Vietnamese HM-speaking Pathen and Hmong with the Thai Hmong populations, while the Vietnamese HM-speaking Yao are more similar to the Thai IuMien groups. The Karen (KSK3) remains distinct, but shows a close genetic relatedness to Burmese. Some of the TK-speaking groups from Vietnam, i.e., Nung, Tay and Thai, are close to the TK populations from northeastern Thailand, e.g., Phutai (PT1 and PT2), Kalueang (KL), and Black Tai (BT1).
To investigate which MSY haplogroups might be driving population relationships, we carried out a correspondence analysis (CA), which is based on haplogroup frequencies.
The results indicate that the genetic distinctiveness of the Hmong reflects high frequencies of haplogroups O2a2a1a2a1a2 (O-N5) and C2e2 (C-F845) (Supplementary mtDNA With respect to mtDNA haplotype sharing (Fig. 3a), the HM populations (HM1-HM5) share mtDNA haplotypes extensively with each other, including with the IuMien populations (Y1 and Y2), but do not share haplotypes with any other population, reflecting their unique genetic structure. As with the MSY results, the Lisu (LS) only shares haplotypes with both Lahu populations (MB and MR), indicating contact between them. The newly studied Karen population (KSK3) also exhibits large differences from the other Karen groups: there is no mtDNA haplotype sharing between the KSK3 and the other Karen populations, while  (Fig. 4d-f).
The MDS plot based on the Φ st distance matrix that includes comparative data from other SEA populations (Fig. 5d-f)   The CA analysis based on mtDNA haplogroup frequencies ( Supplementary Fig. S7) further confirms the distinctiveness of the HM groups based on several haplogroups, i.e., B5a1c1a, B5a1c1a1, B4a5, C7a, D4e1a3, F1g1, F1g2, N9a10 (16311C), and M74a. The Lahu and MO5 were distinguished in the third dimension, reflecting haplogroups B4e and D4j1a1. In the fourth dimension several groups are distinguished via many specific lineages, including all of the Karen groups and two Mon groups from the border between Thailand and Myanmar.

MSY
The BSPs of population size change (N e ) over time were constructed for each ethnicity. For the MSY, different trends were observed for different groups (Fig. 6). The N e of the Hmong gradually increased since~30 kya and then declined~2-3 kya, while for the Lahu the N e remained stable for a long period of time and then was sharply reduced around~1 kya. The Karen, Shan, and Phutai showed a similar trend: the N e gradually increased, and then decreased~5 kya, with sharp increases~2-3 kya, followed by another decrease~1 kya. The N e for the IuMien slightly increased, and then decreased~2-3 kya. In general, we see similar demographic changes at 5 kya, 2-3 kya and 1 kya in both the new groups studied here and in our previous studies of AA Mon, Khmer and Htin groups, and central Thai TK groups [12]. The first two changes may reflect male-specific expansion during the Neolithic period and the Bronze/Iron Age that are characteristic of modern AA and TK groups respectively, as discussed previously [12].  The BSPs for each ethnicity show that several groups, i.e., Lahu, Shan, Phutai, show a common trend of N e increasing 30-50 kya, and then stable until a decline ∼2-5 kya (Fig. 6; BSPs for each individual population are in Supplementary  Fig. S8). However, the Hmong and Karen showed a different pattern, namely a decrease in N e ∼5 kya followed by rapid growth ∼2.0-2.5 kya for the Hmong, and~1.0 kya for the Karen. In general, the BSP plots for the AA (Mon, Htin, Lawa, and Khmer) and TK populations (Yuan, Phuan, and Lue) [12] are similar to the BSPs observed for most of the groups in this study. The major TK groups (Khon Muenag, Central Thai, and Lao Isan) also showed a population increase ∼10 kya in our previous study [12]. Interestingly, the distinct BSP plot of the Hmong, showing population increase during the Bronze/Iron Age, indicates different demographic changes in the maternal vs. paternal side and supports genetic differences of the Hmong from the other populations indicated by other results (Figs. 3, 4).

Patrilocal vs. matrilocal genetic variation
There are nine official hill tribes in Thailand: the AAspeaking Lawa, Htin, and Khmu; the HM-speaking Hmong and IuMien; and the ST-speaking Karen, Lahu, Lisu and Akha. The Lahu, Karen, and Htin are matrilocal (i.e., the husband moves to the residence of the wife after marriage) whereas the others are patrilocal. Our previous study [12] had investigated four hill tribes (Lawa, Htin, Khmu, and Karen); here we add data from four additional hill tribes (Hmong, IuMien, Lahu, and Lisu) for a total of 23 populations belonging to eight hill tribes. Moreover, although the Palaung is not officially recognized as a hill tribe group, we include them in the analysis because they are a minority people from the same mountainous region of northern Thailand.
The Hmong (HM1-HM5), IuMien (Y1-Y2), Lisu (LS), Khmu (KA), Lawa (LW1-LW3), and Palaung (PL) groups practice patrilocality, whereas the Htin (TN1-TN3), Karen (KSK1-KSK3, KPA, and KPW) and Lahu (MR and MB) are matrilocal. If genetic variation was influenced by residence pattern, then lower within-population genetic diversity coupled with greater genetic heterogeneity among populations is expected for matrilocal groups than for patrilocal groups for the mtDNA, whereas the opposite pattern is expected for the MSY [11]. However, the MSY h and MPD values do not differ significantly between patrilocal and matrilocal groups (Supplementary Table S9 and Fig. S9). For mtDNA, genetic diversity values are significantly higher for patrilocal than for matrilocal groups for h but the differences are not statistically significant for MPD (Supplementary Table S9). Notably, the patrilocal groups HM1, HM4, and LS exhibit higher than average MPD values for the MSY (78.65-99.37, compared with the average of 51.74) and some matrilocal groups, e.g., KSK3 and TN1-TN2, show much lower MPD (6.92-20.69) than average (33.99) for mtDNA (Supplementary Table S9). Furthermore, the genetic diversity of all Htin groups are much lower than the other groups ( Supplementary Fig. S9).
Another potential effect of patrilocality vs. matrilocality is on the shared haplotypes between populations. If recent contact between populations is influenced by residence pattern, one would expect more MSY haplotype sharing among matrilocal groups than among patrilocal groups, and more mtDNA sharing among patrilocal groups than among matrilocal groups. However, the results for haplotype sharing between populations within matrilocal and patrilocal groups do not show a strong effect (Supplementary  Table S10). Haplotype sharing for the MSY is slightly lower on average for patrilocal groups (0.15) than for matrilocal groups (0.18), in accordance with expectations, but haplotype sharing for mtDNA is also lower on average for patrilocal groups (0.15) than for matrilocal groups (0.22), which is not in accordance with expectations based on residence pattern.

Discussion
Our previous studies have focused on the genetic ancestry of the TK and AA groups in Thailand and Laos, here we investigate the less well-studied HM and ST speaking groups from Thailand, to gain more insights into the genetic history of MSEA. We sequenced ∼2.3 mB of the MSY and complete mtDNA genomes of the HM and ST groups who are regarded as hill tribes from northern Thailand, as well as additional TK groups from northern and northeastern Thailand. Although we focus on the HM and ST groups, we note that the previously-observed general pattern of overall genetic homogeneity of Thailand TK groups [12] continues to be maintained with these additional TK groups, consistent with the idea that the TK language family spread via demic diffusion [13]. However, an additional insight arises when we compare the Thai TK data to similar mtDNA and MSY data from Vietnamese TK groups: [33] some of the TK-speaking groups from Vietnam (i.e., Nung, Tay, and Thai) are quite similar to the TK populations from northeastern Thailand (Phutai, Kalueang, and Black Tai) (Fig. 5). This is in agreement with historical evidence for a migration of the ancestors of the Phutai, Kalueang, and Black Tai from Vietnam through Laos during last 200 years [6,34].

Genetic differences between the Hmong and IuMien groups and their origins
Previous studies of HM groups have reported sequences of the mtDNA hypervariable region 1 with some diagnostic coding SNPs to define haplogroups [35], and Y-STR variation and genotypes for Y chromosomal bi-allelic loci [36]. Here we analyze complete mtDNA and partial MSY sequences from five Hmong and two IuMien populations from Thailand; strikingly, we find significant differences between Hmong and Mien populations in Thailand, with the IuMien more similar to other populations (Figs. 3, 4, Supplementary Fig. S6, S7 Apart from the genetic distinction from their linguistic relatives, the IuMien, the Hmong in Thailand are genetically distinct from almost all other groups (Fig. 3). There are no shared mtDNA haplotypes between HM populations and other Thai/Lao populations, and only a few shared MSY haplotypes (Fig. 3a), moreover, they do not overlap with other groups in the MDS analysis (Fig. 4), suggesting that they add unique genetic profiles that were not found in the previous studies of Thai/Lao AA, TK, and ST groups [12]. This striking genetic divergence of Hmong populations in Thailand may reflect cultural isolation. Hmong communities have strong connections and prefer to marry with other Hmong groups rather than with non-Hmong groups [6]. In contrast, the IuMien have shared haplotypes and closer genetic relatedness with several TK-speaking groups, indicating more contact with other groups. These results may reflect the pronounced IuMien culture for adoption. Based on ethnographic accounts from the 1960s, around 10-20% of adult IuMien were adopted from other ethnic groups (both highland and lowland), in order to increase the size of their household and their family's influence [6,9,37]. Another factor behind the genetic similarity of IuMien with other East Asian populations could be admixture, as suggested by sharing of features between IuMien (but not Hmong) and Sinitic languages [38].
Although the proto-HM groups were suggested to have originated in central and southern China during the Neolithic Period [35], their greatest ethnolinguistic diversity is found between the Yangtze and Mekong rivers today. In general, the ages of mtDNA and MSY haplogroups characteristic for HM groups are during the Holocene to the late Neolithic Period (Supplementary Figs. S2, S3). This is consistent with archaeological and historical evidence that the proto-HM group might be linked with the Neolithic cultures in the Middle Reach of the Yangtze River in southern China, namely the Daxi culture (5300-6400 kya) and the Qujialing culture (4600-5000 kya) [35]. Our results are also consistent with a recent study of HM groups from Húnán, China [39], which identified lineages within MSY haplogroup O-N5 as specific to Hmong (and dated to ∼2.33 kya) and mtDNA haplogroup B5a1c1a as correlating with Pahng and IuMien (and dated to ∼9.80 kya). However, we do find B5a1c1a* and B5a1c1a1 specific to Hmong, but not IuMien, in Thailand. Overall, the coalescent ages of both MSY and mtDNA lineages are in the same range.

The origin of the Sino-Tibetan groups
It has been proposed that Neolithic people living at least 6 kya [40] in northwestern China were probably the ancestors of modern ST populations. It has also been suggested that ST languages originated among millet farmers, located in North China, around 7.2 kya [41] or 5.9 kya [42]. Linguistic evidence then suggests differentiation between Sinitic and Tibeto-Burman languages, and also between Tibetan and Lolo-Burmese languages, and southward and westward expansions of ST groups [41,42].
Although an MSY lineage (O-M122*) was proposed to be characteristic of all modern ST populations [43], subsequent studies have found further differentiation, e.g., haplogroup O2a1c-002611, which is at high frequency in Han Chinese but found at very low frequencies in Tibeto-Burman populations [44,45]. Also, autosomal STR genotypes differentiate Tibetan and Lolo-Burmese speaking groups [46].
ST-speaking groups in Thailand have not been studied in the same detail as those in China; here we analyzed three groups: Lisu, Lahu (Mussur), and Karen. Lisu and Lahu speak Lolo-Burmese languages, while the Karen languages belong to a different branch, Karenic [6]. Historical evidence indicates that Lisu and Lahu migrated from southern China through Myanmar to northern Thailand about 100-200 years ago [6]. The Karen claim to be the first settlers in Myanmar who migrated from southern China before the arrival of Mon and Burmese people, and the Karen groups in Thailand have been migrating from Myanmar started around 1750 A.D. due to the growing influence of the Burmese [47].
The genetic distances between the Lisu and Lahu are significantly different from zero for both mtDNA and the MSY, and they also share both mtDNA and MSY haplotypes (Fig. 3a, b), indicating recent contact and/or shared ancestry. The two Lahu populations (Black Lahu (MB) and Red Lahu (MR)) are genetically similar to one another and both are genetically distinct from the other populations (Figs. 3b, 4b-d). However, the Lisu do not differ significantly from many AA and TK populations (Fig. 3b), suggesting interactions between the Lisu and other populations. For the Karen, we have added an additional Karen population (Skaw (KSK3)) to the previously studied Karen populations; KSK3 has very low haplogroup diversity and MPD values for the MSY (Fig. 2), suggesting strong genetic drift that has in turn increased their divergence from the other Karen populations (Fig. 4). This is in keeping with historical information: according to their oral history, KSK3 was founded~60 years ago by just 18 households in a very remote region that is isolated from other Karen villages. The other Karen groups are genetically similar to several populations, and also share many basal mtDNA M haplogroups (M21a, M* and M91a) with neighboring Austroasiatic populations, especially the Mon, suggesting significant admixing (Supplementary Table S8). Previous studies based on autosomal STRs and SNPs also support the relatedness of the Karen and other AA groups in Thailand [47,48].
The estimated coalescent ages of the predominant lineages in ST populations provide an upper bound for their divergence/contact from other groups. MSY haplogroup O2a2b1a1-Page23, equivalent to O-M117, which was previously reported to be abundant in TB groups in southwestern China and in Han Chinese [45], was dated to around~2.41 kya in this study. However, this MSY lineage also occurs in HM, TK, and AA populations, reflecting recent shared ancestry and/or contact ( Supplementary  Fig. S2). The ages of mtDNA haplogroups prevalent in Lahu and Lisu, namely A13, B4e, D4j1a1, and G1c, are dated to~6.74,~2.21,~2.49, and~2.20 kya, respectively ( Supplementary Fig. S3). Thus, the coalescent ages of many MSY and mtDNA lineages prevalent in ST groups are around the time of the Han expansion (~2.5 kya) [49].
Contrasting paternal and maternal genetic variation in patrilocal vs. matrilocal groups Previously, postmarital residence pattern has been shown to influence genetic variation in the hill tribes of Thailand [9,11,12]. Our previous study had investigated four hill tribes: Lawa, Htin, Khmu and Karen [12]. Here we added data from four additional hill tribes (Hmong, IuMien, Lahu, and Lisu) for the most detailed investigation to date, comprising a total of 23 populations belonging to eight hill tribes. The Hmong, IuMien, Lisu, Lawa and Khmu are patrilocal (i.e., the wife moves to the residence of the husband after marriage) whereas the others are matrilocal. If postmarital residence pattern is having an influence on patterns of genetic variation, we would expect larger between-group differences and smaller within-group diversity for patrilocal groups for the MSY, and the same trends for matrilocal groups for mtDNA [11].
In general, the within-population genetic diversity values were not in agreement with expectations (Supplementary  Table S9 and Fig. S10) whereas genetic differentiation between populations did go in the direction predicted by postmarital residence pattern (Table 1). However, when focusing on genetic differentiation within individual groups, the patrilocal Hmong and Lawa and the matrilocal Htin did fit with expectations, i.e., higher genetic differentiation among populations for the MSY than for mtDNA for the Hmong and Lawa, and the opposite for the Htin (Table 1). However, the matrilocal Karen show higher differentiation for the MSY than for mtDNA (Table 1), contrary to expectations and contrary to previous results based on four Karen populations [12]. The addition of the KSK3 population increases the between-population MSY genetic variance from 2.3 to 9.1%, while the between-population mtDNA genetic variance is relatively unchanged. The low MSY MPD value (Fig. 2c) and outlier position in the MSY MDS plots (Fig. 4), as well as their oral history, indicate a strong effect of genetic drift on MSY variation in KSK3, which might then mitigate any influence of postmarital residence pattern on MSY vs. mtDNA variation. Interestingly, overall we find less contrast between matrilocal and patrilocal groups than found previously for the hill tribes [9,11,12]. Presumably this is because of both more detailed sampling and higher-resolution analysis of the mtDNA and MSY genomes. And, this is not unexpected because while some studies find an impact of residence pattern on mtDNA/MSY variation, others do not [50,51]. Many different factors can influence mtDNA/MSY variation, e.g., micro-evolutionarily factors such as genetic drift (as seen with the TN1, LW2 and KSK3 population), physical landscape, and other human cultural patterns, e.g., adoption in IuMien and cultural isolation and intermarriage in Hmong and Lawa [9,12,52,53]; these can dilute or erase any potential impact of residence pattern.
Nonetheless, one striking pattern remains in our data, and that concerns NEA vs. SEA ancestry. Previous genetic studies supported a north-south division in East Asian peoples and with some spread of northern ancestry to the south [35,49]. There are distinct differences in the mtDNA and MSY lineages of NEA vs. SEA populations [48,49,54], and here we also find a higher frequency of both mtDNA and MSY lineages of SEA origin than of NEA origin in most of the studied populations. In general, the SEA specific maternal lineages (B5*, F1a*, M7b* and R9b*) are at an average frequency of 38.28%, while NEA mtDNA lineages (i.e., A*, D* and G*) have an average frequency of 9.38% (Supplementary Table S8). The MSY haplogroups also show major SEA lineages (O1b*) predominating at an average frequency of 45.35%, and minor NEA lineages (C2e*, D-M174 and N*) at an average frequency of 8.33% (Supplementary Table S7).
However, the HM and ST groups are a dramatic exception to this general pattern of higher SEA than NEA ancestry for both paternal and maternal lineages (Fig. 1). The estimated NEA maternal ancestry of the HM groups is 11.94%, comparable to that of other Thai/Lao populations (average = 9.11%), while the average frequency of NEA paternal lineages in HM groups is 24.72% (compared with the average frequency of 6.59% for other Thai/Lao populations). Conversely, in the ST groups we detect an average of 24.09% NEA maternal ancestry, which is much higher than the average NEA maternal ancestry for other Thai/Lao groups (7.57%), while the NEA paternal ancestry in ST groups is comparable to that in other Thai/Lao groups (11.95% vs. 7.88%).
Given that both HM and ST groups originated from southern China or northwestern China, it is likely that the ancestral HM and ST groups both had relatively high levels of NEA ancestry for both the MSY and mtDNA, as this has been reported for contemporary Chinese populations (Supplementary Table S5) [35,39,54,55]. We suggest that there was subsequent contact with SEA groups as their ancestors migrated southward, with HM populations incorporating more SEA maternal than paternal lineages, and ST populations incorporating more SEA paternal than maternal lineages. This could be explained if the ancestral HM group was patrilocal (as all HM populations are today), and so subsequent interactions between the HM ancestors and SEA groups incorporated more SEA mtDNA lineages than MSY lineages into HM populations. Conversely, if the ancestral ST group was matrilocal (as the Karen and Lahu are today), subsequent interactions between ST ancestors and SEA groups would have incorporated more SEA MSY lineages than mtDNA lineages into ST populations. Matrilocality for ancient ST groups has also been suggested based on linguistic evidence [56]. The fact that some ST populations are now patrilocal (e.g., Lisu) while still exhibiting higher frequencies of NEA maternal lineages may then reflect recent changes from matrilocality to patrilocality.
However, the interaction between NEA and SEA groups was undoubtedly a complex admixture process that started in prehistoric times and has continued into historic times, and there are many factors in addition to residence pattern that could have led to sex-biased admixture [55]. Additional studies of both other ethnolinguistic groups (e.g., Akha, who show strong cultural practice preservation) as well as populations sampled from different villages of Lahu, Lisu, and IuMien from northern Thailand and Hmong and Karen (Skaw and Kayah) from western and central Thailand, are necessary to further investigate this pattern.

Conclusion
We have carried out the most extensive study to date, using high-resolution methods, of the maternal and paternal lineages in HM and ST speaking groups of northern Thailand. We find unexpected differences between the Hmong and IuMien, which may reflect different cultural practices, and genetic heterogeneity among ST groups. Compared with previous studies, we find less contrast in genetic diversity and differentiation between matrilocal and patrilocal groups among the hill tribes. However, a novel finding of this study is the contrast between HM and ST groups, both assumed to have origins in southern China, in frequencies of NEA maternal and paternal lineages. We suggest that this striking difference reflects ancestral patrilocality for HM groups vs. ancestral matrilocality for ST groups. Overall, our results further attest to the impact of cultural practices on patterns of mtDNA vs. MSY variation in human populations.