The paternal and maternal genetic history of Vietnamese populations

Vietnam exhibits great cultural and linguistic diversity, yet the genetic history of Vietnamese populations remains poorly understood. Previous studies focused mostly on the majority Kinh group, and thus the genetic diversity of the many other groups has not yet been investigated. Here we analyze complete mtDNA genome sequences and ~2.3 Mb sequences of the male-specific portion of the Y chromosome from the Kinh and 16 minority populations, encompassing all five language families present in Vietnam. We find highly variable levels of diversity within and between groups that do not correlate with either geography or language family. In particular, the Mang and Sila have undergone recent, independent bottlenecks, while the majority group, Kinh, exhibits low levels of differentiation with other groups. The two Austronesian-speaking groups, Giarai and Ede, show a potential impact of matrilocality on their patterns of variation. Overall, we find that isolation, coupled with limited contact involving some groups, has been the major factor influencing the genetic structure of Vietnamese populations, and that there is substantial genetic diversity that is not represented by the Kinh.


Introduction
Southeast Asia (SEA) is a melting pot of ethnolinguistic diversity shaped by many demographic events, beginning with the initial arrival of anatomically modern humans at least 65 kya [1,2], and including migrations accompanying the spread of agriculture, in particular rice and millet farming, the expansion of the Austronesian (AN) language family, and movements of Tai-Kadai (TK) and Hmong-Mien (HM) speakers [3]. The languages spoken in SEA today belong to five language families: Austro-Asiatic (AA), AN, HM, Sino-Tibetan (ST), and TK. Geographically SEA is divided into two subregions, Island SEA (ISEA) and Mainland SEA (MSEA).
Vietnam is a multiethnic country that occupies a key position within MSEA and exhibits both geographic and ethnolinguistic diversity. The northern part of the country consists of highlands and the Red River delta; the central part also comprises highlands, while the southern part encompasses mostly coastal lowlands and the Mekong River delta. There are 54 official ethnicities in Vietnam, and a total of 109 different languages are spoken in the country. These belong to all of the five major language families present in SEA [4]. Groups speaking AA languages are distributed throughout the country, while those speaking TK, HM, or ST languages historically were found mainly in the north but are now also living in other areas; ANspeaking groups are located in the south part of central Vietnam and the Tay Nguyen highlands. The AA language family is considered the oldest within the area; AA languages are scattered across MSEA and South Asia, and the location of the AA homeland is under debate [5]. The AA languages are associated with a major occupation of MSEA after the introduction of agriculture [6].
AN speakers are found all over ISEA and Oceania, and trace at least a part of their ancestry to aboriginal Taiwanese AN-speaking populations, supporting a start of the AN expansion out of Taiwan about 4 kya [3,7]. The genetic composition of modern AN speakers in ISEA is heterogeneous; AN speakers in western Indonesia have substantial AA-related ancestry, caused most likely by a movement of AN speakers through MSEA mixing with AA speakers in Vietnam or peninsular Malaysia [3], while AN speakers in eastern Indonesia harbor both Papuan and AN-related ancestry [8]. AN speakers in MSEA include the Cham, Chru, Raglai, Giarai, and Ede from the south part of central Vietnam and Tay Nguyen highlands, Cham of Cambodia, and Malay groups in Malaysia and Thailand. In contrast to the predominant patrilocal residence pattern of other groups, AN groups are thought to have had an ancestral matrilocal residence pattern [9]. The TK and HM languages likely originated in the area of present-day southern China and north Vietnam then spread by multiple migrations southward to what is now Thailand, Laos, and other parts of Vietnam hundreds to thousands of years ago [10,11]. Whether the current distribution of these languages and the farming culture across MSEA was a result of human migration events (demic diffusion) or happened without the major movement of people (cultural diffusion) is still highly disputed. MtDNA variation in Thailand supports a model of demic diffusion of TK speakers [12], while recent studies based on ancient DNA provide further evidence for Neolithic and Bronze age migrations from East Asia [13], and explain present-day SEA populations as the result of admixture of early mainland Hòabìnhian hunter-gatherers and several migrant groups from ancient East Asia associated with speakers of the AN, AA, and TK languages [14]. ST languages are assumed to have diverged 5.9 kya in Northern China, and ST speakers are thought to have migrated southward into the area of MSEA about 3 kya [15,16].
While several genetic studies have focused on SEA, research on the ethnic groups in Vietnam remains rather limited [17][18][19][20][21][22][23]. Most of these studies either focused solely on the majority group in Vietnam, the Kinh, as representative of the entire country, or are based on a restricted number of SNPs, microsatellites, or only partial sequencing of mtDNA. Because the Kinh comprise 86% of the population, sampling individuals from this group is a promising way to capture the main signal of genetic diversity in Vietnam. But the complicated history of SEA indicates that there might be hidden complexity and genetic structure in the minority populations. We have therefore initiated a comprehensive study of the genetic history of Vietnamese ethnolinguistic groups. Here, we analyze sequences of full mtDNA genomes and~2.3 million bases of the malespecific portion of the Y chromosome (MSY) of the Kinh and 16 minority groups, encompassing all five language families, to investigate their maternal and paternal genetic structure. We use the genetic results based on our extensive sampling to investigate whether the genetic composition of the Kinh is a valid representation of all populations living in Vietnam today, and we assess the impact of geographic, linguistic, and cultural factors (i.e., postmarital residence pattern) on the genetic structure of Vietnamese populations.

Sample information
We analyzed DNA from 600 male Vietnamese individuals (Supplementary Material Table S1) belonging to 17 ethnic groups that speak languages belonging to the five major language families in Vietnam. In detail the data set consists of two AA speaking groups (Kinh and Mang), five TK speaking groups (Tay, Thai, Nung, Colao, and Lachi), two AN speaking groups (Giarai and Ede), three HM speaking groups (Pathen, Hmong, and Dao), and five ST speaking groups (Lahu, Hanhi, Phula, Lolo, and Sila). The average sampling locations per population are shown in Fig. 1. Ethnic groups sampled for this project, name, language affiliation, and census size were based on the General Sta-

MtDNA sequencing
DNA was extracted from blood using the Qiagen Tissue and Blood extraction kit. Double-stranded, double bar-coded Illumina sequencing libraries were constructed as described previously [24]. The libraries were enriched for mtDNA sequences via in-solution capture [25]. The mtDNA haplogroups were defined using haplogrep2 [26], and the phylogeny and terminology of Phylotree 17 [27], http://www.phylotree.org. For further experimental details see [28]. The complete mtDNA data set can be found at GenBank MH448947-MH449555.

Y-chromosome sequencing
We enriched for~2.3 million bases of the MSY from the same libraries used for mtDNA capture enrichment [28].
The MSY sequence processing pipeline is described elsewhere [29]. To increase the quality of the data set, we removed 41 positions with more than 16.6% missing information across the Vietnam MSY sequences and then imputed the remaining missing genotypes with BEAGLE 4.0 [30] using published reference sequences (Supplementary Material Table S2) from South Asia, East Asia, SEA, and Oceania [20,31,32]. After making initial haplogroup calls (using the procedure described below), we then went back and added additional reference samples from haplogroups C2-M217 and N1-M2291 for imputation, as these were present in 28 and 22 individuals respectively in the Vietnamese sample set but not present in the initial set of reference sequences used for imputation. Further, a merged A00 sequence [31] was added as an outgroup. From the combined data set we additionally excluded 53 positions not covered by more than 75% of the samples. The aligned MSY reads are deposited in the European Nucleotide Archive (PRJEB33028). Final SNP genotypes and their chromosomal positions on hg19 are provided in Supplementary Material Data 1 and Data 2. MSY haplogroups were called with yhaplo [33] using a stopping condition parameter "ancStopThresh" = 10. Haplogroups were typed to the maximum depth possible given the phylogeny of ISOGG version 11.04 (http://www.isogg.org/) and the available genetic markers in our target region. Labels denoted with an asterisk in the text and figures are paragroups that do not include subgroups.

Sequence analysis
For both markers, we calculated the mean number of pairwise differences by averaging over the sum of nucleotide differences for each pair of sequences within a population (R function: dist.dna package: ape) divided by the total number of pairs. The nucleotide diversity (π) and its variance were computed using the R function nuc.div (package:pegas). Because the MSY information is based on a set of linked SNPs without recombination, we use the term haplotype throughout this paper to refer to the MSY sequences, and not to STR profiles as has been done in the past. We calculated the number of unique haplotypes for each population and obtained the haplotype diversity (H) values using Arlequin version 3.5.2.2 [34]. To visualize π and H we calculated the percentage difference from the mean for each population. Arlequin version 3.5.2.2 [34] was additionally used to calculate the pairwise genetic distances (Φ ST distances) among the populations and the analyses of molecular variance (AMOVA) for both markers. The p values of the genetic distances were corrected for multiple testing by applying the Benjamini-Hochberg procedure. The Φ ST distances were used to compute nonmetric multidimensional scaling (MDS) plots. We created twodimensional projections (R function: isoMDS package: MASS) and calculated heatplots with five dimensions, showing per-dimension standardized values between 0 and 1. We calculated Mantel matrix correlation tests between genetic distances and great circle distances of the average geographical location per population using Pearson's correlation with 10,000 times random resampling. The correspondence analyses were computed in R using the libraries "vegan", "fields," and "ca". The haplotype sharing analysis was based on sequence haplotypes via string matching. We excluded Ns and indels for the mtDNA sequences; this step was not necessary for the MSY sequences because indels were not called and there were no Ns after imputation.
We performed mtDNA and MSY Bayesian analysis with BEAST 1.8 [35]. The software jmodeltest2 [36] was used to determine that the HKY + I + G and GTR models were the best substitution models for the mtDNA and MSY sequences, respectively. We partitioned the mtDNA genomes into the coding and noncoding sections and applied previously published and widely used mutation rates [37] of 1.708 × 10 −8 and 9.883 × 10 −8 mutations/site/ year, respectively. This partitioning was supported by Par-titionFinder2 [38]. For all MSY analyses the MSY mutation rate of 0.871 × 10 −9 , based on an Icelandic pedigree [39], was applied. Because of the uncertainty in MSY mutation rates [40] and to provide a comprehensive comparison, we additionally applied an ancient DNA calibrated MSY mutation rate (0.76 × 10 −9 substitutions/bp/year [41]) to MSY results where date estimates are relevant, as reported in the Supplementary Material Text. A Bayes factor analysis including marginal likelihood estimations [42] was used to test different clock models. We applied the Bayesian skyline piecewise linear tree prior for the dating and Bayesian skyline generation, so as to allow for population size changes over time. To ensure successful Bayesian estimation and to reach ESS values above 200, we combined multiple MCMC runs with 100 million steps using the BEAST logcombiner with a resampling up tõ 40,000 trees. We constructed MP trees for both markers and counted the mutations from the outgroup per sample. The mutation counts were used to compare the average distance of macrohaplogroups to the base of the trees as a measurement of branch length heterogeneity. We tested for significant differences in branch length distributions of major haplogroups with the Mann-Whitney U test.

MSY sequences
We sequenced 2,346,049 bases of the MSY of 600 Vietnamese from 17 populations to a mean coverage of 30.2 × (minimum: 5×, maximum: 72×). After filtering, there were 3932 SNP positions, including 1908 novel sites that have not been described previously (dbSNP Build 153, accessed October 9, 2019, http://www.ncbi.nlm.nih.gov/SNP/). Fiftyseven different haplogroups were present in the 17 populations (Table S3). A detailed analysis of the phylogeography of the MSY data set will be presented elsewhere; the focus of the present study is the comparison of patterns of mtDNA and MSY variation in the sampled Vietnamese populations, and so we only briefly mention some interesting features about the MSY haplogroup distribution (Supplementary Material Text 1, Fig. S1, Tables S3, S4).

Genetic diversity within populations
The nucleotide diversity (π) and haplotype diversity (H) for mtDNA and MSY sequences varied substantially among populations (Fig. 2, Table S5). Kinh had high values of both π and H for both genetic markers, compared with the mean across populations. Sila, Lachi, and Mang had lower than average H values for both markers. The Mang MSY π-value was particularly low (5.91 × 10 −6 ) compared with the mean (4.72 × 10 −5 ), reflecting the unusual MSY haplogroup composition of this group, dominated by haplogroup O1b-B426, which had a frequency of 97%. The HM groups Pathen and Hmong had higher than average π values for the MSY, which reflects the higher frequency of C and D haplogroup sequences in these two groups (Table S3). The two AN groups were notable in having substantially higher than average MSY H values but average or below average mtDNA H values. Overall, the variation in H and π values for both markers was not consistent within language families. We found high frequencies of both mtDNA and MSY haplotype sharing, up to 17% (Fig. 3), within all 17 populations. In total, 28.5% of the mtDNA and 24.7% of the MSY haplotypes were shared within at least one population. All groups shared mtDNA types within the population, and only the Giarai lacked any shared MSY haplotypes within the population. The highest frequencies of within-population MSY haplotype sharing were present among the Mang (0.15), Sila (0.17), and Colao (0.14), in keeping with their very low H values (Fig. 2).

Population relationships
We examined haplotype sharing between populations as an indication of recent genetic contact or shared ancestry (Fig. 3). In general, there were more occurrences of sharing between populations in the mtDNA sequences compared with the MSY sequences, but only 5.5% of the mtDNA and 2.8% of the MSY haplotypes were shared between populations. The two AN populations (Giarai and Ede) had the highest frequency of mtDNA haplotype sharing between them (0.16), which was even higher than their within-population sharing (0.10 and 0.07). However, in notable contrast to the high degree of mtDNA sharing, there was no MSY haplotype sharing between the AN groups. The three HM populations each shared mtDNA haplotypes with some TK populations, but not with other populations of their own language family. The Dao (HM) shared MSY haplotypes with the TK groups Nung and Lachi; they are located geographically close to one another and also shared mtDNA haplotypes. All other occurrences of MSY haplotype sharing involved populations that are settled in close geographical proximity, namely Sila-Hanhi, Lahu-Hanhi, and Phula-Tay. To visualize the relationships among Vietnamese populations, we generated MDS plots (Fig. 4)   in the periphery of the MDS plots (Fig. 4a, b). In addition, Phula (ST) stands out in the MSY MDS plot. Giarai and Ede showed large Φ ST distances for mtDNA but not for the MSY (Fig. 4), and larger Φ ST distances to most ST and HM populations than to Kinh (AA) and TK (except Lachi) groups. The Kinh group showed overall low genetic distances with other groups (Supplementary Material Fig. S2) and a central position in both MDS plots (Fig. 4). Because the rather high stress values of the two-dimensional MDS plots (Fig. 4a, b) indicated potentially more complex structure, we calculated a five-dimensional MDS and depicted the results in a heat plot (Supplementary Material Fig. S3A and B). The Kinh, Thai, and Tay remained centrally located across all five dimensions for both markers (Supplementary Material Fig. S3A and B), while the Mang remain an outlier in most dimensions in the MSY plot ( Supplementary Material Fig. S3B). We additionally explored population relationships based on haplogroup frequencies via a correspondence analysis (Supplementary Material Fig. S4). The results were similar to the MDS results for the AN groups, in that the Giarai and Ede were outliers for mtDNA but not for the MSY. Their mtDNA separation, in the first dimension of the plot, was driven mainly by the high frequencies of haplogroup M71 + C151T (37-42%), as well as by other exclusive haplogroups such as M68a1a, F1a4a, M21b, M24b, M68a1a, M73b, M74b1, M7b1a1f, and R9b1a1a. The second dimension separated the Mang (haplogroup A, D4, M71) and Pathen (A14, F1d, F2a). For the MSY, the high frequencies of F-M89 separated Phula (74%) and Lahu (32%) from the rest in the first dimension, while the second dimension spreads the populations between Mang and Hmong (Fig. S4).

Factors influencing the genetic structure
To test for correspondence between linguistic affiliation or geographic location with genetic structure, we analyzed three groupings (linguistic affiliation and two levels of geographical proximity) via AMOVA (Table 1). With respect to geography, we grouped populations on a broad scale by regions (political units), and on a finer scale by their origin in the same or neighboring districts. All three tested grouping patterns (language family, district, and region) revealed that~90% of the total mtDNA variation and~77% of the total MSY variation is explained by the differences within populations. Although the among-group component was significant for language family (1.8%) and districts (2.6%) for the mtDNA sequences, and for language family (4.4%) for the MSY sequences, in all of these cases the within-group component was considerably larger, indicating that differences between populations assigned to the same group were bigger than differences between populations assigned to different groups.
To further assess the impact of geography on the genetic structure of Vietnamese populations, we tested for correlations between the geographic and genetic distances for both mtDNA and MSY sequences ( Table 2). This analysis was carried out for the entire data set, for a subset excluding the Kinh, and for a subset excluding the Ede and Giarai. We excluded the Kinh to control for the influence of their geographically widespread distribution and potentially mixed gene pool, as they are the majority group in Vietnam. Giarai and Ede, the only two groups from the Central Highlands of Vietnam, were excluded to test for a potential bias caused by their unique geographic position, as including these groups results in a bimodal distribution of geographical distances. We found a significant correlation between the geographic distance and the mtDNA genetic distance matrices when analyzing the entire data set and the population subset excluding the Kinh (Table 2). However, the correlation between mtDNA distances and geography became nonsignificant when excluding the two Central Highlands groups (Ede and Giarai), suggesting that their large geographic distance (Fig. 1) and high mtDNA genetic distances (Fig. S2) from the other groups was driving the significant correlation. Furthermore, no significant correlation was detected for any comparisons of MSY genetic and geographic distances. However, there was a significant correlation between the genetic distance matrices of the two uniparental markers (Table 2) when the Kinh were included. We additionally carried out maximum parsimony and Bayesian analyses of the MSY sequences and tested for branch length heterogeneity; the results of these analyses (which indicate substantial branch length heterogeneity in the MSY tree, but not the mtDNA tree) are discussed in Supplementary Material Texts 2 and 3.

Discussion
Previous genetic studies of Vietnam have focused largely on the Kinh majority as representative of the country [14, 18, 19, 21-23, 43, 44]. In contrast, we have investigated the patterns of genetic variation in a large sample of ethnolinguistic groups from Vietnam that speak languages encompassing all of the five language families present in the country. We found varying levels of genetic diversity within and among groups, not all of which is represented in the Kinh. For example, the high genetic diversity observed within the Kinh and several TK speaking groups differed substantially from the much lower levels of diversity found in small populations like the Sila and the Mang (Fig. 2). The predominant sharing of haplotypes within populations (Fig. 3), as well as this reduced diversity (Fig. 2), suggests that many of the Vietnamese groups are relatively isolated from one another. The higher number of population pairs that share mtDNA haplotypes, compared with MSY sharing, likely reflects recent contact and is in accordance with the expectation of more female exchange due to patrilocality in all non-AN groups. However, lower levels of MSY sharing could also reflect the greater resolution of MSY haplotypes based on 2.3 Mb of sequence, compared with the 16.5 Kb mtDNA genome sequences. Considering the difference in mutation rates for the different molecules [37,39], a new mutation is expected in an MSY haplotype of our target size every 489.4 years, and in the whole mtDNA genome every 3624 years; thus, new  [29], and 6.9% for Angolan populations [46] (Supplementary Material Fig. S7).
There was no correspondence between the genetic structure and geography, as indicated by the absence of a significant correlation between genetic and geographic distances (Table 2), and the lack of any significant influence of geographical clusterings in the AMOVA (Table 1). While the mtDNA genetic structure is slightly influenced by geography, a significant correlation between mtDNA genetic distances and geographic distances disappears when the AN groups are excluded ( Table 2). Because there are other factors that differ between AN groups and non-AN groups, such as the postmarital residence pattern (discussed in more detail in Supplementary Material Text 4), which might influence the genetic structure, geography is not necessarily directly related to genetic structure. We also did not find any evidence for an association between genetic structure and language family affiliation ( Table 1). A consistent finding across Vietnamese populations was a higher female than male effective population size (Supplementary Material Text 2, Fig. S5), and more genetic structure on the MSY (MSY F ST = 23% vs. mtDNA F ST = 10%) ( Table 1). These are common patterns in human populations [31,44,47], and likely reflect a predominant patrilocal residence pattern and higher levels of female migration [48][49][50], a greater variance in reproductive success for males than for females [49,51], and male-specific cultural inheritance of fitness [31]. A recent study of MSEA populations supports the negative influence of clan fission and extinction dynamics on Y-chromosome effective population size in patrilineal societies [52]. Strikingly, the Pathen (HM) mtDNA and MSY effective population sizes were about the same. Why this is the case is not known, but we speculate that this could reflect greater homogeneity in male reproductive success for the Pathen, compared with other Vietnamese groups. On a global scale, previous analyses have revealed consistently higher female than male population sizes, and increases in both at~40-60 kya for all non-African continents [31]. While this signal of female population size increase was present in our study, the MYS lineages of all of the minority groups coalesced more recently than 40 kya. (Fig. S5). Previous results [31] further indicated a reduction in male Ne for continental populations between 8 and 4 kya, linked to the spread of Neolithic cultures and adoption of farming and changes in social structures, leading to an increase in the variance of male reproductive success and sex-specific demographic events. This trend was not found in the trajectories of most of our Vietnamese population, which may reflect the substantially lower structure present on the local compared with the continental scale.
Overall, it appears that the genetic structure of Vietnam has been primarily influenced by two main factors. The first is isolation and genetic drift, leading to high levels of genetic differentiation between groups and variable levels of genetic diversity within groups. Second, there has been limited recent contact between some groups, leading to some haplotype sharing. The levels of genetic differentiation among groups of 10% based on mtDNA and 23% based on the MSY (Table 1) are similar to what was found previously for populations from Northwestern Amazonia (13% mtDNA and 27% MSY; [29]) and higher than those found for Thai populations (8.5% mtDNA and 11% MSY; [45]). We also found particularly low levels of diversity in some specific groups, like the Mang and Sila (Fig. 2, Supplementary Material Table S5). The low levels of haplotype sharing for both markers (Fig. 3) are further evidence of isolation and limited contact between geographically close populations. The observed level of mtDNA haplotype sharing between Vietnamese groups (5.5%) is lower than that observed in most other studies of complete mtDNA genome sequences (Supplementary Material Fig. S6), while the MSY haplotype sharing between Vietnamese groups (2.2%) is also lower than what was observed in studies that sequenced the same regions of the MSY (Supplementary Material Fig. S7).
In addition to the general aspects of Vietnamese genetic diversity discussed above, our results provide some insights into the genetic profile and history of specific groups. These are discussed in detail in Supplementary Text Material Text 4, and include: the higher male than female isolation of HM groups; recent expansions and diversification of TK groups, along with some contact between them and HM groups; the impact of matrilocality on patterns of genetic variation in the AN groups; evidence for the probable incorporation of other groups into the Kinh during their initial spread; and the pronounced bottleneck (especially in the MSY sequences) in the Mang and Sila.

Conclusion
The sequencing of 2.34 Mb of the MSY chromosome and the complete mtDNA genome of 17 different ethnic groups (encompassing all five language families) enabled us to carry out the first comprehensive analysis of the genetic diversity within Vietnam. Overall, isolation leading to genetic drift has had an important impact on Vietnamese groups, with recent contact limited to some TK and HM groups. There are several differences between the maternal and paternal genetic history of some populations; in particular, a matrilocal vs. patrilocal residence pattern appears to be one of the major drivers of differences between the MSY and mtDNA signals in AN vs. other groups. There is also a profound impact of genetic drift for the Mang and Sila, especially in the MSY lineages, suggesting male-specific bottlenecks or founder events. And although we find genetic evidences for a central position of the Kinh as the majority group within the country, there is substantial genetic diversity in the other ethnic groups that is not represented in the Kinh. Genetic studies of the remaining ethnic groups in Vietnam, and expansion of the genetic data to include genome-wide variation will provide further insights into the genetic history of this key region of MSEA. gram "Investissements d'Avenir" (ANR-11-IDEX-0007) of the French government operated by the National Research Agency (ANR).

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.