The genetic legacy of continental scale admixture in Indian Austroasiatic speakers

Surrounded by speakers of Indo-European, Dravidian and Tibeto-Burman languages, around 11 million Munda (a branch of Austroasiatic language family) speakers live in the densely populated and genetically diverse South Asia. Their genetic makeup holds components characteristic of South Asians as well as Southeast Asians. The admixture time between these components has been previously estimated on the basis of archaeology, linguistics and uniparental markers. Using genome-wide genotype data of 102 Munda speakers and contextual data from South and Southeast Asia, we retrieved admixture dates between 2000–3800 years ago for different populations of Munda. The best modern proxies for the source populations for the admixture with proportions 0.29/0.71 are Lao people from Laos and Dravidian speakers from Kerala in India. The South Asian population(s), with whom the incoming Southeast Asians intermixed, had a smaller proportion of West Eurasian genetic component than contemporary proxies. Somewhat surprisingly Malaysian Peninsular tribes rather than the geographically closer Austroasiatic languages speakers like Vietnamese and Cambodians show highest sharing of IBD segments with the Munda. In addition, we affirmed that the grouping of the Munda speakers into North and South Munda based on linguistics is in concordance with genome-wide data.

Considering the widespread sharing of words related to rice agriculture in all main branches of Austroasiatic, it has been proposed that this language family co-expanded with farming in MSEA and that the speakers of Munda languages spread to India as part of this farming expansion 9,10 . Alternatively, considering the deep splits of extant Munda and extinct Para-Munda languages and evidence for independent domestication of rice in India and in Southeast Asia, it has been proposed that Austroasiatic languages could have, instead, spread from India to Southeast Asia 11 . Given that about 25% of the genetic ancestry of Munda speakers has been shown to be shared with Southeast Asians, unlike in other Indian populations, and, reversely, because Burmese share some ancestry with Indian populations, it has been proposed that the expansion of rice farming may have involved bilateral movement of people 7 .
Studies analysing mtDNA and Y chromosome markers have revealed a sex-specific admixture pattern of admixture of Southeast and South Asian ancestry components for Munda speakers. While close to 100% of mtDNA lineages present in Mundas match those in other Indian populations, around 65% of their paternal genetic heritage is more closely related to Southeast Asian than South Asian variation 7,12,13 . Such a contrasting distribution of maternal and paternal lineages among the Munda speakers is a classic example of 'father tongue hypothesis' 14 . However, the temporality of this expansion is contentious 7,13,15,16 . Based on Y-STR data the coalescent time of Indian O2a-M95 haplogroup was estimated to be >10 KYA 7,13 . Recently, the reconstructed phylogeny of 8.8 Mb region of Y chromosome data showed that Indian O2a-M95 lineages coalesce within a clade nested within East/Southeast Asian within the last ~5-7 KYA 17 . This date estimate sets the upper boundary for the main episode of gene flow of Y chromosomes from Southeast Asia to India.
Previous autosomal study was limited to a single Austroasiatic population from Southeast Asia 7 , therefore in the present study, we generated and assembled large body of contextual genome-wide genotype data from Southeast Asia as well as from South Asia (Supplementary Table S1). We set out to affirm the signal of the admixture event in autosomal data and to address previously unresolved questions including: (i) autosomal date of the South and Southeast Asian admixture event in Munda; (ii) characteristics of the Indian ancestry component of the Mundas; (iii) who are the closest living descendants of the source populations of the ancient admixture; (iv) and if the grouping of the Munda speakers into North and South Mundas based on some linguistic models is supported by genetic data.
To address these questions, we analysed 102 individual samples from Munda speaking populations (including 10 newly reported samples) in context of 978 other samples (including 46 newly reported samples) from 72 populations mainly from India, Southeast Asia and East Asia. The Munda speakers are divided into North Mundas (NM) and South Mundas (SM) based on linguistic affinities. List of all the populations, sample sizes, and some additional information on the dataset can be found in Supplementary Table S1.

Results and Discussion
The Munda speakers as an admixed population. We first analysed Munda genomes with ADMIXTURE and PCA in context of other South and Southeast Asian populations and found that Munda share about three quarters of their genetic ancestry (k3-k5 components in Fig. 1, Supplementary Fig. S3) with Indian Dravidian and Indo-European speakers. Interestingly, Indian populations with the k3-k5 components have also a pink component (k2) which is widespread in European, West Eurasian, Near Eastern and Pakistani populations but missing in the Munda speakers (see Supplementary Fig. S3). Roughly one quarter of the ancestral components in the Mundas' genome (k6 -k12) are shared with Southeast Asians. There are two populations with a similar genetic profile to the Mundas in Central India: Dravidian speaking Gond who are known to have received a substantial gene flow from the Munda speakers 18 and a linguistic isolate Nihali.
Principal component analysis (PCA) roughly reflects geographical locations of studied populations (see Supplementary Fig. S4). Based on the first two components of PCA, the Mundas are genetically situated between South Asians and Southeast Asians and Oceanians. Furthermore, South and North Munda tribes are clearly different -South Mundas are genetically closer to Southeast Asians and Oceanians while North Mundas are closer to South Asians. In sum, the results of the ADMIXTURE and PCA are consistent with the model by which the genetic ancestry of Indian Munda speakers represents an admixture between Indian and Southeast Asian ancestries.
The scenario of independent evolution without admixture was rejected by 3-population formal test of admixture 6 for South Munda, Santhal (NM) and Ho (NM) speakers, as they yielded significantly negative f3 values (indicative of admixture) when tested together with populations from India and Southeast Asia (Supplementary Table S2). Birhor (NM) and Korwa (NM) speakers did not display significant admixture signal potentially because of the vast genetic drift they have gone through after the admixture event as they show the lowest average heterozygosity among the Munda speakers (Supplementary Table S3).
To understand further the position of Mundas in the genetic landscape of Indian populations, we plotted the second and third principal components from the global PCA analysis (see Supplementary Fig. S5). The Mundas were situated close to the Dravidian speaking southern Indian end of the gradient, near Pulliyar population from southwestern India, being stretched towards Southeast Asian populations, the closest ones being Bateq, Jehai, Kintaq and Mendriq from Malaysia.
The best contemporary proxies for admixture sources. Three populations that yield the highest outgroup-f3 values as the most affine sources of Southeast Asian ancestry in Munda are Lao from Laos, Dai from China and Murut from Borneo. From South Asia, the populations that produce the highest outgroup-f3 scores are Dravidian speaking Paniya and Pulliyar from Kerala region of India. For North Mundas, among the top Indian populations is also Indo-European speaking Chamar, whereas for South Mundas, there are Jarawa and Onge from Andaman Islands (Supplementary Table S2). Consistently, the South Munda speakers show the highest affinity to the Andamanese populations based on fineSTRUCTURE 19 analysis (see Supplementary Fig. S6).  Table S3). Surprisingly, the geographically closer Austroasiatic speakers from Southeast Asia, such as Cambodians and Vietnamese, do not share as many IBD segments with the Mundas. This effect could be caused by the fact that the mainland Southeast Asian populations have smaller proportions of the original Austroasiatic component in their genomes due to subsequent gene flow received from East Asia. Another explanation could be a more complex direction of gene flow in this area. Similar results were observed when using total lengths of shared IBD segments instead of their counts ( Supplementary Fig. S7).
When dividing the segments shared with the Mundas into two groups, short (<1 cM) and long (>1 cM), we noticed that the two sources, South Asian and Southeast Asian populations, clearly form two distinct groups based on shared segment length patterns (Fig. 2). Both, mainland and island Southeast Asian populations share a high number of long IBD segments with the Mundas while Indian Dravidian and Indo-European speaking populations share plenty of short IBD segments. Surprisingly, no difference was found in Indian Dravidian and Indo-European speakers in context of sharing DNA with the Mundas (Welch's t-test; short IBD P = 0.5218; long IBD P = 0.5302; all IBD P = 0.9305). The formation of the two groups seen on Fig. 2 could refer to different genetic distance between admixed populations and other populations from the corresponding areas; i.e., the Southeast Asian share of the Munda speakers' genomes has diverged from present day Southeast Asians more recently than the South Asian part from present day South Asians. This result has to be taken with caution as we found correlation between the shared IBD segment lengths and the average heterozygosity in these populations ( Supplementary Fig. S8, Table S3).
Admixture proportions suggest a novel scenario. We used qpAdm 21 to determine the relative proportions of West, Southeast and South Asian ancestries in Munda speakers, using a number of modern and ancient West Asian populations, Lao, Burmese, Cambodia or Mah Meri (Austroasiatic speaking Malaysian group), and Onge or Paniya as proxies for the three Asian components (Supplementary Table S4). Regardless of which West Asian population we used, we found that Munda speakers can be described on average as a mixture of ~29% Southeast Asian, 15.5% West Asian and 55.5% South Asian components. When rescaling the West and South Asian components to 1 to explore the Munda genetic composition prior to the introduction of the Southeast Asian component, we note that the West Asian component is lower (~22%) in Munda compared to Paniya (27%) (Supplementary Table S4: Total_Average_SEA = 0). Consistently with qpGraph analyses in Narasimhan et al. 22 ,  Table S4: SM_SEA = 0 and NM_SEA = 0 show 23% and 21% West Asian and 77% and 79% South Asian contributions) while the Southeast Asian component is 12% higher in South than in North Munda. The independence between the amount of Southeast and West Asian components in our North and South Munda populations contradicts the expectations and therefore points to an opposite and simpler scenario: both South and North Munda could be modelled as an initial admixture between Southeast Asian populations and an autochthonous Indian group with a slightly lower West/South Asian composition compared to what observed in Paniya today. South Munda then kept isolated from additional gene flow, while North Munda received a longer admixture pulse from the local Indian groups, which caused the dilution of the newly arrived Southeast Asian components in North Munda, without affecting the relative proportions of West and South Asian components.
To further explore the demographic history of Indian Munda speaking populations we used Admixture Graph (AG) approach implemented in qpGraph 21 that allows us to estimate goodness of fit between data and specified model that involves multiple admixture events and drift. Populations within North and South Munda language meta-groups differed from each other significantly in their affinity to these three ancestral sources when using f4 statistic, yielding |Z| scores higher than 3 (Supplementary Table S5), and could not thus be pooled together. The general model with three distinct sources of admixture ( Supplementary Fig. S9) offers a good fit with data in case of all 11 Munda speaking populations tested, however, with different proportions of individual ancestry components (Supplementary Table S6). Compared to qpAdm we observed that qpGraph consistently yielded higher estimates of West Eurasian ancestry in Munda speakers.
Dating the admixture event. We used ALDER to test this scenario and to infer the admixture time that led to the genesis of the Mundas 23 . The admixture midpoint was 3846 (3235-4457) years ago for South Mundas, which may point to the time of arrival of the Southeast Asian component in the area, and 2867 (1751-4525) years ago for North Mundas (Fig. 3). The longer (1000 years) admixture time between North Munda and local Indian populations is consistent with the ADMIXTURE, PCA and qpAdm results where we saw North Mundas having a bigger proportion of Indian ancestry (made up, proportionally, by ~21% West and 79% South Asian) and a smaller Southeast Asian fraction than South Mundas ( Supplementary Fig. S3, Fig. S4, Table S4).
While the ALDER dates that we obtained are, to our knowledge, the first estimates of the time of admixture of the Munda speakers based on genome-wide data, estimates from previous studies, based on other types of data, have yielded much earlier dates for the spread of Austroasiatic populations in India. Diamond and Bellwood 24   15 . The latter date estimate is very similar to ours and implies a significant male-specific founder event as part of the admixture process.
In this study, we have replicated a result previously reported in Chaubey et al. 7 that the Mundas lack one ancestral component (k2 on Fig. 1) that is characteristic to Indian Indo-European and Dravidian speaking populations. If this component came to India through one of the Indo-Aryan migrations 26 then it would be fair to presume that the Munda admixture happened before this component reached India or at least before it spread all over the country. However, the admixture time computed here, falls in the exact same timeframe as the ANI-ASI mixture has been estimated to have happened in India 5 through which the k2 component probably spread. Therefore, we propose that if the Munda admixture happened at the same time, it is possible for it to have happened in the eastern part of the country, east of Bangladesh, and later when populations from East Asia moved to the area, the Mundas migrated towards central India. Such a scenario, which may be further clarified by ancient DNA analyses, seems to be further supported by the fact that Mundas harbour a smaller fraction of West Asian ancestry compared to contemporary Paniya (Supplementary Table S4) and cannot therefore be seen as a simple admixture product of Southern Indian populations with incoming Southeast Asian ancestries.
Sex-biased admixture in Munda speakers. In Chaubey et al. 7 , it was shown that the Munda speakers have high frequencies (19-95%) of East Asian chromosome Y haplogroup O2a at the background of almost no detectable East Asian mitochondrial DNA signal pointing to a sex-biased nature of admixture between Austroasiatic speakers and their local Indian neighbouring populations. We used outgroup f3 analysis to contrast allele frequency patterns on the X chromosome versus those on the autosomal chromosomes to clarify the maternal side of this sex-biased admixture event. Our analysis revealed that on X chromosome, a Dravidian speaking group, North Kannadi, is relatively more similar to Munda speakers than on autosomes, while on autosomes Lao, Vietnamese and Burmese from Southeast Asia and Sino-Tibetan speaking Kuki from India have relatively higher f3 values than on X chromosome (Supplementary Fig. S10). This relatively higher autosomal affinity to Southeast Asian populations, however, is detectable only when testing South Munda speakers. The fact that South Munda speakers show more evident signs of a sex-specific admixture on maternal side is in accordance with the Y chromosome results from Chaubey et al. (2011), where South Munda speakers have also higher (0.73) average frequency of haplogroup O2a than North Munda speakers (0.62) 7 . This finding is consistent with our proposed scenario where South Munda kept isolated after the admixture event, while North Munda received additional admixture from local Indian groups, which diluted Southeast Asian component and blurred the signs of the sex-specific nature of the admixture event as the latter admixture pulse in North Munda was not sex-specific anymore.
Linguistics is in concordance with genome-wide data. Until now, we have presumed that the linguistic classification of the Mundas (North and South) is a suitable grouping criteria for genetic analyses. Here we take a glance at the genetic relationship between different North and South Munda populations. PCA of only Munda populations displayed North and South Mundas as separate groups, except one Juang and one Kharia individual fell together with North Mundas on first two principal components (see Supplementary Fig. S11). ADMIXTURE analysis showed that North Mundas have less of the combined k8-k11 genetic component than South Mundas (Wilcoxon rank sum test; N1 = 75; N2 = 11; P < 0.0001). These components were maximised in East and Southeast Asian samples. Smaller amount of Lao ancestry in North Mundas was also shown by qpAdm analysis (Supplementary Table S4). On the fineSTRUCTURE tree 19 , North and South Mundas clustered separately, except Figure 3. Admixture times as evaluated by ALDER. We let ALDER pair up populations from Southeast Asia and South Asia as several populations from either area were good proxies for the admixture event based on Refined IBD and f3 analyses. For accuracy, North Munda speaking Santhal, Ho, Korwa and Birhor were addressed separately as admixed populations; due to a small sample size South Munda speakers were treated as one population. Reference population pair was chosen based on LD decay curve amplitude. Standard errors are estimated by jackknifing on chromosomes. Generation length is 30 years 47 . For all the pairs, see Supplementary  Table S7.
Kharia samples (South Munda) which clustered with Asur and Ho samples from North Munda (Fig. 4). All these analyses showed that Kharia and Juang were the most similar population to North Mundas among South Munda populations. Refined IBD analysis infers that North Munda populations share more long and short IBD segments among each other than with South Munda populations (see Supplementary Fig. S12). Therefore, by and large, the linguistic classification justifies itself but Kharia and Juang do not fit in this simplification perfectly. Interestingly, although Diffloth's classification of the Munda languages into North and South Munda 27 is widely cited, in 2005, Diffloth changed the position of Kharia-Juang branch on the language tree from South Munda group to be a side www.nature.com/scientificreports/ branch of the group that was previously known as North Munda 28 . Hence, this is in accordance with our findings about Juang and Kharia genetic affinities.

Methods
Samples collection and genotyping. The analyses were performed on a merged dataset of 56 new samples together with 1024 previously published samples from different studies 4,7,[29][30][31][32][33][34][35][36][37] (Supplementary Table S1) all using different but compatible Illumina genotyping arrays. The new samples were collected from Laos (Lao N = 24), Bangladesh (Santhal (NM) N = 10), and East India (Hmar N = 4, Kom N = 2, Kuki N = 6, Mizo N = 5, Naga N = 1, Nyishi N = 4). DNA was extracted from blood samples collected from healthy adult donors who signed an informed consent form. New samples were genotyped using Illumina OmniExpress Bead Chips for 730k, 710k and 650k SNPs. Since all samples used in this study were genotyped by compatible Illumina genotyping arrays we do not expect notable batch biases. Indeed, we do not observe batch differences between samples from the same population (except in cases of large population groups where sampling locations differ like Burmese) but genotyped with different array (see for example Admixture results for Ho, Naga, Paniya and Santhal on Supplementary Fig. S3). The study was approved by Research Ethics Committee of the University of Tartu. All genotyped data will be made publicly available on the ebc.ee/free_data website and NCBI-GEO public repository (accession number XXXX).
Data curation. All the samples were filtered with plink v1.9 38 . Only SNPs on autosomal chromosomes with a minor allele frequency >1% and genotyping success >97% were used in the analyses. Only individuals with a genotyping success rate >97% were left in the sample set. 245848 variants and 1072 people passed the filters; 8 Gond were removed due to low genotyping success rate. For analyses that are affected by linkage disequilibrium (PCA, ADMIXTURE), dataset was further pruned by excluding SNPs with pairwise genotypic correlation r 2 > 0.4 in a window of 200 SNPs sliding the window by 25 SNPs at a time 39 . This left us 155743 SNPs.

Population structure analyses.
To capture genetic variability, we performed PCA using software EIGENSOFT 6.1.4 40 on pruned data of the whole filtered dataset (1072 individuals). To get some idea of the Munda speakers' genetic structure in context of other Asian populations, we ran ADMIXTURE 1.23 program 41 with random seed number generator on the LD pruned data set one hundred times at K = 2 to K = 18 ( Supplementary Fig. S1). Following an established procedure, we examined the log likelihood scores (LLs) of the individual runs and found that the highest K with stable (global maximum has been reached) LL values is K = 13. Based on cross-validation (CV) procedure, genetic structure of a sample set is best described choosing the value of K with the lowest CV error. In our dataset the lowest CV error was at K = 13 ( Supplementary Fig. S2).

Tests aimed at providing demographic inferences.
To test the admixture, we ran three-population formal test of admixture 6 using Popstats program by Skoglund et al. 42 . For f3 analysis, source 1 was South Asian or West Eurasian population and source 2 was Southeast Asian or East Asian population. Outcomes with |Z| > 3 were considered significant. All the South Munda speaking tribes (Bonda, Gadaba, Juang, Kharia, Savara) were treated as one population due to small sample size. We ran outgroup f3 statistic as f3 = (SouthMunda/Ho(NM), X, Yoruba) to find the closest modern populations from our data set for South and North Munda.
To compare the sharing of SNPs on X chromosome to autosomes in Munda speakers and other relevant populations from South and Southeast Asia, outgroup f3 statistics of the form f3(Yoruba; North/South Munda, reference population) were computed separately for X chromosomes and autosomes of the same individuals using AdmixTools-1.1 21 . Only individuals for which X chromosome information existed in the dataset, could be used; for individuals present in this analysis, see Supplementary Table 1.
To retrieve the admixture proportions, we ran the qpAdm software 21 testing the following South and North Munda populations (Bonda, Gadaba, Juang, Kharia, Savara, Asur, Birhor, Ho, Korwa, Mawasi, Santhal) as a three ways mixture of all possible combinations of West (Anatolia_N, Armenia_MLBA, Germans, Iran_N, IranianLaz2016), East (Lao, Cambodian, Burmese, Mah Meri) and South (Onge, Paniya) Asian groups and using as outgroups the following groups: Natufian, WHG, Han, Kankanaey, Karitiana, MbutiLaz2016, Papuan, Ust_ Ishim, Yorubas 43,44 . To retrieve the percentage for West Asian component, we modelled the Mundas as a mixture of IranianLaz2016, Onge, and Southeast Asian population. For Southeast Asian component, we modelled the Mundas as a mixture of IranianLaz2016, Paniya and Mah Meri population which turned out to be a better fit in qpAdm analysis than Lao, Burmese, and Cambodian. We could not use Iranian and Onge separately with Mah Meri to retrieve Southeast Asian component because Onge itself shares some Southeast Asian ancestry 45,46 .
To estimate goodness of fit between data and specified model that involves multiple admixture events and drift, we used qpGraph software 21 but first, to decide if we should make a graph for every 11 Munda population or group them in some way (e.g. North and South Munda), we ran f4 analysis as described in Lazaridis et al. (2016) supplementary information section 3 43 using AdmixTools-4.1 software 21 . There was no clear clustering pattern in the Munda populations and therefore we generated 11 separate qpGraphs.
We used ALDER 23 to infer admixture dates for South Munda, Ho (NM), Santhal (NM), Birhor (NM) and Korwa (NM). We used all the populations spanning from India to Europe from our data set as source 1 and all the populations from East and Southeast Asia as source 2. The population pairs to represent admixture times were chosen based on decay status and LD decay curve amplitude. Standard errors were estimated by jackknifing on chromosomes. We used generation length of 30 years 47 .
Haplotype-based analyses. To investigate the relationship between the Munda speakers and Andamanese, we used fineSTRUCTURE 19 . For this analysis, the data was previously phased with Beagle 3.3.2 48 . A co-ancestry matrix was constructed using ChromoPainter v1 19 with the default settings. From the co-ancestry matrix, the mean chunk lengths shared between Eurasian populations and Andamanese were extracted.