Introduction

The high ethnolinguistic diversity of Mainland Southeast Asia (MSEA) reflects the complex population history of this region1. Anatomically modern humans arrived in MSEA approximately 50,000 years ago2. The early to mid-Holocene hunter-gatherers associated with the Hoabinhian archaeological tradition (with genomic data available for individuals from Laos and Malaysia dated to ca. 8000 and 4500 years before present, respectively) were modelled as a deeply diverged East Eurasian lineage, related to present-day Andamanese and to MSEA foragers such as Jehai3. During the Neolithic period (starting at ~ 4500 years before present, BP), ancient MSEA populations exhibited an admixed genetic profile between the deeply diverged East Eurasian lineage and East Asians3,4. The same genetic mixture is typical for some present-day groups speaking Austroasiatic languages, such as the Mlabri4. The genetic structure of Bronze Age individuals from Northern Vietnam (dated to ~ 2000 BP) suggests an additional wave of migration from Southern China3,4.

Debates over the origins of early states in MSEA have hinged on the nature of contact with South Asia for more than 50 years (e.g., Cœdès 19685). Archaeological evidence revealed that the cultural interaction between Indian and local MSEA people began by the fourth century BCE, creating synergies that influenced the formation of early states in MSEA6,7. Various present-day MSEA populations, especially those highly influenced by Indian culture, harbor a low level of South Asian admixture: for instance, Bamar, Cham, Khmer, Malay, Mon, and Thai8,9,10. Using methods based on autosomal haplotypes, this genetic admixture was dated to around the 14th c. CE (1194 CE–1502 CE) in a Khmer group from Cambodia11). In a later study, the admixture date for the Khmer group from Cambodia was confirmed (12th–13th cc. CE), and admixture date estimates for a wider range of present-day SEA groups were obtained: they lie between the 16th c. CE for Bamar and the 4th cc. CE for Giarai (Jarai) from Vietnam9,10. Previous ancient DNA studies, although of very limited extent due to poor ancient DNA preservation in tropical climate, did not detect South Asian ancestry in any ancient MSEA individuals3,4.

Here we report a finding of substantial South Asian ancestry in a Protohistoric Period male child from the Vat Komnou cemetery at the Angkor Borei site on the western edge of Cambodia’s Mekong Delta. The walled and moated Angkor Borei site was first occupied in the middle of the first millennium BCE and is one of the earliest dated urban sites in the Mekong Delta in either Cambodia or Vietnam12. In addition to brick architectural monuments, associated moats, and ponds, the Vat Komnou mortuary assemblage includes human burials, beads, ceramics, multiple pig skulls, and other faunal remains13,14. The Vat Komnou cemetery is one of the largest archaeological skeletal samples analyzed to date from Cambodia.

Genetic data for the male child from the Vat Komnou cemetery (skeletal code AB M-40) were first reported by Lipson et al.4: 0.047 × coverage on the 1240K SNP panel used for enrichment of human ancient DNA15,16, and we increased the coverage to 0.061 × and generated variant calls at 64,103 1240K sites (as compared to 54,221 sites in Lipson et al.4). A direct radiocarbon date on the human bone (95% confidence interval is 78–234 calCE) was also obtained for the individual AB M-40 by Lipson et al.4.

Results

Genetic overview of ancient Southeast Asians

The location of the ancient individual from Cambodia (AB M-40 or I1680) is illustrated in Fig. 1. We explored genetic ancestry of this individual and other ancient individuals from Southeast Asia for whom published genetic data of acceptable quality were available (i.e., pseudo-haploid genotypes are available for more than 50,000 1240K sites). We computed principal components (PCs) using present-day populations from Europe, from East, Southeast, South, and Central Asia, the Andaman Islands, and Siberia, and projected 28 ancient individuals (from McColl et al.3, Lipson et al.4, and this study) onto these PCs. The PC1 vs. PC2 plot (Fig. 2, Suppl. Fig. S1) reveals three clusters of individuals: East and Southeast Asian (ESEA), European (EUR), and Andamanese Negrito. South Asian (SAS) populations form a cline between the European and Andamanese Negrito clusters, with populations from Pakistan and Northern India lying closer to the European cluster, in line with previous studies on the population genetics of South Asia17,18,19. The positions of populations speaking languages of the Munda branch of the Austroasiatic language family, such as Birhor and Kharia, deviate from the South Asian cline, and this is not unexpected since they were shown to harbor Southeast Asian admixture18,20. Kusunda, a group from Nepal, and Riang, a Sino-Tibetan-speaking group from India, fall close to the East and Southeast Asian cluster. These populations harbor a relatively high proportion of an ancestry component maximized in ESEA populations according to our ADMIXTURE analysis (Fig. 3). Central Asian and Siberian populations are distributed along PC1 between the SAS–EUR cline and the ESEA cluster, in line with many previous studies, such as Jeong et al.21. Most ancient individuals dated to 2500 BCE–1950 CE from Southeast Asia fall within the ESEA cluster, except for the individual AB M-40/I1680 from the Vat Komnou site who deviates from the ESEA cluster towards the South Asian cline. The ADMIXTURE22 analysis also indicates that this individual stands out among Southeast Asians since his genome harbors a high proportion of both “blue” and “grey” ancestry components, maximized in European and Andamanese populations, respectively. Both components are also prominent in South Asian populations (Fig. 3). Statistics f4(Orcadian, Japanese; AB M-40/I1680, another ancient ESEA individual) support the existence of the West Eurasian genetic component in the individual AB M-40/I1680 as all f4-statistic values are positive, with |Z| > 2 in most cases (Suppl. Fig. S2), which indicates that the Orcadian group shares more genetic drift with I1680 than with other ancient SEA individuals. All the other ancient SEA individuals show mixed positive and negative values of similar f4-statistics (Suppl. Fig. S2). However, given ADMIXTURE (Fig. 3) and f4-statistic results for South Asians, I1680 has obtained the West Eurasian genetic component most probably via South Asian groups. For example, statistics f4(Vellalar, Japanese; I1680, another ancient MSEA ind.) show results (Suppl. Fig. S3) similar to those for statistics f4 (Orcadian, Japanese; I1680, another ancient MSEA ind.) (Suppl. Fig. S2). The “red” component, which is maximized in African populations, is ubiquitously present in a trace amount in all ancient groups. This ubiquitous presence suggests we are dealing with an artefact. The Papuan-enriched component (pink), which is present in some ancient individuals and in ESEA and SAS present-day groups (Fig. 3), could represent a signal of the deeply diverged East Eurasian lineage3,4,10,23.

Figure 1
figure 1

Geographic locations of the ancient individual AB M-40/I1680 sequenced in this study (the red circle) and of the present-day South Asian surrogates used in qpAdm analysis (other marker shapes). The South Asian surrogates which are not included in any plausible qpAdm model are labeled with black cross markers.

Figure 2
figure 2

Principal component analysis (PCA). Ancient Southeast Asian individuals were projected on the principal components calculated using present-day Eurasian populations, and the first two components (PC1 and PC2) are shown. Abbreviations for meta-populations are as follows: CAS present-day Central Asians, ESEA present-day East and Southeast Asians, EUR present-day Europeans, NEGA present-day Andamanese Negritos, NEGM present-day Mainland Southeast Asian Negritos, SAS present-day South Asians, SIB present-day Siberians, Hoa. Hoabinhian culture, N Neolithic, LN Late Neolithic, PH Protohistoric period, BA Bronze Age, His. Historical period.

Figure 3
figure 3

An ADMIXTURE analysis plot showing results for 6 hypothetical ancestral groups (K = 6). Abbreviations for meta-populations are as follows: Ancient MSEA Ancient Mainland Southeast Asians, AFR present-day Africans, EUR present-day Europeans, CAU present-day Caucasians, ME present-day Middle Easterners, NEG present-day Andamanese Negritos, PAP present-day Papuans, SAM present-day Native Meso- and South Americans, ESEA present-day East and Southeast Asians, SIB present-day Siberians, CAS present-day Central Asians, SAS present-day South Asians, N Neolithic, LN Late Neolithic, PH Protohistoric period, BA bronze age, His. Historical period. The number of individuals for each population is indicated in brackets after the population name.

Present-day Southeast Asian groups, such as Kinh and Cambodians share more drift with the Protohistoric individual AB M-40/I1680 from Cambodia than most present-day ESEA populations as inferred using “outgroup” f3-statistics f3(Mbuti; AB M-40/I1680, X) (Fig. 4). To test if any ESEA group is asymmetrically related to the Protohistoric individual AB M-40/I1680 and present-day Cambodians, which could imply an additional ESEA gene flow to one of these groups, we calculated f4-statistics of the form f4(an ESEA group, all other ESEA groups; AB M-40/I1680, present-day Cambodians). Values of the statistics f4(Atayal, all other ESEA groups; AB M-40/I1680, present-day Cambodians) are all negative, with most of absolute Z-scores higher than 2 (Suppl. Fig. S4). f4-statistics of this type indicate that Atayal-related (probably Austronesian-speaking) groups and present-day Cambodians share additional genetic drift as compared to the Protohistoric individual AB M-40/I1680. On the other hand, statistics f4(Kinh [data from Mallick et al.24], all other ESEA groups; AB M-40/I1680, present-day Cambodians) are all positive, with most absolute Z-scores higher than 2 (Suppl. Fig. S4), which indicates attraction between present-day Kinh (from Mallick et al.24) and the Protohistoric individual AB M-40/I1680. However, when we replaced Kinh from Mallick et al.24 with a much larger Kinh group from the 1000 Genomes Project25, the f4-statistics showed a mix of positive and negative values (Suppl. Fig. S4). The dataset-dependent attraction between Kinh and the Protohistoric individual AB M-40/I1680 indicated that the question about their relationship cannot be settled at this point.

Figure 4
figure 4

Genetic affinity between the ancient individual AB M-40/I1680 and present-day East and Southeast Asian groups. Genetic drift sharing was estimated using outgroup f3-statistics in the form of f3(Mbuti; AB M-40/I1680, X), where X is a present-day East and Southeast Asian group.

To sum up, these PCA and ADMIXTURE results (and f4-statistics) suggest that substantial South Asian admixture is present in the individual AB M-40/I1680, which sets him apart from 25 other Southeast Asians for whom genetic data of acceptable quality was reported in the literature3,4.

South Asian ancestry in ancient Southeast Asians

We explored further the signal of South Asian admixture in ancient Southeast Asians by inspecting a scatterplot of outgroup f3-statistics, f3(Mbuti; South Asian, X) and f3(Mbuti; East Asian, X), where X are ancient individuals from Southeast Asia, present-day ESEA populations, Tibeto-Burman-speaking groups from India and Nepal, and Kusunda from Nepal (Fig. 5, Suppl. Fig. S5). We used the Brahmin from Uttar Pradesh26 (North Indians) and Vellalar26 (South Indians) as alternative SAS surrogates, and Atayal and Dai27 as alternative ESEA surrogates. Most present-day and ancient groups from ESEA demonstrate a linear relationship of genetic drift sharing with the South Asian and East Asian surrogates. f3-statistics for various groups, namely the ancient individual AB M-40/I1680 from Cambodia, Tibeto-Burman-speaking populations (Tibetan and Sherpa) and Kusunda from Nepal, Austroasiatic-speaking (Birhor and Kharia) and Tibeto-Burman-speaking (Riang) groups from India, do not conform to this trend line (Fig. 5, Suppl. Fig. S5). The shift indicates excessive shared genetic drift with the South Asian surrogate in these populations compared to most ESEA groups. Notably, South Asian admixture in Bamar, Tibetans, Sherpa, and Austroasiatic-speaking populations from India has been reported in previous studies8,18,20,28,29. The result is consistent across combinations of ESEA and SAS surrogates (Fig. 5, Suppl. Fig. S5).

Figure 5
figure 5

A biplot of outgroup f3-statistics. The plot of f3(Mbuti; Vellalar, X) vs. f3(Mbuti, Dai, X), where X is an ancient individual from Southeast Asia or present-day populations from East, Southeast, or South Asia. The plot demonstrates shared genetic drift between population X and Vellalar (a SAS population) or Dai (an ESEA population). The trend line illustrates the ratio of genetic drift sharing between most ESEA populations and the SAS and ESEA surrogate. Abbreviations: N Neolithic, LN Late Neolithic, PH Protohistoric period, BA Bronze Age, His. Historical period.

We formally tested for South Asian admixture in the ancient individual AB M-40/I1680 from Cambodia (along with other published ancient Southeast Asian individuals) using a qpAdm30 protocol with a “rotating” set of reference or “right” populations31. We first performed pairwise qpWave tests to check if the target ancient individual is cladal (genetically continuous) with a single reference population (see “Materials and methods” for details). None of these pairwise qpWave models were plausible (Suppl. Table S1). Next, we tested all possible pairs of ancestry sources (66 pairs) from the set of reference populations (see “Materials and methods”), and found that the genome of the Protohistoric Cambodian individual AB M-40/I1680 can be modelled as a two-way mixture of East Asian and South Asian populations (Fig. 6, Suppl. Table S1). Among plausible models (those with p-value is higher than 0.05 and with inferred admixture proportions ± 2 standard errors between 0 and 1 for all ancestry components) for the Protohistoric Cambodian individual AB M-40/I1680, Ami is the only East Asian surrogate, and Irula, Mala, and Vellalar are the only non-East Asian surrogates that result in plausible models (Fig. 6, Suppl. Table S1). Remarkably, all these surrogates are from Southern India (Fig. 1), and the fraction of Southern Indian ancestry was estimated at 42–49% across all plausible qpAdm models (Fig. 6, Suppl. Table S1). The fact that qpAdm models including European, Middle Eastern, Caucasian, or even Northern Indian ancestry sources are rejected suggests that contamination with modern DNA is an unlikely explanation for our results. Moreover, the libraries passed routine ancient DNA authenticity checks such as damage rate at a terminal read nucleotide: a damage rate below 0.03 is considered problematic4. See Suppl. Table S2 for detailed statistics for the merged data and for each library.

Figure 6
figure 6

Admixture proportions estimated by qpAdm. The plot illustrates all plausible qpAdm models inferred using the rotating strategy (see details in “Materials and methods”). All the plausible models consist of two surrogates: (1) an East and Southeast Asian surrogate (Ami, on the right-hand side of each bar) and (2) a South Asian surrogate (Irula, Mala, and Vellalar on the left-hand side of each bar). Results for all qpWave and qpAdm models tested are available in Suppl. Table S1.

The genetic makeup of some ancient individuals from Laos (La727), Malaysia (Ma554, Ma555), Thailand (Th531), and Vietnam (I859, I2947, I0626, Vt880) can be interpreted in the light of our qpAdm models as a two-way admixture of East Asians and a deeply diverged East Eurasian lineage (represented by Onge and Papuans in the set of reference populations) (Suppl. Table S1). The results for all the other ancient individuals are ambiguous, as plausible models for some individuals are diverse (i.e., East Asians + various proxies such as West Asians, Caucasians, Europeans, South Asians, the deep East Eurasian lineage). The fact that qpAdm failed to reject multiple models with various West Eurasian proxies could be caused by modern DNA contamination or by lack of data. For few individuals, we were unable to find any plausible qpAdm model (Suppl. Table S1).

The Protohistoric Cambodian individual AB M-40/I1680 harbors a considerably higher proportion of South Asian ancestry as compared to present-day Cambodians (42–49% vs 9%), as estimated in a previous study10. However, the qpAdm reference populations used here and in the previous study10 are different. To avoid an unforeseen bias which may cause the excess of South Asian ancestry in the Protohistoric Cambodian individual AB M-40/I1680, we estimated the proportion of South Asian admixture with the same set of surrogates and qpAdm “right” populations for the ancient individual and present-day Cambodians. Since no plausible model was found for present-day Cambodians when we used the reference population set from the analysis above, we replaced Ami with Dusun and performed qpAdm analysis with the new set of reference populations (see “Materials and methods”). The results indicate that South Asian ancestry in the ancient individual is indeed much higher than in present-day Cambodians: 37–44% and 12–15%, respectively (Suppl. Table S1).

Discussion

We report new archaeogenomic data from a radiocarbon-dated individual AB M-40/I1680 from Cambodia (95% confidence interval for the date is 78–234 calCE) that serves as a key data point for dating the beginning of the South Asian gene flow into Southeast Asia. This individual was previously analyzed by Lipson et al.4, but we generated additional data to increase the sensitivity of the analyses in the current study. South Asians were not considered as a potential ancestry source for MSEA groups in the original study, which explains the discrepancy between our results and those by Lipson et al.4. For instance, PCA in Fig. 1 of that study4 lacks any West Eurasian groups, and another PCA in Fig. S1 of that study4 lacks any South Asian groups except for the Andamanese. Nevertheless, according to the latter PCA4 the individual AB M-40/I1680 is shifted towards Europeans and Andamanese as compared to all present-day and most ancient MSEA individuals.

In addition to our genetic results, other anthropological results on the individual AB M-40/I1680 are compatible with his admixed genetic profile. Slight expression of Carabelli’s cusp was observed in the maxillary permanent right and left first molars and left second molar of this individual, and some of the highest frequencies of this dental trait have been reported for the present-day inhabitants of Western Eurasia and India32. Strontium isotopic analysis of this individual suggests limited evidence of human mobility33. Since the individual AB M-40/I1680 was a young child the strontium isotope data do not rule out the possibility of his parents’ mobility. Nevertheless, outgroup f3-statistics (Fig. 4) suggest that present-day populations geographically close to the individual AB M-40/I168, such as Kinh (Vietnamese) and Cambodians, are among the top present-day East Asian populations which harbor a large amount of shared genetic drift with the individual AB M-40/I1680. Thus, it is likely that the ancestors who contributed the East Asian genetic component to the individual AB M-40/I1680 were local Southeast Asians.

South Asian admixture is present in various present-day MSEA populations, especially those heavily influenced by Indian culture8,9,10. However, previous aDNA studies did not detect South Asian admixture in ancient individuals across MSEA, although due to the immense technical difficulties genomic data (of any quality) were generated for just 21 individuals from Cambodia, Laos, Malaysia, Thailand, and Vietnam dated to the 5th c. BCE and later3,4. As suggested by archaeological evidence6,7, we expect to find traces of interactions with South Asia starting in the 4th c. BCE.

An earlier study estimated the proportion of South Asian ancestry in present-day Cambodians at ca. 9%10. Notably, the proportion of South Asian ancestry in the ancient individual from Cambodia (point estimates ranging from 42 to 49% across all plausible qpAdm models) is considerably higher than in the present-day majority population of the country and in any present-day MSEA population investigated in the literature (see an overview in Changmai et al.10). In this study, we also demonstrate that the ancient individual from Cambodia harbors an approximately three times higher proportion of South Asian ancestry as compared to the present-day Cambodians, as estimated by qpAdm with the same set of reference populations (Suppl. Table S1). We were also not able to detect a clear signal of South Asian ancestry in any other 25 published ancient individuals from 2500 BCE to 1950 CE across Southeast Asia. Since we are dealing with a single individual only, it is hard to estimate the intensity of the gene flow and to extrapolate this estimate to the general Protohistoric population in the territory of Cambodia.

The location and date (1st–3rd c. CE) of the individual AB M-40/I1680 fit the early period of Funan, an early state in the territories of Cambodia and Vietnam. Chinese written sources documented that the Funan dynasty was established by an Indian Brahmin named Kaundinya and a local princess named Soma34. Archaeological evidence from glass and stone beads recovered from the Mekong Delta and peninsular Thailand35 and archaeobotanical remains36 suggests the possibility of multi-ethnic residence in areas of Protohistoric MSEA whose populations engaged in maritime trade (e.g., Bellina37). Collectively, these data suggest some level of Indian cultural influence in the Mekong Delta in the 1st–3rd c. CE. The only plausible South Asian genetic sources for the individual from Funan, inferred by the qpAdm method, are populations from Southern India. Even though the results suggest that South Indian populations are by far the most plausible surrogates for this individual, we caution that the actual ancient sources possibly had a genetic profile different to the present-day South Asian populations we used for the qpAdm analysis.

Consensus now holds that the early first-millennium Mekong Delta kingdoms associated with the Chinese-named Funan were the predecessor states for the Angkorian empire, but what language was spoken before the seventh century CE (when the earliest inscription in the Delta appeared, at Angkor Borei38) remains a mystery. The term “Funan” is based on the modern Chinese pronunciation. Whether the local pronunciation was “biunâm”, resembling the old Khmer word “bnam”, meaning “mountain,” remains an unresolved question5. Continuities in architectural style, imagery, and settlement forms from the Protohistoric through later Angkorian period suggest that some residents of Funan spoke old Khmer. Outgroup f3-statistics (Fig. 4) support this point as they show that Cambodians and Kinh (Vietnamese) are present-day groups sharing the highest amount of genetic drift with the ancient individual AB M-40/I1680 from Funan. When we compared the ancient individual AB M-40/I1680 and present-day Cambodians, we detected a signal of additional gene flow from Atayal-related (probably Austronesian-speaking) groups in present-day Cambodians. The signal, previously found by a different method in an earlier study10, could be a result of long-term interactions between Cambodians and Cham, an Austronesian-speaking group and the biggest ethnic minority group in present-day Cambodia1, documented throughout their history5. Kinh (data from Mallick et al.24) share more genetic drift with the ancient individual AB M-40/I1680 than present-day Cambodians (Suppl. Fig. S4), which is consistent with outgroup f3-statistic results (Fig. 4). However, the extra attraction was not detected in Kinh from the 1000 Genomes Project25 (Suppl. Fig. S4). We note that the Kinh group from Mallick et al.24 consists of just two individuals, while there are 97 individuals available from the 1000 Genomes Project25. Thus, these two Kinh individuals from Mallick et al. may be genetically close to the ancient individual AB M-40/I1680, but may not be representative of the general Kinh population.

Changmai et al.10 estimated the date of South Asian admixture events in various present-day MSEA populations between the 4th and 16th cc. CE. The inferred dates of South Asian admixture in present-day Cambodians and in a Khmer group from Thailand are 771–808 BP and 1218–1291 BP (95% confidence intervals), respectively10, but the date of the ancient individual AB M-40/I1680 is much older (1716–1872 calBP, 95% confidence interval). Indeed, true dates of admixture events, especially its starting date, do not necessarily fall into the range of admixture dates estimated from similar present-day populations from the same or proximate locations. Our results suggest that some South Asians migrated to MSEA and intermarried with local people before or at the early stage of state formation. These South Asians may have influenced the expansion of Indian culture and the establishment of Indian-style states (also known as Indianized states5). aDNA data from MSEA are scarce, so further data collection is necessary to trace the incorporation of South Asian ancestry into the populations of this region.

Materials and methods

Archaeological context, ancient sample preparation, and dataset compilation

Archaeological field investigations at the Vat Komnou cemetery (Angkor Borei, Cambodia) were undertaken by the Lower Mekong Archaeological Project (LOMAP) in 1999 and 2000 as a collaboration between the University of Hawai ‘i-Mānoa (USA), the Ministry of Culture and Fine Arts (Kingdom of Cambodia), and the Royal University of Fine Arts (Kingdom of Cambodia). A total of 111 individuals were sorted and analyzed from 57 burial features excavated from excavation unit AB7, a 5 m by 2 m unit near the edge of the cemetery mound. Many of the burials were commingled (in one case the remains of at least five individuals are represented) and lack clear grave cuts. The extensive commingling of the burials is due, in part, to past disturbances of the stratified nature of this densely packed cemetery, often resulting in subsequent interments in the same location14. Despite site disturbance, the presence of 33 primary burials, most interred in the same orientation (with head pointing southwest), suggests the excavated portion (approximately 0.03 percent of the mound) was part of a designated communal burial area.

AB M-40/I1680, a Protohistoric Period male individual (dated to 78–234 calCE4) from the Vat Komnou cemetery, is the same individual as AB40 in Lipson et al.4. The individual was sorted from commingled remains of at least five different individuals. The skeletal remains for AB M-40 included cranial and postcranial elements of a male child whose age was estimated to be 5–7 years at the time of death based on clavicular diaphyseal length, the degree of epiphyseal fusion, and dental calcification and eruption. The cranial elements include the right and left parietals, occipital, frontal, right zygomatic, inferior maxillary bones, left temporal, and the petrous portion of the right temporal. The mandible was missing the right ramus. A left clavicular shaft and a thoracic vertebra represent the postcranial remains for this individual. DNA was obtained from the petrous portion of the right temporal bone (for details on DNA extraction and library preparation see Lipson et al.4).

In the present study, we generated two single-stranded sequencing libraries for this individual (according to a protocol by Gansauge et al.39) in addition to 7 analyzed by Lipson et al.4, which increased the coverage for autosomal targets on the 1240K SNP panel15,16 from 0.047 × to 0.061 × and increased the number of usable SNPs from 54,221 to 64,103 1240K sites (see details on all the sequencing libraries in Suppl. Table S2). We merged the new data with published genome-wide data for ancient individuals from Southeast Asia3,4 and worldwide present-day populations24,27,40,41,42 using PLINK v.1.90b6.10 (https://www.cog-genomics.org/plink/). We kept only autosomal SNPs from the 1240K panel15,16 and filtered out ancient individuals who have data for fewer than 50,000 autosomal SNPs. Details of the dataset composition are available in Suppl. Table S3.

Principal component analysis (PCA)

We performed PCA using smartpca43 version 16,000 from the EIGENSOFT package (https://github.com/DReichLab/EIG). We computed principal components for present-day populations from Central, East, Southeast, and South Asia, Andamanese Islands, Siberia, and Europe. We then projected the ancient Southeast Asian individuals onto the principal components using options "lsqproject: YES" and "shrinkmode: YES".

ADMIXTURE

We performed LD-pruning using PLINK v.1.90b6.10 with options "-indep-pairwise 50 5 0.5" (window size = 50 SNPs, window step = 5 SNPs, r2 threshold = 0.5). The LD-pruned dataset included 523,650 SNPs. We performed ADMIXTURE analysis22 using ADMIXTURE v.1.3 (https://dalexander.github.io/admixture/download.html). We first tested 4–12 hypothetical ancestral populations (K from 4 to 12) with tenfold cross-validation and five independent algorithm iterations for each K value. The CV errors are almost indistinguishable for K from 6 to 10 and grow for K values above 10 (Suppl. Fig. S6). We subsequently ran up to 30 algorithm iterations for K = 6 and chose an iteration with the highest value of the model likelihood as the final result shown in Fig. 3. We visualized the ADMIXTURE results using AncestryPainter44 (https://www.picb.ac.cn/PGG/resource.php).

Outgroup f 3-statistics

We used the function "qp3pop" from the R package "ADMIXTOOLS2"45 with default settings (https://uqrmaie1.github.io/admixtools/index.html). For outgroup f3-statistics in Fig. 4, we calculated statistics of the form f3(Mbuti; the ancient individual AB M-40/I1680, X), where X are present-day populations from East and Southeast Asia.

For the f3-biplots in Fig. 5 and Suppl. Fig. S5, we computed statistics f3 (Mbuti; SAS, X) and f3 (Mbuti; ESEA, X), where X are ancient individuals from Southeast Asi or, present-day populations from East, Southeast, and South Asia; SAS are South Asian surrogates (Brahmin Uttar Pradesh or Vellalar); and ESEA are East and Southeast Asian surrogates (Atayal or Dai).

f 4-statistics

We computed f4-statistics using the function "qpdstat" from the R package "ADMIXTOOLS2"45 with default settings (https://uqrmaie1.github.io/admixtools/index.html). For the plot in Suppl. Figs. S2S4, we calculated statistics of the form f4(Orcadian, Japanese; an ancient MSEA individual, all other ancient MSEA individuals), f4(Vellalar, Japanese; an ancient MSEA individual, all other ancient MSEA individuals), and f4(an ESEA group, all other ESEA groups; I1680, present-day Cambodians), respectively.

qpWave and qpAdm

We used a set of 13 populations as “outgroup” (“right”) populations: 12 diverse worldwide populations (Mbuti, Ami, Japanese, Yi, Onge, Papuan, Pima, Karitiana, Druze, Adygei, Sardinian, and Orcadian) + one of the following South Asian populations (Balochi, Bengali, Brahmin Uttar Pradesh, Brahui, Burusho, Irula, Kalash, Makrani, Mala, Pathan, Punjabi, Rajput, Sindhi, and Vellalar). All the South Asian surrogates were composed of at least three individuals. In total, we had 14 alternative sets of reference populations. For consistency, in all qpWave and qpAdm analyses we fixed Mbuti as R1 in a matrix of statistics f4(L1,Li;R1,Rj).

We first performed pairwise qpWave modeling using the "qpwave" function from the R package "ADMIXTOOLS2"45. We considered a pair composed of the target individual (AB M-40/I1680 or one of 25 other ancient MSEA individuals) and one of the 12 reference populations (all except Mbuti) as "left populations". We assigned the remaining 12 reference populations as "right populations". We used a cut-off p-value of 0.05 and tested both "allsnps = TRUE" (using all the possible SNPs for each f4-statistic) and "allsnps = FALSE" settings (using only SNPs with no missing data at the group level across all the “left” and “right” populations). Overall, we tested 168 group pairs with qpWave per one target group per one "allsnps" setting.

As none of the qpWave models for all the ancient MSEA targets was fitting the data, we further tested 2-way admixture models using qpAdm with a “rotating” strategy31. All possible combinations of two reference populations (except for Mbuti) acted as surrogates for the target. The remaining 11 reference populations were assigned as "right populations". We defined a model that meets two following criteria as a "plausible model": (1) p-value is higher than 0.05; (2) inferred admixture proportions ± 2 standard errors lie between 0 and 1 for all ancestry components. We tested qpAdm models using the "qpadm" function from the R package "ADMIXTOOLS2". We tested both "allsnps = TRUE" and "allsnps = FALSE" settings. In total, 924 models per target per one "allsnps" setting were tested.

For comparing South Asian ancestry proportions in the Protohistoric Cambodian individual AB M-40/I1680 and in present-day Cambodians, we replaced Ami with Dusun in the set of reference populations. We modeled the ancient individual and present-day Cambodians as a 2-way admixture of Dusun and one of the following South Asian populations: Irula, Mala, and Vellalar (plausible South Asian sources for the Protohistoric Cambodian individual in the previous qpAdm analysis). The set of right populations consists of Mbuti, Japanese, Yi, Onge, Papuan, Pima, Karitiana, Druze, Adygei, Sardinian, and Orcadian.