Genomic insights into the formation of human populations in East Asia


The deep population history of East Asia remains poorly understood owing to a lack of ancient DNA data and sparse sampling of present-day people1,2. Here we report genome-wide data from 166 East Asian individuals dating to between 6000 bc and ad 1000 and 46 present-day groups. Hunter-gatherers from Japan, the Amur River Basin, and people of Neolithic and Iron Age Taiwan and the Tibetan Plateau are linked by a deeply splitting lineage that probably reflects a coastal migration during the Late Pleistocene epoch. We also follow expansions during the subsequent Holocene epoch from four regions. First, hunter-gatherers from Mongolia and the Amur River Basin have ancestry shared by individuals who speak Mongolic and Tungusic languages, but do not carry ancestry characteristic of farmers from the West Liao River region (around 3000 bc), which contradicts theories that the expansion of these farmers spread the Mongolic and Tungusic proto-languages. Second, farmers from the Yellow River Basin (around 3000 bc) probably spread Sino-Tibetan languages, as their ancestry dispersed both to Tibet—where it forms approximately 84% of the gene pool in some groups—and to the Central Plain, where it has contributed around 59–84% to modern Han Chinese groups. Third, people from Taiwan from around 1300 bc to ad 800 derived approximately 75% of their ancestry from a lineage that is widespread in modern individuals who speak Austronesian, Tai–Kadai and Austroasiatic languages, and that we hypothesize derives from farmers of the Yangtze River Valley. Ancient people from Taiwan also derived about 25% of their ancestry from a northern lineage that is related to, but different from, farmers of the Yellow River Basin, which suggests an additional north-to-south expansion. Fourth, ancestry from Yamnaya Steppe pastoralists arrived in western Mongolia after around 3000 bc but was displaced by previously established lineages even while it persisted in western China, as would be expected if this ancestry was associated with the spread of proto-Tocharian Indo-European languages. Two later gene flows affected western Mongolia: migrants after around 2000 bc with Yamnaya and European farmer ancestry, and episodic influences of later groups with ancestry from Turan.

Fig. 1: Overview.
Fig. 2: Model of deep population relationships.
Fig. 3: Estimates of mixture proportions using qpAdm.

Data availability

The aligned sequences are available through the European Nucleotide Archive under accession number PRJEB42781. The newly generated genotype data of 383 modern East Asian individuals have been deposited in Zenodo ( The previously published data co-analysed with our newly reported data can be obtained as described in the original publications, which are all referenced in Supplementary Table 4; a compiled dataset that includes the merged genotypes used in this paper is available as the Allen Ancient DNA Resource at aadr-downloadable-genotypes-present-day-and-ancient-dna-data. Any other relevant data are available from the corresponding authors upon reasonable request.


We thank D. Anthony, O. Bar-Yosef, K. Brunson, R. Flad, P. Flegontov, Q. Fu, W. Haak, I. Lazaridis, M. Lipson, I. Mathieson, R. Meadow, I. Olalde, N. Patterson, P. Skoglund, D. Xu, P. Bellwood and C. Chiang for comments; N. Saitou and the Asian DNA Repository Consortium for sharing genotype data from present-day Japanese groups; T. Nishimoto and T. Fujisawa from the Rebun Town Board of Education for sharing the Funadomari Jomon samples, and H. Tanaka and W. Nagahara from the Archeological Center of Chiba City, who are excavators of the Rokutsu Jomon site. The excavations at Boisman-2 site (Boisman culture), the Pospelovo-1 site (Yankovsky culture) and the Roshino-4 site (Heishui Mohe culture) were funded by the Far Eastern Federal University and the Institute of History, Archaeology and Ethnology Far Eastern Branch of the Russian Academy of Sciences; research on Pospelovo-1 is funded by RFBR project number 18-09-40101. C.-C.W. was funded by the Max Planck Society, the National Natural Science Foundation of China (NSFC 31801040), the Nanqiang Outstanding Young Talents Program of Xiamen University (X2123302), the Major project of National Social Science Foundation of China (20&ZD248), a European Research Council (ERC) grant to D. Xu (ERC-2019-ADG-883700-TRAM) and Fundamental Research Funds for the Central Universities (ZK1144). H.M. was supported by grant JSPS 16H02527. M.R. and C.-C.W. received funding from the ERC under the European Union’s Horizon 2020 research and innovation program (grant no. 646612) to M.R. H. Li was funded NSFC (91731303, 31671297), B&R International Joint Laboratory of Eurasian Anthropology (18490750300). J.K. was funded by DFG grant KR 4015/1-1, the Baden Württemberg Foundation and the Max Planck Institute. Accelerator Mass Spectrometry radiocarbon dating work was supported by the National Science Foundation (NSF) (BCS-1460369) to D.J.K. and B.J.C. D.R. was funded by NSF grant BCS-1032255, NIH (NIGMS) grant GM100233, the Paul M. Allen Frontiers Group, John Templeton Foundation grant 61220, a gift from J.-F. Clin and the Howard Hughes Medical Institute.

C.-C.W., H.-Y.Y., A.N.P., H.M., A.M.K., L.J., H. Li, J.K., R.P. and D.R. conceptualized the study. C.-C.W., R.B., M. Mah, S.M., Z.Z., B.J.C. and D.R. carried out the formal analysis; C.-C.W., K. Sirak, O.C., A.K., N.R., A.M.K., M. Mah, S.M., K.W., N.A., N.B., K.C., F.C., K.S.D.C., B.J.C., L.E., S.F., D.K., A.M.L., K.M., M. Michel, J.O., K.T.O., K. Stewardson, S.W., S.Y., F.Z., J.G., Q.D., L.K., Dawei Li, Dongna Li, R.L., W.C., N., R.S., L.-X.W., L.W., G.X., H.Y., M.Z., G.H., X.Y., R.H., S.S., D.J.K., L.J., H. Li, J.K., R.P. and D.R. carried out the investigation. H.-Y.Y., A.N.P., R.B., D.T., J.Z., Y.-C.L., J.-Y.L., M. Mah, S.M., Z.Z., R.C., H. Looh, C.-J.H., C.-C.S., Y.G.N., A.V.T., A.A.T., S.L., Z.-Y.S., X.-M.W., T.-L.Y., X.H., L.C., H.D., J.B., E. Mijiddorj, D.E., T.-O.I., E. Myagmar, H.K.-K., M.N., K.-i.S., O.A.S., D.J.K., R.P. and D.R. provided resources. C.-C.W., K. Sirak, O.C., A.K., N.R., R.B., M. Mah, S.M., B.J.C., L.E., A.A.T. and D.R. curated the data. C.-C.W., H.-Y.Y., A.N.P., H.M., A.K. and D.R. wrote the paper. C.-C.W., H.-Q.Z., N.R., M.R., S.S., D.J.K., L.J., H. Li, J.K., R.P. and D.R. supervised the study.

Extended data figures and tables

Extended Data Fig. 1 PCA of ancient samples.

Projection of ancient samples onto PCA dimensions 1 and 2 defined by East Asian, European, Siberian and Native American populations.

Extended Data Fig. 2 PCA of present-day samples.

a, PCA dimensions 1 and 2 defined by present-day East Asian, European, Siberian and Native American populations. b, PCA dimensions 1 and 2 defined by present-day East Asian groups with little West Eurasian mixture.

Extended Data Fig. 3 Neighbour-joining tree of present-day East Eurasian individuals using the human origin dataset.

a, Neighbour-joining tree of present-day East Eurasian individuals based on FST distances using the human origin dataset. The branch length is shown in FST distance. b, Neighbour-joining tree of present-day East Eurasian individuals in which internal branches are all shown with the same branch length for better visualization.

Extended Data Fig. 4 Admixture plot at K = 15 using the human origin dataset.

af, We grouped the populations roughly into six groups based on geographical and genetic affinity. a, Populations mainly from Africa (yellow), America (magenta), West Eurasia (dark green and light brown) and Oceania (light magenta). b, Populations mainly from Mongolia (blue) and Siberia (purple). c, Populations mainly from southern China and Southeast Asia (light blue). d, Populations mainly from the Tibetan Plateau (olive) and Neolithic Yellow River Basin (red). e, Mainly Han Chinese groups from China (light blue and red). f, Populations mainly from the Amur River Basin (blue and red) and northeast Asia.

Extended Data Fig. 5 Estimates of population split times.

a, Cross-coalescence rates for selected population pairs. We ran MSMC for four pairs of populations: Tibetan–Ami, Tibetan–Atayal, Tibetan–Ulchi and Tibetan–Mixe. We used one individual from each population in this analysis. The modern genomic data for those individuals are from the Simons Genome Diversity Project. The times are calculated based on the mutation rate and generation time specified on the x axis. b, Cross-coalescence rates for selected population pairs. The same analysis as shown in a but using MSMC2 instead of MSMC, and using two individuals per population except for the Tibetan–Atayal pair, for which we used only one.

Extended Data Fig. 6 Admixture graph model.

This figure is the same as Fig. 2 except we show the fitted genetic drifts on each lineage. We used all available sites in the dataset comprising 1,237,207 SNPs, restricting to transversions only to confirm that the same model fit (Supplementary Information section 3). We started with a skeleton tree that fits the data for Denisovan, Mbuti, Onge, Tianyuan and Luxembourg Loschbour and one admixture event. We grafted on Mongolia East Neolithic, Late Neolithic farmers from the Upper Yellow River, Liangdao 2, Japan Jomon, Nepal Chokhopani, Taiwan Hanben and Late Neolithic farmers from the West Liao River in turn, adding them consecutively to all possible edges in the tree and retaining only graph solutions that provided no differences of |Z| < 3 between fitted and estimated statistics (maximum |Z| = 2.95 here). We used the MSMC and MSMC2 relative population split time estimates to constrain models. Deep splits are not well constrained because of the minimal availability of data on East Asian populations from the Upper Paleolithic. a, Locations and dates of the East Asian individuals used in model fitting, with colours indicating whether the majority ancestry is from the hypothesized coastal expansion (green), interior expansion south (red) and interior expansion north (blue). The map is based on the ‘Google Map Layer’ from ArcGIS Online Basemaps (map data ©2020 Google). The grey circles represent sampled populations and white circles represent unsampled hypothesized nodes. b, In the model visualization, we colour lineages modelled as deriving entirely from one of these expansions, and also colour populations according to ancestry proportions. Dashed lines represent admixture (proportions are marked), and we show the amount of genetic drift on each lineage in units of FST × 1,000.

Extended Data Fig. 7 Shared genetic drift among Tibetan groups, measured by f3(X, Y; Mbuti).

Lighter colours indicate more shared drift. Lahu groups with the Southeast Asian cluster probably due to substantial admixture. The Tibetan_Yajiang are geographically in the Tibeto-Burman Corridor but group with Core Tibetan individuals, presumably reflecting less genetic admixture from people of the Southeast Asian cluster.

Extended Data Table 1 Population information for newly genotyped present-day individuals
Extended Data Table 2 Kinship detected between pairs of individuals

Supplementary information

Supplementary Information

This Supplementary Information file contains an Ethics Statement, Supplementary Information sections 1-4 including, 15 Supplementary Figures, 5 Supplementary Tables and Supplementary References. The supplementary figures and tables provide information on the genetic structure and population history of East Asians.

Reporting Summary

Supplementary Tables

This zipped file contains 26 Supplementary Tables and a table guide.

Supplementary Data

Genotypes of the newly reported 166 ancient individuals.

.Peer Review File

