Large-scale lexical and genetic alignment supports a hybrid model of Han Chinese demic and cultural diffusions


The Han Chinese history is shaped by substantial demographic activities and sociocultural transmissions. However, it remains challenging to assess the contributions of demic and cultural diffusion to Han culture and language, primarily due to the lack of rigorous examination of genetic–linguistic congruence. Here we digitized a large-scale linguistic inventory comprising 1,018 lexical traits across 926 dialect varieties. Using phylogenetic analysis and admixture inference, we revealed a north–south gradient of lexical differences that probably resulted from historical migrations. Furthermore, we quantified extensive horizontal language transfers and pinpointed central China as a dialectal melting pot. Integrating genetic data from 30,408 Han Chinese individuals, we compared the lexical and genetic landscapes across 26 provinces. Our results support a hybrid model where demic diffusion predominantly impacts central China, while cultural diffusion and language assimilation occur in southwestern and coastal regions, respectively. This interdisciplinary study sheds light on the complex social-genetic history of the Han Chinese.

Fig. 1: Geographic characteristics of Chinese dialects.
Fig. 2: Manhattan plot of outlier lexical traits.
Fig. 3: Internal structure of Chinese dialects.
Fig. 4: Admixture patterns of Chinese dialects.
Fig. 5: Correlation between the genetic component of Northern Han populations and the linguistic component of northern language (Mandarin) in China.

Data availability

The lexical inventory of Chinese dialects and other necessary datasets to reproduce the results in the paper are available in the Supplementary Tables. The allele frequency data of genetic variants used in this study are available via PGG.Han website at The pairwise FST for any two provincial populations and the inferred ancestry composition for each of the provincial populations are available in GitHub at or via Zenodo at (ref. 112). The use of genetic data in this work is approved by The National Health Commission of the People’s Republic of China (No. 2024BAT00503). Release of the summary statistics of genetic data in this work is recorded by the National Health Commission (NHC) of the People’s Republic of China at Open Archive for Miscellaneous Data (OMIX) with accession number OMIX004518. All data generated or analysed during this study are included in this Article, its Supplementary Information and publicly available repositories.

Code availability

The codes required to transform the data into the statistics and outputs reported in the paper are available in Supplementary Information and have been deposited in GitHub at and in Zenodo at (ref. 113).


This research was supported by the National Natural Science Foundation of China (T2122007, 32288101, 32070577 and 32030020), National Key R&D Program of China (2023YFC2605400, 2020YFE0201600), National Social Science Foundation (23&ZD317 and 20&ZD301), the Shanghai Science and Technology Commission Program (23JS1410100), the Office of Global Partnerships (Key Projects Development Fund), and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 883700 TRAM). This work was also sponsored by the ‘Shuguang Program’ supported by the Shanghai Education Development Foundation and Shanghai Municipal Education Commission (20SG06), and by the Fundamental Research Funds for the Central Universities (2022ECNU-XWK-XK005). The computational work in this study was supported by the CFFF Computing Platform and the Human Phenome Data Center of Fudan University. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

C.Y., S. Yan, L.J., S.X. and M.Z. designed the research. C.Y., F.Y., Y.C., N.X., Z.W. and M.Z. collated the linguistic data of the Chinese dialects. X.Z. and S.X. assembled the genetic data of the Han Chinese and performed genetic data analysis. C.Y., S. Yang, B.W. and M.Z. performed the linguistic analyses and interdisciplinary alignment. C.Y., S. Yan, L.J., S.X. and M.Z. discussed the results. C.Y., X.Z., S.X. and M.Z. wrote and revised the paper. All authors approved the final version of the manuscript.

Correspondence to Li Jin, Shuhua Xu or Menghan Zhang.

The authors declare no conflict of interest.

Nature Human Behaviour thanks Randy J. Lapolla and Chuang-Chao Wang for their contribution to the peer review of this work.

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Note, Discussion, Figs. 1–3, and table of contents for Supplementary Tables 1–13.

Lexical inventory, datasets generated from statistical analysis, and other relevant analytical results.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Yang, C., Zhang, X., Yan, S. et al. Large-scale lexical and genetic alignment supports a hybrid model of Han Chinese demic and cultural diffusions. Nat Hum Behav (2024).

